炼丹记录2 - 掘金

pycharm remote interpreter: cannot find declaration

现象:

调试debug 带来很多不便.

解决:

在 project interpreter中的路径修改一下: /bin/python3---> /bin/python.小细节坑爹.

2. 重启pycharm,下载远程环境的代码,时间视网络环境和数量量决定.因为是远程环境,网络较差,坑爹地大概需要30多分钟,需要耐心等等:

evalue coco dataset error

Traceback (most recent call last):
  File "/root/dxq/question-split-mask-rcnn/doc/evaluater.py", line 15, in <module>
    from pycocotools.coco import COCO
  File "/root/anaconda3/lib/python3.7/site-packages/pycocotools-2.0-py3.7-linux-x86_64.egg/pycocotools/coco.py", line 55, in <module>
    from . import mask as maskUtils
  File "/root/anaconda3/lib/python3.7/site-packages/pycocotools-2.0-py3.7-linux-x86_64.egg/pycocotools/mask.py", line 3, in <module>
    import pycocotools._mask as _mask
  File "__init__.pxd", line 918, in init pycocotools._mask
ValueError: numpy.ufunc size changed, may indicate binary incompatibility. Expected 216 from C header, got 192 from PyObject

环境设置错误了...

pip uninstall numpy
pip install numpy==1.16.2

pycocotools 和numpy的兼容问题. 降级处理. 坑爹玩意,除此之外,还有好多坑,都心塞踩过来了...要不是对 coco格式还算了解,真不是那么容易.

pycocotools evaluate 数据解读.

cocodataset.org/#detection-…

Average Precision (AP):
AP% AP at IoU=.50:.05:.95 (primary challenge metric)
APIoU=.50% AP at IoU=.50 (PASCAL VOC metric) 
APIoU=.75% AP at IoU=.75 (strict metric)

AP Across Scales:
AP small% AP for small objects: area < 32**2 
AP medium% AP for medium objects: 32**2  < area < 96**2 
AP large% AP for large objects: area > 96**2 

Average Recall (AR):
AR max=1% AR given 1 detection per image 
AR max=10% AR given 10 detections per image 
AR max=100% AR given 100 detections per image

AR Across Scales:
AR small% AR for small objects: area < 32**2 
AR medium% AR for medium objects: 32**2  < area < 96**2 
AR large% AR for large objects: area > 96**2 
 
-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------

DETECTION_MIN_CONFIDENCE = 0.5
-----------------------------------------------------------------------------------
20200522T1052_0601.h5
100%|█████████████████████████████████████| 2346/2346 [3:12:10<00:00,  4.57s/it]

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.384
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.626
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.420
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.148
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.291
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.386
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.184
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.436
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.452
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.183
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.364
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.449
Prediction time: 3965.8749873638153. Average 1.6904837968302708/image
Total time:  11549.180015802383
-----------------------------------------------------------------------------------

20200522T1052_0656.h5
100%|█████████████████████████████████████| 2346/2346 [3:34:07<00:00,  5.06s/it]

 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.381
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.621
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.415
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.154
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.290
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.382
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.183
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.432
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.448
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.187
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.368
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.443
Prediction time: 4758.105183124542. Average 2.028177827418816/image

-----------------------------------------------------------------------------------
20200211T1100_0789.h5

100%|██████████| 2346/2346 [3:58:21<00:00,  6.23s/it]
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.288
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.494
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.302
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.132
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.215
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.288
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.144
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.351
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.364
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.149
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.284
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.361
Prediction time: 5515.610037565231. Average 2.351069922235819/image
Total time:  14321.094527959824

-----------------------------------------------------------------------------------
-----------------------------------------------------------------------------------

DETECTION_MIN_CONFIDENCE = 0
-----------------------------------------------------------------------------------
20200522T1052_0601.h5
100%|██████████| 2346/2346 [4:04:43<00:00,  5.23s/it]
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.397
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.653
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.430
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.156
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.302
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.399
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.191
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.454
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.471
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.192
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.381
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.468
Prediction time: 5827.201899766922. Average 2.4838882778205122/image
Total time:  14703.1005589962

-----------------------------------------------------------------------------------
20200211T1100_0789.h5

 100%|██████████| 2346/2346 [3:50:37<00:00,  5.99s/it]
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.303
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.525
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.313
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.146
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.226
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.304
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.155
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.376
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.389
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.168
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.304
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.387
Prediction time: 4823.234839677811. Average 2.0559398293596804/image

evaluate 上一次的模型, 本次训练的模型.挑选出上线模型.挑选标准:train loss,val loss, AP,AP50,AP75--->视觉检验
confidence= 0.5情况: 601, 565, 上一次 789
confidence= 0 情况: 601, 789. (以后主要用于不同网络架构的比较. 这个标准比较统一)
对比不同 confidence,相同模型 AP 等情况
- confidence = 0 > confidence= 0.5
对比相同confidence, 不同模型表现, 本次表现是否好于之前的模型.
- confidence=0.5. 601 > 656 > 上一次训练的789.
train loss 比 val loss 更有参考意义.下次训练保存 train loss 最佳即可.

导出模型报错

~/anaconda3/envs/dl/lib/python3.6/site-packages/keras/engine/saving.py in load_weights_from_hdf5_group_by_name(f, layers, skip_mismatch, reshape)
   1147                                          ' has shape {}'.format(symbolic_shape) +
   1148                                          ', but the saved weight has shape ' +
-> 1149                                          str(weight_values[i].shape) + '.')
   1150                 else:
   1151                     weight_value_tuples.append((symbolic_weights[i],

ValueError: Layer #391 (named "mrcnn_bbox_fc"), weight <tf.Variable 'mrcnn_bbox_fc/kernel:0' shape=(1024, 48) dtype=float32_ref> has shape (1024, 48), but the saved weight has shape (1024, 68).

原因: 输出类别设置错误, 48/4=12类别,而模型中输出的类别为 68/4=17个类别. 参数改一下就行了.

detectron2 训练中断

现象:

Traceback (most recent call last):
  File "train.py", line 46, in <module>
    trainer.train()
  File "/root/dxq/detectron2/detectron2/engine/defaults.py", line 401, in train
    super().train(self.start_iter, self.max_iter)
  File "/root/dxq/detectron2/detectron2/engine/train_loop.py", line 132, in train
    self.run_step()
  File "/root/dxq/detectron2/detectron2/engine/train_loop.py", line 209, in run_step
    data = next(self._data_loader_iter)
  File "/root/dxq/detectron2/detectron2/data/common.py", line 142, in __iter__
    for d in self.dataset:
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 345, in __next__
    data = self._next_data()
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 838, in _next_data
    return self._process_data(data)
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 881, in _process_data
    data.reraise()
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/_utils.py", line 395, in reraise
    raise self.exc_type(msg)
OSError: Caught OSError in DataLoader worker process 0.
Original Traceback (most recent call last):
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/_utils/worker.py", line 178, in _worker_loop
    data = fetcher.fetch(index)
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 44, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/root/dxq/detectron2/detectron2/data/common.py", line 41, in __getitem__
    data = self._map_func(self._dataset[cur_idx])
  File "/root/dxq/detectron2/detectron2/utils/serialize.py", line 23, in __call__
    return self._obj(*args, **kwargs)
  File "/root/dxq/detectron2/detectron2/data/dataset_mapper.py", line 77, in __call__
    image = utils.read_image(dataset_dict["file_name"], format=self.img_format)
  File "/root/dxq/detectron2/detectron2/data/detection_utils.py", line 120, in read_image
    return convert_PIL_to_numpy(image, format)
  File "/root/dxq/detectron2/detectron2/data/detection_utils.py", line 57, in convert_PIL_to_numpy
    image = image.convert(conversion_format)
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/PIL/Image.py", line 860, in convert
    self.load()
  File "/root/anaconda3/envs/pytorch/lib/python3.6/site-packages/PIL/ImageFile.py", line 231, in load
    "(%d bytes not processed)" % len(b))
OSError: image file is truncated (4 bytes not processed)

处理:

在detection_utils.py中加入下面这句代码即可.

from PIL import ImageFile
ImageFile.LOAD_TRUNCATED_IMAGES = True

对某个类别检测效果奇差.

现象:

其他类别检测效果看上去都还可以
就某一个类别检测效果很差,很多都没检测出来.
从 loss上已经非常平缓了.
将置信度调低到 0 ,发现该类别的bbox 重叠非常严重

推测:

首先怀疑其实还没训练到收敛. 这一类的特征确实比较复杂,还需要再深入训练.
数据中有其他类别造成的影响.

处理过程:

继续训练,iterations从 30000提高到100000.
去除一些不太重要的类别,排除干扰,重新训练.

验证:

发现只是训练还不到位,大力出奇迹,训练久一点即可. 比较奇怪的是明明loss已经没什么太大变化,为什么训练久一点就可以了.

推测: 可能只有 0.01单位的loss影响也是挺大的.