Data Collection and Image Labeling

A collection of over 2,227 images of cattle feet will be labeled for M-stages of DD from two commercial dairy farms in the midwestern United States. GoPro Hero 5 Black cameras were used to collect MP4 video recordings of the interdigital space of cows’ hind feet. Additionally, traditional and cell phone cameras were used to obtain JPG images from other studies in Wisconsin; Manitoba, Canada; and Utrecht, the Netherlands. All images were collected between June 2018 and March 2019. A rich, diverse library of images were compiled from various settings and scenarios including images at foot level of hooves lifted in trimming chutes and cattle standing in the cattle housing area, automated milking system, or rotary milking parlor. All images were collected with the camera facing the rear foot with a clear view of the interdigital space of the hoof. Images were scored for M-stages of DD by a trained investigator using the M-stage DD classification system. Overall, the library of images includes 1,177 M0 images (936 JPG images via MP4 video recordings and 241 JPG images via camera stills) and 1,050 M2 images (660 screenshot JPG images via MP4 video recordings and 390 JPG images via camera stills).

The images were renamed and corresponding annotations were generated in Python. Labels were converted from Pascal VOC format XML files for Faster and Cascade R-CNNs to TXT files required for YOLO models. The images were split into a training set (90%) and a testing set (10%). The YOLO models were trained to differentiate between M0 lesions and M2 lesions using Darknet framework in Google Colab 12GB NVIDIA Tesla K80 GPU with a batch size of 64, subdivision size of 16, network size height and width of 416 by 416 pixels, and a learning rate of 0.001 for maximum number of 4,000 batches. An ignore threshold of 0.70 was used for object detection where the class probability is greater than or equal to the ignore threshold for prediction. The YOLOv4 network uses CSPDarknet53 as a backbone, spatial pyramid pooling additional module, PANet path-aggregation neck, and YOLOv3 head. The YOLOv5 network keeps the same architecture as YOLOv4, but is implemented using PyTorch rather than C++. Faster R-CNN and Cascade R-CNN uses ResNet-50 backbones. Faster R-CNN improves on Fast R-CNN by utilizing a Region Proposal Network to generate high-quality region proposals, which are used by Fast R-CNN for detection. Cascade R-CNN extends Faster R-CNN where a sequence of detectors is trained stage by stage, leveraging the output of a detector for training the next. Accuracy, precision-recall, and mean average precision (mAP) at intersection over union (IOU) of 0.5 were used for performance measures to compare between the predictions made by the computer vision models and a trained investigator (ground truth).