Deep learning-based methods for individual recognition in small birds

Individual identification is a crucial step to answer many questions in evolutionary biology and is mostly performed by marking animals with tags. Such methods are well established but often make data collection and analyses time consuming and consequently are not suited for collecting very large datasets. Recent technological and analytical advances, such as deep learning, can help overcome these limitations by automatizing data collection and analysis. Currently one of the bottlenecks preventing the application of deep learning for individual identification is the need of hundreds to thousands of labelled pictures required for training convolutional neural networks (CNNs). Here, we describe procedures that improve data collection and allow individual identification in captive and wild birds and we apply it to three small bird species, the sociable weaver Philetairus socius, the great tit Parus major and the zebra finch Taeniopygia guttata. First, we present an automated method that allows the collection of large samples of individually labelled images. Second, we describe how to train a CNN to identify individuals. Third, we illustrate the general applicability of CNN for individual identification in animal studies by showing that the trained CNN can predict the identity of birds from images collected in contexts that differ from the ones originally used to train the CNNs. Fourth, we present a potential solution to solve the issues of new incoming individuals. Overall our work demonstrates the feasibility of applying state-of-the-art deep learning tools for individual identification of birds, both in the lab and in the wild. These techniques are made possible by our approaches that allow efficient collection of training data. The ability to conduct individual identification of birds without requiring external markers that can be visually identified by human observers represents a major advance over current methods.

3. Here, we describe procedures that improve data collection and allow individual 30 identification in captive and wild birds and we apply it to three small bird species, the 31 sociable weaver Philetairus socius, the great tit Parus major and the zebra finch 32 Taeniopygia guttata. the application of deep learning to smaller taxa, and specifically birds, remains unexplored.
In birds, manual examination of pictures or video recordings of visually marked populations 75 (e.g. using colour rings), are well established methods. However, relying on humans for 76 individual identification and data collection is time consuming (Weinstein, 2018). In many 77 cases the use of recently developed animal-tracking devices (e.g. GPS) and sensor 78 technologies (e.g. RFID) can be used (reviewed in Krause et al., 2013 are not suitable to be fitted to many animals, especially in the wild. Deep learning has the 100 potential to overcome some of the limitations of the current automated methods, as it can identify individuals by relying only on their natural variance in appearance and be tolerant to 102 spurious variation in the recording conditions. 103 A major challenge for the application of individual recognition using deep learning methods is 104 the need of collecting extensive training data. Acquiring training data typically involves 105 labelling images with the location and/or identity (or an attribute) of each individual. The 106 amount of data required to train a CNN is expected to be proportionally dependent on the 107 difficulty of the classification challenge, i.e. a bear and a bird would be easier to differentiate 108 than two bears of the same species. Usually CNNs that achieve large generalization 109 capability are trained over thousands to millions of pictures (Marcus, 2018). Such large 110 datasets are required as usually CNNs have to generalize from the specific data that they 111 have been exposed to during training. For example, if a CNN was trained to distinguish two  individuals may join the population (e.g. immigrants or recruited offspring). These cases 132 require that the process of identifying individuals and labelling photos is routinely repeated. 133 Therefore, relying on human observers for collecting labelled data in this type of systems 134 might hinder the implementation of deep learning techniques for individual identification, or 135 restrict its application to short-term projects. 136 Here, we provide guidance on how training data can be efficiently collected, both in captivity 137 and in the wild, and on the subsequent steps required to train a CNN for individual 138 identification. We demonstrate the feasibility of our approaches using data from two wild pit-139 tagged populations of birds from two different species, the sociable weaver Philetairus 140 socius and the great tit Parus major, and a population of captive zebra finches Taeniopygia 141

guttata. 142
We start by 1) focusing on the problem of efficiently collecting large training datasets. We 143 provide simple and automated methods for collecting a very large number of labelled 144 pictures by using RFID tags associated to camera traps (in the wild sociable weaver and the 145 great tit populations) or by temporarily isolating the target individuals (in captive zebra 146 finches). In all cases, we used low-cost RFIDs and low-cost cameras that can be programed 147 to take labelled pictures of the birds' back feathers. 2) We provide details of the data pre- Möggingen, southern Germany. For both species, birds were fitted with pit-tags as nestlings, 159 or when trapped in mist-nets as adults and are habituated to artificial feeders that are fitted 160 with RFID antennas, as part of two independent on-going studies in these populations. For 161 the zebra finches, pictures were collected from a captive population housed in Möggingen, 162 southern Germany. Birds were being kept in indoor cages in pairs and small flocks. 163

Collecting training data: 164
Sociable weavers: 165 The collection of labelled pictures was automated by combining RFID technology 166 (Priority1Design, Australia) with single-board computers (Raspberry Pi), cameras and 167 artificial feeders. We fitted RFID antenna to small perches placed in front of plastic feeders 168 filled with mixed seeds (Fig. 1a). Each RFID data logger was connected to a Raspberry Pi 169 (detailed  explanation  of  the  developed  setup  is  available  at  170 github.com/AndreCFerreira/Bird_individualID) which was connected to a Pi camera (we used 171 Pi camera V1 5mp and V2 8mp). We programmed the Raspberry Pi to take a picture every 172 time that a bird was detected on the RFID logger, with a 2 seconds gap between pictures. 173 This interval was introduced in order to avoid having near-identical frames of the same bird 174 that would increase overfitting of the CNN and jeopardize the generalization capability of the 175 models (see "Convolutional neural networks" section). The Raspberry Pi was programmed to 176 take pictures with different shutter speeds to account for variation in light conditions over the 177 day. Each picture file was automatically labelled with the bird identity, known from the RFID 178 logger and the time of shooting in the filename. Training data collection is therefore automatized by automatically linking the identity of the bird perching on the antenna while 180 feeding to its pictures, without the need of human manual identification and annotation. 181 Three PI cameras and three feeders which were ca. two meters apart from each other were 182 used. The cameras were positioned to take a picture from top perspective to enable to 183 photograph both the scaled pattern of the back and wing feathers (Fig. 1b). The birds' back 184 was chosen as the distinctive mark since it is the body part that is most easily observed and 185 recorded in multiple contexts (e.g. when perching at the feeders or building at the nest), Picture taken by the Pi camera of a great tit perching at the RFID antenna on a feeder and d) 192 of a male zebra finch taken from inside the cage. 193

Great tits: 194
We collected pictures of the individuals using a similar setup to the one described above, by 195 placing a RFID antenna at an artificial feeder hanging on a tree branch (Fig. 1c). We used 196 one single Pi camera and one feeder to collect pictures during seven days over the course of 197 the last two weeks of August 2019. 198

Zebra finches: 199
We temporarily divided aviaries into equally-sized partitions with a net to take pictures from 200 individual birds without completely socially isolating them. We collected data from 10 zebra 201 finches (five males and five females). In each partition, we placed two Raspberry Pi cameras 202 to photograph (every two seconds) the birds sitting on the wooden perches (Fig. 1d). Each 203 bird was recorded for four hours. Since we know which Raspberry Pi photographed which 204 bird, we avoided the need to manually link the identity of the birds to the pictures. in which the region corresponding to the bird was manually delimited using "VGG Image 216 interest is time consuming, we started by training the model for 10 epochs with 200 pictures. 218 If the model was found to perform badly, additional pictures were manually labelled and 219 added it to the training dataset. This process was repeated until a satisfactory performance 220 was achieved. For the great tits 500 pictures were used for training and 125 for validation 221 (see "Convolutional neural networks" section below for explanation on training and validation 222 datasets), for the zebra finch we used 400 pictures for training and 100 for validation. 223 were replaced by layers with random weights that fits our particular task of interest and the 260 corresponding number of classes (30 individuals).
To further increase our training sample, we used data augmentation, which consists of 262 artificially increasing the sample size by applying transformations to our existing sample. 263 Using the data generator available in Keras, images were randomly rotated (from 0 to 40º) 264 and zoomed (zoom range of 0.2). One 0.5 dropout layer was added just before the first 265 dense layer to limit overfitting (see github.com/AndreCFerreira/Bird_individualID for details 266 on the network architecture). We used a softmax activation function for the classifier. ADAM 267 optimizer (Kingma & Ba 2014) was used with a learning rate of 1e-5. A batch size of eight 268 was used since it has been shown that small batch sizes improve models' generalization 269 capability (Masters & Luschi, 2018). If there was no decrease in loss for more than 10 270 consecutive epochs we stopped training, and then retrained the model that achieved the 271 lowest loss with a SGD optimizer and a learning rate 10 times smaller until there was no 272 further decrease in the loss for more than 10 consecutive epochs. All analyses were 273 conducted with python 3.7 using keras tensorflow 1.9, and on nvdia rtx 2070 gpu. 274 In an exploratory approach, and even though our model achieved ca. 90% accuracy with the 275 validation dataset, the accuracy was significantly lower when generalizing to other contexts 276 (see results). We suspected that such differences could be due to the lower quality of 277 pictures collected in those other contexts (with different cameras, capture distances and 278 conditions; see "Testing models" section). To account for this possibility we trained a model 279 using the same setting parameters that yielded the best results, but applying Gaussian blur,  For the great tits we trained the CNN with 1000 pictures per bird, 900 pictures for training 295 and 100 for validation. For birds with less than 1000 pictures (six birds) we did oversampling 296 by creating copies of the pictures available following the same procedures as for the 297 sociable weaver. We used 7605 unique pictures, 760.50±222.56 (mean±SD) per bird. 298 Pictures in the validation dataset were also taken in different days from the pictures used for 299

training. 300
The same architecture and hyperparameters as for sociable weavers were used, except that 301 the dropout value was reduced to 0.2 as the model did not improve the accuracy from a 302 random guess for 10 epochs when the dropout was at an initial value of 0.5. In addition to 303 the zooming and rotation data transformations, horizontal and vertical flips were also used 304 as the great tits, contrary to the sociable weavers, could be photographed from any 305 orientation (as they perched all around the RFID antenna). Blur and noise transformations were not used as there were no differences in the overall quality of the pictures used for 307 training and for testing the model generalization capability (see "Testing models" section). 308

Zebra finches: 309
There were more pictures available per bird for the zebra finch than for the other species.  Finally, the CNN was trained using the same procedures as for the great tits except that the 330 dropout layer was set to 0.5 rather than 0.2. 331 Testing models: 332

Sociable weavers: 333
To test the efficiency of our models, we collected images of the sociable weavers in different 334 viewing perspectives, using different cameras and different contexts than the original feeding 335 station setup. The aim was to evaluate the ability of our trained CNN to identify individuals in 336 different experiments and contexts. 337 We used four different setups for testing. We filmed birds feeding in the same plastic RFID 338 feeders but recorded using a Sony handycam (rather than Raspberry Pi camera), from two 339 different perspectives: 1) close (95 pictures from 26 birds 3.65 ± 0.68 (mean ± SD; Fig. 4a) 340 and 2) and far (71 pictures from 21 birds 3.43 ± 0.58; Fig. 4b). In addition, a plastic round 341 feeder with seeds was positioned on the floor to record both from 3) a ground perspective 342 (90 pictures from 28 birds 3.21 ± 1.21; Fig. 4c) and 4) a top perspective (83 pictures from 25 343 birds 3.32 ± 1.01; Fig. 4d). 344 The birds were manually cropped out from pictures using imageJ (Schneider,Rasband & 345 Eliceiri, 2012) and individually identified using their colour rings. The colour rings were then 346 erased directly from the image to guarantee that the model did not use them for 347 identification. Videos were recorded within the same time window as the training pictures 348 collection and we aimed at extracting five non-identical frames per bird in which the back 349 was fully visible, however this was not always possible for all birds as not all of them were 350 recorded in these testing videos, or were not recorded long enough. 351 Great tits: 357 We recorded birds feeding in a table from a top perspective with a Raspberry Pi camera 358 (Fig. 5). Since these birds had no colour ring or any mark for visual identification, we 359 identified them using their pit-tags by placing seeds on top of a RFID antenna in order to 360 induce the birds to activate the RIFD antenna and obtain the identity of the birds feeding 361 (similar to the pictures collected for training described above). Birds were recorded feeding 362 on the table for 3 days but 4 out of the 10 birds in the training dataset did not use this new 363 feeding spot. Additionally the number of pictures collected at this setup varied greatly 364 between birds (from 2 to 38 pictures, mean: 15.7±11.3SD). We did not attempt to make a 365 For the zebra finches we did not have a second setup that differed from the one used to 372 collect the pictures to train a CNN and that could be used for testing the CNN generalization. 373 Therefore, we ran an additional trial which consisted of recording the birds together to see 374 how well the model would predict the identity of the birds when they are in small groups, 375 interacting with each other (Fig. 6). Since these birds did not have any visual tags and it was 376 not possible to distinguish them when in group, we used one flock of three birds and another 377 flock of two birds for each sex to estimate the model's accuracy by calculating the number of 378 times that the CNN wrongly attributed the identity of a bird as being an individual that is not 379 effectively present in that flock. In order to avoid near-identical pictures, the same procedure 380 as for the validation dataset to select 160 pictures from each trial was used. 381 The model was able to achieve an accuracy of 92.4% (Table 1) after training for 21 epochs. 407 When the model was used to predict the identity in four other contexts, it appears that the 408 accuracy of top perspective's context was lower (67.5%). After adding blur and noise to the 409 training images, the model achieved a validation accuracy of 90.3%, while successfully 410 increasing the accuracy from the top perspective to 91.6% (Table 1). 411 Zebra finches: 420 The model reached 87.0% accuracy after training for 11 epochs with similar accuracies for 421 males and females (85% for males, 88.9% for females). When using the trained model to 422 predict the identity of the birds when they were in small groups the model correctly predicted 423 the identity of a bird present in that group in 93.6% of the time. 424 New birds:

425
The entropy of the softmax outputs (i.e. probabilities) was smaller when predicting the 426 identity of birds present in the training dataset, compared to when predicting the identity of 427 new birds (Fig. 7). This is due to the fact that when predicting the identity of a bird from the 428 training dataset, there is usually one that stands out with very high probability (indicating the 429 bird's identity) and the remaining probabilities are very low (other birds' identities). In Furthermore, we found high generalization capacities of the trained CNNs, meaning that the 455 rate of successful identification remained high in various contexts. This is particularly 456 relevant as researchers often need to collect data in contexts that may be challenging, from 457 parental behaviour at the nest to dominance interactions at artificial feeders. However, we 458 also show that the models' performance can become lower when new individuals join the 459 population, especially when new individuals are common. . 460 The first critical step when attempting to implement deep learning is to guarantee that 461 enough training data can be collected to train a model. In this study, for the two wild 462 populations, we showed that we can rely on RFID technology to gather large amounts of 463 automatically labelled data. Since this technology has been increasingly used on birds, we 464 needed. However, while these datasets are not available, the automatization of training data collection is an immediate and effective solution, i.e. it is possible to continuously collect 514 training pictures and routinely re-train the CNNs using the new updated dataset. 515 The arrival of new individuals to the study population is another challenge that needs to be 516 carefully addressed. If these new birds are marked with a pit-tag, the CNN could be updated 517 similarly to the problem of changes in appearance discussed above. If the new individuals 518 are not marked and cannot be captured the problem fits in the anomaly ( are able to cope with the challenges presented here, among others. 542 Having large datasets will also allow optimizing the CNN performances. Other network 543 architectures (e.g. ResNet; He, Zhang, Ren & Sun, 2016) and different hyper-parameters 544 settings (e.g. learning rate) than the ones used here can yield different, and potentially 545 improved, results. There are also other pre-processing steps that can greatly improve the 546 model training and reduce the number of images needed such as, image alignment (e.g. to work at Benfontein Reserve. We also thank Gustavo Alarcón-Nieto and Adriana 563 Maldonado-Chaparro for the assistance with the material needed to collect pictures of the 564 great tits and the zebra finches. Data collection for the sociable weaver data was supported 565 by funding from the FitzPatrick Institute of African Ornithology (DST-NRF Centre of