Thinking like a naturalist: Enhancing computer vision of citizen science images by harnessing contextual data

The accurate identification of species in images submitted by citizen scientists is currently a bottleneck for many data uses. Machine learning tools offer the potential to provide rapid, objective and scalable species identification for the benefit of many aspects of ecological science. Currently, most approaches only make use of image pixel data for classification. However, an experienced naturalist would also use a wide variety of contextual information such as the location and date of recording. Here, we examine the automated identification of ladybird (Coccinellidae) records from the British Isles submitted to the UK Ladybird Survey, a volunteer‐led mass participation recording scheme. Each image is associated with metadata; a date, location and recorder ID, which can be cross‐referenced with other data sources to determine local weather at the time of recording, habitat types and the experience of the observer. We built multi‐input neural network models that synthesize metadata and images to identify records to species level. We show that machine learning models can effectively harness contextual information to improve the interpretation of images. Against an image‐only baseline of 48.2%, we observe a 9.1 percentage‐point improvement in top‐1 accuracy with a multi‐input model compared to only a 3.6% increase when using an ensemble of image and metadata models. This suggests that contextual data are being used to interpret an image, beyond just providing a prior expectation. We show that our neural network models appear to be utilizing similar pieces of evidence as human naturalists to make identifications. Metadata is a key tool for human naturalists. We show it can also be harnessed by computer vision systems. Contextualization offers considerable extra information, particularly for challenging species, even within small and relatively homogeneous areas such as the British Isles. Although complex relationships between disparate sources of information can be profitably interpreted by simple neural network architectures, there is likely considerable room for further progress. Contextualizing images has the potential to lead to a step change in the accuracy of automated identification tools, with considerable benefits for large‐scale verification of submitted records.


| INTRODUC TI ON
Large-scale and accurate biodiversity monitoring is a cornerstone of understanding ecosystems and human impacts upon them (IPBES, 2019). Recent advances in artificial intelligence have revolutionized the outlook for automated tools to provide rapid, scalable, objective and accurate species identification and enumeration (Torney et al., 2019;Weinstein, 2018;Willi et al., 2019). Improved accuracy levels could revolutionize the capacity of biodiversity monitoring (Isaac, Strien, August, Zeeuw, & Roy, 2014) and invasive non-native species surveillance programmes (August et al., 2015). Nonetheless, at present, general-purpose automated classification of animal species is still some distance from the level of accuracy obtained by human experts. Recent studies have achieved percentage classification accuracy ranging between mid-60s to high 90s depending on the difficulty of the problem Weinstein, 2018), and their potential remains underutilized (Christin, Hervet, & Lecomte, 2019).
The large data requirements and capacity of machine learning has led to a close association with citizen science projects , where volunteers contribute scientific data (Silvertown, 2009). Citizen scientists can accurately crowdsource identification of researcher-gathered images (e.g. Snapshot Serengeti; Swanson et al., 2015), generate records to be validated by experts (e.g. iRecord; Pocock, Roy, Preston, & Roy, 2015) or both simultaneously (e.g. iNaturalist; iNatu ralist.org). However, there can be a considerable lag between record submission and human verification. If computer vision tools could generate more rapid, or even instantaneous, identifications it could assist with citizen scientist recruitment and retention. While image acquisition by researchers can be directly controlled and lead to high accuracies (Marques et al., 2018;Rzanny, Seeland, Wäldchen, & Mäder, 2017), images from citizen science projects are highly variable and pose considerable challenges for computer vision (Van Horn et al., 2017).
Most automatic species identification tools only make use of images (Weinstein, 2018). However, an experienced naturalist would utilize a wide variety of contextual information when making an identification. This is particularly the case when distinguishing 'difficult' species, where background information about the record may be essential for a confident identification. In a machine learning context, this Supporting Information about an image (metadata) can be split into two categories ( Figure 1). Primary metadata is directly associated with a record such as GPS-coordinates, date of recording and the identity of the recorder. Derived (secondary) metadata is generated through cross-referencing with other sources of information to place this metadata into a more informative context (Tang, Paluri, Fei-Fei, Fergus, & Bourdev, 2015). In an ecological context, this may include weather records, maps of species distribution, climate or habitat, phenology records, recorder experience, or any other information source that could support an identification. identification tools, with considerable benefits for large-scale verification of submitted records.

K E Y W O R D S
citizen science, computer vision, convolutional neural network, ladybird, machine learning, metadata, naturalists, species identification F I G U R E 1 Relationships between categories of metadata. Primary metadata are basic attributes of the record directly associated with an image such as the date or location. By contrast, derived (or secondary) metadata requires crossreference to external databases, which may include physical, ecological or social data. External sources of information may be fixed and stable (such as habitat maps) or dynamic and require updating in order to keep the model up to date (such as weather records or recorder experience) shortlists of German plants (Wittich, Seeland, Wäldchen, Rzanny, & Mäder, 2018). This approach can greatly reduce the risk of non-sensical identifications that otherwise lead to considerable scepticism over the use of automated methods (Gaston & O'Neill, 2004).
Nevertheless, this 'filtering' approach does not make full use the available data, and it has been recently shown that improvements in the identification of plankton from images can be improved by incorporating sample metadata directly into a neural network (Ellen, Graff, & Ohman, 2019). Many species vary in appearance seasonally or across their range. For example, the proportion of the melanic form of the two-spot ladybird Adalia bipunctata varies greatly across the UK (Creed, 1966 2. Exploring whether neural networks make use of the same pieces of metadata information that a human experts do.

| Data
Records of ladybirds (Coccinellidae) were sourced from the UK Biological Records Centre (www.brc.ac.uk). These were filtered to include only those from within the British Isles, from 2013 to 2018 inclusive, that contained an image and had been verified by an expert assessor. Records were distributed across the whole of the British Isles, although records were more frequent near moreheavily populated areas ( Figure S1).

| Images
Records were manually scanned to remove the majority of images predominantly of eggs, larvae or pupae, 'contextual' images of habitat area, images including multiple species, and images that had been uploaded repeatedly. Larval and pupal images were overwhelming dominated by the highly distinctive Harlequin ladybird larvae or pupae (78%). Where a single record had multiple associated images, only the first was used.
Images were centre cropped to square and then rescaled to 299 × 299 pixels. Example images for each species are shown in Figure 2. After all data cleaning steps, the dataset had 39,877 records in total.

| Metadata
We constructed models that made use of different subsets of the available metadata. The first (the primary metadata model) took only three pieces of primary metadata, drawn directly from the UK Ladybird Survey dataset: longitude, latitude and date. We represented date by day-of-year, excluding year values since information on 'year' would not be transferable to future records. The second model (the derived metadata model) supplemented the primary metadata with secondary metadata: data generated with additional reference to external sources of information, namely weather records, habitat and recorder expertise. We did not use the original citizen scientist species determination in our models, since it was too powerful compared to other sources of information (correct over 92% of the time) and did not align with the goal of fully automated identification.
Temperature records were accessed from the MIDAS database (Met Office, 2012), selecting data from the 88 UK stations with fewer than 20 missing records (2013-2018). Occasional missing values were imputed with a polynomial spline. Using the closest weather station to the record, maximum daily temperature for each day in the 14 preceding days (d-1:d-15) and weekly average maximum TA B L E 1 Average per-species top-1 accuracy across the suite of models. Citizen scientist accuracy is determined by frequency by which the label assigned by the recorder corresponds to the verified species name. Equivalent tables for top-3 accuracy and for accuracy including a prior weighting based on relative frequency are given in Tables S2 and S3. The top performing model in each row is marked with an asterisk (*) Local habitat information was derived from a 1 km resolution land cover map (Rowland et al., 2017). This provides percentages in each 1 km grid of 21 target habitat classes (e.g. 'urban', 'coniferous woodland', 'heather', etc.). Where no data was available, each habitat was assumed to be 0.
We calculated a 'recorder experience' variable as the cumulative count of records submitted by that recorder at the time of each record. Only records of ladybirds in our dataset were included in this count. Where no unique recorder ID was available, that record was assumed to be a first record.
This led to a one-dimensional metadata vector of length 47 (dayof-year, latitude, longitude, 14 daily maximum temperature records, 8 weekly average temperature records, 21 habitat frequencies and recorder experience) associated with each image.

| Machine learning model architecture
We built and fit convolutional neural network models (Goodfellow, Bengio, & Courville, 2016) in R 3.5.3 using the functional model framework of the keras package (Allaire & Chollet, 2019). We used the TensorFlow backend on a Nvidia GTX 1080 Ti GPU. R code used to train the models is available at github.com/jcdte rry/Ladyb irdID_Public and the core model architecture code is summarized in Supporting Information. We first constructed and trained image-only and metadata-only models. Once these had separately attained maximum performance, these were then combined to form the core of a multi-input model that takes both an image and metadata as input variables. For all models, we conducted extensive hyperparameter searches to determine model architecture, extent of data-augmentation, regularization parameters, learning rates and training times.
A schematic of the model architectures is shown in Figure 3.
The metadata models were built with a simple architecture of two

| Model training
Species records in the UK Ladybird Survey, like most biological record datasets (Van Horn et al., 2017), are highly skewed towards certain common species (Table 1). As predictive models are not perfect, such class-imbalanced data leads to critical choices about how to best assess 'accuracy'. Overall accuracy may be maximized by rarely or never assigning species to infrequent categories. A citizen scientist may prefer the maximum accuracy for the species in front of them (which is likely to be a commonly reported species).
However, in an ecological science context, rare (or more precisely, rarely reported) species are often of particular interest to researchers managing citizen science projects.
The total dataset was randomly partitioned into training (70%), validation (15%) and test (15%) sets. To address the class-imbalance, we followed the approach suggested by Buda, Maki, and Mazurowski (2018) and rebalanced our training set through up-sampling and down-sampling the available records. We did this so that each species had 2,000 effective training records. To ensure a consistent batch-size of 32, we removed records of the most common species where necessary. Consequently, our underlying models did not have direct access to the information that, all else being equal, certain species are far more likely than others. This reduces the potential for the model 'cheating' during training by fixating on common species and ignoring rare species. To demonstrate the potential to improve overall accuracy by taking into account the relative frequency of each species, we tested weighted versions of each of the models. In these, the relative probability assigned to each species from each unweighted model (P i ) was scaled by the relative frequency of each of the species (F i ) in the training data as: P weighted i ∝ P i F i .
To reduce overfitting, we made extensive use of image augmentation, weight regularization, batch normalization, dropout layers during training and introduced Gaussian noise on the metadata vector. Training optimization was based on a categorical cross-entropy loss function using the 'Adam' adaptive moment estimation optimizer (Kingma & Ba, 2014). During training, if validation loss had reached a plateau, learning rate was reduced automatically. Training was stopped (and the best model restored) if there had been no further improvement in validation loss over at least four epochs.
After fitting the derived metadata, image-only and combined models, a simple ensemble model taking a weighted average of the derived metadata and image-only model predictions was also constructed and tested. This could be considered equivalent to using the metadata to construct a prior expectation for the predictions of the image model: where the weighting (ω) between the metadata and image model probabilities was determined by optimizing the ensemble model top-1 accuracy on the validation set.

| Model testing and evaluation
Overall and species-level model performance was assessed in terms of top-1 (was the true ID rated most likely) and top-3 (was the true ID amongst the three options rated most highly) accuracy. Because model accuracy will be dependent on the split of data into testing and training sets, and because model optimization is a non-deterministic process, we repeated the entire model fitting process five F I G U R E 3 Outline schematic of the difference in model architectures between the single input models that take either just metadata (a) or image (b) information, and the two multi-input models combining (c) or ensembling (d) both data sources. Dense layers are the principle component of neural networks, that fit linkages between every input and output node. All our dense layers incorporated a rectified linear unit (ReLU) nonlinear activation function. Inception-ResNet-v2 is a very deep feature extraction model incorporating many convolutional layers and originally trained to classify a diverse set of objects, that we refined by retraining on our ladybird dataset. The global max pooling stage summarizes the outputs of the image feature extractor for further computation by dense layers. Softmax layers output a vector that sums to one, which can be interpreted as probabilities of each potential category. Dropout, noise, batch normalization and other regularization features enacted only during training time are not shown here for simplicity. R code to build models using the keras r package (Allaire & Chollet, 2019) is given in Supporting Information, which also details further hyperparameters such as the size of the each layer

| Role of metadata components
To examine the dependence of the model on each aspect of the metadata, we examined the decline in top-3 accuracy for each species when elements of metadata were randomized by reshuffling sets of values within the test set. We did this separately for the spatial coordinates, day-of-year, temperatures data, habitats data and recorder expertise.

| RE SULTS
Across each of our training-test split realizations, combined multiinput models showed a marked and consistent improvement on both the image-only (+9.1 percentage points) and the ensemble models (+3.6 percentage points) (Figure 4). Species-level accuracies (averaged across the 5 split realizations) for each of the models are reported in Table 1 The overall accuracy of all models could be greatly improved by weighting the output probabilities by the prior expectation given the relative frequency of each species. For example, the average top-1 accuracy of the combined model rises from 57% to 69%. The model ranking in terms of overall accuracy was maintained (Table   S2). However, these gains are made at the cost of very infrequently identifying rare species correctly. With a weighted model the two most commonly observed species, Harlequin and Seven-spot ladybirds, are correctly identified 90% and 89% of the time respectively.
However, 12 infrequently observed species are correctly identified in less than 12% of cases. appeared to be making more use of the weekly temperature data (2-10 weeks before the record), where randomization caused an 8.1 percentage-point decrease than the more proximate daily records for the preceding fortnight (5.4% decrease). The remaining metadata components had smaller influences on overall top-3 accuracy: randomising habitat data led to a 2.8% decrease while randomizing recorder experience led to a 3.1 percentage-point decrease.
These overall results are highly influenced by the dominant species (particularly the Harlequin ladybird) in the test set, masking variation in decline in accuracy on a per-species level (Table S1).
The apparent importance of each metadata component appears to align with ecological expectations. The five species with greatest decline in accuracy when habitat is randomized are all considered habitat specialists (Roy & Brown, 2018): Coccinella undecimpunctata (dunes), Anatis ocellata (conifers), Tytthaspis sedecimpunctata (grassland and dunes), Subcoccinella vigintiquattuorpunctata (grassland), and Aphidecta obliterata (conifers). Similarly, the randomization of location had the greatest effect on the localized species ( Figure S1).
The top three most affected were: Aphidecta obliterata (frequently reported in Scotland), Scymnus interruptus (South-East England) and Coccinella undecimpunctata (coastal). By contrast, Coccinella septempunctata, a widespread and generalist species was poorly identified by the metadata model and showed a minimal response to randomization. The species affected most by the randomization of temperature was Propylea quattuordecimpunctata, with the common name of the 'dormouse' ladybird (Roy & Brown, 2018, p. 112) because of its known late emergence.
The randomization of recorder experience had the greatest impact on Scymnus interruptus (a 33.6 percentage-point decrease). This was the only 'inconspicuous' ladybird in our dataset, which inexperienced recorders may not even realize is a ladybird (see Figure 2g, right column). There was also a 5.5 percentage-point decrease in the identification of Harlequin ladybirds when recorder experience was randomized. Novice recorders are notably more likely to record Harlequin ladybirds than more experienced recorders. The first record submitted by a new recorder is a Harlequin ladybird 57.4% of the time, which rapidly declines to 38% by the 10th.

| D ISCUSS I ON
The use of metadata within computer vision models considerably improves their reliability for species identification. This exciting finding has implications for biological recording, demonstrating the potential to use innovative approaches to assist in processing large occurrence datasets accrued through mass participation citizen science. Basic primary metadata is straightforward to incorporate within machine learning models and, since this information is already collected alongside the biological records, can be widely adopted.

| Interpretation of results
The notable gain in accuracy of the combined multi-input model evidence can be derived from the change in the relative confidence assigned to the true classification when metadata is incorporated (Appendix 3).
While it is not possible to determine exactly what interpretations the artificial intelligence is making, we can discern plausible scenarios. In autumn, ladybirds select suitable overwintering sites and enter dormancy through the adverse months (Roy & Brown, 2018). Each species exhibits a specific preference in overwintering habitat. Harlequin ladybirds favour buildings, leading to a high proportion of submitted records from inside homes of Harlequin ladybirds in the autumn as they move inside to overwinter . Determining how deep learning models make decisions is complex (Goodfellow et al., 2016). Multiple interwoven contributing factors combine to produce a result, much akin to human decisions. The nature of metadata means much of the gain likely comes from ruling species out rather than positively identifying them, which makes the interpretation of 'accuracy' metrics even more challenging. Our randomization analysis to determine the features used by the metadata model can only be a rough guide to the basis of decisions. The randomization process will represent the pre-existing imbalance of our dataset and will produce illogical combinations of metadata, such as hot temperatures during the winter, or coastal habitat within inland areas. Nonetheless, it does show evidence that the model operates along similar lines to expert identifiers. Where certain aspects of information are lost, this translated into inaccuracies in species for which that information is relevant. This is aligned with the results of Miao et al. (2018) who found that their image recognition tool for savanna mammals also used similar features to humans to identify species. Equally, for widespread and generalist species, metadata is not able to contribute to the accuracy. For instance, the identification of Seven-spot ladybird is essentially unchanged by the inclusion of metadata.
In theory, given enough records, a deep-learning model would be able to infer the information content of the cross-referenced database based only on primary metadata. For example, a neural network could learn to identify a set of location coordinates with a high likelihood of a given species, without knowing that those coordinates contained favoured habitat, simply because the species is frequently recorded at these locations in the training dataset. In this respect, the inclusion of derived metadata could be considered a feature extractor technique that interprets the primary metadata, rather than providing additional information. In practice, the level of data required to internally reconstruct sufficient mapping purely from primary metadata would be very high, particularly when the features are very high resolution (Tang et al., 2015). A core challenge for automated species identification is the long tail of species for which there are very sparse records (Van Horn et al., 2017), for which the advantage of including derived metadata is likely to be considerably larger than for frequently recorded species.

| Further improvements to model
The design and training of deep learning models is an art rather than an exact science (Chollet & Allaire, 2018 In contrast, currently available support for multi-input models is comparatively lacking and requires direct specification of the model architecture as well as data manipulation pipelines to combine disparate information sources. Fortunately, tools such as the keras r package (Allaire & Chollet, 2019) provide straightforward frameworks for multi-input models that are well within the reach of ecologists without a formal computational science background.
We have also shared our code (Supporting Information) to help others make use of this methodology.
We have demonstrated the improvement gained through the use of metadata. Further improvements in accuracy could likely be made through instigating test-time augmentation where multiple crops or rotations of an image are presented to the classifier, ensembling multiple models, and increasing the size of the dataset through supplementary images and historical records (Chollet & Allaire, 2018). Our approach to augmenting metadata (adding Gaussian noise to each element) was relatively basic and more targeted approaches to generating additional synthetic training data (Chawla, Bowyer, Hall, & Kegelmeyer, 2002) could lead to better results.
The overall accuracy of a species classifier can be considerably enhanced by incorporating a prior likelihood of each species' relative frequency. Approaches that allow the model to directly learn the relative frequencies of the species could attain even higher overall accuracy. However, in contrast to improvements discussed in the previous paragraph, this would significantly reduce the accuracy for rarely observed species. A model that only learnt to accurately distinguish between Harlequin and Seven-spot ladybirds (that constitute the majority of records) could attain an accuracy of 70%, but this would be of limited applied use.
The challenge of species identification has in the past attracted computer scientists who can view species identification as an interesting example of large real-world labelled datasets (Weinstein, 2018). Open competitions such as the annual iNaturalist (Van Horn et al., 2017) and LifeCLEF competitions (Goëau, Bonnet, & Joly, 2017) have spurred considerable improvements in identification accuracy. Including metadata in these datasets (such as the PlantCLEF 2019 competition) could lead to considerable improvements. However, any release of metadata must consider the geoprivacy of citizen scientists and potential risk to endangered species. Due consideration of the appropriate resolution of location data, and the identifiability of individuals in any data publicly released is essential.

| Transferability of models including metadata
The inclusion of metadata in an automatic identification tool will influence its transferability to new contexts. With all machine learning approaches, any automatic identification process is only as good as the extent and scope of the training data used. A model that has been trained on the location of UK records would need to be retrained for use in continental Europe, whereas an image-only model could be expected to be at least somewhat useful in both contexts.
As such, a model trained on derived metadata such as habitat types or local weather may be more transferable than one trained on coordinates and specific dates. Understanding the domain a model will be applied to is essential. Transferability will be critical for expanding from well-studied areas (such as UK), to understudied areas where there is great potential for citizen science to fill gaps in knowledge (Pocock et al., 2018).
Transferability of models can be a challenge even within a region since records generated through unstructured broad-based citizen science are distinctive from those generated by committed amateur recorders, structured citizen science projects or professional surveys (Boakes et al., 2016). Submitted records are the result of interactions between human behaviour and species ecology (Boakes et al., 2016). Highly visited sites may show an over-abundance of common species that are new to citizen scientists with relatively limited experience. In our dataset, uploaded records of ladybirds correlate strongly with the first appearance of species and news reports of invasive species (M. Logie & T. A. August, unpublished data). In comparison to ecological data, the inclusion of observer behaviour needs to be treated with particular care. While 'ecological' factors could be expected to transfer well between datasets, observer behaviour is likely to be considerably less transferable. Nevertheless, when working with citizen science data, including observer behaviour can provide additional information (Johnston, Fink, Hochachka, & Kelling, 2018). In our dataset, we could gain additional information at either end of the reviewer experience spectrum-novice recorders were much more likely to record Harlequin ladybirds. There is also potential for more detailed metrics, such as observer range, frequency or previous identification accuracy, could further improve model accuracy.
Our choice of what contextual data to include was guided by our knowledge of variables that are likely to influence ladybirds in the British Isles. For more taxonomically diverse tools, it would be beneficial to use a wider range of derived metadata variables. This could include more diverse weather information, climate maps, and topography. We did not include species range maps (Roy, Brown, Frost, & Poland, 2011) in this study since most (>90%) records came from areas within the range of 15 out of the 18 focal species considered in this study. Binary species range maps cannot account for the relative frequency of species across a region, but this can be learnt by a deep learning network provided with location data of records. Although range maps could be informative within models with a wide spatial scope or for highly localized species, they are comparatively verbose to encode for in deep learning networks. When using a model to identify large numbers of species, the intersection or otherwise of a record with each species range map may need to be encoded in a separate variable. This greatly increases the length of the metadata vector associated with each record and it could become challenging for models to identify relevant information. Although deep learning networks have the potential to effectively ignore data that is not relevant, there is the potential to slow the fitting procedure if too much irrelevant information is presented. Where accurate species range map data are available (and may impart additional information beyond that contained in the training set of records), an approach that combines machine learning with a range-map-based shortlist may be the most useful (Wittich et al., 2018).

| CON CLUS IONS
Identification of insects poses a considerable challenge for computer vision (Martineau et al., 2017). Insect diversity is extraordinarily large -as an example, there are over 6,000 ladybird species worldwide (Roy & Brown, 2018), most of which do not have accessible labelled images. For difficult challenges, such as species identification in the field, the optimal solutions will involve humans and artificial intelligence working in tandem (Trouille, Lintott, & Fortson, 2019). Our results demonstrate the potential for considerable improvement in the accuracy of automatic identification when incorporating contextualization information directly within the model. This is also likely to apply to passive acoustic monitoring tools (Gibb, Browning, Glover-Kapfer, & Jones, 2019) too. Researchers building automatic identification methods will benefit from training models to place images in context, just as a human naturalist would, to best unlock the potential of artificial intelligence in ecology.

ACK N OWLED G EM ENTS
Our thanks to Mark Logie for assistance accessing the iRecord data- We thank two reviewers for their constructive comments.

DATA AVA I L A B I L I T Y S TAT E M E N T
R code used to build models and analyse results is available at https :// github.com/jcdte rry/Ladyb irdID_Public (https ://doi.org/10.5281/ zenodo.3530383) (Terry, 2019) and summarized in Supporting Information 2. Images, user IDs and location data used in this paper are not publically archived due to image licensing and data protection constraints. Data can be accessed from the Biological Records Centre indicia database by searching for records in the family Coccinellidae, or with a common name that included '*adybird', that were not marked as rejected or dubious. For access to the indicia database contact brc@ceh.ac.uk.