AIDE: Accelerating image-based ecological surveys with interactive machine learning

1. Ecological surveys increasingly rely on large-scale image datasets, typically tera-bytes of imagery for a single survey. The ability to collect this volume of data allows surveys of unprecedented scale, at the cost of expansive volumes of photo-interpretation labour. 2. We present Annotation Interface for Data-driven Ecology (AIDE), an open-source web framework designed to alleviate the task of image annotation for ecological surveys. AIDE employs an easy-to-use and customisable labelling interface that supports multiple users, database storage and scalability to the cloud and/or multiple machines. 3. Moreover, AIDE closely integrates users and machine learning models into a feed-back loop, where user-provided annotations are employed to re-train the model, and the latter is applied over unlabelled images to e.g. identify wildlife. These predictions are then presented to the users in optimised order, according to a cus-tomisable active learning criterion. AIDE has a number of deep learning models built-in, but also accepts custom model implementations. 4. Annotation Interface for Data-driven Ecology has the potential to greatly accelerate annotation tasks for a wide range of researches employing image data. AIDE is open-source and can be downloaded for free at https://github.com/micro soft/ aerial_wildl ife_detec tion.

. To this end, software solutions have been proposed, such as Trapper (Bubnicki et al., 2016), Aardwolf (Krishnappa & Turner, 2014) and camtrapR (Niedballa et al., 2016). While these facilitate data management, they lack labelling assistance and require users to carry out all annotation work manually. On a different track, some interfaces were designed with explicit focus on annotation, like VATIC (Vondrick et al., 2013), LabelImg, 1 VGG Image Annotator (Dutta & Zisserman, 2019), VIOLA (Bondi et al., 2017), LabelMe (Russell et al., 2008) and commercial tools like LabelBox. 2 A few of them have some form of simple annotation assistance; for example, both VATIC and VIOLA offer interpolation for video data to reduce the number of annotations required. However, more elaborate labelling assistance is often absent.
Recently, computer vision research has focused on automatically interpreting ecological imagery (Kellenberger et al., 2018;Norouzzadeh et al., 2018;Schneider et al., 2019;Tabak et al., 2019;Willi et al., 2019) through machine learning (ML) models, in particular convolutional neural networks (CNNs; LeCun et al., 2015). CNNs are a family of deep learning models designed for recognition tasks in images, such as image classification (Krizhevsky et al., 2012) or object detection (Lin, Goyal, et al., 2017), and have become the most widely used variant of ML models in computer vision tasks. 3 However, employing these models requires substantial programming effort, as well as a very large collection of labelled images for training. In ecological applications, data acquisition campaigns often result in large quantities of images, but no annotations, which prevents CNN training. Furthermore, although methodologies like pre-training and transfer learning exist that can reduce the required number of images and annotations (Kornblith et al., 2019), obtaining a model that can generalise across an entire image dataset still requires large amounts of annotated data from the target image campaign. This can be attributed to the visual heterogeneity of the objects of interest in an image, as well as the images themselves: for example, objects (animals, plants, etc.) may exhibit viewpoint or pose variations, they may be of different sizes depending on their age and distance to the camera, or they might have different fur colours and patterns. Similarly, images may be taken with different camera models, resolutions or during the day or at night. ML models need to be exposed to these variations by means of training data, and labels, for them to be able to generalise and yield high-quality predictions thoughout the full dataset. These data may not be readily available for image labelling campaigns, which limits the usefulness of CNNs, unless they can be included in the annotation process and incrementally trained on new annotations provided by the users.
In this work we address both problems-the tedium of manual photo-interpretation and the constraints of ML models-by unifying them into one labelling framework, which we denote Annotation Interface for Data-driven Ecology (AIDE). AIDE is a web-based, opensource collaboration platform that integrates a versatile labelling tool and ML models for image annotation, without the requirement of writing code. The incorporation of ML models into annotation platforms has been proposed before, e.g. by the camera trap image tool Timelapse (Greenberg et al., 2019). However, AIDE does so by means of a feedback loop, leveraging a heuristic known as active learning (AL; Settles, 2009). In AIDE, the ML model is repeatedly trained on the latest, user-provided annotations. Once training has finished, the model is used to obtain predictions on (yet) unlabelled images.
Critically, the images are further sorted by an AL criterion, which e.g.
prioritises images that contain highly unconfident ML model predictions. The promise of using AL then is that a lower number of annotated images are required to train an ML model for the task at hand. AIDE has a number of CNN-based ML models and AL criteria built-in, but also accepts custom, user-provided implementations. The result is a collaborative platform that (a) has the potential to greatly accelerate large-scale image annotation projects and (b) allows training ML models with potentially lower amounts of training data. To the best of our knowledge, AIDE is the first open-source software suite that integrates ML models in an AL manner for image annotation.

| Overview
Annotation Interface for Data-driven Ecology is a web-based, collaborative annotation platform that includes humans and a prediction model in a loop, with both parties reinforcing each other for accelerated label retrieval. Figure 1 illustrates this loop and the key components of AIDE, including: • Labelling interface, the primary access point for annotators and a window into the dataset to be annotated (Section 2.2).
• Database, the storage solution for annotations and metadata (Section 2.3).
• Integrated model training, which allows training an ML model on user-provided annotations and obtaining predictions in (yet) unlabelled images (Section 2.4).
• Active learning (AL) criterion, responsible for ordering the model predictions, e.g. to maximise model accuracy gain during retraining (Section 2.5).
By default, AIDE iterates this loop until the entire dataset has been annotated. The annotation process can also be terminated earlier, e.g. upon satisfactory prediction quality of the model. The following sections outline this loop and the individual components.

| Labelling interface
The labelling interface (Figure 2) is written in JavaScript with the jQuery library 4 and is accessible through any modern web browser. Since the main target of AIDE is to obtain labels in the most efficient way, multi-step workflows, nested dialogues and pop-up messages have been avoided as much as possible.

| Annotation types
Annotation Interface for Data-driven Ecology supports a number of annotation types, namely image labels, points (with pixel coordinates), bounding boxes and segmentation maps (where every pixel gets assigned a label). The interface and tool set are automatically adjusted depending on the annotation type selected for a project.
AIDE has been designed to allow one type of annotation per project, rather than e.g. a fully customisable cascade of dialogues or annotation tags. This allows for a leaner annotation interface and more straightforward integration of the ML model (Section 2.4). Figure 3 illustrates examples of the interface set up for the four currently supported annotation types.

| Annotating images
Users can create, modify and delete annotations; the precise interaction depending on the annotation type. For instance, a click Most of the labelling tools are assigned keyboard shortcuts, so that the user can keep their focus on the images, without having to look around to find the necessary tool. This also applies to the list of label classes, whose entries can be organised into hierarchical groups, collapsed and searched. For instance, the search field can also be accessed through a keystroke-this way, users can keep the mouse cursor in the image view, and select the desired label class through simple keyboard operations, without having to scroll through the list of classes.
After a user annotates a set of images, clicking 'Next' commits the annotations to the database (see Section 2.3.1 below) and presents a new set of images. Metadata related to the annotation process are stored as well, e.g. annotation author, image view count, date and time of creation, time required, browser agent, window size, number of interactions and more. Clicking 'Previous' re-displays the image (or batch of images, depending on the configuration) the user has seen before and allows modifying annotations therein. Finally, the platform also supports re-visiting existing annotations, filterable by date and annotation presence/ absence to skip empty images.

| Server backend
Annotation Interface for Data-driven Ecology stores annotations and metadata in a relational database (RDB), specifically Postgres, 5 an open-source database system. RDBs enable concurrent (i.e. multiuser) access, scalability and security on the one hand, but also facilitate tabular data download for further analyses on the other. Note that images are only referenced through the database, but stored as files on disk for easier organisation. Images can be uploaded and managed through the web browser; large images can automatically be split into patches on a regular grid during upload, if requested. Data input and output between the RDB and the annotation interface is handled by the server-sided logic of AIDE, which is written in Python and based around bottle.py, a lightweight web server engine. 6

| User performance evaluation
Expertise and diligence of annotators may vary, which might become a challenge in collaborative labelling projects. To assist project 5 https://www.postg resql.org 6 https://bottl epy.org F I G U R E 3 AIDE's labelling interface can be customised in many ways and supports multiple annotation types (clockwise, from top left): image labels, points, bounding boxes and segmentation masks administrators, AIDE offers tools for assessing the performance and annotation accuracy of users. All users' annotations can be compared to each other (including project administrators) through the web interface ( Figure 4). The returned statistics are calculated on the server and adjusted to the annotation type: for image labels and segmentation masks, the overall accuracy is returned; for points and bounding boxes, AIDE provides precision and recall scores as well as average spatial point distances, resp. intersection-over-union (IoU) scores.
Furthermore, AIDE also allows the specification of 'golden questions' which are images that serve as a reference for evaluation: project administrators can flag an arbitrarily large set of images as 'golden questions'. Every annotator then first sees only the golden question images when they begin with the labelling process in a specific project. The platform can further be configured to only allow new users to continue if they pass a certain accuracy criterion (e.g. a recall of 80% or more) on the golden questions, or after explicit admission by the project administrator.

| ML backend
At the heart of AIDE lies its capability of training ML models, based on the annotations provided by the users. Including ML models into the labelling process provides a number of potential advantages, such as: Although we did not observe any speedups or accuracy improvement of annotators when showing predictions in the images in our tests, the option is available.
3. Acceleration: AIDE can alter the order of images based on the model predictions to e.g. prioritise particularly difficult images (i.e. with low-confidence predictions), or images with a high number of predictions ( Figure 6).
Annotation Interface for Data-driven Ecology is designed to accommodate any ML model, as long as it can be trained in a supervised way on images annotated by the users of the interface. To this end, AIDE comes with a number of ML models built-in (Section 2.4.2), but also accepts third-party models (Section 2.4.3).

| Model training
Upon project creation, or later on, administrators can select one of the available model types that is compatible with their project's selected annotation and prediction types. AIDE has a number of ML models built-in, but those built-in models can be replaced by almost any user-provided ML model.  Figure 7). Furthermore, all statistical evaluation functionalities described in Section 2.3.2 are also available for evaluating model performance.

| Built-in models
Annotation Interface for Data-driven Ecology has a number of deep learning models built-in that have been shown to yield high performances on computer vision tasks. These include: • RetinaNet for object detection and classification with bounding boxes (Lin, Goyal, et al., 2017). RetinaNet is an evolution of Faster R-CNN (Ren et al., 2015), which is widely used in computer vision research and ecology (Schneider et al., 2018). RetinaNet provides two advantages over Faster R-CNN: the first is a sequence of layers called 'Feature Pyramid Network' , which enables obtaining both high-resolution and semantically expressive features for each location in the image for object detection with high accuracy. The second is the 'Focal loss', which reduces the penalty for correct predictions whose confidence is not perfect, but is already good enough, allowing the model to become more robust to datasets that exhibit strong class imbalances. RetinaNet has been successfully used for aerial wildlife counting (Eikelboom et al., 2019) and coral detection (Modasshir et al., 2018).
• U-Net for semantic segmentation (Ronneberger et al., 2015). 7 U-Net contains a sequence of encoder and decoder, which map the image to a lower spatial resolution, but high-dimensional fea-

| Custom models
In some cases, the built-in models of AIDE might not be adequate, or else users of the system may already have an ML model available that they would like to use in the annotation process. For these cases, AIDE supports the integration of third-party models.

| Active Learning for human-machine collaboration
In most ML workflows, a model is trained once on parts of a dataset and then kept static during a prediction phase on the rest of the images. While this may work if sufficient data have been labelled, it is less than optimal for situations where the initial number of existing annotations is low, or a model is to be re-used over e.g. a new set of images whose visual appearance is very different from the one in the training set. In this case, specific domain adaptation strategies can be devised to compensate for the domain shift (Tuia et al., 2016), but at the cost of custom-built ML models that are difficult to use for non-specialists.
Instead, AIDE integrates prediction models in an active learning (AL) loop (Kellenberger et al., 2019;Settles, 2009), also known as a 'human-in-the-loop' system (Brodley, 2017) click 'Next' they are automatically presented with the newly predicted images, sorted by the priority score. In the end, this means that more relevant images are shown to the user with higher priority throughout the entire labelling process, with the notion of relevance depending on the task.
As an example, AL can be used to improve model performance after a given number of annotated images. Figure 9 shows precision-recall curves of CNNs on large mammal detection in aerial images, before fine-tuning (grey) and after five (dashed) and ten (solid) iterations with different AL criteria (see Appendix 5.4 for details).
Note that the prediction quality of the CNN improves with all tested F I G U R E 8 AIDE allows configuring model options for each project through the web browser. If model settings are provided in the right format, they will be rendered with graphical elements and can incorporate explanation texts and links for each parameter; this is also available for third-party models (see Appendix 5.2) F I G U R E 9 Precision-recall curves of an object detector CNN with initial performance (grey) and after five (dashed) and ten (solid) AL iterations with three different AL criteria criteria, including simple random image ordering (i.e. no AL) over the original base model (grey), but the improvement after only five AL iterations is the highest with a dedicated AL criterion (Breaking Ties; Luo et al., 2005).

| AIDE FOR COMMUNIT Y DE VELOPMENT
Integrating ML models into the labelling process eventually results in model states that are highly optimised for the data at hand. This is particularly helpful for large-scale image campaigns, where a well-trained model may result in reduced annotation efforts.
However, the benefits of ML reach further: once trained, models can be used across individual projects. Oftentimes, ecologists conduct image campaigns with similar targets in mind, e.g. with images containing the same species, comparable types of background, or from the same viewpoint (ground-based, airborne, etc.).
In these cases, re-using ML model states from other, similar projects provides a starting point that has the potential to accelerate labelling campaigns even further.
To this end, an upcoming release of AIDE will include a 'model marketplace' where users will be able to share trained ML model states across projects. At the start of each annotation project, users will be able to browse through a catalogue of available model states. Each state is accompanied with a description, a list of label classes the model supports and other related metadata.
This way, users can select the most appropriate model state as a starting point and obtain even higher quality predictions straight from the start of the annotation process. Likewise, once a user decides that the model in their own project is sufficiently trained, they can decide to share its state with others by providing the mentioned metadata (name, description, etc.) and sharing it on the marketplace. For privacy reasons, only the aforementioned metadata and model parameters will be shared, which sufficiently prevents conclusions about the images of the originating project.
Also, model states have to be shared explicitly by a project administrator and will be shareable either only across the administrator's own projects, or globally. Owners of the model states can further discard any information about the origin, such as their AIDE account name.
Eventually, we foresee AIDE and the model marketplace as a platform to enhance ecological image analysis in a collaborative way, beyond the individual project. Once a sufficient number of applications and image types have been covered by shared model states, labelling efforts will be reduced to a minimum for any new image campaign. This will enable ecologists to allot more time for the data interpretation, rather than the annotation process.

| LIMITATI ON S OF AIDE
Annotation Interface for Data-driven Ecology was designed to enable large-scale, collaborative annotation projects for ecological applications by means of interactive integration of ML models in an easy-to-use manner. Effectively, AIDE does not require users to write a single line of code, if they decide to use one of the built-in or contributed third-party models. However, AIDE is still a growing project, and as such has a number of limitations, including the following: • Annotation Interface for Data-driven Ecology currently only supports RGB images and is not compatible with multi-band images, georeferenced data or other media types like videos.
• Only the four annotation types mentioned are supported at this moment. We plan to add compatibility for other types, such as more complex polygons or instance segmentation maps, in upcoming releases, and will also include appropriate ML models for them. • Models need to be trained to a certain degree on the data to be useful for interactive setups. In the case of deep learning models, this requires a comparably large set of existing labels, limiting their use at the start of annotation projects. If a new project is started with a completely untrained deep learning model, the latter will usually provide random labels per image, resp. per pixel in the case of image classification and semantic segmentation, or predictions in all possible locations of the image for points and bounding boxes. We intend to address this obstacle in a future release of AIDE through the 'model marketplace' as highlighted in Section 3.
• While AIDE offers tools to train ML models and evaluate model prediction and user performances (cf. Section 2.3.2), it does not guarantee high-quality annotations or well-performing ML models by itself. Eventually, it will always be the project administrators' responsibility to verify the accuracy of provided annotations, and to ensure that ML models are trained to the degree required for the individual annotation project.
Finally, we would like to note that AIDE is still work in progress and will grow in functionality over time. We hope to be able to deliver a solution that facilitates using ML models in as many ecological applications as possible.

| CON CLUS ION
Ecological research increasingly relies on large-scale visual datasets, which can dramatically scale the spatial coverage of wildlife surveys, but requires tedious and expensive photo-interpretation of the images acquired. ML models, in particular convolutional neural networks (CNNs), have demonstrated high potential for accelerating this manual work. However, they often require involved coding efforts, which likely prevented broad adoption in many ecology projects.
In this study we presented Annotation Interface for Data-driven Ecology (AIDE), an open-source web framework that integrates a flexible and easy-to-use annotation platform with CNN-based prediction models. AIDE is a versatile labelling tool that offers a high degree of customisability, support for various annotation types and support for multiple users. It is also one of the first annotation platforms that employs ML models to assist annotators in their task. Critically, AIDE employs these models through active learning, where humans and the machine work hand-in-hand: humans provide annotations the model can learn from, and the model returns suggested predictions and prioritises images with respect to their relevance.
Annotation Interface for Data-driven Ecology is under active development, and will be expanded in functionalities in upcoming releases. This includes addressing the shortcomings mentioned like support for more annotation types, the ability to share pre-trained models across projects, as well as implementing new functionalities that have the potential to enhance image labelling projects for ecology.
Annotation Interface for Data-driven Ecology is an open-source platform that is free to use. The source code is available at https:// github.com/micro soft/aerial_wildl ife_detec tion.

ACK N OWLED G EM ENTS
The authors would like to acknowledge the SAVMAP consortium

PE E R R E V I E W
The peer review history for this article is available at https://publo ns.

DATA AVA I L A B I L I T Y S TAT E M E N T
The proposed platform (AIDE) is open source and available for download at https://github.com/micro soft/aerial_wildl ife_detec tion. The version of AIDE used in this manuscript (Microsoft, 2020) can be obtained at https://doi.org/10.5281/zenodo.4028309. Note that this is a frozen code that will not contain the latest updates and developments beyond the state at publication of this manuscript. For the official and latest release, please refer to the official GitHub link. The images used for the studies behind Figures 6 and 9 are available at https://doi.org/10.5281/zenodo.1204408 (Reinhard et al., 2015).
Labels are available from the authors upon request.