Multi-label classification for multi-temporal, multi-spatial coral reef condition monitoring using vision foundation model with adapter learning
While coral reef ecosystems provide important ecological services, they are increasingly threatened by global warming, ocean acidification, and human disturbances. Around the world, citizen science-based coral reef conservation programs have been implemented to support local monitoring and management. These programs typically rely on volunteer divers to record text-based observations, a process that is labor-intensive and often inconsistent across observers. With the growing availability of underwater imagery, deep learning models offer a promising solution for automating coral condition assessment. However, conventional models still struggle to maintain high accuracy and generalizability when dealing with complex ecological monitoring tasks.
Building on our previous work published in Aquatic Conservation: Marine and Freshwater Ecosystems, this research broadens the scope of the data and the methodological depth. First, we expand the dataset to include multi-temporal and multi-spatial field images collected over two years at 15 dive sites in Koh Tao, Thailand. All images are annotated using standardized categories commonly used in citizen science programs, enabling direct alignment with existing monitoring protocols. Second, we improve accuracy, generalizability, and training efficiency by integrating vision foundation models with an adapter learning technique.
Vision foundation models, known for their strong cross-domain performance, offer substantial advantages for ecological image analysis. However, full fine-tuning requires high-end GPU resources at data center scale and results in high carbon emissions, making it impractical for many conservation communities. To address this challenge, this study introduces a lightweight approach that combines the DINOv2 foundation model with Low-Rank Adaptation (LoRA). This adapter-based method enables efficient fine-tuning while significantly reducing the number of trainable parameters.
Experiments show that the DINOv2-LoRA model significantly improves multi-label coral condition classification accuracy, achieving a match ratio of 64.77%, compared with 60.34% from the best conventional deep learning baseline (Swin-Transformer). The method reduces trainable parameters from 1136.50M to only 5.91M and peak allocated GPU memory from 21.39G to 13.00G, offering substantial gains in training efficiency. Transfer learning experiments across different seasons and sites further demonstrate strong generalizability under multi-temporal and multi-spatial settings, enabling long-term ecological monitoring.
To our knowledge, this research presents the first efficient adaptation of a vision foundation model for multi-label classification of coral reef conditions using diverse field-collected imagery. The proposed framework advances automated coral reef monitoring and provides an accessible tool to support environmental management and citizen-science-based conservation efforts.
Fig. 1. Photo transect survey method used for data collection in this study.
Fig. 2. Schematic diagram illustrating the image classification process.
Fig. 3. Two volunteers from a local conservation team conducting a predator removal activity (crown-of-thorns starfish).
Publication
Shao, X., Chen, H., Zhao, F., Magson, K., Chen, J., Li, P., Wang, J., Sasaki, J.: Multi-label classification for multi-temporal, multi-spatial coral reef condition monitoring using vision foundation model with adapter learning. Marine Pollution Bulletin, 223, 119054, 2026. DOI




