Self-supervised Fine-grained Image Recognition Method Based on Multi-scale Attention and Contrastive Learning

Chih-Hao Lin; Yu-Hsuan Tseng; Pei-Chen Wu; Cheng-Yu Huang; Meng-Ying Lai

doi:10.5281/zenodo.15232950

Authors

Chih-Hao Lin Department of Electrical Engineering, National Taiwan University, Taipei 10617, Taiwan
Yu-Hsuan Tseng Department of Electrical Engineering, National Taiwan University, Taipei 10617, Taiwan
Pei-Chen Wu Department of Electrical Engineering, National Taiwan University, Taipei 10617, Taiwan
Cheng-Yu Huang Department of Computer Science, National Tsing Hua University, Hsinchu 30013, Taiwan
Meng-Ying Lai Department of Computer Science, National Tsing Hua University, Hsinchu 30013, Taiwan

DOI:

https://doi.org/10.5281/zenodo.15232950

Keywords:

Fine-grained image recognition, Self-supervised learning, Contrastive learning, Multi-scale attention, Image representation

Abstract

Fine-grained image recognition aims to accurately distinguish subclass differences within the same major category. However, due to subtle inter-class differences and high annotation costs, it has long been a significant challenge in the field of computer vision. This study innovatively proposes a self-supervised image recognition framework integrating multi-scale attention mechanisms and contrastive learning, enabling efficient and high-quality feature extraction without manual annotation. The method leverages a multi-level attention module to deeply explore both local and global image information. Meanwhile, momentum encoding strategies and data augmentation techniques are used to generate positive and negative sample pairs for contrastive training. Experimental results on standard datasets such as CUB-200-2011 and FGVC-Aircraft show that the proposed method achieves Top-1 recognition accuracies of 89.2% and 87.5%, respectively, demonstrating a significant performance improvement over current mainstream methods.

References

Sevugan, P., Rudhrakoti, V., Kim, T. H., Gunasekaran, M., Purushotham, S., Chinthaginjala, R., ... & Kumar, A. (2025). Class-aware feature attention-based semantic segmentation on hyperspectral images. PloS one, 20(2), e0309997.

Wang, Z., Yan, H., Wei, C., Wang, J., Bo, S., & Xiao, M. (2024, August). Research on autonomous driving decision-making strategies based deep reinforcement learning. In Proceedings of the 2024 4th International Conference on Internet of Things and Machine Learning (pp. 211-215).

Joly, A., Goëau, H., Botella, C., Glotin, H., Bonnet, P., Vellinga, W. P., ... & Müller, H. (2018). Overview of LifeCLEF 2018: a large-scale evaluation of species identification and recommendation algorithms in the era of AI. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: 9th International Conference of the CLEF Association, CLEF 2018, Avignon, France, September 10-14, 2018, Proceedings 9 (pp. 247-266). Springer International Publishing.

Gao, D., Shenoy, R., Yi, S., Lee, J., Xu, M., Rong, Z., ... & Chen, Y. (2023). Synaptic resistor circuits based on Al oxide and Ti silicide for concurrent learning and signal processing in artificial intelligence systems. Advanced Materials, 35(15), 2210484.

Mo, K., Chu, L., Zhang, X., Su, X., Qian, Y., Ou, Y., & Pretorius, W. (2024). Dral: Deep reinforcement adaptive learning for multi-uavs navigation in unknown indoor environment. arXiv preprint arXiv:2409.03930.

Shi, X., Tao, Y., & Lin, S. C. (2024, November). Deep Neural Network-Based Prediction of B-Cell Epitopes for SARS-CoV and SARS-CoV-2: Enhancing Vaccine Design through Machine Learning. In 2024 4th International Signal Processing, Communications and Engineering Management Conference (ISPCEM) (pp. 259-263). IEEE.

Wei, X. S., Song, Y. Z., Mac Aodha, O., Wu, J., Peng, Y., Tang, J., ... & Belongie, S. (2021). Fine-grained image analysis with deep learning: A survey. IEEE transactions on pattern analysis and machine intelligence, 44(12), 8927-8948.

Wang, S., Jiang, R., Wang, Z., & Zhou, Y. (2024). Deep learning-based anomaly detection and log analysis for computer networks. arXiv preprint arXiv:2407.05639.

Gong, C., Zhang, X., Lin, Y., Lu, H., Su, P. C., & Zhang, J. (2025). Federated Learning for Heterogeneous Data Integration and Privacy Protection.

Shih, K., Han, Y., & Tan, L. (2025). Recommendation System in Advertising and Streaming Media: Unsupervised Data Enhancement Sequence Suggestions.

Bao, Q., Chen, Y., & Ji, X. (2025). Research on evolution and early warning model of network public opinion based on online Latent Dirichlet distribution model and BP neural network. arXiv preprint arXiv:2503.03755.

Zhu, J., Ortiz, J., & Sun, Y. (2024, November). Decoupled Deep Reinforcement Learning with Sensor Fusion and Imitation Learning for Autonomous Driving Optimization. In 2024 6th International Conference on Artificial Intelligence and Computer Applications (ICAICA) (pp. 306-310). IEEE.

Liu, Z., Costa, C., & Wu, Y. (2024). Quantitative Assessment of Sustainable Supply Chain Practices Using Life Cycle and Economic Impact Analysis.

Mugobo, V. V., & Baschiera, M. G. (2015). The impact of personalized engagement with customers and efficient stock management software systems on customer service at a clothing retailer in Cape Town, South Africa.

Vepa, A., Yang, Z., Choi, A., Joo, J., Scalzo, F., & Sun, Y. (2024). Integrating Deep Metric Learning with Coreset for Active Learning in 3D Segmentation. Advances in Neural Information Processing Systems, 37, 71643-71671.

Yang, Z., & Zhu, Z. (2024). Curiousllm: Elevating multi-document qa with reasoning-infused knowledge graph prompting. arXiv preprint arXiv:2404.09077.

Li, Z., Ji, Q., Ling, X., & Liu, Q. (2025). A Comprehensive Review of Multi-Agent Reinforcement Learning in Video Games. Authorea Preprints.

Feng, H. (2024, September). The research on machine-vision-based EMI source localization technology for DCDC converter circuit boards. In Sixth International Conference on Information Science, Electrical, and Automation Engineering (ISEAE 2024) (Vol. 13275, pp. 250-255). SPIE.

Zhang, W., Li, Z., & Tian, Y. (2025). Research on Temperature Prediction Based on RF-LSTM Modeling. Authorea Preprints.

Li, Z. (2024). Advances in Deep Reinforcement Learning for Computer Vision Applications. Journal of Industrial Engineering and Applied Science, 2(6), 16-26.

Liu, Z., Costa, C., & Wu, Y. (2024). Leveraging Data-Driven Insights to Enhance Supplier Performance and Supply Chain Resilience.

Liu, J., Li, K., Zhu, A., Hong, B., Zhao, P., Dai, S., ... & Su, H. (2024). Application of deep learning-based natural language processing in multilingual sentiment analysis. Mediterranean Journal of Basic and Applied Sciences (MJBAS), 8(2), 243-260.

Tang, X., Wang, Z., Cai, X., Su, H., & Wei, C. (2024, August). Research on heterogeneous computation resource allocation based on data-driven method. In 2024 6th International Conference on Data-driven Optimization of Complex Systems (DOCS) (pp. 916-919). IEEE.

Liu, X., Zhang, F., Hou, Z., Mian, L., Wang, Z., Zhang, J., & Tang, J. (2021). Self-supervised learning: Generative or contrastive. IEEE transactions on knowledge and data engineering, 35(1), 857-876.

Feng, H. (2024). High-Efficiency Dual-Band 8-Port MIMO Antenna Array for Enhanced 5G Smartphone Communications. Journal of Artificial Intelligence and Information, 1, 71-78.

Zhu, J., Wu, Y., Liu, Z., & Costa, C. (2025). Sustainable Optimization in Supply Chain Management Using Machine Learning. International Journal of Management Science Research, 8(1).

Kampichler, C., Wieland, R., Calmé, S., Weissenberger, H., & Arriaga-Weiss, S. (2010). Classification in conservation biology: a comparison of five machine-learning methods. Ecological Informatics, 5(6), 441-450.

Liu, Z., Costa, C., & Wu, Y. (2024). Data-Driven Optimization of Production Efficiency and Resilience in Global Supply Chains. Journal of Theory and Practice of Engineering Science, 4(08), 23-33.

Sun, Y., Pai, N., Ramesh, V. V., Aldeer, M., & Ortiz, J. (2023). GeXSe (Generative Explanatory Sensor System): An Interpretable Deep Generative Model for Human Activity Recognition in Smart Spaces. arXiv preprint arXiv:2306.15857.

Yang, J., Chen, T., Qin, F., Lam, M. S., & Landay, J. A. (2022, April). Hybridtrak: Adding full-body tracking to vr using an off-the-shelf webcam. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (pp. 1-13).

Wang, G., Qin, F., Liu, H., Tao, Y., Zhang, Y., Zhang, Y. J., & Yao, L. (2020). Morphing Circuit: An integrated design, simulation, and fabrication workflow for self-morphing electronics. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 4(4), 1-26.

Pissas, T., Ravasio, C. S., Cruz, L. D., & Bergeles, C. (2022, October). Multi-scale and cross-scale contrastive learning for semantic segmentation. In European Conference on Computer Vision (pp. 413-429). Cham: Springer Nature Switzerland.

Self-supervised Fine-grained Image Recognition Method Based on Multi-scale Attention and Contrastive Learning

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Current Issue

Information