Volume 3,Issue 9
人工智能驱动的数据标注:从辅助者到主导者的范式革命
数据标注作为人工智能模型训练的起点,其效率和质量的高低直接决定了算法能力上限,纯人工标注方式存在代价高、难规模化、质量不稳定等痛点,成为制约AI发展的“标注瓶颈”。论文将系统介绍人工智能赋能数据标注的发展历程,并提出一个核心命题:AI在数据标注领域的应用从“辅助”正在演变为“主角”。首先分析传统标注方式中成本、质量和规模三难权衡的困境;其次分析以主动学习(Active Learning)为代表的AI助标注范式,分析其效果与“主动学习悖论”的固有困境;然后重点分析以LLM和视觉基础模型为代表的新范式,分析其如何通过零样本或少样本等能力重塑标注工作流,以及存在“泛化-特化鸿沟”等新问题;最后分析新范式下的算法偏见、模型可靠性等问题,并展望人机协作的新范式。我们认为,未来数据标注的核心是构建一个人机协同的强大生态系统,在一个大的系统中由基础模型做规模化工作,人类专家进行更高水平的监管、治理和对齐。
[1] Zhu X, et al. 数据标注研究综述[J]. 软件学报, 2020, 31(1): 1-20.
[2] Neves M, Ševa J, Teixeira L. On the challenges of using text annotation tools[C]//Proceedings of the 12th Language Resources and Evaluation Conference. 2020: 5353-5360.
[3] 中国信息通信研究院. 人工智能数据标注白皮书[R]. 2023.
[4] 国务院办公厅. 关于促进数据标注产业高质量发展的实施意见[Z]. 2024.
[5] 王珊, 李建中. 数据库管理系统中的不确定性数据管理研究进展[J]. 软件学报, 2018, 29(1): 1-20.
[6] Midtvedt B. Annotation-free deep learning for quantitative microscopy[D]. University of Gothenburg, 2024.
[7] Sun S, Liu L, Liu Y, et al. Uncovering bias in foundation models: Impact, testing, harm, and mitigation[J]. arXiv preprint arXiv:2501.10453, 2025.
[8] 工业和信息化部. 人工智能产业创新发展行动计划(2023-2025 年)[Z]. 2023.
[9] Settles B. Active learning literature survey[R]. University of Wisconsin-Madison, 2009.
[10] 周志华. 机器学习[M]. 北京: 清华大学出版社, 2016.
[11] 李飞飞, 张钹, 等. 人工智能的本质与未来[J]. 中国科学: 信息科学, 2024, 54(1): 1-25.
[12] Dasgupta S. Analysis of a simple active learning algorithm[C]//Proceedings of the twenty-first conference on Uncertainty in artificial intelligence. 2005: 133-140.
[13] Beygelzimer A, Dasgupta S, Langford J. Importance weighted active learning[C]//Proceedings of the 26th annual international conference on machine learning. 2009:49-56.
[14] Brown T B, Mann B, Ryder N, et al. Language models are few-shot learners[J]. Advances in neural information processing systems, 2020, 33: 1877-1901.
[15] Min S, Lyu X, Holtzman A, et al. Rethinking the role of demonstrations: What makes in-context learning work?[J]. arXiv preprint arXiv:2202.12837, 2022.
[16] Zhao W X, Zhou K, Li J, et al. A survey of large language models[J]. arXiv preprint arXiv:2303.18223, 2023.
[17] Tan Z, Beigi A, Wang S, et al. Large language models for data annotation: A survey[J]. arXiv preprint arXiv:2402.13446, 2024.
[18] Kirillov A, Mintun E, Ravi N, et al. Segment anything[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 3879-3890.
[19] Mazurowski M A, Dong H, Gu H, et al. Segment anything model for medical image analysis: An experimental study[J]. Medical Image Analysis, 2023, 89: 102918.
[20] Ma J, He Y, Li F, et al. Segment anything in medical images[J]. Nature Communications, 2024, 15(1): 654.
[21] Zhang Y, Liu Q, Song L. Medical image segmentation based on U-Net and its variants: A review[J]. Journal of Shanghai Jiaotong University (Science), 2023, 28(3): 345-358.
[22] Veselý A, Straka M. Synthetic data generation using large language models: A survey[J]. arXiv preprint arXiv:2503.14023, 2025.
[23] Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network[J]. arXiv preprint arXiv:1503.02531, 2015.
[24] Shumailov I, Zhao Y, Bates D, et al. The curse of recursion: Training on generated data makes models forget[J]. arXiv preprint arXiv:2305.17493, 2023.
[25] 段伟文. 人工智能伦理与治理[M]. 北京: 科学出版社, 2023.
[26] Ouyang L, Wu J, Jiang X, et al. Training language models to follow instructions with human feedback[J]. Advances in Neural Information Processing Systems, 2022, 35:27730-27744.
[27] 张俊林. 以数据为中心的人工智能:现状与展望[J]. 中国科学: 信息科学, 2023, 53(8): 1459-1476.