Boosting data findability: The role of AI-enhanced keywords
DOI:
https://doi.org/10.29173/iq1127Keywords:
FAIR, user-defined keywords, text classification, Artificial Intelligence, findability, Controlled vocabulariesAbstract
In today’s data-driven world, finding relevant data in a vast expanse of information is increasingly important. Researchers have been exploring various methods to improve the findability, accessibility, interoperability, and reusability of data, for example, by using controlled vocabularies to enhance data findability. Although the use of controlled vocabularies is growing, challenges remain for findability when users provide their own keywords, known as user-defined keywords or do not provide keywords at all. Finding data in data archives based on metadata fields with user-defined or missing keywords is challenging, or even impossible. Here, we show the use of artificial intelligence (AI) techniques from the subfield of deep learning to automate the assignment of keywords using controlled vocabulary, leading to improved data findability. The main results demonstrate that AI automation performs well on the test set. In addition, we comapre our deep learning model against large language model (LLM) on the task of automated topic assignment. Automated topic assignments will reduce the time and effort required for data curation, enhancing data findability and usability for data producers and consumers. The application of AI to automate metadata assignment offers practical solutions for improving data findability and reusability, not only in research data archives but across various data-driven domains. Overall, this approach highlights the potential of AI in addressing data findability challenges, paving the way for more efficient and effective data discovery and utilization in the era of big data and information abundance.
References
Bogatinovski, J., Todorovski, L., Džeroski, S., & Kocev, D. (2022). Comprehensive comparative study of multi-label classification methods. Expert Systems with Applications, 203, 117215. DOI: https://doi.org/10.1016/j.eswa.2022.117215
Pant, P., Sai Sabitha, A., Choudhury, T., & Dhingra, P. (2019). Multi-label classification trending challenges and approaches. Emerging Trends in Expert Applications and Security: Proceedings of ICETEAS 2018, 433-444. DOI: https://doi.org/10.1007/978-981-13-2285-3_51
Reimers, N., Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-Networks. In: EMNLP-IJCNLP 2019. ACL. DOI: https://doi.org/10.18653/v1/D19-1410
Sorower, M. S. (2010). A literature survey on algorithms for multi-label learning. Oregon State University, Corvallis, 18(1), 25
Steiner, G., (2023). The exponential growth of research data. Analytical Science Magazine, Vol. 3 -May/23 (https://analyticalscience.wiley.com/content/article-do/exponential-growth-research-data) [Accessed 27/06/2024]
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 Kokila Jamwal
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.
This license lets others remix, tweak, and build upon your work non-commercially, and although their new works must also acknowledge you and be non-commercial, they don’t have to license their derivative works on the same terms.
The Creative Commons-Attribution-Noncommercial License 4.0 International applies to all works published by IASSIST Quarterly. Authors will retain copyright of the work. Your contribution will be available at the IASSIST Quarterly website when announced on the IASSIST list server.