Boosting data findability: The role of AI-enhanced keywords

Authors

DOI:

https://doi.org/10.29173/iq1127

Keywords:

FAIR, user-defined keywords, text classification, Artificial Intelligence, findability, Controlled vocabularies

Abstract

In today’s data-driven world, finding relevant data in a vast expanse of information is increasingly important. Researchers have been exploring various methods to improve the findability, accessibility, interoperability, and reusability of data, for example, by using controlled vocabularies to enhance data findability. Although the use of controlled vocabularies is growing, challenges remain for findability when users provide their own keywords, known as user-defined keywords or do not provide keywords at all. Finding data in data archives based on metadata fields with user-defined or missing keywords is challenging, or even impossible. Here, we show the use of artificial intelligence (AI) techniques from the subfield of deep learning to automate the assignment of keywords using controlled vocabulary, leading to improved data findability. The main results demonstrate that AI automation performs well on the test set. In addition, we comapre our deep learning model against large language model (LLM) on the task of automated topic assignment. Automated topic assignments will reduce the time and effort required for data curation, enhancing data findability and usability for data producers and consumers. The application of AI to automate metadata assignment offers practical solutions for improving data findability and reusability, not only in research data archives but across various data-driven domains. Overall, this approach highlights the potential of AI in addressing data findability challenges, paving the way for more efficient and effective data discovery and utilization in the era of big data and information abundance.

References

Bogatinovski, J., Todorovski, L., Džeroski, S., & Kocev, D. (2022). Comprehensive comparative study of multi-label classification methods. Expert Systems with Applications, 203, 117215. DOI: https://doi.org/10.1016/j.eswa.2022.117215

Pant, P., Sai Sabitha, A., Choudhury, T., & Dhingra, P. (2019). Multi-label classification trending challenges and approaches. Emerging Trends in Expert Applications and Security: Proceedings of ICETEAS 2018, 433-444. DOI: https://doi.org/10.1007/978-981-13-2285-3_51

Reimers, N., Gurevych, I. (2019). Sentence-BERT: Sentence embeddings using Siamese BERT-Networks. In: EMNLP-IJCNLP 2019. ACL. DOI: https://doi.org/10.18653/v1/D19-1410

Sorower, M. S. (2010). A literature survey on algorithms for multi-label learning. Oregon State University, Corvallis, 18(1), 25

Steiner, G., (2023). The exponential growth of research data. Analytical Science Magazine, Vol. 3 -May/23 (https://analyticalscience.wiley.com/content/article-do/exponential-growth-research-data) [Accessed 27/06/2024]

Downloads

Published

2024-12-23

How to Cite

Jamwal, K. (2024). Boosting data findability: The role of AI-enhanced keywords. IASSIST Quarterly, 48(4). https://doi.org/10.29173/iq1127