Pedram Yamini, Fatemeh Daneshfar, Abuzar Ghorbani,
Volume 20, Issue 4 (December (Special Issue on ADLEEE) 2024)
Abstract
With the exponential growth of unstructured data on the Web and social networks, extracting relevant information from multiple sources; has become increasingly challenging, necessitating the need for automated summarization systems. However, developing machine learning-based summarization systems largely depends on datasets, which must be evaluated to determine their usefulness in retrieving data. In most cases, these datasets are summarized with humans’ involvement. Nevertheless, this approach is inadequate for some low-resource languages, making summarization a daunting task. To address this, this paper proposes a method for developing the first abstractive text summarization corpus with human evaluation and automated summarization model for the Sorani Kurdish language. The researchers compiled various documents from information available on the Web (rudaw), and the resulting corpus was released publicly. A customized and simplified version of the mT5-base transformer was then developed to evaluate the corpus. The model's performance was assessed using criteria such as Rouge-1, Rouge-2, Rouge-L, N-gram novelty, manual evaluation and the results are close to reference summaries in terms of all the criteria. This unique Sorani Kurdish corpus and automated summarization model have the potential to pave the way for future studies, facilitating the development of improved summarization systems in low-resource languages.