Please use this identifier to cite or link to this item: http://buratest.brunel.ac.uk/handle/2438/13733
Title: Enhanced topic identification algorithm for Arabic corpora
Authors: Alsaad, A
Abbod, M
Keywords: Root extraction;Feature selection;Topic identification;Natural language processing;Data mining;Text mining
Issue Date: 2016
Publisher: IEEE
Citation: Proceedings - UKSim-AMSS 17th International Conference on Computer Modelling and Simulation, UKSim 2015, pp. 90 - 94, (2016)
Abstract: During the past few years, the construction of digitalized content is rapidly increasing, raising the demand of information retrieval, data mining and automatic data tagging applications. There are few researches in this field for Arabic data due to the complex nature of Arabic language and the lack of standard corpora. In addition, most work focuses on improving Arabic stemming algorithms, or topic identification and classification methods and experiments. No work has been conducted to include an efficient stemming method within the classification algorithm, which would lead to more efficient outcome. In this paper, we propose a new approach to identify significant keywords for Arabic corpora. That is done by implementing advanced stemming and root extraction algorithm, as well as Term Frequency/Inverse Document Frequency (TFIDF) topic identification method. Our results show that combining advanced stemming, root extraction and TFIDF techniques, lead to extracting a highly significant terms represented by Arabic roots. These roots weights higher TFIDF values than terms extracted without the use of advanced stemming and root extraction methods. Decreasing the size of indexed words and improving the feature selection process.
URI: http://bura.brunel.ac.uk/handle/2438/13733
DOI: http://dx.doi.org/10.1109/UKSim.2015.77
ISBN: 9781479987122
Appears in Collections:Dept of Electronic and Computer Engineering Research Papers

Files in This Item:
File Description SizeFormat 
FullText.docx50.74 kBUnknownView/Open


Items in BURA are protected by copyright, with all rights reserved, unless otherwise indicated.