JSW 2014 Vol.9(5): 1202-1209 ISSN: 1796-217X
doi: 10.4304/jsw.9.5.1202-1209
doi: 10.4304/jsw.9.5.1202-1209
Lemmatization Technique in Bahasa: Indonesian Language
Derwin Suhartono, David Christiandy, Rolando
Computer Science Department, Bina Nusantara University, Jakarta, Indonesia
Abstract—Many researches and inventions have been made in the field of linguistics and technology. Even so, the integration between linguistics and technology is not always reliable to all language. Every language is unique in its linguistic nature and rules. In this paper, a lemmatization technique in Bahasa (Indonesian language) is presented. It has achieved good precision by using The Indonesian Dictionary and a set of rules to remove affixes. The lemmatization technique is developed based on the previous algorithm, Indonesian stemmer. Both Indonesian stemming and lemmatization method have the same characteristics but a little bit different in its implementation. The way to reach its own goal/purpose is defined as a core difference and therefore possible to modify. The result shows that the algorithm achieved roughly 98% precision on a collection consisting 57,261 valid words with 7,839 unique valid words gathered from Kompas.com, an Indonesian online news article.
Index Terms—stemmer, algorithm, lemmatization, language, Bahasa, Indonesian
Abstract—Many researches and inventions have been made in the field of linguistics and technology. Even so, the integration between linguistics and technology is not always reliable to all language. Every language is unique in its linguistic nature and rules. In this paper, a lemmatization technique in Bahasa (Indonesian language) is presented. It has achieved good precision by using The Indonesian Dictionary and a set of rules to remove affixes. The lemmatization technique is developed based on the previous algorithm, Indonesian stemmer. Both Indonesian stemming and lemmatization method have the same characteristics but a little bit different in its implementation. The way to reach its own goal/purpose is defined as a core difference and therefore possible to modify. The result shows that the algorithm achieved roughly 98% precision on a collection consisting 57,261 valid words with 7,839 unique valid words gathered from Kompas.com, an Indonesian online news article.
Index Terms—stemmer, algorithm, lemmatization, language, Bahasa, Indonesian
Cite: Derwin Suhartono, David Christiandy, Rolando, "Lemmatization Technique in Bahasa: Indonesian Language," Journal of Software vol. 9, no. 5, pp. 1202-1209, 2014.
General Information
ISSN: 1796-217X (Online)
Frequency: Quarterly
Editor-in-Chief: Prof. Antanas Verikas
Executive Editor: Ms. Yoyo Y. Zhou
Abstracting/ Indexing: DBLP, EBSCO, CNKI, Google Scholar, ProQuest, INSPEC(IET), ULRICH's Periodicals Directory, WorldCat, etc
E-mail: jsw@iap.org
-
Apr 26, 2021 News!
Vol 14, No 4- Vol 14, No 12 has been indexed by IET-(Inspec) [Click]
-
Nov 18, 2021 News!
Papers published in JSW Vol 16, No 1- Vol 16, No 6 have been indexed by DBLP [Click]
-
Dec 24, 2021 News!
Vol 15, No 1- Vol 15, No 6 has been indexed by IET-(Inspec) [Click]
-
Nov 02, 2023 News!
Vol 18, No 4 has been published with online version [Click]
-
Dec 06, 2019 News!
Vol 14, No 1- Vol 14, No 4 has been indexed by EI (Inspec) [Click]