Volume 6 Number 9 (Sep. 2011)
Home > Archive > 2011 > Volume 6 Number 9 (Sep. 2011) >
JSW 2011 Vol.6(9): 1713-1720 ISSN: 1796-217X
doi: 10.4304/jsw.6.9.1713-1720

Enriched Format Text Categorization Using A Component Similarity Approach

Fei Zhu, Jiong Yang, Yong Zhou
School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu, China, 215006; Provincial Key Laboratory for Computer Information Processing Technology, Soochow University, Suzhou, China

Abstract—Text categorization has been widely studied for years. However, conventional plain text categorization approaches which work good in plain text behave poor when they are simply applied to enriched format texts. An categorization approach that is applicable to enriched format text is proposed. During feature selection, we get feature structure distribution weight by using extended structure model so that structure affections to categorization are fully considered. Text formats are also taken into account in feature weighting. The combined feature weighting approach strengthens important parts and weakens less important ones. The text categorization is fulfilled by document component similarity, which first decomposes document, gathers features by components and other user-defined rules, completes document component tree, and then achieves categorization by it. We implement a CSBC based Naïve Bayes classifier in which the final result is the combination of all classifiers of component tree. Finally we parse OpenOffice.org document, draw components that are most related to classification from OpenOffice.org documents, and then use the classifier to categorize OpenOffice.org documents. The experiment results show that the classifier can automatically classify OpenOffice.org documents and work quite well.

Index Terms—text classification, enriched format text classification, OpenDocument, OpenOffice.org, Naïve Bayes

[PDF]

Cite: Fei Zhu, Jiong Yang, Yong Zhou, "Enriched Format Text Categorization Usin," Journal of Software vol. 6, no. 9, pp. 1713-1720, 2011.

General Information

ISSN: 1796-217X (Online)
Frequency: Monthly (2006-2019); Bimonthly (Since 2020)
Editor-in-Chief: Prof. Antanas Verikas
Executive Editor: Ms. Yoyo Y. Zhou
Abstracting/ Indexing: DBLP, EBSCO, Google Scholar, ProQuest, INSPEC, ULRICH's Periodicals Directory, WorldCat, etc
E-mail: jsw@iap.org
  • Dec 06, 2019 News!

    Vol 14, No 1- Vol 14, No 4 has been indexed by EI (Inspec)   [Click]

  • Jun 22, 2020 News!

    Papers published in JSW Vol 14, No 1- Vol 15 No 4 have been indexed by DBLP     [Click]

  • Jun 22, 2020 News!

    The papers published in Vol 15, No 5 have all received dois from Crossref    [Click]

  • Aug 01, 2018 News!

    [CFP] 2020 the annual meeting of JSW Editorial Board, ICCSM 2020, will be held in Rome, Italy, July 17-19, 2020   [Click]

  • Jun 22, 2020 News!

    Vol 15, No 5 has been published with online version     [Click]