Volume 6 Number 4 (Apr. 2011)
Home > Archive > 2011 > Volume 6 Number 4 (Apr. 2011) >
JSW 2011 Vol.6(4): 620-627 ISSN: 1796-217X
doi: 10.4304/jsw.6.4.620-627

A Data-drive Feature Selection Method in Text Categorization

Yan Xu
Beijing Language And Culture University ,Beijing, China; Institute of Computing Technology,Chinese Academy of Sciences

Abstract—Text Categorization (TC) is the process of grouping texts into one or more predefined categories based on their content. It has become a key technique for handling and organizing text data. One of the most important issues in TC is Feature Selection (FS). Many FS methods have been put forward and widely used in TC field, such as Information Gain (IG), Document Frequency thresholding (DF) and Mutual Information. Empirical studies show that some of these (e.g. IG, DF) produce better categorization performance than others (e.g. MI). A basic research question is why these FS methods cause different performance. Many existing works seek to answer this question based on empirical studies. In this paper, we present a formal study of FS in TC. We first define three desirable constraints that any reasonable FS function should satisfy, then check these constraints on some popular FS methods, including IG, DF, MI and two other methods. We find that IG satisfies the first two constraints, and that there are strong statistical correlations between DF and the first constraint, whilst MI does not satisfy any of the constraints. Experimental results indicate that the empirical performance of a FS function is tightly related to how well it satisfies these constraints and none of the investigated FS functions can satisfy all the three constraints at the same time. Finally we present a novel framework for developing FS functions which satisfy all the three constraints, and design several new FS functions using this framework. Experimental results on Reuters21578 and Newsgroup corpora show that our new FS function DFICF outperforms IG and DF when using either Micro- or Macro-averagedmeasures.

Index Terms—Feature selection, text categorization, Constraints


Cite: Yan Xu, "A Data-drive Feature Selection Method in Text Categorization," Journal of Software vol. 6, no. 4, pp. 620-627, 2011.

General Information

ISSN: 1796-217X (Online)
Frequency:  Bimonthly (Since 2020)
Editor-in-Chief: Prof. Antanas Verikas
Executive Editor: Ms. Yoyo Y. Zhou
Abstracting/ Indexing: DBLP, EBSCO, Google Scholar, ProQuest, INSPEC(IET), ULRICH's Periodicals Directory, WorldCat, etc
E-mail: jsw@iap.org
  • Apr 26, 2021 News!

    Vol 14, No 4- Vol 14, No 12 has been indexed by IET-(Inspec)     [Click]

  • Jun 22, 2020 News!

    Papers published in JSW Vol 14, No 1- Vol 15 No 4 have been indexed by DBLP     [Click]

  • Sep 13, 2021 News!

    The papers published in Vol 16, No 6 have all received dois from Crossref    [Click]

  • Jan 28, 2021 News!

    [CFP] 2021 the annual meeting of JSW Editorial Board, ICCSM 2021, will be held in Rome, Italy, July 21-23, 2021   [Click]

  • Sep 13, 2021 News!

    Vol 16, No 6 has been published with online version     [Click]