Volume 6 Number 4 (Apr. 2011)
Home > Archive > 2011 > Volume 6 Number 4 (Apr. 2011) >
JSW 2011 Vol.6(4): 620-627 ISSN: 1796-217X
doi: 10.4304/jsw.6.4.620-627

A Data-drive Feature Selection Method in Text Categorization

Yan Xu
Beijing Language And Culture University ,Beijing, China; Institute of Computing Technology,Chinese Academy of Sciences

Abstract—Text Categorization (TC) is the process of grouping texts into one or more predefined categories based on their content. It has become a key technique for handling and organizing text data. One of the most important issues in TC is Feature Selection (FS). Many FS methods have been put forward and widely used in TC field, such as Information Gain (IG), Document Frequency thresholding (DF) and Mutual Information. Empirical studies show that some of these (e.g. IG, DF) produce better categorization performance than others (e.g. MI). A basic research question is why these FS methods cause different performance. Many existing works seek to answer this question based on empirical studies. In this paper, we present a formal study of FS in TC. We first define three desirable constraints that any reasonable FS function should satisfy, then check these constraints on some popular FS methods, including IG, DF, MI and two other methods. We find that IG satisfies the first two constraints, and that there are strong statistical correlations between DF and the first constraint, whilst MI does not satisfy any of the constraints. Experimental results indicate that the empirical performance of a FS function is tightly related to how well it satisfies these constraints and none of the investigated FS functions can satisfy all the three constraints at the same time. Finally we present a novel framework for developing FS functions which satisfy all the three constraints, and design several new FS functions using this framework. Experimental results on Reuters21578 and Newsgroup corpora show that our new FS function DFICF outperforms IG and DF when using either Micro- or Macro-averagedmeasures.

Index Terms—Feature selection, text categorization, Constraints

[PDF]

Cite: Yan Xu, "A Data-drive Feature Selection Method in Text Categorization," Journal of Software vol. 6, no. 4, pp. 620-627, 2011.

General Information

ISSN: 1796-217X (Online)
Frequency: Monthly (2006-2019); Bimonthly (Since 2020)
Editor-in-Chief: Prof. Antanas Verikas
Executive Editor: Ms. Yoyo Y. Zhou
Abstracting/ Indexing: DBLP, EBSCO, ProQuest, INSPEC, ULRICH's Periodicals Directory, WorldCat, etc
E-mail: jsw@iap.org
  • Dec 06, 2019 News!

    Vol 14, No 1- Vol 14, No 4 has been indexed by EI (Inspec)   [Click]

  • Apr 16, 2020 News!

    Papers published in JSW Vol 14, No 1- Vol 15 No 1 have been indexed by DBLP     [Click]

  • May 12, 2020 News!

    Vol 15, No 4 has been published with online version     [Click]

  • Aug 01, 2018 News!

    [CFP] 2020 the annual meeting of JSW Editorial Board, ICCSM 2020, will be held in Rome, Italy, July 17-19, 2020   [Click]

  • May 12, 2020 News!

    The papers published in Vol 15, No 4 have all received dois from Crossref     [Click]