Volume 6 Number 4 (Apr. 2011)
Home > Archive > 2011 > Volume 6 Number 4 (Apr. 2011) >
JSW 2011 Vol.6(4): 620-627 ISSN: 1796-217X
doi: 10.4304/jsw.6.4.620-627

A Data-drive Feature Selection Method in Text Categorization

Yan Xu
Beijing Language And Culture University ,Beijing, China; Institute of Computing Technology,Chinese Academy of Sciences

Abstract—Text Categorization (TC) is the process of grouping texts into one or more predefined categories based on their content. It has become a key technique for handling and organizing text data. One of the most important issues in TC is Feature Selection (FS). Many FS methods have been put forward and widely used in TC field, such as Information Gain (IG), Document Frequency thresholding (DF) and Mutual Information. Empirical studies show that some of these (e.g. IG, DF) produce better categorization performance than others (e.g. MI). A basic research question is why these FS methods cause different performance. Many existing works seek to answer this question based on empirical studies. In this paper, we present a formal study of FS in TC. We first define three desirable constraints that any reasonable FS function should satisfy, then check these constraints on some popular FS methods, including IG, DF, MI and two other methods. We find that IG satisfies the first two constraints, and that there are strong statistical correlations between DF and the first constraint, whilst MI does not satisfy any of the constraints. Experimental results indicate that the empirical performance of a FS function is tightly related to how well it satisfies these constraints and none of the investigated FS functions can satisfy all the three constraints at the same time. Finally we present a novel framework for developing FS functions which satisfy all the three constraints, and design several new FS functions using this framework. Experimental results on Reuters21578 and Newsgroup corpora show that our new FS function DFICF outperforms IG and DF when using either Micro- or Macro-averagedmeasures.

Index Terms—Feature selection, text categorization, Constraints


Cite: Yan Xu, "A Data-drive Feature Selection Method in Text Categorization," Journal of Software vol. 6, no. 4, pp. 620-627, 2011.

General Information

ISSN: 1796-217X (Online)
Frequency: Monthly
Editor-in-Chief: Prof. Antanas Verikas
Executive Editor: Ms. Yoyo Y. Zhou
Abstracting/ Indexing: DBLP, EBSCO, ProQuest, INSPEC, ULRICH's Periodicals Directory, WorldCat, etc
E-mail: jsw@iap.org
  • Aug 21, 2019 News!

    Papers published in JSW Vol 14, No 1- Vol 14 No 8 have been indexed by DBLP     [Click]

  • Jun 25, 2019 News!

    Vol.13, No.9 has been indexed by EI (Inspec).   [Click]

  • Aug 01, 2018 News!

    [CFP] 2020 the annual meeting of JSW Editorial Board, ICCSM 2020, will be held in Rome, Italy, July 17-19, 2020   [Click]

  • Jul 10, 2019 News!

    Vol 14, No.8 has been published with online version 4 original aritcles from 2 countries are published in this issue.    [Click]

  • Sep 12, 2019 News!

    Vol 14, No 10 has been published with online version 4 original aritcles from 2 countries are published in this issue      [Click]