Volume 6 Number 3 (Mar. 2011)
Home > Archive > 2011 > Volume 6 Number 3 (Mar. 2011) >
JSW 2011 Vol.6(3): 413-420 ISSN: 1796-217X
doi: 10.4304/jsw.6.3.413-420

An Improvement to TF-IDF: Term Distribution based Term Weight Algorithm

Tian Xia1, Yanmei Chai2

1Shanghai Second Polytechnic University, Shanghai, China
2Central University of Finance and Economics, Beijing, China


Abstract—In the process of document formalization, term weight algorithm plays an important role. It greatly interferes the precision and recall results of the natural language processing (NLP) systems. Currently, TF-IDF term weight algorithm is widely applied into language models to build NLP Systems. Since term frequency is not the only discriminator which is necessary to be considered in term weighting and make each weight suitable to indicate the term’s importance, we are motivated to investigate other statistical characteristics of terms and found an important discriminator: term distribution. Furthermore, we found that, in a single document, a term with higher frequency and close to hypodispersion distribution usually contains much semantic information and should be given higher weight. One the other hand, in a document collection, the term with higher frequency and hypo-dispersion distribution usually contains less information. Based on this hypothesis, by leveraging the Pearson Chisquare Test Statistic, a Term Distribution based Local Term Weight Algorithm and Global Term Weight Algorithm are put forward respectively in this paper. Also, the experiment results at the end of this paper approve the reliability and efficiency of the algorithms.

Index Terms—TF, IDF, Term Weight, Natural Language Processing

[PDF]

Cite: Tian Xia, Yanmei Chai, "An Improvement to TF-IDF: Term Distribution based Term Weight Algorithm," Journal of Software vol. 6, no. 3, pp. 413-420, 2011.

General Information

  • ISSN: 1796-217X (Online)

  • Abbreviated Title: J. Softw.

  • Frequency:  Quarterly

  • APC: 500USD

  • DOI: 10.17706/JSW

  • Editor-in-Chief: Prof. Antanas Verikas

  • Executive Editor: Ms. Cecilia Xie

  • Abstracting/ Indexing: DBLP, EBSCO,
           CNKIGoogle Scholar, ProQuest,
           INSPEC(IET), ULRICH's Periodicals
           Directory, WorldCat, etc

  • E-mail: jsweditorialoffice@gmail.com

  • Jun 12, 2024 News!

    Vol 19, No 2 has been published with online version   [Click]

  • Jan 04, 2024 News!

    JSW will adopt Article-by-Article Work Flow

  • Apr 01, 2024 News!

    Vol 14, No 4- Vol 14, No 12 has been indexed by IET-(Inspec)     [Click]

  • Apr 01, 2024 News!

    Papers published in JSW Vol 18, No 1- Vol 18, No 6 have been indexed by DBLP   [Click]

  • Mar 01, 2024 News!

    Vol 19, No 1 has been published with online version    [Click]