Volume 13 Number 11 (Nov. 2018)
Home > Archive > 2018 > Volume 13 Number 11 (Nov. 2018) >
JSW 2018 Vol.13(11): 621-629 ISSN: 1796-217X
doi: 10.17706/jsw.13.11.621-629

An Effective Method to Extract Web Content Information

Pan Suhan, Li Zhiqiang*, Dai Juan

Collegeof Information Engineering, Yangzhou University, Yangzhou, China.

Abstract—To simplify the operation of web text content extraction and improve the accuracy of that, a newextraction method based on text-punctuation distribution and tag features (TPDT) is proposed. Combining the distribution of text-punctuation and tag features. Calculating the text-punctuation density in different text blocks and get the maximum continuoussum of density to extracting the best text content from web pages.The method effectively solves the problem of noisy information filtering and text content extraction without the training and manual processing. Experimental results on web pages randomly selected from different portalwebsites show that the TPDT method has good applicability on various news pages.

Index Terms— Collegeof Information Engineering, Yangzhou University, Yangzhou, China.

[PDF]

Cite: Pan Suhan, Li Zhiqiang, Dai Juan, "An Effective Method to Extract Web Content Information," Journal of Software vol. 13, no. 11, pp. 621-629, 2018.

General Information

ISSN: 1796-217X (Online)
Frequency:  Quarterly
Editor-in-Chief: Prof. Antanas Verikas
Executive Editor: Ms. Yoyo Y. Zhou
Abstracting/ Indexing: DBLP, EBSCO, CNKIGoogle Scholar, ProQuest, INSPEC(IET), ULRICH's Periodicals Directory, WorldCat, etc
E-mail: jsweditorialoffice@gmail.com
  • Mar 01, 2024 News!

    Vol 19, No 1 has been published with online version    [Click]

  • Jan 04, 2024 News!

    JSW will adopt Article-by-Article Work Flow

  • Apr 01, 2024 News!

    Vol 14, No 4- Vol 14, No 12 has been indexed by IET-(Inspec)     [Click]

  • Apr 01, 2024 News!

    Papers published in JSW Vol 18, No 1- Vol 18, No 6 have been indexed by DBLP   [Click]

  • Nov 02, 2023 News!

    Vol 18, No 4 has been published with online version   [Click]