JSW 2018 Vol.13(11): 621-629 ISSN: 1796-217X
doi: 10.17706/jsw.13.11.621-629
doi: 10.17706/jsw.13.11.621-629
An Effective Method to Extract Web Content Information
Pan Suhan, Li Zhiqiang*, Dai Juan
Collegeof Information Engineering, Yangzhou University, Yangzhou, China.
Abstract—To simplify the operation of web text content extraction and improve the accuracy of that, a newextraction method based on text-punctuation distribution and tag features (TPDT) is proposed. Combining the distribution of text-punctuation and tag features. Calculating the text-punctuation density in different text blocks and get the maximum continuoussum of density to extracting the best text content from web pages.The method effectively solves the problem of noisy information filtering and text content extraction without the training and manual processing. Experimental results on web pages randomly selected from different portalwebsites show that the TPDT method has good applicability on various news pages.
Index Terms— Collegeof Information Engineering, Yangzhou University, Yangzhou, China.
Abstract—To simplify the operation of web text content extraction and improve the accuracy of that, a newextraction method based on text-punctuation distribution and tag features (TPDT) is proposed. Combining the distribution of text-punctuation and tag features. Calculating the text-punctuation density in different text blocks and get the maximum continuoussum of density to extracting the best text content from web pages.The method effectively solves the problem of noisy information filtering and text content extraction without the training and manual processing. Experimental results on web pages randomly selected from different portalwebsites show that the TPDT method has good applicability on various news pages.
Index Terms— Collegeof Information Engineering, Yangzhou University, Yangzhou, China.
Cite: Pan Suhan, Li Zhiqiang, Dai Juan, "An Effective Method to Extract Web Content Information," Journal of Software vol. 13, no. 11, pp. 621-629, 2018.
PREVIOUS PAPER
Research on an Efficient Software Framework for Developing
NEXT PAPER
Last page
General Information
ISSN: 1796-217X (Online)
Frequency: Monthly (2006-2019); Bimonthly (Since 2020)
Editor-in-Chief: Prof. Antanas Verikas
Executive Editor: Ms. Yoyo Y. Zhou
Abstracting/ Indexing: DBLP, EBSCO, Google Scholar, ProQuest, INSPEC, ULRICH's Periodicals Directory, WorldCat, etc
E-mail: jsw@iap.org
-
Dec 06, 2019 News!
Vol 14, No 1- Vol 14, No 4 has been indexed by EI (Inspec) [Click]
-
Jun 22, 2020 News!
Papers published in JSW Vol 14, No 1- Vol 15 No 4 have been indexed by DBLP [Click]
-
Dec 15, 2020 News!
The papers published in Vol 16, No 1 have all received dois from Crossref [Click]
-
Aug 01, 2018 News!
[CFP] 2020 the annual meeting of JSW Editorial Board, ICCSM 2020, will be held in Rome, Italy, July 17-19, 2020 [Click]
-
Dec 15, 2020 News!
Vol 16, No 1 has been published with online version [Click]