The Automatic Extraction of Web Information Based on Regular Expression
2Zhejiang Provincial Key Laboratory of Forestry Intelligent Monitoring and Information Technology, Zhejiang A&F University, Lin’an 311300,China.
3Jiande Xin'anjiang Woodland , Jiande 311600.
Abstract—Based on search engine , this paper built a Web information retrieval matching and structure extraction model. And realized the algorithm of locating and automatically extracting multi-web Baidu news information. Getting the standard mathematical expression of URLs by analyzing the search results URLs and analyzing the DOM tree structure of web pages, this article designed the key tags regular expression. Finally, the method of multi-page location retrieval and structured extraction based on search engine is realized. The experimental results showed that the average extraction result is 99.60%, and the matching ratio is 99.56%. It can be used for Web information structure and automatic extraction and local preservation.
Index Terms—Search engine; extraction model of web information; regular expression; web information.
Cite: Li Ji, Jiang Guangyu, Xu Aijun, Wang Yunzhen3, "The Automatic Extraction of Web Information Based on Regular Expression," Journal of Software vol. 12, no. 3, pp. 180-188, 2017.
May 03, 2016 News!
Papers published in JSW Vol. 11, No. 1- Vol. 11, No. 12 have been indexed by DBLP. [Click]
Jan 05, 2017 News!
[CFP] 2017 the annual meeting of JSW Editorial Board, ICCSM 2017, will be held in Maldives, July 4-6, 2017. [Click]
Apr 05, 2017 News!
Vol 12, No. 3 has been published with online version 7 original aritcles from 4 countries are published in this issue. [Click]
Sep 21, 2016 News!
Vol.11, No.8 has been indexed by EI (Inspec). [Click]
Nov 17, 2015 News!
Welcome Prof. Karim El Guemhioui from Canada to join the Editorial board of JSW. [Click]