The Automatic Extraction of Web Information Based on Regular Expression

JSW 2017 Vol.12(3): 180-188 ISSN: 1796-217X
doi: 10.17706/jsw.12.3.180-188

Li Ji^1,2, Jiang Guangyu^1,2, Xu Aijun^1,2*, Wang Yunzhen³

¹School of Information Engineering, Zhejiang A&F University, Lin’an 311300,China.
²Zhejiang Provincial Key Laboratory of Forestry Intelligent Monitoring and Information Technology, Zhejiang A&F University, Lin’an 311300,China.
³Jiande Xin'anjiang Woodland , Jiande 311600.

Abstract—Based on search engine , this paper built a Web information retrieval matching and structure extraction model. And realized the algorithm of locating and automatically extracting multi-web Baidu news information. Getting the standard mathematical expression of URLs by analyzing the search results URLs and analyzing the DOM tree structure of web pages, this article designed the key tags regular expression. Finally, the method of multi-page location retrieval and structured extraction based on search engine is realized. The experimental results showed that the average extraction result is 99.60%, and the matching ratio is 99.56%. It can be used for Web information structure and automatic extraction and local preservation.

Index Terms—Search engine; extraction model of web information; regular expression; web information.

[PDF]

Cite: Li Ji, Jiang Guangyu, Xu Aijun, Wang Yunzhen3, "The Automatic Extraction of Web Information Based on Regular Expression," Journal of Software vol. 12, no. 3, pp. 180-188, 2017.

PREVIOUS PAPER

Introduction of Scrum in An Elite Team: A Case Study

NEXT PAPER

A Formal Semantics for Use Case Diagram Via Event-B

General Information

ISSN: 1796-217X (Online)

Frequency: Quarterly

Editor-in-Chief: Prof. Antanas Verikas

Executive Editor: Ms. Yoyo Y. Zhou

Abstracting/ Indexing: DBLP, EBSCO, CNKI, Google Scholar, ProQuest, INSPEC(IET), ULRICH's Periodicals Directory, WorldCat, etc

E-mail: jsweditorialoffice@gmail.com

What's New

Mar 01, 2024 News!

Vol 19, No 1 has been published with online version　 [Click]
Jan 04, 2024 News!

JSW will adopt Article-by-Article Work Flow
Apr 01, 2024 News!

Vol 14, No 4- Vol 14, No 12 has been indexed by IET-(Inspec) 　 [Click]
Apr 01, 2024 News!

Papers published in JSW Vol 18, No 1- Vol 18, No 6 have been indexed by DBLP [Click]
Nov 02, 2023 News!

Vol 18, No 4 has been published with online version [Click]

Volume 12 Number 3 (Mar. 2017)

Home > Archive > 2017 > Volume 12 Number 3 (Mar. 2017) >

The Automatic Extraction of Web Information Based on Regular Expression

General Information