Volume 9 Number 3 (Mar. 2014)
Home > Archive > 2014 > Volume 9 Number 3 (Mar. 2014) >
JSW 2014 Vol.9(3): 641-647 ISSN: 1796-217X
doi: 10.4304/jsw.9.3.641-647

A DOM-based Anchor-Hop-T Method for Web Application Information Extraction

Yuanyuan Zhang1, Qinyan Zhang2, Guanfu Jiang3
1College of Information Technology, Zhejiang Chinese Medical University, Hangzhou 310053, China
2Computer Center, Zhejiang University, Hangzhou 310058, China
3College of Computer Science, Zhejiang University, Hangzhou 315100, China

Abstract—In order to implement the information fusion of electronic products, the widely adopted approach is to extract information from HTML structure of business Website with deeply data processing. However, modeling Web application is hard to be solved that the data in HTML is semi-formal which displayed as DOM (Document Object Model) tree when using XML schema to data analysis. How to understand and extract information is first to be researched. The general model Anchor-Hop considering the text property and label property is simple to handle this problem. Therefore, it has low effectiveness. This model is sensitive to the data of HTML structure, that if the website structure is slightly changed the issue of extraction accuracy is encountered. As a result, the extraction rules should be redefined because of the changed structure. In order to improve extraction efficiency, this paper proposed a DOMbased dynamic model Anchor-Hop-T information extraction model. The HTML tags including table, ol and ul can be searched and processed using XPath so that it is convenience to extract corresponding Anchor data block. Furthermore, the location of Hop point is considered as invariant, by which our new model based on Anchor and Hop point introduces more concepts for extracting information, such as Anchor data block, Anchor locating library and AH relevance value. Finally, we try to give out an experiment to demonstrate the applicability of our approach.

Index Terms—Web Application, Information Extraction, DOM, Anchor-Hop-T Model

[PDF]

Cite: Yuanyuan Zhang, Qinyan Zhang, Guanfu Jiang, "A DOM-based Anchor-Hop-T Method for Web Application Information Extraction," Journal of Software vol. 9, no. 3, pp. 641-647, 2014.

General Information

ISSN: 1796-217X (Online)
Frequency:  Bimonthly (Since 2020)
Editor-in-Chief: Prof. Antanas Verikas
Executive Editor: Ms. Yoyo Y. Zhou
Abstracting/ Indexing: DBLP, EBSCO, Google Scholar, ProQuest, INSPEC(IET), ULRICH's Periodicals Directory, WorldCat, etc
E-mail: jsw@iap.org
  • Apr 26, 2021 News!

    Vol 14, No 4- Vol 14, No 12 has been indexed by IET-(Inspec)     [Click]

  • Jun 22, 2020 News!

    Papers published in JSW Vol 14, No 1- Vol 15 No 4 have been indexed by DBLP     [Click]

  • Sep 13, 2021 News!

    The papers published in Vol 16, No 6 have all received dois from Crossref    [Click]

  • Jan 28, 2021 News!

    [CFP] 2021 the annual meeting of JSW Editorial Board, ICCSM 2021, will be held in Rome, Italy, July 21-23, 2021   [Click]

  • Sep 13, 2021 News!

    Vol 16, No 6 has been published with online version     [Click]