Volume 6 Number 8 (Aug. 2011)
Home > Archive > 2011 > Volume 6 Number 8 (Aug. 2011) >
JSW 2011 Vol.6(8): 1409-1416 ISSN: 1796-217X
doi: 10.4304/jsw.6.8.1409-1416

Automatically Extracting Academic Papers from Web Pages Using Conditional Random Fields Model

Wei Liu, Jianxun Zeng
Institute of Scientific and Technical Information of China China, 100038

Abstract—A huge amount of academic papers(including research reports) are being released in web pages. It is important to extract these papers in a structured way for many popular applications, such as science and technology information retrieval and digital library. However, few investigations have been done on the issue of academic paper extraction. This paper proposed a unified approach for automatically extracting academic papers from web pages based on CRF model. In the proposed approach, both academic paper extraction and semantic labeling are performed simultaneously by employing the theoretical Conditional Random Fields(CRF) model. Experimental results show that our approach can achieve significantly better extraction results.

Index Terms—Web data extraction, Web intelligence, Machine learning, Conditional Random Fields

[PDF]

Cite: Wei Liu, Jianxun Zeng, "Automatically Extracting Academic Papers from Web Pages Using Conditional Random Fields Model," Journal of Software vol. 6, no. 8, pp. 1409-1416, 2011.

General Information

ISSN: 1796-217X (Online)
Frequency: Monthly
Editor-in-Chief: Prof. Antanas Verikas
Executive Editor: Ms. Yoyo Y. Zhou
Abstracting/ Indexing: DBLP, EBSCO, ProQuest, INSPEC, ULRICH's Periodicals Directory, WorldCat, etc
E-mail: jsw@iap.org
  • Dec 06, 2019 News!

    Vol 14, No 1- Vol 14, No 4 has been indexed by EI (Inspec)   [Click]

  • Nov 18, 2019 News!

    Papers published in JSW Vol 14, No 1- Vol 14 No 10 have been indexed by DBLP     [Click]

  • Dec 06, 2019 News!

     Vol 13, No 10- Vol 13, No 12 has been indexed by EI (Inspec)   [Click]

  • Aug 01, 2018 News!

    [CFP] 2020 the annual meeting of JSW Editorial Board, ICCSM 2020, will be held in Rome, Italy, July 17-19, 2020   [Click]

  • Jun 25, 2019 News!

    Vol.13, No.9 has been indexed by EI (Inspec).   [Click]