RMIT University
Browse

Sampling, information extraction and summarisation of hidden web databases

journal contribution
posted on 2024-11-01, 08:32 authored by Yih-Ling Hedley, Muhammad Younas, Anne James, Mark SandersonMark Sanderson
Hidden Web databases maintain a collection of specialised documents, which are dynamically generated using page templates. This paper presents the Two-Phase Sampling (2PS) technique that detects and extracts query-related information from documents contained in databases. 2PS is based on a two-phase framework for the sampling, information extraction and summarisation of Hidden Web documents. In the first phase, 2PS samples and stores documents for further analysis. In the second phase, it detects Web page templates from sampled documents and extracts relevant information from which a content summary is then generated. Experimental results demonstrate that 2PS effectively eliminates irrelevant information from sampled documents and generates terms and frequencies with improved accuracy.

History

Related Materials

  1. 1.
    DOI - Is published in 10.1016/j.datak.2006.01.009
  2. 2.
    ISSN - Is published in 0169023X

Journal

Journal of Data & Knowledge Engineering

Volume

59

Issue

2

Start page

213

End page

230

Total pages

18

Publisher

Elsevier BV, North-Holland

Place published

Netherlands

Language

English

Copyright

© 2006 Published by Elsevier B.V.

Former Identifier

2006021649

Esploro creation date

2020-06-22

Fedora creation date

2013-02-25

Usage metrics

    Scholarly Works

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC