International Journal of Innovative Research in Engineering & Multidisciplinary Physical Sciences
E-ISSN: 2349-7300Impact Factor - 9.907

A Widely Indexed Open Access Peer Reviewed Online Scholarly International Journal

Call for Paper Volume 12 Issue 3 May-June 2024 Submit your research for publication

Web Content Extraction Using Machine Learning

Authors: Prof. Yagnesh Tiwari

DOI: https://doi.org/10.17605/OSF.IO/2A35N

Short DOI: https://doi.org/ggkv66

Country: -

Full-text Research PDF File:   View   |   Download


Abstract: Extraction model aims at separating the main data from noise. We define content as a continuous and meaningful resource of text from web pages which can be successfully used to summarize required topic in a concise way. Noise can be any parameter of a web page. Noise on the other hand is defined by any web page parameter that deviates from the main content. Noise can be copyright disclaimers, advertisements, navigation arc. Boilerplate templates form a major extraction criteria of Bolierplate detection algorithm.

Keywords: Content Extraction, Boilerplate Removal, Template Detection, CETR.


Paper Id: 549

Published On: 2014-04-27

Published In: Volume 2, Issue 2, March-April 2014

Cite This: Web Content Extraction Using Machine Learning - Prof. Yagnesh Tiwari - IJIRMPS Volume 2, Issue 2, March-April 2014. DOI 10.17605/OSF.IO/2A35N

Share this