Boilerplate removal and Content Extraction from Dynamic Web Pages

San, Pan Ei

Index Tree

RootNode
- Co-operative College, Mandalay
- Cooperative College, Phaunggyi
- Co-operative University, Sagaing
- Co-operative University, Thanlyin
- Dagon University
- Kyaukse University
- Laquarware Technological college
- Mandalay Technological University
- Mandalay University of Distance Education
- Mandalay University of Foreign Languages
- Maubin University
- Mawlamyine University
- Meiktila University
- Mohnyin University
- Myanmar Institute of Information Technology
- Myanmar Maritime University
- National Management Degree College
- Naypyitaw State Academy
- Pathein University
- Sagaing University
- Sagaing University of Education
- Taunggyi University
- Technological University, Hmawbi
- Technological University (Kyaukse)
- Technological University Mandalay
- University of Computer Studies, Mandalay
- University of Computer Studies Maubin
- University of Computer Studies, Meikhtila
- University of Computer Studies Pathein
- University of Computer Studies, Taungoo
- University of Computer Studies, Yangon
- University of Dental Medicine Mandalay
- University of Dental Medicine, Yangon
- University of Information Technology
- University of Mandalay
- University of Medicine 1
- University of Medicine 2
- University of Medicine Mandalay
- University of Myitkyina
- University of Public Health, Yangon
- University of Veterinary Science
- University of Yangon
- West Yangon University
- Yadanabon University
- Yangon Technological University
- Yangon University of Distance Education
- Yangon University of Economics
- Yangon University of Education
- Yangon University of Foreign Languages
- Yezin Agricultural University
- New Index

Item

{"_buckets": {"deposit": "dd6d7b82-9e49-47a9-857e-225d23fe0cc1"}, "_deposit": {"id": "4270", "owners": [], "pid": {"revision_id": 0, "type": "recid", "value": "4270"}, "status": "published"}, "_oai": {"id": "oai:meral.edu.mm:recid/4270", "sets": ["user-ucsy"]}, "communities": ["ucsy"], "item_1583103067471": {"attribute_name": "Title", "attribute_value_mlt": [{"subitem_1551255647225": "Boilerplate removal and Content Extraction from Dynamic Web Pages", "subitem_1551255648112": "en"}]}, "item_1583103085720": {"attribute_name": "Description", "attribute_value_mlt": [{"interim": "Web pages not only contain main content, but also other elements such as navigation panels,advertisements and links to related documents. To ensure the high quality of web page, a goodboilerplate removal algorithm is needed to extract only the relevant contents from web page. Maintextual contents are just included in HTML source code which makes up the files. The goal of contentextraction or boilerplate detection is to separate the main content from navigation chrome,advertising blocks, and copyright notices in web pages. The system removes boilerplate and extractsmain content. In this system, there are two phases: Feature Extraction phase and Clustering phase. Thesystem classifies the noise or content from HTML web page. Content Extraction algorithm describes toget high performance without parsing DOM trees. After observation the HTML tags, one line may notcontain a piece of complete information and long texts are distributed in close lines, this system uses Line-Block concept to determine the distance of any two neighbor lines with text and Feature Extraction such as text-to-tag ratio (TTR), anchor text-to-text ratio (ATTR) and new content feature as Title Keywords Density (TKD) classifies noise or content. After extracting the features, the system uses these features as parameters in threshold method to classify the block are content or noncontent."}]}, "item_1583103108160": {"attribute_name": "Keywords", "attribute_value_mlt": [{"interim": "content extraction"}, {"interim": "line-block"}, {"interim": "TKD"}, {"interim": "TTR"}, {"interim": "ATTR"}]}, "item_1583103120197": {"attribute_name": "Files", "attribute_type": "file", "attribute_value_mlt": [{"accessrole": "open_access", "date": [{"dateType": "Available", "dateValue": "2019-08-13"}], "displaytype": "preview", "download_preview_message": "", "file_order": 0, "filename": "Boilerplate removal and content extraction(ijren).pdf", "filesize": [{"value": "596 Kb"}], "format": "application/pdf", "future_date_message": "", "is_thumbnail": false, "licensetype": "license_free", "mimetype": "application/pdf", "size": 596000.0, "url": {"url": "https://meral.edu.mm/record/4270/files/Boilerplate removal and content extraction(ijren).pdf"}, "version_id": "889304ab-4d33-46d3-8ea3-e7155e9a13e3"}]}, "item_1583103131163": {"attribute_name": "Journal articles", "attribute_value_mlt": [{"subitem_issue": "", "subitem_journal_title": "International Journal of Computer Science, Engineering and Applications (IJCSEA)", "subitem_pages": "", "subitem_volume": ""}]}, "item_1583103147082": {"attribute_name": "Conference papers", "attribute_value_mlt": [{"subitem_acronym": "", "subitem_c_date": "", "subitem_conference_title": "", "subitem_part": "", "subitem_place": "", "subitem_session": "", "subitem_website": ""}]}, "item_1583103211336": {"attribute_name": "Books/reports/chapters", "attribute_value_mlt": [{"subitem_book_title": "", "subitem_isbn": "", "subitem_pages": "", "subitem_place": "", "subitem_publisher": ""}]}, "item_1583103233624": {"attribute_name": "Thesis/dissertations", "attribute_value_mlt": [{"subitem_awarding_university": "", "subitem_supervisor(s)": [{"subitem_supervisor": ""}]}]}, "item_1583105942107": {"attribute_name": "Authors", "attribute_value_mlt": [{"subitem_authors": [{"subitem_authors_fullname": "San, Pan Ei"}]}]}, "item_1583108359239": {"attribute_name": "Upload type", "attribute_value_mlt": [{"interim": "Publication"}]}, "item_1583108428133": {"attribute_name": "Publication type", "attribute_value_mlt": [{"interim": "Article"}]}, "item_1583159729339": {"attribute_name": "Publication date", "attribute_value": "2014-12"}, "item_1583159847033": {"attribute_name": "Identifier", "attribute_value": "http://onlineresource.ucsy.edu.mm/handle/123456789/2127"}, "item_title": "Boilerplate removal and Content Extraction from Dynamic Web Pages", "item_type_id": "21", "owner": "1", "path": ["1597824175385"], "permalink_uri": "http://hdl.handle.net/20.500.12678/0000004270", "pubdate": {"attribute_name": "Deposited date", "attribute_value": "2019-08-13"}, "publish_date": "2019-08-13", "publish_status": "0", "recid": "4270", "relation": {}, "relation_version_is_last": true, "title": ["Boilerplate removal and Content Extraction from Dynamic Web Pages"], "weko_shared_id": -1}

Boilerplate removal and Content Extraction from Dynamic Web Pages

http://hdl.handle.net/20.500.12678/0000004270

Preview

Name / File	License	Actions
Boilerplate removal and content extraction(ijren).pdf (596 Kb)

Publication type
		Article
Upload type
		Publication
Title
	Title	Boilerplate removal and Content Extraction from Dynamic Web Pages
	Language	en
Publication date		2014-12
Authors
		San, Pan Ei
Description
		Web pages not only contain main content, but also other elements such as navigation panels,advertisements and links to related documents. To ensure the high quality of web page, a goodboilerplate removal algorithm is needed to extract only the relevant contents from web page. Maintextual contents are just included in HTML source code which makes up the files. The goal of contentextraction or boilerplate detection is to separate the main content from navigation chrome,advertising blocks, and copyright notices in web pages. The system removes boilerplate and extractsmain content. In this system, there are two phases: Feature Extraction phase and Clustering phase. Thesystem classifies the noise or content from HTML web page. Content Extraction algorithm describes toget high performance without parsing DOM trees. After observation the HTML tags, one line may notcontain a piece of complete information and long texts are distributed in close lines, this system uses Line-Block concept to determine the distance of any two neighbor lines with text and Feature Extraction such as text-to-tag ratio (TTR), anchor text-to-text ratio (ATTR) and new content feature as Title Keywords Density (TKD) classifies noise or content. After extracting the features, the system uses these features as parameters in threshold method to classify the block are content or noncontent.
Keywords
		content extraction, line-block, TKD, TTR, ATTR
Identifier		http://onlineresource.ucsy.edu.mm/handle/123456789/2127
Journal articles
		International Journal of Computer Science, Engineering and Applications (IJCSEA)
Conference papers
Books/reports/chapters
Thesis/dissertations

Back

0

views

downloads

See details

	Views	Downloads

Versions

Ver.1

2020-09-01 14:26:41.599257

Show All versions

Share

Export

OAI-PMH

DublinCore

Other Formats

JSON

Index Link

Index Tree

Item

Boilerplate removal and Content Extraction from Dynamic Web Pages

Versions

Share

Export