Main Content Extraction from Dynamic Web Pages

San, Pan Ei; Aye, Nilar

MERAL Myanmar Education Research and Learning Portal

lat lon distance

[[sub_check.contents]]　

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

Index Tree

Item

{"_buckets": {"deposit": "1f2545de-4f16-453e-a835-b83ad6ab8802"}, "_deposit": {"id": "4274", "owners": [], "pid": {"revision_id": 0, "type": "recid", "value": "4274"}, "status": "published"}, "_oai": {"id": "oai:meral.edu.mm:recid/4274", "sets": ["user-ucsy"]}, "communities": ["ucsy"], "item_1583103067471": {"attribute_name": "Title", "attribute_value_mlt": [{"subitem_1551255647225": "Main Content Extraction from Dynamic Web Pages", "subitem_1551255648112": "en"}]}, "item_1583103085720": {"attribute_name": "Description", "attribute_value_mlt": [{"interim": "Web pages not only contain main content, but also other elements such as navigation panels, advertisements andlinks to related documents. To ensure the high quality of web page, a good boilerplate removal algorithm is needed to extractonly the relevant contents from web page. Main textual contents are just included in HTML source code which makes up thefiles. The goal of content extraction or boilerplate detection is to separate the main content from navigation chrome, advertisingblocks, and copyright notices in web pages. The system removes boilerplate and extracts main content. In this system, there aretwo phases: Feature Extraction phase and Clustering phase. The system classifies the noise or content from HTML web page.Content Extraction algorithm describes to get high performance without parsing DOM trees. After observation the HTML tags,one line may not contain a piece of complete information and long texts are distributed in close lines, this system usesLine-Block concept to determine the distance of any two neighbor lines with text and Feature Extraction such as text-to-tagratio (TTR), anchor text-to-text ratio (ATTR) and new content feature as Title Keywords Density (TKD) classifies noise orcontent. After extracting the features, the system uses these features as parameters in threshold method to classify the block arecontent or non- content."}]}, "item_1583103108160": {"attribute_name": "Keywords", "attribute_value_mlt": [{"interim": "Content Extraction"}, {"interim": "Line-Block"}, {"interim": "TKD"}, {"interim": "TTR"}, {"interim": "ATTR"}]}, "item_1583103120197": {"attribute_name": "Files", "attribute_type": "file", "attribute_value_mlt": [{"accessrole": "open_access", "date": [{"dateType": "Available", "dateValue": "2019-08-13"}], "displaytype": "preview", "download_preview_message": "", "file_order": 0, "filename": "IJAECS.pdf", "filesize": [{"value": "214 Kb"}], "format": "application/pdf", "future_date_message": "", "is_thumbnail": false, "licensetype": "license_free", "mimetype": "application/pdf", "size": 214000.0, "url": {"url": "https://meral.edu.mm/record/4274/files/IJAECS.pdf"}, "version_id": "21066944-3577-49b5-94b4-88f031caf57e"}]}, "item_1583103131163": {"attribute_name": "Journal articles", "attribute_value_mlt": [{"subitem_issue": "", "subitem_journal_title": "International Journal of Advances in Electronics and Computer Science", "subitem_pages": "", "subitem_volume": ""}]}, "item_1583103147082": {"attribute_name": "Conference papers", "attribute_value_mlt": [{"subitem_acronym": "", "subitem_c_date": "", "subitem_conference_title": "", "subitem_part": "", "subitem_place": "", "subitem_session": "", "subitem_website": ""}]}, "item_1583103211336": {"attribute_name": "Books/reports/chapters", "attribute_value_mlt": [{"subitem_book_title": "", "subitem_isbn": "", "subitem_pages": "", "subitem_place": "", "subitem_publisher": ""}]}, "item_1583103233624": {"attribute_name": "Thesis/dissertations", "attribute_value_mlt": [{"subitem_awarding_university": "", "subitem_supervisor(s)": [{"subitem_supervisor": ""}]}]}, "item_1583105942107": {"attribute_name": "Authors", "attribute_value_mlt": [{"subitem_authors": [{"subitem_authors_fullname": "San, Pan Ei"}, {"subitem_authors_fullname": "Aye, Nilar"}]}]}, "item_1583108359239": {"attribute_name": "Upload type", "attribute_value_mlt": [{"interim": "Publication"}]}, "item_1583108428133": {"attribute_name": "Publication type", "attribute_value_mlt": [{"interim": "Article"}]}, "item_1583159729339": {"attribute_name": "Publication date", "attribute_value": "2015-03"}, "item_1583159847033": {"attribute_name": "Identifier", "attribute_value": "2393-2835"}, "item_title": "Main Content Extraction from Dynamic Web Pages", "item_type_id": "21", "owner": "1", "path": ["1597824175385"], "permalink_uri": "http://hdl.handle.net/20.500.12678/0000004274", "pubdate": {"attribute_name": "Deposited date", "attribute_value": "2019-08-13"}, "publish_date": "2019-08-13", "publish_status": "0", "recid": "4274", "relation": {}, "relation_version_is_last": true, "title": ["Main Content Extraction from Dynamic Web Pages"], "weko_shared_id": -1}

Main Content Extraction from Dynamic Web Pages

http://hdl.handle.net/20.500.12678/0000004274

Preview

Name / File	License	Actions
IJAECS.pdf (214 Kb)

Publication type
		Article
Upload type
		Publication
Title
	Title	Main Content Extraction from Dynamic Web Pages
	Language	en
Publication date		2015-03
Authors
		San, Pan Ei
		Aye, Nilar
Description
		Web pages not only contain main content, but also other elements such as navigation panels, advertisements andlinks to related documents. To ensure the high quality of web page, a good boilerplate removal algorithm is needed to extractonly the relevant contents from web page. Main textual contents are just included in HTML source code which makes up thefiles. The goal of content extraction or boilerplate detection is to separate the main content from navigation chrome, advertisingblocks, and copyright notices in web pages. The system removes boilerplate and extracts main content. In this system, there aretwo phases: Feature Extraction phase and Clustering phase. The system classifies the noise or content from HTML web page.Content Extraction algorithm describes to get high performance without parsing DOM trees. After observation the HTML tags,one line may not contain a piece of complete information and long texts are distributed in close lines, this system usesLine-Block concept to determine the distance of any two neighbor lines with text and Feature Extraction such as text-to-tagratio (TTR), anchor text-to-text ratio (ATTR) and new content feature as Title Keywords Density (TKD) classifies noise orcontent. After extracting the features, the system uses these features as parameters in threshold method to classify the block arecontent or non- content.
Keywords
		Content Extraction, Line-Block, TKD, TTR, ATTR
Identifier		2393-2835
Journal articles
		International Journal of Advances in Electronics and Computer Science
Conference papers
Books/reports/chapters
Thesis/dissertations