Noise Block Cleaning and Main Content Block Extraction from Dynamic Web Page

San, Pan Ei; Aye, Nilar

MERAL Myanmar Education Research and Learning Portal

lat lon distance

[[sub_check.contents]]　

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

Index Tree

Item

{"_buckets": {"deposit": "58bc2e71-b171-4ab8-b808-f29b7f6d9d2a"}, "_deposit": {"id": "5003", "owners": [], "pid": {"revision_id": 0, "type": "recid", "value": "5003"}, "status": "published"}, "_oai": {"id": "oai:meral.edu.mm:recid/5003", "sets": ["user-ucsy"]}, "communities": ["ucsy"], "item_1583103067471": {"attribute_name": "Title", "attribute_value_mlt": [{"subitem_1551255647225": "Noise Block Cleaning and Main Content Block Extraction from Dynamic Web Page", "subitem_1551255648112": "en"}]}, "item_1583103085720": {"attribute_name": "Description", "attribute_value_mlt": [{"interim": "Web Information Extraction systemsbecomes more complex and time-consuming. Webpage contains many informative blocks and noiseblocks. Noise blocks are navigational elements,templates and advertisements that are not the maincontent blocks of the web page; it can be definednoisy blocks or boilerplate text. This boilerplate texttypically is not related to the main content, maydeteriorate search precision and thus needs to bedetected properly. This paper proposes a Web Pagecleaning and main content block extraction approachand purposes of improving the accuracy andefficiency of web content mining. The system usesstructural features and the shallow text features assuch as number of words, link density, and averageword length can be used to classify the main contentor boilerplate text from the web page. And then thesystem extracts main content block using threeparameters such as Title keyword, KeywordFrequency based Block selection and positionfeatures. The relevant content blocks are identified asthe high important level by similarity of blockcontents to other blocks. Experiments show that WebPage cleaning based on shallow features lead to moreaccurate and efficient classification results forboilerplate or other content than existing approaches."}]}, "item_1583103108160": {"attribute_name": "Keywords", "attribute_value_mlt": [{"interim": "Boilerplate Detection"}, {"interim": "Decision Tree"}, {"interim": "Shallow Text features"}, {"interim": "Web Content Mining"}]}, "item_1583103120197": {"attribute_name": "Files", "attribute_type": "file", "attribute_value_mlt": [{"accessrole": "open_access", "date": [{"dateType": "Available", "dateValue": "2019-07-02"}], "displaytype": "preview", "download_preview_message": "", "file_order": 0, "filename": "12020.pdf", "filesize": [{"value": "115 Kb"}], "format": "application/pdf", "future_date_message": "", "is_thumbnail": false, "licensetype": "license_free", "mimetype": "application/pdf", "size": 115000.0, "url": {"url": "https://meral.edu.mm/record/5003/files/12020.pdf"}, "version_id": "d0871e20-542c-4253-9717-a08f14947b0f"}]}, "item_1583103131163": {"attribute_name": "Journal articles", "attribute_value_mlt": [{"subitem_issue": "", "subitem_journal_title": "Twelfth International Conference On Computer Applications (ICCA 2014)", "subitem_pages": "", "subitem_volume": ""}]}, "item_1583103147082": {"attribute_name": "Conference papers", "attribute_value_mlt": [{"subitem_acronym": "", "subitem_c_date": "", "subitem_conference_title": "", "subitem_part": "", "subitem_place": "", "subitem_session": "", "subitem_website": ""}]}, "item_1583103211336": {"attribute_name": "Books/reports/chapters", "attribute_value_mlt": [{"subitem_book_title": "", "subitem_isbn": "", "subitem_pages": "", "subitem_place": "", "subitem_publisher": ""}]}, "item_1583103233624": {"attribute_name": "Thesis/dissertations", "attribute_value_mlt": [{"subitem_awarding_university": "", "subitem_supervisor(s)": [{"subitem_supervisor": ""}]}]}, "item_1583105942107": {"attribute_name": "Authors", "attribute_value_mlt": [{"subitem_authors": [{"subitem_authors_fullname": "San, Pan Ei"}, {"subitem_authors_fullname": "Aye, Nilar"}]}]}, "item_1583108359239": {"attribute_name": "Upload type", "attribute_value_mlt": [{"interim": "Publication"}]}, "item_1583108428133": {"attribute_name": "Publication type", "attribute_value_mlt": [{"interim": "Article"}]}, "item_1583159729339": {"attribute_name": "Publication date", "attribute_value": "2014-02-17"}, "item_1583159847033": {"attribute_name": "Identifier", "attribute_value": "http://onlineresource.ucsy.edu.mm/handle/123456789/90"}, "item_title": "Noise Block Cleaning and Main Content Block Extraction from Dynamic Web Page", "item_type_id": "21", "owner": "1", "path": ["1597824273898"], "permalink_uri": "http://hdl.handle.net/20.500.12678/0000005003", "pubdate": {"attribute_name": "Deposited date", "attribute_value": "2019-07-02"}, "publish_date": "2019-07-02", "publish_status": "0", "recid": "5003", "relation": {}, "relation_version_is_last": true, "title": ["Noise Block Cleaning and Main Content Block Extraction from Dynamic Web Page"], "weko_shared_id": -1}

Noise Block Cleaning and Main Content Block Extraction from Dynamic Web Page

http://hdl.handle.net/20.500.12678/0000005003

Preview

Name / File	License	Actions
12020.pdf (115 Kb)

Publication type
		Article
Upload type
		Publication
Title
	Title	Noise Block Cleaning and Main Content Block Extraction from Dynamic Web Page
	Language	en
Publication date		2014-02-17
Authors
		San, Pan Ei
		Aye, Nilar
Description
		Web Information Extraction systemsbecomes more complex and time-consuming. Webpage contains many informative blocks and noiseblocks. Noise blocks are navigational elements,templates and advertisements that are not the maincontent blocks of the web page; it can be definednoisy blocks or boilerplate text. This boilerplate texttypically is not related to the main content, maydeteriorate search precision and thus needs to bedetected properly. This paper proposes a Web Pagecleaning and main content block extraction approachand purposes of improving the accuracy andefficiency of web content mining. The system usesstructural features and the shallow text features assuch as number of words, link density, and averageword length can be used to classify the main contentor boilerplate text from the web page. And then thesystem extracts main content block using threeparameters such as Title keyword, KeywordFrequency based Block selection and positionfeatures. The relevant content blocks are identified asthe high important level by similarity of blockcontents to other blocks. Experiments show that WebPage cleaning based on shallow features lead to moreaccurate and efficient classification results forboilerplate or other content than existing approaches.
Keywords
		Boilerplate Detection, Decision Tree, Shallow Text features, Web Content Mining
Identifier		http://onlineresource.ucsy.edu.mm/handle/123456789/90
Journal articles
		Twelfth International Conference On Computer Applications (ICCA 2014)
Conference papers
Books/reports/chapters
Thesis/dissertations