MERAL Myanmar Education Research and Learning Portal
Item
{"_buckets": {"deposit": "58bc2e71-b171-4ab8-b808-f29b7f6d9d2a"}, "_deposit": {"id": "5003", "owners": [], "pid": {"revision_id": 0, "type": "recid", "value": "5003"}, "status": "published"}, "_oai": {"id": "oai:meral.edu.mm:recid/5003", "sets": ["user-ucsy"]}, "communities": ["ucsy"], "item_1583103067471": {"attribute_name": "Title", "attribute_value_mlt": [{"subitem_1551255647225": "Noise Block Cleaning and Main Content Block Extraction from Dynamic Web Page", "subitem_1551255648112": "en"}]}, "item_1583103085720": {"attribute_name": "Description", "attribute_value_mlt": [{"interim": "Web Information Extraction systemsbecomes more complex and time-consuming. Webpage contains many informative blocks and noiseblocks. Noise blocks are navigational elements,templates and advertisements that are not the maincontent blocks of the web page; it can be definednoisy blocks or boilerplate text. This boilerplate texttypically is not related to the main content, maydeteriorate search precision and thus needs to bedetected properly. This paper proposes a Web Pagecleaning and main content block extraction approachand purposes of improving the accuracy andefficiency of web content mining. The system usesstructural features and the shallow text features assuch as number of words, link density, and averageword length can be used to classify the main contentor boilerplate text from the web page. And then thesystem extracts main content block using threeparameters such as Title keyword, KeywordFrequency based Block selection and positionfeatures. The relevant content blocks are identified asthe high important level by similarity of blockcontents to other blocks. Experiments show that WebPage cleaning based on shallow features lead to moreaccurate and efficient classification results forboilerplate or other content than existing approaches."}]}, "item_1583103108160": {"attribute_name": "Keywords", "attribute_value_mlt": [{"interim": "Boilerplate Detection"}, {"interim": "Decision Tree"}, {"interim": "Shallow Text features"}, {"interim": "Web Content Mining"}]}, "item_1583103120197": {"attribute_name": "Files", "attribute_type": "file", "attribute_value_mlt": [{"accessrole": "open_access", "date": [{"dateType": "Available", "dateValue": "2019-07-02"}], "displaytype": "preview", "download_preview_message": "", "file_order": 0, "filename": "12020.pdf", "filesize": [{"value": "115 Kb"}], "format": "application/pdf", "future_date_message": "", "is_thumbnail": false, "licensetype": "license_free", "mimetype": "application/pdf", "size": 115000.0, "url": {"url": "https://meral.edu.mm/record/5003/files/12020.pdf"}, "version_id": "d0871e20-542c-4253-9717-a08f14947b0f"}]}, "item_1583103131163": {"attribute_name": "Journal articles", "attribute_value_mlt": [{"subitem_issue": "", "subitem_journal_title": "Twelfth International Conference On Computer Applications (ICCA 2014)", "subitem_pages": "", "subitem_volume": ""}]}, "item_1583103147082": {"attribute_name": "Conference papers", "attribute_value_mlt": [{"subitem_acronym": "", "subitem_c_date": "", "subitem_conference_title": "", "subitem_part": "", "subitem_place": "", "subitem_session": "", "subitem_website": ""}]}, "item_1583103211336": {"attribute_name": "Books/reports/chapters", "attribute_value_mlt": [{"subitem_book_title": "", "subitem_isbn": "", "subitem_pages": "", "subitem_place": "", "subitem_publisher": ""}]}, "item_1583103233624": {"attribute_name": "Thesis/dissertations", "attribute_value_mlt": [{"subitem_awarding_university": "", "subitem_supervisor(s)": [{"subitem_supervisor": ""}]}]}, "item_1583105942107": {"attribute_name": "Authors", "attribute_value_mlt": [{"subitem_authors": [{"subitem_authors_fullname": "San, Pan Ei"}, {"subitem_authors_fullname": "Aye, Nilar"}]}]}, "item_1583108359239": {"attribute_name": "Upload type", "attribute_value_mlt": [{"interim": "Publication"}]}, "item_1583108428133": {"attribute_name": "Publication type", "attribute_value_mlt": [{"interim": "Article"}]}, "item_1583159729339": {"attribute_name": "Publication date", "attribute_value": "2014-02-17"}, "item_1583159847033": {"attribute_name": "Identifier", "attribute_value": "http://onlineresource.ucsy.edu.mm/handle/123456789/90"}, "item_title": "Noise Block Cleaning and Main Content Block Extraction from Dynamic Web Page", "item_type_id": "21", "owner": "1", "path": ["1597824273898"], "permalink_uri": "http://hdl.handle.net/20.500.12678/0000005003", "pubdate": {"attribute_name": "Deposited date", "attribute_value": "2019-07-02"}, "publish_date": "2019-07-02", "publish_status": "0", "recid": "5003", "relation": {}, "relation_version_is_last": true, "title": ["Noise Block Cleaning and Main Content Block Extraction from Dynamic Web Page"], "weko_shared_id": -1}
Noise Block Cleaning and Main Content Block Extraction from Dynamic Web Page
http://hdl.handle.net/20.500.12678/0000005003
http://hdl.handle.net/20.500.12678/0000005003dddfcbc1-58a9-4159-ade0-34e2127c8379
58bc2e71-b171-4ab8-b808-f29b7f6d9d2a
Name / File | License | Actions |
---|---|---|
12020.pdf (115 Kb)
|
|
Publication type | ||||||
---|---|---|---|---|---|---|
Article | ||||||
Upload type | ||||||
Publication | ||||||
Title | ||||||
Title | Noise Block Cleaning and Main Content Block Extraction from Dynamic Web Page | |||||
Language | en | |||||
Publication date | 2014-02-17 | |||||
Authors | ||||||
San, Pan Ei | ||||||
Aye, Nilar | ||||||
Description | ||||||
Web Information Extraction systemsbecomes more complex and time-consuming. Webpage contains many informative blocks and noiseblocks. Noise blocks are navigational elements,templates and advertisements that are not the maincontent blocks of the web page; it can be definednoisy blocks or boilerplate text. This boilerplate texttypically is not related to the main content, maydeteriorate search precision and thus needs to bedetected properly. This paper proposes a Web Pagecleaning and main content block extraction approachand purposes of improving the accuracy andefficiency of web content mining. The system usesstructural features and the shallow text features assuch as number of words, link density, and averageword length can be used to classify the main contentor boilerplate text from the web page. And then thesystem extracts main content block using threeparameters such as Title keyword, KeywordFrequency based Block selection and positionfeatures. The relevant content blocks are identified asthe high important level by similarity of blockcontents to other blocks. Experiments show that WebPage cleaning based on shallow features lead to moreaccurate and efficient classification results forboilerplate or other content than existing approaches. | ||||||
Keywords | ||||||
Boilerplate Detection, Decision Tree, Shallow Text features, Web Content Mining | ||||||
Identifier | http://onlineresource.ucsy.edu.mm/handle/123456789/90 | |||||
Journal articles | ||||||
Twelfth International Conference On Computer Applications (ICCA 2014) | ||||||
Conference papers | ||||||
Books/reports/chapters | ||||||
Thesis/dissertations |