Log in
Language:

MERAL Myanmar Education Research and Learning Portal

  • Top
  • Universities
  • Ranking
To
lat lon distance
To

Field does not validate



Index Link

Index Tree

Please input email address.

WEKO

One fine body…

WEKO

One fine body…

Item

{"_buckets": {"deposit": "4a148715-28a6-4d52-9d43-8c069d4d6abd"}, "_deposit": {"id": "4272", "owners": [], "pid": {"revision_id": 0, "type": "recid", "value": "4272"}, "status": "published"}, "_oai": {"id": "oai:meral.edu.mm:recid/4272", "sets": ["user-ucsy"]}, "communities": ["ucsy"], "item_1583103067471": {"attribute_name": "Title", "attribute_value_mlt": [{"subitem_1551255647225": "Main Content Extraction from Dynamic Web Pages", "subitem_1551255648112": "en"}]}, "item_1583103085720": {"attribute_name": "Description", "attribute_value_mlt": [{"interim": "Web pages not only contain main content, but also other elements such as navigation panels, advertisements andlinks to related documents. To ensure the high quality of web page, a good boilerplate removal algorithm is needed to extractonly the relevant contents from web page. Main textual contents are just included in HTML source code which makes up thefiles. The goal of content extraction or boilerplate detection is to separate the main content from navigation chrome, advertisingblocks, and copyright notices in web pages. The system removes boilerplate and extracts main content. In this system, there aretwo phases: Feature Extraction phase and Clustering phase. The system classifies the noise or content from HTML web page.Content Extraction algorithm describes to get high performance without parsing DOM trees. After observation the HTML tags,one line may not contain a piece of complete information and long texts are distributed in close lines, this system usesLine-Block concept to determine the distance of any two neighbor lines with text and Feature Extraction such as text-to-tagratio (TTR), anchor text-to-text ratio (ATTR) and new content feature as Title Keywords Density (TKD) classifies noise orcontent. After extracting the features, the system uses these features as parameters in threshold method to classify the block arecontent or non- content."}]}, "item_1583103108160": {"attribute_name": "Keywords", "attribute_value_mlt": [{"interim": "Content Extraction"}, {"interim": "Line-Block"}, {"interim": "TKD"}, {"interim": "TTR"}, {"interim": "ATTR"}]}, "item_1583103120197": {"attribute_name": "Files", "attribute_type": "file", "attribute_value_mlt": [{"accessrole": "open_access", "date": [{"dateType": "Available", "dateValue": "2019-08-13"}], "displaytype": "preview", "download_preview_message": "", "file_order": 0, "filename": "MAIN CONTENT EXTRACTION FROM DYNAMIC WEB PAGES(IIER).pdf", "filesize": [{"value": "678 Kb"}], "format": "application/pdf", "future_date_message": "", "is_thumbnail": false, "licensetype": "license_free", "mimetype": "application/pdf", "size": 678000.0, "url": {"url": "https://meral.edu.mm/record/4272/files/MAIN CONTENT EXTRACTION FROM DYNAMIC WEB PAGES(IIER).pdf"}, "version_id": "5bb7f5f5-ccf1-40c2-99b6-9094ed0dc7f1"}]}, "item_1583103131163": {"attribute_name": "Journal articles", "attribute_value_mlt": [{"subitem_issue": "", "subitem_journal_title": "Seventh TheIIER International Conference, Singapore", "subitem_pages": "", "subitem_volume": ""}]}, "item_1583103147082": {"attribute_name": "Conference papers", "attribute_value_mlt": [{"subitem_acronym": "", "subitem_c_date": "", "subitem_conference_title": "", "subitem_part": "", "subitem_place": "", "subitem_session": "", "subitem_website": ""}]}, "item_1583103211336": {"attribute_name": "Books/reports/chapters", "attribute_value_mlt": [{"subitem_book_title": "", "subitem_isbn": "", "subitem_pages": "", "subitem_place": "", "subitem_publisher": ""}]}, "item_1583103233624": {"attribute_name": "Thesis/dissertations", "attribute_value_mlt": [{"subitem_awarding_university": "", "subitem_supervisor(s)": [{"subitem_supervisor": ""}]}]}, "item_1583105942107": {"attribute_name": "Authors", "attribute_value_mlt": [{"subitem_authors": [{"subitem_authors_fullname": "San, Pan Ei"}, {"subitem_authors_fullname": "Aye, Nilar"}]}]}, "item_1583108359239": {"attribute_name": "Upload type", "attribute_value_mlt": [{"interim": "Publication"}]}, "item_1583108428133": {"attribute_name": "Publication type", "attribute_value_mlt": [{"interim": "Article"}]}, "item_1583159729339": {"attribute_name": "Publication date", "attribute_value": "2015-01"}, "item_1583159847033": {"attribute_name": "Identifier", "attribute_value": "http://onlineresource.ucsy.edu.mm/handle/123456789/2129"}, "item_title": "Main Content Extraction from Dynamic Web Pages", "item_type_id": "21", "owner": "1", "path": ["1597824175385"], "permalink_uri": "http://hdl.handle.net/20.500.12678/0000004272", "pubdate": {"attribute_name": "Deposited date", "attribute_value": "2019-08-13"}, "publish_date": "2019-08-13", "publish_status": "0", "recid": "4272", "relation": {}, "relation_version_is_last": true, "title": ["Main Content Extraction from Dynamic Web Pages"], "weko_shared_id": -1}
  1. University of Computer Studies, Yangon
  2. Faculty of Computer Science

Main Content Extraction from Dynamic Web Pages

http://hdl.handle.net/20.500.12678/0000004272
http://hdl.handle.net/20.500.12678/0000004272
17d085ef-848a-41f2-8576-90642f616258
4a148715-28a6-4d52-9d43-8c069d4d6abd
None
Preview
Name / File License Actions
MAIN MAIN CONTENT EXTRACTION FROM DYNAMIC WEB PAGES(IIER).pdf (678 Kb)
Publication type
Article
Upload type
Publication
Title
Title Main Content Extraction from Dynamic Web Pages
Language en
Publication date 2015-01
Authors
San, Pan Ei
Aye, Nilar
Description
Web pages not only contain main content, but also other elements such as navigation panels, advertisements andlinks to related documents. To ensure the high quality of web page, a good boilerplate removal algorithm is needed to extractonly the relevant contents from web page. Main textual contents are just included in HTML source code which makes up thefiles. The goal of content extraction or boilerplate detection is to separate the main content from navigation chrome, advertisingblocks, and copyright notices in web pages. The system removes boilerplate and extracts main content. In this system, there aretwo phases: Feature Extraction phase and Clustering phase. The system classifies the noise or content from HTML web page.Content Extraction algorithm describes to get high performance without parsing DOM trees. After observation the HTML tags,one line may not contain a piece of complete information and long texts are distributed in close lines, this system usesLine-Block concept to determine the distance of any two neighbor lines with text and Feature Extraction such as text-to-tagratio (TTR), anchor text-to-text ratio (ATTR) and new content feature as Title Keywords Density (TKD) classifies noise orcontent. After extracting the features, the system uses these features as parameters in threshold method to classify the block arecontent or non- content.
Keywords
Content Extraction, Line-Block, TKD, TTR, ATTR
Identifier http://onlineresource.ucsy.edu.mm/handle/123456789/2129
Journal articles
Seventh TheIIER International Conference, Singapore
Conference papers
Books/reports/chapters
Thesis/dissertations
Back
0
0
views
downloads
See details
Views Downloads

Versions

Ver.1 2020-09-01 14:27:01.047324
Show All versions

Share

Mendeley Twitter Facebook Print Addthis

Export

OAI-PMH
  • OAI-PMH DublinCore
Other Formats
  • JSON

Confirm


Back to MERAL


Back to MERAL