Log in
Language:

MERAL Myanmar Education Research and Learning Portal

  • Top
  • Universities
  • Ranking
To
lat lon distance
To

Field does not validate



Index Link

Index Tree

Please input email address.

WEKO

One fine body…

WEKO

One fine body…

Item

{"_buckets": {"deposit": "dd6d7b82-9e49-47a9-857e-225d23fe0cc1"}, "_deposit": {"id": "4270", "owners": [], "pid": {"revision_id": 0, "type": "recid", "value": "4270"}, "status": "published"}, "_oai": {"id": "oai:meral.edu.mm:recid/4270", "sets": ["user-ucsy"]}, "communities": ["ucsy"], "item_1583103067471": {"attribute_name": "Title", "attribute_value_mlt": [{"subitem_1551255647225": "Boilerplate removal and Content Extraction from Dynamic Web Pages", "subitem_1551255648112": "en"}]}, "item_1583103085720": {"attribute_name": "Description", "attribute_value_mlt": [{"interim": "Web pages not only contain main content, but also other elements such as navigation panels,advertisements and links to related documents. To ensure the high quality of web page, a goodboilerplate removal algorithm is needed to extract only the relevant contents from web page. Maintextual contents are just included in HTML source code which makes up the files. The goal of contentextraction or boilerplate detection is to separate the main content from navigation chrome,advertising blocks, and copyright notices in web pages. The system removes boilerplate and extractsmain content. In this system, there are two phases: Feature Extraction phase and Clustering phase. Thesystem classifies the noise or content from HTML web page. Content Extraction algorithm describes toget high performance without parsing DOM trees. After observation the HTML tags, one line may notcontain a piece of complete information and long texts are distributed in close lines, this system uses Line-Block concept to determine the distance of any two neighbor lines with text and Feature Extraction such as text-to-tag ratio (TTR), anchor text-to-text ratio (ATTR) and new content feature as Title Keywords Density (TKD) classifies noise or content. After extracting the features, the system uses these features as parameters in threshold method to classify the block are content or noncontent."}]}, "item_1583103108160": {"attribute_name": "Keywords", "attribute_value_mlt": [{"interim": "content extraction"}, {"interim": "line-block"}, {"interim": "TKD"}, {"interim": "TTR"}, {"interim": "ATTR"}]}, "item_1583103120197": {"attribute_name": "Files", "attribute_type": "file", "attribute_value_mlt": [{"accessrole": "open_access", "date": [{"dateType": "Available", "dateValue": "2019-08-13"}], "displaytype": "preview", "download_preview_message": "", "file_order": 0, "filename": "Boilerplate removal and content extraction(ijren).pdf", "filesize": [{"value": "596 Kb"}], "format": "application/pdf", "future_date_message": "", "is_thumbnail": false, "licensetype": "license_free", "mimetype": "application/pdf", "size": 596000.0, "url": {"url": "https://meral.edu.mm/record/4270/files/Boilerplate removal and content extraction(ijren).pdf"}, "version_id": "889304ab-4d33-46d3-8ea3-e7155e9a13e3"}]}, "item_1583103131163": {"attribute_name": "Journal articles", "attribute_value_mlt": [{"subitem_issue": "", "subitem_journal_title": "International Journal of Computer Science, Engineering and Applications (IJCSEA)", "subitem_pages": "", "subitem_volume": ""}]}, "item_1583103147082": {"attribute_name": "Conference papers", "attribute_value_mlt": [{"subitem_acronym": "", "subitem_c_date": "", "subitem_conference_title": "", "subitem_part": "", "subitem_place": "", "subitem_session": "", "subitem_website": ""}]}, "item_1583103211336": {"attribute_name": "Books/reports/chapters", "attribute_value_mlt": [{"subitem_book_title": "", "subitem_isbn": "", "subitem_pages": "", "subitem_place": "", "subitem_publisher": ""}]}, "item_1583103233624": {"attribute_name": "Thesis/dissertations", "attribute_value_mlt": [{"subitem_awarding_university": "", "subitem_supervisor(s)": [{"subitem_supervisor": ""}]}]}, "item_1583105942107": {"attribute_name": "Authors", "attribute_value_mlt": [{"subitem_authors": [{"subitem_authors_fullname": "San, Pan Ei"}]}]}, "item_1583108359239": {"attribute_name": "Upload type", "attribute_value_mlt": [{"interim": "Publication"}]}, "item_1583108428133": {"attribute_name": "Publication type", "attribute_value_mlt": [{"interim": "Article"}]}, "item_1583159729339": {"attribute_name": "Publication date", "attribute_value": "2014-12"}, "item_1583159847033": {"attribute_name": "Identifier", "attribute_value": "http://onlineresource.ucsy.edu.mm/handle/123456789/2127"}, "item_title": "Boilerplate removal and Content Extraction from Dynamic Web Pages", "item_type_id": "21", "owner": "1", "path": ["1597824175385"], "permalink_uri": "http://hdl.handle.net/20.500.12678/0000004270", "pubdate": {"attribute_name": "Deposited date", "attribute_value": "2019-08-13"}, "publish_date": "2019-08-13", "publish_status": "0", "recid": "4270", "relation": {}, "relation_version_is_last": true, "title": ["Boilerplate removal and Content Extraction from Dynamic Web Pages"], "weko_shared_id": -1}
  1. University of Computer Studies, Yangon
  2. Faculty of Computer Science

Boilerplate removal and Content Extraction from Dynamic Web Pages

http://hdl.handle.net/20.500.12678/0000004270
http://hdl.handle.net/20.500.12678/0000004270
f00d53b3-4c40-470f-994a-8f948cf2e7b9
dd6d7b82-9e49-47a9-857e-225d23fe0cc1
None
Preview
Name / File License Actions
Boilerplate Boilerplate removal and content extraction(ijren).pdf (596 Kb)
Publication type
Article
Upload type
Publication
Title
Title Boilerplate removal and Content Extraction from Dynamic Web Pages
Language en
Publication date 2014-12
Authors
San, Pan Ei
Description
Web pages not only contain main content, but also other elements such as navigation panels,advertisements and links to related documents. To ensure the high quality of web page, a goodboilerplate removal algorithm is needed to extract only the relevant contents from web page. Maintextual contents are just included in HTML source code which makes up the files. The goal of contentextraction or boilerplate detection is to separate the main content from navigation chrome,advertising blocks, and copyright notices in web pages. The system removes boilerplate and extractsmain content. In this system, there are two phases: Feature Extraction phase and Clustering phase. Thesystem classifies the noise or content from HTML web page. Content Extraction algorithm describes toget high performance without parsing DOM trees. After observation the HTML tags, one line may notcontain a piece of complete information and long texts are distributed in close lines, this system uses Line-Block concept to determine the distance of any two neighbor lines with text and Feature Extraction such as text-to-tag ratio (TTR), anchor text-to-text ratio (ATTR) and new content feature as Title Keywords Density (TKD) classifies noise or content. After extracting the features, the system uses these features as parameters in threshold method to classify the block are content or noncontent.
Keywords
content extraction, line-block, TKD, TTR, ATTR
Identifier http://onlineresource.ucsy.edu.mm/handle/123456789/2127
Journal articles
International Journal of Computer Science, Engineering and Applications (IJCSEA)
Conference papers
Books/reports/chapters
Thesis/dissertations
Back
0
0
views
downloads
See details
Views Downloads

Versions

Ver.1 2020-09-01 14:26:41.599257
Show All versions

Share

Mendeley Twitter Facebook Print Addthis

Export

OAI-PMH
  • OAI-PMH DublinCore
Other Formats
  • JSON

Confirm


Back to MERAL


Back to MERAL