Discovering Informative Content Blocks for Efficient Web Data Extraction

Hlaing, Nwe Nwe; Nyunt, Thi Thi Soe

Index Tree

RootNode
- Co-operative College, Mandalay
- Cooperative College, Phaunggyi
- Co-operative University, Sagaing
- Co-operative University, Thanlyin
- Dagon University
- Kyaukse University
- Laquarware Technological college
- Mandalay Technological University
- Mandalay University of Distance Education
- Mandalay University of Foreign Languages
- Maubin University
- Mawlamyine University
- Meiktila University
- Mohnyin University
- Myanmar Institute of Information Technology
- Myanmar Maritime University
- National Management Degree College
- Naypyitaw State Academy
- Pathein University
- Sagaing University
- Sagaing University of Education
- Taunggyi University
- Technological University, Hmawbi
- Technological University (Kyaukse)
- Technological University Mandalay
- University of Computer Studies, Mandalay
- University of Computer Studies Maubin
- University of Computer Studies, Meikhtila
- University of Computer Studies Pathein
- University of Computer Studies, Taungoo
- University of Computer Studies, Yangon
- University of Dental Medicine Mandalay
- University of Dental Medicine, Yangon
- University of Information Technology
- University of Mandalay
- University of Medicine 1
- University of Medicine 2
- University of Medicine Mandalay
- University of Myitkyina
- University of Public Health, Yangon
- University of Veterinary Science
- University of Yangon
- West Yangon University
- Yadanabon University
- Yangon Technological University
- Yangon University of Distance Education
- Yangon University of Economics
- Yangon University of Education
- Yangon University of Foreign Languages
- Yezin Agricultural University
- New Index

Item

{"_buckets": {"deposit": "f0456c7c-5207-4757-8213-2fd9c135a43c"}, "_deposit": {"id": "3523", "owners": [], "pid": {"revision_id": 0, "type": "recid", "value": "3523"}, "status": "published"}, "_oai": {"id": "oai:meral.edu.mm:recid/3523", "sets": ["user-ucsy"]}, "communities": ["ucsy"], "item_1583103067471": {"attribute_name": "Title", "attribute_value_mlt": [{"subitem_1551255647225": "Discovering Informative Content Blocks for Efficient Web Data Extraction", "subitem_1551255648112": "en"}]}, "item_1583103085720": {"attribute_name": "Description", "attribute_value_mlt": [{"interim": "As web sites are getting more complicated,the construction of web information extractionsystems becomes more troublesome and timeconsuming.A common theme is the difficulty inlocating the segments of a page in which the targetinformation is contained, which we call theinformative blocks. So discriminating informativeblocks from the noisy blocks and then extracting theinformative blocks from web page is an importanttask. In this paper, we propose a method that utilizesboth the visual features and semantic information toextract information block. First, the VIPS (VisionbasedPage Segmentation) algorithm is used topartition a web page into semantic blocks with ahierarchy structure. Then spatial features (such asposition, size) and content feature (the number ofimage and links) are extracted to construct featurevector for each block. Secondly based on thesefeature, the blocks with similar content structuresand spatial structures are clustered by means ofsimilarity computation. After clustering blocks withsimilar structures, determine the cluster with thelargest size and nearest distance to the centre ofpage as informative block."}]}, "item_1583103108160": {"attribute_name": "Keywords", "attribute_value_mlt": [{"interim": "Vision-based Page Segmentation"}, {"interim": "Information Extraction"}, {"interim": "Block Clustering"}]}, "item_1583103120197": {"attribute_name": "Files", "attribute_type": "file", "attribute_value_mlt": [{"accessrole": "open_access", "date": [{"dateType": "Available", "dateValue": "2019-07-25"}], "displaytype": "preview", "download_preview_message": "", "file_order": 0, "filename": "psc2010paper (62).pdf", "filesize": [{"value": "148 Kb"}], "format": "application/pdf", "future_date_message": "", "is_thumbnail": false, "licensetype": "license_free", "mimetype": "application/pdf", "size": 148000.0, "url": {"url": "https://meral.edu.mm/record/3523/files/psc2010paper (62).pdf"}, "version_id": "ab5f65e2-b37e-49e8-a26d-8e2d9994e55d"}]}, "item_1583103131163": {"attribute_name": "Journal articles", "attribute_value_mlt": [{"subitem_issue": "", "subitem_journal_title": "Fifth Local Conference on Parallel and Soft Computing", "subitem_pages": "", "subitem_volume": ""}]}, "item_1583103147082": {"attribute_name": "Conference papers", "attribute_value_mlt": [{"subitem_acronym": "", "subitem_c_date": "", "subitem_conference_title": "", "subitem_part": "", "subitem_place": "", "subitem_session": "", "subitem_website": ""}]}, "item_1583103211336": {"attribute_name": "Books/reports/chapters", "attribute_value_mlt": [{"subitem_book_title": "", "subitem_isbn": "", "subitem_pages": "", "subitem_place": "", "subitem_publisher": ""}]}, "item_1583103233624": {"attribute_name": "Thesis/dissertations", "attribute_value_mlt": [{"subitem_awarding_university": "", "subitem_supervisor(s)": [{"subitem_supervisor": ""}]}]}, "item_1583105942107": {"attribute_name": "Authors", "attribute_value_mlt": [{"subitem_authors": [{"subitem_authors_fullname": "Hlaing, Nwe Nwe"}, {"subitem_authors_fullname": "Nyunt, Thi Thi Soe"}]}]}, "item_1583108359239": {"attribute_name": "Upload type", "attribute_value_mlt": [{"interim": "Publication"}]}, "item_1583108428133": {"attribute_name": "Publication type", "attribute_value_mlt": [{"interim": "Article"}]}, "item_1583159729339": {"attribute_name": "Publication date", "attribute_value": "2010-12-16"}, "item_1583159847033": {"attribute_name": "Identifier", "attribute_value": "http://onlineresource.ucsy.edu.mm/handle/123456789/1265"}, "item_title": "Discovering Informative Content Blocks for Efficient Web Data Extraction", "item_type_id": "21", "owner": "1", "path": ["1597824273898"], "permalink_uri": "http://hdl.handle.net/20.500.12678/0000003523", "pubdate": {"attribute_name": "Deposited date", "attribute_value": "2019-07-25"}, "publish_date": "2019-07-25", "publish_status": "0", "recid": "3523", "relation": {}, "relation_version_is_last": true, "title": ["Discovering Informative Content Blocks for Efficient Web Data Extraction"], "weko_shared_id": -1}

Discovering Informative Content Blocks for Efficient Web Data Extraction

http://hdl.handle.net/20.500.12678/0000003523

Preview

Name / File	License	Actions
psc2010paper (62).pdf (148 Kb)

Publication type
		Article
Upload type
		Publication
Title
	Title	Discovering Informative Content Blocks for Efficient Web Data Extraction
	Language	en
Publication date		2010-12-16
Authors
		Hlaing, Nwe Nwe
		Nyunt, Thi Thi Soe
Description
		As web sites are getting more complicated,the construction of web information extractionsystems becomes more troublesome and timeconsuming.A common theme is the difficulty inlocating the segments of a page in which the targetinformation is contained, which we call theinformative blocks. So discriminating informativeblocks from the noisy blocks and then extracting theinformative blocks from web page is an importanttask. In this paper, we propose a method that utilizesboth the visual features and semantic information toextract information block. First, the VIPS (VisionbasedPage Segmentation) algorithm is used topartition a web page into semantic blocks with ahierarchy structure. Then spatial features (such asposition, size) and content feature (the number ofimage and links) are extracted to construct featurevector for each block. Secondly based on thesefeature, the blocks with similar content structuresand spatial structures are clustered by means ofsimilarity computation. After clustering blocks withsimilar structures, determine the cluster with thelargest size and nearest distance to the centre ofpage as informative block.
Keywords
		Vision-based Page Segmentation, Information Extraction, Block Clustering
Identifier		http://onlineresource.ucsy.edu.mm/handle/123456789/1265
Journal articles
		Fifth Local Conference on Parallel and Soft Computing
Conference papers
Books/reports/chapters
Thesis/dissertations

Back

0

views

downloads

See details

	Views	Downloads

Versions

Ver.1

2020-09-01 13:06:18.598462

Show All versions

Share

Export

OAI-PMH

DublinCore

Other Formats

JSON

Index Link

Index Tree

Item

Discovering Informative Content Blocks for Efficient Web Data Extraction

Versions

Share

Export