Mining Web Content Outliers by using Term Weighting Technique and Rank Correlation Coefficient Approach

Thinzar Tun; Khin Mo Mo Tun

Index Tree

RootNode
- Co-operative College, Mandalay
- Cooperative College, Phaunggyi
- Co-operative University, Sagaing
- Co-operative University, Thanlyin
- Dagon University
- Kyaukse University
- Laquarware Technological college
- Mandalay Technological University
- Mandalay University of Distance Education
- Mandalay University of Foreign Languages
- Maubin University
- Mawlamyine University
- Meiktila University
- Mohnyin University
- Myanmar Institute of Information Technology
- Myanmar Maritime University
- National Management Degree College
- Naypyitaw State Academy
- Pathein University
- Sagaing University
- Sagaing University of Education
- Taunggyi University
- Technological University, Hmawbi
- Technological University (Kyaukse)
- Technological University Mandalay
- University of Computer Studies, Mandalay
- University of Computer Studies Maubin
- University of Computer Studies, Meikhtila
- University of Computer Studies Pathein
- University of Computer Studies, Taungoo
- University of Computer Studies, Yangon
- University of Dental Medicine Mandalay
- University of Dental Medicine, Yangon
- University of Information Technology
- University of Mandalay
- University of Medicine 1
- University of Medicine 2
- University of Medicine Mandalay
- University of Myitkyina
- University of Public Health, Yangon
- University of Veterinary Science
- University of Yangon
- West Yangon University
- Yadanabon University
- Yangon Technological University
- Yangon University of Distance Education
- Yangon University of Economics
- Yangon University of Education
- Yangon University of Foreign Languages
- Yezin Agricultural University
- New Index

Item

{"_buckets": {"deposit": "db157c74-1a4d-45c8-b79e-766a23dc48b5"}, "_deposit": {"created_by": 45, "id": "6277", "owner": "45", "owners": [45], "owners_ext": {"displayname": "", "username": ""}, "pid": {"revision_id": 0, "type": "recid", "value": "6277"}, "status": "published"}, "_oai": {"id": "oai:meral.edu.mm:recid/6277", "sets": ["1605779935331", "user-uit"]}, "communities": ["uit"], "item_1583103067471": {"attribute_name": "Title", "attribute_value_mlt": [{"subitem_1551255647225": "Mining Web Content Outliers by using Term Weighting Technique and Rank Correlation Coefficient Approach", "subitem_1551255648112": "en"}]}, "item_1583103085720": {"attribute_name": "Description", "attribute_value_mlt": [{"interim": "In the Internet area, World Wide Web (www) involves with voluminous amount of information with more redundant and irrelevant web pages. Outliers are the data that differ significantly from the rest of data. Web content mining is a subarea under web mining that mines required and useful knowledge or information from web page content. Web content outlier mining concentrates on finding outliers such as irrelevant and redundant pages from the web pages. Webs contain unstructured and semi-structured documents, so algorithms for web content mining are needed to handle both unstructured and semi structured documents. The proposed system based on big web data. The objective of proposed system is to obtain higher accurate result. In this proposal, Term Frequency Inverse Document Frequency (TF.IDF) technique based on full word matching with domain dictionary is used to remove the irrelevant documents from the unstructured web documents based on user’s input query. Removal of outliers (irrelevant and redundant contents) from webs not only leads to reduction in indexing space and time complexity, but also improves the accuracy of search results. The documents that have very little similarity words from the user’s input query are assumed as the web outliers. And then a mathematical approach called Spearman’s rank correlation coefficient is used to remove the redundant web documents and to retrieve ranked relevant web documents."}]}, "item_1583103108160": {"attribute_name": "Keywords", "attribute_value_mlt": [{"interim": "outliers"}, {"interim": "web content mining"}, {"interim": "term frequency"}, {"interim": "correlation coefficient"}]}, "item_1583103120197": {"attribute_name": "Files", "attribute_type": "file", "attribute_value_mlt": [{"accessrole": "open_access", "date": [{"dateType": "Available", "dateValue": "2020-11-19"}], "displaytype": "preview", "download_preview_message": "", "file_order": 0, "filename": "Mining Web Content Outliers by using Term Weighting Technique and Rank Correlation Coefficient Approach.pdf", "filesize": [{"value": "1.3 Mb"}], "format": "application/pdf", "future_date_message": "", "is_thumbnail": false, "licensefree": "© 2017 ICAIT", "licensetype": "license_free", "mimetype": "application/pdf", "size": 1300000.0, "url": {"url": "https://meral.edu.mm/record/6277/files/Mining Web Content Outliers by using Term Weighting Technique and Rank Correlation Coefficient Approach.pdf"}, "version_id": "23581f3a-5005-4f1e-931a-878b712b5416"}]}, "item_1583103147082": {"attribute_name": "Conference papers", "attribute_value_mlt": [{"subitem_acronym": "ICAIT-2017", "subitem_c_date": "1-2 November, 2017", "subitem_conference_title": "1st International Conference on Advanced Information Technologies", "subitem_place": "Yangon, Myanmar", "subitem_session": "Software Engineering and Web Mining", "subitem_website": "https://www.uit.edu.mm/icait-2017/"}]}, "item_1583105942107": {"attribute_name": "Authors", "attribute_value_mlt": [{"subitem_authors": [{"subitem_authors_fullname": "Thinzar Tun"}, {"subitem_authors_fullname": "Khin Mo Mo Tun"}]}]}, "item_1583108359239": {"attribute_name": "Upload type", "attribute_value_mlt": [{"interim": "Publication"}]}, "item_1583108428133": {"attribute_name": "Publication type", "attribute_value_mlt": [{"interim": "Conference paper"}]}, "item_1583159729339": {"attribute_name": "Publication date", "attribute_value": "2017-11-02"}, "item_title": "Mining Web Content Outliers by using Term Weighting Technique and Rank Correlation Coefficient Approach", "item_type_id": "21", "owner": "45", "path": ["1605779935331"], "permalink_uri": "http://hdl.handle.net/20.500.12678/0000006277", "pubdate": {"attribute_name": "Deposited date", "attribute_value": "2020-11-19"}, "publish_date": "2020-11-19", "publish_status": "0", "recid": "6277", "relation": {}, "relation_version_is_last": true, "title": ["Mining Web Content Outliers by using Term Weighting Technique and Rank Correlation Coefficient Approach"], "weko_shared_id": -1}

Mining Web Content Outliers by using Term Weighting Technique and Rank Correlation Coefficient Approach

http://hdl.handle.net/20.500.12678/0000006277

Preview

Name / File	License	Actions
Mining Web Content Outliers by using Term Weighting Technique and Rank Correlation Coefficient Approach.pdf (1.3 Mb)	© 2017 ICAIT

Publication type
		Conference paper
Upload type
		Publication
Title
	Title	Mining Web Content Outliers by using Term Weighting Technique and Rank Correlation Coefficient Approach
	Language	en
Publication date		2017-11-02
Authors
		Thinzar Tun
		Khin Mo Mo Tun
Description
		In the Internet area, World Wide Web (www) involves with voluminous amount of information with more redundant and irrelevant web pages. Outliers are the data that differ significantly from the rest of data. Web content mining is a subarea under web mining that mines required and useful knowledge or information from web page content. Web content outlier mining concentrates on finding outliers such as irrelevant and redundant pages from the web pages. Webs contain unstructured and semi-structured documents, so algorithms for web content mining are needed to handle both unstructured and semi structured documents. The proposed system based on big web data. The objective of proposed system is to obtain higher accurate result. In this proposal, Term Frequency Inverse Document Frequency (TF.IDF) technique based on full word matching with domain dictionary is used to remove the irrelevant documents from the unstructured web documents based on user’s input query. Removal of outliers (irrelevant and redundant contents) from webs not only leads to reduction in indexing space and time complexity, but also improves the accuracy of search results. The documents that have very little similarity words from the user’s input query are assumed as the web outliers. And then a mathematical approach called Spearman’s rank correlation coefficient is used to remove the redundant web documents and to retrieve ranked relevant web documents.
Keywords
		outliers, web content mining, term frequency, correlation coefficient
Conference papers
		ICAIT-2017
		1-2 November, 2017
		1st International Conference on Advanced Information Technologies
		Yangon, Myanmar
		Software Engineering and Web Mining
		https://www.uit.edu.mm/icait-2017/

Back

0

views

downloads

See details

	Views	Downloads

Versions

Ver.1

2020-11-19 15:50:31.399011

Show All versions

Share

Export

OAI-PMH

DublinCore

Other Formats

JSON

Index Link

Index Tree

Item

Mining Web Content Outliers by using Term Weighting Technique and Rank Correlation Coefficient Approach

Versions

Share

Export