Modified K-Means for Document Clustering System

Tin Thu Zar Win; Moe Moe Aye

MERAL Myanmar Education Research and Learning Portal

lat lon distance

[[sub_check.contents]]　

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

Index Tree

Item

{"_buckets": {"deposit": "033d64ba-46ea-44db-a6f9-1ecffc6ed024"}, "_deposit": {"id": "3116", "owners": [], "pid": {"revision_id": 0, "type": "recid", "value": "3116"}, "status": "published"}, "_oai": {"id": "oai:meral.edu.mm:recid/3116", "sets": ["user-ytu"]}, "communities": ["ytu"], "item_1583103067471": {"attribute_name": "Title", "attribute_value_mlt": [{"subitem_1551255647225": "Modified K-Means for Document Clustering System", "subitem_1551255648112": "en"}]}, "item_1583103085720": {"attribute_name": "Description", "attribute_value_mlt": [{"interim": "\u003cp\u003eIn\u0026nbsp; today\u0026rsquo;s\u0026nbsp; era\u0026nbsp; of\u0026nbsp; World\u0026nbsp; Wide\u0026nbsp; Web,\u0026nbsp; there\u0026nbsp; is\u0026nbsp; a\u003cbr\u003e\ntremendous\u0026nbsp; proliferation\u0026nbsp; in\u0026nbsp; the\u0026nbsp; amount\u0026nbsp; of\u0026nbsp; digitized\u0026nbsp; text\u003cbr\u003e\ndocuments. As there is huge collection of documents on the web,\u003cbr\u003e\nthere\u0026nbsp; is\u0026nbsp; a\u0026nbsp; need\u0026nbsp; of\u0026nbsp; grouping\u0026nbsp; the\u0026nbsp; set\u0026nbsp; of\u0026nbsp; documents\u0026nbsp; into\u0026nbsp; clusters.\u003cbr\u003e\nDocument\u0026nbsp; clustering\u0026nbsp; plays\u0026nbsp; an\u0026nbsp; important\u0026nbsp; role\u0026nbsp; in\u0026nbsp; effectively\u003cbr\u003e\nnavigating\u0026nbsp; and\u0026nbsp; organizing\u0026nbsp; the\u0026nbsp; documents.\u0026nbsp; K-Means\u0026nbsp; clustering\u003cbr\u003e\nalgorithm\u0026nbsp; is\u0026nbsp; the\u0026nbsp; most\u0026nbsp; commonly\u0026nbsp; document\u0026nbsp; clustering\u0026nbsp; algorithm\u003cbr\u003e\nbecause it can be easily implemented and is the most efficient one\u003cbr\u003e\nin\u0026nbsp; terms\u0026nbsp; of\u0026nbsp; execution\u0026nbsp; times.\u0026nbsp; The\u0026nbsp; major\u0026nbsp; problem\u0026nbsp; with\u0026nbsp; this\u003cbr\u003e\nalgorithm is that it is quite sensitive to selection of initial cluster\u003cbr\u003e\ncentroids. The algorithm takes the initial cluster center arbitrarily\u003cbr\u003e\nso it does not always promise good clustering results. If the initial\u003cbr\u003e\ncentroids\u0026nbsp; are\u0026nbsp; incorrectly\u0026nbsp; determined,\u0026nbsp; the\u0026nbsp; remaining\u0026nbsp; data\u0026nbsp; points\u003cbr\u003e\nwith the same similarity scores may fall into the different clusters\u003cbr\u003e\ninstead of the same cluster. To overcome this problem,\u0026nbsp;\u0026nbsp; modified\u003cbr\u003e\nK-Means\u0026nbsp; approach\u0026nbsp; is\u0026nbsp; proposed\u0026nbsp; to\u0026nbsp; improve\u0026nbsp; the\u0026nbsp; quality\u0026nbsp; of\u003cbr\u003e\nclustering\u0026nbsp; in\u0026nbsp; this\u0026nbsp; paper.\u0026nbsp;\u0026nbsp;\u0026nbsp; Unlike\u0026nbsp; the\u0026nbsp; traditional\u0026nbsp; K-Means\u003cbr\u003e\nclustering, the proposed K-Means method can generate the most\u003cbr\u003e\ncompact and stable clustering results based on maximum distance\u003cbr\u003e\ninitial centroids points instead of random initial centroid points.\u003cbr\u003e\nMoreover,\u0026nbsp; the\u0026nbsp; similar\u0026nbsp; data\u0026nbsp; points\u0026nbsp; are\u0026nbsp; clustered\u0026nbsp; based\u0026nbsp; on\u003cbr\u003e\nmaximum probability distribution of data points.\u0026nbsp; Therefore, the\u003cbr\u003e\nproposed method is more effective and converges to more accurate\u003cbr\u003e\nclusters than original K-Means clustering method. In this paper,\u003cbr\u003e\nexperimental\u0026nbsp; results\u0026nbsp; are\u0026nbsp; presented\u0026nbsp; in\u0026nbsp; F-measure\u0026nbsp; using\u0026nbsp; 20-News\u003cbr\u003e\nGroup standard dataset.\u003c/p\u003e"}]}, "item_1583103108160": {"attribute_name": "Keywords", "attribute_value_mlt": [{"interim": "Document clustering"}, {"interim": "F-measure"}, {"interim": "Initial centroid"}, {"interim": "K-Means"}]}, "item_1583103120197": {"attribute_name": "Files", "attribute_type": "file", "attribute_value_mlt": [{"accessrole": "open_access", "date": [{"dateType": "Available", "dateValue": "2019-07-04"}], "displaytype": "preview", "download_preview_message": "", "file_order": 0, "filename": "Modified K-Means for Document Clustering System-2016.pdf", "filesize": [{"value": "311 Kb"}], "format": "application/pdf", "future_date_message": "", "is_thumbnail": false, "mimetype": "application/pdf", "size": 311000.0, "url": {"url": "https://meral.edu.mm/record/3116/files/Modified K-Means for Document Clustering System-2016.pdf"}, "version_id": "6026e71d-9240-4bb0-99cb-d5f012f59eab"}]}, "item_1583103131163": {"attribute_name": "Journal articles", "attribute_value_mlt": [{"subitem_issue": "", "subitem_journal_title": "", "subitem_pages": "", "subitem_volume": ""}]}, "item_1583103147082": {"attribute_name": "Conference papers", "attribute_value_mlt": [{"subitem_acronym": "", "subitem_c_date": "", "subitem_conference_title": "", "subitem_part": "", "subitem_place": "", "subitem_session": "", "subitem_website": ""}]}, "item_1583103211336": {"attribute_name": "Books/reports/chapters", "attribute_value_mlt": [{"subitem_book_title": "", "subitem_isbn": "", "subitem_pages": "", "subitem_place": "", "subitem_publisher": ""}]}, "item_1583103233624": {"attribute_name": "Thesis/dissertations", "attribute_value_mlt": [{"subitem_awarding_university": "", "subitem_supervisor(s)": [{"subitem_supervisor": ""}]}]}, "item_1583105942107": {"attribute_name": "Authors", "attribute_value_mlt": [{"subitem_authors": [{"subitem_authors_fullname": "Tin Thu Zar Win"}, {"subitem_authors_fullname": "Moe Moe Aye"}]}]}, "item_1583108359239": {"attribute_name": "Upload type", "attribute_value_mlt": [{"interim": "Publication"}]}, "item_1583108428133": {"attribute_name": "Publication type", "attribute_value_mlt": [{"interim": "Conference paper"}]}, "item_1583159729339": {"attribute_name": "Publication date", "attribute_value": "2016-10-01"}, "item_1583159847033": {"attribute_name": "Identifier", "attribute_value": "10.5281/zenodo.3268423"}, "item_title": "Modified K-Means for Document Clustering System", "item_type_id": "21", "owner": "1", "path": ["1596119372420"], "permalink_uri": "http://hdl.handle.net/20.500.12678/0000003116", "pubdate": {"attribute_name": "Deposited date", "attribute_value": "2019-07-04"}, "publish_date": "2019-07-04", "publish_status": "0", "recid": "3116", "relation": {}, "relation_version_is_last": true, "title": ["Modified K-Means for Document Clustering System"], "weko_shared_id": -1}

Modified K-Means for Document Clustering System

http://hdl.handle.net/20.500.12678/0000003116

Preview

Name / File	License	Actions
Modified K-Means for Document Clustering System-2016.pdf (311 Kb)

Publication type
		Conference paper
Upload type
		Publication
Title
	Title	Modified K-Means for Document Clustering System
	Language	en
Publication date		2016-10-01
Authors
		Tin Thu Zar Win
		Moe Moe Aye
Description
		<p>In  today’s  era  of  World  Wide  Web,  there  is  a<br> tremendous  proliferation  in  the  amount  of  digitized  text<br> documents. As there is huge collection of documents on the web,<br> there  is  a  need  of  grouping  the  set  of  documents  into  clusters.<br> Document  clustering  plays  an  important  role  in  effectively<br> navigating  and  organizing  the  documents.  K-Means  clustering<br> algorithm  is  the  most  commonly  document  clustering  algorithm<br> because it can be easily implemented and is the most efficient one<br> in  terms  of  execution  times.  The  major  problem  with  this<br> algorithm is that it is quite sensitive to selection of initial cluster<br> centroids. The algorithm takes the initial cluster center arbitrarily<br> so it does not always promise good clustering results. If the initial<br> centroids  are  incorrectly  determined,  the  remaining  data  points<br> with the same similarity scores may fall into the different clusters<br> instead of the same cluster. To overcome this problem,   modified<br> K-Means  approach  is  proposed  to  improve  the  quality  of<br> clustering  in  this  paper.    Unlike  the  traditional  K-Means<br> clustering, the proposed K-Means method can generate the most<br> compact and stable clustering results based on maximum distance<br> initial centroids points instead of random initial centroid points.<br> Moreover,  the  similar  data  points  are  clustered  based  on<br> maximum probability distribution of data points.  Therefore, the<br> proposed method is more effective and converges to more accurate<br> clusters than original K-Means clustering method. In this paper,<br> experimental  results  are  presented  in  F-measure  using  20-News<br> Group standard dataset.</p>
Keywords
		Document clustering, F-measure, Initial centroid, K-Means
Identifier		10.5281/zenodo.3268423
Journal articles
Conference papers
Books/reports/chapters
Thesis/dissertations