Systematic Selection of Initial Centroid for K-Means Document  Clustering System

Tin Thu Zar Win; Moe Moe Aye

MERAL Myanmar Education Research and Learning Portal

lat lon distance

[[sub_check.contents]]　

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

Index Tree

Item

{"_buckets": {"deposit": "8e653531-25ce-4ec3-944f-28a3de73ffd3"}, "_deposit": {"id": "3117", "owners": [], "pid": {"revision_id": 0, "type": "recid", "value": "3117"}, "status": "published"}, "_oai": {"id": "oai:meral.edu.mm:recid/3117", "sets": ["user-ytu"]}, "communities": ["ytu"], "item_1583103067471": {"attribute_name": "Title", "attribute_value_mlt": [{"subitem_1551255647225": "Systematic Selection of Initial Centroid for K-Means Document  Clustering System", "subitem_1551255648112": "en"}]}, "item_1583103085720": {"attribute_name": "Description", "attribute_value_mlt": [{"interim": "\u003cp\u003eAs the number of electronic documents generated\u003cbr\u003e\nfrom\u0026nbsp; worldwide\u0026nbsp; source\u0026nbsp; increases,\u0026nbsp; it\u0026nbsp; is\u0026nbsp; hard\u0026nbsp; to\u0026nbsp; manually\u003cbr\u003e\norganize,\u0026nbsp; analyze\u0026nbsp; and\u0026nbsp; present\u0026nbsp; these\u0026nbsp; documents\u0026nbsp; efficiently.\u003cbr\u003e\nDocument\u0026nbsp; clustering\u0026nbsp; is\u0026nbsp; one\u0026nbsp; of\u0026nbsp; the\u0026nbsp; traditionally\u0026nbsp; data\u0026nbsp; mining\u003cbr\u003e\ntechniques and an unsupervised learning paradigm. Fast and\u003cbr\u003e\nhigh\u0026nbsp; quality\u0026nbsp; document\u0026nbsp; clustering\u0026nbsp; algorithms\u0026nbsp; play\u0026nbsp; an\u003cbr\u003e\nimportant\u0026nbsp; role\u0026nbsp; in\u0026nbsp; helping\u0026nbsp; users\u0026nbsp; to\u0026nbsp; effectively\u0026nbsp; navigate,\u003cbr\u003e\nsummarize and organize the information. K-Means algorithm\u003cbr\u003e\nis\u0026nbsp; the\u0026nbsp; most\u0026nbsp; commonly\u0026nbsp; used\u0026nbsp; partitioned\u0026nbsp; clustering\u0026nbsp; algorithm\u003cbr\u003e\nbecause it can be easily implemented and is the most efficient\u003cbr\u003e\none in terms of execution times. However, the major problem\u003cbr\u003e\nwith\u0026nbsp; this\u0026nbsp; algorithm\u0026nbsp; is\u0026nbsp; that\u0026nbsp; it\u0026nbsp; is\u0026nbsp; sensitive\u0026nbsp; to\u0026nbsp; the\u0026nbsp; selection\u0026nbsp; of\u003cbr\u003e\ninitial\u0026nbsp; centroid\u0026nbsp; and\u0026nbsp; may\u0026nbsp; converge\u0026nbsp; to\u0026nbsp; local\u0026nbsp; optima.\u0026nbsp; The\u003cbr\u003e\nalgorithm takes the initial cluster centre arbitrarily so it does\u003cbr\u003e\nnot always guarantee good clustering results. Different initial\u003cbr\u003e\ncluster\u0026nbsp; centres\u0026nbsp; often\u0026nbsp; lead\u0026nbsp; to\u0026nbsp; different\u0026nbsp; clustering\u0026nbsp; and\u0026nbsp; thus\u003cbr\u003e\nprovide unstable clustering results. To overcome this problem,\u0026nbsp; \u0026nbsp;\u003cbr\u003e\nSystematic Selection of Initial Centroid for K-Means (SSIC K-\u003cbr\u003e\nMeans)\u0026nbsp; approach\u0026nbsp; is\u0026nbsp; proposed\u0026nbsp; to\u0026nbsp; improve\u0026nbsp; the\u0026nbsp; quality\u0026nbsp; of\u003cbr\u003e\nclustering\u0026nbsp; in\u0026nbsp; this\u0026nbsp; paper.\u0026nbsp; Unlike\u0026nbsp; the\u0026nbsp; traditional\u0026nbsp; K-Means\u003cbr\u003e\nclustering, the proposed SSIC K-Means method can generate\u003cbr\u003e\nthe\u0026nbsp; most\u0026nbsp; compact\u0026nbsp; and\u0026nbsp; stable\u0026nbsp; clustering\u0026nbsp; results\u0026nbsp; based\u0026nbsp; on\u003cbr\u003e\nmaximum distance initial centroids points instead of random\u003cbr\u003e\ninitial centroid points. In this paper, experimental results are\u003cbr\u003e\npresented\u0026nbsp; in\u0026nbsp; F-measures\u0026nbsp; using\u0026nbsp; 20\u0026nbsp; Newsgroup\u0026nbsp; standard\u003cbr\u003e\ndatasets.\u0026nbsp; The\u0026nbsp; evaluations\u0026nbsp; demonstrate\u0026nbsp; that\u0026nbsp; the\u0026nbsp; proposed\u003cbr\u003e\nsolution outperforms the other initialization methods and can\u003cbr\u003e\nbe applied for other various standard datasets.\u003c/p\u003e"}]}, "item_1583103108160": {"attribute_name": "Keywords", "attribute_value_mlt": [{"interim": "Document clustering"}, {"interim": "Data mining"}, {"interim": "K-Means"}, {"interim": "Initial centroid"}, {"interim": "SSIC K-Means"}]}, "item_1583103120197": {"attribute_name": "Files", "attribute_type": "file", "attribute_value_mlt": [{"accessrole": "open_access", "date": [{"dateType": "Available", "dateValue": "2019-07-04"}], "displaytype": "preview", "download_preview_message": "", "file_order": 0, "filename": "Systematic Selection of Initial Centroid for K-Means Document Clustering System.pdf", "filesize": [{"value": "251 Kb"}], "format": "application/pdf", "future_date_message": "", "is_thumbnail": false, "mimetype": "application/pdf", "size": 251000.0, "url": {"url": "https://meral.edu.mm/record/3117/files/Systematic Selection of Initial Centroid for K-Means Document Clustering System.pdf"}, "version_id": "aa0b4c3e-0c8d-4e66-925c-d5edd590e0ad"}]}, "item_1583103131163": {"attribute_name": "Journal articles", "attribute_value_mlt": [{"subitem_issue": "", "subitem_journal_title": "", "subitem_pages": "", "subitem_volume": ""}]}, "item_1583103147082": {"attribute_name": "Conference papers", "attribute_value_mlt": [{"subitem_acronym": "", "subitem_c_date": "", "subitem_conference_title": "", "subitem_part": "", "subitem_place": "", "subitem_session": "", "subitem_website": ""}]}, "item_1583103211336": {"attribute_name": "Books/reports/chapters", "attribute_value_mlt": [{"subitem_book_title": "", "subitem_isbn": "", "subitem_pages": "", "subitem_place": "", "subitem_publisher": ""}]}, "item_1583103233624": {"attribute_name": "Thesis/dissertations", "attribute_value_mlt": [{"subitem_awarding_university": "", "subitem_supervisor(s)": [{"subitem_supervisor": ""}]}]}, "item_1583105942107": {"attribute_name": "Authors", "attribute_value_mlt": [{"subitem_authors": [{"subitem_authors_fullname": "Tin Thu Zar Win"}, {"subitem_authors_fullname": "Moe Moe Aye"}]}]}, "item_1583108359239": {"attribute_name": "Upload type", "attribute_value_mlt": [{"interim": "Publication"}]}, "item_1583108428133": {"attribute_name": "Publication type", "attribute_value_mlt": [{"interim": "Conference paper"}]}, "item_1583159729339": {"attribute_name": "Publication date", "attribute_value": "2016-12-29"}, "item_1583159847033": {"attribute_name": "Identifier", "attribute_value": "10.5281/zenodo.3268434"}, "item_title": "Systematic Selection of Initial Centroid for K-Means Document  Clustering System", "item_type_id": "21", "owner": "1", "path": ["1596119372420"], "permalink_uri": "http://hdl.handle.net/20.500.12678/0000003117", "pubdate": {"attribute_name": "Deposited date", "attribute_value": "2019-07-04"}, "publish_date": "2019-07-04", "publish_status": "0", "recid": "3117", "relation": {}, "relation_version_is_last": true, "title": ["Systematic Selection of Initial Centroid for K-Means Document  Clustering System"], "weko_shared_id": -1}

Systematic Selection of Initial Centroid for K-Means Document Clustering System

http://hdl.handle.net/20.500.12678/0000003117

Preview

Name / File	License	Actions
Systematic Selection of Initial Centroid for K-Means Document Clustering System.pdf (251 Kb)

Publication type
		Conference paper
Upload type
		Publication
Title
	Title	Systematic Selection of Initial Centroid for K-Means Document Clustering System
	Language	en
Publication date		2016-12-29
Authors
		Tin Thu Zar Win
		Moe Moe Aye
Description
		<p>As the number of electronic documents generated<br> from  worldwide  source  increases,  it  is  hard  to  manually<br> organize,  analyze  and  present  these  documents  efficiently.<br> Document  clustering  is  one  of  the  traditionally  data  mining<br> techniques and an unsupervised learning paradigm. Fast and<br> high  quality  document  clustering  algorithms  play  an<br> important  role  in  helping  users  to  effectively  navigate,<br> summarize and organize the information. K-Means algorithm<br> is  the  most  commonly  used  partitioned  clustering  algorithm<br> because it can be easily implemented and is the most efficient<br> one in terms of execution times. However, the major problem<br> with  this  algorithm  is  that  it  is  sensitive  to  the  selection  of<br> initial  centroid  and  may  converge  to  local  optima.  The<br> algorithm takes the initial cluster centre arbitrarily so it does<br> not always guarantee good clustering results. Different initial<br> cluster  centres  often  lead  to  different  clustering  and  thus<br> provide unstable clustering results. To overcome this problem,   <br> Systematic Selection of Initial Centroid for K-Means (SSIC K-<br> Means)  approach  is  proposed  to  improve  the  quality  of<br> clustering  in  this  paper.  Unlike  the  traditional  K-Means<br> clustering, the proposed SSIC K-Means method can generate<br> the  most  compact  and  stable  clustering  results  based  on<br> maximum distance initial centroids points instead of random<br> initial centroid points. In this paper, experimental results are<br> presented  in  F-measures  using  20  Newsgroup  standard<br> datasets.  The  evaluations  demonstrate  that  the  proposed<br> solution outperforms the other initialization methods and can<br> be applied for other various standard datasets.</p>
Keywords
		Document clustering, Data mining, K-Means, Initial centroid, SSIC K-Means
Identifier		10.5281/zenodo.3268434
Journal articles
Conference papers
Books/reports/chapters
Thesis/dissertations