Social Media Text Normalization

Thet Thet Zin

MERAL Myanmar Education Research and Learning Portal

lat lon distance

[[sub_check.contents]]　

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

Index Tree

Item

{"_buckets": {"deposit": "ac83521a-86b2-4832-8b7d-51433ce0a5b7"}, "_deposit": {"created_by": 45, "id": "6329", "owner": "45", "owners": [45], "owners_ext": {"displayname": "", "username": ""}, "pid": {"revision_id": 0, "type": "recid", "value": "6329"}, "status": "published"}, "_oai": {"id": "oai:meral.edu.mm:recid/6329", "sets": ["1605779935331", "user-uit"]}, "communities": ["uit"], "item_1583103067471": {"attribute_name": "Title", "attribute_value_mlt": [{"subitem_1551255647225": "Social Media Text Normalization", "subitem_1551255648112": "en"}]}, "item_1583103085720": {"attribute_name": "Description", "attribute_value_mlt": [{"interim": "Recent years some researchers interested in text normalization over social media, as the informal writing styles found in Twitter and other social media data. These informal texts often cause problems for Natural Language processing applications such as various mining research or translation on social media data. Today Facebook supports English translation of post and status for Myanmar Language. However, Most of the translation is not relevant for Myanmar words meaning. Complex nature of Myanmar language’s syntactic structure, informal writing style, slang words and spelling mistakes are challenge in social media text translation work. This paper proposed text normalization that can be deployed as a preprocessing step for opinion mining, machine translation and various Natural Language Processing (NLP) applications to handle social media text. There are three steps in this work: Firstly, candidate words for normalization are selected from the collected raw dataset. In this case, Out-Of-Vocabulary (OOV) words are extracted for normalization. However, not all OOV words need to be normalized. Therefore, ill-formed words are detected from OOV words list for normalization. Second, slang words dictionary is generated for this work. Third, text similarity methods are applied to ill-formed words for normalization. Evaluation will be done on translation by applying normalization in pre-processing step. For translation, Myanmar-English machine translation [14] is used. The experimental results improve by applying proposed normalization to the translation work especially for social media text."}]}, "item_1583103108160": {"attribute_name": "Keywords", "attribute_value_mlt": [{"interim": "informal text"}, {"interim": "social media"}, {"interim": "normalization"}, {"interim": "Out-Of-Vocabulary word (OOV)"}, {"interim": "translation"}]}, "item_1583103120197": {"attribute_name": "Files", "attribute_type": "file", "attribute_value_mlt": [{"accessrole": "open_access", "date": [{"dateType": "Available", "dateValue": "2020-11-20"}], "displaytype": "preview", "download_preview_message": "", "file_order": 0, "filename": "Social Media Text Normalization.pdf", "filesize": [{"value": "1.5 Mb"}], "format": "application/pdf", "future_date_message": "", "is_thumbnail": false, "licensefree": "© 2018 ICAIT", "licensetype": "license_free", "mimetype": "application/pdf", "size": 1500000.0, "url": {"url": "https://meral.edu.mm/record/6329/files/Social Media Text Normalization.pdf"}, "version_id": "071306f1-e7e0-40d7-84ea-91fe0d88b1ef"}]}, "item_1583103147082": {"attribute_name": "Conference papers", "attribute_value_mlt": [{"subitem_acronym": "ICAIT-2018", "subitem_c_date": "1-2 November, 2018", "subitem_conference_title": "2nd International Conference on Advanced Information Technologies", "subitem_place": "Yangon, Myanmar", "subitem_session": "Natural Language Processing", "subitem_website": "https://www.uit.edu.mm/icait-2018/"}]}, "item_1583105942107": {"attribute_name": "Authors", "attribute_value_mlt": [{"subitem_authors": [{"subitem_authors_fullname": "Thet Thet Zin"}]}]}, "item_1583108359239": {"attribute_name": "Upload type", "attribute_value_mlt": [{"interim": "Publication"}]}, "item_1583108428133": {"attribute_name": "Publication type", "attribute_value_mlt": [{"interim": "Conference paper"}]}, "item_1583159729339": {"attribute_name": "Publication date", "attribute_value": "2018-11-02"}, "item_title": "Social Media Text Normalization", "item_type_id": "21", "owner": "45", "path": ["1605779935331"], "permalink_uri": "http://hdl.handle.net/20.500.12678/0000006329", "pubdate": {"attribute_name": "Deposited date", "attribute_value": "2020-11-20"}, "publish_date": "2020-11-20", "publish_status": "0", "recid": "6329", "relation": {}, "relation_version_is_last": true, "title": ["Social Media Text Normalization"], "weko_shared_id": -1}

Social Media Text Normalization

http://hdl.handle.net/20.500.12678/0000006329

Preview

Name / File	License	Actions
Social Media Text Normalization.pdf (1.5 Mb)	© 2018 ICAIT

Publication type
		Conference paper
Upload type
		Publication
Title
	Title	Social Media Text Normalization
	Language	en
Publication date		2018-11-02
Authors
		Thet Thet Zin
Description
		Recent years some researchers interested in text normalization over social media, as the informal writing styles found in Twitter and other social media data. These informal texts often cause problems for Natural Language processing applications such as various mining research or translation on social media data. Today Facebook supports English translation of post and status for Myanmar Language. However, Most of the translation is not relevant for Myanmar words meaning. Complex nature of Myanmar language’s syntactic structure, informal writing style, slang words and spelling mistakes are challenge in social media text translation work. This paper proposed text normalization that can be deployed as a preprocessing step for opinion mining, machine translation and various Natural Language Processing (NLP) applications to handle social media text. There are three steps in this work: Firstly, candidate words for normalization are selected from the collected raw dataset. In this case, Out-Of-Vocabulary (OOV) words are extracted for normalization. However, not all OOV words need to be normalized. Therefore, ill-formed words are detected from OOV words list for normalization. Second, slang words dictionary is generated for this work. Third, text similarity methods are applied to ill-formed words for normalization. Evaluation will be done on translation by applying normalization in pre-processing step. For translation, Myanmar-English machine translation [14] is used. The experimental results improve by applying proposed normalization to the translation work especially for social media text.
Keywords
		informal text, social media, normalization, Out-Of-Vocabulary word (OOV), translation
Conference papers
		ICAIT-2018
		1-2 November, 2018
		2nd International Conference on Advanced Information Technologies
		Yangon, Myanmar
		Natural Language Processing
		https://www.uit.edu.mm/icait-2018/