For the tweets with one emoji were

For the purpose of this assignment, we have chosen to work on the data provided
for the SemEval-2018 shared task 1
, more specifically, for the subtask 1
– Emoji Prediction in English. The goal of the task is to design a system that
given a tweet in English predicts the emoji that is most likely to be associated
with it.
The data that we will use consists of 500.000 tweets in English that were
collected with the Twitter APIs from October 2015 to February 2017 and are
geolocalized in the United States. The emojis used in the tweets were then
removed and used as labels, and it is important to point out that only the tweets
with one emoji were used. The emojis chosen as labels are the 20 most frequent
in English tweets. Emojis and the numbers used to identify them in the data are
shown in figure 1, while the figure 2 summarizes the distribution of tweets by
different emojis; x-axis represents the labels, while the y-axis is the total number
of tweets in which a certain emoji appeared. Data is then finally split into training
(90%) and test (10%).The methods we will be using for emoji prediction are Multinomial Naive
Bayes (MNB) and Support Vector Machines (SVM). These methods are very
frequently used in the literature on Twitter sentiment analysis (for example in
1, 25, 23, etc.). MNB is a popular classification method since it is computationally
efficient and showed relatively good predictive performance 12, while
linear SVM classifier demonstrated to perform better than other well known machine
learning techniques, such as, for example, the mentioned Naïve-Bayes
classifiers or k-Nearest Neighbour classifiers 12, 19. We intend to test the
performance of these algorithms on our data, hence, evaluate their performance
over a different kind of sentimental analysis – using emojis as labels
– since few papers have been focused on this kind of problem (for example 5,
14, 17. This project provides a good opportunity to learn in detail and implement
two techniques which are commonly used in natural language processing.