Tweet Sentiment Extraction

Dipanshu Rana
8 min readNov 3, 2021

Extract support phrases for sentimental labels

Table of Contents:

  1. Business Problem
  2. Source of Data
  3. Data Overview
  4. Mapping real world problem with ML/DL problem
  5. Performance Metrics
  6. Exploratory Data Analysis
  7. Data Preprocessing
  8. Deep Learning Models
  9. Models Comparison
  10. Model Deployment
  11. Future Work
  12. References

1. Business Problem

With all of the tweets circulating every second it is hard to tell whether the sentiment behind a specific tweet will impact a company or a person’s brand for being viral (positive) or devastate profit because it strikes a negative tone. Capturing sentiment in language is important in these times where decisions and reactions are created and updated in seconds. But which words actually lead to the sentiment description? In this problem we need to pick out the part of the tweet (word or phrase) that reflects the sentiment.

2. Source of Data

It is a Kaggle competition. In this competition we have extracted support phrases from Figure Eight’s Data for Everyone platform. Data is available here.

3. Data Overview

It consists of two data files.train.csv with 27481 rows and test.csv with 3534 rows.

List of columns in the dataset :

textID: unique id for each row of data.

text: contains text data of the tweet.

sentiment: sentiment of the text (positive/negative/neutral).

selected_text: phrases /words from the text that best supports the sentiment.

4. Mapping real world problem with ML/DL problem

Input given to model is a sentence and output will be the phrase/word which is part of input itself. That is we have to predict the word or phrase from the tweet that exemplifies the provided sentiment. For example

Sentence: So sad I will miss you here in San Diego.

Sentiment: negative

Output: So sad

5. Performance Metrics

The metric in this problem is the word-level Jaccard score. Jaccard Score or Jaccard Similarity is defined as size of intersection divided by size of union of two sets.

Let’s see the example how Jaccard Similarity work?

doc_1 = “Data is the new oil of the digital economy”

doc_2 = “Data is a new oil”

6. Exploratory Data Analysis

  • About 40 percent of the tweets are neutral followed by positive and negative tweets. (No. of neutral tweets : 11117, No. of positive tweets : 8582, No. of negative tweets : 7781)
  • The histogram shows that the length of the cleaned text ranges maximum up to approx. 140 characters.
  • Positive tweets: Most of tweets have length in range 30 to 50.
  • Negative tweets: Most of tweets have length in range 20 to 40.
  • Neutral tweets: Most of tweets have length in range 20 to 40.
  • Very less tweet of length greater than 130 or length less than 10 in all categories.
  • Most of tweets have text word counts in range 25 to 60 for all categories.
  • Very less tweet with length greater than 140.
  • The histogram shows that the no. of text word counts ranges maximum up to approx. 35 characters.
  • Most of tweets have text word counts in range 5 to 15 in all categories.
  • Very less tweet of text word counts greater than 30 or text word counts less than 5 in all categories.
  • Most of tweets have text word counts in range 5 to 15 for all categories.
  • Very less no. of word counts for range 30 to 35.
  • The word clouds give an idea of the words which might influence the polarity of the tweet.
Common Words in Text
Common Words in Selected Text
  • The above plots shows the most common words that are found in the text, selected_text columns for different sentiments.

7. Data Preprocessing

In any machine learning task cleaning or preprocessing the data is as important as model building. And when it comes to unstructured data like text this process is even more important.

Some of the common text preprocessing task are:

  • Lower case
  • Removing Hyper-links
  • Removing Numbers, Angular Brackets, Square Brackets, ‘\n’ character, replacing **** by <ABUSE>word
  • Removing punctuation
  • Spelling correction
  1. Firstly, identify wrong spellings in selected_text.

2. If length of wrong spelling is 1, then we can eliminate those rows. Because such words have no meaning and have no impact while modelling.

3. If length of wrong spelling is greater than 1. In that case one can use fuzzy wuzzy library where we can have a score out of 100, that denotes two string are equal by giving similarity index.

8. Deep Learning Models

8.1. RNN

Recurrent Neural Network(RNN) are a type of Neural Network where the output from previous step are given as input to the current step. In traditional neural networks all the inputs and outputs are independent of each other but in cases like when it is required to predict the next word of a sentence the previous words are required and hence there is a need to remember the previous words. Thus RNN came into existence which solved this issue with the Hidden Layer which helps to remember information about a sequence.

Vanishing Gradient Problem :

As we go back with the gradients, It is possible that the values becomes too small, which makes it difficult to learn some long period dependencies. This problem is called the vanishing gradient.

Solution to Vanishing Gradient Problem : Weight Initialization, choosing the right Activation Function, LSTMs

8.2. LSTMs(Long Short-Term Memory Networks)

LSTMs are special kind of Recurrent Neural Networks, capable of learning long-term dependencies. Remembering information for long periods of time is their default behavior. LSTMs also have a chain like structure, but the repeating module has a different structure. Instead of having a single neural network layer, there are four interacting layers communicating in a very special way.

Forget gate: Decides which information to delete that is not important from previous time step.

Input gate: Determines which information to let through based on its significance in the current time step.

Output gate: Allows the passed in information to impact the output in the current time step.

Each tweet in ‘text’ column would be of different length. That’s why better to convert ‘text’ data into numbers and pad all the tweets so that all the inputs are of the same length.

Preparing the Embedding layer:

We will compute an index mapping words to known embeddings, by parsing the data dump of pre-trained embeddings.

We can leverage our embeddings_index dictionary and our word_index_text , word_index_sentiment to compute our embedding matrix respectively. We load these embedding matrix into an Embedding layer.

Note that we set trainable=False in Embedding Layer to prevent the weights from being updated during training.

8.3. Bi-directional LSTM

With the regular LSTM, we can make input flow in one direction either backwards or forward. However, in bi-directional we can make the input flow in both directions to preserve the future and the past information.

For example: In the sentence “boys go to …..” we can not fill the blank space unless we know the future sentence “boys come out of school”. Similar thing we want to perform by our model and bidirectional LSTM allows the neural network to perform this.

Here the output is a vector of length max_length_text (max. length of input). The words which are part of predicted_text given a value of 1 and others given a value of 0. Later using this output vector we can extract the text from ‘text’ column.

For example :
Input (text): Today is a wonderful day when everything moves smoothly and harmoniously

Input (sentiment): positive

Output Vector: 0,0,0,1,0,0,0,0,1,0,1

Predicted_text: wonderful, smoothly, harmoniously

9. Models Comparison

LSTM Jaccard Score
Bi-LSTM Jaccard Score

LSTM Jaccard Score : 0.57

Bi-LSTM Jaccard Score : 0.63

One can observe that with Bi-directional Flow Performance is more better . Thus we will save Bi-LSTM model for future use.

10. Model Deployment

After training Bi-LSTM (best model), I stored the model in pickle file and deployed model in my local system along with Flask API built around the final pipeline which takes tweet, sentiment as input and return the phrases /words from the text that best supports the sentiment.

Designed a HTML page (home.html) with two inputs tweet, sentiment and return the tweet, sentiment, extracted phrase/words as output.

11. Future Work

  • Can try GRU, which will decrease computational cost and process will become more faster as compared to LSTMs.
  • In above modeling, I have used word level modelling one can try for character level modelling.
  • Use of Transformers and BERT will increase performance.

12. References

Complete Project is available on Github. For any queries regarding project contact me on Linkedin.

--

--

Dipanshu Rana

Machine Learning,NLP,Computer Vision,Data Analysis