How to Fight Social Media Anxiety with Machine Learning
According to wikipedia
Fear of missing out, or FOMO, is “a pervasive apprehension that others might be having rewarding experiences from which one is absent”. This social anxiety is characterised by “a desire to stay continually connected with what others are doing.
FOMO is one of the reasons why we see people spend hours scrolling endlessly through their feeds. At the same time, talking from a personal perspective here, I really get confused seeing new trending topics poping up throughout the day. Most of the time, it takes a lot of effort/research to get to the bottom of the story.
Trying to keep up with all of what’s going on can be distracting, confusing, and both time, and energy consuming…
Machine Learning be like ..
What if we can train a machine to go through hundreds (or thousands) of tweets, extract the relevant data, classify people’s tweets based on opinions, and let us know of what we actually HAVE to know.
That sounds neat, doesn’t it?
Translating that to code
If you made it this far, you are ready to go ahead and clone the project on GitHub.
Harnessing Machine Learning for classifying and estimating Twitter user's opinions and thoughts on a given topic. …
The Project in Action
Let us take for example a term that everyone on the planet has seen/heard of at least a hundred times, but the majority (Britons included) don’t quite understand. You guessed it, it is “Brexit”
If you search for that term on Twitter, you will find a huge number of related tweets, those tweets can be news updates, links, memes ( ͡° ͜ʖ ͡°), opinions, fake news, etc..
After crunching that data, we will get something like this as output:
Now for a person on his device that’s time and effort, but for a machine that takes what .. 2 seconds? No useless data, no spam, only relevant data.
Wait .. How can we achieve that?
Good question, here is a typical way to classify/make sense of data in a machine learning project. This project is no exception.
- We retrieve the data
- We clean the data
- We extract the relevant words
- We create a dictionary of words used and their frequency
- We train our model to recognise the pattern of words using a dataset of already classified comments/feedbacks.
- We use our model to classify each tweet based on the pattern of most used words.
- We display the cloud of most used words
- We display the stats of opinions
- We try to create a comprehensive overall description from the dictionary of most frequently used words!
Let’s get technical
Tweets Sentiment Classification
In order to guess what people are actually thinking about a given topic (Positive, negative, neutral), We are going to use the Naive Bayes Classifier.
For this specific project, we will have 3 classes for each opinion, the features are the dictionary of words found in tweets (without stopwords, punctuation, links, etc..)
Naïve bayes classifiers can use any sort of feature (URL, email address, dictionaries, network features..)
If you want to learn more about this algorithm, check this article out.
How to train our NB model to do that?
The project allows to either train your own model with a dataset of your choosing, or use the pre-trained model. In our example we used the model included in the NLTK corpora basic models.
Optional: If you have your own dataset and want to train your own model, you can use the training method. Here is an excerpt of line 111 on main.py
Generating a Brief Description
The script basically uses the most used 5–6 words and tries to order them in different ways while testing the resulting paragraph’s Coherence. We are using the Language_check to achieve that. Really simple, right?
I have described the installation, pre-requirements, and the getting started steps in the Readme file please go ahead and check it out.
This is the initial version of this project, I am intending to improve it as I progress. Please do not hesitate to contribute/share your suggestions.
I have tried to over-simply things and include as many details as possible since I am still a beginner in ML myself. The code is self-explanatory and carefully commented. If you think there is a mistake or something is missing, please let me know in the comments.