Stock Sentiment Analysis in NLP

Stock Sentiment Analysis using News Headlines

In this tutorial, we are going to train a model of a stock sentiment analysis using news headlines. This model checks that the stocks are increased or decreased using the news headlines. This is a Natural Language Processing based project. The data set stores different news headlines. Sentiment Analysis is a process of classifying whether a piece of written sentence or headline is positive, negative or neutral. Sentiment Analysis system for text analysis combines the NLP and ML techniques to assign sentiment scores to the categories within a sentence. So we are going to train a NLP model with the help of following such steps :-

  • Import all required libraries
  • Load data
  • Split data into training and testing
  • Apply regular expression to remove other characters and symbols accept English alphabets
  • Apply CountVectorizer
  • Define model
  • Prepare test set
  • Apply predictions 
  • Check the overall accuracy of model 

About the problem and the dataset used.

  • The data set in consideration is a combination of the world news and stock price shifts.

  • Data ranges from 2008 to 2016 and the data from 2000 to 2008 was scrapped from Yahoo finance.

  • There are 25 columns of top news headlines for each day in the data frame.

  • Class 0- the stock price stayed the same or decreased.

  • Class 1- the stock price increased.




Let’s start :-

Firstly import all the required libraries like Pandas, NLTK, Regular Expression, Scikit-learn, numpy etc.



Load the stock sentiment analysis data and display the top five rows of data using the head() function.


Now check the minimum and maximum value of date using the max() and min() function. Checking the minimum and maximum date for splitting the data.

Split the data into training and testing. Training data contains all the news headlines whose date is less than ‘2015-01-01’ and Testing data contains all the news headlines whose date is greater than ‘2014-12-31’. Then extract the y_train and y_test labels from the training and testing set. And check the shape of the train, test, y_train, y_test.



Now apply a regular expression Training data. With the help of regular expressions, we are removing all the special symbols and numeric characters except the English alphabets. And display the complete training data after applying the regular expression.


Change the column names of all news headlines. Make all headlines column names from 0 to 25. Then display the training data.


 Now make all headlines from uppercase to lowercase and display the data.


Now combine all the stock sentiment news headline columns into a single column and append a single column into a list using a for loop. Just execute the loop from 0 to length of training data and do string slicing from 0-25 and append into the list. You can see the code in the below image, how is it working on a single data (index) and all data (index). And Check the length of the list.


Now apply the count vectorizer to convert all the categorical data into the numerical data. CountVectorizer is a tool which is available in the Scikit-learn library in Python. It is used to transform a given text data into a vector on the basis of the frequency (count) of each word that occurs in the entire text.


Now define the RandomForestClassifier model which is a Machine Learning model provided by the scikit-learn library. And train the RandomForestClassifier model on the train_df data and y_train data, train_df stores the independent feature (news headlines) and y_train stores the dependent feature (labels of news headlines).


Now prepare the test data like training data. In this, firstly combine all the columns into a single column and append into another list. And apply a CountVectorizer to transform text data into vectors. And check the length of the data.


Now finally, apply a prediction on the test data. And display the difference between the actual and predicted labels of stock sentiment news headlines.


Now check the classification report, confusion matrix and accuracy score to check the overall accuracy of the random forest classifier model.

  • A confusion matrix is a summary of prediction results on classification problems. The number of correct predictions and incorrect predictions are summarized with count values and broken down by each class.
  • A Classification report is used to measure the predictions from a classification problem. This report shows the classification metrics precision, recall and f1-score on a per-class basis.
  • An accuracy score computes the overall accuracy of the model with the help of y_actual and y_pred.



Source code and How to use :-

  1. Go to my GitHub account and download or fork repository : Stock Sentiment Analysis 
  2. Open Jupyter notebook on project path.
  3. Then open project and you can use now.

Video tutorial

Thank You !!!!!!!!!


2 Comments

If you have any doubts, Please let me know

Post a Comment
Previous Post Next Post