Handling Imbalanced Dataset

In this tutorial, We are going to see how to handle the imbalance data set using different techniques. So, we are taking here credit card fraud detection dataset from the kaggle website. Credit card fraud detection data set is a highly imbalance data set. We try to balance the data set using some techniques. There are some techniques to handle the imbalanced data set but here we are using two techniques to handle the imbalanced data set.

Over_sampling technique
Under_sampling technique

Tutorial Overview

All is divided into six parts, They are:

read the data
visualize data
prepared data
apply over_sampling in data set and train RandomForestClassifier model
apply under_sampling in data set and train RandomForestClassifier model
Check both model accuracy

About data set

The dataset contains transactions made by credit cards in September 2013 by European cardholders.

This dataset presents transactions that occurred in two days, where we have 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions.

Let’s start :-

First of all, import all the required libraries like Pandas, numpy, counter, warnings, etc. Then remove all the warnings which are generated at the runtime of the code execution.

Now load the credit card fraud detection data set and display the top five rows of the data set using the head() functions. Data set contains the independent and dependent features. All the features are independent except class features.

Now apply the value_counts() function on our target feature to count how many transactions are fraud and how many transactions are normal.
Separate the normal transactions and fraud transactions into separate variables. You can see below the fraud transactions are 492 and normal transactions are 284315. Try to visualize the data.

Now visualize the card feature you can see in the below image.

Now it's time to prepare the data set. X stores all the independent features and Y stores all the dependent features. Split the data into training and testing using the train_test_split() function. X_train and Y_train are used for training the model and X_test and Y_test are used to test the model And check the accuracy.

And check the shape of the X_train, Y_train, X_test and Y_test. You can see the image below. And apply a value_counts() function on the Y_train feature and you can see the count of normal and fraud transactions in the below image.

Now firstly apply an oversampling on the data set. So, import the class RandomOverSampler from the imblearn library. And Create the object of RandomOverSampler and now apply an oversampling on the X_train and Y_train using the fit_sample() function.
You can see the previous Y_train counts and after applying the RandomOverSampler, you can see the Y_train_os.
Y_train contains 394 fraud transactions and 227451 normal transactions. But after applying an oversampling, Y_train_os contains 170588 fraud transactions and 227451 normal transactions.
RandomOverSampler creates duplicate data from the minority class and increases the duplicate training data. Random oversampling creates duplicate examples from the minority class in the training data set and can result in the overfitting for some models.
Now define a RandomForestClassifier model and create an object of the model and train the RandomForestClassifier model on the new X_train_os and Y_train_os data. You can also use the naive-bayes model, DecisionTreeClassifier model, SVC model, etc.

Test the model on the testing data set. Now apply a prediction using the predict() function with the RandomForestClassifier object which returns a result (pred).

Now create a DataFrame which shows the actual and predicted result, you can see in the below image.

Check the accuracy_score, classification_report and confusion_matrix of the model. The overall accuracy is 99% after applying the random oversampling.

Now it's time to apply an undersampling on the training data set. So, firstly import the class NearMiss. And Create the object of NearMiss and apply an undersampling on the X_train and Y_train using the fit_sample() function.

You can see the previous Y_train counts and after applying an undersampling, you can see the Y_train_us.

Y_train contains 394 fraud transactions and 227451 normal transactions. But after applying an undersampling, Y_train_us contains 394 fraud and 394 normal transactions. Random undersampling deletes examples from the majority class and can result in losing information invaluable to a model.

Now the same process is here, define a RandomForestClassifier model and create an object of the model and train the RandomForestClassifier model on the new X_train_us and Y_train_us data.

Now check the accuracy_score, classification_report and confusion_matrix of the model. The overall accuracy is 61% after applying the random undersampling.

Source code and how to use

Go to my GitHub and fork or download repo : Credit card fraud detection
Download the data set : Dataset
Open the .ipnyb file in jupyter or google colab notebook
then you can use now

Video Tutorial

Thank You !!!!!!!!!!!!!!!!!

2 Comments

If you have any doubts, Please let me know

MBA in Artificial Intelligence26 July 2021 at 02:49
Anyone having a keen interest in artificial intelligence which require analytical knowledge and want to contribute to these fields, MBA in Artificial Intelligence is definitely for you.
Ramesh Sampangi3 December 2021 at 07:54
Really awesome post, informative and knowledgeable content. Keep sharing more stuff with us. Thank you.
Online Data Science Courses

How to handle Imbalanced dataset