Hyperparameter Tuning on Diabetes Classification

Introduction

In this tutorial, we are going to tune the diabetes classification model using the Hyperparameter Tuning. Hyperparameter Tuning is choosing a set of optimal hyperparameters for a learning algorithm. Hyperparameter tuning include a steps to tune the mode:

Defining our model.
Defining the range of possible values for all hyperparameters.
Defining a method for simply hyperparameter values.
Defining a cross validation method.

So, in the previous tutorial we learnt how to preprocess and prepare the data and train the diabetes classification model using the machine learning algorithm. If you don't read my previous post. Just go to below link and read.

Diabetes classification : Click here to read

In this tutorial we train that model to increase the performance of the model. But we train the diabetes model using the Random forest classifier instead of naive-bayes algorithm because the random forest classifier has lots of hyperparameters to tune the model.

Let’s start :-

Firstly, the basic step is to load our final preprocessed data and remove the runtime warnings.

Split the data into the Independent and dependent features.

And split the data into the training and testing data set using the train_test_split() function to train and test the random forest classifier model.

The next step is to normalize the training and testing data. It scale into a particular range.

Firstly, we train the diabetes classification model using random forest classifier without tuning the model and see what accuracy we get. So simply load the model and train the model using the training and testing. And check the confusion matrix, accuracy score and classification report to evaluate the model. And you can see that we got 79.87% overall accuracy.

And now we tune the model using two techniques to increase the overall performance of the model.

So firstly, we use RandomizedSeachCV to tune the random forest classifier. Random Search CV sets up a grid of hyperparameter values and selects random combinations of parameters to train and evaluate the mode. This allows you to explicitly control the number of parameter combinations that are attempted. The number of search iterations for parameters is set based on time or resources.

Now define all the parameters of random forest classifiers like n_estimators, max_features, max_depth, min_samples_split, min_samples_leaf etc, which are used to tune the model and you can also increase the parameters and range of their parameters. And After defining the parameters, Create a dictionary of all the parameters which contains key value pairs. Key is the parameter's name and Value is the parameter's value.

Create the object of the random forest classifier and define the RandomizedSearchCV to tune the model which takes some parameters like estimator, param_distributions, random_state, cv (cross validation) etc., and train the model. It takes some time to execute and it will do 300 fits on data which means 300 times a model will train with new parameters.

You can also check the best parameters which are assigned to the random forest classifier using a method best_params_ with the object of randomized search cv. And use a new model to evaluate the model using best_estimator_.

Now it's time to evaluate the model using the accuracy score, confusion matrix and classification report and you can see that the overall accuracy is 81.16%.

So secondly, we use GridSearchCV to tune the random forest classifier. Grid Search CV can be thought of as an exhaustive search for selecting a model. In Grid Search CV, the developer sets up a grid of hyperparameter values and for each combination, trains a model and scores on the testing data. In this approach, every combination of hyperparameter values is tried which can be very inefficient.

Now define the parameters of a random forest regressor and we’ll take parameters near those best parameters which we got from the RandomizedSearchCV and create a dictionary. As you can see in the below image.

Same as above, create the object of random forest classifier and define GridSearchCV which takes some parameters like estimator, param_grid, cv etc., and train the model using the training data.

And we can also check the best parameters with the help of which, our model is trained using the method best_estimator_ with the object of GridSearchCV.

Now again it's time to evaluate the random forest classifier model using the accuracy score, confusion matrix and classification report and you can see that the overall accuracy is 81.16%.

So there will be a minor change in both the accuracies after applying GridSearchCV and RandomizedSearchCV, as you can see in the output. So now it’s time for you to perform Hyperparameter tuning on other models and then evaluate the model with different techniques like accuracy score.

If you face any difficulty, you can comment below. I’ll solve your problem.

Source code :-

1. Go to my GitHub account and download or fork repo : Hyperparameter Tuning
2. Open .ipnyb file in jupyter or google colab.

Video Tutorial :-

Thank you !!!!!!!!!

Hyperparameter Tuning on Diabetes Classification

1 Comments

Contact Form