Language Identification Using NLP

Introduction

Explore the blog for read the interesting content about NLP model on “Language Identification”

In this tutorial, we are going to implement a language identification model using natural language processing. It's a NLP based project. It identifies the language of a given sentence. For example: the given sentence is in English language like “what are you doing now-a-days?”. So, it is able to detect the given sentence in which language like we gave the sentence in English language, so it’ll give the result “English”.

I have trained NLP model on 22 different languages like English, Dutch, French, Latin, Hindi, Romanian, Tamil, etc. After training the model on a large data set of 22 different languages, it gave me 92% accuracy. It's a unique and useful project which can be used as a minor project for your college. You can also deploy this model on the local host as well as server like heroku platform or pythonanywhere.

Let’s start:-

So, the common steps is to load all the required libraries then load the languages data which contains the “Text” which is sentence and target feature “Language” and “Language” contains 22 different languages. And after that check the shape of the data. And it displayed 22000 rows and 2 columns of the data set.

Check the count of each language and you can see in the below image that each language contains 100 different sentences. For example: English language contains 1000 different language sentences.

Visualise the target features using the count plot function of seaborn.

Then remove the stop words and remove all the characters except the alphabets then lower the sentence and remove the stop words then append into the corpus list. Corpus list is our final list which contains the sentences after pre processing.

Then apply the count vectorizer function to convert the string sentences into the numerical sentences. CountVectorizer converts the sentences into the numerics based on the frequency count of a word in a sentence. And after that, check the shape of the final data set.

And perform the label encoding on our target feature which is a “Language”, it converts the language column into the numbers according to the order. Like 1,2,3,4,5,6, etc.

Then Check our target features after applying the label encoding then check the length of the target feature which is a 22000 and see all the labels using the classes_ which contains target data, you can see in the below image.

Now make a new data frame using the pandas function which contains the preprocessed sentences and target column which is in the form of numeric data.

Now it's time to prepare the data, split the data into training and testing. Training data is 80% and testing data is 20%. Our model is trained using the 80% data and for the evaluation, we use 20% data. And then check the shape of the data sets.

Now define the naive-bayes model which performs based on the probability. And test the model using the 20% testing data, which returns the predicted output. This output is basically used to evaluate the model.

Now it's time to evaluate the model using the actual testing data and the predicted output data. So, you can see that the accuracy score is 92% and you also see the confusion matrix.

For better understanding of the confusion matrix, we plot the confusion matrix using the heatmap function of the seaborn. So you can see in the below image of the confusion matrix.

Now create a new data frame to see the actual data and predicted output, which you can see in the below image.

Now save the model using the joblib library which takes two arguments. One is our classifier and the other one is the model’s name.

Then load the model using the load function of joblib which takes the model’s path.

Now create a function which takes custom input as a sentence and returns the output and this function performs the same action, which we perform above on our data set.

So we pass custom 6 sentences in Tamil, Thai, English, etc. languages as an input to our function, so my model predicts 5 sentences are correct and one sentence is incorrect.

If you face any type of doubt, then you can comment below.

Source Code:-

Go to my GitHub account and fork the repo : Language Identification
Use jupyter notebook to use the .ipnyb file.

Video Tutorial

Thank You !!!!!!!!!!!

Language Identification Using NLP

2 Comments

Contact Form