Multi-layer Text Classification With Voting for Consumer Reviews

This article was published as a office of the Information Science Blogathon.

Overview

In this article, nosotros are going to hash out automated multi-class classification on the mixed information type. Remember about text nomenclature. When we have a bunch of text and a target label. Based on the incoming text nosotros create a model to acquire on the target label and finally predict on the target label. We typically perform all our NLP steps like tokenization, etc to classify our target values. Merely if you have real-world data sometimes forth with the text data yous will too have some continuous variables or categorical variables. For instance, consider a call center and y'all have a customer voice to text conversion model and you lot are trying to classify whether the sentence is positive or negative. Here, along with the text data, you might likewise have data on how much time the customer was on the line with the telephone call middle amanuensis, how many times the customer transferred from one agent to another. These things volition also have an touch on identifying whether the customer had positive interaction or negative interaction with the agent. Here, we are going to work and classify such a kind of data.

Getting Started with Automated Text Classification

Here, I am using AutoViML. It is a parcel for automated machine learning.

It tin exist installed
by using

pip install autoviml

I am as well using TensorFlow datasets where I am using the amazon personal care appliances dataset. I am besides using other python libraries like NumPy and Pandas.

Importing libraries and Pre-processing for Text Nomenclature

Let'southward import these
libraries.

import tensorflow_datasets as tfds import numpy as np import pandas as pd

Adjacent, I am loading our dataset from TensorFlow datasets and assigning it
to the variable dataset. The info variable is metadata well-nigh the dataset. Adjacent,
I am loading the training dataset into train_dataset.

dataset, info = tfds.load('amazon_us_reviews/Personal_Care_Appliances_v1_00', with_info = True, batch_size = -1) train_dataset = dataset['train']

Now, let'due south print the info.

info of dataset for text classification

It is listing all the columns in the dataset along with their data types.

Side by side, I am converting
the dataset into a NumPy assortment.

          dataset = tfds.as_numpy(train_dataset)        

 If you lot impress the dataset output, you tin see some initial set of rows
because it is a NumPy array.

dataset

Now, I am taking a few selected columns, a combination of continuous
value and a combination of categorical value and a text column.

verified_purchase = dataset['information']['verified_purchase'] helpful_votes = dataset['data']['helpful_votes'] review_headline = dataset['information']['review_headline'] review_body = dataset['data']['review_body'] rating = dataset['data']['star_rating']

Adjacent, I am converting these entire variables into a Pandas information frame.

reviews_df = pd.DataFrame(np.hstack((verified_purchase[:, None], helpful_votes[:,None], review_headline[:,None], review_body[:,None],rating[:,None])), columns = ['verified', 'votes', 'headline', 'reviews', 'rating'])

At present, I am declaring data types to the cavalcade.

convert_dict = {'verified': int,                 'votes': int,                 'headline': str,                 'reviews': str,                 'rating': int                 }

Then I am passing it to review the information frame.

reviews_df = reviews_df.astype(convert_dict)

Let me print it.

dataset for text classification

You lot tin see the output.

I have the verified cavalcade showing whether information technology is a verified purchase. The votes were given to the review, the headline of the review, the body of the review, and then the rating between 1 to 5.

Automated Multi-Class Text classification

At present, I am going to create a multi-class classification.

Rather than converting the reviews to positive or negative, I am converting them into positive, negative, and neutral. For that, I am creating a part. In that function, if the rating is less than or equal to 2, I am because it as a negative review. If the rating is equal to 3 so I am considering it equally a neutral review. The rating to a higher place iii is a positive review.

def convert_rating(rating):   if rating <= 2:     out = 0   elif rating == 3:     out = 1   else:     out = ii   return out

Now, I am creating a target variable. I am taking reviews_df of the target  and so I am using the rating column past applying a lambda function to information technology and calling this role.

reviews_df["target"] = reviews_df["rating"].apply(lambda x: convert_rating(x)) reviews_df
 data frame

The shape of the data frame is

reviews_df.shape[0]        

The target value count is

reviews_df["target"].value_counts()

In this imbalanced information set, most of the reviews are positive. Some of the reviews are negative and a few of them are neutral.

Here, the neutral values are not predicted properly. The reason is when people write 3 for rating, they are okay with review and the words will overlap.

Train-Test Split

Allow'south meet how the model performs.

Auto ML is but to get an intuition of how the modeling technique is performing. Information technology is not going to give the perfect model that you melody and deploy. Information technology is used to accept an initial agreement of the output of the model and and so you tin can further fine-melody information technology and apply.

Now, I am going to
drib the rating cavalcade because I don't crave the rating cavalcade into the already created target. If I continue the rating cavalcade and pass it to the model, it may be just a learning and mapping function for the target variable.

reviews_df = reviews_df.drop('rating', axis = 1)

At present, I am splitting the dataset into 75-25. That is 75% into the train
gear up and 25% into the test fix.

from sklearn.model_selection import train_test_split railroad train, test = train_test_split(reviews_df, test_size = 0.25)

Training Automatic Text Classification Model using AutoViML

Later that, I am importing the AutoViML package.

from autoviml.Auto_ViML import Auto_ViML

I am defining the target variable every bit a target.

I am calling the AutoViML function. here, I am passing the training data frame and the examination information frame forth with the target variable.

grand, feats, trainm, testm = Auto_ViML(railroad train, target, examination,                                     sample_submission = '',                                     scoring_parameter = '', KMeans_Featurizer = Fake,                                     hyper_param = 'RS', feature_reduction = Truthful,                                     Boosting_Flag = 'CatBoost', Binning_Flag = False,                                     Add_Poly = 0, Stacking_Flag = False, Imbalanced_Flag = Faux,                                     verbose = 2)

Now the training is completed.

It gives you the distribution of words within the review column.

It tries out a unlike combination. Get-go, it tries out count vectorizer and and then TFIDF vectorizer with and without binary. And so information technology picks the
best one (Count vectorizer) which is working for the detail model.

count vectorization | Automated Text Classification

It as well gives you the ROC curve for each variable.

roc curve

Information technology is besides giving the feature importance.

Importance of Features in prediction (automated text classification)

As I said earlier, this is not the finalized model. It is just giving an thought of how it performs. You can apply the features that are created and techniques that are created and and then you can do your feature engineering further.

Next, let me print the feats. It volition give all the features that are used.

features for text classifier

You can meet the diverse texts that are considered as features.

Next, allow see the
test frame.

text classifier data set | Automated Text Classification

It is press all the features and finally the different models that are used and what are the target column, prediction based on the test data.

Now, we can relieve the
model.

thousand.save_model('model', format = 'cbm')

Testing the Automated Text Classification model

Let's get into the prediction part of text classification.

Finally, I am plotting the defoliation matrix.

from autoviml.Auto_NLP import plot_coonfusion_matrix,plot_classification_matrix plot_confusion_matrix(test[target].values,m.predict(testm[feats]))
confusion matrix | Automated Text Classification

From the defoliation matrix, information technology can be seen that the model has washed very well in course 2. Information technology has done decent prediction in form 0 as well. But class 1 is non done correctly. The reason is perchance this is not data for really splitting into the neutral review. Another reason may be that neutral people may like information technology simply they don't want to express themselves a lot.

End Notes

In this article, we understood how to apply AutoViML. I hope yous notice this useful in your NLP journey. As e'er, I wish you the all-time in your learning endeavors!

The media shown in this article is not owned by Analytics Vidhya and are used at the Author's discretion

felicianokeisheiled.blogspot.com

Source: https://www.analyticsvidhya.com/blog/2021/11/introduction-to-automated-multi-class-text-classification/

0 Response to "Multi-layer Text Classification With Voting for Consumer Reviews"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel