Diving into Sentiments: Building, Fine-Tuning, and Deploying a COVID-19 Vaccine Tweet Classifier with Hugging Face and User Interface Tools.
INTRODUCTION
Greetings once again! Today, I am thrilled to share with you an exciting project that has surpassed all my expectations– but hey, I say that every time, right? Throughout my time with Azubi, I have worked on various projects, each one seemingly better than the last. However, this particular project stands out as, introduced me to unstructured datasets, deep learning (specifically natural language processing), and the concept of transfer learning through fine-tuning a model.
So, what sets this particular project apart, you might wonder? In our initial four projects, we delved into structured datasets and explored fundamental machine learning algorithms.
However, what makes this endeavor truly extraordinary is its jump into unstructured datasets, deep learning (specifically, natural language processing), and the intricate
realm of transfer learning – fine-tuning a model for specialized training on unique data.
In this project, we had the opportunity to utilize the dataset from the '#ZindiWeekendz To Vaccinate or Not to Vaccinate: It’s not a Question Challenge'.
Our goal was to fine-tune an NLP model to classify tweets as negative, neutral, or positive sentiments regarding the COVID vaccine.
The journey was both challenging and exhilarating, and I am excited to take you through the steps we followed to bring this project to life.
Moreover, I will also delve into the deployment process, where we leveraged the power of Hugging Face and Streamlit Cloud.
By combining the user-friendly interfaces of Streamlit and Gradio, we created an intuitive and interactive platform to showcase the functionality of our sentiment analysis model.
So jump on board as we explore the world of sentiment analysis, dive into the intricacies of fine-tuning a Hugging Face model, and witness the
seamless integration of Streamlit and Gradio in our user interface.
So, what sets this particular project apart, you might wonder? In our initial four projects, we delved into structured datasets and explored fundamental machine learning algorithms. However, what makes this endeavor truly extraordinary is its jump into unstructured datasets, deep learning (specifically, natural language processing), and the intricate realm of transfer learning – fine-tuning a model for specialized training on unique data.
In this project, we had the opportunity to utilize the dataset from the '#ZindiWeekendz To Vaccinate or Not to Vaccinate: It’s not a Question Challenge'. Our goal was to fine-tune an NLP model to classify tweets as negative, neutral, or positive sentiments regarding the COVID vaccine. The journey was both challenging and exhilarating, and I am excited to take you through the steps we followed to bring this project to life.
Moreover, I will also delve into the deployment process, where we leveraged the power of Hugging Face and Streamlit Cloud. By combining the user-friendly interfaces of Streamlit and Gradio, we created an intuitive and interactive platform to showcase the functionality of our sentiment analysis model.
So jump on board as we explore the world of sentiment analysis, dive into the intricacies of fine-tuning a Hugging Face model, and witness the seamless integration of Streamlit and Gradio in our user interface.
WHAT IS SENTIMENT ANALYSIS
Sentiment analysis, also known as opinion mining, is a natural language processing (NLP) technique that involves extractingthe sentiment or emotional tone expressed in a piece of text. Sentiment analysis aims to classify text data into different categories or labels that represent sentiment, such as positive, negative, neutral, or even more specific emotions like happy, sad, angry, etc.
WHAT IS HUGGING FACE?
Hugging Face serves as a model hub that greatly simplifies the workflow for practitioners leveraging cutting-edge natural language processing models through its Transformers library. To integrate the Hugging Face library into our notebook, we can easily install two essential modules using the following commands:
pip install transformers
pip install datasets
These modules facilitate the seamless utilization of state-of-the-art NLP models and datasets, enhancing the efficiency and convenience of
NLP-related tasks in our workflow.
pip install transformers
pip install datasets
These modules facilitate the seamless utilization of state-of-the-art NLP models and datasets, enhancing the efficiency and convenience of NLP-related tasks in our workflow.
1.0 DATA CLEANING, EXPLORATION AND FILE SAVING
While the task at hand appeared relatively straightforward, it demanded considerable resources, especially in terms of computational runtime. Given the constraints on computational resources, the optimal approach was to leverage a notebook environment for the project's pivotal phases/notebook. The key sections encompassed exploratory data analysis (EDA), model training, inference, and deployment, resulting in the creation of five distinct notebooks. This segmentation facilitated efficient management and execution throughout the project lifecycle.
1.1 Data Loading and Exploration:
Since we we need more computational resources the best decision was to get the work done in Google Collab. And to get access to the dataset
and other resources, we placed these in Google drive and start the code by mounting Google Drive and installing necessary packages such as transformers, datasets,
wordcloud, and nlp. It then loads the training and test datasets from CSV files, which are expected to contain information about tweets and their associated labels.
Since we we need more computational resources the best decision was to get the work done in Google Collab. And to get access to the dataset and other resources, we placed these in Google drive and start the code by mounting Google Drive and installing necessary packages such as transformers, datasets, wordcloud, and nlp. It then loads the training and test datasets from CSV files, which are expected to contain information about tweets and their associated labels.
1.2 Data Preprocessing:
After we were done loading the dataset and exploring it, including checking the structure of the dataset, we realize that the data has some missing values.
We could have dropped it since it wasn't a lot and wouldn't have necessarily affect the dataset. Instead, we handled missing values, and addressed the issues with the 'tweet_id' column.
When we were done and satisfied with the results, we then move to the text preprocessing step. Including converting text to lowercase, tokenization, removing stopwords, and lemmatization.
After we were done loading the dataset and exploring it, including checking the structure of the dataset, we realize that the data has some missing values. We could have dropped it since it wasn't a lot and wouldn't have necessarily affect the dataset. Instead, we handled missing values, and addressed the issues with the 'tweet_id' column. When we were done and satisfied with the results, we then move to the text preprocessing step. Including converting text to lowercase, tokenization, removing stopwords, and lemmatization.
1.3 Data Visualization:
Data visualization is used to explore the distribution of sentiment labels, text lengths, and word clouds for different sentiment labels.
Word clouds provide a visual representation of the most frequent words in the text data.
Data visualization is used to explore the distribution of sentiment labels, text lengths, and word clouds for different sentiment labels. Word clouds provide a visual representation of the most frequent words in the text data.
1.4 Word Frequency Analysis:
The code analyzes the frequency of words in the dataset and identifies the top 10 most common words. This information can be valuable
for understanding the characteristics of the text data.
The code analyzes the frequency of words in the dataset and identifies the top 10 most common words. This information can be valuable for understanding the characteristics of the text data.
1.5 Data Cleaning:
Certain words like 'user' and 'url' are identified as common and are removed from the lemmatized text to clean the data and improve the model's performance.
Certain words like 'user' and 'url' are identified as common and are removed from the lemmatized text to clean the data and improve the model's performance.
1.6 Dataset Splitting and Saving:
The dataset is split into training and test subsets. The final datasets include columns like 'text', 'label', 'agreement', and 'lemmatized'. The 'text' column contains the preprocessed and cleaned text data. The resulting datasets are saved as CSV files for future use.
1.7 Dataset Saving:
Finally, the processed and split datasets are saved to CSV files, which can be loaded later for model training and evaluation.
Finally, the processed and split datasets are saved to CSV files, which can be loaded later for model training and evaluation.
2.0 MODEL TRAINING AND FINE-TUNING
- Data we were going to use for the fine-tuning of the model
- The task specificity of the project
- Model size and performance trade-off
- The model popularity and endorsement on hugging face(community feedback)
- DistilBert: This model is a distilled version of BERT, hence the name, which retains most of it accuracy but its much lighter faster
- BERT: This model is designed for a wide range of nlp task including question answering and sentiment analysis.
2.1 Installing Required Libraries:
The code installs various Python libraries required for the project, including transformers, datasets, wordcloud, plotly, nlp, and huggingface_hub.
The code installs various Python libraries required for the project, including transformers, datasets, wordcloud, plotly, nlp, and huggingface_hub.
2.2 Data Import and Exploration:
The code imports essential data manipulation and visualization libraries. It loads a training dataset saved earlier as a CSV file, which is part of the project dataset stored on
Google Drive. The dataset is then split into training and evaluation subsets.
The code imports essential data manipulation and visualization libraries. It loads a training dataset saved earlier as a CSV file, which is part of the project dataset stored on Google Drive. The dataset is then split into training and evaluation subsets.
2.3 Examples of Tweets:
The code showcases examples of tweets with positive, neutral, and negative sentiments from the training dataset.
The code showcases examples of tweets with positive, neutral, and negative sentiments from the training dataset.
2.4 Tokenization and Data Preparation:
The script tokenizes the lemmatized text data using the model's tokenizer and prepares the data for model input. Labels are transformed into numerical values.
The dataset is split into training and evaluation subsets.
The script tokenizes the lemmatized text data using the model's tokenizer and prepares the data for model input. Labels are transformed into numerical values. The dataset is split into training and evaluation subsets.
2.5 Model Fine-Tuning:
The script fine-tunes model for sequence classification on the sentiment analysis task. Training parameters and configurations are set using the TrainingArguments. The model is fine-tuned using the Trainer class, and the final evaluation is performed.
The script fine-tunes model for sequence classification on the sentiment analysis task. Training parameters and configurations are set using the TrainingArguments. The model is fine-tuned using the Trainer class, and the final evaluation is performed.
2.6 Evaluation Metrics:
The script evaluates the fine-tuned model on the evaluation dataset and computes accuracy as a metric.
The script evaluates the fine-tuned model on the evaluation dataset and computes accuracy as a metric.
2.7 Pushing Model to the Hugging Face Model Hub:
The code pushes the fine-tuned model, tokenizer, and configuration to the Hugging Face Model Hub for versioning and sharing.
The code pushes the fine-tuned model, tokenizer, and configuration to the Hugging Face Model Hub for versioning and sharing.
2.8 Comparison of Model Metrics:
The script provides a summary table comparing key metrics (training loss, runtime, samples per second, evaluation loss, accuracy, and evaluation runtime) between two models.
The script provides a summary table comparing key metrics (training loss, runtime, samples per second, evaluation loss, accuracy, and evaluation runtime) between two models.
2.9 Observations:
The observations highlight differences in training loss, runtime, samples per second, evaluation loss, and accuracy between the two models.
The observations highlight differences in training loss, runtime, samples per second, evaluation loss, and accuracy between the two models.
3.0 INFERENCE
Inference is the process of running data points into a machine learning model to calculate an output such as a single numerical score. This process is also referred to as "operationalizing a machine learning model" or "putting a machine learning model into production."
In this notebook, the codes showcases how to load a fine-tuned sentiment analysis model, preprocess input text, classify the sentiment using the model, and display the results using a PyTorch-based model.
Inference is the process of running data points into a machine learning model to calculate an output such as a single numerical score. This process is also referred to as "operationalizing a machine learning model" or "putting a machine learning model into production."
3.1 Library Installation:
The notebook starts by installing the necessary libraries, including the transformers library for working with pre-trained models and the gradio library for creating simple UIs.
The notebook starts by installing the necessary libraries, including the transformers library for working with pre-trained models and the gradio library for creating simple UIs.
3.2 Loading Pre-trained Model:
The notebook then loads a pre-trained sentiment analysis model that has been fine-tuned on a specific task. The model is loaded using the AutoModelForSequenceClassification class from the transformers library.
The notebook then loads a pre-trained sentiment analysis model that has been fine-tuned on a specific task. The model is loaded using the AutoModelForSequenceClassification class from the transformers library.
3.3 Tokenizer Initialization:
The tokenizer is initialized using the AutoTokenizer class, which is essential for converting input text into a format suitable for the model.
The tokenizer is initialized using the AutoTokenizer class, which is essential for converting input text into a format suitable for the model.
3.4 Text Preprocessing Function:
The preprocess function is defined to handle some basic text preprocessing, such as replacing usernames with '@user' and identifying links ('http').
The preprocess function is defined to handle some basic text preprocessing, such as replacing usernames with '@user' and identifying links ('http').
3.5 Pipeline for Text Classification:
The pipeline class from the transformers library is used to create a text classification pipeline. This pipeline is named "text-classification" and uses the fine-tuned model.
The pipeline class from the transformers library is used to create a text classification pipeline. This pipeline is named "text-classification" and uses the fine-tuned model.
3.6 Input Text Preprocessing:
An example text ("This covid came with its own agenda") is preprocessed using the preprocess function.
An example text ("This covid came with its own agenda") is preprocessed using the preprocess function.
3.7 PyTorch Model Inference:
The preprocessed text is tokenized, and the model's PyTorch version is used to obtain the model's output scores without applying the softmax function. The output scores represent the model's confidence for each sentiment class.
The preprocessed text is tokenized, and the model's PyTorch version is used to obtain the model's output scores without applying the softmax function. The output scores represent the model's confidence for each sentiment class.
3.8 Softmax Transformation:
The softmax function is applied to the output scores to obtain probability-like values, indicating the model's confidence in each sentiment class.
The softmax function is applied to the output scores to obtain probability-like values, indicating the model's confidence in each sentiment class.
3.9 Configuring Labels:
Here, the code configures labels for different sentiment classes (negative, neutral, positive) using the config object.
Here, the code configures labels for different sentiment classes (negative, neutral, positive) using the config object.
3.10 Printing Classification Results:
The code prints the original text and the model's classification results, including the sentiment label, sorted by the model's confidence scores.
The code prints the original text and the model's classification results, including the sentiment label, sorted by the model's confidence scores.
3.11 Displaying Scores:
The raw scores without the softmax function and the scores after applying softmax are displayed for reference.
4.0 DEPLOYMENT
After the inference notebook we moved to deployment phase . At this point, we wanted to give a simple User Interface to our models as everything has being going according to plan so far.
We used streamlit and gradio to build the user interface for our models.
- Streamlit is an open-source Python library that is used to create web applications for data science and machine learning projects with minimal effort. It is designed to be user-friendly and allows developers to create interactive and visually appealing applications by writing simple Python scripts.
- Gradio is another Python library for creating UIs for machine learning models. It's focused on simplifying the deployment and interaction with machine learning models through intuitive user interfaces. Gradio allows you to turn your ML models into shareable APIs with minimal code. It supports a wide range of models, from deep learning to traditional machine learning models.
There is an article on Streamlit here and one too on Gradio here for more info.
The building of these User Interface was done locally but you can get access to the code here. These models where then later deployed on Hugging Face and github and can also be accessed via gradio app and streamlit to have a feel of the model.
5.0 CONCLUSION
As stated earlier on, carrying out this project was really exciting and we learnt quite a lot. We are really grateful to our tutors for the guidance and this project and if you are interested in also carrying out this project, you can check out my repo on github here.
The raw scores without the softmax function and the scores after applying softmax are displayed for reference.
4.0 DEPLOYMENT
- Streamlit is an open-source Python library that is used to create web applications for data science and machine learning projects with minimal effort. It is designed to be user-friendly and allows developers to create interactive and visually appealing applications by writing simple Python scripts.
- Gradio is another Python library for creating UIs for machine learning models. It's focused on simplifying the deployment and interaction with machine learning models through intuitive user interfaces. Gradio allows you to turn your ML models into shareable APIs with minimal code. It supports a wide range of models, from deep learning to traditional machine learning models.
5.0 CONCLUSION
Comments
Post a Comment