A few months ago, I worked on a project that attempted to classify one's mental state by using his/her statements. I used BERT to generate embeddings from the statements and to make it easier to process through different models. I used both linear and non-linear models, and also included a simple ANN (Artificial Neural Network) to identify which model performed best. By the end of the tests, I had resorted to using an XGBost model. My XGboost model scored an accuracy of 80% which I thought was good. However, I also noticed that my Linear SVM and Logistic Regression models also scored relatively high, i.e., 77% and 76% respectively. This made me question whether my data is generally linear and how I could go about improving the training data to get better results.
There are a couple of techniques for improving the training data for a linear model. You can test for multicollinearity and remove correlated features. You could use PCA to get a more compact dataset for training your model. You could remove outliers from your dataset as that has a tendency to influence the coefficient for the linear model. I did not want to remove any features as they represented the different dimensions of my embeddings. Consequently, PCA and removal of correlated features were not an option. However, the removal of outliers was one approach that was doable. But how do I approach a problem with a dataset that has 5 classes of mental states?
One way of thinking of the problem is that each mental state class represents a different linear problem and will have outliers specific to that group. Therefore, it will be necessary to cycle through each group and remove the associated outliers. Here is an approach to that problem broken down into steps:
- Separate your data into the 5 groups representing the 5 classes.
- Generate a centroid embedding by averaging each dimension of the embedding.
- Calculate the distance of each observation from the centroid
- This part is arbitrary, but remove any observation that is greater than 90% from the centroid. I state this is arbitrary because it is dependent on what you consider an outlier. This could just as easily be 95%.
Here is an example of some code I used to do this. We will assume that we have already identified the different classes of sentiments, which I will keep in the statuses variable. Let's start with generating the centroids and calculating the distances from the centroids.
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances
import pandas as pd
import numpy as np
# Create a dictionary for keeping track of the generated centroids
centroids = {}
# Make a copy of the dataframe
working_df = df.copy()
# Loop through the groups and calculate the centroids
for i in statuses:
# separate groups of dataframes
group = df[df['status'] == i]
vectors = group.drop(columns=['status']).values
# calculate centroids
centroids[i] = np.mean(vectors, axis=0)
# calculate distances
distances = cosine_distances(vectors,centroids[i].reshape(1, -1)).flatten()
# Define threshold for outliers (e.g., top 10% furthest from centroid)
threshold = np.percentile(distances, 90)
# Store results
working_df.loc[group.index, 'cosine_distance'] = distances
working_df.loc[group.index, 'is_outlier'] = distances > threshold
Lastly, let's remove the outliers from the data.
cleaned_df = working_df[working_df['is_outlier'] == False]
cleaned_df = cleaned_df.drop(columns=['cosine_distance', 'is_outlier'])
Incidentally, when I applied this change to my data, it did not change my results except for my XGBoost model, which went down in accuracy by 1%. Consequently, it may imply that my data is not linear, and I lost my predictive capability by removing observations from my dataset. Nevertheless, by grouping your sentiments, identifying a centroid within each group, and removing observations furthest away from the centroids, one can remove outliers from sentiment data.