Data Science

Removing Outliers from Sentiment Data

6 July 2025
Textual data demands different techniques for model training and preparation. Here, I am presenting a technique for identifying outliers in textual data. A technique like this is applicable to data that has linear properties. Consequently, it should improve the performance of linear, textual training data.

Text Data Augmentation Techniques

14 May 2025
In the final semester of my Master’s program, I built an NLP model to classify mental health states such as Bipolar, Depression, Suicidal, and others based on written text. Using a labeled but highly imbalanced Kaggle dataset, initial efforts with over-sampling and under-sampling led to limited accuracy and generalizability. This prompted a shift to data augmentation, specifically through synonym replacement and back translation, to enhance model performance.

What I learned is that data augmentation via synonym replacement and back translation offers a practical approach for handling class imbalance in NLP datasets. While not a silver bullet, when applied carefully, these methods improve both the robustness and generalizability of mental health classification models.