Data Science

Understanding Your Dataset: Bounded vs Unbounded Data

22 February 2026
Understanding your datset is the key to determining what normalization technique to use to scale your data before training your model. This blog entry explains the difference between bounded and unbounded data and how this dictates whether we use normalization or standardization to scale the data.

Removing Outliers from Sentiment Data

6 July 2025
Textual data demands different techniques for model training and preparation. Here, I am presenting a technique for identifying outliers in textual data. A technique like this is applicable to data that has linear properties. Consequently, it should improve the performance of linear, textual training data.

Text Data Augmentation Techniques

14 May 2025
In the final semester of my Master’s program, I built an NLP model to classify mental health states such as Bipolar, Depression, Suicidal, and others based on written text. Using a labeled but highly imbalanced Kaggle dataset, initial efforts with over-sampling and under-sampling led to limited accuracy and generalizability. This prompted a shift to data augmentation, specifically through synonym replacement and back translation, to enhance model performance.

What I learned is that data augmentation via synonym replacement and back translation offers a practical approach for handling class imbalance in NLP datasets. While not a silver bullet, when applied carefully, these methods improve both the robustness and generalizability of mental health classification models.