Data Science

Text Data Augmentation Techniques

14 May 2025
In the final semester of my Master’s program, I built an NLP model to classify mental health states such as Bipolar, Depression, Suicidal, and others based on written text. Using a labeled but highly imbalanced Kaggle dataset, initial efforts with over-sampling and under-sampling led to limited accuracy and generalizability. This prompted a shift to data augmentation, specifically through synonym replacement and back translation, to enhance model performance.

What I learned is that data augmentation via synonym replacement and back translation offers a practical approach for handling class imbalance in NLP datasets. While not a silver bullet, when applied carefully, these methods improve both the robustness and generalizability of mental health classification models.