Text Data Augmentation Techniques

In my last semester of my Master’s degree, I worked on a project that classified someone’s mental health state through their writing by using NLP techniques. Thankfully, I had gotten my hands on a labeled dataset from Kaggle that classified about 7 different mental health states – Normal, Bipolar, Personality Disorder, Anxiety, Suicidal, Depression, and Stress. Like any dataset in the wild, the data was severely imbalanced. With concerns about overfitting and biases in the possible resulting model, I did what every data scientist does. I tried to use over-sampling and under-sampling techniques to create a model with as little bias as possible. Although I succeeded, the accuracy and the ability to generalize were very limited. It was later that I abandoned the resampling techniques in favor of data augmentation and learned of two techniques, which improved both its ability to generalize and accuracy. The techniques I will be explaining about are synonym replacement and reverse translation.

Synonym Replacement


This is a technique of data augmentation that substitutes tokenized words for synonyms without changing the overall meaning of the statement. This creates a new statement that includes words which were not already present in the original dataset, but retains the same meaning of the original. A Python excerpt of the code I used is shown below.

import random
from nltk.corpus import wordnet
import nltk

r_state=42

nltk.download('wordnet')

def synonym_replacement(sentence, n=1):
    words = sentence.split()
    new_words = words.copy()
    random_idx = random.sample(range(len(words)), min(n, len(words)))

    for idx in random_idx:
        synonyms = wordnet.synsets(words[idx])
        if synonyms:
            synonym = synonyms[0].lemmas()[0].name().replace('_', ' ')
            new_words[idx] = synonym

    return ' '.join(new_words)


minority_classes = ['Bipolar', 'Stress']
augmented_data = augment_data(minority_classes, augmented_data, 2000)
Code language: JavaScript (javascript)

The code produced 2000 new observations of the Bipolar and Stress classes.

Back Translation

Whereas synonym replacement replaces words with other words of similar meaning, back translation replaces entire statements with new ones of similar meaning. It does this by translating from English to another language of your choice, French, Spanish, German, etc. and then back to English. This allows for my diversity of expressive statements. An example of the code I used to accomplish this is shown below.

! pip install googletrans==3.1.0a0

from googletrans import Translator
import asyncio

translator = Translator()

def random_translate(minority_class, augmented_data, amount):
  for label in minority_classes:
      subset = df[df['status'] == label]['statement'].dropna().sample(amount, replace=True)  # Sample 500 sentences

      for text in subset:
          if isinstance(text, str):
              tblob = translator.translate(text, dest='fr')
              translated_text = translator.translate(tblob.text, dest='en')
              augmented_data.append((translated_text.text, label))
  return augmented_data

# Personality Disorder
augmented_data = random_translate(['Personality disorder'], augmented_data, 700)

# Suicidal
augmented_data = random_translate(['Suicidal'], augmented_data, 700)
Code language: PHP (php)

Advantages and Disadvantages

One of the advantages of applying both augmentation techniques is that you can increase the number of observations available for a training dataset, effectively rebalancing selective groups of classes. It also introduces new ways of expressing the same statements, exposing the model to different approaches to sentence construction. Therefore, it pushes the model’s ability, ever so slightly, to generalize over new content. Lastly, it expands the vocabulary the model is exposed to.

However, there is a downside. Both techniques provide alternative ways of saying the same thing, they do not provide novel content that the model has not seen before. It instead provides novel ways of representing the same semantic meaning. For new content, we would have to collect new observations that are distinct from what we have already collected.

Another disadvantage is that we run the risk of producing duplicates. This can be addressed through post-augmentation processes. But it is important to remove duplicates from the training set to avoid overfitting.

Nevertheless, data augmentation is an important technique for dealing with imbalanced datasets. Synonym replacement and back translation are only two methods for applying data augmentation to text data. When applied effectively, both methods increase a model’s ability to generalize and detect alternative ways of expressing similar ideas while reducing biases in the dataset.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *