Formatting Functions

Available functions:


Remove stopwords

Remove stopwords from the input text using NLTK's stopwords.

Remove numbers

Remove numbers from the input text.

Remove whitespace

Remove excess whitespace from the input text.

Normalize whitespace

Normalize multiple whitespaces into a single whitespace in the input text.

Seperate symbols

Separate symbols and words with a space to ease tokenization.

Remove special characters

Remove special characters from the input text.

Standardize text

Standardize the formatting of the input text.

Tokenize text

Tokenize the input text into individual words.

Stem words

Stem the input words using the Porter stemming algorithm.

Lemmatize words

Lemmatize the input words using WordNet lemmatization.

POS tag

Perform part-of-speech (POS) tagging on the input text.

Remove profane words from text

This ensures that the text is clean and does not contain inappropriate language.

Remove sensitive information from text

This can be useful for depersonalization of text data.

Remove hate speech from text using AI

This function removes sentences, and not just a certain word because it is context-relevant.

Post-format text using regex

This function post-formats the text after DupliPy's augmentation or other processes.

Last updated

Was this helpful?