Formatting Functions

Available functions:

  • remove_stopwords(text): Remove stopwords from the input text using NLTK's stopwords.

  • remove_numbers(text): Remove numbers from the input text.

  • remove_whitespace(text): Remove excess whitespace from the input text.

  • normalize_whitespace(text): Normalize multiple whitespaces into a single whitespace in the input text.

  • seperate_symbols(text): Separate symbols and words with a space to ease tokenization.

  • remove_special_characters(text): Remove special characters from the input text.

  • standardize_text(text): Standardize the formatting of the input text.

  • tokenize_text(text): Tokenize the input text into individual words.

  • stem_words(words): Stem the input words using Porter stemming algorithm.

  • lemmatize_words(words): Lemmatize the input words using WordNet lemmatization.

  • pos_tag(text): Perform part-of-speech (POS) tagging on the input text.

  • remove_profanity_from_text(text): Remove profane words from the input text.

  • remove_sensitive_info_from_text(text): Remove sensitive information from the input text.

  • remove_hate_speech_from_text(text): Remove hate speech or offensive speech from the input text.


Remove stopwords

Remove stopwords from the input text using NLTK's stopwords.

Parameters:
- `text` (str): The input text from which stopwords should be removed.

Returns:
- `str`: The text without stopwords.

Remove numbers

Remove numbers from the input text.

Parameters:
- `text` (str): The input text from which numbers should be removed.

Returns:
- `str`: The text without numbers.

Remove whitespace

Remove excess whitespace from the input text.

Parameters:
- `text` (str): The input text from which excess whitespace should be removed.

Returns:
- `str`: The text with the removed excess whitespace.

Normalize whitespace

Normalize multiple whitespaces into a single whitespace in the input text.

Parameters:
- `text` (str): The input text from which whitespace should be normalized.

Returns:
- `str`: The text with normalized whitespace.

Seperate symbols

Separate symbols and words with a space to ease tokenization.

Parameters:
- `text` (str): The input text from which symbols needs to be seperated.

Returns:
- `str`: The text from which symbols have been seperated.

Remove special characters

Remove special characters from the input text.

Parameters:
- `text` (str): The input text from which special characters should be removed.

Returns:
- `str`: The text with special characters removed.

Standardize text

Standardize the formatting of the input text.

Parameters:
- `text` (str): The input text which needs to be standardized.

Returns:
- `str`: The standardized text.

Tokenize text

Tokenize the input text into individual words.

Parameters:
- `text` (str): The input text to be tokenized.

Returns:
- `list`: A list of tokens (words) from the input text.

Stem words

Stem the input words using Porter stemming algorithm.

Parameters:
- `words` (list): A list of words to be stemmed.

Returns:
- `list`: A list of stemmed words.

Lemmatize words

Lemmatize the input words using WordNet lemmatization.

Parameters:
- `words` (list): A list of words to be lemmatized.

Returns:
- `list`: A list of lemmatized words.

POS tag

Perform part-of-speech (POS) tagging on the input text.

Parameters:
- `text` (str): The input text to be POS tagged.

Returns:
- `list`: A list of tuples containing (word, tag) pairs.

Remove profane words from text

This ensures that text is clean and does not contain inappropriate language.

Parameters:
- `text` (str): The input text to remove profanity from.

Returns:
- `text` (str): The cleaned output text.

Remove sensitive information from text

This can be useful for depersonalization of text data.

Parameters:
- `text` (str): The input text to remove sensitive information from.

Returns:
- `text` (str): The cleaned output text.

Remove hate speech from text using AI

This function removes sentences, and not just a certain word, because it is context relevant.

Parameters:
- `text` (str): The input text to remove hate speech and offensive speech from.

Returns:
- `text` (str): The cleaned output text.

Last updated