Formatting Functions
Available functions:
remove_stopwords
(text)
: Remove stopwords from the input text using NLTK's stopwords.remove_numbers
(text)
: Remove numbers from the input text.remove_whitespace
(text)
: Remove excess whitespace from the input text.normalize_whitespace
(text)
: Normalize multiple whitespaces into a single whitespace in the input text.seperate_symbols
(text)
: Separate symbols and words with a space to ease tokenization.remove_special_characters
(text)
: Remove special characters from the input text.standardize_text
(text)
: Standardize the formatting of the input text.tokenize_text
(text)
: Tokenize the input text into individual words.stem_words
(words)
: Stem the input words using Porter stemming algorithm.lemmatize_words
(words)
: Lemmatize the input words using WordNet lemmatization.pos_tag
(text)
: Perform part-of-speech (POS) tagging on the input text.remove_profanity_from_text
(text)
: Remove profane words from the input text.remove_sensitive_info_from_text
(text)
: Remove sensitive information from the input text.remove_hate_speech_from_text
(text)
: Remove hate speech or offensive speech from the input text.post_format_text
(text)
: Post-format the text using regex.
Remove stopwords
Remove stopwords from the input text using NLTK's stopwords.
Parameters:
- `text` (str): The input text from which stopwords should be removed.
Returns:
- `str`: The text without stopwords.
Remove numbers
Remove numbers from the input text.
Parameters:
- `text` (str): The input text from which numbers should be removed.
Returns:
- `str`: The text without numbers.
Remove whitespace
Remove excess whitespace from the input text.
Parameters:
- `text` (str): The input text from which excess whitespace should be removed.
Returns:
- `str`: The text with the removed excess whitespace.
Normalize whitespace
Normalize multiple whitespaces into a single whitespace in the input text.
Parameters:
- `text` (str): The input text from which whitespace should be normalized.
Returns:
- `str`: The text with normalized whitespace.
Seperate symbols
Separate symbols and words with a space to ease tokenization.
Parameters:
- `text` (str): The input text from which symbols needs to be seperated.
Returns:
- `str`: The text from which symbols have been seperated.
Remove special characters
Remove special characters from the input text.
Parameters:
- `text` (str): The input text from which special characters should be removed.
Returns:
- `str`: The text with special characters removed.
Standardize text
Standardize the formatting of the input text.
Parameters:
- `text` (str): The input text which needs to be standardized.
Returns:
- `str`: The standardized text.
Tokenize text
Tokenize the input text into individual words.
Parameters:
- `text` (str): The input text to be tokenized.
Returns:
- `list`: A list of tokens (words) from the input text.
Stem words
Stem the input words using the Porter stemming algorithm.
Parameters:
- `words` (list): A list of words to be stemmed.
Returns:
- `list`: A list of stemmed words.
Lemmatize words
Lemmatize the input words using WordNet lemmatization.
Parameters:
- `words` (list): A list of words to be lemmatized.
Returns:
- `list`: A list of lemmatized words.
POS tag
Perform part-of-speech (POS) tagging on the input text.
Parameters:
- `text` (str): The input text to be POS tagged.
Returns:
- `list`: A list of tuples containing (word, tag) pairs.
Remove profane words from text
This ensures that the text is clean and does not contain inappropriate language.
Parameters:
- `text` (str): The input text to remove profanity from.
Returns:
- `text` (str): The cleaned output text.
Remove sensitive information from text
This can be useful for depersonalization of text data.
Parameters:
- `text` (str): The input text to remove sensitive information from.
Returns:
- `text` (str): The cleaned output text.
Remove hate speech from text using AI
This function removes sentences, and not just a certain word because it is context-relevant.
Parameters:
- `text` (str): The input text to remove hate speech and offensive speech from.
Returns:
- `text` (str): The cleaned output text.
Post-format text using regex
This function post-formats the text after DupliPy's augmentation or other processes.
Parameters:
- `text` (str): The input text to be post-formatted.
Returns:
- `str`: The post-formatted text.
Last updated
Was this helpful?