Salt Shaker Press: Text analysis

Here is a list of 100 command prompts for text analysis, categorized by the typical workflow of a Natural Language Processing (NLP) project.

Scrape [website/URL] for all text content in [HTML tag].
Fetch the top [N] tweets containing the [keyword or hashtag].
Load all .txt files from [directory] into a single corpus.
Parse the [JSON/XML] file to extract the [text field] from each record.
Query the [SQL database] to retrieve all entries from the [customer_feedback] column.
Transcribe the [audio_file.mp3] using a Speech-to-Text API.
Read and extract all text from the [document.pdf] file.
Download the [N] most recent articles from the [RSS feed].
Connect to the [API endpoint] and retrieve text data.
Concatenate [file_A.csv] and [file_B.csv] into a single dataset.

Convert all text in the corpus to lowercase.
Remove all punctuation (e.g., '!', '.', '?') from the text.
Remove all numerical digits from the text.
Strip all HTML/XML tags from the raw text.
Expand all contractions (e.g., "don't" -> "do not", "we'll" -> "we will").
Remove all URLs and email addresses using regex.
Standardize all whitespace (remove extra spaces, tabs, and newlines).
Identify and remove all stopwords (e.g., 'the', 'is', 'a') using the [language] list.
Perform stemming on all words using the [Porter/Snowball] stemmer.
Perform lemmatization on all words (e.g., "running" -> "run") using WordNet.
Correct common misspellings using a [spell-checking library/custom dictionary].
Remove all non-ASCII or special characters.
Normalize Unicode characters (e.g., NFD, NFKC).

Salt Shaker Press