Here is a list of 100 command prompts for text analysis, categorized by the typical workflow of a Natural Language Processing (NLP) project.
1. Data Collection & Ingestion
Scrape [website/URL] for all text content in [HTML tag].Fetch the top [N] tweets containing the [keyword or hashtag].Load all .txt files from [directory] into a single corpus.Parse the [JSON/XML] file to extract the [text field] from each record.Query the [SQL database] to retrieve all entries from the [customer_feedback] column.Transcribe the [audio_file.mp3] using a Speech-to-Text API.Read and extract all text from the [document.pdf] file.Download the [N] most recent articles from the [RSS feed].Connect to the [API endpoint] and retrieve text data.Concatenate [file_A.csv] and [file_B.csv] into a single dataset.
2. Basic Preprocessing & Cleaning
Convert all text in the corpus to lowercase.Remove all punctuation (e.g., '!', '.', '?') from the text.Remove all numerical digits from the text.Strip all HTML/XML tags from the raw text.Expand all contractions (e.g., "don't" -> "do not", "we'll" -> "we will").Remove all URLs and email addresses using regex.Standardize all whitespace (remove extra spaces, tabs, and newlines).Identify and remove all stopwords (e.g., 'the', 'is', 'a') using the [language] list.Perform stemming on all words using the [Porter/Snowball] stemmer.Perform lemmatization on all words (e.g., "running" -> "run") using WordNet.Correct common misspellings using a [spell-checking library/custom dictionary].Remove all non-ASCII or special characters.Normalize Unicode characters (e.g., NFD, NFKC).
3. Exploratory Text Analysis (ETA)
Tokenize the corpus into individual words (word tokens).Tokenize the corpus into individual sentences (sentence tokens).Calculate the total word count for the entire corpus.Calculate the unique word count (vocabulary size).Calculate the lexical diversity (unique words / total words).Calculate the average sentence length (in words).Calculate the average word length (in characters).Generate a frequency distribution for the top [N] most common words.Plot the word frequency distribution on a log-log scale (Zipf's Law).Generate a bar chart of the top [N] most frequent n-grams (bigrams/trigrams).Create a word cloud from the corpus.
No comments:
Post a Comment