Salt Shaker Press: Command prompts for "data analysis"

That's a massive request! Generating 100 specific, unique, and universally applicable command prompts for "data analysis" is challenging because the exact commands depend heavily on the specific software, programming language, and environment (e.g., Python/Pandas, R, SQL, MATLAB, a statistical package like SPSS or SAS, or a command-line tool like Awk/Sed) being used.

However, I can provide 100 distinct conceptual tasks or operations common in data analysis, phrased as prompts you would translate into a specific command in your chosen environment.

Here are 100 Conceptual Command Prompts for Data Analysis, broken down by common analytical phases:

Data Acquisition & Inspection (1-20)

These focus on loading data and getting an initial look at its structure.

Load the primary dataset named sales_data.csv into a dataframe.
Display the first 10 rows of the dataframe.
Show the last 5 rows of the dataframe.
Print the total number of rows and columns (the shape/dimensions).
Summarize the data types (schema) of all columns.
List all the column names in the dataset.
Calculate the memory usage of the dataframe.
Count the total number of non-missing values per column.
Display a quick statistical summary (mean, std, min, max, quartiles) for all numerical columns.
Check for the presence of duplicate rows.
Display the unique values and their counts for the region column.
Read a dataset from a SQL query using a connection string.
Import data from an Excel file specifically from the sheet named "Q3_Results."
Inspect the date range of the transaction_date column.
Convert the entire dataframe to a dictionary of records.
Show the metadata or header information of the underlying data file.
Preview the raw text content of the first few lines of a log file.
Establish a live connection to a cloud database (e.g., BigQuery, S3).
List all files in the current working directory that end with .json.
Assign a new index to the dataframe starting from 1.

Data Cleaning & Preparation (21-40)

These focus on handling missing values, duplicates, and converting data types.

Drop all rows that contain any missing values (NaN).
Fill missing values in the customer_age column with the mean age.
Replace all instances of the string 'N/A' with actual missing values (NaN).
Drop the customer_id column as it's not needed for analysis.
Remove any fully duplicated rows.
Convert the price column from a string to a float data type.
Standardize the case of the product_name column to lowercase.
Extract the year from the order_date column into a new column.
Split the full_name column into two new columns: first_name and last_name.
Remove leading/trailing whitespace from the category column.
Filter the data to keep only transactions where status is 'Completed'.
Identify and count any outliers in the revenue column using the IQR method.
Apply a log transformation to the sales_volume column.
Recode the values in the gender column ('M', 'F') to (0, 1).
One-hot encode the marital_status categorical column.
Bin the continuous age column into 5 equal-width bins.
Standardize (Z-score normalize) the income column.
Impute missing values in the city column using the most frequent (mode) city.
Validate that the sum of cost and profit equals the price for every row.
Rename the column txn_id to transaction_identifier.

Data Exploration & Manipulation (41-70)

These focus on slicing, aggregating, pivoting, and exploring relationships.

Filter the data for transactions in 'California' OR 'Texas'.
Select only the date, product, and quantity columns.
Sort the entire dataset by transaction_amount in descending order.
Calculate the total sales across the entire dataset.
Find the average rating for the product 'X-2000'.
Group the data by category and calculate the mean price for each category.
Find the maximum profit achieved by any single transaction.
Count the number of unique customers.
Create a cross-tabulation (contingency table) of region and product_type.
Merge the current dataframe with a customer_details dataframe using the customer_id as the key (left join).
Append a new dataset (Q4_data) to the bottom of the current dataframe.
Pivot the data to show product_type in the index, year in the columns, and the sum of sales as the values.
Calculate the percent difference in sales between the current year and the previous year.
Identify the top 5 products by total revenue.
Calculate a rolling 7-day average of the daily_visitors column.
Apply a custom function to clean text data in the notes column.
Sample 10% of the data randomly.
Shift the stock_price column by one row to enable comparison with the next day's price.
Generate a cumulative sum of the daily_views column.
Group by country and return the name of the city with the highest sales within each country.
Calculate the interquartile range (IQR) for the delivery_time column.
Compute the correlation matrix for all numerical variables.
Create a new column that categorizes transactions as 'High Value' ( $> 500) o r^{'} L o w Va l u e^{'} ($ \le 500$).
Select rows where product_type is 'Electronics' AND quantity is greater than 10.
Calculate the mode of the payment_method column.
Rank the products based on their total profit.
Calculate the variance of the inventory_level column.
Compute the coefficient of variation for the monthly_expense column.
Filter the data to exclude the category 'Returns'.
Perform a full outer join between the main dataset and a supplier_info table.

Statistical Modeling & Visualization (71-100)

These focus on advanced analysis, model preparation, and output.

Run a simple linear regression with sales as the dependent variable and advertising_spend as the independent variable.
Fit a $k$ -means clustering model with $k = 5$ to the feature data.
Calculate the $p$ -value from a $χ^{2}$ test between gender and purchase_decision.
Perform an ANOVA test to compare the mean scores across three different treatment groups.
Split the dataset into training (80%) and testing (20%) sets.
Calculate the precision, recall, and F1-score for a classification model.
Visualize the distribution of the income column using a histogram.
Generate a scatter plot of price vs. rating.
Create a box plot to show the distribution of salary across different departments.
Plot a time series chart of daily_visitors.
Display the coefficients and intercept of the fitted regression model.
Compute the eigenvalues and eigenvectors of a covariance matrix for Principle Component Analysis (PCA).
Plot a bar chart showing the total quantity sold per country.
Save the resulting cleaned dataframe to a new CSV file named cleaned_data.csv.
Export the summary statistics of the model to a text file.
Generate a heatmap of the correlation matrix.
Calculate the root mean squared error (RMSE) of the model predictions.
Perform a Grubbs' test to statistically identify outliers.
Save the resulting visualization (e.g., the bar chart) as a PNG image file.
Calculate the log-likelihood of a fitted model.
Perform a feature selection using Recursive Feature Elimination (RFE).
Apply a MinMaxScaler to the numerical features.
Generate a $k$ -fold cross-validation split ( $k = 5$ ).
Display the ROC curve for a binary classification model.
Serialize (pickle) the trained machine learning model for later use.
Compute the variance inflation factor (VIF) for model variables to check for multicollinearity.
Execute a t-test to compare the mean response_time of two different server groups.
Aggregate the data by week and find the sum of sales for each week.
Save the output of a lengthy group-by calculation to an intermediate file (e.g., HDF5).
Print the memory address of the current dataframe object (for debugging/performance).

Salt Shaker Press

Search This Blog

Command prompts for "data analysis"

Data Acquisition & Inspection (1-20)

Data Cleaning & Preparation (21-40)

Data Exploration & Manipulation (41-70)

Statistical Modeling & Visualization (71-100)

No comments:

Post a Comment

Graphic Text Analysis

Shaker Posts