That's a massive request! Generating 100 specific, unique, and universally applicable command prompts for "data analysis" is challenging because the exact commands depend heavily on the specific software, programming language, and environment (e.g., Python/Pandas, R, SQL, MATLAB, a statistical package like SPSS or SAS, or a command-line tool like Awk/Sed) being used.
However, I can provide 100 distinct conceptual tasks or operations common in data analysis, phrased as prompts you would translate into a specific command in your chosen environment.
Here are 100 Conceptual Command Prompts for Data Analysis, broken down by common analytical phases:
Data Acquisition & Inspection (1-20)
These focus on loading data and getting an initial look at its structure.
Load the primary dataset named
sales_data.csv
into a dataframe.Display the first 10 rows of the dataframe.
Show the last 5 rows of the dataframe.
Print the total number of rows and columns (the shape/dimensions).
Summarize the data types (schema) of all columns.
List all the column names in the dataset.
Calculate the memory usage of the dataframe.
Count the total number of non-missing values per column.
Display a quick statistical summary (mean, std, min, max, quartiles) for all numerical columns.
Check for the presence of duplicate rows.
Display the unique values and their counts for the
region
column.Read a dataset from a SQL query using a connection string.
Import data from an Excel file specifically from the sheet named "Q3_Results."
Inspect the date range of the
transaction_date
column.Convert the entire dataframe to a dictionary of records.
Show the metadata or header information of the underlying data file.
Preview the raw text content of the first few lines of a log file.
Establish a live connection to a cloud database (e.g., BigQuery, S3).
List all files in the current working directory that end with
.json
.Assign a new index to the dataframe starting from 1.
Data Cleaning & Preparation (21-40)
These focus on handling missing values, duplicates, and converting data types.
Drop all rows that contain any missing values (
NaN
).Fill missing values in the
customer_age
column with the mean age.Replace all instances of the string 'N/A' with actual missing values (NaN).
Drop the
customer_id
column as it's not needed for analysis.Remove any fully duplicated rows.
Convert the
price
column from a string to a float data type.Standardize the case of the
product_name
column to lowercase.Extract the year from the
order_date
column into a new column.Split the
full_name
column into two new columns:first_name
andlast_name
.Remove leading/trailing whitespace from the
category
column.Filter the data to keep only transactions where
status
is 'Completed'.Identify and count any outliers in the
revenue
column using the IQR method.Apply a log transformation to the
sales_volume
column.Recode the values in the
gender
column ('M', 'F') to (0, 1).One-hot encode the
marital_status
categorical column.Bin the continuous
age
column into 5 equal-width bins.Standardize (Z-score normalize) the
income
column.Impute missing values in the
city
column using the most frequent (mode) city.Validate that the sum of
cost
andprofit
equals theprice
for every row.Rename the column
txn_id
totransaction_identifier
.
Data Exploration & Manipulation (41-70)
These focus on slicing, aggregating, pivoting, and exploring relationships.
Filter the data for transactions in 'California' OR 'Texas'.
Select only the
date
,product
, andquantity
columns.Sort the entire dataset by
transaction_amount
in descending order.Calculate the total sales across the entire dataset.
Find the average rating for the product 'X-2000'.
Group the data by
category
and calculate the meanprice
for each category.Find the maximum profit achieved by any single transaction.
Count the number of unique customers.
Create a cross-tabulation (contingency table) of
region
andproduct_type
.Merge the current dataframe with a
customer_details
dataframe using thecustomer_id
as the key (left join).Append a new dataset (
Q4_data
) to the bottom of the current dataframe.Pivot the data to show
product_type
in the index,year
in the columns, and the sum ofsales
as the values.Calculate the percent difference in sales between the current year and the previous year.
Identify the top 5 products by total revenue.
Calculate a rolling 7-day average of the
daily_visitors
column.Apply a custom function to clean text data in the
notes
column.Sample 10% of the data randomly.
Shift the
stock_price
column by one row to enable comparison with the next day's price.Generate a cumulative sum of the
daily_views
column.Group by
country
and return the name of the city with the highest sales within each country.Calculate the interquartile range (IQR) for the
delivery_time
column.Compute the correlation matrix for all numerical variables.
Create a new column that categorizes transactions as 'High Value' (\le 500$).
Select rows where
product_type
is 'Electronics' ANDquantity
is greater than 10.Calculate the mode of the
payment_method
column.Rank the products based on their total
profit
.Calculate the variance of the
inventory_level
column.Compute the coefficient of variation for the
monthly_expense
column.Filter the data to exclude the category 'Returns'.
Perform a full outer join between the main dataset and a
supplier_info
table.
Statistical Modeling & Visualization (71-100)
These focus on advanced analysis, model preparation, and output.
Run a simple linear regression with
sales
as the dependent variable andadvertising_spend
as the independent variable.Fit a -means clustering model with to the feature data.
Calculate the -value from a test between
gender
andpurchase_decision
.Perform an ANOVA test to compare the mean scores across three different treatment groups.
Split the dataset into training (80%) and testing (20%) sets.
Calculate the precision, recall, and F1-score for a classification model.
Visualize the distribution of the
income
column using a histogram.Generate a scatter plot of
price
vs.rating
.Create a box plot to show the distribution of
salary
across differentdepartments
.Plot a time series chart of
daily_visitors
.Display the coefficients and intercept of the fitted regression model.
Compute the eigenvalues and eigenvectors of a covariance matrix for Principle Component Analysis (PCA).
Plot a bar chart showing the total
quantity
sold percountry
.Save the resulting cleaned dataframe to a new CSV file named
cleaned_data.csv
.Export the summary statistics of the model to a text file.
Generate a heatmap of the correlation matrix.
Calculate the root mean squared error (RMSE) of the model predictions.
Perform a Grubbs' test to statistically identify outliers.
Save the resulting visualization (e.g., the bar chart) as a PNG image file.
Calculate the log-likelihood of a fitted model.
Perform a feature selection using Recursive Feature Elimination (RFE).
Apply a MinMaxScaler to the numerical features.
Generate a -fold cross-validation split ().
Display the ROC curve for a binary classification model.
Serialize (pickle) the trained machine learning model for later use.
Compute the variance inflation factor (VIF) for model variables to check for multicollinearity.
Execute a t-test to compare the mean
response_time
of two different server groups.Aggregate the data by week and find the sum of sales for each week.
Save the output of a lengthy group-by calculation to an intermediate file (e.g., HDF5).
Print the memory address of the current dataframe object (for debugging/performance).
No comments:
Post a Comment