That's a massive request! Generating 100 specific, unique, and universally applicable command prompts for "data analysis" is challenging because the exact commands depend heavily on the specific software, programming language, and environment (e.g., Python/Pandas, R, SQL, MATLAB, a statistical package like SPSS or SAS, or a command-line tool like Awk/Sed) being used.
However, I can provide 100 distinct conceptual tasks or operations common in data analysis, phrased as prompts you would translate into a specific command in your chosen environment.
Here are 100 Conceptual Command Prompts for Data Analysis, broken down by common analytical phases:
Data Acquisition & Inspection (1-20)
These focus on loading data and getting an initial look at its structure.
Load the primary dataset named
sales_data.csvinto a dataframe.Display the first 10 rows of the dataframe.
Show the last 5 rows of the dataframe.
Print the total number of rows and columns (the shape/dimensions).
Summarize the data types (schema) of all columns.
List all the column names in the dataset.
Calculate the memory usage of the dataframe.
Count the total number of non-missing values per column.
Display a quick statistical summary (mean, std, min, max, quartiles) for all numerical columns.
Check for the presence of duplicate rows.
Display the unique values and their counts for the
regioncolumn.Read a dataset from a SQL query using a connection string.
Import data from an Excel file specifically from the sheet named "Q3_Results."
Inspect the date range of the
transaction_datecolumn.Convert the entire dataframe to a dictionary of records.
Show the metadata or header information of the underlying data file.
Preview the raw text content of the first few lines of a log file.
Establish a live connection to a cloud database (e.g., BigQuery, S3).
List all files in the current working directory that end with
.json.Assign a new index to the dataframe starting from 1.
Data Cleaning & Preparation (21-40)
These focus on handling missing values, duplicates, and converting data types.
Drop all rows that contain any missing values (
NaN).Fill missing values in the
customer_agecolumn with the mean age.Replace all instances of the string 'N/A' with actual missing values (NaN).
Drop the
customer_idcolumn as it's not needed for analysis.Remove any fully duplicated rows.
Convert the
pricecolumn from a string to a float data type.Standardize the case of the
product_namecolumn to lowercase.Extract the year from the
order_datecolumn into a new column.Split the
full_namecolumn into two new columns:first_nameandlast_name.Remove leading/trailing whitespace from the
categorycolumn.Filter the data to keep only transactions where
statusis 'Completed'.Identify and count any outliers in the
revenuecolumn using the IQR method.Apply a log transformation to the
sales_volumecolumn.Recode the values in the
gendercolumn ('M', 'F') to (0, 1).One-hot encode the
marital_statuscategorical column.Bin the continuous
agecolumn into 5 equal-width bins.Standardize (Z-score normalize) the
incomecolumn.Impute missing values in the
citycolumn using the most frequent (mode) city.Validate that the sum of
costandprofitequals thepricefor every row.Rename the column
txn_idtotransaction_identifier.
Data Exploration & Manipulation (41-70)
These focus on slicing, aggregating, pivoting, and exploring relationships.
Filter the data for transactions in 'California' OR 'Texas'.
Select only the
date,product, andquantitycolumns.Sort the entire dataset by
transaction_amountin descending order.Calculate the total sales across the entire dataset.
Find the average rating for the product 'X-2000'.
Group the data by
categoryand calculate the meanpricefor each category.Find the maximum profit achieved by any single transaction.
Count the number of unique customers.
Create a cross-tabulation (contingency table) of
regionandproduct_type.Merge the current dataframe with a
customer_detailsdataframe using thecustomer_idas the key (left join).Append a new dataset (
Q4_data) to the bottom of the current dataframe.Pivot the data to show
product_typein the index,yearin the columns, and the sum ofsalesas the values.Calculate the percent difference in sales between the current year and the previous year.
Identify the top 5 products by total revenue.
Calculate a rolling 7-day average of the
daily_visitorscolumn.Apply a custom function to clean text data in the
notescolumn.Sample 10% of the data randomly.
Shift the
stock_pricecolumn by one row to enable comparison with the next day's price.Generate a cumulative sum of the
daily_viewscolumn.Group by
countryand return the name of the city with the highest sales within each country.Calculate the interquartile range (IQR) for the
delivery_timecolumn.Compute the correlation matrix for all numerical variables.
Create a new column that categorizes transactions as 'High Value' (\le 500$).
Select rows where
product_typeis 'Electronics' ANDquantityis greater than 10.Calculate the mode of the
payment_methodcolumn.Rank the products based on their total
profit.Calculate the variance of the
inventory_levelcolumn.Compute the coefficient of variation for the
monthly_expensecolumn.Filter the data to exclude the category 'Returns'.
Perform a full outer join between the main dataset and a
supplier_infotable.
Statistical Modeling & Visualization (71-100)
These focus on advanced analysis, model preparation, and output.
Run a simple linear regression with
salesas the dependent variable andadvertising_spendas the independent variable.Fit a -means clustering model with to the feature data.
Calculate the -value from a test between
genderandpurchase_decision.Perform an ANOVA test to compare the mean scores across three different treatment groups.
Split the dataset into training (80%) and testing (20%) sets.
Calculate the precision, recall, and F1-score for a classification model.
Visualize the distribution of the
incomecolumn using a histogram.Generate a scatter plot of
pricevs.rating.Create a box plot to show the distribution of
salaryacross differentdepartments.Plot a time series chart of
daily_visitors.Display the coefficients and intercept of the fitted regression model.
Compute the eigenvalues and eigenvectors of a covariance matrix for Principle Component Analysis (PCA).
Plot a bar chart showing the total
quantitysold percountry.Save the resulting cleaned dataframe to a new CSV file named
cleaned_data.csv.Export the summary statistics of the model to a text file.
Generate a heatmap of the correlation matrix.
Calculate the root mean squared error (RMSE) of the model predictions.
Perform a Grubbs' test to statistically identify outliers.
Save the resulting visualization (e.g., the bar chart) as a PNG image file.
Calculate the log-likelihood of a fitted model.
Perform a feature selection using Recursive Feature Elimination (RFE).
Apply a MinMaxScaler to the numerical features.
Generate a -fold cross-validation split ().
Display the ROC curve for a binary classification model.
Serialize (pickle) the trained machine learning model for later use.
Compute the variance inflation factor (VIF) for model variables to check for multicollinearity.
Execute a t-test to compare the mean
response_timeof two different server groups.Aggregate the data by week and find the sum of sales for each week.
Save the output of a lengthy group-by calculation to an intermediate file (e.g., HDF5).
Print the memory address of the current dataframe object (for debugging/performance).
No comments:
Post a Comment