Salt Shaker Press: 100 command rompts for data analysis

Here is a list of 100 command prompts for data analysis, categorized by the typical data analysis workflow.

1. Data Definition & Acquisition

Define the primary business question(s) to be answered.
Identify all necessary data sources (e.g., SQL DB, CSVs, APIs).
Query the database to extract [specific data] using [SQL join/filter].
Scrape data from [webpage/API] using [Python library/tool].
Merge [Dataset A] and [Dataset B] on the [common key/index].
Concatenate [Dataset A] and [Dataset B] vertically.
Create a data dictionary (metadata) for all available variables.
Verify the integrity, origin, and freshness of the data.
Load the [CSV/Excel/JSON] file into a pandas DataFrame.
Sample the data to create a smaller, representative subset.

2. Data Cleaning & Preprocessing

Identify and count all missing (null/NaN) values per column.
Calculate the percentage of missing data for each feature.
Formulate a strategy for handling missing data (e.g., deletion, imputation).
Impute missing numerical values using the [mean/median].
Impute missing categorical values using the [mode/a constant].
Perform advanced imputation using [k-NN Imputer/MICE].
Identify and count all duplicate rows in the dataset.
Remove duplicate records based on [all columns/a subset of columns].
Verify and correct the data type for each column (e.g., string to datetime).
Standardize categorical text (e.g., 'USA', 'U.S.', 'America' -> 'USA').
Parse and extract components from a [datetime/text] column.
Remove or replace special characters and whitespace from [text column].
Identify outliers using the Z-score method (threshold: 3).
Identify outliers using the Interquartile Range (IQR) method.
Visualize potential outliers using a box plot for [feature].
Apply a transformation (e.g., log, square root) to the [skewed feature].
Normalize [feature] using Min-Max scaling.
Standardize [feature] using Z-score (Standard Scaler).
Convert [categorical feature] into numerical form using one-hot encoding.
Convert [ordinal feature] into numerical form using label encoding.
Bin (discretize) the [continuous feature] into [N] categories.
Engineer a new feature by [combining/dividing two existing features].

3. Exploratory Data Analysis (EDA)

Calculate descriptive statistics (mean, median, mode, std dev, quartiles) for all numerical features.
Generate a frequency distribution and count plot for [categorical feature].
Plot a histogram to understand the distribution of [continuous feature].
Plot a density plot (KDE) for [continuous feature].
Assess the skewness and kurtosis of [feature distribution].
Create a bar chart to compare [numerical feature] across [categorical feature].
Create a scatter plot for [variable 1] vs. [variable 2] to check for correlation.
Add a regression line to the scatter plot.
Calculate the Pearson correlation coefficient between [variable 1] and [variable 2].
Generate a full correlation matrix for all numerical variables.
Visualize the correlation matrix using a heatmap.
Generate a cross-tabulation (contingency table) for [categorical var 1] and [categorical var 2].
Plot a stacked bar chart to show the relationship between [cat var 1] and [cat var 2].
Plot side-by-side box plots to compare [continuous var] across [categorical var].
Use violin plots to compare the distribution shape of [continuous var] across [categorical var].
Create a scatter plot matrix (pairs plot) for key numerical features.
Plot a bubble chart using [var 1 (x-axis)], [var 2 (y-axis)], and [var 3 (size)].
Plot [time-series variable] over time to identify trends.
Decompose the time series into trend, seasonality, and residual components.
Perform a cohort analysis to track [user retention/customer churn].
Conduct an RFM (Recency, Frequency, Monetary) analysis for customer segmentation.
Analyze the user journey funnel to identify key drop-off points.
Map geospatial data using a choropleth or scatter map.

4. Statistical Analysis & Hypothesis Testing

Formulate a clear null hypothesis (H0) and alternative hypothesis (H1).
Set the significance level (alpha) for the hypothesis test (e.g., 0.05).
Check the assumptions for the chosen statistical test (e.g., normality, homogeneity of variance).
Perform a Shapiro-Wilk test to check for normality.
Perform Levene's test to check for homogeneity of variances.
Perform a one-sample t-test to compare [sample mean] against a [known population mean].
Perform an independent two-sample t-test to compare the means of [Group A] and [Group B].
Perform a paired t-test to compare [before] and [after] measurements.
Perform an Analysis of Variance (ANOVA) to compare means across [3+ groups].
If ANOVA is significant, perform a [Tukey's HSD] post-hoc test.
Perform a Chi-squared test for independence between [categorical var 1] and [categorical var 2].
Perform a non-parametric equivalent test (e.g., Mann-Whitney U, Kruskal-Wallis) if assumptions are violated.
Calculate the p-value and determine statistical significance.
Calculate the [95%/99%] confidence interval for the [mean/proportion].
Perform a simple linear regression analysis with [Y] as the dependent variable.
Interpret the R-squared, coefficients, and p-values of the regression model.
Analyze the A/B test results to determine the winning variant.
Calculate the statistical power (and Type II error) of the test.

5. Modeling & Machine Learning

Split the data into training, validation, and test sets (e.g., 70/15/15 split).
Establish a baseline model for performance comparison.
Train a [Linear/Logistic] Regression model.
Train a [Decision Tree] classifier and visualize the tree.
Train an ensemble model (e.g., Random Forest, Gradient Boosting/XGBoost).
Train a [k-Nearest Neighbors (k-NN)] model and find the optimal 'k'.
Train a [Support Vector Machine (SVM)] model.
For classification, generate a confusion matrix.
For classification, plot the ROC curve and calculate the AUC score.
For classification, calculate precision, recall, and F1-score.
For regression, calculate Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE).
Perform hyperparameter tuning using [Grid Search/Random Search CV].
Perform K-means clustering to identify [N] distinct groups.
Use the elbow method to determine the optimal number of clusters (k).
Perform Principal Component Analysis (PCA) for dimensionality reduction.
Analyze the feature importance scores from the [Random Forest/XGBoost] model.

6. Interpretation & Reporting

Summarize the top 3-5 key findings from the exploratory analysis.
Interpret the results of the hypothesis test in plain business language.
Translate the model's coefficients into actionable insights.
Analyze the model's prediction errors: where and why does it fail?
Identify any potential sources of bias in the data or model.
List the limitations of the analysis (e.g., data quality, sample size).
Formulate 3-5 data-driven recommendations based on the findings.
Create a narrative (story) that explains the data from problem to solution.
Prepare a slide deck for [technical/non-technical] stakeholders.
Build an interactive dashboard (e.g., in Tableau, Power BI, Streamlit) to present the key findings.
Outline the next steps and proposals for future analysis.

Salt Shaker Press

Search This Blog

100 command rompts for data analysis

1. Data Definition & Acquisition

2. Data Cleaning & Preprocessing

3. Exploratory Data Analysis (EDA)

4. Statistical Analysis & Hypothesis Testing

5. Modeling & Machine Learning

6. Interpretation & Reporting

No comments:

Post a Comment

2003 Test Sample Model--School Analysis Experiment

Shaker Posts