Here is a list of 100 command prompts for data analysis, categorized by the typical data analysis workflow.
1. Data Definition & Acquisition
Define the primary business question(s) to be answered.Identify all necessary data sources (e.g., SQL DB, CSVs, APIs).Query the database to extract [specific data] using [SQL join/filter].Scrape data from [webpage/API] using [Python library/tool].Merge [Dataset A] and [Dataset B] on the [common key/index].Concatenate [Dataset A] and [Dataset B] vertically.Create a data dictionary (metadata) for all available variables.Verify the integrity, origin, and freshness of the data.Load the [CSV/Excel/JSON] file into a pandas DataFrame.Sample the data to create a smaller, representative subset.
2. Data Cleaning & Preprocessing
Identify and count all missing (null/NaN) values per column.Calculate the percentage of missing data for each feature.Formulate a strategy for handling missing data (e.g., deletion, imputation).Impute missing numerical values using the [mean/median].Impute missing categorical values using the [mode/a constant].Perform advanced imputation using [k-NN Imputer/MICE].Identify and count all duplicate rows in the dataset.Remove duplicate records based on [all columns/a subset of columns].Verify and correct the data type for each column (e.g., string to datetime).Standardize categorical text (e.g., 'USA', 'U.S.', 'America' -> 'USA').Parse and extract components from a [datetime/text] column.Remove or replace special characters and whitespace from [text column].Identify outliers using the Z-score method (threshold: 3).Identify outliers using the Interquartile Range (IQR) method.Visualize potential outliers using a box plot for [feature].Apply a transformation (e.g., log, square root) to the [skewed feature].Normalize [feature] using Min-Max scaling.Standardize [feature] using Z-score (Standard Scaler).Convert [categorical feature] into numerical form using one-hot encoding.Convert [ordinal feature] into numerical form using label encoding.Bin (discretize) the [continuous feature] into [N] categories.Engineer a new feature by [combining/dividing two existing features].
3. Exploratory Data Analysis (EDA)
Calculate descriptive statistics (mean, median, mode, std dev, quartiles) for all numerical features.Generate a frequency distribution and count plot for [categorical feature].Plot a histogram to understand the distribution of [continuous feature].Plot a density plot (KDE) for [continuous feature].Assess the skewness and kurtosis of [feature distribution].Create a bar chart to compare [numerical feature] across [categorical feature].Create a scatter plot for [variable 1] vs. [variable 2] to check for correlation.Add a regression line to the scatter plot.Calculate the Pearson correlation coefficient between [variable 1] and [variable 2].Generate a full correlation matrix for all numerical variables.Visualize the correlation matrix using a heatmap.Generate a cross-tabulation (contingency table) for [categorical var 1] and [categorical var 2].Plot a stacked bar chart to show the relationship between [cat var 1] and [cat var 2].Plot side-by-side box plots to compare [continuous var] across [categorical var].Use violin plots to compare the distribution shape of [continuous var] across [categorical var].Create a scatter plot matrix (pairs plot) for key numerical features.Plot a bubble chart using [var 1 (x-axis)], [var 2 (y-axis)], and [var 3 (size)].Plot [time-series variable] over time to identify trends.Decompose the time series into trend, seasonality, and residual components.Perform a cohort analysis to track [user retention/customer churn].Conduct an RFM (Recency, Frequency, Monetary) analysis for customer segmentation.Analyze the user journey funnel to identify key drop-off points.Map geospatial data using a choropleth or scatter map.
4. Statistical Analysis & Hypothesis Testing
Formulate a clear null hypothesis (H0) and alternative hypothesis (H1).Set the significance level (alpha) for the hypothesis test (e.g., 0.05).Check the assumptions for the chosen statistical test (e.g., normality, homogeneity of variance).Perform a Shapiro-Wilk test to check for normality.Perform Levene's test to check for homogeneity of variances.Perform a one-sample t-test to compare [sample mean] against a [known population mean].Perform an independent two-sample t-test to compare the means of [Group A] and [Group B].Perform a paired t-test to compare [before] and [after] measurements.Perform an Analysis of Variance (ANOVA) to compare means across [3+ groups].If ANOVA is significant, perform a [Tukey's HSD] post-hoc test.Perform a Chi-squared test for independence between [categorical var 1] and [categorical var 2].Perform a non-parametric equivalent test (e.g., Mann-Whitney U, Kruskal-Wallis) if assumptions are violated.Calculate the p-value and determine statistical significance.Calculate the [95%/99%] confidence interval for the [mean/proportion].Perform a simple linear regression analysis with [Y] as the dependent variable.Interpret the R-squared, coefficients, and p-values of the regression model.Analyze the A/B test results to determine the winning variant.Calculate the statistical power (and Type II error) of the test.
5. Modeling & Machine Learning
Split the data into training, validation, and test sets (e.g., 70/15/15 split).Establish a baseline model for performance comparison.Train a [Linear/Logistic] Regression model.Train a [Decision Tree] classifier and visualize the tree.Train an ensemble model (e.g., Random Forest, Gradient Boosting/XGBoost).Train a [k-Nearest Neighbors (k-NN)] model and find the optimal 'k'.Train a [Support Vector Machine (SVM)] model.For classification, generate a confusion matrix.For classification, plot the ROC curve and calculate the AUC score.For classification, calculate precision, recall, and F1-score.For regression, calculate Mean Absolute Error (MAE), Mean Squared Error (MSE), and Root Mean Squared Error (RMSE).Perform hyperparameter tuning using [Grid Search/Random Search CV].Perform K-means clustering to identify [N] distinct groups.Use the elbow method to determine the optimal number of clusters (k).Perform Principal Component Analysis (PCA) for dimensionality reduction.Analyze the feature importance scores from the [Random Forest/XGBoost] model.
6. Interpretation & Reporting
Summarize the top 3-5 key findings from the exploratory analysis.Interpret the results of the hypothesis test in plain business language.Translate the model's coefficients into actionable insights.Analyze the model's prediction errors: where and why does it fail?Identify any potential sources of bias in the data or model.List the limitations of the analysis (e.g., data quality, sample size).Formulate 3-5 data-driven recommendations based on the findings.Create a narrative (story) that explains the data from problem to solution.Prepare a slide deck for [technical/non-technical] stakeholders.Build an interactive dashboard (e.g., in Tableau, Power BI, Streamlit) to present the key findings.Outline the next steps and proposals for future analysis.
No comments:
Post a Comment