11. Explain Type I and Type II errors
Type I Error (False Positive): Rejecting a true null hypothesis. Example: Saying a drug works when it doesn’t.
Type II Error (False Negative): Failing to reject a false null hypothesis.Example: Saying a drug doesn’t work when it actually does.
12. What are descriptive vs inferential statistics?
Descriptive: Summarizes data using charts, graphs, and metrics like mean, median.
Inferential: Makes predictions or inferences about a population using a sample (e.g., confidence intervals, hypothesis testing).
13. What is correlation vs causation?
Correlation: Two variables move together, but one doesn’t necessarily cause the other.
Causation: One variable directly affects the other.
Important: Correlation ≠ Causation.
14. What is a normal distribution?
A bell-shaped curve where data is symmetrically distributed around the mean.
Mean = Median = Mode
68% data within 1 SD, 95% within 2 SD, 99.7% within 3 SD.
15. What is central limit theorem (CLT)?
As sample size increases, the sampling distribution of the sample mean approaches a normal distribution — even if the population isn’t normal.
Used in: Confidence intervals, hypothesis testing.
16. What is feature engineering?
Creating or transforming features to improve model performance.
Examples: Creating age from DOB, binning values, log transformations, creating interaction terms.
17. What is missing value imputation?
Filling missing data using:
• Mean/Median/Mode
• KNN Imputation
• Regression or ML models
• Forward/Backward fill (time series)
18. Explain one-hot encoding vs label encoding
One-hot encoding: Converts categories into binary columns. Best for non-ordinal data.
Label encoding: Assigns numerical labels (e.g., Red=1, Blue=2). Suitable for ordinal data.
19. What is multicollinearity? How to detect it?
When two or more independent variables are highly correlated, making it hard to isolate their effects.
Detection:
• Correlation matrix
• Variance Inflation Factor (VIF > 5 or 10 = problematic)
20. What is dimensionality reduction?
Reducing the number of input features while retaining important info.
Benefits: Simplifies models, reduces overfitting, speeds up training
Techniques: PCA, LDA, t-SNE