Top Data Science Interview Questions with Answers: Part-2

11. Explain Type I and Type II errors

Type I Error (False Positive): Rejecting a true null hypothesis. Example: Saying a drug works when it doesn’t.

Type II Error (False Negative): Failing to reject a false null hypothesis.Example: Saying a drug doesn’t work when it actually does.

12. What are descriptive vs inferential statistics?

Descriptive: Summarizes data using charts, graphs, and metrics like mean, median.

Inferential: Makes predictions or inferences about a population using a sample (e.g., confidence intervals, hypothesis testing).

13. What is correlation vs causation?

Correlation: Two variables move together, but one doesn’t necessarily cause the other.

Causation: One variable directly affects the other.

Important: Correlation ≠ Causation.

14. What is a normal distribution?

A bell-shaped curve where data is symmetrically distributed around the mean.

Mean = Median = Mode

68% data within 1 SD, 95% within 2 SD, 99.7% within 3 SD.

15. What is central limit theorem (CLT)?

As sample size increases, the sampling distribution of the sample mean approaches a normal distribution — even if the population isn’t normal.

Used in: Confidence intervals, hypothesis testing.

16. What is feature engineering?

Creating or transforming features to improve model performance.

Examples: Creating age from DOB, binning values, log transformations, creating interaction terms.

17. What is missing value imputation?

Filling missing data using:

•⁠ ⁠Mean/Median/Mode

•⁠ ⁠KNN Imputation

•⁠ ⁠Regression or ML models

•⁠ ⁠Forward/Backward fill (time series)

18. Explain one-hot encoding vs label encoding

One-hot encoding: Converts categories into binary columns. Best for non-ordinal data.

Label encoding: Assigns numerical labels (e.g., Red=1, Blue=2). Suitable for ordinal data.

19. What is multicollinearity? How to detect it?

When two or more independent variables are highly correlated, making it hard to isolate their effects.

Detection:

•⁠ ⁠Correlation matrix

•⁠ ⁠Variance Inflation Factor (VIF > 5 or 10 = problematic)

20. What is dimensionality reduction?

Reducing the number of input features while retaining important info.

Benefits: Simplifies models, reduces overfitting, speeds up training

Techniques: PCA, LDA, t-SNE

Post Views: 56