Top Data Science Interview Questions with Answers: Part-2

  • admin
  • January 2, 2026

11. Explain Type I and Type II errors  

Type I Error (False Positive): Rejecting a true null hypothesis.  Example: Saying a drug works when it doesn’t.  

Type II Error (False Negative): Failing to reject a false null hypothesis.Example: Saying a drug doesn’t work when it actually does.

12. What are descriptive vs inferential statistics?  

Descriptive: Summarizes data using charts, graphs, and metrics like mean, median.  

Inferential: Makes predictions or inferences about a population using a sample (e.g., confidence intervals, hypothesis testing).

13. What is correlation vs causation?  

Correlation: Two variables move together, but one doesn’t necessarily cause the other.  

Causation: One variable directly affects the other.  

Important: Correlation ≠ Causation.

14. What is a normal distribution?  

A bell-shaped curve where data is symmetrically distributed around the mean.  

Mean = Median = Mode  

68% data within 1 SD, 95% within 2 SD, 99.7% within 3 SD.

15. What is central limit theorem (CLT)?  

As sample size increases, the sampling distribution of the sample mean approaches a normal distribution — even if the population isn’t normal.

Used in: Confidence intervals, hypothesis testing.

16. What is feature engineering?  

Creating or transforming features to improve model performance.  

Examples: Creating age from DOB, binning values, log transformations, creating interaction terms.

17. What is missing value imputation?  

Filling missing data using:  

•⁠  ⁠Mean/Median/Mode  

•⁠  ⁠KNN Imputation  

•⁠  ⁠Regression or ML models  

•⁠  ⁠Forward/Backward fill (time series)

18. Explain one-hot encoding vs label encoding  

One-hot encoding: Converts categories into binary columns. Best for non-ordinal data.  

Label encoding: Assigns numerical labels (e.g., Red=1, Blue=2). Suitable for ordinal data.

19. What is multicollinearity? How to detect it?  

When two or more independent variables are highly correlated, making it hard to isolate their effects.  

Detection:  

•⁠  ⁠Correlation matrix  

•⁠  ⁠Variance Inflation Factor (VIF > 5 or 10 = problematic)

20. What is dimensionality reduction?  

Reducing the number of input features while retaining important info.  

Benefits: Simplifies models, reduces overfitting, speeds up training  

Techniques: PCA, LDA, t-SNE

Leave a Reply

Your email address will not be published. Required fields are marked *

Need Help?