Top Data Science Interview Questions with Answers: Part-3

21. Difference between PCA and LDA

•⁠ ⁠PCA (Principal Component Analysis):

Unsupervised technique that reduces dimensionality by maximizing variance. It doesn’t consider class labels.

•⁠ ⁠LDA (Linear Discriminant Analysis):

Supervised technique that reduces dimensionality by maximizing class separability using labeled data.

22. What is Logistic Regression?

A classification algorithm used to predict the probability of a binary outcome (0 or 1).

It uses the sigmoid function to map outputs between 0–1. Commonly used in spam detection, churn prediction, etc.

23. What is Linear Regression?

A supervised learning method that models the relationship between a dependent variable and one or more independent variables using a straight line (Y = a + bX + e). It’s widely used for forecasting and trend analysis.

24. What are assumptions of Linear Regression?

•⁠ ⁠Linearity between independent and dependent variables

•⁠ ⁠No multicollinearity among predictors

•⁠ ⁠Homoscedasticity (equal variance of residuals)

•⁠ ⁠Residuals are normally distributed

•⁠ ⁠No autocorrelation in residuals

25. What is R-squared and Adjusted R-squared?

•⁠ ⁠R-squared: Proportion of variance in the dependent variable explained by the model

•⁠ ⁠Adjusted R-squared: Adjusts R-squared for the number of predictors, preventing overfitting in models with many variables

26. What are Residuals?

The difference between the observed value and the predicted value.

Residual = Actual − Predicted. They indicate model accuracy and should ideally be randomly distributed.

27. What is Regularization (L1 vs L2)?

Regularization prevents overfitting by penalizing large coefficients:

•⁠ ⁠L1 (Lasso): Adds absolute values of coefficients; can eliminate irrelevant features

•⁠ ⁠L2 (Ridge): Adds squared values of coefficients; shrinks them but rarely to zero

28. What is k-Nearest Neighbors (KNN)?

A lazy, non-parametric algorithm used for classification and regression. It assigns a label based on the majority of the k closest data points using a distance metric like Euclidean.

29. What is k-Means Clustering?

An unsupervised algorithm that groups data into k clusters. It assigns points to the nearest centroid and recalculates centroids iteratively until convergence.

30. Difference between Classification and Regression?

•⁠ ⁠Classification: Predicts discrete categories (e.g., Yes/No, Cat/Dog)

•⁠ ⁠Regression: Predicts continuous values (e.g., temperature, price)

Post Views: 74