Prepare smart with these frequently asked questions covering ML, stats, Python, and more!
Core Concepts
1. Q: Supervised vs Unsupervised Learning?
Supervised: Uses labeled data (e.g., regression, classification)
Unsupervised: No labels (e.g., clustering, PCA)
2. Q: What is overfitting? How to prevent it?
Overfitting: Model performs well on training but poorly on new data.
Use cross-validation, regularization (L1/L2), pruning, or get more data.
3. Q: Bias vs Variance?
Bias: Error from incorrect assumptions
Variance: Error from sensitivity to small fluctuations
Trade-off between both is crucial.
Machine Learning
4. Q: What is the difference between classification and regression?
Classification: Predict categories (spam or not)
Regression: Predict continuous values (price, temperature)
5. Q: What is precision, recall, and F1 score?
Precision: TP / (TP + FP)
Recall: TP / (TP + FN)
F1: Harmonic mean of precision & recall
6. Q: What’s the purpose of ROC-AUC?
Evaluates classification model’s ability to distinguish classes.
7. Q: What is feature engineering?
Creating new input features or transforming data to improve model performance.
Python & Tools
8. Q: How is NumPy different from lists?
NumPy arrays are faster, more efficient, and support vectorized operations.
9. Q: Difference between apply() and map() in Pandas?
map() works on Series, apply() works on Series or DataFrames.
10. Q: How to handle missing data?
Drop rows/columns, fill with mean/median/mode, or use model-based imputation.
SQL & Data Handling
11. Q: How to get top 3 salaries from an Employee table?
sql
SELECT DISTINCT salary
FROM employee
ORDER BY salary DESC
LIMIT 3;
12. Q: What is a JOIN?
Combines rows from two or more tables using a related column.
Projects & Deployment
13. Q: How to deploy a data science model?
Save model (using pickle/joblib), wrap with Flask/FastAPI, host on Render/Heroku/AWS.
14. Q: How to explain a model to non-tech stakeholders?
Use visuals, simple analogies, focus on impact, not technical metrics.
Bonus: Key Libraries to Know
NumPy, Pandas, Matplotlib, Scikit-learn, Seaborn, TensorFlow/PyTorch (for DL)