1. What is data science?
Data science is an interdisciplinary field that uses statistics, computer science, and domain knowledge to extract insights and knowledge from data (structured and unstructured). It involves data collection, cleaning, analysis, visualization, and model building.
2. Difference between data science, data analytics, and machine learning
• Data Science: Broad field involving analysis, prediction, and decision-making using data.
• Data Analytics: Focused on examining past data to find insights and trends.
• Machine Learning: Subset of data science that uses algorithms to learn from data and make predictions.
3. What is the data science lifecycle?
• Problem Definition
• Data Collection
• Data Cleaning
• Exploratory Data Analysis (EDA)
• Feature Engineering
• Model Building
• Model Evaluation
• Deployment
• Monitoring
4. Explain structured vs unstructured data
• Structured: Organized in rows and columns (e.g., SQL tables)
• Unstructured: No predefined format (e.g., text, images, videos)
5. What is data wrangling or data munging?
It is the process of cleaning, transforming, and preparing raw data into a usable format for analysis or modeling.
6. What is the role of statistics in data science?
Statistics help in understanding data distribution, making inferences, identifying relationships, and building predictive models. It’s foundational to hypothesis testing and model evaluation.
7. Difference between population and sample
• Population: Entire group you want to study
• Sample: Subset of the population used for analysis
Sampling helps in making generalizations without studying the whole population.
8. What is sampling? Types of sampling?
Sampling is selecting a portion of data from a larger set.
Types:
• Random Sampling
• Stratified Sampling
• Systematic Sampling
• Cluster Sampling
9. What is hypothesis testing?
A statistical method to test assumptions (hypotheses) about a population parameter. It helps validate if an observed result is statistically significant.
10. What is p-value?
The p-value indicates the probability of observing results at least as extreme as the ones in your sample, assuming the null hypothesis is true.
• p < 0.05 → Reject null hypothesis (significant)
• p ≥ 0.05 → Fail to reject null (not significant)