Here's a list of 50 statistics-related interview questions and answers tailored for Data Scientist roles.
These cover fundamentals, probability, distributions, sampling, hypothesis testing, and practical applications.
Q: What is the difference between population and sample?
A: A population includes all elements from a set of data, while a sample is a subset of the population used to make inferences.
Q: Define descriptive vs inferential statistics.
A: Descriptive stats summarize data (mean, median), inferential stats draw conclusions (hypothesis testing, confidence intervals).
Q: What are the types of data?
A: Nominal, ordinal, interval, and ratio.
Q: What are measures of central tendency?
A: Mean, median, and mode.
Q: What are measures of dispersion?
A: Range, variance, standard deviation, IQR.
Q: Define probability.
A: Probability is the measure of the likelihood that an event will occur.
Q: What is conditional probability?
A: Probability of event A given B has occurred: P(A|B) = P(A ∩ B) / P(B)
Q: What is Bayes' Theorem?
A: It calculates the probability of a hypothesis based on prior knowledge:
Q: Name some probability distributions.
A: Normal, Binomial, Poisson, Bernoulli, Exponential, Uniform.
Q: What is the Central Limit Theorem?
A: It states that the sampling distribution of the sample mean approaches normality as the sample size increases.
Q: What is a null and alternative hypothesis?
A: H₀: no effect or status quo; H₁: effect or difference exists.
Q: What is a p-value?
A: The probability of obtaining test results at least as extreme as the observed during the test, assuming the null is true.
Q: When do you reject the null hypothesis?
A: When p-value < significance level (e.g., 0.05).
Q: What are Type I and Type II errors?
A: Type I: Rejecting true H₀ (false positive), Type II: Failing to reject false H₀ (false negative).
Q: What is a confidence interval?
A: A range within which the true population parameter lies with a specified probability.
Q: What is sampling?
A: Selecting a subset from a population to estimate characteristics.
Q: Name different sampling methods.
A: Random, stratified, cluster, systematic, convenience.
Q: What is sampling bias?
A: It occurs when the sample isn't representative of the population.
Q: What is the law of large numbers?
A: As sample size increases, sample mean approximates population mean.
Q: What is oversampling and undersampling?
A: Techniques to balance classes in imbalanced datasets.
Q: Difference between correlation and causation?
A: Correlation: association, Causation: one causes the other.
Q: What is multicollinearity?
A: When independent variables in regression are highly correlated.
Q: What is R² in regression?
A: Proportion of variance in the dependent variable explained by the model.
Q: What are residuals?
A: Differences between observed and predicted values.
Q: What is adjusted R²?
A: R² adjusted for the number of predictors in the model.
Q: What is ANOVA?
A: Analysis of variance — compares means across multiple groups.
Q: When to use t-test vs z-test?
A: t-test for small samples or unknown variance, z-test for large samples and known variance.
Q: What is a Chi-square test?
A: Tests the association between categorical variables.
Q: What is logistic regression used for?
A: Predicting binary outcomes (0 or 1).
Q: What is heteroscedasticity?
A: When residuals have unequal variance.
Q: How do you handle missing data?
A: Imputation (mean, median, mode), removal, or prediction.
Q: How do you check if data is normally distributed?
A: Histograms, Q-Q plots, Shapiro-Wilk test.
Q: Why normalize or standardize data?
A: To scale variables and improve model performance.
Q: How to detect outliers?
A: Z-score, IQR method, boxplot.
Q: What is the purpose of A/B testing?
A: Comparing two versions (A vs B) to determine statistically significant improvement.
Q: You get a p-value of 0.07. What do you conclude?
A: Fail to reject the null at 5% significance level.
Q: What does a high variance in data indicate?
A: More spread out data; potential model instability.
Q: When is a non-parametric test used?
A: When data doesn’t meet assumptions of normality.
Q: What is bootstrapping?
A: A resampling technique to estimate statistics on a population.
Q: Explain the difference between parametric and non-parametric tests.
A: Parametric assumes data distribution, non-parametric doesn't.
Q: What is skewness?
A: Measure of data asymmetry.
Q: What is kurtosis?
A: Measure of tails and peak sharpness in a distribution.
Q: What is a time series?
A: Data collected at successive, equally spaced points in time.
Q: What is autocorrelation?
A: Correlation of a variable with itself over successive time intervals.
Q: What is cross-validation?
A: Technique to assess model performance on unseen data.
Q: If a model has high accuracy but low precision, what does it mean?
A: It predicts many positives, but many are false positives.
Q: How do you handle class imbalance?
A: Resampling, SMOTE, change evaluation metric (e.g., F1-score).
Q: What is the difference between recall and precision?
A: Precision = TP / (TP + FP); Recall = TP / (TP + FN)
Q: What metric is best for imbalanced classes?
A: F1-score, AUC-ROC.
Q: How do you ensure statistical significance in results?
A: Use appropriate tests, ensure sample size, and check p-values and confidence intervals.