PDF

What is Statistics?

Statistics is the science of collecting, summarizing, analyzing, and interpreting data to make decisions or understand patterns. It helps turn raw data into useful information.

Why learn statistics?

  • To describe and summarize data clearly.
  • To make informed conclusions from samples about larger populations.
  • To evaluate uncertainty and quantify how confident we are in conclusions.
  • To model relationships between variables (prediction and explanation).

Two main branches

  1. Descriptive statistics — summarize and visualize data (means, medians, charts).
  2. Inferential statistics — draw conclusions about a population from a sample (confidence intervals, hypothesis tests, regression).

Types of data

  • Quantitative (numerical)
    • Continuous: can take many values (height, time).
    • Discrete: integer counts (number of children).
  • Qualitative (categorical)
    • Nominal: categories without order (colors).
    • Ordinal: categories with a natural order (ratings: poor, fair, good).

Basic steps in a statistical study (step-by-step)

  1. Define the question — What do you want to know? State your goals clearly.
  2. Collect data — Choose sampling method, design surveys/experiments.
  3. Summarize data — Use tables, graphs, and summary statistics to explore patterns.
  4. Model and analyze — Choose appropriate statistical methods (tests, regression).
  5. Interpret results — Draw conclusions, consider limitations and assumptions.
  6. Communicate — Present findings clearly with visualizations and plain language.

Key descriptive statistics

  • Mean (average): mean = (x1 + x2 + ... + xn) / n.
  • Median: middle value when data are ordered (or average of two middle values if n is even).
  • Mode: most frequent value.
  • Variance and standard deviation:
    • Population variance: sigma^2 = (1/N) sum (xi - mu)^2.
    • Sample variance: s^2 = (1/(n-1)) sum (xi - xbar)^2 (uses n-1 for an unbiased estimate).
    • Standard deviation = sqrt(variance) — same units as the data.
  • Range, interquartile range (IQR): IQR = Q3 - Q1 measures spread of the middle 50%.

Quick example (descriptive)

Data: 3, 5, 7, 8, 10

  • Mean = (3+5+7+8+10)/5 = 33/5 = 6.6
  • Median = 7 (middle value)
  • Range = 10 - 3 = 7

Basic probability ideas (foundation for inference)

  • Probability measures uncertainty between 0 and 1.
  • Independence, conditional probability, and common distributions (binomial, normal, Poisson) are foundational.
  • The Central Limit Theorem (CLT): the sampling distribution of the sample mean is approximately normal for large n, regardless of the population distribution (under mild conditions). This is why many methods work.

Inferential statistics: hypothesis testing (step-by-step)

  1. State null hypothesis H0 and alternative H1 (example: H0: mu = mu0).
  2. Choose significance level alpha (common choices: 0.05, 0.01).
  3. Select an appropriate test (t-test, chi-square, ANOVA, etc.).
  4. Compute the test statistic from sample data.
  5. Compute p-value or compare to critical value.
  6. Make a decision: reject H0 if p-value < alpha; otherwise fail to reject H0.
  7. State conclusion in context and consider practical significance.

Confidence intervals (CI)

A CI gives a range of plausible values for a population parameter. Example: a 95% CI for a mean is xbar ± margin of error. The margin of error depends on variability and sample size. Interpreting a 95% CI: if we repeated the sampling many times, about 95% of such intervals would contain the true parameter.

Relationship and prediction: correlation and regression

  • Correlation measures linear association between two variables (r between -1 and 1). Correlation does not imply causation.
  • Simple linear regression fits a line y = b0 + b1 x to predict y from x. b1 measures the average change in y per unit change in x.
  • Assess fit with R-squared (proportion of variance explained) and check residuals for model assumptions.

Common pitfalls and things to watch for

  • Biased sampling (nonrepresentative samples) leads to wrong inferences.
  • Confusing correlation with causation.
  • Overfitting: too complex a model fits noise, not signal.
  • Ignoring assumptions (normality, independence, equal variances) can invalidate tests.
  • Multiple comparisons: many tests increase the chance of false positives unless corrected.

Practical tips for learning statistics

  • Work with real datasets and practice calculating summaries and plots.
  • Learn to visualize first: histograms, boxplots, scatterplots give intuition.
  • Understand formulas conceptually (what the mean and variance tell you) rather than just memorizing them.
  • Use software (R, Python, Excel, or statistical calculators) to handle larger problems, but know how to do small examples by hand.
  • Check assumptions when applying methods and perform sensitivity checks.

Where to go next

If you want to continue: study probability theory, sampling distributions, t-tests and ANOVA, regression (multiple), logistic regression, nonparametric methods, and Bayesian statistics. Practice with datasets and problems.

If you'd like, tell me your background and goals (class level or a specific dataset), and I will create a focused lesson or exercises for you.


Ask a followup question

Loading...