A Primer On Statistics

What is Statistics

Statistics is a branch of mathematics and a scientific discipline that involves the collection, analysis, interpretation, presentation, and organization of data. Its primary goal is to extract meaningful information from data, make inferences, and support decision-making processes. Statistics is broadly divided into two main branches:

Descriptive Statistics: Descriptive statistics involve the methods for summarizing and organizing data to describe its main features. Common descriptive measures include measures of central tendency (mean, median, mode) and measures of dispersion (range, variance, standard deviation). Descriptive statistics provide a concise summary of the main characteristics of a dataset.
Inferential Statistics: Inferential statistics draw inferences and make predictions about a population based on a sample of data. This branch includes hypothesis testing, confidence intervals, and regression analysis. Inferential statistics help researchers generalize findings from a sample to a larger population.

Key Concepts in Statistics:

Population and Sample: A population is the entire group of individuals or instances about whom the study is concerned. A sample is a subset of the population selected for analysis.
Variable: A characteristic that varies among individuals or objects in a population. It can be quantitative (e.g., height, weight) or qualitative (e.g., color, gender).
Data: Information collected from observations, measurements, or surveys. Data can be classified as numerical (quantitative) or categorical (qualitative).
Statistical Methods: Various methods, techniques, and tools are used to analyze and interpret data. These include probability theory, hypothesis testing, regression analysis, and data visualization.
Probability: The likelihood of an event occurring. Probability theory is fundamental to statistical reasoning and inference.
Statistical Software: Tools such as R, Python (with libraries like NumPy and Pandas), and software like SPSS and SAS are commonly used to perform statistical analyses.

Statistics is applied in numerous fields, including science, economics, finance, social sciences, healthcare, and engineering, to draw insights, make predictions, and support evidence-based decision-making.

Statistical Models

A statistical model is a mathematical representation or abstraction that attempts to describe the underlying structure of a set of data and the relationships between its variables. It provides a framework for making inferences and predictions based on observed data. Statistical models are widely used in various fields, including science, engineering, economics, and social sciences.

Key components of a statistical model include:

Parameters: These are the unknown constants in the model that need to be estimated from the data. Parameters represent features of the population being studied.
Variables: The variables in a statistical model are the characteristics or factors being observed or measured. These can be either dependent (response) variables or independent (predictor) variables.
Assumptions: Statistical models are built on certain assumptions about the data and the relationships between variables. These assumptions may include assumptions about the distribution of the data, the independence of observations, or the linearity of relationships.
Error Term/Residuals: The difference between the observed values and the values predicted by the model is often represented by an error term or residuals. The goal is to minimize the discrepancies between the model predictions and the actual data.

Types of statistical models include:

Explanatory Models: Explain the relationships between variables. Examples include regression analysis.
Predictive Models: Predict future values or outcomes based on historical data. Examples include linear regression and machine learning models.

Statistical modeling is a fundamental tool in statistical analysis, allowing researchers and analysts to gain insights from data, test hypotheses, and make predictions. The choice of a specific model depends on the nature of the data and the objectives of the analysis.

Examples of Different Model Structures

Examples of different model structures commonly used in statistical and machine learning contexts are:

Linear Regression Model: The model structure is \[Y=\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p + \epsilon\]

where \(Y\) is called the response variable, the \(p\) \(X\)-variables are called the explanatory variables or predictors, the \(p+1\) \(\beta\)’s are called the coefficients, and \(\epsilon\) is called the error term. A commonly used assumption is that the error term \(\epsilon\) has a normal distribution with mean 0 and standard deviation \(\sigma\), so the model can then be written as

\[Y\sim N(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p, \sigma)\]

Logistic Regression Model: The model structure is \[log(\frac{p}{1-p})=\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p\] or \[p=\frac{1}{1+e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p)}}\] where \(p=Prob(Y=1|X_1, X_2, \cdots, X_p)\) and \(Y\) is the binary response variable with values 0 or 1. Again, we can write the logistic regression model as

\[Y\sim Bernoulli(p=\frac{1}{1+e^{-(\beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p)}})\]

Decision Tree Model: The model structure is a tree-like model where each internal node represents a decision based on a feature, each branch represents the outcome of the decision, and each leaf node represents the predicted target value. Search Google images for “Decision Tree Model” to see example structures.

Random Forest Model: The model structure is an ensemble of decision trees, where predictions are made by aggregating the predictions of individual trees. Search Google images for “Random Forest Model” to see example structures. A reference: https://serokell.io/blog/random-forest-classification

Neural Network Model: The model structure is a network of interconnected nodes (neurons) organized in layers, including an input layer, one or more hidden layers, and an output layer. Each connection has associated weights. Search Google images for “Neural Network Model” to see example structures. A reference: https://towardsdatascience.com/designing-your-neural-networks-a5e4617027ed

These examples illustrate diverse model structures for different types of tasks, including linear and logistic regression for regression and classification, decision trees and random forests for tree-based methods, and neural networks for deep learning. The specific structure of a model depends on the nature of the data and the problem being addressed.

Stat 321

Book: https://www.stat2.org/

Data manual: https://www.stat2.org/manuals/Stat2DataManual.pdf

Book R Code: https://www.stat2.org/manuals/Stat2RCompanion.pdf

Stat 325

Book: https://bookdown.org/speegled/foundations-of-statistics/RData.html#data-types

We focus on Chapters 1, 2.1, 2.4.1, 3-7, 9, 10

Stat 415/615

Copy and paste the following code to console to install the “mlba” package, which allows you to access some utility functions and all datasets in the book https://github.com/gedeck/mlba.

if (!require(mlba)) {
  library(devtools)
  install_github("gedeck/mlba/mlba", force=TRUE)
}