POL 682: Linear Regression Analysis
2026-01-20
R and GithubGitHub is a web based platform to host data, source code, projects. It’s useful for a variety of reasons, some perhaps more important than others in your own research.
A repository is a folder that contains: - Your code files - Data files - Documentation (README.md) - Configuration files - Complete history of changes - Let’s have a look at the structure of a repo
my_analysis/
├── README.md
├── data/
│ ├── raw_data.csv
│ └── processed_data.csv
├── code/
│ ├── 01_load_data.R
│ ├── 02_analysis.R
│ └── 03_visualize.R
└── output/
├── figures/
└── tables/
RStudioGit tab.Git a lot in this course, but instead of an exhaustive presentation at the beginning, let’s take things in steps, starting with accessing the files for this course.Linear regression is one of the most widely used statistical methods for several reasons:
\[y_i = \alpha + \beta x_i\]
Intercept (\(\alpha\)): The point at which the regression line crosses the \(y\)-axis, or the value of \(y\) when \(x=0\).
Slope (\(\beta\)): The change in \(y\) for every unit change in \(x\). Formally: \(\frac{\partial y}{\partial x} = \beta\)
Examples…..
\(x\) may be caused by \(y\) rather than the reverse.
\(x\) may be related to \(y\), but there is a common variable affecting both.
A relationship may be indirect through another variable.
It is difficult to sort these out using a regression model with observed cross-sectional data.
In most applications, we observe the outcome of a process. It is often difficult to make claims about the process itself.
Is \(x\) observed or under the control of the researcher?
The two are intimately related, but we make several key assumptions in the regression equation.
Variables that take on a set of values drawn from a probability distribution.
Units observed once. Indexed with \(i\): \(x_i\)
A single unit observed multiple times. Indexed with \(t\): \(x_t\) - May trend (systematic change) or be stationary (randomly varying around a mean)
Multiple units observed over multiple time periods. Indexed with \(i\) and \(t\): \(x_{it}\)
| Type | Zero Point | Distances | Ordering | Examples |
|---|---|---|---|---|
| Ratio | Natural | Meaningful | Natural | Income, height, age |
| Interval | None | Meaningful | Natural | Temperature (°C), IQ |
| Ordinal | None | Meaningless | Natural | Education level, satisfaction |
| Nominal | None | Meaningless | None | Religion, ethnicity, region |
Quantitative variables: Ratio and Interval
Qualitative variables: Ordinal and Nominal
The basic functional form:
\[y_i = \alpha + \beta x_i\]
Where:
More formally: \(\partial y / \partial x = \beta\)
Real-world relationships are stochastic rather than deterministic
\[y_i = \alpha + \beta x_i + \epsilon_i\]
With multiple predictors:
\[y_i = \alpha + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_k x_{ki} + \epsilon_i\]
Or more compactly:
\[y_i = \alpha + \sum_{j=1}^{k} \beta_j x_{ji} + \epsilon_i\]
We model the conditional distribution of \(y\) given \(x\):
\[p(y | x_1, x_2, ..., x_k) = f(x_1, x_2, ..., x_k; \alpha, \beta_1, \beta_2, ..., \beta_k)\]
Key Question: What is the expected value of \(y\) given \(x_1, x_2, ..., x_k\)?
\[E(y | x_1, x_2, ..., x_k) = \alpha + \sum_{j=1}^{k} \beta_j x_{j}\]

POL 682 | Introduction to Linear Regression