Introduction to Linear Regression

POL 682: Linear Regression Analysis

Chris Weber

University of Arizona

School of Government and Public Policy

2026-01-20

Introduction to Linear Regression

Welcome to the Course

  • For today:
  • Syllabus review, expectations and questions
  • Getting started with R and Github
  • Overview of the linear model

What is GitHub?

GitHub Overview

GitHub is a web based platform to host data, source code, projects. It’s useful for a variety of reasons, some perhaps more important than others in your own research.

  • Version Control: Track changes to your code and documents
  • Collaboration: Work with others on the same project
  • Project Management: Organize tasks and track progress
  • Backup: Store your work in the cloud
  • Questions, Instructions

Why Use GitHub?

For Your Research

  • GitHub makes computer code transparent, reproducible, and collaborative. It’s transparent because it’s open to the public, it’s reproducible because it allows others to recreate and expand upon your work, and it’s collaborative because it creates a platform for users to jointly contribute to a project.
  • Version Control. It allows you to track every change you make. It’s trivial to revert back to previous code.

Key Concepts

Repository (Repo)

A repository is a folder that contains: - Your code files - Data files - Documentation (README.md) - Configuration files - Complete history of changes - Let’s have a look at the structure of a repo

Example Structure

my_analysis/
├── README.md
├── data/
│   ├── raw_data.csv
│   └── processed_data.csv
├── code/
│   ├── 01_load_data.R
│   ├── 02_analysis.R
│   └── 03_visualize.R
└── output/
    ├── figures/
    └── tables/

Setup

  • Create an account at github.com
  • Log into your account
  • Search for this repository, fork it to your account, and clone it to your computer
  • And, create a sample repository to practice with.

RStudio

  • Generally you should work from the terminal to use Git commands.
  • You can access the terminal from within RStudio: Tools \(\rightarrow\) Terminal \(\rightarrow\) New Terminal
  • Alternatively, when using RStudio, you can use the Git tab to perform Git operations via a graphical interface, the Git tab.
  • We’ll use Git a lot in this course, but instead of an exhaustive presentation at the beginning, let’s take things in steps, starting with accessing the files for this course.

Step 1: Fork the Repository

Step 2: Clone to Your Computer

Step 3: Open in RStudio

Step 4: Work with Git in RStudio

Step 5: Push Your Changes

The Linear Regression Model

Linear regression is one of the most widely used statistical methods for several reasons:

  • Straightforward and foundational - ideal starting point for advanced techniques
  • Data Exploration and Inference - descriptive analysis, exploratory data analysis, inference
  • Practical utility - hypothesis testing and predictive modeling

A Linear Model

  • Linear regression is a common technique in the social sciences.
  • Distills complex relationships to a simple linear relationship.

\[y_i = \alpha + \beta x_i\]

  • Intercept (\(\alpha\)): The point at which the regression line crosses the \(y\)-axis, or the value of \(y\) when \(x=0\).

  • Slope (\(\beta\)): The change in \(y\) for every unit change in \(x\). Formally: \(\frac{\partial y}{\partial x} = \beta\)

  • Examples…..

Some Common Misconceptions

Reverse Causality

\(x\) may be caused by \(y\) rather than the reverse.

Confounding

\(x\) may be related to \(y\), but there is a common variable affecting both.

Mediation

A relationship may be indirect through another variable.

The Challenge

It is difficult to sort these out using a regression model with observed cross-sectional data.

In most applications, we observe the outcome of a process. It is often difficult to make claims about the process itself.

Observation versus Manipulation

Experimental versus Observational

Is \(x\) observed or under the control of the researcher?

  • If observational: We can examine whether a relationship exists, but that relationship does not imply causation.

Correlation versus Regression

The two are intimately related, but we make several key assumptions in the regression equation.

  • Dependent variable (regressand): The outcome
  • Explanatory variable (regressor): The predictor
  • The explanatory variable is fixed and exogenous; the dependent variable is random and endogenous *

Data Types

Random/Stochastic Variables

Variables that take on a set of values drawn from a probability distribution.

Cross-Sectional Data

Units observed once. Indexed with \(i\): \(x_i\)

Time Series Data

A single unit observed multiple times. Indexed with \(t\): \(x_t\) - May trend (systematic change) or be stationary (randomly varying around a mean)

Panel Data

Multiple units observed over multiple time periods. Indexed with \(i\) and \(t\): \(x_{it}\)

Variable Types

Type Zero Point Distances Ordering Examples
Ratio Natural Meaningful Natural Income, height, age
Interval None Meaningful Natural Temperature (°C), IQ
Ordinal None Meaningless Natural Education level, satisfaction
Nominal None Meaningless None Religion, ethnicity, region

Quantitative variables: Ratio and Interval
Qualitative variables: Ordinal and Nominal

The Linear Model

The basic functional form:

\[y_i = \alpha + \beta x_i\]

Where:

  • \(\alpha\) = intercept (value of \(y\) when \(x=0\))
  • \(\beta\) = slope (change in \(y\) for every unit change in \(x\))

More formally: \(\partial y / \partial x = \beta\)

The Stochastic Model

Real-world relationships are stochastic rather than deterministic

\[y_i = \alpha + \beta x_i + \epsilon_i\]

  • \(\epsilon_i\) = error term
  • Often assume: \(\epsilon_i \sim N(0, \sigma^2)\)
    • Normally distributed
    • Mean zero
    • Constant variance \(\sigma^2\)

Multiple Predictors

With multiple predictors:

\[y_i = \alpha + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_k x_{ki} + \epsilon_i\]

Or more compactly:

\[y_i = \alpha + \sum_{j=1}^{k} \beta_j x_{ji} + \epsilon_i\]

Conditional Distribution

We model the conditional distribution of \(y\) given \(x\):

\[p(y | x_1, x_2, ..., x_k) = f(x_1, x_2, ..., x_k; \alpha, \beta_1, \beta_2, ..., \beta_k)\]

Key Question: What is the expected value of \(y\) given \(x_1, x_2, ..., x_k\)?

\[E(y | x_1, x_2, ..., x_k) = \alpha + \sum_{j=1}^{k} \beta_j x_{j}\]