Introduction to Linear Regression

POL 682: Linear Regression Analysis

Chris Weber

chrisweber@arizona.edu

University of Arizona

School of Government and Public Policy

2026-01-26

Introduction to Linear Regression

Welcome to the Course

For today:
Syllabus review, expectations and questions
Getting started with R and Github
Overview of the linear model

What is GitHub?

GitHub Overview

GitHub is a web based platform to host data, source code, projects. It’s useful for a variety of reasons, some perhaps more important than others in your own research.

Version Control: Track changes to your code and documents
Collaboration: Work with others on the same project
Project Management: Organize tasks and track progress
Backup: Store your work in the cloud
Questions, Instructions

Why Use GitHub?

For Your Research

GitHub makes computer code transparent, reproducible, and collaborative. It’s transparent because it’s open to the public, it’s reproducible because it allows others to recreate and expand upon your work, and it’s collaborative because it creates a platform for users to jointly contribute to a project.
Version Control. It allows you to track every change you make. It’s trivial to revert back to previous code.

Key Concepts

Repository (Repo)

A repository is a folder that contains: - Your code files - Data files - Documentation (README.md) - Configuration files - Complete history of changes - Let’s have a look at the structure of a repo

Example Structure

my_analysis/
├── README.md
├── data/
│   ├── raw_data.csv
│   └── processed_data.csv
├── code/
│   ├── 01_load_data.R
│   ├── 02_analysis.R
│   └── 03_visualize.R
└── output/
    ├── figures/
    └── tables/

Setup

Create an account at github.com
Log into your account
Search for this repository, fork it to your account, and clone it to your computer
And, createa sample repository to practice with.

`RStudio`

Generally you should work from the terminal to use Git commands.
You can access the terminal from within RStudio: Tools \(\rightarrow\) Terminal \(\rightarrow\) New Terminal
Alternatively, when using RStudio, you can use the Git tab to perform Git operations via a graphical interface, the Git tab.
We’ll use Git a lot in this course, but instead of an exhaustive presentation at the beginning, let’s take things in steps, starting with accessing the files for this course.

Understanding Git Workflows

Three Key Concepts

Forking: Creating your own copy of someone else’s repository
Committing: Saving changes to your local repository
Push/Pull: Synchronizing between local and remote repositories

Forking vs. Cloning

Fork

Creates a copy of a repository under your GitHub account
A forked repository is useful for collaboration
Use when you want to contribute to someone else’s project
Common in open-source collaboration

Clone

Downloads a repository to your local computer
You can clone your own repos or others’ repos
Changes are made locally until you push them
Clones are “read-only” unless it is a repo in your account.
Fork + Clone = Collaboration
A Fork + Clone allows you to generate a “pull request”

The Commit Workflow

What is a Commit?

A snapshot of your changes at a point in time
Each commit has a unique ID and message
Creates a history of your work

Basic Terminal Commands

# Check status of your files
git status

# Add files to staging area
git add data_683.R
git add .  # adds all changed files

# Commit changes with a message
git commit -m "My new 683 repo"

Pull, Push, and Staying in Sync

The Basic Workflow

# 1. Pull latest changes from remote
git pull origin main

# 2. Make your changes to files
# (edit your code, data, etc.)

# 3. Stage and commit changes
git add .
git commit -m "Update analysis"

# 4. Push changes to remote
git push origin main

Some Practical Advice:: Always pull before you start working to avoid conflicts!

When to Use What?

Action	When to Use	Command
Fork	Contributing to others’ projects	Done on GitHub website
Clone	Getting a repo onto your computer	`git clone <url>`
Pull	Getting latest changes from remote	`git pull origin main`
Commit	Saving changes locally	`git commit -m "message"`
Push	Sending local changes to remote	`git push origin main`

## Class Activity

Fork the Class Repository

Clone the Repo to your Computer

Create a .qmd file in the tmp folder

Add content and save

Add, Commit, and Push

The Linear Regression Model

Linear regression is one of the most widely used statistical methods for several reasons:

Straightforward and foundational - ideal starting point for advanced techniques
Data Exploration and Inference - descriptive analysis, exploratory data analysis, inference
Practical utility - hypothesis testing and predictive modeling

The Logic Underlying the Regression Model

A Brief (Troubled) History

Francis Galton (1886) used regression and coined the term to denote a “regression to the mean” often in the heritability of physical charateristics.
For instance, he compared characteristics using inter-generational data (e.g., average height in father-son pairs; Pearl & Mackenzie 2018)
Observed “regression to the mean”: If a father is 5’2’’ (unusually short), his son tends to be short but closer to the population average (taller than the father, closer to the mean; problematically referred to as “regression to mediocrity” by Galton)
Galton was a founder of the eugenics movement
The broader method – a statistical model that can be used to quantify the relationship between two variables – is value neutral
But a word of caution: Statistical methods are routinely abused and misinterpreted, sometimes in tragic and unexpected ways; all the more reason to thoroughly understand how they work

A Linear Model

Linear regression is a common technique in the social sciences.
Distills complex relationships to a simple linear relationship.

\[y_i = \alpha + \beta x_i\]

Intercept (\(\alpha\)): The point at which the regression line crosses the \(y\)-axis, or the value of \(y\) when \(x=0\).
Slope (\(\beta\)): The change in \(y\) for every unit change in \(x\). Formally: \(\frac{\partial y}{\partial x} = \beta\)
Examples…..

Some Common Misconceptions

Reverse Causality

\(x\) may be caused by \(y\) rather than the reverse.

Confounding

\(x\) may be related to \(y\), but there is a common variable affecting both.

Mediation

A relationship may be indirect through another variable.

The Challenge

It is difficult to sort these out using a regression model with observed cross-sectional data.

In most applications, we observe the outcome of a process. It is often difficult to make claims about the process itself.

Observation versus Manipulation

Experimental versus Observational

Is \(x\) observed or under the control of the researcher?

If observational: We can examine whether a relationship exists, but that relationship does not imply causation.

Correlation versus Regression

The two are intimately related, but we make several key assumptions in the regression equation.

Dependent variable (regressand): The outcome
Explanatory variable (regressor): The predictor
The explanatory variable is fixed and exogenous; the dependent variable is random and endogenous *

Data Types

Random/Stochastic Variables: Variables that take on a set of values drawn from a probability distribution.
Cross-Sectional Data: Units observed once. Indexed with \(i\): \(x_i\)
Time Series Data: A single unit observed multiple times. Indexed with \(t\): \(x_t\) - May trend (systematic change) or be stationary (randomly varying around a mean)
Panel Data: Multiple units observed over multiple time periods. Indexed with \(i\) and \(t\): \(x_{it}\)

Variable Types

Type	Zero Point	Distances	Ordering	Examples
Ratio	Natural	Meaningful	Natural	Income, height, age
Interval	None	Meaningful	Natural	Temperature (°C), IQ
Ordinal	None	Meaningless	Natural	Education level, satisfaction
Nominal	None	Meaningless	None	Religion, ethnicity, region

Quantitative variables: Ratio and Interval
Qualitative variables: Ordinal and Nominal

The Linear Model

The basic functional form:

\[y_i = \alpha + \beta x_i\]

Where:

\(\alpha\) = intercept (value of \(y\) when \(x=0\))
\(\beta\) = slope (change in \(y\) for every unit change in \(x\))

More formally: \(\partial y / \partial x = \beta\)

The Stochastic Model

Real-world relationships are stochastic rather than deterministic

\[y_i = \alpha + \beta x_i + \epsilon_i\]

\(\epsilon_i\) = error term
Often assume: \(\epsilon_i \sim N(0, \sigma^2)\)
- Normally distributed
- Mean zero
- Constant variance \(\sigma^2\)

Multiple Predictors

With multiple predictors:

\[y_i = \alpha + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_k x_{ki} + \epsilon_i\]

Or more compactly:

\[y_i = \alpha + \sum_{j=1}^{k} \beta_j x_{ji} + \epsilon_i\]

Conditional Distribution

We model the conditional distribution of \(y\) given \(x\):

\[p(y | x_1, x_2, ..., x_k) = f(x_1, x_2, ..., x_k; \alpha, \beta_1, \beta_2, ..., \beta_k)\]

Key Question: What is the expected value of \(y\) given \(x_1, x_2, ..., x_k\)?

\[E(y | x_1, x_2, ..., x_k) = \alpha + \sum_{j=1}^{k} \beta_j x_{j}\]

Introduction to Linear Regression

Introduction to Linear Regression

Welcome to the Course

What is GitHub?

GitHub Overview

Why Use GitHub?

For Your Research

Version Control. It allows you to track every change you make. It’s trivial to revert back to previous code.

Key Concepts

Repository (Repo)

Example Structure

Setup

RStudio

Understanding Git Workflows

Three Key Concepts

Forking vs. Cloning

Fork

Clone

The Commit Workflow

What is a Commit?

Basic Terminal Commands

Pull, Push, and Staying in Sync

The Basic Workflow

When to Use What?

The Linear Regression Model

The Logic Underlying the Regression Model

A Brief (Troubled) History

A Linear Model

Some Common Misconceptions

Reverse Causality

Confounding

Mediation

The Challenge

Observation versus Manipulation

Experimental versus Observational

Correlation versus Regression

Data Types

Variable Types

The Linear Model

The Stochastic Model

Multiple Predictors

Conditional Distribution

`RStudio`