James Scott (UT-Austin)
Reference: “Data Science” Chapter 2.
So far we've concentrated on relatively simple visual and numerical summaries of data sets.
In many cases we will want to go further, by fitting an explicit equation, called a regression model that describes how one variable (\( y \)) changes as a function of some other variables (\( x \)).
This is process is called regression or curve fitting: estimating the conditional expected value for \( y \), given \( x \).
For example, you may have heard the following rule of thumb: to calculate your maximum heart rate, subtract your age from 220.
We can express this rule as an equation:
\[ \mbox{MHR} = 220 - \mbox{Age} \]
This equation comes from data. The study probably went something like this:
Data from this kind of study looks like this…
It turns out that Equation 2 (MHR = 208 - 0.7 \( \times \) Age) is a better equation: it makes smaller errors, on average.
Top three reasons for fitting an equation:
Our heart rate example illustrates all three of these concepts.
Alice is 28. What is her predicted max heart rate?
Our equation expresses the conditional expected value of MHR, given a known value of age:
\[ \mbox{E(MHR | Age)} = 208 - 0.7 \cdot 28 = 188.4 \]
This is our best guess without actually putting Alice on a treadmill test until she vomits.
How does max heart rate tend to change with age?
\[ \mbox{E(MHR | Age)} = 208 - 0.7 \cdot \mbox{Age} \]
So about 0.7 BPM slower, on average, with every additional year we age.
This isn't a guarantee that your MHR will decline at this rate; it's just a population-level average.
A third, very common use of regression modeling is to make fair comparisons that adjust for the systematic effect of some common variable.
It's kind of like asking: how big of a head start should The Freeze get?
This is not a fair race!
To the make the race “fair” (and thus interesting), we need to adjust for how fast we expect these guys to run, given what we know about them.
Key fact: regression models are great at estimating conditional expectations.
Let's compare two people whose max heart rates are measured using an actual treadmill test:
Clearly Alice has a higher MHR, but let's make things fair! We need to give Abigail a “head start,” since max heart rate declines with age.
So who has a higher maximum heart rate for her age?
Key idea: compare actual MHR with expected MHR.
Alice's actual MHR is 185, versus an expected MHR of 188.4
\[ \begin{aligned} \mbox{Actual} - \mbox{Predicted} &= 185 - (208 - 0.7 \cdot 28) \\ &= 185 - 188.4 \\ & = -3.4 \end{aligned} \]
Key idea: compare actual MHR with expected MHR.
Abigail's actual MHR is 174, versus an expected MHR of 169.5
\[ \begin{aligned} \mbox{Actual} - \mbox{Predicted} &= 174 - (208 - 0.7 \cdot 55) \\ &= 174 - 169.5 \\ & = 4.5 \end{aligned} \]
So Abigail has a lower absolute MHR, but a higher age-adjusted MHR. Her “head start” was the difference between her and Alice's expected MHRs: 188.4 - 169.5 = 18.9.
The equation that relates MHR to age shows us how to place everyone on a level playing field, regardless of age. There are a lot of synonyms for this idea:
It's all just subtraction! Compare the difference between actual and expected outcomes.
The workhorse here is a linear model:
\[ y_i = \beta_0 + \beta_1 x_i + e_i \]
Notation:
\[ \hat{y} = E(y \mid x) = \beta_0 + \beta_1 x \]
“Fitting a model” = choosing \( \beta_0 \) and \( \beta_1 \) to make the model errors as small as possible on your data set. In practice “as small as possible” means “least squares.” Define the loss function
\[ \begin{aligned} l(\beta_0, \beta_1) &= \sum_{i=1}^N e_i^2 \\ &= \sum_{i=1}^N \left[ y_i - (\beta_0 + \beta_1 x_i) \right]^2 \end{aligned} \]
Ordinary least squares (OLS): choose \( \beta_0 \) and \( \beta_1 \) to make \( l(\beta_0, \beta_1) \) as small as possible, i.e. to minimize the sum of squared model errors.
It's a straightforward calculus problem to show that the OLS solution is:
\[ \begin{aligned} \hat{\beta}_1 &= \frac{ \sum_{i=1}^n (x_i - \bar{x}) (y_i - \bar{y}) } {\sum_{i=1}^n (x_i - \bar{x})^2 } \\ \nonumber \\ \hat{\beta}_0 &= \bar{y} - \hat{\beta}_1 \bar{x} \, , \end{aligned} \]
Don't bother memorizing this! Every statistical package on the planet has it built in. (Plus, they'll probably make you derive it in Econometrics.)
Instead, let's focus on using the fitted equation for our three goals:
Here are a few lines from a data set on Austin restaurants c.2013:
Name Type FoodScore FeelScore Price
Franklin Barbecue Barbecue 9.5 5.5 15
Kerbey Lane Cafe Vegefusion 6.5 8.5 20
Shoal Creek Saloon Southern 4.6 8.5 25
Uchi Japanese,Modern 9.8 8.5 85
Second Bar + Kitchen Modern 8.7 9.0 50
Lamberts Southwestern 8.5 9.0 75
Our fitted line (from OLS):
\[ \mbox{Price} = -6.2 + 7.9 \cdot \mbox{FoodScore} + \mbox{Error} \]
Suppose you're opening a new restaurant and you've hired a chef with a proven track record of cooking at a 7.5 level.
What price would you expect the Austin market to support for an average meal at your restaurant?
We know that \( x = 7.5 \). Remember that our fitted equation expresses the conditional expected value of \( y \), for a known value of \( x \).
So in light of our data on the Austin market, our best guess for price is
\[ E(y \mid x = 7.5) = -6.2 + 7.9 \cdot 7.5 = 53.05 \]
Maybe a sensible starting point for thinking about pricing.
What dollar value does the Austin restaurant market seem to place on one extra point of food deliciousness?
Recall our equation
\[ E(y \mid x) = -6.2 + 7.9 \cdot x \]
That's $7.90 per point of deliciousness, on average.
What's the “best value” restaurant in Austin, i.e. the one that offers the most delicious food for its price?
Recall our key idea: compare actual with predicted.
No surprises here: it's Franklin Barbecue! Actual price is $15 per person; predicted price is nearly $70.
Let's dig in to afc_intro.R and afc.csv on the class website.
Download the data in creatinine.csv from the course website. Each row is a patient in a doctor's office.
Load this data into RStudio and start with a blank script.
Use this data, coupled with your knowledge of linear modeling, to answer three questions:
Ordinary least squares is built for linear models. However, we can also use OLS to fit certain kinds of nonlinear models. We'll focus on four:
These models are special! (Most nonlinear models cannot be fit using ordinary least squares.)
Data on gas consumption versus temperature for a single-family house in Minnesota:
The linear model doesn't fit so well!
But a quadratic model does!
\[ \mbox{Gas Bill} = \$289 - 6.4 \cdot \mbox{Temp} + 0.03 \cdot \mbox{Temp}^2 + \mbox{Residual} \, . \]
In general, a polynomial model of degree \( K \) takes the form
\[ E(y \mid x) = \beta_0 + \sum_{j = 1}^K \beta_j x^j \]
This model is nonlinear in \( x \), but it can still be fit using OLS.
There is a temptation to get a better fit by choosing a larger \( K \). This can get ridiculous:
For most data sets, beyond \( K=2 \) (quadratic) or \( K=3 \) (cubic), we rapidly get into dangerous over-fitting territory:
Over-fitting occurs when a model just memorizes random noise in the data set, rather than describes the systematic relationship between \( x \) and \( y \).
Severely overfit models usually have one or two dead giveaways:
In later courses, you'll learn formal diagnostics for over-fitting. In the meantime: you'll pretty much know it when you see it.
Let's dive in to utilities.csv and utilities.R.
Another nice extension: piecewise-polynomial models, also known as splines.
To fit a spline, we divide the range of the \( x \) variable into nonoverlapping interals \( I_0, I_1, I_2, \cdots, I_K \). The breakpoints between the intervals are called knots.
We then fit a polynomial equation separately on each interval, “gluing” the individual polynomials together so that the overall curve is smooth.
Here's some data on finishing times from runners in the 10-mile Cherry Blossom Road Race in Washington, D.C., held every April:
state time net age sex
1 MD 5100 5032 58 M
2 PA 4675 4400 26 M
3 DC 6564 6564 25 M
4 NJ 5134 5031 35 M
5 NY 5669 5187 37 F
6 MD 4478 4461 23 F
7 VA 7611 7259 34 M
8 VA 6075 5699 25 M
Three knots create four disjoint intervals (knots at the 25th, 50th, and 75th percentiles of temperature).
Separate polynomials on each interval, glued together in a smooth fashion. What might explain the non-monotone behavior? (Code in race_splines.R.)
There are probably at least two things going on here:
Bottom line: the flexible piece-wise polynomial (spline) model allowed us to practice good data science. We could:
An exponential growth or decay model looks like this:
\[ y = \alpha e^{\beta_1 x} \]
where:
This formula comes from an analogy with continuously compounded interest. Suppose you start with \( \alpha \) dollars, invested at rate \( \beta_1 \) and compounded \( n \) times annually. Then after \( t \) years, your investment is worth
\[ y = \alpha \left(1 + \frac{\beta_1}{n} \right)^{nt} \]
If we take the limit as \( n \) gets large, we get
\[ y = \alpha e^{\beta_1 t} \]
This is a non-linear model, but it's easy to fit using OLS!
Here's the trick: if \( y = \alpha e^{\beta_1 x} \), then
\[ \begin{aligned} \log(y) &= \log \left( \alpha e^{\beta_1 x} \right) \\ &= \log \alpha + \beta_1 x \end{aligned} \]
This is a linear function after all: not in \( y \) versus \( x \), but in \( \log(y) \) versus \( x \).
This gives us a simple recipe: Fit a linear regression model where the \( y \) variable has been log-transformed:
\[ \log(y_i) = \beta_0 + \beta_1 x_i + e_i \]
This tells us the parameters of our exponential growth or decay model:
See example in ebola.R.
A similar trick works for power laws, where
\[ y = \alpha x^{\beta} \]
This is a very common model in microeconomics, where we often use power laws to model change in consumer demand as a function of price:
\[ Q = K P^E \]
where \( Q \) is quantity demanded, \( P \) is price, \( E \) is price elasticity of demand (PED), and \( K \) is a constant.
We can git a model like this using OLS as well.
Here's the trick: if \( y = \alpha x^{\beta_1} \), then
\[ \begin{aligned} \log(y) &= \log \left( \alpha x^{\beta_1} \right) \\ &= \log \alpha + \beta_1 \log(x) \end{aligned} \]
This is also linear function, in \( \log(y) \) versus \( \log(x) \).
Now we fit a linear regression model with both variables log-transformed:
\[ \log(y_i) = \beta_0 + \beta_1 \log(x_i) + e_i \]
This tells us the parameters of our exponential growth or decay model:
Your turn! In milk.csv, you're given a data set on consumer demand for milk, along with price. Your ultimate goal is to answer the question: how much should the store charge for a carton of milk?
Let's dive in to the case study on milk prices on the class website.
Remember: demand versus price often follows a power law!