Linear Regression

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.1.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)

Regression, much like t-tests and correlations, is all about relationships. What is the relationship between X and Y? Or between X, Y and Z?

For very simple data, this is easy enough to see. You can just plot it:

x<-(1:50)
y<-(100:149)
example_data<-data.frame(x=x,y=y)

ggplot(example_data,aes(x=x,y=y)) + geom_point()

We see here that as “X” goes up, “Y” goes up. In fact, they seem to go with each other exactly. Just for fun, we can add a regression model to this data just to make sure:

ggplot(example_data,aes(x=x,y=y)) + geom_point() + geom_smooth(method='lm',color='blue')
## `geom_smooth()` using formula = 'y ~ x'

geom_smooth method=lm fits a regression line to the data

R-squared- percentage of the variance explained by the independent variables. How much of the line can be explained by the dots?

In this model, there are no errors: both X and Y match the expected model (the line) perfectly.

Remember last week when we did some work in regression? What would this perfect relationship look like, mathematically? Let’s run a regression!

example_model<-lm(x~y,data=example_data)

summary(example_model)
## Warning in summary.lm(example_model): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = x ~ y, data = example_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.034e-14 -4.609e-16  6.939e-16  1.400e-15  4.102e-15 
## 
## Coefficients:
##               Estimate Std. Error    t value Pr(>|t|)    
## (Intercept) -9.900e+01  4.216e-15 -2.348e+16   <2e-16 ***
## y            1.000e+00  3.364e-17  2.972e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.433e-15 on 48 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 8.836e+32 on 1 and 48 DF,  p-value: < 2.2e-16

The first thing you’ll notice is that we get a warning: “essentially perfect fit: summary may be unreliable”. This is the computer telling you that the data has such a perfect relationship, that you should be deeply suspicious of it. And you should – we made it up, after all. In real life, data is never this perfectly related.

The second thing you might notice is that there is an adjusted R-squared of “1” – this means that the model (the line) explains 100% of the variance in the data. Or, to put it another way, the model perfectly explains the data. Again, in real life, this is unlikely-to-impossible.

How about data that looks a bit more realistic:

ggplot(mtcars,aes(x=mpg,y=hp)) + geom_point() + ggtitle("MPG vs. HP in MtCars")

This is real-world data. What do you notice? The relationship between mpg and hp is broadly negative: as HP goes up, MPG tends to go down. But it’s not a perfectly straight line, is it?

ggplot(mtcars,aes(x=mpg,y=hp)) + geom_point() + geom_smooth(method='lm',color='blue') + ggtitle("MPG vs. HP in MtCars",subtitle = "With linear regression line")
## `geom_smooth()` using formula = 'y ~ x'

Notice that the regression line barely touches on only a few dots. This means that the model only perfectly predicts a few values. The rest of the values are sitting outside the predicted line.

We can do this mathematically as well:

hp_mpg_model<-lm(hp~mpg,data=mtcars)
summary(hp_mpg_model)
## 
## Call:
## lm(formula = hp ~ mpg, data = mtcars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -59.26 -28.93 -13.45  25.65 143.36 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   324.08      27.43  11.813 8.25e-13 ***
## mpg            -8.83       1.31  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 43.95 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

The adjusted R-squared is only 59% (so this regression explains 59% of the data). [MPG expalins 59% of the horsepower] It’s not bad, but its far from

What is the estimate? For every MPG that goes up, horsepower goes down by 8.83. The estimate tells you the magnitutde of the effect. The estimate of the effect.

We can further visualize the prediction and error of the model by attaching it to the mtcars dataframe:

cars_test<-mtcars %>% mutate(predicted_hp_value=predict(hp_mpg_model)) %>% select(hp,mpg,predicted_hp_value)

# and now the error:

cars_test<-cars_test %>% mutate(predicted_error=hp-predicted_hp_value)

head(cars_test)
##                    hp  mpg predicted_hp_value predicted_error
## Mazda RX4         110 21.0           138.6580       -28.65796
## Mazda RX4 Wag     110 21.0           138.6580       -28.65796
## Datsun 710         93 22.8           122.7644       -29.76445
## Hornet 4 Drive    110 21.4           135.1261       -25.12607
## Hornet Sportabout 175 18.7           158.9663        16.03366
## Valiant           105 18.1           164.2642       -59.26418

The predicted_hp_value is trying to guess what the horsepower will be based on the mpg using the regression mode.

If you wanted to view your residuals, you don’t even need to calculate them (like we did above). You can just call “residuals()” on the model that we created

cars_test$model_error<-residuals(hp_mpg_model)
head(cars_test)
##                    hp  mpg predicted_hp_value predicted_error model_error
## Mazda RX4         110 21.0           138.6580       -28.65796   -28.65796
## Mazda RX4 Wag     110 21.0           138.6580       -28.65796   -28.65796
## Datsun 710         93 22.8           122.7644       -29.76445   -29.76445
## Hornet 4 Drive    110 21.4           135.1261       -25.12607   -25.12607
## Hornet Sportabout 175 18.7           158.9663        16.03366    16.03366
## Valiant           105 18.1           164.2642       -59.26418   -59.26418

They’re the same!

The errors can now be visualized like so (you don’t have to know how to do this but it’s a fun lil exercise). This is useful for judging different models:

ggplot(cars_test,aes(x=mpg,y=hp)) + geom_point() + geom_smooth(method='lm',color='blue') + ggtitle("MPG vs. HP in MtCars",subtitle = "With linear regression line AND errors") + geom_segment(aes(x=mpg,xend=mpg,y=hp,yend=predicted_hp_value),linetype = 'dashed') 
## `geom_smooth()` using formula = 'y ~ x'

You may be wondering: how does R “know” which line fits the data the best? Without getting too technical, R tries to calculate a bunch of different lines and then chooses the one that has the least amount of errors.

The “R squared” measurement we have been talking about can be derived manually. I do that here just to give you an idea of where the measurement is coming from:

# lets add the square of the model error, along with the mean horsepower for the whole dataset

cars_test<-cars_test %>% mutate(residual_sq=model_error*model_error)
cars_test<-cars_test %>% mutate(mean_hp=mean(hp))

# now we can get the total sum of squares by subtracting mean HP from total HP, squared:

total_sum_of_squares<-sum((cars_test$hp-cars_test$mean_hp)^2)

# now lets get the residual sum of squares (hp minus the predicted HP value, squared):

residual_sum_of_squares<-sum((cars_test$hp-cars_test$predicted_hp_value)^2)

# by dividing the residual by the total sum of squares, we get the r-squared: a measurement of how much variance the model "explains":

hp_r_squared<-1-(residual_sum_of_squares/total_sum_of_squares)
print(hp_r_squared)
## [1] 0.6024373

This is fairly close to the .59 calculated earlier.

Of course, there is an easier way to get it:

summary(hp_mpg_model)$r.squared
## [1] 0.6024373

This is another way of saying “the regression model here explains 60% of the variance”, which can be said even more simply: “the regression model explains 60% of the data”.

So, practically speaking, 60% of horsepower can be “explained” by miles per gallon.

MULTIPLE REGRESSION

The really useful thing about regression (as we discussed in the last lecture) is that we can start to fit models not just on one variable (hp vs. mpg, for example), but upon multiple variables (hp vs. mpg, wt, disp, and so forth).

Think about this: does mpg really explain hp? That is, does a car’s mpg really cause the horsepower to go up or down? MPG is a proxy for something else…

It makes more sense when you realize that mpg is probably a proxy for different variables: weight, or maybe engine power. More weight and more engine power tend to make mpg go down, while making HP go up. Engine power is probably the biggest contributor to HP, while weight and HP are only somewhat related.

But how can we get this from the data?

# lets add "wt" (weight) and "cyl" (number of cylinders) to the linear regression. They appear after the tilde (~).

cars_multiple<-lm(hp~mpg+wt+cyl,data=mtcars)

summary(cars_multiple)
## 
## Call:
## lm(formula = hp ~ mpg + wt + cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.041 -21.622  -7.177  15.235 125.753 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  115.662    113.207   1.022  0.31567   
## mpg           -4.220      2.778  -1.519  0.14002   
## wt           -12.135     14.382  -0.844  0.40596   
## cyl           25.025      7.486   3.343  0.00237 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.41 on 28 degrees of freedom
## Multiple R-squared:  0.7165, Adjusted R-squared:  0.6861 
## F-statistic: 23.58 on 3 and 28 DF,  p-value: 8.064e-08

R-squared went up! But p-values went up. P-value for the whole model is very small, but p-value for mpg is above the critical value of .05. When you account for weight and cylinders, mpg is not significant on horsepower. Why? Whatever was in mpg that was ‘causing’ horsepower is better represented by cylinders. MPG was not as good of a predictor of horsepower as we originally thought.

– As we expected earlier, mpg and wt have totally evaporated in significance (they have p-values far larger than 0.05). The only variable that is still significant is cyl (number of cylinders – basically, engine power). We also see that the r-squared has increased from around 60% to almost 70%.

Again, this is because “mpg” was actually a bad variable to predict HP. MPG doesn’t really explain HP – mpg was actually just a substitute for CYL – the variable that actually causes HP.

How do you know if a variable is causing something – or just a proxy for the actual cause? This is extremely complex and there are no great answers, however this question will ultimately be answered in your literature review. You have to convince the reader that “variables x, y and z” are actually causing “effect a” rather than just being associated with it, or rather than being a proxy for something else.

The whole kitchen sink

Here’s an idea: why not just throw the whole kitchen sink into the linear regression, and see what “sticks”? Isn’t this a good way to come up with cause and effect?

Short answer: no.

Longer answer: eh. maybe. sort of. it depends on who you ask.

Even longer answer:

#dependent value is still horsepower
kitchen_sink_model<-lm(hp~wt+cyl+disp+drat+qsec+vs+am+gear+carb+mpg,data=mtcars)
#if you think there are two variables that have a stong effect together, you can use a * instead of a +
summary(kitchen_sink_model)
## 
## Call:
## lm(formula = hp ~ wt + cyl + disp + drat + qsec + vs + am + gear + 
##     carb + mpg, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.681 -15.558   0.799  18.106  34.718 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  79.0484   184.5041   0.428  0.67270   
## wt          -27.6600    19.2704  -1.435  0.16591   
## cyl           8.2037    10.0861   0.813  0.42513   
## disp          0.4390     0.1492   2.942  0.00778 **
## drat         -4.6185    16.0829  -0.287  0.77680   
## qsec         -1.7844     7.3639  -0.242  0.81089   
## vs           25.8129    19.8512   1.300  0.20758   
## am            9.4863    20.7599   0.457  0.65240   
## gear          7.2164    14.6160   0.494  0.62662   
## carb         18.7487     7.0288   2.667  0.01441 * 
## mpg          -2.0631     2.0906  -0.987  0.33496   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.97 on 21 degrees of freedom
## Multiple R-squared:  0.9028, Adjusted R-squared:  0.8565 
## F-statistic:  19.5 on 10 and 21 DF,  p-value: 1.898e-08

On the surface, it appears that we’ve created a better model of HP: when we throw everything into the linear model, we discover that the only two significant variables are “disp” (engine displacement - size) and “carb” (number of carburetors in the engine). R-squared has increased to almost 90%. So what’s the problem?

[for every cubic inch of displacement added, horsepower goes up .4390 units] The estimates in the (intercept) line is the horsepower of a hypothetical car if all the measurements were 0– so the horsepower of the car would be 79 if the wt/cyl/disp/etc. were 0

Reminder- adjusted R-squared is the best one to pay attention to

In this specific case, engine size and number of carburetors are still just a proxy for engine power. If you think about it, increasing an engine’s size doesn’t automatically increase HP. Imagine if you made a huge engine 100 feet long but instead of extra cylinders, you just filled it up full of sand. You get the idea. Size is still a proxy for “something else” that is not being fully captured by the measurements.

Probably the best thing we can say about this model is that “disp” and “carb” are the best proxies for what HP actually is: the ability of the engine to move the weight of the car a certain distance.

tl;dr: if you try to “kitchen sink” everything, you might be fooled into thinking that the effect is being “caused” by something that is itself a proxy for something else. The variables you put into the model should be guided by the literature….

The best thing to do is have a good theoretical understanding of which variables are most important and why. This typically comes out in your literature review.

A Note on Interpreting Coefficients

In addition to seeing which variables are most significant, multiple regression allows us to “estimate” the size of a significant effect upon a dependent variable. Our as our cars_multiple model shows:

summary(cars_multiple)
## 
## Call:
## lm(formula = hp ~ mpg + wt + cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -60.041 -21.622  -7.177  15.235 125.753 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  115.662    113.207   1.022  0.31567   
## mpg           -4.220      2.778  -1.519  0.14002   
## wt           -12.135     14.382  -0.844  0.40596   
## cyl           25.025      7.486   3.343  0.00237 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 38.41 on 28 degrees of freedom
## Multiple R-squared:  0.7165, Adjusted R-squared:  0.6861 
## F-statistic: 23.58 on 3 and 28 DF,  p-value: 8.064e-08

[hp is the dependent value, cyl is the independent variable. as cyl go up by 1, horsepower goes up by 25.025]

“CYL” has an “Estimate” (sometimes called a “Beta” or “Beta coefficient”) of 25.025. Because CYL is significant, this means that, all things being equal, if you add an extra cylinder to a car, it increases the HP by 25 points.

“WT” is not significant in this model, but lets pretend it was. WT has an Estimate of -12.1. This means that every time “WT” goes up by 1, “HP” goes down by 12.1.

In this case, “WT” is the weight in tons, so it makes sense that an extra ton of weight reduces HP.

An easy way to interpret the Estimate is by saying “when [independent variable] does up by”1”, the [dependent variable] goes [up/down] by [Estimate]“.

diamonds_price<-lm(price~carat+depth+table,data=diamonds)
summary(diamonds_price)
## 
## Call:
## lm(formula = price ~ carat + depth + table, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18288.0   -785.9    -33.2    527.2  12486.7 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13003.441    390.918   33.26   <2e-16 ***
## carat        7858.771     14.151  555.36   <2e-16 ***
## depth        -151.236      4.820  -31.38   <2e-16 ***
## table        -104.473      3.141  -33.26   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1526 on 53936 degrees of freedom
## Multiple R-squared:  0.8537, Adjusted R-squared:  0.8537 
## F-statistic: 1.049e+05 on 3 and 53936 DF,  p-value: < 2.2e-16

An increase of 1 carat (a measurement of weight), tends to make the price of a diamond go up by how much? $7,858.77

An increase in the depth measurement by “1” tends to make the price of a diamond go down by how much? -$151.24

Testing assumptions

Linear Regression (single and multiple) has some assumptions that must be met for the p-values to be accurate. They are:

  1. Linearity of variables
  2. Independence of errors
  3. Homoscedasticity
  4. Normality of residuals
  5. No multicollinearity

We could spend a whole semester discussing these, and we will spend most of the next class discussing them, but for now here is the most important thing to know:

  1. Linearity of variables: the variables in the model should have a linear relationship with each other. This is easy to visualize with two variables (just do a dot-plot), but with multiple variables it’s a bit harder to visualize. The best way is calling “plot(x,which=1)” where “x” is the linear model.

Here is the hp_mpg_model:

plot(cars_multiple,which=1)

Here we see that the line of residuals versus fitted (the red line) is mostly straight. This suggests that the relationship between variables is mostly linear.

For an example of one that isn’t linear, we can look at the “kitchen sink” model we created earlier:

plot(kitchen_sink_model,which=1)

The residuals versus fitted is clearly curved (and much more curved than the previous model). This means that the assumption of linearity is probably violated. There is no strict mathematically definition – it’s more a matter of “eyeing” it. Consequently, because the “kitchen sink” model does not meet the linearity assumption, the (higher) r-squared values will be inaccurate.

Best practice- have as few variables as possible in your model as the literature supports

ADVANCED TECHNIQUE

BONUS: there is probably a method which fits the data more correctly than a linear model:

hp_mpg_poly<-lm(hp~poly(mpg,degree = 2),data=mtcars)
cars_test$pred_poly<-predict(hp_mpg_poly,newdata=cars_test)


ggplot(cars_test,aes(x=mpg,y=hp)) + geom_point() + geom_smooth(method='lm',color='blue') + ggtitle("MPG vs. HP in MtCars",subtitle = "With linear regression line AND errors") + geom_segment(aes(x=mpg,xend=mpg,y=hp,yend=predicted_hp_value),linetype = 'dashed') 
## `geom_smooth()` using formula = 'y ~ x'

ggplot(cars_test,aes(x=mpg,y=hp)) + geom_point() + geom_smooth(method='gam',color='blue') + ggtitle("MPG vs. HP in MtCars",subtitle = "With polynomial regression and errors") + geom_segment(aes(x=mpg,xend=mpg,y=hp,yend=pred_poly),linetype = 'dashed')
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

This is called a polynomial model. It’s similar to a linear model but it accounts for curvature in the relationship between hp and mpg.You can tell that the error bars are a bit shorter in the polynomial model, which means the curve fits the dots a bit better than just a straight line. The polynomial model basically makes a linear model curve a little bit!

There is also a mathematical way to determine how well a model fits the data:

AIC(hp_mpg_model,hp_mpg_poly)
##              df      AIC
## hp_mpg_model  3 336.8553
## hp_mpg_poly   4 335.3593

hp_mpg_poly is slightly lower than hp_mpg_model, which means that hp_mpg_poly is a slightly better model.

The downside to polynomial models is that they are more difficult to interpret than a linear model, so don’t worry too much about them right now.

HOMEWORK

  1. Load your chosen dataset into Rmarkdown
  2. Select the dependent variable you are interested in, along with independent variables which you believe are causing the dependent variable [between 2-5]
  3. create a linear model using the “lm()” command, save it to some object [like we did with cars_multiple]
  4. call a “summary()” on your new model
  5. interpret the model’s r-squared and p-values. How much of the dependent variable does the overall model explain? What are the significant variables? What are the insignificant variables?
  6. Choose some significant independent variables. Interpret its Estimates (or Beta Coefficients). How do the independent variables individually affect the dependent variable? [if you have something significant, say this going up by 1 increases this by 1]
  7. Does the model you create meet or violate the assumption of linearity? Show your work with “plot(x,which=1)”

#Class Example

theswiss<-swiss
#examine for relationships

pairs(theswiss)

Let’s try running some linear models on this!

model1<-lm(Fertility~Catholic+Agriculture+Infant.Mortality, data=theswiss)

Are these variables linear with each other?

plot(model1, which=1)

There’s one county on the left that is very out of the norm with fertility. If we were writing a paper, we could run a model with and without the county to see what the outcome is.

For the most part, this is pretty linear. Good enough!

summary(model1)
## 
## Call:
## lm(formula = Fertility ~ Catholic + Agriculture + Infant.Mortality, 
##     data = theswiss)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.0367  -5.6899   0.3825   5.9029  18.9091 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)   
## (Intercept)      26.74755   11.27417   2.372  0.02221 * 
## Catholic          0.08778    0.04004   2.192  0.03383 * 
## Agriculture       0.14229    0.07253   1.962  0.05627 . 
## Infant.Mortality  1.63342    0.52621   3.104  0.00337 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.12 on 43 degrees of freedom
## Multiple R-squared:  0.3859, Adjusted R-squared:  0.343 
## F-statistic: 9.007 on 3 and 43 DF,  p-value: 9.578e-05

As infant mortality goes up by 1, fertility goes up by 1.63. The p value is very small. It is significant.

R-squared– the variables explained about 34% of fertility. There’s a lot that goes into fertility that is not explained by these variables.

Agriculture is on the edge of significance– .056.

R-squared- how much of the dependent variable is being explained by the independent variable(s)

Binary– we can use them as a variable, but the estimate is interpreted differently; it turns into probability. We’ll talk about that more later…

(note to self- I can use nonprofit/private/government nursing homes and the total health score)