library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
Regression, much like t-tests and correlations, is all about relationships. What is the relationship between X and Y? Or between X, Y and Z?
For very simple data, this is easy enough to see. You can just plot it:
x<-(1:50)
y<-(100:149)
example_data<-data.frame(x=x,y=y)
ggplot(example_data,aes(x=x,y=y)) + geom_point()
We see here that as “X” goes up, “Y” goes up. In fact, they seem to go with each other exactly. Just for fun, we can add a regression model to this data just to make sure:
ggplot(example_data,aes(x=x,y=y)) + geom_point() + geom_smooth(method='lm',color='blue')
## `geom_smooth()` using formula = 'y ~ x'
geom_smooth method=lm fits a regression line to the data
R-squared- percentage of the variance explained by the independent variables. How much of the line can be explained by the dots?
In this model, there are no errors: both X and Y match the expected model (the line) perfectly.
Remember last week when we did some work in regression? What would this perfect relationship look like, mathematically? Let’s run a regression!
example_model<-lm(x~y,data=example_data)
summary(example_model)
## Warning in summary.lm(example_model): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = x ~ y, data = example_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.034e-14 -4.609e-16 6.939e-16 1.400e-15 4.102e-15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.900e+01 4.216e-15 -2.348e+16 <2e-16 ***
## y 1.000e+00 3.364e-17 2.972e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.433e-15 on 48 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 8.836e+32 on 1 and 48 DF, p-value: < 2.2e-16
The first thing you’ll notice is that we get a warning: “essentially perfect fit: summary may be unreliable”. This is the computer telling you that the data has such a perfect relationship, that you should be deeply suspicious of it. And you should – we made it up, after all. In real life, data is never this perfectly related.
The second thing you might notice is that there is an adjusted R-squared of “1” – this means that the model (the line) explains 100% of the variance in the data. Or, to put it another way, the model perfectly explains the data. Again, in real life, this is unlikely-to-impossible.
How about data that looks a bit more realistic:
ggplot(mtcars,aes(x=mpg,y=hp)) + geom_point() + ggtitle("MPG vs. HP in MtCars")
This is real-world data. What do you notice? The relationship between mpg and hp is broadly negative: as HP goes up, MPG tends to go down. But it’s not a perfectly straight line, is it?
ggplot(mtcars,aes(x=mpg,y=hp)) + geom_point() + geom_smooth(method='lm',color='blue') + ggtitle("MPG vs. HP in MtCars",subtitle = "With linear regression line")
## `geom_smooth()` using formula = 'y ~ x'
Notice that the regression line barely touches on only a few dots. This means that the model only perfectly predicts a few values. The rest of the values are sitting outside the predicted line.
We can do this mathematically as well:
hp_mpg_model<-lm(hp~mpg,data=mtcars)
summary(hp_mpg_model)
##
## Call:
## lm(formula = hp ~ mpg, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -59.26 -28.93 -13.45 25.65 143.36
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 324.08 27.43 11.813 8.25e-13 ***
## mpg -8.83 1.31 -6.742 1.79e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 43.95 on 30 degrees of freedom
## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892
## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07
The adjusted R-squared is only 59% (so this regression explains 59% of the data). [MPG expalins 59% of the horsepower] It’s not bad, but its far from
What is the estimate? For every MPG that goes up, horsepower goes down by 8.83. The estimate tells you the magnitutde of the effect. The estimate of the effect.
We can further visualize the prediction and error of the model by attaching it to the mtcars dataframe:
cars_test<-mtcars %>% mutate(predicted_hp_value=predict(hp_mpg_model)) %>% select(hp,mpg,predicted_hp_value)
# and now the error:
cars_test<-cars_test %>% mutate(predicted_error=hp-predicted_hp_value)
head(cars_test)
## hp mpg predicted_hp_value predicted_error
## Mazda RX4 110 21.0 138.6580 -28.65796
## Mazda RX4 Wag 110 21.0 138.6580 -28.65796
## Datsun 710 93 22.8 122.7644 -29.76445
## Hornet 4 Drive 110 21.4 135.1261 -25.12607
## Hornet Sportabout 175 18.7 158.9663 16.03366
## Valiant 105 18.1 164.2642 -59.26418
The predicted_hp_value is trying to guess what the horsepower will be based on the mpg using the regression mode.
If you wanted to view your residuals, you don’t even need to calculate them (like we did above). You can just call “residuals()” on the model that we created
cars_test$model_error<-residuals(hp_mpg_model)
head(cars_test)
## hp mpg predicted_hp_value predicted_error model_error
## Mazda RX4 110 21.0 138.6580 -28.65796 -28.65796
## Mazda RX4 Wag 110 21.0 138.6580 -28.65796 -28.65796
## Datsun 710 93 22.8 122.7644 -29.76445 -29.76445
## Hornet 4 Drive 110 21.4 135.1261 -25.12607 -25.12607
## Hornet Sportabout 175 18.7 158.9663 16.03366 16.03366
## Valiant 105 18.1 164.2642 -59.26418 -59.26418
They’re the same!
The errors can now be visualized like so (you don’t have to know how to do this but it’s a fun lil exercise). This is useful for judging different models:
ggplot(cars_test,aes(x=mpg,y=hp)) + geom_point() + geom_smooth(method='lm',color='blue') + ggtitle("MPG vs. HP in MtCars",subtitle = "With linear regression line AND errors") + geom_segment(aes(x=mpg,xend=mpg,y=hp,yend=predicted_hp_value),linetype = 'dashed')
## `geom_smooth()` using formula = 'y ~ x'
You may be wondering: how does R “know” which line fits the data the best? Without getting too technical, R tries to calculate a bunch of different lines and then chooses the one that has the least amount of errors.
The “R squared” measurement we have been talking about can be derived manually. I do that here just to give you an idea of where the measurement is coming from:
# lets add the square of the model error, along with the mean horsepower for the whole dataset
cars_test<-cars_test %>% mutate(residual_sq=model_error*model_error)
cars_test<-cars_test %>% mutate(mean_hp=mean(hp))
# now we can get the total sum of squares by subtracting mean HP from total HP, squared:
total_sum_of_squares<-sum((cars_test$hp-cars_test$mean_hp)^2)
# now lets get the residual sum of squares (hp minus the predicted HP value, squared):
residual_sum_of_squares<-sum((cars_test$hp-cars_test$predicted_hp_value)^2)
# by dividing the residual by the total sum of squares, we get the r-squared: a measurement of how much variance the model "explains":
hp_r_squared<-1-(residual_sum_of_squares/total_sum_of_squares)
print(hp_r_squared)
## [1] 0.6024373
This is fairly close to the .59 calculated earlier.
Of course, there is an easier way to get it:
summary(hp_mpg_model)$r.squared
## [1] 0.6024373
This is another way of saying “the regression model here explains 60% of the variance”, which can be said even more simply: “the regression model explains 60% of the data”.
So, practically speaking, 60% of horsepower can be “explained” by miles per gallon.
The really useful thing about regression (as we discussed in the last lecture) is that we can start to fit models not just on one variable (hp vs. mpg, for example), but upon multiple variables (hp vs. mpg, wt, disp, and so forth).
Think about this: does mpg really explain hp? That is, does a car’s mpg really cause the horsepower to go up or down? MPG is a proxy for something else…
It makes more sense when you realize that mpg is probably a proxy for different variables: weight, or maybe engine power. More weight and more engine power tend to make mpg go down, while making HP go up. Engine power is probably the biggest contributor to HP, while weight and HP are only somewhat related.
But how can we get this from the data?
# lets add "wt" (weight) and "cyl" (number of cylinders) to the linear regression. They appear after the tilde (~).
cars_multiple<-lm(hp~mpg+wt+cyl,data=mtcars)
summary(cars_multiple)
##
## Call:
## lm(formula = hp ~ mpg + wt + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.041 -21.622 -7.177 15.235 125.753
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 115.662 113.207 1.022 0.31567
## mpg -4.220 2.778 -1.519 0.14002
## wt -12.135 14.382 -0.844 0.40596
## cyl 25.025 7.486 3.343 0.00237 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.41 on 28 degrees of freedom
## Multiple R-squared: 0.7165, Adjusted R-squared: 0.6861
## F-statistic: 23.58 on 3 and 28 DF, p-value: 8.064e-08
R-squared went up! But p-values went up. P-value for the whole model is very small, but p-value for mpg is above the critical value of .05. When you account for weight and cylinders, mpg is not significant on horsepower. Why? Whatever was in mpg that was ‘causing’ horsepower is better represented by cylinders. MPG was not as good of a predictor of horsepower as we originally thought.
– As we expected earlier, mpg and wt have totally evaporated in significance (they have p-values far larger than 0.05). The only variable that is still significant is cyl (number of cylinders – basically, engine power). We also see that the r-squared has increased from around 60% to almost 70%.
Again, this is because “mpg” was actually a bad variable to predict HP. MPG doesn’t really explain HP – mpg was actually just a substitute for CYL – the variable that actually causes HP.
How do you know if a variable is causing something – or just a proxy for the actual cause? This is extremely complex and there are no great answers, however this question will ultimately be answered in your literature review. You have to convince the reader that “variables x, y and z” are actually causing “effect a” rather than just being associated with it, or rather than being a proxy for something else.
Here’s an idea: why not just throw the whole kitchen sink into the linear regression, and see what “sticks”? Isn’t this a good way to come up with cause and effect?
Short answer: no.
Longer answer: eh. maybe. sort of. it depends on who you ask.
Even longer answer:
#dependent value is still horsepower
kitchen_sink_model<-lm(hp~wt+cyl+disp+drat+qsec+vs+am+gear+carb+mpg,data=mtcars)
#if you think there are two variables that have a stong effect together, you can use a * instead of a +
summary(kitchen_sink_model)
##
## Call:
## lm(formula = hp ~ wt + cyl + disp + drat + qsec + vs + am + gear +
## carb + mpg, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.681 -15.558 0.799 18.106 34.718
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 79.0484 184.5041 0.428 0.67270
## wt -27.6600 19.2704 -1.435 0.16591
## cyl 8.2037 10.0861 0.813 0.42513
## disp 0.4390 0.1492 2.942 0.00778 **
## drat -4.6185 16.0829 -0.287 0.77680
## qsec -1.7844 7.3639 -0.242 0.81089
## vs 25.8129 19.8512 1.300 0.20758
## am 9.4863 20.7599 0.457 0.65240
## gear 7.2164 14.6160 0.494 0.62662
## carb 18.7487 7.0288 2.667 0.01441 *
## mpg -2.0631 2.0906 -0.987 0.33496
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25.97 on 21 degrees of freedom
## Multiple R-squared: 0.9028, Adjusted R-squared: 0.8565
## F-statistic: 19.5 on 10 and 21 DF, p-value: 1.898e-08
On the surface, it appears that we’ve created a better model of HP: when we throw everything into the linear model, we discover that the only two significant variables are “disp” (engine displacement - size) and “carb” (number of carburetors in the engine). R-squared has increased to almost 90%. So what’s the problem?
[for every cubic inch of displacement added, horsepower goes up .4390 units] The estimates in the (intercept) line is the horsepower of a hypothetical car if all the measurements were 0– so the horsepower of the car would be 79 if the wt/cyl/disp/etc. were 0
Reminder- adjusted R-squared is the best one to pay attention to
In this specific case, engine size and number of carburetors are still just a proxy for engine power. If you think about it, increasing an engine’s size doesn’t automatically increase HP. Imagine if you made a huge engine 100 feet long but instead of extra cylinders, you just filled it up full of sand. You get the idea. Size is still a proxy for “something else” that is not being fully captured by the measurements.
Probably the best thing we can say about this model is that “disp” and “carb” are the best proxies for what HP actually is: the ability of the engine to move the weight of the car a certain distance.
tl;dr: if you try to “kitchen sink” everything, you might be fooled into thinking that the effect is being “caused” by something that is itself a proxy for something else. The variables you put into the model should be guided by the literature….
The best thing to do is have a good theoretical understanding of which variables are most important and why. This typically comes out in your literature review.
In addition to seeing which variables are most significant, multiple regression allows us to “estimate” the size of a significant effect upon a dependent variable. Our as our cars_multiple model shows:
summary(cars_multiple)
##
## Call:
## lm(formula = hp ~ mpg + wt + cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -60.041 -21.622 -7.177 15.235 125.753
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 115.662 113.207 1.022 0.31567
## mpg -4.220 2.778 -1.519 0.14002
## wt -12.135 14.382 -0.844 0.40596
## cyl 25.025 7.486 3.343 0.00237 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 38.41 on 28 degrees of freedom
## Multiple R-squared: 0.7165, Adjusted R-squared: 0.6861
## F-statistic: 23.58 on 3 and 28 DF, p-value: 8.064e-08
[hp is the dependent value, cyl is the independent variable. as cyl go up by 1, horsepower goes up by 25.025]
“CYL” has an “Estimate” (sometimes called a “Beta” or “Beta coefficient”) of 25.025. Because CYL is significant, this means that, all things being equal, if you add an extra cylinder to a car, it increases the HP by 25 points.
“WT” is not significant in this model, but lets pretend it was. WT has an Estimate of -12.1. This means that every time “WT” goes up by 1, “HP” goes down by 12.1.
In this case, “WT” is the weight in tons, so it makes sense that an extra ton of weight reduces HP.
An easy way to interpret the Estimate is by saying “when [independent variable] does up by”1”, the [dependent variable] goes [up/down] by [Estimate]“.
diamonds_price<-lm(price~carat+depth+table,data=diamonds)
summary(diamonds_price)
##
## Call:
## lm(formula = price ~ carat + depth + table, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18288.0 -785.9 -33.2 527.2 12486.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13003.441 390.918 33.26 <2e-16 ***
## carat 7858.771 14.151 555.36 <2e-16 ***
## depth -151.236 4.820 -31.38 <2e-16 ***
## table -104.473 3.141 -33.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1526 on 53936 degrees of freedom
## Multiple R-squared: 0.8537, Adjusted R-squared: 0.8537
## F-statistic: 1.049e+05 on 3 and 53936 DF, p-value: < 2.2e-16
An increase of 1 carat (a measurement of weight), tends to make the price of a diamond go up by how much? $7,858.77
An increase in the depth measurement by “1” tends to make the price of a diamond go down by how much? -$151.24
Linear Regression (single and multiple) has some assumptions that must be met for the p-values to be accurate. They are:
We could spend a whole semester discussing these, and we will spend most of the next class discussing them, but for now here is the most important thing to know:
Here is the hp_mpg_model:
plot(cars_multiple,which=1)
Here we see that the line of residuals versus fitted (the red line) is mostly straight. This suggests that the relationship between variables is mostly linear.
For an example of one that isn’t linear, we can look at the “kitchen sink” model we created earlier:
plot(kitchen_sink_model,which=1)
The residuals versus fitted is clearly curved (and much more curved than the previous model). This means that the assumption of linearity is probably violated. There is no strict mathematically definition – it’s more a matter of “eyeing” it. Consequently, because the “kitchen sink” model does not meet the linearity assumption, the (higher) r-squared values will be inaccurate.
Best practice- have as few variables as possible in your model as the literature supports
BONUS: there is probably a method which fits the data more correctly than a linear model:
hp_mpg_poly<-lm(hp~poly(mpg,degree = 2),data=mtcars)
cars_test$pred_poly<-predict(hp_mpg_poly,newdata=cars_test)
ggplot(cars_test,aes(x=mpg,y=hp)) + geom_point() + geom_smooth(method='lm',color='blue') + ggtitle("MPG vs. HP in MtCars",subtitle = "With linear regression line AND errors") + geom_segment(aes(x=mpg,xend=mpg,y=hp,yend=predicted_hp_value),linetype = 'dashed')
## `geom_smooth()` using formula = 'y ~ x'
ggplot(cars_test,aes(x=mpg,y=hp)) + geom_point() + geom_smooth(method='gam',color='blue') + ggtitle("MPG vs. HP in MtCars",subtitle = "With polynomial regression and errors") + geom_segment(aes(x=mpg,xend=mpg,y=hp,yend=pred_poly),linetype = 'dashed')
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'
This is called a polynomial model. It’s similar to a linear model but it accounts for curvature in the relationship between hp and mpg.You can tell that the error bars are a bit shorter in the polynomial model, which means the curve fits the dots a bit better than just a straight line. The polynomial model basically makes a linear model curve a little bit!
There is also a mathematical way to determine how well a model fits the data:
AIC(hp_mpg_model,hp_mpg_poly)
## df AIC
## hp_mpg_model 3 336.8553
## hp_mpg_poly 4 335.3593
hp_mpg_poly is slightly lower than hp_mpg_model, which means that hp_mpg_poly is a slightly better model.
The downside to polynomial models is that they are more difficult to interpret than a linear model, so don’t worry too much about them right now.
#Class Example
theswiss<-swiss
#examine for relationships
pairs(theswiss)
Let’s try running some linear models on this!
model1<-lm(Fertility~Catholic+Agriculture+Infant.Mortality, data=theswiss)
Are these variables linear with each other?
plot(model1, which=1)
There’s one county on the left that is very out of the norm with
fertility. If we were writing a paper, we could run a model with and
without the county to see what the outcome is.
For the most part, this is pretty linear. Good enough!
summary(model1)
##
## Call:
## lm(formula = Fertility ~ Catholic + Agriculture + Infant.Mortality,
## data = theswiss)
##
## Residuals:
## Min 1Q Median 3Q Max
## -25.0367 -5.6899 0.3825 5.9029 18.9091
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.74755 11.27417 2.372 0.02221 *
## Catholic 0.08778 0.04004 2.192 0.03383 *
## Agriculture 0.14229 0.07253 1.962 0.05627 .
## Infant.Mortality 1.63342 0.52621 3.104 0.00337 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.12 on 43 degrees of freedom
## Multiple R-squared: 0.3859, Adjusted R-squared: 0.343
## F-statistic: 9.007 on 3 and 43 DF, p-value: 9.578e-05
As infant mortality goes up by 1, fertility goes up by 1.63. The p value is very small. It is significant.
R-squared– the variables explained about 34% of fertility. There’s a lot that goes into fertility that is not explained by these variables.
Agriculture is on the edge of significance– .056.
R-squared- how much of the dependent variable is being explained by the independent variable(s)
Binary– we can use them as a variable, but the estimate is interpreted differently; it turns into probability. We’ll talk about that more later…
(note to self- I can use nonprofit/private/government nursing homes and the total health score)