Multiple Linear Regression Additional Variable Types
Author
Spencer Hamilton
GRADING RUBRIC (46 possible points)
45 possible points for correctly answering questions
1 possible point for correctly formatting/submitting the assignment.
Data and Description
Note that for the sake of length for this homework assignment, I am not having you check the model assumptions. You certainly can, if you would like, and in “real life” you would definitely need to do this prior to any statistical inference.
Macroeconomists often speculate that life expectancy is linked with the economic well-being of a country. Macroeconomists also hypothesize that Organisation for Economic Co-operation and Development (OECD) (an international think tank charged with promoting policies that will improve global social and economic well-being) members will have longer life expectancy. To test these hypotheses, the LifeExpectancy.txt data set (found on Canvas) contains the following information:
Variable
Description
LifeExp
Average life expectancy in years
Country
Country name
Group
Is the country a member of OECD, Africa, or other?
PPGDP
Per person GDP (on the log scale)
The Group variable indicates if the country is a member of the OECD, a member of the African continent, or belonging to neither group (other). Note that the Country variable is just for your reference - you will not use this variable in your model.
Download LifeExpectancy.txt, and put it in the same folder as this .qmd file.
0. Replace the text “< PUT YOUR NAME HERE >” (above next to “author:”) with your full name.
1. Read in the data set, call it “life”, remove the “Row” column, and change the class of any categorical variables to factor variables. Print a summary of the data and make sure the data makes sense.
Country Group PPGDP LifeExp
Albania : 1 africa: 37 Min. : 4.743 Min. :51.86
Anguilla : 1 oecd : 31 1st Qu.: 7.157 1st Qu.:70.81
Argentina: 1 other :113 Median : 8.542 Median :75.29
Armenia : 1 Mean : 8.492 Mean :73.14
Aruba : 1 3rd Qu.: 9.844 3rd Qu.:79.80
Australia: 1 Max. :11.563 Max. :85.62
(Other) :175
2. Show a scatterplot with the response on the \(y\)-axis and the other continuous variable on the \(x\)-axis. Comment on the the relationship between these two variables.
# your code hereggplot(data = life, mapping =aes(y = LifeExp, x = PPGDP)) +geom_point()
You can see that there is probably a strong linear positive relationship between PPGDP and Life Expectancy. But, its interesting how there’s that gi
3. Create and print a boxplot with the response on the \(y\)-axis and the categorical variable on the \(x\)-axis. Comment on the the relationship between these two variables.
# your code hereggplot(data = life) +geom_boxplot(mapping =aes(x = PPGDP, y = LifeExp)) +theme(aspect.ratio =1)
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?
This box plot is interesting because there is such a big skew. We can see that there are so many lower out liars. We see that the average life expextancy overall is probably around 75. But, there is such a weird variation with the downward skew.
4. Create and print a color-coded scatterplot using all of the variables that will be in your model. Hint: plot the response on the \(y\)-axis, the other continuous variable on the \(x\)-axis, and color the points by the categorical variable.
# your code hereggplot(data = life, mapping =aes(y = LifeExp, x = PPGDP, color = Group)) +geom_point()
5. Write out the theoretical model (using Greek letters/parameters) that includes main effects for Per Person GDP and the group of the country (you will not write out the fitted model using coefficients, because you have not fit a model yet). DO NOT include interactions at this step. Remember, you will need to use dummy variables for Group. USE “other” AS THE BASELINE CATEGORY. Use variable names that are descriptive (not \(y\), \(x_1\), etc.) and be sure to include proper subscripts.
6. Fit the multiple linear regression model from question 5 to the data (no transformations, no interactions, etc.) using dummy variables that you create manually. USE “other” AS THE BASELINE CATEGORY FOR GROUP. Print a summary of the results.
# your code herelevels(life$Group) # order of levels originally
[1] "africa" "oecd" "other"
# Create dummy variables for Group: "other" will be the baseline categorylife$Group_oecd <-as.numeric(life$Group =="oecd")life$Group_Africa <-as.numeric(life$Group =="africa")# Fit the multiple linear regression modellife_6_lm <-lm(LifeExp ~ PPGDP + Group_oecd + Group_Africa, data = life)summary(life_6_lm)
Call:
lm(formula = LifeExp ~ PPGDP + Group_oecd + Group_Africa, data = life)
Residuals:
Min 1Q Median 3Q Max
-2.8136 -0.6546 -0.0327 0.6861 3.0454
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.95789 0.65272 78.070 < 2e-16 ***
PPGDP 2.87690 0.07478 38.470 < 2e-16 ***
Group_oecd 1.52983 0.25418 6.019 9.88e-09 ***
Group_Africa -12.29427 0.25726 -47.789 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.077 on 177 degrees of freedom
Multiple R-squared: 0.9857, Adjusted R-squared: 0.9855
F-statistic: 4080 on 3 and 177 DF, p-value: < 2.2e-16
7. Fit the multiple linear regression model from question 5 again, but this time let R create the dummy variables for you in the lm function. As before, USE “other” AS THE BASELINE CATEGORY FOR GROUP. Print a summary of the results and make sure they are identical to the results from question 6.
# your code here# Reorder the levels of the 'Group' variable so that "other" becomes the baseline categorylife$Group <-relevel(life$Group, ref ="other")life_plain_lm <-lm(LifeExp ~ PPGDP + Group, data = life)summary(life_plain_lm)
Call:
lm(formula = LifeExp ~ PPGDP + Group, data = life)
Residuals:
Min 1Q Median 3Q Max
-2.8136 -0.6546 -0.0327 0.6861 3.0454
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.95789 0.65272 78.070 < 2e-16 ***
PPGDP 2.87690 0.07478 38.470 < 2e-16 ***
Groupafrica -12.29427 0.25726 -47.789 < 2e-16 ***
Groupoecd 1.52983 0.25418 6.019 9.88e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.077 on 177 degrees of freedom
Multiple R-squared: 0.9857, Adjusted R-squared: 0.9855
F-statistic: 4080 on 3 and 177 DF, p-value: < 2.2e-16
8. Briefly interpret the intercept (like we did in class). Note that you will need to use the word “average” twice since you are predicting an average already (i.e. the response variable is a country’s average life expectancy). You will need to do this here and with the questions following, when interpreting.
On average, the average person from other countries (outside africa and oecd), who have 0 PPGDP, will live 50.95 years.
9. Briefly interpret the coefficient for PPGDP. You do not need to un-transform anything or interpret this in the percentage change framework - you can just write something like “for every one unit increase in per person GDP (log scale)” in your response.
For every unit increase in per person GDP (in the log scale), the average life expectancy rises by 2.8769, on average.
10. For equal per person GDP (log scale), how does life expectancy change for countries that are members of the OECD compared to countries that are on the African continent? Show how you obtained this number, and briefly interpret this number (like we did in class).
The average life expectancy of the countries who are members of OECD are (1.52983 - (-12.29427) = 13.8241) higher than African countries with the same GDP (on the log scale).
11. Create 95% confidence intervals for all coefficients (use the confint function). You do not need to interpret them in this question.
# your code here# Calculate 95% confidence intervals for all coefficientsconf_intervals <-confint(life_plain_lm)# Print the confidence intervalsconf_intervals
12. Briefly interpret the 95% confidence interval for I(Group=Africa).
We are 95% confident that the true effect of being in the “Africa” group on the average life expectancy lies between -12.801967 years and -11.786572 years when compared to countries in the “other” group, after controlling for other variables (PPGDP) in the model.
13. Use the anova function to conduct a hypothesis test that tests a reduced model compared to the full model. Specifically, test if Group has a significant effect on LifeExp. What do you conclude from the result of the test? Hint: you will need to create another linear model and compare it with the one you made previously.
# your code here# Fit the reduced model (without the Group variable)reduced_lm <-lm(LifeExp ~ PPGDP, data = life)# Perform the ANOVA test to compare the modelsanova_result <-anova(reduced_lm, life_plain_lm)# Print the ANOVA tableanova_result
Analysis of Variance Table
Model 1: LifeExp ~ PPGDP
Model 2: LifeExp ~ PPGDP + Group
Res.Df RSS Df Sum of Sq F Pr(>F)
1 179 2858.84
2 177 205.48 2 2653.4 1142.8 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The very low p-value (< 2.2e-16) associated with the F-statistic indicates strong evidence against the null hypothesis. Therefore, we reject the null hypothesis and conclude that there is a significant effect of the Group variable on the life expectancy (LifeExp) after controlling for the Gross Domestic Product per capita (PPGDP).
14. Create a 95% prediction interval for the life expectancy of a country in the OECD with an average per person GDP (log scale) of 9.5. Print the result, and briefly interpret this interval (like we did in class). (Use the predict function.)
# your code here# Create a data frame with the predictor values for predictionnew_data <-data.frame(PPGDP =9.5, Group ="oecd")# Predict life expectancy for the new dataprediction <-predict(life_plain_lm, newdata = new_data, interval ="prediction", level =0.95)# Print the prediction intervalprint(prediction)
fit lwr upr
1 79.81829 77.65424 81.98233
We are 95% confident that if we were predict the life expectancy of a country in the OECD with an average per person GDP (log scale) of 9.5, it would be between 77.65424, and 81.98233 years.
15. Plot the fitted model on the scatterplot with the two continuous variables on the axes, colored by the categorical variable. Hint: you should have 3 different lines on your plot, and you will not need to have different line types or point shapes (you will need to have different colors).
# your code herelibrary(ggplot2)ggplot(life) +geom_point(mapping =aes(x = PPGDP,y = LifeExp, color = Group)) +geom_line(mapping =aes(x = PPGDP,y =predict(life_plain_lm), color = Group)) +theme(aspect.ratio =1)
16. Fit a multiple linear regression model to the data, where this time you include an interaction term between PPGDP and Group. USE “other” AS THE BASELINE CATEGORY FOR GROUP. Print a summary of the results.
# your code here# Fit the multiple linear regression model with an interaction termlife_lm_interaction <-lm(LifeExp ~ PPGDP + Group + PPGDP:Group, data = life)# Print summary of the resultssummary(life_lm_interaction)
Call:
lm(formula = LifeExp ~ PPGDP + Group + PPGDP:Group, data = life)
Residuals:
Min 1Q Median 3Q Max
-2.7772 -0.6729 -0.1000 0.6446 3.0438
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 50.42403 0.71286 70.734 < 2e-16 ***
PPGDP 2.93882 0.08187 35.896 < 2e-16 ***
Groupafrica -11.89511 1.47535 -8.063 1.13e-13 ***
Groupoecd 11.29201 3.21337 3.514 0.000562 ***
PPGDP:Groupafrica -0.04128 0.21249 -0.194 0.846187
PPGDP:Groupoecd -0.95268 0.31279 -3.046 0.002680 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.056 on 175 degrees of freedom
Multiple R-squared: 0.9865, Adjusted R-squared: 0.9861
F-statistic: 2551 on 5 and 175 DF, p-value: < 2.2e-16
17. Write out the fitted model (using coefficients values from above) for a model with PPGDP, Group, and an interaction between PPGDG and Group. Remember, you will need to use dummy variables for Group. USE “other” AS THE BASELINE CATEGORY. Use variable names that are descriptive (not \(y\), \(x_1\), etc.) and use proper subscripts.
18. Use the anova function to test if the overall interaction between PPGDP and Group is significant. Print the result. What do you conclude (full sentance)?
# your code here# Perform ANOVA to test the overall interactioninteraction_test <-anova(life_plain_lm, life_lm_interaction)# Print the resultprint(interaction_test)
Analysis of Variance Table
Model 1: LifeExp ~ PPGDP + Group
Model 2: LifeExp ~ PPGDP + Group + PPGDP:Group
Res.Df RSS Df Sum of Sq F Pr(>F)
1 177 205.48
2 175 195.12 2 10.357 4.6447 0.01083 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We have a p-value of 0.01083, so we reject the null hypothesis and conclude that the interaction is significant between Group and PPGDP.
19. Plot the fitted model (with the interaction included) on the scatterplot with the two continuous variables on the axes, colored by the categorical variable. Hint: you should have 3 different lines on your plot, and you will not need to have different line types or point shapes (you will need to have different colors).
# your code here# Create the scatterplotggplot(life) +geom_point(mapping =aes(x = PPGDP, y = LifeExp, color = Group)) +geom_line(mapping =aes(x = PPGDP,y =predict(life_lm_interaction), color = Group)) +theme(aspect.ratio =1)
20. How did the fitted lines change when you included an interaction term compared with when you did not include an interaction term?
It seems like the OECD and other groups have the same slope when the interaction term is not added. They seem to be about the same. However, when we add the interaction, we can see that the slope of OECD is about to hit the other group near 11.5 PPGDP. Thus, we know that the graph was really effected, even though it is a little hard to see it physically.
21. What is the estimated effect of PPGDP on LifeExp for countries in a country other than those in the OECD or Africa (i.e. in the “other” category)? You should report this number in a complete sentence. Since this is a continuous-categorical interaction, and since we are focusing on the effect of the continuous variable, you should use the “one unit increase” terminology in your response.
The estimated effect of PPGDP on LifeExp for countries in the “other” category, which excludes those in the OECD or Africa, is approximately -0.95268 years for every one unit increase in PPGDP. This estimate is based on the coefficient associated with the interaction term “PPGDP:Groupoecd” from the linear regression model.
22. What is the p-value for the test of whether the effect of PPGDP on LifeExp is different between countries in the OECD group and countries in the “other” group?
With a pvalue of 0.002680, from the lm in question 16, We can conclude that the effect of PPGDP on LifeExp is significantly different between countries in the OECD group and countries in the “other” group, as the p-value is less than the conventional significance level of 0.05.
23. What is the p-value for the test of whether the effect of PPGDP on LifeExp is different between countries in the OECD group and countries in the African continent? (use the glht() function from the multcomp package to get an answer to this question.)
# your code herelibrary(multcomp)# Define the contrast matrixcontrast_matrix <-matrix(c(0, 0, 0, 0, 1, -1), ncol =1)# Perform the test#test_result <- glht(life_lm_interaction, linfct = mcp(Group = contrast_matrix))#summary(test_result)
< your response here >
24. What is the effect of PPGDP on LifeExp for countries in the OECD (relative to the reference group)? You should report a number in a complete sentence. Since this is a continuous-categorical interaction, and since we are focusing on the effect of the continuous variable, you should use the “one unit increase” terminology in your response.
# your code here
Relative to the other group, we know that on average, for each additional unit of PPGDP (log scale) life expectancy on average for countries in the OECD decreases by 0.95268, or in the LM model it was given as ( -0.95268).
25. Conditional on having a PPGDP of 9, what is the estimated effect of belonging to the OECD relative to being in the “other” country group? You should report a number in a complete sentence.
# your code here# Define the PPGDP valueppgdp_value <-9# Define the coefficient for the interaction term "PPGDP:Groupoecd"coefficient_interaction <--0.95268# Calculate the estimated effect for the OECD group relative to the "other" groupeffect_oecd_relative_to_other <- ppgdp_value * coefficient_interaction# Print the resultprint(effect_oecd_relative_to_other)
[1] -8.57412
So, conditional on having a ppgdp of 9, the estimated effect of belonging to the OECD relative to being in the “other” country group is a life expectancy that is on average -8.57412 years lower compared to the “other” countries.
26. Briefly summarize what you learned from this analysis to a non-statistician. Write a few sentences about (1) the purpose of this data set and analysis and (2) what you learned about this data set from your analysis. Write your response as if you were addressing a business manager (avoid using statistics jargon) and just provide the main take-aways.
To a non-statistician I would say that the purpose of this analysis was to look at life expectancy. This means that we are already looking at big gigantic averages, as each country probably can only give us their average life expectancy, and then we group them, to see if they have differences in life expectancy based on geography. From this data set, we learned that there really is an measurable difference in the groupings we were using. Your life expectancy average is effected by which geographical area that you live in, be it Africa, OECD, or other. This is what I learned, and it was such an astonishing realization that is is not all about your total PPGDP (per person gpd). We can now predict what the life expectancy will be for any PPGDP and geographic area. This is what I learned. As you might have assumed, Africa has a lower life expectancy, and then other, and then OECD.