Homework 6



Multiple Linear Regression Additional Variable Types


Author


Spencer Hamilton

GRADING RUBRIC (46 possible points)

Data and Description

Note that for the sake of length for this homework assignment, I am not having you check the model assumptions. You certainly can, if you would like, and in “real life” you would definitely need to do this prior to any statistical inference.

Macroeconomists often speculate that life expectancy is linked with the economic well-being of a country. Macroeconomists also hypothesize that Organisation for Economic Co-operation and Development (OECD) (an international think tank charged with promoting policies that will improve global social and economic well-being) members will have longer life expectancy. To test these hypotheses, the LifeExpectancy.txt data set (found on Canvas) contains the following information:

Variable Description
LifeExp Average life expectancy in years
Country Country name
Group Is the country a member of OECD, Africa, or other?
PPGDP Per person GDP (on the log scale)

The Group variable indicates if the country is a member of the OECD, a member of the African continent, or belonging to neither group (other). Note that the Country variable is just for your reference - you will not use this variable in your model.

Download LifeExpectancy.txt, and put it in the same folder as this .qmd file.

0. Replace the text “< PUT YOUR NAME HERE >” (above next to “author:”) with your full name.

1. Read in the data set, call it “life”, remove the “Row” column, and change the class of any categorical variables to factor variables. Print a summary of the data and make sure the data makes sense.

# your code here
life <- read.table("LifeExpectancy.txt", header = TRUE)
life <- subset(life, select = -Row)
life$Country <- as.factor(life$Country)
life$Group <- as.factor(life$Group)
summary(life)
      Country       Group         PPGDP           LifeExp     
 Albania  :  1   africa: 37   Min.   : 4.743   Min.   :51.86  
 Anguilla :  1   oecd  : 31   1st Qu.: 7.157   1st Qu.:70.81  
 Argentina:  1   other :113   Median : 8.542   Median :75.29  
 Armenia  :  1                Mean   : 8.492   Mean   :73.14  
 Aruba    :  1                3rd Qu.: 9.844   3rd Qu.:79.80  
 Australia:  1                Max.   :11.563   Max.   :85.62  
 (Other)  :175                                                

2. Show a scatterplot with the response on the \(y\)-axis and the other continuous variable on the \(x\)-axis. Comment on the the relationship between these two variables.

# your code here
ggplot(data = life, mapping = aes(y = LifeExp, x = PPGDP)) + geom_point()

You can see that there is probably a strong linear positive relationship between PPGDP and Life Expectancy. But, its interesting how there’s that gi

3. Create and print a boxplot with the response on the \(y\)-axis and the categorical variable on the \(x\)-axis. Comment on the the relationship between these two variables.

# your code here
ggplot(data = life) +
  geom_boxplot(mapping = aes(x = PPGDP, y = LifeExp)) +
  theme(aspect.ratio = 1)
Warning: Continuous x aesthetic
ℹ did you forget `aes(group = ...)`?

This box plot is interesting because there is such a big skew. We can see that there are so many lower out liars. We see that the average life expextancy overall is probably around 75. But, there is such a weird variation with the downward skew.

4. Create and print a color-coded scatterplot using all of the variables that will be in your model. Hint: plot the response on the \(y\)-axis, the other continuous variable on the \(x\)-axis, and color the points by the categorical variable.

# your code here
ggplot(data = life, mapping = aes(y = LifeExp, x = PPGDP, color = Group)) + geom_point()

5. Write out the theoretical model (using Greek letters/parameters) that includes main effects for Per Person GDP and the group of the country (you will not write out the fitted model using coefficients, because you have not fit a model yet). DO NOT include interactions at this step. Remember, you will need to use dummy variables for Group. USE “other” AS THE BASELINE CATEGORY. Use variable names that are descriptive (not \(y\), \(x_1\), etc.) and be sure to include proper subscripts.

\(\text{Life Expectancy}_i = \beta_1(baseline) + \beta_2 × \text{PPGDP} + \beta_3 × \text{I(Group = "Africa")} + \beta_4 × \text{I(Group = "oecd")} + \epsilon_i\)

$ iid N(0,^2) $

6. Fit the multiple linear regression model from question 5 to the data (no transformations, no interactions, etc.) using dummy variables that you create manually. USE “other” AS THE BASELINE CATEGORY FOR GROUP. Print a summary of the results.

# your code here
levels(life$Group)  # order of levels originally
[1] "africa" "oecd"   "other" 
# Create dummy variables for Group: "other" will be the baseline category
life$Group_oecd <- as.numeric(life$Group == "oecd")
life$Group_Africa <- as.numeric(life$Group == "africa")

# Fit the multiple linear regression model
life_6_lm <- lm(LifeExp ~ PPGDP + Group_oecd + Group_Africa, data = life)

summary(life_6_lm)

Call:
lm(formula = LifeExp ~ PPGDP + Group_oecd + Group_Africa, data = life)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8136 -0.6546 -0.0327  0.6861  3.0454 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   50.95789    0.65272  78.070  < 2e-16 ***
PPGDP          2.87690    0.07478  38.470  < 2e-16 ***
Group_oecd     1.52983    0.25418   6.019 9.88e-09 ***
Group_Africa -12.29427    0.25726 -47.789  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.077 on 177 degrees of freedom
Multiple R-squared:  0.9857,    Adjusted R-squared:  0.9855 
F-statistic:  4080 on 3 and 177 DF,  p-value: < 2.2e-16

7. Fit the multiple linear regression model from question 5 again, but this time let R create the dummy variables for you in the lm function. As before, USE “other” AS THE BASELINE CATEGORY FOR GROUP. Print a summary of the results and make sure they are identical to the results from question 6.

# your code here
# Reorder the levels of the 'Group' variable so that "other" becomes the baseline category
life$Group <- relevel(life$Group, ref = "other")

life_plain_lm <- lm(LifeExp ~ PPGDP + Group, data = life)
summary(life_plain_lm)

Call:
lm(formula = LifeExp ~ PPGDP + Group, data = life)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8136 -0.6546 -0.0327  0.6861  3.0454 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  50.95789    0.65272  78.070  < 2e-16 ***
PPGDP         2.87690    0.07478  38.470  < 2e-16 ***
Groupafrica -12.29427    0.25726 -47.789  < 2e-16 ***
Groupoecd     1.52983    0.25418   6.019 9.88e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.077 on 177 degrees of freedom
Multiple R-squared:  0.9857,    Adjusted R-squared:  0.9855 
F-statistic:  4080 on 3 and 177 DF,  p-value: < 2.2e-16

8. Briefly interpret the intercept (like we did in class). Note that you will need to use the word “average” twice since you are predicting an average already (i.e. the response variable is a country’s average life expectancy). You will need to do this here and with the questions following, when interpreting.

On average, the average person from other countries (outside africa and oecd), who have 0 PPGDP, will live 50.95 years.

9. Briefly interpret the coefficient for PPGDP. You do not need to un-transform anything or interpret this in the percentage change framework - you can just write something like “for every one unit increase in per person GDP (log scale)” in your response.

For every unit increase in per person GDP (in the log scale), the average life expectancy rises by 2.8769, on average.

10. For equal per person GDP (log scale), how does life expectancy change for countries that are members of the OECD compared to countries that are on the African continent? Show how you obtained this number, and briefly interpret this number (like we did in class).

The average life expectancy of the countries who are members of OECD are (1.52983 - (-12.29427) = 13.8241) higher than African countries with the same GDP (on the log scale).

11. Create 95% confidence intervals for all coefficients (use the confint function). You do not need to interpret them in this question.

# your code here
# Calculate 95% confidence intervals for all coefficients
conf_intervals <- confint(life_plain_lm)

# Print the confidence intervals
conf_intervals
                 2.5 %     97.5 %
(Intercept)  49.669772  52.246000
PPGDP         2.729321   3.024483
Groupafrica -12.801967 -11.786572
Groupoecd     1.028207   2.031453

12. Briefly interpret the 95% confidence interval for I(Group=Africa).

We are 95% confident that the true effect of being in the “Africa” group on the average life expectancy lies between -12.801967 years and -11.786572 years when compared to countries in the “other” group, after controlling for other variables (PPGDP) in the model.

13. Use the anova function to conduct a hypothesis test that tests a reduced model compared to the full model. Specifically, test if Group has a significant effect on LifeExp. What do you conclude from the result of the test? Hint: you will need to create another linear model and compare it with the one you made previously.

# your code here

# Fit the reduced model (without the Group variable)
reduced_lm <- lm(LifeExp ~ PPGDP, data = life)

# Perform the ANOVA test to compare the models
anova_result <- anova(reduced_lm, life_plain_lm)

# Print the ANOVA table
anova_result
Analysis of Variance Table

Model 1: LifeExp ~ PPGDP
Model 2: LifeExp ~ PPGDP + Group
  Res.Df     RSS Df Sum of Sq      F    Pr(>F)    
1    179 2858.84                                  
2    177  205.48  2    2653.4 1142.8 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The very low p-value (< 2.2e-16) associated with the F-statistic indicates strong evidence against the null hypothesis. Therefore, we reject the null hypothesis and conclude that there is a significant effect of the Group variable on the life expectancy (LifeExp) after controlling for the Gross Domestic Product per capita (PPGDP).

14. Create a 95% prediction interval for the life expectancy of a country in the OECD with an average per person GDP (log scale) of 9.5. Print the result, and briefly interpret this interval (like we did in class). (Use the predict function.)

# your code here
# Create a data frame with the predictor values for prediction
new_data <- data.frame(PPGDP = 9.5, Group = "oecd")

# Predict life expectancy for the new data
prediction <- predict(life_plain_lm, newdata = new_data, interval = "prediction", level = 0.95)

# Print the prediction interval
print(prediction)
       fit      lwr      upr
1 79.81829 77.65424 81.98233

We are 95% confident that if we were predict the life expectancy of a country in the OECD with an average per person GDP (log scale) of 9.5, it would be between 77.65424, and 81.98233 years.

15. Plot the fitted model on the scatterplot with the two continuous variables on the axes, colored by the categorical variable. Hint: you should have 3 different lines on your plot, and you will not need to have different line types or point shapes (you will need to have different colors).

# your code here
library(ggplot2)

ggplot(life) +
  geom_point(mapping = aes(x = PPGDP,
                           y = LifeExp, 
                           color = Group)) +
  geom_line(mapping = aes(x = PPGDP,
                          y = predict(life_plain_lm), 
                          color = Group)) +
  theme(aspect.ratio = 1)

16. Fit a multiple linear regression model to the data, where this time you include an interaction term between PPGDP and Group. USE “other” AS THE BASELINE CATEGORY FOR GROUP. Print a summary of the results.

# your code here
# Fit the multiple linear regression model with an interaction term
life_lm_interaction <- lm(LifeExp ~ PPGDP + Group + PPGDP:Group, data = life)

# Print summary of the results
summary(life_lm_interaction)

Call:
lm(formula = LifeExp ~ PPGDP + Group + PPGDP:Group, data = life)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7772 -0.6729 -0.1000  0.6446  3.0438 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        50.42403    0.71286  70.734  < 2e-16 ***
PPGDP               2.93882    0.08187  35.896  < 2e-16 ***
Groupafrica       -11.89511    1.47535  -8.063 1.13e-13 ***
Groupoecd          11.29201    3.21337   3.514 0.000562 ***
PPGDP:Groupafrica  -0.04128    0.21249  -0.194 0.846187    
PPGDP:Groupoecd    -0.95268    0.31279  -3.046 0.002680 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.056 on 175 degrees of freedom
Multiple R-squared:  0.9865,    Adjusted R-squared:  0.9861 
F-statistic:  2551 on 5 and 175 DF,  p-value: < 2.2e-16

17. Write out the fitted model (using coefficients values from above) for a model with PPGDP, Group, and an interaction between PPGDG and Group. Remember, you will need to use dummy variables for Group. USE “other” AS THE BASELINE CATEGORY. Use variable names that are descriptive (not \(y\), \(x_1\), etc.) and use proper subscripts.

$ = + 2.93882 × _i - 11.89511 × _i + 11.29201 × _i - 0.04128 × _i * _i - 0.95268 _i * _i $

18. Use the anova function to test if the overall interaction between PPGDP and Group is significant. Print the result. What do you conclude (full sentance)?

# your code here

# Perform ANOVA to test the overall interaction
interaction_test <- anova(life_plain_lm, life_lm_interaction)

# Print the result
print(interaction_test)
Analysis of Variance Table

Model 1: LifeExp ~ PPGDP + Group
Model 2: LifeExp ~ PPGDP + Group + PPGDP:Group
  Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
1    177 205.48                              
2    175 195.12  2    10.357 4.6447 0.01083 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We have a p-value of 0.01083, so we reject the null hypothesis and conclude that the interaction is significant between Group and PPGDP.

19. Plot the fitted model (with the interaction included) on the scatterplot with the two continuous variables on the axes, colored by the categorical variable. Hint: you should have 3 different lines on your plot, and you will not need to have different line types or point shapes (you will need to have different colors).

# your code here

# Create the scatterplot
ggplot(life) +
  geom_point(mapping = aes(x = PPGDP, 
                           y = LifeExp, 
                           color = Group)) +
  geom_line(mapping = aes(x = PPGDP,
                          y = predict(life_lm_interaction), 
                          color = Group)) +
  theme(aspect.ratio = 1)

20. How did the fitted lines change when you included an interaction term compared with when you did not include an interaction term?

It seems like the OECD and other groups have the same slope when the interaction term is not added. They seem to be about the same. However, when we add the interaction, we can see that the slope of OECD is about to hit the other group near 11.5 PPGDP. Thus, we know that the graph was really effected, even though it is a little hard to see it physically.

21. What is the estimated effect of PPGDP on LifeExp for countries in a country other than those in the OECD or Africa (i.e. in the “other” category)? You should report this number in a complete sentence. Since this is a continuous-categorical interaction, and since we are focusing on the effect of the continuous variable, you should use the “one unit increase” terminology in your response.

The estimated effect of PPGDP on LifeExp for countries in the “other” category, which excludes those in the OECD or Africa, is approximately -0.95268 years for every one unit increase in PPGDP. This estimate is based on the coefficient associated with the interaction term “PPGDP:Groupoecd” from the linear regression model.

22. What is the p-value for the test of whether the effect of PPGDP on LifeExp is different between countries in the OECD group and countries in the “other” group?

With a pvalue of 0.002680, from the lm in question 16, We can conclude that the effect of PPGDP on LifeExp is significantly different between countries in the OECD group and countries in the “other” group, as the p-value is less than the conventional significance level of 0.05.

23. What is the p-value for the test of whether the effect of PPGDP on LifeExp is different between countries in the OECD group and countries in the African continent? (use the glht() function from the multcomp package to get an answer to this question.)

# your code here
library(multcomp)

# Define the contrast matrix
contrast_matrix <- matrix(c(0, 0, 0, 0, 1, -1), ncol = 1)

# Perform the test
#test_result <- glht(life_lm_interaction, linfct = mcp(Group = contrast_matrix))

#summary(test_result)

< your response here >

24. What is the effect of PPGDP on LifeExp for countries in the OECD (relative to the reference group)? You should report a number in a complete sentence. Since this is a continuous-categorical interaction, and since we are focusing on the effect of the continuous variable, you should use the “one unit increase” terminology in your response.

# your code here

Relative to the other group, we know that on average, for each additional unit of PPGDP (log scale) life expectancy on average for countries in the OECD decreases by 0.95268, or in the LM model it was given as ( -0.95268).

25. Conditional on having a PPGDP of 9, what is the estimated effect of belonging to the OECD relative to being in the “other” country group? You should report a number in a complete sentence.

# your code here
# Define the PPGDP value
ppgdp_value <- 9

# Define the coefficient for the interaction term "PPGDP:Groupoecd"
coefficient_interaction <- -0.95268

# Calculate the estimated effect for the OECD group relative to the "other" group
effect_oecd_relative_to_other <- ppgdp_value * coefficient_interaction

# Print the result
print(effect_oecd_relative_to_other)
[1] -8.57412

So, conditional on having a ppgdp of 9, the estimated effect of belonging to the OECD relative to being in the “other” country group is a life expectancy that is on average -8.57412 years lower compared to the “other” countries.

26. Briefly summarize what you learned from this analysis to a non-statistician. Write a few sentences about (1) the purpose of this data set and analysis and (2) what you learned about this data set from your analysis. Write your response as if you were addressing a business manager (avoid using statistics jargon) and just provide the main take-aways.

To a non-statistician I would say that the purpose of this analysis was to look at life expectancy. This means that we are already looking at big gigantic averages, as each country probably can only give us their average life expectancy, and then we group them, to see if they have differences in life expectancy based on geography. From this data set, we learned that there really is an measurable difference in the groupings we were using. Your life expectancy average is effected by which geographical area that you live in, be it Africa, OECD, or other. This is what I learned, and it was such an astonishing realization that is is not all about your total PPGDP (per person gpd). We can now predict what the life expectancy will be for any PPGDP and geographic area. This is what I learned. As you might have assumed, Africa has a lower life expectancy, and then other, and then OECD.