Week 12 Assignment

The attached who.csv dataset contains real-world data from 2008. The variables included follow.

Country: name of the country LifeExp: average life expectancy for the country in years InfantSurvival: proportion of those surviving to one year or more Under5Survival: proportion of those surviving to five years or more TBFree: proportion of the population without TB. PropMD: proportion of the population who are MDs PropRN: proportion of the population who are RNs PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate TotExp: sum of personal and government expenditures.

Problem 1:

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

# Read CSV file 
data <- read.csv("who.csv")

# View the first few rows of the dataset
head(data)
##               Country LifeExp InfantSurvival Under5Survival  TBFree      PropMD
## 1         Afghanistan      42          0.835          0.743 0.99769 0.000228841
## 2             Albania      71          0.985          0.983 0.99974 0.001143127
## 3             Algeria      71          0.967          0.962 0.99944 0.001060478
## 4             Andorra      82          0.997          0.996 0.99983 0.003297297
## 5              Angola      41          0.846          0.740 0.99656 0.000070400
## 6 Antigua and Barbuda      73          0.990          0.989 0.99991 0.000142857
##        PropRN PersExp GovtExp TotExp
## 1 0.000572294      20      92    112
## 2 0.004614439     169    3128   3297
## 3 0.002091362     108    5184   5292
## 4 0.003500000    2589  169725 172314
## 5 0.001146162      36    1620   1656
## 6 0.002773810     503   12543  13046
# Load ggplot2 package
library(ggplot2)
# Create scatter plot
ggplot(data, aes(x = TotExp, y = LifeExp)) +
  geom_point(color = "red") +
  xlab("Sum of Personal and Government Expenditures.") +
  ylab("Life Expectancy")

Run Simple Regression

simple.model.lm <- lm(LifeExp~TotExp, data = data)
print(summary(simple.model.lm))
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

The F-statistic can be seen above it is 65.26 which is large enough to be significant. The p-value can also be seen as 7.714e-14 which is small enough to also be significant. These two values indicate that we can reject the null hypothesis. The \(R^2\) value of 0.2577 shows that about 25% of the variability in LifeExp can be explained with TotExp. This is not a very strong \(R^2\) value as we would like it to be closer to 1. The standard error of 9.31 suggests some variability with the predicted values but overall its not too bad. However, we would ideally like for that value to be closer to 0.

Based on the information above and the scatter plot the assumptions of simple linear regression have been met.

Problem 2:

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

data$newLifeExp = data$LifeExp^4.6
data$newTotExp = data$TotExp^.06
head(data)
##               Country LifeExp InfantSurvival Under5Survival  TBFree      PropMD
## 1         Afghanistan      42          0.835          0.743 0.99769 0.000228841
## 2             Albania      71          0.985          0.983 0.99974 0.001143127
## 3             Algeria      71          0.967          0.962 0.99944 0.001060478
## 4             Andorra      82          0.997          0.996 0.99983 0.003297297
## 5              Angola      41          0.846          0.740 0.99656 0.000070400
## 6 Antigua and Barbuda      73          0.990          0.989 0.99991 0.000142857
##        PropRN PersExp GovtExp TotExp newLifeExp newTotExp
## 1 0.000572294      20      92    112   29305338  1.327251
## 2 0.004614439     169    3128   3297  327935478  1.625875
## 3 0.002091362     108    5184   5292  327935478  1.672697
## 4 0.003500000    2589  169725 172314  636126841  2.061481
## 5 0.001146162      36    1620   1656   26230450  1.560068
## 6 0.002773810     503   12543  13046  372636298  1.765748
transformed.model.lm <- lm(newLifeExp~newTotExp, data = data)
print(summary(transformed.model.lm))
## 
## Call:
## lm(formula = newLifeExp ~ newTotExp, data = data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## newTotExp    620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

The F-statistic can be seen above it is 507.7 which is large increase from 65 and is significant. The p-value can also be seen as 2.2e-16 which is a big decrease from 7.714e-14 and also significant. These two values indicate that we can reject the null hypothesis. The \(R^2\) value of 0.7298 is a huge increase from 0.2577 and shows that now about 72% of the variability in LifeExp can be explained with TotExp. This is a much better \(R^2\) value as it is closer 1. The standard error of 90490000 is a substantial increase from 9.31 and suggests a lot variability with the predicted values.

We see improvements in the F-statistic, p-value and \(R^2\) value for this model over the non-transformed one. We also see a big increase in the residual standard error which can possible be explained in the large values since we raised LifeExp to the 4.6 power. Overall, I would say that this model is a better fit to the data even with the higher RSE.

Problem 3:

Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

# create the test data 
test_data <- data.frame(newTotExp = c(1.5, 2.5))  # example values
print(test_data)
##   newTotExp
## 1       1.5
## 2       2.5
# Predict life expectancy using the model
predicted_life_exp <- predict(transformed.model.lm, newdata = test_data)

# Print the predicted values
print(predicted_life_exp)
##         1         2 
## 193562414 813622630

The forecast for LifeExp when TotExp^.06 = 1.5 is 193562414.

The forecast for LifeExp when TotExp^.06 = 2.5 is 813622630.

Problem 4:

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

multiple.model.lm <- lm(LifeExp~PropMD+TotExp + PropMD*TotExp, data = data)
print(summary(multiple.model.lm))
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

The F-statistic is 34.49 with a p-value of < \(2.2e^{-16}\) which is statistically significant. We have a \(R^2\) value of 0.3574 which is an increase from the first model but a decrease from the model with the transformed data. Generally, we would like for our \(R^2\) value to be closer to 1. The Standard Residual Error is 8.765 which is a slight decrease from the first model and a very large decrease from the model with the transformed data. Overall, from this info I dont think that this is a great model although it is statistically significant.

Problem 5

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

# create the test data 
test_data_mul <- data.frame(TotExp = c(14), PropMD = c(.03))  # example values
print(test_data_mul)
##   TotExp PropMD
## 1     14   0.03
# Predict life expectancy using the model
predicted_life_exp <- predict(multiple.model.lm, newdata = test_data_mul)

# Print the predicted values
print(predicted_life_exp)
##       1 
## 107.696

Just at first glance this seems like an unrealistic prediction but buts inspect the data a little bit.

hist(data$LifeExp, xlim = c(30,100))

print(max(data$LifeExp))
## [1] 83
hist(data$PropMD, xlim= c(0,.06))

print(max(data$PropMD))
## [1] 0.03512903
hist(data$TotExp)

print(min(data$TotExp))
## [1] 13
print(mean(data$TotExp))
## [1] 41695.49
print(max(data$TotExp))
## [1] 482750

From looking at the data above we can conclude that this is not an accurate prediction. We can see that the longest life in our data was 83 years living to 107 would be an extreme outlier. Also from inspecting the PropMD and TotExp columns we can also see that the values passed into the model for the prediction are also outliers themselves. We can see that the largest PropMD value is .03 which is an outlier and the TotExp minimum value is 13 which is also an outlier and the value that we passed in. Overall this is not an accurate prediction but we also gave the model outliers for the inputs.