Background Information

The who.csv dataset contains real-world data from 2008. The variables included follow.

1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

library(ggplot2)

raw_link <- 'https://raw.githubusercontent.com/rkasa01/DATA605_HW12/main/who.csv'
who_df <- read.csv(raw_link)
lm_model <- lm(LifeExp ~ TotExp, data=who_df)


plot(LifeExp~TotExp, data=who_df, 
     xlab="Total Expenditures", ylab="Life Expectancy",
     main="Life Expectancy vs Total Expenditures")
abline(lm_model, col = "red")

Here, we see a plot of Life Expectancy vs Total Expenditures using the given set of data. The regression line is colored in red. It appears that because of the distribution of the data, the points do not align well with the line. This indicates a poor fit, and low correlation between the two variables.

summary(lm_model)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

The R-squared value is .2577. This indicates that about 25.77% of the variability in LifeExp, the variable which measures average life expectancy for a given country, can be explained by TotExp, or the variable which measures the sum of personal and governmental expenditures. Since this value is not significant, it shows that the two variables have low correlation to one another.

The residual standard error is 9.371 on 188 degrees of freedom. The standard error is high and tells us that on average, the data does not align with the line of best fit by about 9.4 years. The reason this number is so high is because there is a lot of data which is unexpected. For example, average life expectancy with a total expenditure of 0 can be less than 40 years old. This is a drastic change from it being as high as almost 80 years old with 0 total expenditures. Since there is more deviation, this impacts the standard error, making it larger.

The F-statistic is 65.26 on 1 and 188 DF, and the p-value is 7.714e-14, making it significant. This is strong evidence which points to rejecting the null hypothesis, which is ‘TotExp’ is correlated to average ‘LifeExp’, and accepting the alternative hypothesis, which is ‘TotExp’ is not correlated to ‘LifeExp’.

Overall, I would conclude that there is very low correlation between total expenditures and life expectancy. Because of that, I would say that total expenditures does not sufficiently predict the average life expectancy of individuals in a given country.

2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

LifeExp_4_6=who_df$LifeExp^4.6
TotExp_4_6=who_df$TotExp^.06

lm_model_4_6 <- lm(LifeExp_4_6 ~ TotExp_4_6, data=who_df)
plot(LifeExp_4_6~TotExp_4_6, data=who_df, 
     xlab="Total Expenditures", ylab="Life Expectancy",
     main="Life Expectancy vs Total Expenditures")
abline(lm_model_4_6, col = "red")

Here, we see a plot of Life Expectancy vs Total Expenditures with the updates values. The regression line is once again colored in red. This model more accurately predicts the average life expectancy based on total expenditures. We can see this because the data more closely aligns to the regression line, suggesting that total expenditures is more likely, in this case, to predict life expectancy.

summary(lm_model_4_6)
## 
## Call:
## lm(formula = LifeExp_4_6 ~ TotExp_4_6, data = who_df)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp_4_6   620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

The R-squared value is 0.7298. This indicates that about 72.98% of the variability in LifeExp (when raised to the 4.6 power), can be explained by TotExp (when raised to the .06 power). This value is significant, indicating that the two variables have a significant correlation to one another.

The residual standard error is 90490000 on 188 degrees of freedom. This standard error is extremely high due to the fact that other values were exponentially increased. The residual standard error would also be exponentially increased.

The F-statistic is 507.7 on 1 and 188 DF, and the p-value is less than 2.2e-16, making it significant. This is strong evidence which points to average LifeExp^4.6 can be predicted by TotExp^.06.

Overall, I would conclude that there is a significant and strong correlation between total expenditures and life expectancy when they have been exponentially increased. This is because life expectancy was increased to the power of 4.6, making all the values even higher, and the total expenditures were increased to the power of .06, making total expenditures even lower.

3. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

LifeExp_4_6 = -736527910 +  620060216 * (1.5)
LifeExp_1_5 = exp(log(LifeExp_4_6)/4.6)
print(LifeExp_1_5)
## [1] 63.31153
LifeExp_4_6 = -736527910 +  620060216 * (2.5)
LifeExp_2_5 = exp(log(LifeExp_4_6)/4.6)
print(LifeExp_2_5)
## [1] 86.50645

4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error,and p-values. How good is the model?

LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

lm_model_PropMD <- lm(who_df$LifeExp ~ who_df$PropMD + who_df$TotExp + who_df$PropMD * who_df$TotExp)

summary(lm_model_PropMD)
## 
## Call:
## lm(formula = who_df$LifeExp ~ who_df$PropMD + who_df$TotExp + 
##     who_df$PropMD * who_df$TotExp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  6.277e+01  7.956e-01  78.899  < 2e-16 ***
## who_df$PropMD                1.497e+03  2.788e+02   5.371 2.32e-07 ***
## who_df$TotExp                7.233e-05  8.982e-06   8.053 9.39e-14 ***
## who_df$PropMD:who_df$TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

The R-squared value is 0.3574 This indicates that about 35.74% of the variability in LifeExp can be explained by TotExp and PropMD. This value is not significant, indicating that the two variables have low correlation to the dependent variable.

The residual standard error is 8.765 on 186 degrees of freedom. This standard error is not as high as it was in the other model, but still points to the data not being significantly aligned with the regression line.

The F-statistic is 34.49 on 3 and 186 DF, and the p-value is less than 2.2e-16, making it significant. This is strong evidence for rejecting the null hypothesis and accepting the alternate hypothesis, which is that these two variables are not significant predictors of average life expectancy in a given country.

Overall, I would conclude that this model is better than the first one but still does not include strong enough evidence for us to accept the null hypothesis.

5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

PropMD=.03
TotExp = 14
b0=6.277e+01 
b1=1.497e+03
b2=7.233e-05 
b3=-6.026e-03

LifeExp_b_0123 <- b0 + (b1*PropMD) + (b2*TotExp) + (b3*PropMD*TotExp)

print(LifeExp_b_0123)
## [1] 107.6785

This does not seem realistic to me, especially with an expenditure this low. This number may be an outlier, seeing as the majority of people do not make it to 107 years old, so it cannot be a valid average life expectancy.