DATA 605 Week 12 Homework

Import Data

who_data <- read_csv(url, show_col_types = FALSE)
head(who_data)

## # A tibble: 6 × 10
##   Country   LifeExp InfantSurvival Under5Survival TBFree  PropMD  PropRN PersExp
##   <chr>       <dbl>          <dbl>          <dbl>  <dbl>   <dbl>   <dbl>   <dbl>
## 1 Afghanis…      42          0.835          0.743  0.998 2.29e-4 5.72e-4      20
## 2 Albania        71          0.985          0.983  1.00  1.14e-3 4.61e-3     169
## 3 Algeria        71          0.967          0.962  0.999 1.06e-3 2.09e-3     108
## 4 Andorra        82          0.997          0.996  1.00  3.30e-3 3.5 e-3    2589
## 5 Angola         41          0.846          0.74   0.997 7.04e-5 1.15e-3      36
## 6 Antigua …      73          0.99           0.989  1.00  1.43e-4 2.77e-3     503
## # ℹ 2 more variables: GovtExp <dbl>, TotExp <dbl>

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

Scatterplot

plot(who_data$TotExp, who_data$LifeExp, main="Life Expectancy vs Total Expenditure",
     xlab="Total Expenditure", ylab="Life Expectancy")

Linear Model

lm1 <- lm(LifeExp~TotExp, data = who_data)
summary(lm1)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

The F statistic is 65.26. The F statistic tests whether at least one predictor variable has a non-zero coefficient. The p-value is 7.714e-14 and is extremely small. This p-value suggests that it is highly unlikely that the observed results are due to random chance. Since 65.26 is quite large, and in conjunction with the small p-value, this indicates that the model as a whole is statistically significant.

The multiple \(R^2\) is 0.2577. This means that 25.77% of the variability in life expectancy can be explained by the total expenditure using this model. Although this does suggest that there are likely other predictors that help explain the variability in life expectancy.

The residual standard error is 9.371. A lower RSE indicates suggests that the observed values are closer to the predicted values. A RSE of 9.371 suggests that, on average, the observed life expectancy values deviate from the predicted values by 9.371.

Assumptions of Linear Model

Normality

From the below Q-Q plot, we can see that the residuals do not fall along the normal line. We are not able to confirm normality.

qqnorm(resid(lm1))
qqline(resid(lm1))

Linearity

In order to confirm linearity, we will plot the residuals vs fitted values. To assume linearity, there needs to be no discernable trend in the points. In the below plot, we can see that the points are populated to the left side of the graph. We cannot cofirm lienarity based on this plot.

plot(fitted(lm1),resid(lm1))

Homoscedasticity

In order to confirm homoscedasticity, we will plot the residuals vs fitted values again with a horizontal line at 0. Ideally, the points should be randomly dispersed about 0, with no discernible pattern. From the below plot, we can not confirm homoscedasticity because there is a discernible pattern and the points aren’t scatted about 0.

plot(fitted(lm1),resid(lm1))
abline(h = 0, col = "red", lty = 2)

Independence

Using the above residuals vs fitted values plot, we cannot assume independence as there is a discernable trend in the points.

Conclusion

In conclusion, the assumptions for linear regression are not satisified.

Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

Scatterplot

plot(who_data$TotExp^0.06, who_data$LifeExp^4.6, main="Life Expectancy vs Total Expenditure",
     xlab="Total Expenditure", ylab="Life Expectancy")

Linear Model

lm2 <- lm(I(LifeExp^4.6) ~ I(TotExp^0.06), data = who_data)
summary(lm2)

## 
## Call:
## lm(formula = I(LifeExp^4.6) ~ I(TotExp^0.06), data = who_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -736527910   46817945  -15.73   <2e-16 ***
## I(TotExp^0.06)  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

The F statistic is 507.7. The p-value is 2.2e-16 and is extremely small. This p-value suggests that it is highly unlikely that the observed results are due to random chance. Since the F statistic is large and the p-value is small, it indiciates that total expenditure is a significant predictor in life expectancy. The F statistic is much higher than in the previous model and the p-value is much lower than the previous mode.

The multiple \(R^2\) is 0.7298. This means that 72.98% of the variability in life expectancy can be explained by the total expenditure using this model. This is higher than the previous model by almost 50%.

The residual standard error is 90,490,000. A RSE of 90,490,000 suggests that, on average, the observed life expectancy values deviate from the predicted values by 90,490,000. The large residual standard error suggests that there may still be unexplained variability in Life Expectancy.

Overall, while the transformed model represents a significant improvement over the original model, further analysis may be necessary to better understand and account for the remaining variability in Life Expectancy.

Assumptions of Linear Model

Normality

From the below Q-Q plot, we can see that majority of the points fall along the normal line, but at both ends the points trail off. This is an improvement from the previous models Q-Q plot. We can assume normality based on this plot.

qqnorm(resid(lm2))
qqline(resid(lm2))

Linearity

In the below plot, we can see that the points are scatter about the plot with no discernible pattern. We can assume linearity based on this plot.

plot(fitted(lm2),resid(lm2))

Homoscedasticity

Based on the below plot, the points are scattered about 0 with no discerinble pattern. We can assume homoscedasticity.

plot(fitted(lm2),resid(lm2))
abline(h = 0, col = "red", lty = 2)

Independence

Using the above residuals vs fitted values plot, we can assume independence as there is no discernable trend in the plotted residuals.

Conclusion

In conclusion, the assumptions for linear regression are satisified.

Which is Better?

The second linear model is better than the first. The second model has a significantly larger F statistic and lower p-value compared to the first model. This indicates a stronger overall relationship between the predictor and response variables and a higher level of confidence in the observed results. The \(R^2\) has improved significantly in the second model vs the first.

Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

# Given coefficients from the linear model
beta0 <- coef(lm2)[1]
beta1 <- coef(lm2)[2]

# Given TotExp^0.06 values
TotExp1 <- 1.5
TotExp2 <- 2.5

# Forecast LifeExp^4.6 for TotExp^0.06 = 1.5
LifeExp1 <- beta0 + beta1 * TotExp1

# Forecast LifeExp^4.6 for TotExp^0.06 = 2.5
LifeExp2 <- beta0 + beta1 * TotExp2

# Taking the 4.6th root to obtain forecasted life expectancy
LifeExpForecast1 <- LifeExp1^(1/4.6)
LifeExpForecast2 <- LifeExp2^(1/4.6)

print(paste("Forecasted Life Expectancy when TotExp^0.06 = 1.5:", round(LifeExpForecast1, 2)))

## [1] "Forecasted Life Expectancy when TotExp^0.06 = 1.5: 63.31"

print(paste("Forecasted Life Expectancy when TotExp^0.06 = 2.5:", round(LifeExpForecast2, 2)))

## [1] "Forecasted Life Expectancy when TotExp^0.06 = 2.5: 86.51"

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

Linear Model

lm3 <- lm(LifeExp ~ PropMD + TotExp + PropMD * TotExp, data  = who_data)
summary(lm3)

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

The F statistic is 34.49. The p-value is 2.2e-16 and is extremely small. This p-value suggests that it is highly unlikely that the observed results are due to random chance. Since the F statistic is fairly large and the p-value is small, it indiciates that the predictors in the model are significant.

The adjusted \(R^2\) is 0.3471. This means that 34.71% of the variability in life expectancy can be explained by the total expenditure using this model. This suggests that there may be other variables that are not in this model that can further explain the variability.

The residual standard error is 8.765. A RSE of 8.765 suggests that, on average, the observed life expectancy values deviate from the predicted values by 8.765. This is a fairly low residual standard error, which suggest that the observed values are closer to the predicted values.

Assumptions of Linear Model

Normality

From the below Q-Q plot, we can see that the residuals do not fall along the normal line perfectly. It is hard to confirm normality based on this plot.

qqnorm(resid(lm3))
qqline(resid(lm3))

Linearity

In the below residuals vs fitted values plot, we can see that the points are populated towards the top left-hand corner of the plot. In order to confirm linearity, there would need to be no discernible pattern in the points. We cannot confirm linearity based on this plot.

plot(fitted(lm3),resid(lm3))

Homoscedasticity

In order to confirm homoscedasticity, we would need to see the points scattered about 0 in no discerinble pattern. In this case, we cannot confirm homoscedasticity.

plot(fitted(lm3),resid(lm3))
abline(h = 0, col = "red", lty = 2)

Independence

Using the above residuals vs fitted values plot, we cannot assume independence as there is a discernable trend in the plotted residuals.

Conclusion

Overall, the assumptions for linear regression are not satisfied. This is not the best model.

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

# Extract coefficients from the linear model lm3
beta0 <- coef(lm3)[1]
beta1 <- coef(lm3)[2]
beta2 <- coef(lm3)[3]
beta3 <- coef(lm3)[4]

# Define the values of PropMD and TotExp
PropMD_value <- 0.03
TotExp_value <- 14

# Compute the predicted LifeExp
LifeExp_prediction <- beta0 + beta1 * PropMD_value + beta2 * TotExp_value + beta3 * PropMD_value * TotExp_value

print(LifeExp_prediction)

## (Intercept) 
##     107.696

A life expectancy of 107.696 does not seem realistic as the average life expectancy worldwide is around 71.33. 197 would be considered an outlier.

DATA 605 Week 12 Homework

Kristin Lussi

2024-04-14

Import Data

Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

Scatterplot

Linear Model

Assumptions of Linear Model

Normality

Linearity

Homoscedasticity

Independence

Conclusion

Scatterplot

Linear Model

Assumptions of Linear Model

Normality

Linearity

Homoscedasticity

Independence

Conclusion

Which is Better?

Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

Linear Model

Assumptions of Linear Model

Normality

Linearity

Homoscedasticity

Independence

Conclusion

Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?