BrianSingh_Data605

0.1 Load Libraries and Data

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.2.3

## Warning: package 'ggplot2' was built under R version 4.2.3

## Warning: package 'tibble' was built under R version 4.2.3

## Warning: package 'tidyr' was built under R version 4.2.3

## Warning: package 'readr' was built under R version 4.2.3

## Warning: package 'purrr' was built under R version 4.2.3

## Warning: package 'dplyr' was built under R version 4.2.3

## Warning: package 'stringr' was built under R version 4.2.3

## Warning: package 'forcats' was built under R version 4.2.3

## Warning: package 'lubridate' was built under R version 4.2.3

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(car)

## Warning: package 'car' was built under R version 4.2.3

## Loading required package: carData

## Warning: package 'carData' was built under R version 4.2.3

## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

library(olsrr)

## Warning: package 'olsrr' was built under R version 4.2.3

## 
## Attaching package: 'olsrr'
## 
## The following object is masked from 'package:datasets':
## 
##     rivers

who_data <- read.csv("https://raw.githubusercontent.com/brsingh7/Data/main/who.csv")

0.2 1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

plot(who_data$TotExp, who_data$LifeExp, xlab = "TotExp", ylab = "LifeExp", main = "LifeExp vs TotExp")

lm_mod <- lm(LifeExp ~ TotExp, data = who_data)
summary(lm_mod)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

The model has an f-statistic of 65.26 and a pvalue less than 0.05 indicating that the model fits the data well. However, the multiple r squared of 0.2577 is fairly low, meaning the variation in lifeexp is only be explained by 25.77% of the explanatory variable. Further, the standard error of 9.371 is larger than we want it to be, showing a large deviation in the predicted and actual values.

hist(lm_mod$residuals)

qqnorm(lm_mod$residuals)
qqline(lm_mod$residuals)

In looking at the residuals, we can see there is a skew to the right, not normally distributed, which can also raise questions about the model’s fit. We can see the residual values are not normally distributed.

0.3 2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

who_data$LifeExp_transformed <- who_data$LifeExp^4.6
who_data$TotExp_transformed <- who_data$TotExp^0.06

plot(who_data$TotExp_transformed, who_data$LifeExp_transformed, xlab = "TotExp_Transformed", ylab = "LifeExp_Transformed", main = "LifeExp vs TotExp")

lm_mod2 <- lm(LifeExp_transformed ~ TotExp_transformed, data = who_data)
summary(lm_mod2)

## 
## Call:
## lm(formula = LifeExp_transformed ~ TotExp_transformed, data = who_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -736527910   46817945  -15.73   <2e-16 ***
## TotExp_transformed  620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

After transforming the variables, there appears to be a much more linear relationship than the plot not transformed. Since the variables are transformed and scaled, the standard error appears out of wack, but we can’t rely on it or use for comparison based on this. The Multiple-R squared is much higher than the first model, and the pvalue is well below 0.05 significance. Overall, the model appears to be a better fit than the first.

hist(lm_mod2$residuals)

qqnorm(lm_mod2$residuals)
qqline(lm_mod2$residuals)

There is still some skew in the distribution of the residuals but a much tighter fit to in the qqplot than the first model.

0.4 3. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

#results will be scaled, so have to invert for correct prediction
p_1.5 = (predict(lm_mod2, data.frame(TotExp_transformed=1.5)))^(1/4.6)
p_2.5 = (predict(lm_mod2, data.frame(TotExp_transformed=2.5)))^(1/4.6)

The predicted life expectancy when TotExp = 1.5 is 63.3 years old and 86.5 years old when TotExp = 2.5.

0.5 4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

lm_mod3 <- lm(LifeExp ~ PropMD + TotExp + PropMD*TotExp, data = who_data)
summary(lm_mod3)

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

The explanatory variables all have a pvalue less than 0.05, therefore are statistically significant. However, the multiple r squared of 0.3574 is much lower than the 2nd model. The FStatistic shows that the model is statistically significant at the pvalue below 0.05. However, the residual error is near the first model’s value.

hist(lm_mod3$residuals)

qqnorm(lm_mod3$residuals)
qqline(lm_mod3$residuals)

The residual plots for this model appear to be close to the first, with a right skewness and high variance in the distribution.

0.6 5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

p_prodmd_totexp = predict(lm_mod3, data.frame(PropMD=0.03,TotExp=14))
p_prodmd_totexp

##       1 
## 107.696

The model predicts lifeexp at 107.69 years old, which does not seem to be an accurate prediction. Its prediction is significantly higher than the other models, and a high proportion of MDs with a low personal and govt expenditure does not appear to be the case with the actual data (e.g. Japan has the highest lifeexpectancy with a lower proportion of MDs and higher totalExp, as is the case with Andorra, Australia and so on)

BrianSingh_Data605_HW12

Brian Singh

2023-11-19

0.1 Load Libraries and Data

0.2 1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

0.4 3. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

0.5 4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

0.6 5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?