Data605 - Homework12

Introduction

The who.csv dataset contains real-world data from 2008. The variables included follow.

Country: name of the country
LifeExp: average life expectancy for the country in years
InfantSurvival: proportion of those surviving to one year or more
Under5Survival: proportion of those surviving to five years or more
TBFree: proportion of the population without TB.
PropMD: proportion of the population who are MDs
PropRN: proportion of the population who are RNs
PersExp: mean personal expenditures on healthcare in US dollars at average exchange rate
GovtExp: mean government expenditures per capita on healthcare, US dollars at average exchange rate
TotExp: sum of personal and government expenditures.

#github URL
url <- getURL("https://raw.githubusercontent.com/amit-kapoor/data605/master/who.csv")
# Read csv from github
who_df <- read.csv(text = url,stringsAsFactors = FALSE)
# glimpse data
glimpse(who_df)

## Observations: 190
## Variables: 10
## $ Country        <chr> "Afghanistan", "Albania", "Algeria", "Andorra", "Angol…
## $ LifeExp        <int> 42, 71, 71, 82, 41, 73, 75, 69, 82, 80, 64, 74, 75, 63…
## $ InfantSurvival <dbl> 0.835, 0.985, 0.967, 0.997, 0.846, 0.990, 0.986, 0.979…
## $ Under5Survival <dbl> 0.743, 0.983, 0.962, 0.996, 0.740, 0.989, 0.983, 0.976…
## $ TBFree         <dbl> 0.99769, 0.99974, 0.99944, 0.99983, 0.99656, 0.99991, …
## $ PropMD         <dbl> 0.000228841, 0.001143127, 0.001060478, 0.003297297, 0.…
## $ PropRN         <dbl> 0.000572294, 0.004614439, 0.002091362, 0.003500000, 0.…
## $ PersExp        <int> 20, 169, 108, 2589, 36, 503, 484, 88, 3181, 3788, 62, …
## $ GovtExp        <int> 92, 3128, 5184, 169725, 1620, 12543, 19170, 1856, 1876…
## $ TotExp         <int> 112, 3297, 5292, 172314, 1656, 13046, 19654, 1944, 190…

1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

who_df2 <- select(who_df, LifeExp, TotExp)
ggpairs(who_df2)

Seeing above plot, we can see that the correleation between LifeExp & TotExp variables is 0.508 that shows weak relationship between them. Scatter lot shows relationship is not linear between these 2 variables.

lm1 <- lm(LifeExp~TotExp, data = who_df2)
summary(lm1)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_df2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

# residuals
lm1 %>% 
  ggplot(aes(fitted(lm1), resid(lm1))) +
  geom_point() +
  geom_smooth(method = lm, se =F) +
  labs(title = "Residual Analysis",
       x = "Fitted Line", y = "Residuals") +
  theme_minimal()

## `geom_smooth()` using formula 'y ~ x'

# residuals histogram
hist(lm1$residuals)

# qq plot
lm1 %>% 
  ggplot(aes(sample = resid(lm1))) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Q-Q Plot") +
  theme_minimal()

F-statistic: This F-statistic describes if there is a relationship between the dependent and independent variables. Generally, a large F-statistic shows a stronger relationship. In this case the value is 65.26 which indiates there is a relation between the variables.
\(R^{2}\): The R-squared is a statistical measure of how close the data are to the fitted regression line. \(R^{2}\) values are between 0 and 1; where numbers closer to 1 represent well-fitting models. The adjusted \(R^{2}\) is to account for the number of independent variables used to make the model. In our case R-squared value is 0.2577 which shows that the models accounts for only 25.77% of varition in the data so it doesnt look to be a good fit.
Residual standard error: The Residual Standard Error is the average amount that the response (dist) will deviate from the true regression line. In this case it is high.
p-value: It is used for rejecting or accpeting the Null hypothesis. In general we use 0.05 as the cutoff for significance; when p-value is smaller than 0.05, we reject H0. In this case p-value is smaller than 0.05, hence we can reject Null hypothesis. Thus there is some relationship between the 2 variables.

Seeing the residual and QQ plot, it doesnt look a good model.

2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

LifeExp2 <- who_df$LifeExp^4.6
TotExp2 <- who_df$TotExp^.06
# create new data frame
who_df3 <- data.frame(LifeExp2, TotExp2)
#glimpse data
glimpse(who_df3)

## Observations: 190
## Variables: 2
## $ LifeExp2 <dbl> 29305338, 327935478, 327935478, 636126841, 26230450, 3726362…
## $ TotExp2  <dbl> 1.327251, 1.625875, 1.672697, 2.061481, 1.560068, 1.765748, …

# pairwise plot
ggpairs(who_df3)

# new model
lm2 <- lm(LifeExp2~TotExp2, data = who_df3)
summary(lm2)

## 
## Call:
## lm(formula = LifeExp2 ~ TotExp2, data = who_df3)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp2      620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

# residuals
lm2 %>% 
  ggplot(aes(fitted(lm2), resid(lm2))) +
  geom_point() +
  geom_smooth(method = lm, se =F) +
  labs(title = "Residual Analysis",
       x = "Fitted Line", y = "Residuals") +
  theme_minimal()

## `geom_smooth()` using formula 'y ~ x'

# residuals histogram
hist(lm2$residuals)

# qq plot
lm2 %>% 
  ggplot(aes(sample = resid(lm2))) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Q-Q Plot") +
  theme_minimal()

F-statistic is 507.7 now which is better than previous value of 65.26. R-squared is improved to 0.7298 than its previous value of 0.2577. p-value < 2.2e-16 is better too. Residual standard error is 90490000 now.

Seeing the residuals plot, it appears more normal as the earlier one was left skewed. QQ plot also shows great improvement. Hence this second model (lm2) came out after transformation is an improved model in comparison to model1 (lm1).

3. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

# y = -736527910 + 620060216 * x

forecast <- function(x) {
  y <- -736527910 + 620060216 * x
  y <- y^(1/4.6)
  return(y)
}

#when TotExp^.06 =1.5
forecast(1.5)

## [1] 63.31153

#when TotExp^.06 =2.5
forecast(2.5)

## [1] 86.50645

4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

Since it is not mentioned which one to pick (question 1 or question 2). So, I choose input from question 1.

# new model
lm3 <- lm(LifeExp ~ PropMD + TotExp + (PropMD*TotExp), data=who_df)
summary(lm3)

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data = who_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

# residuals
lm3 %>% 
  ggplot(aes(fitted(lm3), resid(lm3))) +
  geom_point() +
  geom_smooth(method = lm, se =F) +
  labs(title = "Residual Analysis",
       x = "Fitted Line", y = "Residuals") +
  theme_minimal()

## `geom_smooth()` using formula 'y ~ x'

# residuals histogram
hist(lm3$residuals)

# qq plot
lm3 %>% 
  ggplot(aes(sample = resid(lm2))) +
  stat_qq() +
  stat_qq_line() +
  labs(title = "Q-Q Plot") +
  theme_minimal()

The adj R2 accounts for 0.3471 of the variability of the data, which means that only 34% of the variance in the response variable can be explained by the independent variable. It means that this model too can be improved some transformation as performed in solution 2.

The F-statistic value is quiet less which means that this model is not good for prediction. The p-value indicates that we should reject the null hypothesis, that there is norelationship between the variables.

The residual histogram is left skewed and qq plot is not aligned on edges. This model doesnt appear as good as the one described in solution 2.

5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

coef(lm3)

##   (Intercept)        PropMD        TotExp PropMD:TotExp 
##  6.277270e+01  1.497494e+03  7.233324e-05 -6.025686e-03

givendata <- data.frame(PropMD=0.03, TotExp=14)
predict(lm3, givendata, interval="predict")

##       fit      lwr      upr
## 1 107.696 84.24791 131.1441

# max average life expectancy
max(who_df$LifeExp)

## [1] 83

The prediction is 107.70 years with 95% confidence interval between 84.25 and 131.14. The prediction is completely unrealistic. The highest life expectancy in the data is 83 years and there is nothing in original data to support this prediction. As mentioned in solution 4, this is not a good model.