R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

library(tidyverse)
library(readxl)
who_df <- read.csv("who.csv")

1. Explore LifeExp~TotExp

ScatterPLot LifeExp~TotExp

  • It looks like there is a clear polynomial level relationship between expenditure and life expectancy
library(ggplot2)


ggplot(who_df, aes(x=TotExp, y=LifeExp)) + geom_point()+
  geom_point(size=2, shape=23)

#scawho_df$LifeExp

Build model

my_fit <- lm(LifeExp ~TotExp, data = who_df)
#df_2[1268,]
layout(matrix(c(1, 2, 3, 4), 2, 2))
summary(my_fit)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14
plot(my_fit)

anova(my_fit)
## Analysis of Variance Table
## 
## Response: LifeExp
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## TotExp      1  5731.3  5731.3  65.264 7.714e-14 ***
## Residuals 188 16509.5    87.8                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(my_fit)
##                    2.5 %       97.5 %
## (Intercept) 6.326690e+01 6.623985e+01
## TotExp      4.759394e-05 7.834643e-05

Interpret summary stats

  • our f statistic is low, but it still shows that our independent variable TotExp improves our model
  • Our P value statistic, shows that TotExp is in fact significant.
  • Our standard error, shows a confidence interval of 4.759394e-05- 7.834643e-05.
    • This shows that our coefficient’s confidence interval doesn’t span 0, and is therefore further proof that totexp improves our model
  • looking at plots, our residuals show a clear linear trend, there seems to be some deviation in qq plot from normality and several observations are close to cooks distance
    • it looks pretty clear that some sort of log transformation may be needed when looking at the residual plot

2. Transform variables

who_df$LifeExp2 <- who_df$LifeExp**4.6
who_df$TotExp2 <- who_df$TotExp**.06
ggplot(who_df, aes(x=TotExp2, y=LifeExp2)) + geom_point()+
  geom_point(size=2, shape=23)

my_fit <- lm(LifeExp2 ~TotExp2, data = who_df)
#df_2[1268,]
layout(matrix(c(1, 2, 3, 4), 2, 2))
my_sum <- summary(my_fit)
print(my_sum)
## 
## Call:
## lm(formula = LifeExp2 ~ TotExp2, data = who_df)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp2      620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16
plot(my_fit)

anova(my_fit)
## Analysis of Variance Table
## 
## Response: LifeExp2
##            Df     Sum Sq    Mean Sq F value    Pr(>F)    
## TotExp2     1 4.1575e+18 4.1575e+18   507.7 < 2.2e-16 ***
## Residuals 188 1.5395e+18 8.1889e+15                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
confint(my_fit)
##                  2.5 %     97.5 %
## (Intercept) -828883923 -644171896
## TotExp2      565774630  674345801
  • From the scatter plot, we can see what looks like a much more linear relationship between life exp and totalexp
  • Our rsquared and fstatistic are significantly stronger than the previous model
  • Our residual plots, show much more constant variance in the residuals, no outliers near cooks distance, and a much more normal qq plot
  • this model clearly fits the data better

3 make predictions

+TotExp^.06 =1.5== 63.3 + TotExp^.06=2.5== 86.5

my_intercept <- my_sum$coefficients[1] 
expenditure_coeff <- my_sum$coefficients[2] 
(expenditure_coeff*1.5 +my_intercept)^(1/4.6)
## [1] 63.31153
(expenditure_coeff*2.5 +my_intercept)^(1/4.6)
## [1] 86.50645

4 build new model

my_fit <- lm(LifeExp ~PropMD+TotExp+I(PropMD*TotExp), data = who_df)
new_sum <- summary(my_fit)
print(new_sum)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + I(PropMD * TotExp), 
##     data = who_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD              1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp              7.233e-05  8.982e-06   8.053 9.39e-14 ***
## I(PropMD * TotExp) -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16
layout(matrix(c(1, 2, 3, 4), 2, 2))
plot(my_fit)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

  • the model shows an improvement form original model, but it still is far inferior to our transformed model
    • Residuals still violate constant variance assumption

5 predict

  • PropMD=.03 and TotExp = 14
  • the prediction does not seem realistic. when we look at the scatter plot we see there is a clear approaching of a maximum for average life expectancy around the low 80’s. 107 is unrealistic and shows us that our model may not be behaving well
my_intercept <- new_sum$coefficients[1] 
propmd_coeff <- new_sum$coefficients[2] 
totexp_coeff <- new_sum$coefficients[3] 
propmd_totexp <- new_sum$coefficients[4]

prop <- .03
expended <- 14

my_intercept+propmd_coeff*prop+ expended *totexp_coeff + (prop*expended*propmd_totexp)
## [1] 107.696

Attempt alternative model from question 2

  • age prediction of 66.9
my_fit <- lm(LifeExp2 ~PropMD+TotExp2+I(PropMD*TotExp2), data = who_df)
new_sum <- summary(my_fit)
plot(my_fit)

my_intercept <- new_sum$coefficients[1] 
propmd_coeff <- new_sum$coefficients[2] 
totexp_coeff <- new_sum$coefficients[3] 
propmd_totexp <- new_sum$coefficients[4]

prop <- .03
expended <- 14

(my_intercept+propmd_coeff*prop+ expended *totexp_coeff + (prop*expended*propmd_totexp))^(1/4.6)
## [1] 66.97703