R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

library(tidyverse)
library(readxl)
who_df <- read.csv("who.csv")

1. Explore LifeExp~TotExp

ScatterPLot LifeExp~TotExp

It looks like there is a clear polynomial level relationship between expenditure and life expectancy

library(ggplot2)


ggplot(who_df, aes(x=TotExp, y=LifeExp)) + geom_point()+
  geom_point(size=2, shape=23)

#scawho_df$LifeExp

Build model

my_fit <- lm(LifeExp ~TotExp, data = who_df)
#df_2[1268,]
layout(matrix(c(1, 2, 3, 4), 2, 2))
summary(my_fit)

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

plot(my_fit)

anova(my_fit)

## Analysis of Variance Table
## 
## Response: LifeExp
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## TotExp      1  5731.3  5731.3  65.264 7.714e-14 ***
## Residuals 188 16509.5    87.8                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

confint(my_fit)

##                    2.5 %       97.5 %
## (Intercept) 6.326690e+01 6.623985e+01
## TotExp      4.759394e-05 7.834643e-05

Interpret summary stats

our f statistic is low, but it still shows that our independent variable TotExp improves our model
Our P value statistic, shows that TotExp is in fact significant.
Our standard error, shows a confidence interval of 4.759394e-05- 7.834643e-05.
- This shows that our coefficient’s confidence interval doesn’t span 0, and is therefore further proof that totexp improves our model
looking at plots, our residuals show a clear linear trend, there seems to be some deviation in qq plot from normality and several observations are close to cooks distance
- it looks pretty clear that some sort of log transformation may be needed when looking at the residual plot

2. Transform variables

who_df$LifeExp2 <- who_df$LifeExp**4.6
who_df$TotExp2 <- who_df$TotExp**.06
ggplot(who_df, aes(x=TotExp2, y=LifeExp2)) + geom_point()+
  geom_point(size=2, shape=23)

my_fit <- lm(LifeExp2 ~TotExp2, data = who_df)
#df_2[1268,]
layout(matrix(c(1, 2, 3, 4), 2, 2))
my_sum <- summary(my_fit)
print(my_sum)

## 
## Call:
## lm(formula = LifeExp2 ~ TotExp2, data = who_df)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp2      620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

plot(my_fit)

anova(my_fit)

## Analysis of Variance Table
## 
## Response: LifeExp2
##            Df     Sum Sq    Mean Sq F value    Pr(>F)    
## TotExp2     1 4.1575e+18 4.1575e+18   507.7 < 2.2e-16 ***
## Residuals 188 1.5395e+18 8.1889e+15                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

confint(my_fit)

##                  2.5 %     97.5 %
## (Intercept) -828883923 -644171896
## TotExp2      565774630  674345801

From the scatter plot, we can see what looks like a much more linear relationship between life exp and totalexp
Our rsquared and fstatistic are significantly stronger than the previous model
Our residual plots, show much more constant variance in the residuals, no outliers near cooks distance, and a much more normal qq plot
this model clearly fits the data better

3 make predictions

+TotExp^.06 =1.5== 63.3 + TotExp^.06=2.5== 86.5

my_intercept <- my_sum$coefficients[1] 
expenditure_coeff <- my_sum$coefficients[2] 
(expenditure_coeff*1.5 +my_intercept)^(1/4.6)

## [1] 63.31153

(expenditure_coeff*2.5 +my_intercept)^(1/4.6)

## [1] 86.50645

4 build new model

my_fit <- lm(LifeExp ~PropMD+TotExp+I(PropMD*TotExp), data = who_df)
new_sum <- summary(my_fit)
print(new_sum)

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + I(PropMD * TotExp), 
##     data = who_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD              1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp              7.233e-05  8.982e-06   8.053 9.39e-14 ***
## I(PropMD * TotExp) -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

layout(matrix(c(1, 2, 3, 4), 2, 2))
plot(my_fit)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

the model shows an improvement form original model, but it still is far inferior to our transformed model
- Residuals still violate constant variance assumption

5 predict

PropMD=.03 and TotExp = 14
the prediction does not seem realistic. when we look at the scatter plot we see there is a clear approaching of a maximum for average life expectancy around the low 80’s. 107 is unrealistic and shows us that our model may not be behaving well

my_intercept <- new_sum$coefficients[1] 
propmd_coeff <- new_sum$coefficients[2] 
totexp_coeff <- new_sum$coefficients[3] 
propmd_totexp <- new_sum$coefficients[4]

prop <- .03
expended <- 14

my_intercept+propmd_coeff*prop+ expended *totexp_coeff + (prop*expended*propmd_totexp)

## [1] 107.696

Attempt alternative model from question 2

age prediction of 66.9

my_fit <- lm(LifeExp2 ~PropMD+TotExp2+I(PropMD*TotExp2), data = who_df)
new_sum <- summary(my_fit)
plot(my_fit)

my_intercept <- new_sum$coefficients[1] 
propmd_coeff <- new_sum$coefficients[2] 
totexp_coeff <- new_sum$coefficients[3] 
propmd_totexp <- new_sum$coefficients[4]

prop <- .03
expended <- 14

(my_intercept+propmd_coeff*prop+ expended *totexp_coeff + (prop*expended*propmd_totexp))^(1/4.6)

## [1] 66.97703

J_herman_assignment12

Justin Herman

November 18, 2018