Load Data

who <- read.csv("https://raw.githubusercontent.com/smithchad17/Class605/master/who.csv", header = T, stringsAsFactors = F)
head(who)
##               Country LifeExp InfantSurvival Under5Survival  TBFree
## 1         Afghanistan      42          0.835          0.743 0.99769
## 2             Albania      71          0.985          0.983 0.99974
## 3             Algeria      71          0.967          0.962 0.99944
## 4             Andorra      82          0.997          0.996 0.99983
## 5              Angola      41          0.846          0.740 0.99656
## 6 Antigua and Barbuda      73          0.990          0.989 0.99991
##        PropMD      PropRN PersExp GovtExp TotExp
## 1 0.000228841 0.000572294      20      92    112
## 2 0.001143127 0.004614439     169    3128   3297
## 3 0.001060478 0.002091362     108    5184   5292
## 4 0.003297297 0.003500000    2589  169725 172314
## 5 0.000070400 0.001146162      36    1620   1656
## 6 0.000142857 0.002773810     503   12543  13046

1

plot(x = who$TotExp, y = who$LifeExp)

who_lm <- lm(LifeExp ~ TotExp, data = who)
summary(who_lm)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##                 Estimate   Std. Error t value             Pr(>|t|)    
## (Intercept) 64.753374534  0.753536611  85.933 < 0.0000000000000002 ***
## TotExp       0.000062970  0.000007795   8.079   0.0000000000000771 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 0.00000000000007714

From the scatterplot, it is hard to tell if there is a linear correlation between total expenditures and life expectancy. From a glance, it looks like if a country wants their citizens to life beyond 75 they need to spend at least 100,000 US dollars in healthcare. It can also be said that their could be a minimum amount a money to be spent to have a life expentancy of around 85 because exponentially more money would be spent for a small increase in life expectancy.

The linear model shows a different outcome. Looking at the Residuals we can see that they are almost centered around a mean of zero and the 1Q and 3Q are roughly the same magnitude away from the mean. The standard error is 8 times smaller than the coefficient value meaning there is little variability in the slope estimate. The p-values are extremely small meaning that TotExp is very relevent in the model as well as the intercept. The \(R^2\) value of .2577 means the model only explains 25.7% percent of the data, which isn’t very good. The F-statistic compares this model to a model of one fewer parameters. Since we are only using one predictor, this statistic isn’t important. Since the residuals look like they are nearly normally distributed around zero, we can assume the conditions are met.

2

#make a copy of data to transform
trans_who <- who

trans_who <- transform(trans_who, LifeExp = LifeExp^4.6, TotExp = TotExp^.06)

who_lm <- lm(LifeExp ~ TotExp, data = trans_who)
summary(who_lm)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = trans_who)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73 <0.0000000000000002 ***
## TotExp       620060216   27518940   22.53 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 0.00000000000000022

Since we only have one predictor, the F-statistic doesn’t tell us much. The Std. Error is 22.5 times smaller than the coefficient which means there is little variability in the slope estimate. The \(R^2\) statistic means the model explains 73% of the data, which is decent. The p-value is extremely small meaning the slope and intercept are very significant to the model. This looks to be the ‘better’ model compared to the first one.

3

First we need to build the equation. The response variable is LifeExpt and the predictor is TotExp.

TotExp = 1.5

\(\hat{LifeExp} = intercept + b_1 * TotExp\)

\(\hat{LifeExp} = -736527910 + (620060216 * 1.5)\)

\(\hat{LifeExp}^{4.6} = 193562414\)

\(\hat{LifeExp} = 63.3\)

TotExp = 2.5

\(\hat{LifeExp} = intercept + b_1 * TotExp\)

\(\hat{LifeExp} = -736527910 + (620060216 * 2.5)\)

\(\hat{LifeExp}^{4.6} = 813622630\)

\(\hat{LifeExp} = 86.5\)

4

#Make a copy of the data incase I screw up
who_4 <- who

#Make a third variable that is the multiplication of PropMD and TotExp
who_4 <- transform(who_4, other = PropMD * TotExp)

model_4 <- lm(LifeExp ~ PropMD + TotExp + other, data = who_4)
summary(model_4)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + other, data = who_4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                   Estimate     Std. Error t value             Pr(>|t|)    
## (Intercept)   62.772703255    0.795605238  78.899 < 0.0000000000000002 ***
## PropMD      1497.493952519  278.816879652   5.371   0.0000002320602774 ***
## TotExp         0.000072333    0.000008982   8.053   0.0000000000000939 ***
## other         -0.006025686    0.001472357  -4.093   0.0000635273294941 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 0.00000000000000022

The F-Statistic of 34.49 with a very small p-value show us that the currect model is better than the model with one less predictor. The \(R^2\) value of .357 means the model explains 35.7% of the data which isn’t very good. The Adjusted \(R^2\) is more accurate by filtering out the noise but with a slightly lower value of .347. The p-values are still very small meaning all the variables are very relevant to the model. The Std. Errors are good for the intercept, PropMD, TotExp but not for other (TotExp X PropMD). The Residuals look like they have a normal distribution around zero which is good.

5

PropMD = .03 and TotExp = 14 (other = .42)

\(\hat{LifeExp} = intercept + b_1 * PropMD + b_2 * TotExp + b_3 * other\)

\(\hat{LifeExp} = 62.7727 + 1497.49 * PropMD + .0000723 * TotExp - .00603 * other\)

\(\hat{LifeExp} = 107.7\)

Since 107.7 is highly unlikely, this model doesn’t take into account the skew when ages get around 85.