R Markdown

who_data <- read.csv(text = getURL("https://raw.githubusercontent.com/charlsjoseph/data605wk12/master/who.csv"), header = T, stringsAsFactors =F)

kable(who_data[sample(nrow(who_data), 10), ], align='l', row.names=FALSE)
Country LifeExp InfantSurvival Under5Survival TBFree PropMD PropRN PersExp GovtExp TotExp
Brazil 72 0.981 0.980 0.99945 0.0010466 0.0034814 371 13940 14311
Namibia 61 0.955 0.939 0.99342 0.0002921 0.0030020 165 3888 4053
Belarus 69 0.994 0.992 0.99929 0.0047587 0.0124571 204 11315 11519
Turkey 73 0.976 0.974 0.99968 0.0015694 0.0029448 383 18632 19015
Armenia 69 0.979 0.976 0.99920 0.0036987 0.0049189 88 1856 1944
Chad 46 0.876 0.791 0.99430 0.0000330 0.0002387 22 234 256
Bulgaria 73 0.990 0.988 0.99959 0.0002535 0.0045532 272 11550 11822
Venezuela (Bolivarian Republic of) 74 0.982 0.979 0.99948 0.0017653 0.0010298 247 10528 10775
Slovenia 78 0.997 0.996 0.99985 0.0023603 0.0078516 1495 55233 56728
South Africa 51 0.944 0.931 0.99002 0.0007214 0.0038205 437 10920 11357
summary(who_data)
##    Country             LifeExp      InfantSurvival   Under5Survival  
##  Length:190         Min.   :40.00   Min.   :0.8350   Min.   :0.7310  
##  Class :character   1st Qu.:61.25   1st Qu.:0.9433   1st Qu.:0.9253  
##  Mode  :character   Median :70.00   Median :0.9785   Median :0.9745  
##                     Mean   :67.38   Mean   :0.9624   Mean   :0.9459  
##                     3rd Qu.:75.00   3rd Qu.:0.9910   3rd Qu.:0.9900  
##                     Max.   :83.00   Max.   :0.9980   Max.   :0.9970  
##      TBFree           PropMD              PropRN         
##  Min.   :0.9870   Min.   :0.0000196   Min.   :0.0000883  
##  1st Qu.:0.9969   1st Qu.:0.0002444   1st Qu.:0.0008455  
##  Median :0.9992   Median :0.0010474   Median :0.0027584  
##  Mean   :0.9980   Mean   :0.0017954   Mean   :0.0041336  
##  3rd Qu.:0.9998   3rd Qu.:0.0024584   3rd Qu.:0.0057164  
##  Max.   :1.0000   Max.   :0.0351290   Max.   :0.0708387  
##     PersExp           GovtExp             TotExp      
##  Min.   :   3.00   Min.   :    10.0   Min.   :    13  
##  1st Qu.:  36.25   1st Qu.:   559.5   1st Qu.:   584  
##  Median : 199.50   Median :  5385.0   Median :  5541  
##  Mean   : 742.00   Mean   : 40953.5   Mean   : 41696  
##  3rd Qu.: 515.25   3rd Qu.: 25680.2   3rd Qu.: 26331  
##  Max.   :6350.00   Max.   :476420.0   Max.   :482750

Qn1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

who.lm <- lm(LifeExp ~ TotExp, data = who_data)
plot(who_data$TotExp, who_data$LifeExp, xlab = 'Total Expenditure - $', ylab = 'Average Life Expectancy- in Years', main='Avg. Life Expectancy Vs. Expenditure')
abline(who.lm, col="red")

The scatter plot doesnt fit properly into the linear line. It shows the expenditure is constant until the age 70-75 and slowly increases beyond 70.

Lets find the summary of linear regression and analyze the key parameters.

summary(who.lm)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##                 Estimate   Std. Error t value Pr(>|t|)    
## (Intercept) 64.753374534  0.753536611  85.933  < 2e-16 ***
## TotExp       0.000062970  0.000007795   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Regression model

\[ Avg Life Expectancy=64.75337453+0.00006297∗Total Expenditure \] The model explains on every unit of increase in Total expenditure, the life expenditure goes up by 0.00006297.

plot(who.lm, which=c(1,1))

The resduals are not scattererd, but having an unsual pattern which tells that relation doesnt hold the linear behaviour. P-value is very less which explains there is a strong relation between LifeExp and TotExp. Adjusted R^2 and R^2 is very less which explains the totalExp is not only an explanatory variable and model can be possibly more accurate or better.

2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

#raise total expenditure to the power of 0.06
who_data$TotExp_1 <- (who_data$TotExp)^0.06

#raise life expectancy to the power of 4.6
who_data$LifeExp_1 <- (who_data$LifeExp)^4.6

#build new linear model
who_1.lm <- lm(LifeExp_1 ~ TotExp_1, data = who_data)
plot(who_data$TotExp_1, who_data$LifeExp_1, xlab = 'Total Expenditure(in US Dollars) raised to 0.06', ylab = 'Average Life Expectancy(in Years) raised to 4.6', main='Avg. Life Expectancy Vs. Expenditure')
abline(who_1.lm, col="red")

This explains a better linear relation. Lets check the model key paramaters.

summary(who_1.lm)
## 
## Call:
## lm(formula = LifeExp_1 ~ TotExp_1, data = who_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp_1     620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

Regression model is

\[LifeExp1=−736527909+620060216∗TotExp1 \]

plot(who_1.lm, which=c(1,1))

The residual looks to be scatter and model looks to follow the linear pattern. Adjusted R^2 and R^2 looks to be closer to 1 and better than previous model.

3. Using the results from 2, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

result1 <- (-736527909 + (620060216 * 1.5))^(1/4.6)
result2 <- (-736527909 + (620060216 * 2.5))^(1/4.6)
result1
## [1] 63.31153
result2
## [1] 86.50645

Estimated life expectancy is 63.31153 years when TotExp(0.06) is 1.5. Estimated life expectancy is 86.50645 years when TotExp(0.06) is 2.5.

4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

who_2.lm <- lm(LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who_data)
summary(who_2.lm)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                     Estimate     Std. Error t value Pr(>|t|)    
## (Intercept)     62.772703255    0.795605238  78.899  < 2e-16 ***
## PropMD        1497.493952519  278.816879652   5.371 2.32e-07 ***
## TotExp           0.000072333    0.000008982   8.053 9.39e-14 ***
## PropMD:TotExp   -0.006025686    0.001472357  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

Regression model is \[ Average Life Expectancy=62.77270326+1497.49395252×PropMD+0.00007233×TotExp−0.00602569×PropMD∗TotExp \]

p-Value for PropMD, TotExp and PropMD * TotExp are highly relevant to estimate average life expectancy. R2 and Adj. R2 values decreased compared to earlier

5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

TotExp = 14
PropMD = 0.03
forcast.life.exp = 62.77270326 + (1497.49395252 * PropMD) + (0.00007233 * TotExp) - (0.00602569 * PropMD * TotExp)
forcast.life.exp
## [1] 107.696

The result doesnt looks to be realistic. This could be due to in-efficiency of the model variables.