library(RCurl)
library(car)
library(knitr)
options(scipen=5)
set.seed(1973)
#download file to Github
who.data <- read.csv(text = getURL("https://raw.githubusercontent.com/akulapa/Akula-DATA605-Week12-HW12/master/who.csv"), header = T, stringsAsFactors = F)
kable(who.data[sample(nrow(who.data), 20), ], align='l', caption = "Sample 20 rows", row.names=FALSE)
Sample 20 rows
Country LifeExp InfantSurvival Under5Survival TBFree PropMD PropRN PersExp GovtExp TotExp
Grenada 68 0.983 0.980 0.99992 0.0007547 0.0030755 342 6944 7286
Denmark 79 0.997 0.996 0.99993 0.0035519 0.0099582 4350 314588 318938
Latvia 71 0.992 0.991 0.99940 0.0031455 0.0056094 443 18224 18667
Netherlands 80 0.996 0.995 0.99994 0.0036949 0.0146024 3560 187191 190751
Saudi Arabia 70 0.979 0.974 0.99938 0.0014172 0.0030657 448 27621 28069
Antigua and Barbuda 73 0.990 0.989 0.99991 0.0001429 0.0027738 503 12543 13046
Mongolia 66 0.965 0.958 0.99809 0.0025843 0.0033881 35 1539 1574
Afghanistan 42 0.835 0.743 0.99769 0.0002288 0.0005723 20 92 112
Mozambique 50 0.904 0.862 0.99376 0.0000245 0.0002948 14 315 329
Saint Lucia 75 0.988 0.986 0.99978 0.0045951 0.0020307 323 5068 5391
Kyrgyzstan 66 0.964 0.959 0.99863 0.0024168 0.0058612 28 396 424
Nauru 61 0.975 0.970 0.99866 0.0010000 0.0063000 567 30200 30767
United Republic of Tanzania 50 0.926 0.882 0.99541 0.0000208 0.0003369 17 225 242
Papua New Guinea 62 0.946 0.927 0.99487 0.0000443 0.0004581 34 390 424
Marshall Islands 63 0.950 0.944 0.99759 0.0004138 0.0026207 294 18876 19170
Andorra 82 0.997 0.996 0.99983 0.0032973 0.0035000 2589 169725 172314
El Salvador 71 0.978 0.975 0.99936 0.0011739 0.0007547 177 5700 5877
China 73 0.980 0.976 0.99799 0.0014021 0.0009795 81 1302 1383
Timor-Leste 66 0.953 0.945 0.99211 0.0000709 0.0016113 45 1053 1098
Nicaragua 71 0.971 0.964 0.99926 0.0003697 0.0010597 75 2183 2258

The table shows random 20 rows from the dataset. Dataset describes population health data of each country. To solve question:1 we will be using columns LifeExp and TotExp

Summary of data

summary(who.data)
##    Country             LifeExp      InfantSurvival   Under5Survival  
##  Length:190         Min.   :40.00   Min.   :0.8350   Min.   :0.7310  
##  Class :character   1st Qu.:61.25   1st Qu.:0.9433   1st Qu.:0.9253  
##  Mode  :character   Median :70.00   Median :0.9785   Median :0.9745  
##                     Mean   :67.38   Mean   :0.9624   Mean   :0.9459  
##                     3rd Qu.:75.00   3rd Qu.:0.9910   3rd Qu.:0.9900  
##                     Max.   :83.00   Max.   :0.9980   Max.   :0.9970  
##      TBFree           PropMD              PropRN         
##  Min.   :0.9870   Min.   :0.0000196   Min.   :0.0000883  
##  1st Qu.:0.9969   1st Qu.:0.0002444   1st Qu.:0.0008455  
##  Median :0.9992   Median :0.0010474   Median :0.0027584  
##  Mean   :0.9980   Mean   :0.0017954   Mean   :0.0041336  
##  3rd Qu.:0.9998   3rd Qu.:0.0024584   3rd Qu.:0.0057164  
##  Max.   :1.0000   Max.   :0.0351290   Max.   :0.0708387  
##     PersExp           GovtExp             TotExp      
##  Min.   :   3.00   Min.   :    10.0   Min.   :    13  
##  1st Qu.:  36.25   1st Qu.:   559.5   1st Qu.:   584  
##  Median : 199.50   Median :  5385.0   Median :  5541  
##  Mean   : 742.00   Mean   : 40953.5   Mean   : 41696  
##  3rd Qu.: 515.25   3rd Qu.: 25680.2   3rd Qu.: 26331  
##  Max.   :6350.00   Max.   :476420.0   Max.   :482750

1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

Answer:

Scatterplot

who.lm <- lm(LifeExp ~ TotExp, data = who.data)
plot(who.data$TotExp, who.data$LifeExp, xlab = 'Total Expenditure(in US Dollars)', ylab = 'Average Life Expectancy(in Years)', main='Avg. Life Expectancy Vs. Expenditure')
abline(who.lm, col="red")

Scatter plot explains total expenditure remains lower when average life expectancy is around 70 years. Expenditure increases when life expectancy increase beyond 70 years. We can see that the plot has a curvature as we move from left to right on the x-axis. Looking at the plot, we can assume average life expectancy and total expenditure are not linearly related.

Regression Model

who.lm
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who.data)
## 
## Coefficients:
## (Intercept)       TotExp  
## 64.75337453   0.00006297

Regression model \(Average~ Life~ Expectancy = 64.75337453 + 0.00006297 * Total~ Expenditure\)

The model suggests that for every one unit increase in total expenditure average life expectancy goes up by \(0.00006297\) years.

summary(who.lm)
## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = who.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##                 Estimate   Std. Error t value Pr(>|t|)    
## (Intercept) 64.753374534  0.753536611  85.933  < 2e-16 ***
## TotExp       0.000062970  0.000007795   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

Standard Error and p-Value

In this case slope of the regression line is \(0.000062970\) and standard error of estimate is \(0.000007795\). t-test statistic is \(8.079\). The p-value corresponds to the two-sided test and it is \(7.71 \times 10^{-14}\). Since p-value is very small even at 0.01 level of significance. It indicates total expenditure is highly relevant to estimate average life expectancy.

Residual standard error(\(S\)): 9.371 on 188 degrees of freedom. It indicates standard error of the regression.

The standard error is a measure of the uncertainty associated with the point estimate. It provides a guide for how large we should make the confidence interval. Under normal distribution point estimate(sample mean) should be within two standard error of estimate(population mean).

plot(who.lm, which=c(1,1))

The fitted line(dotted) plot shown above is to predict average life expectancy. Standard Error(\(S\)) is 9.371, which tells us that the average distance of the data points from the fitted line is about \(9.371\).

It means if data is normally distributed about 95% of the observations should fall within \(\pm 18.742\) of the fitted line.

In this case spread is not even on either sides of the fitted line and there are may be outliers influencing the model, suggesting actual data is highly skewed.

hist(who.data$TotExp, main = 'Total Expenditure per Capita on Healthcare Accross Countries', xlab = 'Expenditure', ylab = 'Number of Contries')

Histogram explains expenditures are right-skewed.

R-Squared(\(R^2\))

R-squared statistic provides an overall measure of how well the model fits the data. This statistic indicates the percentage of the variance in the dependent variable that the independent variables explain collectively.

In this case \(R^2\) is \(0.2577\), which means model can explain \(25\%\) of data variation.

Adjusted \(R^2\) values is \(0.2537\), it is smaller than \(R^2\) because it considers all the coefficients in the model.

F-statistic

F-statistic in regression compares the fits of different linear models for all coefficients. This model only has one coefficient as predictor, F-statistic does not provide useful information. We can see p-Value of coefficient TotExp and entire regression model are same \(7.71 \times 10^{-14}\).

Conclusion

2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

Answer:

#raise total expenditure to the power of 0.06
who.data$TotExp_01 <- (who.data$TotExp)^0.06

#raise life expectancy to the power of 4.6
who.data$LifeExp_01 <- (who.data$LifeExp)^4.6

#build new linear model
who_01.lm <- lm(LifeExp_01 ~ TotExp_01, data = who.data)
plot(who.data$TotExp_01, who.data$LifeExp_01, xlab = 'Total Expenditure(in US Dollars) raised to 0.06', ylab = 'Average Life Expectancy(in Years) raised to 4.6', main='Avg. Life Expectancy Vs. Expenditure')
abline(who_01.lm, col="red")

Scatter plot explains total expenditures and average life expectancy are evenly distributed. Average life expectancy increases as total expenditure increases. We can see that curvature has disappeared. Looking at the plot, we can assume average life expectancy and total expenditure are linearly related.

Regression Model

who_01.lm
## 
## Call:
## lm(formula = LifeExp_01 ~ TotExp_01, data = who.data)
## 
## Coefficients:
## (Intercept)    TotExp_01  
##  -736527909    620060216

Regression model \(LifeExp_{01} = -736527909 + 620060216 * TotExp_{01}\)

The model suggests that for every one unit increase in TotExp_01, LifeExp_01 goes up by \(620060216\) units.

summary(who_01.lm)
## 
## Call:
## lm(formula = LifeExp_01 ~ TotExp_01, data = who.data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp_01    620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

Standard Error and p-Value

In this case slope of the regression line is \(620060216\) and standard error of estimate is \(27518940\). t-test statistic is \(22.53\). The p-value corresponds to the two-sided test and it is \(2 \times 10^{-16}\). Since p-value is very small even at 0.01 level of significance. It indicates total expenditure is highly relevant to estimate average life expectancy.

Residual standard error(\(S\)): 90490000 on 188 degrees of freedom. It indicates standard error of the regression.

plot(who_01.lm, which=c(1,1))

The fitted line(dotted) plot shown above is to predict average life expectancy. Standard Error(\(S\)) is 90490000, which tells us that the average distance of the data points from the fitted line is about \(90490000\) total expenditure.

It means if data is normally distributed about 95% of the observations should fall within \(\pm 180980000\) of the fitted line.

In this case spread is even on either sides of the fitted line. However we can use histogram to see if distribution of actual data is normal.

hist(who.data$TotExp_01, main = 'Total Expenditure per Capita on Healthcare Accross Countries', xlab = 'Expenditure Raised to 0.06', ylab = 'Number of Contries')

hist(who.data$LifeExp_01, main = 'Average Life Expectancy Accross Countries', xlab = 'Average Life Expectancy Raised to 4.6', ylab = 'Number of Contries')

Expenditure histograms suggest some skewness on left side. Average Life Expectancy does not suggest skewness. However it suggests existance of outliers between \(0 - 100000000\).

R-Squared(\(R^2\))

R-squared statistic provides an overall measure of how well the model fits the data. This statistic indicates the percentage of the variance in the dependent variable that the independent variables explain collectively.

In this case \(R^2\) is \(0.7298\), which means model can explain \(73\%\) of data variation.

Adjusted \(R^2\) values is \(0.7283\), it is smaller than \(R^2\) because it considers all the coefficients in the model.

F-statistic

F-statistic in regression compares the fits of different linear models for all coefficients. This model only has one coefficient as predictor, F-statistic does not provide useful information. We can see p-Value of coefficient TotExp_01 and entire regression model are same \(2 \times 10^{-16}\).

Conclusion

3. Using the results from 2, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

Answer:

#using LifeExp_01 = -736527909 + 620060216 * TotExp_01
#LifeExp_01 <- Actual value ^ 4.6
#Actual value <- LifeExp_01 ^ (1/4.6)
#Actual value <- (-736527909 + 620060216 * TotExp_01) ^ (1/4.6)

#if TE = 1.5 and 2.5
results <- c((-736527909 + (620060216 * 1.5))^(1/4.6), (-736527909 + (620060216 * 2.5))^(1/4.6))

Estimated life expectancy is \(63.31\) years when \(TotExp^(0.06)\) is 1.5.

Estimated life expectancy is \(86.51\) years when \(TotExp^(0.06)\) is 2.5.

4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp

Answer:

#build new linear model
who_02.lm <- lm(LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who.data)
who_02.lm
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who.data)
## 
## Coefficients:
##   (Intercept)         PropMD         TotExp  PropMD:TotExp  
##   62.77270326  1497.49395252     0.00007233    -0.00602569

Regression Model

\(Average~ Life~ Expectancy = 62.77270326 + 1497.49395252 \times PropMD + 0.00007233 \times TotExp -0.00602569 \times PropMD*TotExp\)

summary(who_02.lm)
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who.data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                     Estimate     Std. Error t value Pr(>|t|)    
## (Intercept)     62.772703255    0.795605238  78.899  < 2e-16 ***
## PropMD        1497.493952519  278.816879652   5.371 2.32e-07 ***
## TotExp           0.000072333    0.000008982   8.053 9.39e-14 ***
## PropMD:TotExp   -0.006025686    0.001472357  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

Standard Error and p-Value

In this case slope for the predictor PropMD \(1497.493952519\) and standard error of estimate is \(278.816879652\). t-test statistic is \(5.371\). The p-value corresponds to the two-sided test and it is \(2.32 \times 10^{-7}\). Since p-value is very small even at 0.01 level of significance. It indicates PropMD is highly relevant to estimate average life expectancy.

Slope of the predictor TotExp is \(0.000072333\) and standard error of estimate is \(0.000008982\). t-test statistic is \(8.053\). The p-value corresponds to the two-sided test and it is \(9.39 \times 10^{-14}\). Since p-value is very small even at 0.01 level of significance. It indicates TotExp is highly relevant to estimate average life expectancy.

Slope of the predictor PropMD * TotExp is \(-0.006025686\) and standard error of estimate is \(0.001472357\). t-test statistic is \(-4.093\). The p-value corresponds to the two-sided test and it is \(6.35 \times 10^{-5}\). Since p-value is very small even at 0.01 level of significance. It indicates PropMD * TotExp is highly relevant to estimate average life expectancy.

Residual standard error: 8.765 on 186 degrees of freedom. It indicates standard error of the regression.

plot(who_02.lm, which=c(1,1))

The fitted line(dotted) plot shown above is to predict average life expectancy. Standard Error(\(S\)) is \(8.765\), which tells us that the average distance of the data points from the fitted line is about \(8.765\).

It means if data is normally distributed about 95% of the observations should fall within \(\pm 17.53\) distance of the fitted line. In this case it is not true.

R-Squared(\(R^2\))

R-squared statistic provides an overall measure of how well the model fits the data. This statistic indicates the percentage of the variance in the dependent variable that the independent variables explain collectively.

In this case \(R^2\) is \(0.3574\), which means the model can explain \(36\%\) of data variation.

Adjusted \(R^2\) values is \(0.3471\), it is smaller than \(R^2\) because it considers all the coefficients in the model.

F-statistic

F-statistic in regression compares the fits of different linear models for all coefficients. This model has three coefficients (predictors), and all three predictors are highly significant to the model. Hence F-statistic is also significant.

Using F-statistic one can derive formal hypothesis test for this relationship. In other words, using F-statistic one can determine whether this relationship is statistically significant.

Since p-Value for the F-statistic is \(2.2 \times 10^{-16}\), the value is less than significance level 0.01, we can conclude that R-squared value is significantly different from zero.

Conclusion

5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

Answer:

TotExp = 14
PropMD = 0.03
ALE = 62.77270326 + (1497.49395252 * PropMD) + (0.00007233 * TotExp) - (0.00602569 * PropMD * TotExp)

When PropMD = 0.03 and TotExp = 14, Average Life Expectancy is \(107.7\) years. Value is very high; it means statistically possible but practically not possible. This seems unreal because among many other conditions,

References