library(RCurl)
library(car)
library(knitr)
options(scipen=5)
set.seed(1973)
#download file to Github
who.data <- read.csv(text = getURL("https://raw.githubusercontent.com/akulapa/Akula-DATA605-Week12-HW12/master/who.csv"), header = T, stringsAsFactors = F)
kable(who.data[sample(nrow(who.data), 20), ], align='l', caption = "Sample 20 rows", row.names=FALSE)
| Country | LifeExp | InfantSurvival | Under5Survival | TBFree | PropMD | PropRN | PersExp | GovtExp | TotExp |
|---|---|---|---|---|---|---|---|---|---|
| Grenada | 68 | 0.983 | 0.980 | 0.99992 | 0.0007547 | 0.0030755 | 342 | 6944 | 7286 |
| Denmark | 79 | 0.997 | 0.996 | 0.99993 | 0.0035519 | 0.0099582 | 4350 | 314588 | 318938 |
| Latvia | 71 | 0.992 | 0.991 | 0.99940 | 0.0031455 | 0.0056094 | 443 | 18224 | 18667 |
| Netherlands | 80 | 0.996 | 0.995 | 0.99994 | 0.0036949 | 0.0146024 | 3560 | 187191 | 190751 |
| Saudi Arabia | 70 | 0.979 | 0.974 | 0.99938 | 0.0014172 | 0.0030657 | 448 | 27621 | 28069 |
| Antigua and Barbuda | 73 | 0.990 | 0.989 | 0.99991 | 0.0001429 | 0.0027738 | 503 | 12543 | 13046 |
| Mongolia | 66 | 0.965 | 0.958 | 0.99809 | 0.0025843 | 0.0033881 | 35 | 1539 | 1574 |
| Afghanistan | 42 | 0.835 | 0.743 | 0.99769 | 0.0002288 | 0.0005723 | 20 | 92 | 112 |
| Mozambique | 50 | 0.904 | 0.862 | 0.99376 | 0.0000245 | 0.0002948 | 14 | 315 | 329 |
| Saint Lucia | 75 | 0.988 | 0.986 | 0.99978 | 0.0045951 | 0.0020307 | 323 | 5068 | 5391 |
| Kyrgyzstan | 66 | 0.964 | 0.959 | 0.99863 | 0.0024168 | 0.0058612 | 28 | 396 | 424 |
| Nauru | 61 | 0.975 | 0.970 | 0.99866 | 0.0010000 | 0.0063000 | 567 | 30200 | 30767 |
| United Republic of Tanzania | 50 | 0.926 | 0.882 | 0.99541 | 0.0000208 | 0.0003369 | 17 | 225 | 242 |
| Papua New Guinea | 62 | 0.946 | 0.927 | 0.99487 | 0.0000443 | 0.0004581 | 34 | 390 | 424 |
| Marshall Islands | 63 | 0.950 | 0.944 | 0.99759 | 0.0004138 | 0.0026207 | 294 | 18876 | 19170 |
| Andorra | 82 | 0.997 | 0.996 | 0.99983 | 0.0032973 | 0.0035000 | 2589 | 169725 | 172314 |
| El Salvador | 71 | 0.978 | 0.975 | 0.99936 | 0.0011739 | 0.0007547 | 177 | 5700 | 5877 |
| China | 73 | 0.980 | 0.976 | 0.99799 | 0.0014021 | 0.0009795 | 81 | 1302 | 1383 |
| Timor-Leste | 66 | 0.953 | 0.945 | 0.99211 | 0.0000709 | 0.0016113 | 45 | 1053 | 1098 |
| Nicaragua | 71 | 0.971 | 0.964 | 0.99926 | 0.0003697 | 0.0010597 | 75 | 2183 | 2258 |
The table shows random 20 rows from the dataset. Dataset describes population health data of each country. To solve question:1 we will be using columns LifeExp and TotExp
summary(who.data)
## Country LifeExp InfantSurvival Under5Survival
## Length:190 Min. :40.00 Min. :0.8350 Min. :0.7310
## Class :character 1st Qu.:61.25 1st Qu.:0.9433 1st Qu.:0.9253
## Mode :character Median :70.00 Median :0.9785 Median :0.9745
## Mean :67.38 Mean :0.9624 Mean :0.9459
## 3rd Qu.:75.00 3rd Qu.:0.9910 3rd Qu.:0.9900
## Max. :83.00 Max. :0.9980 Max. :0.9970
## TBFree PropMD PropRN
## Min. :0.9870 Min. :0.0000196 Min. :0.0000883
## 1st Qu.:0.9969 1st Qu.:0.0002444 1st Qu.:0.0008455
## Median :0.9992 Median :0.0010474 Median :0.0027584
## Mean :0.9980 Mean :0.0017954 Mean :0.0041336
## 3rd Qu.:0.9998 3rd Qu.:0.0024584 3rd Qu.:0.0057164
## Max. :1.0000 Max. :0.0351290 Max. :0.0708387
## PersExp GovtExp TotExp
## Min. : 3.00 Min. : 10.0 Min. : 13
## 1st Qu.: 36.25 1st Qu.: 559.5 1st Qu.: 584
## Median : 199.50 Median : 5385.0 Median : 5541
## Mean : 742.00 Mean : 40953.5 Mean : 41696
## 3rd Qu.: 515.25 3rd Qu.: 25680.2 3rd Qu.: 26331
## Max. :6350.00 Max. :476420.0 Max. :482750
1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
Answer:
who.lm <- lm(LifeExp ~ TotExp, data = who.data)
plot(who.data$TotExp, who.data$LifeExp, xlab = 'Total Expenditure(in US Dollars)', ylab = 'Average Life Expectancy(in Years)', main='Avg. Life Expectancy Vs. Expenditure')
abline(who.lm, col="red")
Scatter plot explains total expenditure remains lower when average life expectancy is around 70 years. Expenditure increases when life expectancy increase beyond 70 years. We can see that the plot has a curvature as we move from left to right on the x-axis. Looking at the plot, we can assume average life expectancy and total expenditure are not linearly related.
who.lm
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who.data)
##
## Coefficients:
## (Intercept) TotExp
## 64.75337453 0.00006297
Regression model \(Average~ Life~ Expectancy = 64.75337453 + 0.00006297 * Total~ Expenditure\)
The model suggests that for every one unit increase in total expenditure average life expectancy goes up by \(0.00006297\) years.
summary(who.lm)
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = who.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 64.753374534 0.753536611 85.933 < 2e-16 ***
## TotExp 0.000062970 0.000007795 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
In this case slope of the regression line is \(0.000062970\) and standard error of estimate is \(0.000007795\). t-test statistic is \(8.079\). The p-value corresponds to the two-sided test and it is \(7.71 \times 10^{-14}\). Since p-value is very small even at 0.01 level of significance. It indicates total expenditure is highly relevant to estimate average life expectancy.
Residual standard error(\(S\)): 9.371 on 188 degrees of freedom. It indicates standard error of the regression.
The standard error is a measure of the uncertainty associated with the point estimate. It provides a guide for how large we should make the confidence interval. Under normal distribution point estimate(sample mean) should be within two standard error of estimate(population mean).
plot(who.lm, which=c(1,1))
The fitted line(dotted) plot shown above is to predict average life expectancy. Standard Error(\(S\)) is 9.371, which tells us that the average distance of the data points from the fitted line is about \(9.371\).
It means if data is normally distributed about 95% of the observations should fall within \(\pm 18.742\) of the fitted line.
In this case spread is not even on either sides of the fitted line and there are may be outliers influencing the model, suggesting actual data is highly skewed.
hist(who.data$TotExp, main = 'Total Expenditure per Capita on Healthcare Accross Countries', xlab = 'Expenditure', ylab = 'Number of Contries')
Histogram explains expenditures are right-skewed.
R-squared statistic provides an overall measure of how well the model fits the data. This statistic indicates the percentage of the variance in the dependent variable that the independent variables explain collectively.
In this case \(R^2\) is \(0.2577\), which means model can explain \(25\%\) of data variation.
Adjusted \(R^2\) values is \(0.2537\), it is smaller than \(R^2\) because it considers all the coefficients in the model.
F-statistic in regression compares the fits of different linear models for all coefficients. This model only has one coefficient as predictor, F-statistic does not provide useful information. We can see p-Value of coefficient TotExp and entire regression model are same \(7.71 \times 10^{-14}\).
If we bulid hypothesis as
\(H_0:\) Total expenditure has no impact on average life expectancy.
\(H_A:\) Total expenditure has impact on average life expectancy.
p-Value for the slope suggests, total expenditure is highly relevant to estimate average life expectancy. Using over allp-Value we would reject null hypothesis(\(H_0\)).Overall, one has to have domain knowlege of expenditures and currency of each country. One cannot easily interpret high numbers easily in case of expenditure.
2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
Answer:
#raise total expenditure to the power of 0.06
who.data$TotExp_01 <- (who.data$TotExp)^0.06
#raise life expectancy to the power of 4.6
who.data$LifeExp_01 <- (who.data$LifeExp)^4.6
#build new linear model
who_01.lm <- lm(LifeExp_01 ~ TotExp_01, data = who.data)
plot(who.data$TotExp_01, who.data$LifeExp_01, xlab = 'Total Expenditure(in US Dollars) raised to 0.06', ylab = 'Average Life Expectancy(in Years) raised to 4.6', main='Avg. Life Expectancy Vs. Expenditure')
abline(who_01.lm, col="red")
Scatter plot explains total expenditures and average life expectancy are evenly distributed. Average life expectancy increases as total expenditure increases. We can see that curvature has disappeared. Looking at the plot, we can assume average life expectancy and total expenditure are linearly related.
who_01.lm
##
## Call:
## lm(formula = LifeExp_01 ~ TotExp_01, data = who.data)
##
## Coefficients:
## (Intercept) TotExp_01
## -736527909 620060216
Regression model \(LifeExp_{01} = -736527909 + 620060216 * TotExp_{01}\)
The model suggests that for every one unit increase in TotExp_01, LifeExp_01 goes up by \(620060216\) units.
summary(who_01.lm)
##
## Call:
## lm(formula = LifeExp_01 ~ TotExp_01, data = who.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp_01 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
In this case slope of the regression line is \(620060216\) and standard error of estimate is \(27518940\). t-test statistic is \(22.53\). The p-value corresponds to the two-sided test and it is \(2 \times 10^{-16}\). Since p-value is very small even at 0.01 level of significance. It indicates total expenditure is highly relevant to estimate average life expectancy.
Residual standard error(\(S\)): 90490000 on 188 degrees of freedom. It indicates standard error of the regression.
plot(who_01.lm, which=c(1,1))
The fitted line(dotted) plot shown above is to predict average life expectancy. Standard Error(\(S\)) is 90490000, which tells us that the average distance of the data points from the fitted line is about \(90490000\) total expenditure.
It means if data is normally distributed about 95% of the observations should fall within \(\pm 180980000\) of the fitted line.
In this case spread is even on either sides of the fitted line. However we can use histogram to see if distribution of actual data is normal.
hist(who.data$TotExp_01, main = 'Total Expenditure per Capita on Healthcare Accross Countries', xlab = 'Expenditure Raised to 0.06', ylab = 'Number of Contries')
hist(who.data$LifeExp_01, main = 'Average Life Expectancy Accross Countries', xlab = 'Average Life Expectancy Raised to 4.6', ylab = 'Number of Contries')
Expenditure histograms suggest some skewness on left side. Average Life Expectancy does not suggest skewness. However it suggests existance of outliers between \(0 - 100000000\).
R-squared statistic provides an overall measure of how well the model fits the data. This statistic indicates the percentage of the variance in the dependent variable that the independent variables explain collectively.
In this case \(R^2\) is \(0.7298\), which means model can explain \(73\%\) of data variation.
Adjusted \(R^2\) values is \(0.7283\), it is smaller than \(R^2\) because it considers all the coefficients in the model.
F-statistic in regression compares the fits of different linear models for all coefficients. This model only has one coefficient as predictor, F-statistic does not provide useful information. We can see p-Value of coefficient TotExp_01 and entire regression model are same \(2 \times 10^{-16}\).
If we bulid hypothesis as
\(H_0:\) Total expenditure has no impact on average life expectancy.
\(H_A:\) Total expenditure has impact on average life expectancy.
p-Value for the slope suggests, total expenditure is highly relevant to estimate average life expectancy. Using over all p-Value we would reject null hypothesis(\(H_0\)).Overall, one has to have domain knowlege of expenditures and currency of each country. One cannot easily interpret high numbers easily in case of average life expectancy.
3. Using the results from 2, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
Answer:
#using LifeExp_01 = -736527909 + 620060216 * TotExp_01
#LifeExp_01 <- Actual value ^ 4.6
#Actual value <- LifeExp_01 ^ (1/4.6)
#Actual value <- (-736527909 + 620060216 * TotExp_01) ^ (1/4.6)
#if TE = 1.5 and 2.5
results <- c((-736527909 + (620060216 * 1.5))^(1/4.6), (-736527909 + (620060216 * 2.5))^(1/4.6))
Estimated life expectancy is \(63.31\) years when \(TotExp^(0.06)\) is 1.5.
Estimated life expectancy is \(86.51\) years when \(TotExp^(0.06)\) is 2.5.
4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model? LifeExp = b0+b1 x PropMd + b2 x TotExp +b3 x PropMD x TotExp
Answer:
#build new linear model
who_02.lm <- lm(LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who.data)
who_02.lm
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who.data)
##
## Coefficients:
## (Intercept) PropMD TotExp PropMD:TotExp
## 62.77270326 1497.49395252 0.00007233 -0.00602569
\(Average~ Life~ Expectancy = 62.77270326 + 1497.49395252 \times PropMD + 0.00007233 \times TotExp -0.00602569 \times PropMD*TotExp\)
summary(who_02.lm)
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + PropMD * TotExp, data = who.data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.772703255 0.795605238 78.899 < 2e-16 ***
## PropMD 1497.493952519 278.816879652 5.371 2.32e-07 ***
## TotExp 0.000072333 0.000008982 8.053 9.39e-14 ***
## PropMD:TotExp -0.006025686 0.001472357 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
In this case slope for the predictor PropMD \(1497.493952519\) and standard error of estimate is \(278.816879652\). t-test statistic is \(5.371\). The p-value corresponds to the two-sided test and it is \(2.32 \times 10^{-7}\). Since p-value is very small even at 0.01 level of significance. It indicates PropMD is highly relevant to estimate average life expectancy.
Slope of the predictor TotExp is \(0.000072333\) and standard error of estimate is \(0.000008982\). t-test statistic is \(8.053\). The p-value corresponds to the two-sided test and it is \(9.39 \times 10^{-14}\). Since p-value is very small even at 0.01 level of significance. It indicates TotExp is highly relevant to estimate average life expectancy.
Slope of the predictor PropMD * TotExp is \(-0.006025686\) and standard error of estimate is \(0.001472357\). t-test statistic is \(-4.093\). The p-value corresponds to the two-sided test and it is \(6.35 \times 10^{-5}\). Since p-value is very small even at 0.01 level of significance. It indicates PropMD * TotExp is highly relevant to estimate average life expectancy.
Residual standard error: 8.765 on 186 degrees of freedom. It indicates standard error of the regression.
plot(who_02.lm, which=c(1,1))
The fitted line(dotted) plot shown above is to predict average life expectancy. Standard Error(\(S\)) is \(8.765\), which tells us that the average distance of the data points from the fitted line is about \(8.765\).
It means if data is normally distributed about 95% of the observations should fall within \(\pm 17.53\) distance of the fitted line. In this case it is not true.
R-squared statistic provides an overall measure of how well the model fits the data. This statistic indicates the percentage of the variance in the dependent variable that the independent variables explain collectively.
In this case \(R^2\) is \(0.3574\), which means the model can explain \(36\%\) of data variation.
Adjusted \(R^2\) values is \(0.3471\), it is smaller than \(R^2\) because it considers all the coefficients in the model.
F-statistic in regression compares the fits of different linear models for all coefficients. This model has three coefficients (predictors), and all three predictors are highly significant to the model. Hence F-statistic is also significant.
Using F-statistic one can derive formal hypothesis test for this relationship. In other words, using F-statistic one can determine whether this relationship is statistically significant.
Since p-Value for the F-statistic is \(2.2 \times 10^{-16}\), the value is less than significance level 0.01, we can conclude that R-squared value is significantly different from zero.
PropMD and TotExp have a positive correlation with the output. However PropMD * TotExp is negatively correlated with the output.If we bulid hypothesis as
\(H_0:\) PropMD, TotExp and PropMD * TotExp has no impact on average life expectancy. It means model is similar to intercept-only model.
\(H_A:\) PropMD, TotExp and PropMD * TotExp has impact on average life expectancy. It means our model is different from intercept-only model.
p-Value for the F-statistic suggests, PropMD, TotExp and PropMD * TotExp are highly relevant to estimate average life expectancy. Using overall p-Value \(2.2 \times 10^{-16}\), which is less than 0.01 we would reject null hypothesis(\(H_0\)).\(R^2\) suggests that model can only explain \(36\%\) data variation.
5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
Answer:
TotExp = 14
PropMD = 0.03
ALE = 62.77270326 + (1497.49395252 * PropMD) + (0.00007233 * TotExp) - (0.00602569 * PropMD * TotExp)
When PropMD = 0.03 and TotExp = 14, Average Life Expectancy is \(107.7\) years. Value is very high; it means statistically possible but practically not possible. This seems unreal because among many other conditions,