Data 605 HW11: Regression Analysis in R 2
Please refer to the Assignment 12 Document.
1 Regression Problem
The given “who.csv” dataset contains real-world data from 2008. The variables included follow.
| Variable | Description |
|---|---|
| Country | Name of the country |
| LifeExp | Average life expectancy for the country in years |
| InfantSurvival | Proportion of those surviving to one year or more |
| Under5Survival | Proportion of those surviving to five years or more |
| TBFree | Proportion of the population without TB |
| PropMD | Proportion of the population who are MDs |
| PropRN | Proportion of the population who are RNs |
| PersExp | Mean personal expenditures on healthcare in US dollars at average exchange rate |
| GovtExp | Mean government expenditures per capita on healthcare, US dollars at average exchange rate |
| TotExp | Sum of personal and government expenditures |
Q1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.
Q2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”
Q3. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
Q4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?
\[LifeExp = b0+b1 \times PropMd + b2 \times TotExp +b3 \times PropMD \times TotExp\]
Q5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
1.1 Import Data
The who.csv dataset contains 10 variables and 190 observations without NA values.
## Warning: package 'ggplot2' was built under R version 3.6.3
## Warning: package 'tibble' was built under R version 3.6.3
## Warning: package 'tidyr' was built under R version 3.6.3
## Warning: package 'readr' was built under R version 3.6.3
## Warning: package 'purrr' was built under R version 3.6.3
## Warning: package 'dplyr' was built under R version 3.6.3
## Warning: package 'lmSupport' was built under R version 3.6.3
download.file("https://raw.githubusercontent.com/shirley-wong/Data-605/master/Assignment12/who.csv", destfile = "who.csv")
df <- read_csv("who.csv")
df## [1] 190 10
## tibble [190 x 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Country : chr [1:190] "Afghanistan" "Albania" "Algeria" "Andorra" ...
## $ LifeExp : num [1:190] 42 71 71 82 41 73 75 69 82 80 ...
## $ InfantSurvival: num [1:190] 0.835 0.985 0.967 0.997 0.846 0.99 0.986 0.979 0.995 0.996 ...
## $ Under5Survival: num [1:190] 0.743 0.983 0.962 0.996 0.74 0.989 0.983 0.976 0.994 0.996 ...
## $ TBFree : num [1:190] 0.998 1 0.999 1 0.997 ...
## $ PropMD : num [1:190] 2.29e-04 1.14e-03 1.06e-03 3.30e-03 7.04e-05 ...
## $ PropRN : num [1:190] 0.000572 0.004614 0.002091 0.0035 0.001146 ...
## $ PersExp : num [1:190] 20 169 108 2589 36 ...
## $ GovtExp : num [1:190] 92 3128 5184 169725 1620 ...
## $ TotExp : num [1:190] 112 3297 5292 172314 1656 ...
## - attr(*, "spec")=
## .. cols(
## .. Country = col_character(),
## .. LifeExp = col_double(),
## .. InfantSurvival = col_double(),
## .. Under5Survival = col_double(),
## .. TBFree = col_double(),
## .. PropMD = col_double(),
## .. PropRN = col_double(),
## .. PersExp = col_double(),
## .. GovtExp = col_double(),
## .. TotExp = col_double()
## .. )
## Country LifeExp InfantSurvival Under5Survival
## Length:190 Min. :40.00 Min. :0.8350 Min. :0.7310
## Class :character 1st Qu.:61.25 1st Qu.:0.9433 1st Qu.:0.9253
## Mode :character Median :70.00 Median :0.9785 Median :0.9745
## Mean :67.38 Mean :0.9624 Mean :0.9459
## 3rd Qu.:75.00 3rd Qu.:0.9910 3rd Qu.:0.9900
## Max. :83.00 Max. :0.9980 Max. :0.9970
## TBFree PropMD PropRN
## Min. :0.9870 Min. :0.0000196 Min. :0.0000883
## 1st Qu.:0.9969 1st Qu.:0.0002444 1st Qu.:0.0008455
## Median :0.9992 Median :0.0010474 Median :0.0027584
## Mean :0.9980 Mean :0.0017954 Mean :0.0041336
## 3rd Qu.:0.9998 3rd Qu.:0.0024584 3rd Qu.:0.0057164
## Max. :1.0000 Max. :0.0351290 Max. :0.0708387
## PersExp GovtExp TotExp
## Min. : 3.00 Min. : 10.0 Min. : 13
## 1st Qu.: 36.25 1st Qu.: 559.5 1st Qu.: 584
## Median : 199.50 Median : 5385.0 Median : 5541
## Mean : 742.00 Mean : 40953.5 Mean : 41696
## 3rd Qu.: 515.25 3rd Qu.: 25680.2 3rd Qu.: 26331
## Max. :6350.00 Max. :476420.0 Max. :482750
1.2 Q1
1.2.1 Build Model
Q: Plot a scatterplot of LifeExp~TotExp and run simple linear regression.
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24.764 -4.778 3.154 7.116 13.292
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.475e+01 7.535e-01 85.933 < 2e-16 ***
## TotExp 6.297e-05 7.795e-06 8.079 7.71e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared: 0.2577, Adjusted R-squared: 0.2537
## F-statistic: 65.26 on 1 and 188 DF, p-value: 7.714e-14
plot(LifeExp~TotExp, data=df, main = "Total Expenditures vs Average Life Expectancy", xlab = "Sum of personal and government expenditures (dollars)", ylab = "Average life expectancy for the country (years)")
abline(reg=df_lm, col="blue")1.2.2 Evaluate the Model
Q: Provide and interpret the F statistics, R^2, standard error,and p-values only.
The \(R^{2}\) is 0.2577, which is quite small and weak. The model only explains 25.77% variability of the response data around its mean, indicating a poor fit.
The standard error is 9.371 on 188 degrees of freedom, which is quite high. This means some sample data points are significantly off the regression line.
The p-value is 7.714e-14, nearly zero, which means we can confidently reject the null hypothesis and accept the alternative hypothesis. We can conclude that there is a strong relationship between LifeExp and TotExp.
As we rejected the null hypothesis, we consider the f-statistics. The F statistics is 65.26, which indicates our model significance.
1.2.3 Regression Assumptions
Q: Discuss whether the assumptions of simple linear regression met.
We need to study four conditions, linearity, normality of residuals, constant variability and independence of errors, of our linear regression model.
As 3 out of the 4 conditions were failed, the regression assumptions are not met.
1.2.3.1 Linearity
From the residual plot below, it is clear that the data is not linear. The residuals does not randomly disperse around y=0 but shows obvious pattern.
plot(fitted(df_lm), resid(df_lm),
main = "Total Expenditures vs Average Life Expectancy",
xlab = "Fitted Life Expectancy (years)", ylab = "Residuals")
abline(h=0, col="blue")1.2.3.2 Normality of Residuals
We can study the normality of residuals through qqplot and histogram of residuals.
The QQ plot shows many outliers below the line at the upper end and the lower end. The data points barely follow the line.
The histogram is bimodal and seriously left-skewed.
Base on the two plots, we can conslude that the residuals are barely normal.
hist(resid(df_lm), breaks=15, prob=TRUE)
curve(dnorm(x, mean=mean(resid(df_lm)), sd=sd(resid(df_lm))), col="red", add=TRUE)1.2.3.3 Variability
From the scatterplot below, the variation of the residuals is relatively large and range between -30 and 15. However, if we look at the statistics from the table, the heteroscedasticity assumption is acceptable.
Therefore, the constant variability condition is poorly met.
plot(fitted(df_lm), resid(df_lm),
main = "Total Expenditures vs Average Life Expectancy",
xlab = "Fitted Life Expectancy (years)", ylab = "Residuals")
abline(h=0, col="blue")##
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
##
## Coefficients:
## (Intercept) TotExp
## 6.475e+01 6.297e-05
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = Model)
##
## Value p-value Decision
## Global Stat 56.737011 1.405e-11 Assumptions NOT satisfied!
## Skewness 30.532757 3.283e-08 Assumptions NOT satisfied!
## Kurtosis 0.002804 9.578e-01 Assumptions acceptable.
## Link Function 26.074703 3.285e-07 Assumptions NOT satisfied!
## Heteroscedasticity 0.126747 7.218e-01 Assumptions acceptable.
1.2.3.4 Independence of errors
From the graph below, the residuals are randomly dispersed. It shows that the deviation of errors are independent of the time of data collection, independent of the errors collected earlier, which means it is not a time series dataset. Therefore, this condition appears to be met.
1.3 Q2
1.3.1 Raise LifExp and TotExp
Q: Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06).
1.3.2 Build Model
Q: Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables.
##
## Call:
## lm(formula = LifeExp ~ TotExp, data = df2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -308616089 -53978977 13697187 59139231 211951764
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -736527910 46817945 -15.73 <2e-16 ***
## TotExp 620060216 27518940 22.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared: 0.7298, Adjusted R-squared: 0.7283
## F-statistic: 507.7 on 1 and 188 DF, p-value: < 2.2e-16
plot(LifeExp~TotExp, data=df2, main = "Total Expenditure^.06 vs Average Life Expectancy^4.6", xlab = "Sum of personal and government expenditures^.06 (dollars)", ylab = "Average life expectancy^4.6 for the country (years)")
abline(reg=df2_lm, col="blue")1.3.3 Evaluate the Model
Q: Provide and interpret the F statistics, R^2, standard error, and p-values.
The \(R^{2}\) is 0.7298, which is quite strong. This model explains 72.98% variability of the response data around its mean, indicating a good fit.
The standard error is 90,490,000 on 188 degrees of freedom, which is very high. This huge increase might due to the exponential increase to the life expectancy variable.
The p-value is <2.2e-16, nearly zero, which means we can confidently reject the null hypothesis and accept the alternative hypothesis. We can conclude that there is a strong relationship between LifeExp and TotExp.
As we rejected the null hypothesis, we consider the f-statistics. The F statistics is 507.7, which indicates our model significance. It shows an even greater relationship between the two variables compared to our first model.
1.3.4 Regression Assumptions
We need to study four conditions, linearity, normality of residuals, constant variability and independence of errors, of our linear regression model.
All 4 conditions are met, the regression assumptions are all met.
1.3.4.1 Linearity
From the residual plot below, it shows that the data is somewhat linear. The residuals are quite randomly dispersed around y=0 with no obvious pattern.
plot(fitted(df2_lm), resid(df2_lm),
main = "Total Expenditures^.06 vs Average Life Expectancy^4.6",
xlab = "Fitted Life Expectancy (years)", ylab = "Residuals")
abline(h=0, col="blue")1.3.4.2 Normality of Residuals
We can study the normality of residuals through qqplot and histogram of residuals.
The QQ plot shows only some outliers below the line at the upper end and the lower end. Most of the data points follow the line in the middle.
The histogram is unimodal and fairly normal.
Base on the two plots, we can conslude that the residuals are quite normal.
hist(resid(df2_lm), breaks=15, xlim=c(-400000000, 400000000), prob=TRUE)
curve(dnorm(x, mean=mean(resid(df2_lm)), sd=sd(resid(df2_lm))), col="red", add=TRUE)1.3.4.3 Variability
From the scatterplot below, the variation of the residuals is extremely large and range between -3e+08 and 3e+08.
However, if we study the heteroscedasticity, the assumptions are acceptable. Therefore, the constant variability condition is poorly met.
plot(fitted(df2_lm), resid(df2_lm),
main = "Total Expenditures^.06 vs Average Life Expectancy^4.6",
xlab = "Fitted Life Expectancy (years)", ylab = "Residuals")
abline(h=0, col="blue")##
## Call:
## lm(formula = LifeExp ~ TotExp, data = df2)
##
## Coefficients:
## (Intercept) TotExp
## -736527909 620060216
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = Model)
##
## Value p-value Decision
## Global Stat 27.8117 1.362e-05 Assumptions NOT satisfied!
## Skewness 17.1372 3.478e-05 Assumptions NOT satisfied!
## Kurtosis 7.4581 6.315e-03 Assumptions NOT satisfied!
## Link Function 2.9866 8.396e-02 Assumptions acceptable.
## Heteroscedasticity 0.2299 6.316e-01 Assumptions acceptable.
1.3.4.4 Independence of errors
From the graph below, the residuals are randomly dispersed. It shows that the deviation of errors are independent of the time of data collection, independent of the errors collected earlier, which means it is not a time series dataset. Therefore, this condition appears to be met.
1.3.5 Which model is “better?”
Based on the statistics, the second model shows a greater relationship between the two variables LifExp and TotExp. Also, all 4 regression assumptions are met with the second model, therefore our second model is better than the first model.
1.4 Q3
Q: Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.
## 1 2
## 63.31153 86.50645
The life expectancy when TotExp^.06=1.5 is 63.31153.
The life expectancy when TotExp^0.6=2.5 is 86.50645.
1.5 Q4
1.5.1 Build Model
Q: Build the following multiple regression model.
\[LifeExp = b0+b1 \times PropMd + b2 \times TotExp +b3 \times PropMD \times TotExp\]
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.320 -4.132 2.098 6.540 13.074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.277e+01 7.956e-01 78.899 < 2e-16 ***
## PropMD 1.497e+03 2.788e+02 5.371 2.32e-07 ***
## TotExp 7.233e-05 8.982e-06 8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03 1.472e-03 -4.093 6.35e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared: 0.3574, Adjusted R-squared: 0.3471
## F-statistic: 34.49 on 3 and 186 DF, p-value: < 2.2e-16
1.5.2 Evaluate the Model
Q: Interpret the F Statistics, R^2, standard error, and p-values.
The \(R^{2}\) is 0.3574, which is quite small and weak. This model explains only 35.75% variability of the response data around its mean, indicating a poor fit.
The standard error is 8.765 on 186 degrees of freedom, which is quite high. This means some sample data points are significantly off the regression line.
The p-value is <2.2e-16, nearly zero, which means we can confidently reject the null hypothesis and accept the alternative hypothesis. We can conclude that there is a strong relationship between LifeExp and TotExp.
As we rejected the null hypothesis, we consider the f-statistics. The F statistics is 34.49, which indicates our model significance.
1.5.3 Regression Assumptions
We need to study four conditions, linearity, normality of residuals, constant variability and independence of errors, of our regression model.
All 4 conditions are met, the regression assumptions are all met.
1.5.3.1 Linearity
From the residual plot below, it is clear that the data is not linear. The residuals does not randomly disperse around y=0 but shows obvious pattern.
plot(fitted(df3_lm), resid(df3_lm),
main = "Total Expenditures and Proportion of MDs vs Average Life Expectancy",
xlab = "Fitted Life Expectancy (years)", ylab = "Residuals")
abline(h=0, col="blue")1.5.3.2 Normality of Residuals
We can study the normality of residuals through qqplot and histogram of residuals.
The QQ plot shows many outliers below the line at the upper end and the lower end. The data points barely follow the line.
The histogram is trimodal and seriously left-skewed.
Base on the two plots, we can conslude that the residuals are barely normal.
hist(resid(df3_lm), breaks=15, prob=TRUE)
curve(dnorm(x, mean=mean(resid(df3_lm)), sd=sd(resid(df3_lm))), col="red", add=TRUE)1.5.3.3 Variability
From the scatterplot below, the variation of the residuals is relatively large and range between -30 and 15. However, if we look at the statistics from the table, the heteroscedasticity assumption is acceptable.
Therefore, the constant variability condition is poorly met.
plot(fitted(df3_lm), resid(df3_lm),
main = "Total Expenditures and Proportion of MDs vs Average Life Expectancy",
xlab = "Fitted Life Expectancy (years)", ylab = "Residuals")
abline(h=0, col="blue")## Descriptive Statistics for Studentized Residuals
##
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data = df)
##
## Coefficients:
## (Intercept) PropMD TotExp PropMD:TotExp
## 6.277e+01 1.497e+03 7.233e-05 -6.026e-03
##
##
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance = 0.05
##
## Call:
## gvlma(x = Model)
##
## Value p-value Decision
## Global Stat 87.0703 0.000e+00 Assumptions NOT satisfied!
## Skewness 33.4219 7.418e-09 Assumptions NOT satisfied!
## Kurtosis 0.5600 4.543e-01 Assumptions acceptable.
## Link Function 52.7284 3.830e-13 Assumptions NOT satisfied!
## Heteroscedasticity 0.3599 5.486e-01 Assumptions acceptable.
1.5.3.4 Independence of errors
From the graph below, the residuals are randomly dispersed. It shows that the deviation of errors are independent of the time of data collection, independent of the errors collected earlier, which means it is not a time series dataset. Therefore, this condition appears to be met.
1.5.3.5 How good is the model?
This model shows a similar statistics with our first model and failed the regression assumptions. Therefore it is not a good model.
1.6 Q5
Q5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?
## 1
## 107.696
The forecast LifeExp is 107.696 years, which does not seem realistic.
The model is built with PropMD=0.03 and TotExp=14, which is very unrealistic by increasing the proportion of MDs to 3% but having a extremely low total expenditures at 14. If we look at the original dataset, there are only two countries, Cyprus and San Marino, reach 3% of PropMD. But their total expenditures are 40k+ and 281k+ respectively.
Based on the unrealistic condition of PropMD and TotExp, it seems unrealistic to have a long life expectancy of 107.696.