Please refer to the Assignment 12 Document.

1 Regression Problem

The given “who.csv” dataset contains real-world data from 2008. The variables included follow.

Variable Description
Country Name of the country
LifeExp Average life expectancy for the country in years
InfantSurvival Proportion of those surviving to one year or more
Under5Survival Proportion of those surviving to five years or more
TBFree Proportion of the population without TB
PropMD Proportion of the population who are MDs
PropRN Proportion of the population who are RNs
PersExp Mean personal expenditures on healthcare in US dollars at average exchange rate
GovtExp Mean government expenditures per capita on healthcare, US dollars at average exchange rate
TotExp Sum of personal and government expenditures

Q1. Provide a scatterplot of LifeExp~TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, R^2, standard error,and p-values only. Discuss whether the assumptions of simple linear regression met.

Q2. Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06). Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, R^2, standard error, and p-values. Which model is “better?”

Q3. Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

Q4. Build the following multiple regression model and interpret the F Statistics, R^2, standard error, and p-values. How good is the model?

\[LifeExp = b0+b1 \times PropMd + b2 \times TotExp +b3 \times PropMD \times TotExp\]

Q5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

1.1 Import Data

The who.csv dataset contains 10 variables and 190 observations without NA values.

## Warning: package 'ggplot2' was built under R version 3.6.3
## Warning: package 'tibble' was built under R version 3.6.3
## Warning: package 'tidyr' was built under R version 3.6.3
## Warning: package 'readr' was built under R version 3.6.3
## Warning: package 'purrr' was built under R version 3.6.3
## Warning: package 'dplyr' was built under R version 3.6.3
## Warning: package 'lmSupport' was built under R version 3.6.3
## [1] 190  10
## tibble [190 x 10] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Country       : chr [1:190] "Afghanistan" "Albania" "Algeria" "Andorra" ...
##  $ LifeExp       : num [1:190] 42 71 71 82 41 73 75 69 82 80 ...
##  $ InfantSurvival: num [1:190] 0.835 0.985 0.967 0.997 0.846 0.99 0.986 0.979 0.995 0.996 ...
##  $ Under5Survival: num [1:190] 0.743 0.983 0.962 0.996 0.74 0.989 0.983 0.976 0.994 0.996 ...
##  $ TBFree        : num [1:190] 0.998 1 0.999 1 0.997 ...
##  $ PropMD        : num [1:190] 2.29e-04 1.14e-03 1.06e-03 3.30e-03 7.04e-05 ...
##  $ PropRN        : num [1:190] 0.000572 0.004614 0.002091 0.0035 0.001146 ...
##  $ PersExp       : num [1:190] 20 169 108 2589 36 ...
##  $ GovtExp       : num [1:190] 92 3128 5184 169725 1620 ...
##  $ TotExp        : num [1:190] 112 3297 5292 172314 1656 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Country = col_character(),
##   ..   LifeExp = col_double(),
##   ..   InfantSurvival = col_double(),
##   ..   Under5Survival = col_double(),
##   ..   TBFree = col_double(),
##   ..   PropMD = col_double(),
##   ..   PropRN = col_double(),
##   ..   PersExp = col_double(),
##   ..   GovtExp = col_double(),
##   ..   TotExp = col_double()
##   .. )
##    Country             LifeExp      InfantSurvival   Under5Survival  
##  Length:190         Min.   :40.00   Min.   :0.8350   Min.   :0.7310  
##  Class :character   1st Qu.:61.25   1st Qu.:0.9433   1st Qu.:0.9253  
##  Mode  :character   Median :70.00   Median :0.9785   Median :0.9745  
##                     Mean   :67.38   Mean   :0.9624   Mean   :0.9459  
##                     3rd Qu.:75.00   3rd Qu.:0.9910   3rd Qu.:0.9900  
##                     Max.   :83.00   Max.   :0.9980   Max.   :0.9970  
##      TBFree           PropMD              PropRN         
##  Min.   :0.9870   Min.   :0.0000196   Min.   :0.0000883  
##  1st Qu.:0.9969   1st Qu.:0.0002444   1st Qu.:0.0008455  
##  Median :0.9992   Median :0.0010474   Median :0.0027584  
##  Mean   :0.9980   Mean   :0.0017954   Mean   :0.0041336  
##  3rd Qu.:0.9998   3rd Qu.:0.0024584   3rd Qu.:0.0057164  
##  Max.   :1.0000   Max.   :0.0351290   Max.   :0.0708387  
##     PersExp           GovtExp             TotExp      
##  Min.   :   3.00   Min.   :    10.0   Min.   :    13  
##  1st Qu.:  36.25   1st Qu.:   559.5   1st Qu.:   584  
##  Median : 199.50   Median :  5385.0   Median :  5541  
##  Mean   : 742.00   Mean   : 40953.5   Mean   : 41696  
##  3rd Qu.: 515.25   3rd Qu.: 25680.2   3rd Qu.: 26331  
##  Max.   :6350.00   Max.   :476420.0   Max.   :482750

1.2 Q1

1.2.1 Build Model

Q: Plot a scatterplot of LifeExp~TotExp and run simple linear regression.

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.764  -4.778   3.154   7.116  13.292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.475e+01  7.535e-01  85.933  < 2e-16 ***
## TotExp      6.297e-05  7.795e-06   8.079 7.71e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.371 on 188 degrees of freedom
## Multiple R-squared:  0.2577, Adjusted R-squared:  0.2537 
## F-statistic: 65.26 on 1 and 188 DF,  p-value: 7.714e-14

1.2.2 Evaluate the Model

Q: Provide and interpret the F statistics, R^2, standard error,and p-values only.

The \(R^{2}\) is 0.2577, which is quite small and weak. The model only explains 25.77% variability of the response data around its mean, indicating a poor fit.

The standard error is 9.371 on 188 degrees of freedom, which is quite high. This means some sample data points are significantly off the regression line.

The p-value is 7.714e-14, nearly zero, which means we can confidently reject the null hypothesis and accept the alternative hypothesis. We can conclude that there is a strong relationship between LifeExp and TotExp.

As we rejected the null hypothesis, we consider the f-statistics. The F statistics is 65.26, which indicates our model significance.

1.2.3 Regression Assumptions

Q: Discuss whether the assumptions of simple linear regression met.

We need to study four conditions, linearity, normality of residuals, constant variability and independence of errors, of our linear regression model.

As 3 out of the 4 conditions were failed, the regression assumptions are not met.

1.2.3.1 Linearity

From the residual plot below, it is clear that the data is not linear. The residuals does not randomly disperse around y=0 but shows obvious pattern.

1.2.3.2 Normality of Residuals

We can study the normality of residuals through qqplot and histogram of residuals.

The QQ plot shows many outliers below the line at the upper end and the lower end. The data points barely follow the line.

The histogram is bimodal and seriously left-skewed.

Base on the two plots, we can conslude that the residuals are barely normal.

1.2.3.3 Variability

From the scatterplot below, the variation of the residuals is relatively large and range between -30 and 15. However, if we look at the statistics from the table, the heteroscedasticity assumption is acceptable.

Therefore, the constant variability condition is poorly met.

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = df)
## 
## Coefficients:
## (Intercept)       TotExp  
##   6.475e+01    6.297e-05  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = Model) 
## 
##                        Value   p-value                   Decision
## Global Stat        56.737011 1.405e-11 Assumptions NOT satisfied!
## Skewness           30.532757 3.283e-08 Assumptions NOT satisfied!
## Kurtosis            0.002804 9.578e-01    Assumptions acceptable.
## Link Function      26.074703 3.285e-07 Assumptions NOT satisfied!
## Heteroscedasticity  0.126747 7.218e-01    Assumptions acceptable.

1.2.3.4 Independence of errors

From the graph below, the residuals are randomly dispersed. It shows that the deviation of errors are independent of the time of data collection, independent of the errors collected earlier, which means it is not a time series dataset. Therefore, this condition appears to be met.

1.3 Q2

1.3.1 Raise LifExp and TotExp

Q: Raise life expectancy to the 4.6 power (i.e., LifeExp^4.6). Raise total expenditures to the 0.06 power (nearly a log transform, TotExp^.06).

1.3.2 Build Model

Q: Plot LifeExp^4.6 as a function of TotExp^.06, and r re-run the simple regression model using the transformed variables.

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = df2)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -308616089  -53978977   13697187   59139231  211951764 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -736527910   46817945  -15.73   <2e-16 ***
## TotExp       620060216   27518940   22.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90490000 on 188 degrees of freedom
## Multiple R-squared:  0.7298, Adjusted R-squared:  0.7283 
## F-statistic: 507.7 on 1 and 188 DF,  p-value: < 2.2e-16

1.3.3 Evaluate the Model

Q: Provide and interpret the F statistics, R^2, standard error, and p-values.

The \(R^{2}\) is 0.7298, which is quite strong. This model explains 72.98% variability of the response data around its mean, indicating a good fit.

The standard error is 90,490,000 on 188 degrees of freedom, which is very high. This huge increase might due to the exponential increase to the life expectancy variable.

The p-value is <2.2e-16, nearly zero, which means we can confidently reject the null hypothesis and accept the alternative hypothesis. We can conclude that there is a strong relationship between LifeExp and TotExp.

As we rejected the null hypothesis, we consider the f-statistics. The F statistics is 507.7, which indicates our model significance. It shows an even greater relationship between the two variables compared to our first model.

1.3.4 Regression Assumptions

We need to study four conditions, linearity, normality of residuals, constant variability and independence of errors, of our linear regression model.

All 4 conditions are met, the regression assumptions are all met.

1.3.4.1 Linearity

From the residual plot below, it shows that the data is somewhat linear. The residuals are quite randomly dispersed around y=0 with no obvious pattern.

1.3.4.2 Normality of Residuals

We can study the normality of residuals through qqplot and histogram of residuals.

The QQ plot shows only some outliers below the line at the upper end and the lower end. Most of the data points follow the line in the middle.

The histogram is unimodal and fairly normal.

Base on the two plots, we can conslude that the residuals are quite normal.

1.3.4.3 Variability

From the scatterplot below, the variation of the residuals is extremely large and range between -3e+08 and 3e+08.

However, if we study the heteroscedasticity, the assumptions are acceptable. Therefore, the constant variability condition is poorly met.

## 
## Call:
## lm(formula = LifeExp ~ TotExp, data = df2)
## 
## Coefficients:
## (Intercept)       TotExp  
##  -736527909    620060216  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = Model) 
## 
##                      Value   p-value                   Decision
## Global Stat        27.8117 1.362e-05 Assumptions NOT satisfied!
## Skewness           17.1372 3.478e-05 Assumptions NOT satisfied!
## Kurtosis            7.4581 6.315e-03 Assumptions NOT satisfied!
## Link Function       2.9866 8.396e-02    Assumptions acceptable.
## Heteroscedasticity  0.2299 6.316e-01    Assumptions acceptable.

1.3.4.4 Independence of errors

From the graph below, the residuals are randomly dispersed. It shows that the deviation of errors are independent of the time of data collection, independent of the errors collected earlier, which means it is not a time series dataset. Therefore, this condition appears to be met.

1.3.5 Which model is “better?”

Based on the statistics, the second model shows a greater relationship between the two variables LifExp and TotExp. Also, all 4 regression assumptions are met with the second model, therefore our second model is better than the first model.

1.4 Q3

Q: Using the results from 3, forecast life expectancy when TotExp^.06 =1.5. Then forecast life expectancy when TotExp^.06=2.5.

##        1        2 
## 63.31153 86.50645

The life expectancy when TotExp^.06=1.5 is 63.31153.

The life expectancy when TotExp^0.6=2.5 is 86.50645.

1.5 Q4

1.5.1 Build Model

Q: Build the following multiple regression model.

\[LifeExp = b0+b1 \times PropMd + b2 \times TotExp +b3 \times PropMD \times TotExp\]

## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -27.320  -4.132   2.098   6.540  13.074 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.277e+01  7.956e-01  78.899  < 2e-16 ***
## PropMD         1.497e+03  2.788e+02   5.371 2.32e-07 ***
## TotExp         7.233e-05  8.982e-06   8.053 9.39e-14 ***
## PropMD:TotExp -6.026e-03  1.472e-03  -4.093 6.35e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.765 on 186 degrees of freedom
## Multiple R-squared:  0.3574, Adjusted R-squared:  0.3471 
## F-statistic: 34.49 on 3 and 186 DF,  p-value: < 2.2e-16

1.5.2 Evaluate the Model

Q: Interpret the F Statistics, R^2, standard error, and p-values.

The \(R^{2}\) is 0.3574, which is quite small and weak. This model explains only 35.75% variability of the response data around its mean, indicating a poor fit.

The standard error is 8.765 on 186 degrees of freedom, which is quite high. This means some sample data points are significantly off the regression line.

The p-value is <2.2e-16, nearly zero, which means we can confidently reject the null hypothesis and accept the alternative hypothesis. We can conclude that there is a strong relationship between LifeExp and TotExp.

As we rejected the null hypothesis, we consider the f-statistics. The F statistics is 34.49, which indicates our model significance.

1.5.3 Regression Assumptions

We need to study four conditions, linearity, normality of residuals, constant variability and independence of errors, of our regression model.

All 4 conditions are met, the regression assumptions are all met.

1.5.3.1 Linearity

From the residual plot below, it is clear that the data is not linear. The residuals does not randomly disperse around y=0 but shows obvious pattern.

1.5.3.2 Normality of Residuals

We can study the normality of residuals through qqplot and histogram of residuals.

The QQ plot shows many outliers below the line at the upper end and the lower end. The data points barely follow the line.

The histogram is trimodal and seriously left-skewed.

Base on the two plots, we can conslude that the residuals are barely normal.

1.5.3.3 Variability

From the scatterplot below, the variation of the residuals is relatively large and range between -30 and 15. However, if we look at the statistics from the table, the heteroscedasticity assumption is acceptable.

Therefore, the constant variability condition is poorly met.

## Descriptive Statistics for Studentized Residuals
## 
## Call:
## lm(formula = LifeExp ~ PropMD + TotExp + (PropMD * TotExp), data = df)
## 
## Coefficients:
##   (Intercept)         PropMD         TotExp  PropMD:TotExp  
##     6.277e+01      1.497e+03      7.233e-05     -6.026e-03  
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = Model) 
## 
##                      Value   p-value                   Decision
## Global Stat        87.0703 0.000e+00 Assumptions NOT satisfied!
## Skewness           33.4219 7.418e-09 Assumptions NOT satisfied!
## Kurtosis            0.5600 4.543e-01    Assumptions acceptable.
## Link Function      52.7284 3.830e-13 Assumptions NOT satisfied!
## Heteroscedasticity  0.3599 5.486e-01    Assumptions acceptable.

1.5.3.4 Independence of errors

From the graph below, the residuals are randomly dispersed. It shows that the deviation of errors are independent of the time of data collection, independent of the errors collected earlier, which means it is not a time series dataset. Therefore, this condition appears to be met.

1.5.3.5 How good is the model?

This model shows a similar statistics with our first model and failed the regression assumptions. Therefore it is not a good model.

1.6 Q5

Q5. Forecast LifeExp when PropMD=.03 and TotExp = 14. Does this forecast seem realistic? Why or why not?

##       1 
## 107.696

The forecast LifeExp is 107.696 years, which does not seem realistic.

The model is built with PropMD=0.03 and TotExp=14, which is very unrealistic by increasing the proportion of MDs to 3% but having a extremely low total expenditures at 14. If we look at the original dataset, there are only two countries, Cyprus and San Marino, reach 3% of PropMD. But their total expenditures are 40k+ and 281k+ respectively.

Based on the unrealistic condition of PropMD and TotExp, it seems unrealistic to have a long life expectancy of 107.696.