library(ggplot2)
library(dplyr)
library(knitr)
library(kableExtra)
#library(broom)
# Load the data into a dataframe
country_stats <- read.csv('who.csv')

Week 12, Regression Analysis in R part 2: 11-15 Apr

About the data

The attached who.csv dataset contains real-world data from 2008. The variables included follow.

Variable Description
Country name of the country
LifeExp average life expectancy for the country in years
InfantSurvival proportion of those surviving to one year or more
Under5Survival proportion of those surviving to five years or more
TBFree proportion of the population without TB.
PropMD proportion of the population who are MDs
PropRN proportion of the population who are RNs
PersExp mean personal expenditures on healthcare in US dollars at average exchange rate
GovtExp mean government expenditures per capita on healthcare, US dollars at average exchange rate
TotExp sum of personal and government expenditures.


Exercise 1

Provide a scatterplot of LifeExp ~ TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, \(R^2\), standard error, and p-values only. Discuss whether the assumptions of simple linear regression met.

Generate a scatter plot

plot(x = country_stats$TotExp, y = country_stats$LifeExp)

The plot above does not resemble a roughly linear correlation between TotExp and LifeExp.

This figure shows that the life expentancy tends to fluctuate (large spread) between 40 and 75 for values of Total Expense under 70,000. However, The life expectancy tends to remain above 75 for Total Expense values above 70,000.

Fit a simple linear regression model

ex1_lm <- lm(formula = LifeExp ~ TotExp, data = country_stats)
ex1_lm_summary <- summary(ex1_lm)

#[1] "call"          "terms"         "residuals"     "coefficients"  "aliased"       "sigma"     
#[7] "df"            "r.squared"     "adj.r.squared" "fstatistic"    "cov.unscaled" 

ex1_fstat       <- ex1_lm_summary$fstatistic
ex1_rsquared    <- ex1_lm_summary$r.squared
ex1_res_std_err <- ex1_lm_summary$sigma
ex1_pVal        <- anova(ex1_lm)$'Pr(>F)'[1]

Show model’s quality measurement statistics

Item Value
F-Statistic 65.2641982, 1, 188
Multiple R-squared 0.2576922
Residual standard error 9.3710333
p-value 7.7139931^{-14}

Evaluate model’s quality measurement statistics

Evaluate F-Statistic

The F-statistic value compares the current model to a model that has one fewer parameters. Because the one-factor model already has only a single parameter, this test is not particularly useful in this case.

Evaluate Multiple R-squared

The Multiple R-squared value is a number between 0 and 1. It is a statistical measure of how well the model describes the measured data. The reported \(R^2\) of 0.2576922 for this model means that the model explains 25.77 percent of the data’s variation.

Evaluate Residual standard error

In a good linear model the, if the residuals are distributed normally, the first and third quantiles of the previous residuals should be about 1.5 times this standard error.

Show the Q1 and Q3 for the model’s residuals

summary(ex1_lm_summary$residuals)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> -24.764  -4.778   3.154   0.000   7.116  13.292

From the summary above we see that,

  • Q1 of -4.778 is only -0.51 times of the Residual standard error (9.371).

  • Q3 of 7.116 is only 0.76 times of the Residual standard error (9.371).

Therefore, the residuals do not seem to be distributed normally.

Evaluate p-value

The Pr(>|t|) shows the probability that the corresponding coefficient is not relevant in the model. This value is also known as the significance or p-value of the coefficient.

In this model,

  • the probability that “TotExp” is not relevant in this model is \(7.7139931\times 10^{-14}\) - a tiny value less than 0.05. This means that “TotExp” (sum of personal and government expenditures) is relevant in the model.
  • the probability that the intercept is not relevant in this model is \(7.8070708\times 10^{-153}\) - a tiny value far less than 0.05. This means that the intercept is relevant in the model.


Exercise 2

Raise life expectancy to the 4.6 power (i.e., \(LifeExp^{4.6}\)). Raise total expenditures to the 0.06 power (nearly a log transform, \(TotExp^{.06}\)). Plot \(LifeExp^{4.6}\) as a function of \(TotExp^{.06}\), and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, \(R^2\), standard error, and p-values. Which model is “better?”


Generate a scatter plot

plot(x = I(country_stats$TotExp ^ 0.06), y = I(country_stats$LifeExp ^ 4.6))

The plot above does resemble a roughly positive linear correlation between TotExp and LifeExp. The life expentancy tends to increase as the values of Total Expense increase.

Fit a simple linear regression model

ex2_lm <- lm(formula = I(LifeExp ^ 4.6) ~ I(TotExp ^ 0.06), data = country_stats)
ex2_lm_summary <- summary(ex2_lm)

#[1] "call"          "terms"         "residuals"     "coefficients"  "aliased"       "sigma"     
#[7] "df"            "r.squared"     "adj.r.squared" "fstatistic"    "cov.unscaled" 

ex2_fstat       <- ex2_lm_summary$fstatistic
ex2_rsquared    <- ex2_lm_summary$r.squared
ex2_res_std_err <- ex2_lm_summary$sigma
ex2_pVal        <- anova(ex2_lm)$'Pr(>F)'[1]

Show model’s quality measurement statistics

Item Value
F-Statistic 507.6967054, 1, 188
Multiple R-squared 0.7297673
Residual standard error 9.0492393^{7}
p-value 2.6014284^{-55}

Evaluate model’s quality measurement statistics

Evaluate F-Statistic

The F-statistic value compares the current model to a model that has one fewer parameters. Because the one-factor model already has only a single parameter, this test is not particularly useful in this case.

Evaluate Multiple R-squared

The Multiple R-squared value is a number between 0 and 1. It is a statistical measure of how well the model describes the measured data. The reported \(R^2\) of 0.7297673 for this model means that the model explains 72.98 percent of the data’s variation. This is a big improvement over the 25.77 percent reported in Exercise 1.


Evaluate Residual standard error

In a good linear model the, if the residuals are distributed normally, the first and third quantiles of the previous residuals should be about 1.5 times this standard error.

Show the Q1 and Q3 for the model’s residuals

summary(ex2_lm_summary$residuals)
#>       Min.    1st Qu.     Median       Mean    3rd Qu.       Max. 
#> -308616089  -53978977   13697187          0   59139231  211951764

From the summary above we see that,

  • Q1 of -5.3978977^{7} is only -0.6 times of the Residual standard error (9.0492393^{7}).

  • Q3 of 5.9139231^{7} is only 0.65 times of the Residual standard error (9.0492393^{7}).

Therefore, the residuals do not seem to be distributed normally.

Evaluate p-value

The Pr(>|t|) shows the probability that the corresponding coefficient is not relevant in the model. This value is also known as the significance or p-value of the coefficient.

In this model,

  • the probability that \(TotExp ^ {0.06}\) is not relevant in this model is \(2.6014284\times 10^{-55}\) - a tiny value less than 0.05. This means that \(TotExp ^ {0.06}\) (sum of personal and government expenditures) is relevant in the model.
  • the probability that the intercept is not relevant in this model is \(3.9117741\times 10^{-36}\) - a tiny value far less than 0.05. This means that the intercept is relevant in the model.

Exercise 3

Using the results from Exercise 2, forecast life expectancy when \(TotExp^{.06} = 1.5\). Then forecast life expectancy when \(TotExp{^.06} = 2.5\).

From exercise 2, the y-intercept is \(a_0 = -736527910\) and the slope is \(a_1 = 620060216\). Thus, the final regression model is:

\(LifeExp^{4.6} = -736527910 + 620060216 * TotExp^{0.06}\)

Using the above equation we can calculate life expectancy as:

  • If \(TotExp^{.06} = 1.5\), then \(LifeExp^{4.6} = -736527910 + 620060216 * 1.5 = 1.9356241\times 10^{8}\)

  • If \(TotExp^{.06} = 2.5\), then \(LifeExp^{4.6} = -736527910 + 620060216 * 2.5 = 8.1362263\times 10^{8}\)


Exercise 4

Build the following multiple regression model and interpret the F Statistics, \(R^2\), standard error, and p-values. How good is the model?

\(LifeExp = b0 + b1 * PropMd + b2 * TotExp + b3 * PropMD * TotExp\)

Fit a simple linear regression model

ex4_lm <- lm(formula = LifeExp ~ PropMD + TotExp + I(PropMD * TotExp), data = country_stats)
ex4_lm_summary <- summary(ex4_lm)

#[1] "call"          "terms"         "residuals"     "coefficients"  "aliased"       "sigma"     
#[7] "df"            "r.squared"     "adj.r.squared" "fstatistic"    "cov.unscaled" 

ex4_fstat       <- ex4_lm_summary$fstatistic
ex4_rsquared    <- ex4_lm_summary$r.squared
ex4_res_std_err <- ex4_lm_summary$sigma
ex4_pVal        <- anova(ex4_lm)$'Pr(>F)'[1]

Show model’s quality measurement statistics

Item Value
F-Statistic 34.4883268, 3, 186
Multiple R-squared 0.3574352
Residual standard error 8.7654934
p-value 3.3134579^{-9}

Evaluate model’s quality measurement statistics

Evaluate F-Statistic

The F-statistic value compares the current model to a model that has one fewer parameters. Because the one-factor model already has only a single parameter, this test is not particularly useful in this case.

Evaluate Multiple R-squared

The Multiple R-squared value is a number between 0 and 1. It is a statistical measure of how well the model describes the measured data. The reported \(R^2\) of 0.3574352 for this model means that the model explains 35.74 percent of the data’s variation.

Evaluate Residual standard error

In a good linear model the, if the residuals are distributed normally, the first and third quantiles of the previous residuals should be about 1.5 times this standard error.

Show the Q1 and Q3 for the model’s residuals

summary(ex4_lm_summary$residuals)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#> -27.320  -4.132   2.098   0.000   6.540  13.074

From the summary above we see that,

  • Q1 of -4.132 is only -0.47 times of the Residual standard error (8.765).

  • Q3 of 6.54 is only 0.75 times of the Residual standard error (8.765).

Therefore, the residuals do not seem to be distributed normally.

Evaluate p-value

The Pr(>|t|) shows the probability that the corresponding coefficient is not relevant in the model. This value is also known as the significance or p-value of the coefficient.

In this model,

  • the probability that “TotExp” is not relevant in this model is \(3.3134579\times 10^{-9}\) - a tiny value less than 0.05. This means that “TotExp” (sum of personal and government expenditures) is relevant in the model.
  • the probability that the intercept is not relevant in this model is \(6.2071874\times 10^{-145}\) - a tiny value far less than 0.05. This means that the intercept is relevant in the model.


Exercise 5

Forecast LifeExp when \(PropMD = .03\) and \(TotExp = 14\). Does this forecast seem realistic? Why or why not?

Based on the results from Exercise 4, the below model

\(LifeExp = b0 + b1 * PropMd + 7.233^{-5} * TotExp + b3 * PropMD * TotExp\)

can be re-written as

\(LifeExp = 62.77 + 1497 * PropMd + 0.00007233 * TotExp - 0.006026 * PropMD * TotExp\)

Using the above equation we can calculate Life Expectancy as:

  • If \(PropMD = .03\) and \(TotExp = 14\), then \(LifeExp = 62.77 + 1497 * .03 + 0.00007233 * 14 - 0.006026 * .03 * 14 = 107.68\)

The life expectancy forecast of 107.68 does not seem realistic because

  • the actual life expectance values in the original source data range only from 40 to 83.
  • if TotExp (sum of personal and government expenditures) is 14, then it does not make sense that such a small expenditure could help increase a person’s life expectancy to the point of exceeding that of all the rest of the life expectancy values for the reported countries.


