library(ggplot2)
library(dplyr)
library(knitr)
library(kableExtra)
#library(broom)
# Load the data into a dataframe
country_stats <- read.csv('who.csv')
Week 12, Regression Analysis in R part 2: 11-15 Apr
About the data
The attached who.csv dataset contains real-world data from 2008. The variables included follow.
| Country |
name of the country |
| LifeExp |
average life expectancy for the country in years |
| InfantSurvival |
proportion of those surviving to one year or more |
| Under5Survival |
proportion of those surviving to five years or more |
| TBFree |
proportion of the population without TB. |
| PropMD |
proportion of the population who are MDs |
| PropRN |
proportion of the population who are RNs |
| PersExp |
mean personal expenditures on healthcare in US dollars at average exchange rate |
| GovtExp |
mean government expenditures per capita on healthcare, US dollars at average exchange rate |
| TotExp |
sum of personal and government expenditures. |
Exercise 1
Provide a scatterplot of LifeExp ~ TotExp, and run simple linear regression. Do not transform the variables. Provide and interpret the F statistics, \(R^2\), standard error, and p-values only. Discuss whether the assumptions of simple linear regression met.
Generate a scatter plot
plot(x = country_stats$TotExp, y = country_stats$LifeExp)
The plot above does not resemble a roughly linear correlation between TotExp and LifeExp.
This figure shows that the life expentancy tends to fluctuate (large spread) between 40 and 75 for values of Total Expense under 70,000. However, The life expectancy tends to remain above 75 for Total Expense values above 70,000.
Fit a simple linear regression model
ex1_lm <- lm(formula = LifeExp ~ TotExp, data = country_stats)
ex1_lm_summary <- summary(ex1_lm)
#[1] "call" "terms" "residuals" "coefficients" "aliased" "sigma"
#[7] "df" "r.squared" "adj.r.squared" "fstatistic" "cov.unscaled"
ex1_fstat <- ex1_lm_summary$fstatistic
ex1_rsquared <- ex1_lm_summary$r.squared
ex1_res_std_err <- ex1_lm_summary$sigma
ex1_pVal <- anova(ex1_lm)$'Pr(>F)'[1]
Show model’s quality measurement statistics
| F-Statistic |
65.2641982, 1, 188 |
| Multiple R-squared |
0.2576922 |
| Residual standard error |
9.3710333 |
| p-value |
7.7139931^{-14} |
Evaluate model’s quality measurement statistics
Evaluate F-Statistic
The F-statistic value compares the current model to a model that has one fewer parameters. Because the one-factor model already has only a single parameter, this test is not particularly useful in this case.
Evaluate Multiple R-squared
The Multiple R-squared value is a number between 0 and 1. It is a statistical measure of how well the model describes the measured data. The reported \(R^2\) of 0.2576922 for this model means that the model explains 25.77 percent of the data’s variation.
Evaluate Residual standard error
In a good linear model the, if the residuals are distributed normally, the first and third quantiles of the previous residuals should be about 1.5 times this standard error.
Show the Q1 and Q3 for the model’s residuals
summary(ex1_lm_summary$residuals)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -24.764 -4.778 3.154 0.000 7.116 13.292
From the summary above we see that,
Therefore, the residuals do not seem to be distributed normally.
Evaluate p-value
The Pr(>|t|) shows the probability that the corresponding coefficient is not relevant in the model. This value is also known as the significance or p-value of the coefficient.
In this model,
- the probability that “TotExp” is not relevant in this model is \(7.7139931\times 10^{-14}\) - a tiny value less than 0.05. This means that “TotExp” (sum of personal and government expenditures) is relevant in the model.
- the probability that the intercept is not relevant in this model is \(7.8070708\times 10^{-153}\) - a tiny value far less than 0.05. This means that the intercept is relevant in the model.
Exercise 2
Raise life expectancy to the 4.6 power (i.e., \(LifeExp^{4.6}\)). Raise total expenditures to the 0.06 power (nearly a log transform, \(TotExp^{.06}\)). Plot \(LifeExp^{4.6}\) as a function of \(TotExp^{.06}\), and r re-run the simple regression model using the transformed variables. Provide and interpret the F statistics, \(R^2\), standard error, and p-values. Which model is “better?”
Generate a scatter plot
plot(x = I(country_stats$TotExp ^ 0.06), y = I(country_stats$LifeExp ^ 4.6))

The plot above does resemble a roughly positive linear correlation between TotExp and LifeExp. The life expentancy tends to increase as the values of Total Expense increase.
Fit a simple linear regression model
ex2_lm <- lm(formula = I(LifeExp ^ 4.6) ~ I(TotExp ^ 0.06), data = country_stats)
ex2_lm_summary <- summary(ex2_lm)
#[1] "call" "terms" "residuals" "coefficients" "aliased" "sigma"
#[7] "df" "r.squared" "adj.r.squared" "fstatistic" "cov.unscaled"
ex2_fstat <- ex2_lm_summary$fstatistic
ex2_rsquared <- ex2_lm_summary$r.squared
ex2_res_std_err <- ex2_lm_summary$sigma
ex2_pVal <- anova(ex2_lm)$'Pr(>F)'[1]
Show model’s quality measurement statistics
| F-Statistic |
507.6967054, 1, 188 |
| Multiple R-squared |
0.7297673 |
| Residual standard error |
9.0492393^{7} |
| p-value |
2.6014284^{-55} |
Evaluate model’s quality measurement statistics
Evaluate F-Statistic
The F-statistic value compares the current model to a model that has one fewer parameters. Because the one-factor model already has only a single parameter, this test is not particularly useful in this case.
Evaluate Multiple R-squared
The Multiple R-squared value is a number between 0 and 1. It is a statistical measure of how well the model describes the measured data. The reported \(R^2\) of 0.7297673 for this model means that the model explains 72.98 percent of the data’s variation. This is a big improvement over the 25.77 percent reported in Exercise 1.
Evaluate Residual standard error
In a good linear model the, if the residuals are distributed normally, the first and third quantiles of the previous residuals should be about 1.5 times this standard error.
Show the Q1 and Q3 for the model’s residuals
summary(ex2_lm_summary$residuals)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -308616089 -53978977 13697187 0 59139231 211951764
From the summary above we see that,
Therefore, the residuals do not seem to be distributed normally.
Evaluate p-value
The Pr(>|t|) shows the probability that the corresponding coefficient is not relevant in the model. This value is also known as the significance or p-value of the coefficient.
In this model,
- the probability that \(TotExp ^ {0.06}\) is not relevant in this model is \(2.6014284\times 10^{-55}\) - a tiny value less than 0.05. This means that \(TotExp ^ {0.06}\) (sum of personal and government expenditures) is relevant in the model.
- the probability that the intercept is not relevant in this model is \(3.9117741\times 10^{-36}\) - a tiny value far less than 0.05. This means that the intercept is relevant in the model.
Exercise 3
Using the results from Exercise 2, forecast life expectancy when \(TotExp^{.06} = 1.5\). Then forecast life expectancy when \(TotExp{^.06} = 2.5\).
From exercise 2, the y-intercept is \(a_0 = -736527910\) and the slope is \(a_1 = 620060216\). Thus, the final regression model is:
\(LifeExp^{4.6} = -736527910 + 620060216 * TotExp^{0.06}\)
Using the above equation we can calculate life expectancy as:
If \(TotExp^{.06} = 1.5\), then \(LifeExp^{4.6} = -736527910 + 620060216 * 1.5 = 1.9356241\times 10^{8}\)
If \(TotExp^{.06} = 2.5\), then \(LifeExp^{4.6} = -736527910 + 620060216 * 2.5 = 8.1362263\times 10^{8}\)
Exercise 4
Build the following multiple regression model and interpret the F Statistics, \(R^2\), standard error, and p-values. How good is the model?
\(LifeExp = b0 + b1 * PropMd + b2 * TotExp + b3 * PropMD * TotExp\)
Fit a simple linear regression model
ex4_lm <- lm(formula = LifeExp ~ PropMD + TotExp + I(PropMD * TotExp), data = country_stats)
ex4_lm_summary <- summary(ex4_lm)
#[1] "call" "terms" "residuals" "coefficients" "aliased" "sigma"
#[7] "df" "r.squared" "adj.r.squared" "fstatistic" "cov.unscaled"
ex4_fstat <- ex4_lm_summary$fstatistic
ex4_rsquared <- ex4_lm_summary$r.squared
ex4_res_std_err <- ex4_lm_summary$sigma
ex4_pVal <- anova(ex4_lm)$'Pr(>F)'[1]
Show model’s quality measurement statistics
| F-Statistic |
34.4883268, 3, 186 |
| Multiple R-squared |
0.3574352 |
| Residual standard error |
8.7654934 |
| p-value |
3.3134579^{-9} |
Evaluate model’s quality measurement statistics
Evaluate F-Statistic
The F-statistic value compares the current model to a model that has one fewer parameters. Because the one-factor model already has only a single parameter, this test is not particularly useful in this case.
Evaluate Multiple R-squared
The Multiple R-squared value is a number between 0 and 1. It is a statistical measure of how well the model describes the measured data. The reported \(R^2\) of 0.3574352 for this model means that the model explains 35.74 percent of the data’s variation.
Evaluate Residual standard error
In a good linear model the, if the residuals are distributed normally, the first and third quantiles of the previous residuals should be about 1.5 times this standard error.
Show the Q1 and Q3 for the model’s residuals
summary(ex4_lm_summary$residuals)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> -27.320 -4.132 2.098 0.000 6.540 13.074
From the summary above we see that,
Therefore, the residuals do not seem to be distributed normally.
Evaluate p-value
The Pr(>|t|) shows the probability that the corresponding coefficient is not relevant in the model. This value is also known as the significance or p-value of the coefficient.
In this model,
- the probability that “TotExp” is not relevant in this model is \(3.3134579\times 10^{-9}\) - a tiny value less than 0.05. This means that “TotExp” (sum of personal and government expenditures) is relevant in the model.
- the probability that the intercept is not relevant in this model is \(6.2071874\times 10^{-145}\) - a tiny value far less than 0.05. This means that the intercept is relevant in the model.
Exercise 5
Forecast LifeExp when \(PropMD = .03\) and \(TotExp = 14\). Does this forecast seem realistic? Why or why not?
Based on the results from Exercise 4, the below model
\(LifeExp = b0 + b1 * PropMd + 7.233^{-5} * TotExp + b3 * PropMD * TotExp\)
can be re-written as
\(LifeExp = 62.77 + 1497 * PropMd + 0.00007233 * TotExp - 0.006026 * PropMD * TotExp\)
Using the above equation we can calculate Life Expectancy as:
- If \(PropMD = .03\) and \(TotExp = 14\), then \(LifeExp = 62.77 + 1497 * .03 + 0.00007233 * 14 - 0.006026 * .03 * 14 = 107.68\)
The life expectancy forecast of 107.68 does not seem realistic because
- the actual life expectance values in the original source data range only from 40 to 83.
- if TotExp (sum of personal and government expenditures) is 14, then it does not make sense that such a small expenditure could help increase a person’s life expectancy to the point of exceeding that of all the rest of the life expectancy values for the reported countries.
