This assignment must be submitted before February 23th before 10pm on Moodle.
This exercise is based on the gapminder dataset from the package of the same name.
Jennifer Bryan (2017). gapminder: Data from Gapminder. R package version 0.3.0. https://CRAN.R-project.org/package=gapminder
This dataset includes the life expectancy (lifeExp), population (pop) and GDP per capita (gdpPercap) for 142 countries and 12 years (every 5 years between 1952 and 2007).
library(gapminder)
str(gapminder)
## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
## $ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
## $ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
## $ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num [1:1704] 779 821 853 836 740 ...
... + facet_wrap(~year).# Load library ggplot2
library(ggplot2)
# To plot the life expectancy as a function of log(GDP)
ggplot(gapminder, aes(x = log(gdpPercap), y = lifeExp)) +
geom_point(aes(color = continent), alpha = 0.5) +
theme_bw() +
labs(x = "Log of GDP per Capita", y = "Life Expectancy", title = "Life Expectancy vs. GDP per Capita over Years") +
facet_wrap(~year) +
scale_color_brewer(palette = "Set1")
What general trends do you observe? Are there extreme values that could strongly influence a regression model? If so, try to identify these data in the table based on the position of the points in the graph.
Based on the plot, we can say that there is a positive correlation between the logarithm GDP per capita and life expectancy. As the GDP per capita increases, the life expectancy tend to increases as well. This trend is consistance across all the years.
Over time, from 1952 to 2007, there is a noticeable improvement in life expectancy for most countries, regardless of their GDP per capita. This improvement is particularly noticeable for countries with lower GDP per capita at the beginning of the time series.
Extreme values
There are a few points that stand out due to their high GDP per capita but not correspondingly high life expectancy. These outliers could influence a regression model by skewing the relationship between GDP and life expectancy.
Similarly, there are countries with relatively high life expectancy despite lower GDP per capita, especially noticeable in later years.
View(gapminder)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# Create a ggplot object without varying point sizes
p <- ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, text = paste("Country:", country, "<br>Year:", year, "<br>GDP per Capita:", gdpPercap, "<br>Life Expectancy:", lifeExp))) +
geom_point(aes(color = continent), alpha = 0.7) + # Fixed size points
scale_x_log10() +
facet_wrap(~year) +
labs(title = "Life Expectancy vs. GDP per Capita",
x = "GDP per Capita (log scale)",
y = "Life Expectancy",
color = "Continent") +
theme_minimal()
# Convert the ggplot object to a plotly object
p_interactive <- ggplotly(p, tooltip = "text")
# Print the interactive plot
p_interactive
Some of the extreme value are
The country Kuwait (Asia) showed an extreme value, high GDP but not correspondingly high life expectancy from the year 1952 to 1982. Simularly, the country Angola (Africa) showed an extreme in all the years.
Rawanda (Africa) is an extreme value in the year 1992.
Myanmar showed exteme value, low GDP but high life expectancy for the years 1992 and 1997.
lm) to determine the
effect of GDP per capita, year and their interaction on life expectancy.
To help interpret the coefficients, perform the following
transformations on the predictors:Take the logarithm of gdpPercap and standardize it with
the function scale. Reminder:
scale(x) subtracts each value of x from its
mean and divides by its standard deviation, so the resulting variable
has a mean of 0 and a standard deviation of 1; it represents the number
of standard deviations above or below the mean.
Replace year with the number of years since 1952.
Interpret the meaning of each of the coefficients in the model and then refer to the diagnostic graphs. Are the assumptions of the linear model met?
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Prepare the data with the required transformations
gapminder_transformed <- gapminder %>%
mutate(log_gdpPercap = log(gdpPercap), # Log-transform gdpPercap
scaled_log_gdpPercap = scale(log_gdpPercap), # Standardize the log-transformed gdpPercap
years_since_1952 = year - 1952) # Replace year with years since 1952
head(gapminder_transformed)
## # A tibble: 6 × 9
## country continent year lifeExp pop gdpPercap log_gdpPercap
## <fct> <fct> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779. 6.66
## 2 Afghanistan Asia 1957 30.3 9240934 821. 6.71
## 3 Afghanistan Asia 1962 32.0 10267083 853. 6.75
## 4 Afghanistan Asia 1967 34.0 11537966 836. 6.73
## 5 Afghanistan Asia 1972 36.1 13079460 740. 6.61
## 6 Afghanistan Asia 1977 38.4 14880372 786. 6.67
## # ℹ 2 more variables: scaled_log_gdpPercap <dbl[,1]>, years_since_1952 <dbl>
# Perform the linear regression
lm_model <- lm(lifeExp ~ scaled_log_gdpPercap * years_since_1952, data = gapminder_transformed)
# Summary of the model
summary(lm_model)
##
## Call:
## lm(formula = lifeExp ~ scaled_log_gdpPercap * years_since_1952,
## data = gapminder_transformed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.340 -3.496 0.802 4.557 18.172
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.311695 0.324413 167.415 < 2e-16 ***
## scaled_log_gdpPercap 10.694128 0.340945 31.366 < 2e-16 ***
## years_since_1952 0.192828 0.009923 19.433 < 2e-16 ***
## scaled_log_gdpPercap:years_since_1952 -0.034776 0.009774 -3.558 0.000384 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.854 on 1700 degrees of freedom
## Multiple R-squared: 0.719, Adjusted R-squared: 0.7185
## F-statistic: 1450 on 3 and 1700 DF, p-value: < 2.2e-16
Intercept (54.311695):. This is the estimated life expectancy when the standardized log GDP per capita is at its mean (which is zero after standardization) and the year is 1952 (since years_since_1952 would be 0). Essentially, this is the starting point of life expectancy for the average country in the dataset at the beginning of the period studied.
scaled_log_gdpPercap (10.694128):
This coefficient represents the estimated change in life expectancy
associated with a one unit increase in the log-transformed (and then
standardized) GDP per capita, while keeping the year constant (at 1952).
In other words, countries with a higher GDP per capita are expected to
have their life expectancy increased by approximately 10.69 years,
assuming the year is 1952.
years_since_1952 (0.192828):
This coefficient indicates the average annual increase in life
expectancy since 1952, assuming that the GDP per capita remains at its
mean value. In practical terms, this suggests that each year, life
expectancy has increased by about 0.19 years, independent of GDP per
capita changes.
scaled_log_gdpPercap:years_since_1952 (-0.034776): This interaction term suggests how the effect of GDP per capita on life expectancy has changed over time. Specifically, the negative coefficient indicates that the positive impact of GDP per capita on life expectancy has decreased each year since 1952. The life expectancy has diminished slightly every year by about 0.035 years per unit increase in log-transformed GDP per capita.
# Set up the layout for 4 plots on one page
# par(mfrow = c(2, 2))
# Plot the diagnostic graphs
plot(lm_model)
# Reset the plotting layout to default
# par(mfrow = c(1, 1))
Based on the diagnostic plots provided for the linear model:
1. Residuals vs Fitted Plot: There appears to be a slight curve (non-linear pattern), suggesting that the relationship between the predictors and the response may not be perfectly linear, which violates the linearity assumption.
2. Normal Q-Q Plot: Q-Q plot shows deviations from the line in the tails, particularly in the lower left and upper right, suggesting that the residuals might not be normally distributed, which violates the normality assumption.
3. Scale-Location Plot (or Spread-Location Plot): The homoscedasticity assumption is likely violated, indicated by the funnel shape in both the Residuals vs Fitted and the Scale-Location plots.
4. Residuals vs Leverage Plot: There are a couple of points far to the right, indicating potential high-leverage points. However, they do not appear to exceed the common threshold for Cook’s distance, they may not impact the model.
These diagnostics suggest that the linear model may not fully meet the necessary assumptions, which could affect the reliability of the regression coefficients and standard error estimates.
lmrob from the robustbase package) and median
regression (function rq from the quantreg package,
choosing only the median quantile). Explain how the estimates and
standard errors of the coefficients differ between the three
methods.Note: Use the showAlgo = FALSE option when
applying the summary function to the output of
lmrob, to simplify the summary.
# Install necessary packages
# install.packages("robustbase")
# install.packages("quantreg")
library(robustbase)
library(quantreg)
## Loading required package: SparseM
##
## Attaching package: 'SparseM'
## The following object is masked from 'package:base':
##
## backsolve
# Robust Regression using Tukey's Biweight
robust_model <- lmrob(lifeExp ~ scaled_log_gdpPercap * years_since_1952, data = gapminder_transformed)
# Summarize the robust regression results
summary(robust_model, showAlgo = FALSE)
##
## Call:
## lmrob(formula = lifeExp ~ scaled_log_gdpPercap * years_since_1952, data = gapminder_transformed)
## \--> method = "MM"
## Residuals:
## Min 1Q Median 3Q Max
## -34.838 -3.912 0.133 3.658 18.625
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.562627 0.352056 157.823 <2e-16 ***
## scaled_log_gdpPercap 12.590304 0.311367 40.436 <2e-16 ***
## years_since_1952 0.178728 0.010456 17.094 <2e-16 ***
## scaled_log_gdpPercap:years_since_1952 -0.075843 0.008496 -8.927 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Robust residual standard error: 5.467
## Multiple R-squared: 0.797, Adjusted R-squared: 0.7967
## Convergence in 9 IRWLS iterations
##
## Robustness weights:
## 8 observations c(37,40,167,853,854,855,1293,1464)
## are outliers with |weight| = 0 ( < 5.9e-05);
## 147 weights are ~= 1. The remaining 1549 ones are summarized as
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0004047 0.8444000 0.9451000 0.8735000 0.9859000 0.9990000
## Algorithmic parameters:
## tuning.chi bb tuning.psi refine.tol
## 1.548e+00 5.000e-01 4.685e+00 1.000e-07
## rel.tol scale.tol solve.tol zero.tol
## 1.000e-07 1.000e-10 1.000e-07 1.000e-10
## eps.outlier eps.x warn.limit.reject warn.limit.meanrw
## 5.869e-05 2.134e-10 5.000e-01 5.000e-01
## nResample max.it best.r.s k.fast.s k.max
## 500 50 2 1 200
## maxit.scale trace.lev mts compute.rd fast.s.large.n
## 200 0 1000 0 2000
## psi subsampling cov
## "bisquare" "nonsingular" ".vcov.avar1"
## compute.outlier.stats
## "SM"
## seed : int(0)
# Median Regression (Quantile Regression at the median)
median_model <- rq(lifeExp ~ scaled_log_gdpPercap * years_since_1952, data = gapminder_transformed, tau = 0.5)
# Summarize the median regression results
summary(median_model)
## Warning in summary.rq(median_model): 3 non-positive fis
##
## Call: rq(formula = lifeExp ~ scaled_log_gdpPercap * years_since_1952,
## tau = 0.5, data = gapminder_transformed)
##
## tau: [1] 0.5
##
## Coefficients:
## Value Std. Error t value Pr(>|t|)
## (Intercept) 55.57793 0.37879 146.72594 0.00000
## scaled_log_gdpPercap 12.51016 0.36768 34.02426 0.00000
## years_since_1952 0.18736 0.01142 16.41102 0.00000
## scaled_log_gdpPercap:years_since_1952 -0.07982 0.00869 -9.18818 0.00000
| Coefficient | Linear Model | Robust Model (lmrob) | Median Model (rq) |
|---|---|---|---|
| Intercept | 54.31 | 55.56 | 55.58 |
| scaled_log_gdpPercap | 10.69 | 12.59 | 12.51 |
| years_since_1952 | 0.19 | 0.18 | 0.19 |
| Interaction: scaled_log_gdpPercap:years_since_1952 | -0.0348 | -0.0758 | -0.0798 |
Observations:
Intercept: The estimates from both robust and median regression are slightly higher than the standard linear regression. This indicates that after adjusting for outliers and leveraging more robust techniques, the baseline life expectancy (starting point in 1952) is estimated to be slightly higher.
scaled_log_gdpPercap (GDP per capita): Both robust and median regressions yield higher estimates compared to the standard model. This suggests that the positive effect of GDP per capita on life expectancy is more pronounced when using these methods, which may be reducing the influence of outliers that were pulling the estimate down in the standard regression.
years_since_1952 (Time effect): The time effect is slightly less in the robust model compared to the standard regression but is almost the same in the median regression. This suggests that the annual increase in life expectancy over time is slightly less pronounced when outliers are mitigated in the robust model but remains consistent in the median model.
Interaction : Both the robust and median regressions show a more negative interaction between GDP per capita and time than the standard regression. This indicates a stronger diminishing effect of GDP per capita on life expectancy over time when the models account for outliers and employ a more robust or median-based approach.
ggplot you can use the geom_smooth
function with method = "lm" for linear regression and
method = "lmrob" for robust regression. For median
regression you can use geom_quantile as seen in the
notes.library(quantreg)
# Base plot
p <- ggplot(gapminder_transformed, aes(x = scaled_log_gdpPercap, y = lifeExp)) +
geom_point(aes(color = continent), alpha = 0.5) +
theme_minimal() +
labs(title = "Life Expectancy vs. Standardized Log GDP per Capita",
x = "Standardized Log of GDP per Capita",
y = "Life Expectancy")
# Adding regression lines
p +
geom_smooth(method = "lm", formula = y ~ x, color = "blue") + # Linear regression
geom_smooth(method = "lmrob", formula = y ~ x, color = "red") + # Robust regression
geom_quantile(formula = y ~ x, quantiles = 0.5, color = "green") # Median regression
Yes, it would be useful to model different quantiles of life expectancy based on the predictors.
Justification:
- scatter plot indicates that the variability of life expectancy changes across different levels of GDP per capita (e.g., variance is much higher at lower levels of GDP than at higher levels).
- there are visible outliers or heteroscedasticity in the data. - Also, witness the skewed distribution of life expectancy for different levels of GDP per capita or years. - slope of the relationship between life expectancy and GDP per capita changes across the distribution of life expectancy.
(0.1, 0.25, 0.5, 0.75, 0.9).
Use the plot function on the quantile regression summary
and describe how the effect of the predictors varies between
quantiles.# Install and load the quantreg package
# install.packages("quantreg")
library(quantreg)
# Fit the quantile regression model for quantiles
qr_models <- rq(lifeExp ~ scaled_log_gdpPercap * years_since_1952, data = gapminder_transformed, tau = c(0.1, 0.25, 0.5, 0.75, 0.9))
summary(qr_models)
## Warning in summary.rq(xi, U = U, ...): 12 non-positive fis
## Warning in summary.rq(xi, U = U, ...): 6 non-positive fis
## Warning in summary.rq(xi, U = U, ...): 3 non-positive fis
## Warning in summary.rq(xi, U = U, ...): 3 non-positive fis
## Warning in summary.rq(xi, U = U, ...): 18 non-positive fis
##
## Call: rq(formula = lifeExp ~ scaled_log_gdpPercap * years_since_1952,
## tau = c(0.1, 0.25, 0.5, 0.75, 0.9), data = gapminder_transformed)
##
## tau: [1] 0.1
##
## Coefficients:
## Value Std. Error t value Pr(>|t|)
## (Intercept) 43.20169 0.70142 61.59156 0.00000
## scaled_log_gdpPercap 7.35068 0.47938 15.33368 0.00000
## years_since_1952 0.26120 0.02149 12.15622 0.00000
## scaled_log_gdpPercap:years_since_1952 0.06933 0.01564 4.43201 0.00001
##
## Call: rq(formula = lifeExp ~ scaled_log_gdpPercap * years_since_1952,
## tau = c(0.1, 0.25, 0.5, 0.75, 0.9), data = gapminder_transformed)
##
## tau: [1] 0.25
##
## Coefficients:
## Value Std. Error t value Pr(>|t|)
## (Intercept) 50.48124 0.70831 71.26986 0.00000
## scaled_log_gdpPercap 11.97199 0.47754 25.07014 0.00000
## years_since_1952 0.20507 0.01806 11.35429 0.00000
## scaled_log_gdpPercap:years_since_1952 -0.03613 0.00967 -3.73735 0.00019
##
## Call: rq(formula = lifeExp ~ scaled_log_gdpPercap * years_since_1952,
## tau = c(0.1, 0.25, 0.5, 0.75, 0.9), data = gapminder_transformed)
##
## tau: [1] 0.5
##
## Coefficients:
## Value Std. Error t value Pr(>|t|)
## (Intercept) 55.57793 0.37879 146.72594 0.00000
## scaled_log_gdpPercap 12.51016 0.36768 34.02426 0.00000
## years_since_1952 0.18736 0.01142 16.41102 0.00000
## scaled_log_gdpPercap:years_since_1952 -0.07982 0.00869 -9.18818 0.00000
##
## Call: rq(formula = lifeExp ~ scaled_log_gdpPercap * years_since_1952,
## tau = c(0.1, 0.25, 0.5, 0.75, 0.9), data = gapminder_transformed)
##
## tau: [1] 0.75
##
## Coefficients:
## Value Std. Error t value Pr(>|t|)
## (Intercept) 59.97344 0.31817 188.49586 0.00000
## scaled_log_gdpPercap 12.14981 0.29628 41.00809 0.00000
## years_since_1952 0.16231 0.01008 16.10339 0.00000
## scaled_log_gdpPercap:years_since_1952 -0.10032 0.00744 -13.48697 0.00000
##
## Call: rq(formula = lifeExp ~ scaled_log_gdpPercap * years_since_1952,
## tau = c(0.1, 0.25, 0.5, 0.75, 0.9), data = gapminder_transformed)
##
## tau: [1] 0.9
##
## Coefficients:
## Value Std. Error t value Pr(>|t|)
## (Intercept) 63.25230 0.38655 163.63235 0.00000
## scaled_log_gdpPercap 9.69923 0.33726 28.75926 0.00000
## years_since_1952 0.15789 0.01351 11.68672 0.00000
## scaled_log_gdpPercap:years_since_1952 -0.08189 0.00866 -9.45363 0.00000
plot(summary(qr_models))
## Warning in summary.rq(xi, U = U, ...): 12 non-positive fis
## Warning in summary.rq(xi, U = U, ...): 6 non-positive fis
## Warning in summary.rq(xi, U = U, ...): 3 non-positive fis
## Warning in summary.rq(xi, U = U, ...): 3 non-positive fis
## Warning in summary.rq(xi, U = U, ...): 18 non-positive fis
Intercept: The intercept, representing the estimated life expectancy at the reference level of predictors (scaled log GDP per capita normalized and years since 1952 set to zero), increases across quantiles. This indicates that in countries at higher quantiles (i.e., with higher life expectancy), the base life expectancy (without the influence of GDP or time) is higher. The slope of the intercept across quantiles suggests that the gap in base life expectancy between lower and higher quantile countries has been widening over time.
scaled_log_gdpPercap: The effect of the standardized log GDP per capita appears to decrease as we move from lower to higher quantiles. This suggests that while GDP per capita has a strong positive effect on life expectancy in countries at the lower end of the life expectancy distribution, its impact diminishes at the higher end. The decreasing trend across quantiles indicates that wealth increases are more beneficial to life expectancy in poor countries than in rich ones.
years_since_1952: The coefficient for years since 1952 seems to decrease across quantiles, indicating that the overall improvement in life expectancy over time is slightly less pronounced in countries at higher life expectancy levels. However, the change is relatively subtle, suggesting that time has been a fairly consistent positive factor across all quantiles, although slightly less so at the higher end of the life expectancy.
Interaction: scaled_log_gdpPercap:years_since_1952: The interaction between standardized log GDP per capita and years since 1952 shows variation across quantiles. Initially positive at lower quantiles, it turns negative as we move to higher quantiles. This could indicate that the enhancing effect of GDP per capita on life expectancy growth over time has been stronger in countries with initially lower life expectancy. In contrast, in countries with higher initial life expectancy, the benefit of increasing GDP per capita on further life expectancy gains has reduced over time.
library(ggplot2)
# Adding quantile regression lines for specified quantiles
quantiles <- c(0.1, 0.25, 0.5, 0.75, 0.9)
p + geom_quantile(quantiles = quantiles, formula = y ~ x, linetype = "solid", color = "red") +
theme(legend.position = "right")
While this dataset is useful for illustrating the concepts of robust regression and quantile regression, it should be noted that this type of statistical analysis comparing variables measured at the national level has several limitations:
It cannot be assumed that the associations detected apply at a smaller scale (e.g., the relationship between life expectancy and income when comparing national averages is not necessarily the same as the relationship between life expectancy and income at the level of individuals living in each country).
Averages calculated in different countries are not independent observations, because environmental, social and economic conditions are correlated between nearby countries.
There are many factors that differentiate countries, so it is difficult to interpret an association as a causal link.
Many articles, particularly in the social sciences, have been published on the methods to be used to make this type of cross-country comparisons.