Robust regression

This assignment must be submitted before February 23th before 10pm on Moodle.

Data

This exercise is based on the gapminder dataset from the package of the same name.

Jennifer Bryan (2017). gapminder: Data from Gapminder. R package version 0.3.0. https://CRAN.R-project.org/package=gapminder

This dataset includes the life expectancy (lifeExp), population (pop) and GDP per capita (gdpPercap) for 142 countries and 12 years (every 5 years between 1952 and 2007).

library(gapminder)
str(gapminder)

## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

1. Effect of GDP and time on life expectancy

First, visualize the life expectancy as a function of GDP per capita and year. It is suggested to represent the logarithm of gdpPercap and to separate the different years, for example with facets in ggplot2: ... + facet_wrap(~year).

# Load library ggplot2
library(ggplot2)

# To plot the life expectancy as a function of log(GDP)
ggplot(gapminder, aes(x = log(gdpPercap), y = lifeExp)) + 
  geom_point(aes(color = continent), alpha = 0.5) + 
  theme_bw() +
  labs(x = "Log of GDP per Capita", y = "Life Expectancy", title = "Life Expectancy vs. GDP per Capita over Years") +
  facet_wrap(~year) +
  scale_color_brewer(palette = "Set1")

What general trends do you observe? Are there extreme values that could strongly influence a regression model? If so, try to identify these data in the table based on the position of the points in the graph.

Based on the plot, we can say that there is a positive correlation between the logarithm GDP per capita and life expectancy. As the GDP per capita increases, the life expectancy tend to increases as well. This trend is consistance across all the years.

Over time, from 1952 to 2007, there is a noticeable improvement in life expectancy for most countries, regardless of their GDP per capita. This improvement is particularly noticeable for countries with lower GDP per capita at the beginning of the time series.

Extreme values
There are a few points that stand out due to their high GDP per capita but not correspondingly high life expectancy. These outliers could influence a regression model by skewing the relationship between GDP and life expectancy.

Similarly, there are countries with relatively high life expectancy despite lower GDP per capita, especially noticeable in later years.

View(gapminder)

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

# Create a ggplot object without varying point sizes
p <- ggplot(gapminder, aes(x = gdpPercap, y = lifeExp, text = paste("Country:", country, "<br>Year:", year, "<br>GDP per Capita:", gdpPercap, "<br>Life Expectancy:", lifeExp))) +
  geom_point(aes(color = continent), alpha = 0.7) +  # Fixed size points
  scale_x_log10() + 
  facet_wrap(~year) +
  labs(title = "Life Expectancy vs. GDP per Capita",
       x = "GDP per Capita (log scale)",
       y = "Life Expectancy",
       color = "Continent") +
  theme_minimal()

# Convert the ggplot object to a plotly object
p_interactive <- ggplotly(p, tooltip = "text")

# Print the interactive plot
p_interactive

Some of the extreme value are
The country Kuwait (Asia) showed an extreme value, high GDP but not correspondingly high life expectancy from the year 1952 to 1982. Simularly, the country Angola (Africa) showed an extreme in all the years.
Rawanda (Africa) is an extreme value in the year 1992.
Myanmar showed exteme value, low GDP but high life expectancy for the years 1992 and 1997.

Perform a linear regression (lm) to determine the effect of GDP per capita, year and their interaction on life expectancy. To help interpret the coefficients, perform the following transformations on the predictors:

Take the logarithm of gdpPercap and standardize it with the function scale. Reminder: scale(x) subtracts each value of x from its mean and divides by its standard deviation, so the resulting variable has a mean of 0 and a standard deviation of 1; it represents the number of standard deviations above or below the mean.
Replace year with the number of years since 1952.

Interpret the meaning of each of the coefficients in the model and then refer to the diagnostic graphs. Are the assumptions of the linear model met?

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Prepare the data with the required transformations
gapminder_transformed <- gapminder %>%
  mutate(log_gdpPercap = log(gdpPercap),  # Log-transform gdpPercap
         scaled_log_gdpPercap = scale(log_gdpPercap),  # Standardize the log-transformed gdpPercap
         years_since_1952 = year - 1952)  # Replace year with years since 1952

head(gapminder_transformed)

## # A tibble: 6 × 9
##   country     continent  year lifeExp      pop gdpPercap log_gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>         <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.          6.66
## 2 Afghanistan Asia       1957    30.3  9240934      821.          6.71
## 3 Afghanistan Asia       1962    32.0 10267083      853.          6.75
## 4 Afghanistan Asia       1967    34.0 11537966      836.          6.73
## 5 Afghanistan Asia       1972    36.1 13079460      740.          6.61
## 6 Afghanistan Asia       1977    38.4 14880372      786.          6.67
## # ℹ 2 more variables: scaled_log_gdpPercap <dbl[,1]>, years_since_1952 <dbl>

# Perform the linear regression
lm_model <- lm(lifeExp ~ scaled_log_gdpPercap * years_since_1952, data = gapminder_transformed)

# Summary of the model
summary(lm_model)

## 
## Call:
## lm(formula = lifeExp ~ scaled_log_gdpPercap * years_since_1952, 
##     data = gapminder_transformed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -28.340  -3.496   0.802   4.557  18.172 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           54.311695   0.324413 167.415  < 2e-16 ***
## scaled_log_gdpPercap                  10.694128   0.340945  31.366  < 2e-16 ***
## years_since_1952                       0.192828   0.009923  19.433  < 2e-16 ***
## scaled_log_gdpPercap:years_since_1952 -0.034776   0.009774  -3.558 0.000384 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.854 on 1700 degrees of freedom
## Multiple R-squared:  0.719,  Adjusted R-squared:  0.7185 
## F-statistic:  1450 on 3 and 1700 DF,  p-value: < 2.2e-16

Intercept (54.311695):. This is the estimated life expectancy when the standardized log GDP per capita is at its mean (which is zero after standardization) and the year is 1952 (since years_since_1952 would be 0). Essentially, this is the starting point of life expectancy for the average country in the dataset at the beginning of the period studied.

scaled_log_gdpPercap (10.694128):
This coefficient represents the estimated change in life expectancy associated with a one unit increase in the log-transformed (and then standardized) GDP per capita, while keeping the year constant (at 1952). In other words, countries with a higher GDP per capita are expected to have their life expectancy increased by approximately 10.69 years, assuming the year is 1952.

years_since_1952 (0.192828):
This coefficient indicates the average annual increase in life expectancy since 1952, assuming that the GDP per capita remains at its mean value. In practical terms, this suggests that each year, life expectancy has increased by about 0.19 years, independent of GDP per capita changes.

scaled_log_gdpPercap:years_since_1952 (-0.034776): This interaction term suggests how the effect of GDP per capita on life expectancy has changed over time. Specifically, the negative coefficient indicates that the positive impact of GDP per capita on life expectancy has decreased each year since 1952. The life expectancy has diminished slightly every year by about 0.035 years per unit increase in log-transformed GDP per capita.

# Set up the layout for 4 plots on one page
# par(mfrow = c(2, 2))

# Plot the diagnostic graphs
plot(lm_model)

# Reset the plotting layout to default
# par(mfrow = c(1, 1))

Based on the diagnostic plots provided for the linear model:
1. Residuals vs Fitted Plot: There appears to be a slight curve (non-linear pattern), suggesting that the relationship between the predictors and the response may not be perfectly linear, which violates the linearity assumption.
2. Normal Q-Q Plot: Q-Q plot shows deviations from the line in the tails, particularly in the lower left and upper right, suggesting that the residuals might not be normally distributed, which violates the normality assumption.
3. Scale-Location Plot (or Spread-Location Plot): The homoscedasticity assumption is likely violated, indicated by the funnel shape in both the Residuals vs Fitted and the Scale-Location plots.
4. Residuals vs Leverage Plot: There are a couple of points far to the right, indicating potential high-leverage points. However, they do not appear to exceed the common threshold for Cook’s distance, they may not impact the model.
These diagnostics suggest that the linear model may not fully meet the necessary assumptions, which could affect the reliability of the regression coefficients and standard error estimates.

Compare the result of the model in (b) with two more robust alternatives: robust regression based on Tukey’s biweight (function lmrob from the robustbase package) and median regression (function rq from the quantreg package, choosing only the median quantile). Explain how the estimates and standard errors of the coefficients differ between the three methods.

Note: Use the showAlgo = FALSE option when applying the summary function to the output of lmrob, to simplify the summary.

# Install necessary packages
# install.packages("robustbase")
# install.packages("quantreg")

library(robustbase)
library(quantreg)

## Loading required package: SparseM

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

# Robust Regression using Tukey's Biweight
robust_model <- lmrob(lifeExp ~ scaled_log_gdpPercap * years_since_1952, data = gapminder_transformed)

# Summarize the robust regression results 
summary(robust_model, showAlgo = FALSE)

## 
## Call:
## lmrob(formula = lifeExp ~ scaled_log_gdpPercap * years_since_1952, data = gapminder_transformed)
##  \--> method = "MM"
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -34.838  -3.912   0.133   3.658  18.625 
## 
## Coefficients:
##                                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                           55.562627   0.352056 157.823   <2e-16 ***
## scaled_log_gdpPercap                  12.590304   0.311367  40.436   <2e-16 ***
## years_since_1952                       0.178728   0.010456  17.094   <2e-16 ***
## scaled_log_gdpPercap:years_since_1952 -0.075843   0.008496  -8.927   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Robust residual standard error: 5.467 
## Multiple R-squared:  0.797,  Adjusted R-squared:  0.7967 
## Convergence in 9 IRWLS iterations
## 
## Robustness weights: 
##  8 observations c(37,40,167,853,854,855,1293,1464)
##   are outliers with |weight| = 0 ( < 5.9e-05); 
##  147 weights are ~= 1. The remaining 1549 ones are summarized as
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0004047 0.8444000 0.9451000 0.8735000 0.9859000 0.9990000 
## Algorithmic parameters: 
##        tuning.chi                bb        tuning.psi        refine.tol 
##         1.548e+00         5.000e-01         4.685e+00         1.000e-07 
##           rel.tol         scale.tol         solve.tol          zero.tol 
##         1.000e-07         1.000e-10         1.000e-07         1.000e-10 
##       eps.outlier             eps.x warn.limit.reject warn.limit.meanrw 
##         5.869e-05         2.134e-10         5.000e-01         5.000e-01 
##      nResample         max.it       best.r.s       k.fast.s          k.max 
##            500             50              2              1            200 
##    maxit.scale      trace.lev            mts     compute.rd fast.s.large.n 
##            200              0           1000              0           2000 
##                   psi           subsampling                   cov 
##            "bisquare"         "nonsingular"         ".vcov.avar1" 
## compute.outlier.stats 
##                  "SM" 
## seed : int(0)

# Median Regression (Quantile Regression at the median)
median_model <- rq(lifeExp ~ scaled_log_gdpPercap * years_since_1952, data = gapminder_transformed, tau = 0.5)

# Summarize the median regression results
summary(median_model)

## Warning in summary.rq(median_model): 3 non-positive fis

## 
## Call: rq(formula = lifeExp ~ scaled_log_gdpPercap * years_since_1952, 
##     tau = 0.5, data = gapminder_transformed)
## 
## tau: [1] 0.5
## 
## Coefficients:
##                                       Value     Std. Error t value   Pr(>|t|) 
## (Intercept)                            55.57793   0.37879  146.72594   0.00000
## scaled_log_gdpPercap                   12.51016   0.36768   34.02426   0.00000
## years_since_1952                        0.18736   0.01142   16.41102   0.00000
## scaled_log_gdpPercap:years_since_1952  -0.07982   0.00869   -9.18818   0.00000

Coefficient Estimates

Coefficient	Linear Model	Robust Model (lmrob)	Median Model (rq)
Intercept	54.31	55.56	55.58
scaled_log_gdpPercap	10.69	12.59	12.51
years_since_1952	0.19	0.18	0.19
Interaction: scaled_log_gdpPercap:years_since_1952	-0.0348	-0.0758	-0.0798

Observations:
Intercept: The estimates from both robust and median regression are slightly higher than the standard linear regression. This indicates that after adjusting for outliers and leveraging more robust techniques, the baseline life expectancy (starting point in 1952) is estimated to be slightly higher.

scaled_log_gdpPercap (GDP per capita): Both robust and median regressions yield higher estimates compared to the standard model. This suggests that the positive effect of GDP per capita on life expectancy is more pronounced when using these methods, which may be reducing the influence of outliers that were pulling the estimate down in the standard regression.

years_since_1952 (Time effect): The time effect is slightly less in the robust model compared to the standard regression but is almost the same in the median regression. This suggests that the annual increase in life expectancy over time is slightly less pronounced when outliers are mitigated in the robust model but remains consistent in the median model.

Interaction : Both the robust and median regressions show a more negative interaction between GDP per capita and time than the standard regression. This indicates a stronger diminishing effect of GDP per capita on life expectancy over time when the models account for outliers and employ a more robust or median-based approach.

Superimpose the regression lines of the three models on the graph in (a). With ggplot you can use the geom_smooth function with method = "lm" for linear regression and method = "lmrob" for robust regression. For median regression you can use geom_quantile as seen in the notes.

library(quantreg)

# Base plot
p <- ggplot(gapminder_transformed, aes(x = scaled_log_gdpPercap, y = lifeExp)) +
  geom_point(aes(color = continent), alpha = 0.5) +
  theme_minimal() +
  labs(title = "Life Expectancy vs. Standardized Log GDP per Capita",
       x = "Standardized Log of GDP per Capita", 
       y = "Life Expectancy")

# Adding regression lines
p + 
  geom_smooth(method = "lm", formula = y ~ x, color = "blue") +  # Linear regression
  geom_smooth(method = "lmrob", formula = y ~ x, color = "red") +  # Robust regression
  geom_quantile(formula = y ~ x, quantiles = 0.5, color = "green")  # Median regression

2. Variation of effects by quantile

Based on your observation of the data in 1(a), would it be useful to model different quantiles of life expectancy based on the predictors? Justify your answer.

Yes, it would be useful to model different quantiles of life expectancy based on the predictors.

Justification:
- scatter plot indicates that the variability of life expectancy changes across different levels of GDP per capita (e.g., variance is much higher at lower levels of GDP than at higher levels).
- there are visible outliers or heteroscedasticity in the data. - Also, witness the skewed distribution of life expectancy for different levels of GDP per capita or years. - slope of the relationship between life expectancy and GDP per capita changes across the distribution of life expectancy.

Perform a quantile regression with the same predictors as in 1(b), with the following quantiles: (0.1, 0.25, 0.5, 0.75, 0.9). Use the plot function on the quantile regression summary and describe how the effect of the predictors varies between quantiles.

# Install and load the quantreg package 
# install.packages("quantreg")
library(quantreg)

# Fit the quantile regression model for  quantiles
qr_models <- rq(lifeExp ~ scaled_log_gdpPercap * years_since_1952, data = gapminder_transformed, tau = c(0.1, 0.25, 0.5, 0.75, 0.9))

summary(qr_models)

## Warning in summary.rq(xi, U = U, ...): 12 non-positive fis

## Warning in summary.rq(xi, U = U, ...): 6 non-positive fis

## Warning in summary.rq(xi, U = U, ...): 3 non-positive fis

## Warning in summary.rq(xi, U = U, ...): 3 non-positive fis

## Warning in summary.rq(xi, U = U, ...): 18 non-positive fis

## 
## Call: rq(formula = lifeExp ~ scaled_log_gdpPercap * years_since_1952, 
##     tau = c(0.1, 0.25, 0.5, 0.75, 0.9), data = gapminder_transformed)
## 
## tau: [1] 0.1
## 
## Coefficients:
##                                       Value    Std. Error t value  Pr(>|t|)
## (Intercept)                           43.20169  0.70142   61.59156  0.00000
## scaled_log_gdpPercap                   7.35068  0.47938   15.33368  0.00000
## years_since_1952                       0.26120  0.02149   12.15622  0.00000
## scaled_log_gdpPercap:years_since_1952  0.06933  0.01564    4.43201  0.00001
## 
## Call: rq(formula = lifeExp ~ scaled_log_gdpPercap * years_since_1952, 
##     tau = c(0.1, 0.25, 0.5, 0.75, 0.9), data = gapminder_transformed)
## 
## tau: [1] 0.25
## 
## Coefficients:
##                                       Value    Std. Error t value  Pr(>|t|)
## (Intercept)                           50.48124  0.70831   71.26986  0.00000
## scaled_log_gdpPercap                  11.97199  0.47754   25.07014  0.00000
## years_since_1952                       0.20507  0.01806   11.35429  0.00000
## scaled_log_gdpPercap:years_since_1952 -0.03613  0.00967   -3.73735  0.00019
## 
## Call: rq(formula = lifeExp ~ scaled_log_gdpPercap * years_since_1952, 
##     tau = c(0.1, 0.25, 0.5, 0.75, 0.9), data = gapminder_transformed)
## 
## tau: [1] 0.5
## 
## Coefficients:
##                                       Value     Std. Error t value   Pr(>|t|) 
## (Intercept)                            55.57793   0.37879  146.72594   0.00000
## scaled_log_gdpPercap                   12.51016   0.36768   34.02426   0.00000
## years_since_1952                        0.18736   0.01142   16.41102   0.00000
## scaled_log_gdpPercap:years_since_1952  -0.07982   0.00869   -9.18818   0.00000
## 
## Call: rq(formula = lifeExp ~ scaled_log_gdpPercap * years_since_1952, 
##     tau = c(0.1, 0.25, 0.5, 0.75, 0.9), data = gapminder_transformed)
## 
## tau: [1] 0.75
## 
## Coefficients:
##                                       Value     Std. Error t value   Pr(>|t|) 
## (Intercept)                            59.97344   0.31817  188.49586   0.00000
## scaled_log_gdpPercap                   12.14981   0.29628   41.00809   0.00000
## years_since_1952                        0.16231   0.01008   16.10339   0.00000
## scaled_log_gdpPercap:years_since_1952  -0.10032   0.00744  -13.48697   0.00000
## 
## Call: rq(formula = lifeExp ~ scaled_log_gdpPercap * years_since_1952, 
##     tau = c(0.1, 0.25, 0.5, 0.75, 0.9), data = gapminder_transformed)
## 
## tau: [1] 0.9
## 
## Coefficients:
##                                       Value     Std. Error t value   Pr(>|t|) 
## (Intercept)                            63.25230   0.38655  163.63235   0.00000
## scaled_log_gdpPercap                    9.69923   0.33726   28.75926   0.00000
## years_since_1952                        0.15789   0.01351   11.68672   0.00000
## scaled_log_gdpPercap:years_since_1952  -0.08189   0.00866   -9.45363   0.00000

plot(summary(qr_models))

## Warning in summary.rq(xi, U = U, ...): 12 non-positive fis

## Warning in summary.rq(xi, U = U, ...): 6 non-positive fis

## Warning in summary.rq(xi, U = U, ...): 3 non-positive fis

## Warning in summary.rq(xi, U = U, ...): 3 non-positive fis

## Warning in summary.rq(xi, U = U, ...): 18 non-positive fis

Intercept: The intercept, representing the estimated life expectancy at the reference level of predictors (scaled log GDP per capita normalized and years since 1952 set to zero), increases across quantiles. This indicates that in countries at higher quantiles (i.e., with higher life expectancy), the base life expectancy (without the influence of GDP or time) is higher. The slope of the intercept across quantiles suggests that the gap in base life expectancy between lower and higher quantile countries has been widening over time.

scaled_log_gdpPercap: The effect of the standardized log GDP per capita appears to decrease as we move from lower to higher quantiles. This suggests that while GDP per capita has a strong positive effect on life expectancy in countries at the lower end of the life expectancy distribution, its impact diminishes at the higher end. The decreasing trend across quantiles indicates that wealth increases are more beneficial to life expectancy in poor countries than in rich ones.

years_since_1952: The coefficient for years since 1952 seems to decrease across quantiles, indicating that the overall improvement in life expectancy over time is slightly less pronounced in countries at higher life expectancy levels. However, the change is relatively subtle, suggesting that time has been a fairly consistent positive factor across all quantiles, although slightly less so at the higher end of the life expectancy.

Interaction: scaled_log_gdpPercap:years_since_1952: The interaction between standardized log GDP per capita and years since 1952 shows variation across quantiles. Initially positive at lower quantiles, it turns negative as we move to higher quantiles. This could indicate that the enhancing effect of GDP per capita on life expectancy growth over time has been stronger in countries with initially lower life expectancy. In contrast, in countries with higher initial life expectancy, the benefit of increasing GDP per capita on further life expectancy gains has reduced over time.

Superimpose the quantile regression lines on the graph of the data. Do the trends for each quantile appear to be affected by extreme values?

library(ggplot2)

# Adding quantile regression lines for specified quantiles
quantiles <- c(0.1, 0.25, 0.5, 0.75, 0.9)

p + geom_quantile(quantiles = quantiles, formula = y ~ x, linetype = "solid", color = "red") +
  theme(legend.position = "right")

Note on international comparisons

While this dataset is useful for illustrating the concepts of robust regression and quantile regression, it should be noted that this type of statistical analysis comparing variables measured at the national level has several limitations:

It cannot be assumed that the associations detected apply at a smaller scale (e.g., the relationship between life expectancy and income when comparing national averages is not necessarily the same as the relationship between life expectancy and income at the level of individuals living in each country).
Averages calculated in different countries are not independent observations, because environmental, social and economic conditions are correlated between nearby countries.
There are many factors that differentiate countries, so it is difficult to interpret an association as a causal link.

Many articles, particularly in the social sciences, have been published on the methods to be used to make this type of cross-country comparisons.