Linear Regression

An animated chart of a 3-way relationship

Life expectancy is the average number of years between birth and death of people in a country. The longer is this span, scientists think, the more well people live. What determine life expectancy of nations is one of the most important questions of modern science. Here’s even a Ted talk for this subject: https://www.ted.com/talks/hans_rosling_let_my_dataset_change_your_mindset/transcript?language=en

And below is the animated scatter plot for that Ted talk. By the end of this discussion, you too, will be able to make a graph like this in no time.

# required library: you'll need to install.packages() before you could use them
library(gapminder)
library(ggplot2)
library(ggrepel)
library(gganimate)
library(gifski)
library(transformr)

wdata = gapminder

# Make a ggplot, but add frame=year: one image per year
p=ggplot(wdata, aes(gdpPercap, lifeExp, size = pop, color = continent)) +
  geom_point() +
  scale_x_log10() +
  geom_text(aes(label=country), size = 2)+
  scale_color_brewer(palette = 'Set1') +
  theme_bw() +
  # gganimate specific bits:
  labs(title = 'Life Expectancy vs. GDP per capita \n Year: {frame_time}', 
       x = '(log10) GDP per capita', y = 'Life expectancy') +
  transition_time(year) +
  ease_aes('linear')

# Save at gif:
#anim_save("animated_world.gif")

animate(p, nframes= 50)

Observations:

In this graph, each bubble represents a country. In a year, the horizontal dimension of a dot (how far to the right) represents a country’s GDP per capita. Its vertical dimension (height) represent the country’s life expectancy.
Overall, how did GDP per capita and life expectancy change over time? Both GDP and life expectancy have been increasing over time (up & to the right)
How would you describe the relationship between GPD per capita and life expectancy? Countries with low GDP per capital have a lower life expectancy & countries with higher GDP per capital have a higher life expectancy.
How, if at all, does the relationship between GDP per capita and life expectancy change over time? Positive relationship between GDP per capita and life expectancy.

A wrinkle in time

The animated picture is cool, but it also contains too many details for us to focus on the important ones.

For a snapshot of the data, we will keep only years 2007 (last year in the data) and 1980:

wdata8007 = wdata[wdata$year==1982 | wdata$year==2007,]

# Make a ggplot, but add frame=year: one image per year
p=ggplot(wdata8007, aes(gdpPercap, lifeExp, size = pop, color = continent)) +
  geom_point() +
  scale_x_log10() +
  scale_color_viridis_d() +
  theme_bw() +
  # gganimate specific bits:
  labs(title = 'Year: {frame_time}', x = '(log) GDP per capita', y = 'life expectancy') +
  transition_time(year) +
  ease_aes('linear')
animate(p, nframes = 50, fps= 5)

Now we’ll only work on a static version of this graph:

ggplot(wdata8007, aes(gdpPercap, lifeExp, size = pop, color = continent)) +
  geom_point() +
  scale_x_log10() +
  scale_color_viridis_d() +
  theme_bw()

Exercise: Set the color of dots to distinguish them by year, rather than continent.

Remember to use the factor() command.
Try to change the colors by adding the subcommand scale_colour_brewer(palette = "Set1")

ggplot(wdata8007, aes(gdpPercap, lifeExp, size = pop, color = factor(year))) +
  geom_point() +
  scale_x_log10() +
  scale_color_brewer(palette= "Set1") +
  theme_bw()

Exercise: Now, what do you notice about the number marks on the x-axis? This feature was executed by the subcommand scale_x_log10(). Try to rerun the code above and delete this command and see what happens:

ggplot(wdata8007, aes(gdpPercap, lifeExp, size = pop, color = continent)) +
  geom_point() +
  scale_color_viridis_d() +
  theme_bw()

#wdata8007 is the data set

Observation:

In level (dollar), higher income predicts an increase in life expectancy, but at a decreasing rate. This means that when we compare between 2 very poor countries, a large difference in income is associated with a small difference in life expectancy. In contrast, as countries get richer and richer (more to the right of the graph), a large difference in income is associated with a small difference in life expectancy.

In proportion, higher income predicts an increase in life expectancy, but at a relatively constant rate. This means that when we compare between 2 countries, a small-times difference in income is associated with a large difference in life expectancy. Furthermore, this ratio appears relatively constant across different income levels.

As a result of this observation, we conclude that it is simpler and equally accurate to study the relationship betwen life expectancy and GDP per capita.

For the rest of the analysis, we will work with log base 10 of GDP per capita

wdata8007$logY = log10(wdata8007$gdpPercap)
#create variable logY by taking the log base 10 of GDP per capita

ggplot(wdata8007, aes(logY, lifeExp, color = factor(year))) +
  geom_point() +
  scale_color_brewer(palette="Dark2") +
  theme_bw()

The log-unit looks weird, so we’ll learn to get used to it:

A one-unit increase in log base 10 of GDP per capita is a 10x increase in GDP per capita.

When logY increases from 2.5 to 3.5, GDP per capita increases by 10x.

When logY increases from 3.5 to 4.5, GDP per capita increases by 10x.

We now add to our scatter plot a regression line for the relationship between GDP per capita and life expectancy. To simplify, we started by ignoring the difference in years:

ggplot(wdata8007, aes(x=logY, y=lifeExp)) +
  geom_point() +
  geom_text(aes(label=country), size = 2, color = 'brown')+
  geom_smooth(method = "lm") +
  scale_color_brewer(palette="Dark2") +
  theme_bw()

The regression line is the best estimate of a linear relationship (straight line) between the two variables.

What does it mean by “best estimate” of a relationship between two variables? To see this, a little equation will help us with much clarity.

The blue line in the graph can be equivalently written in the form of the regression equation:

lm( lifeExp~logY, data= wdata8007)

## 
## Call:
## lm(formula = lifeExp ~ logY, data = wdata8007)
## 
## Coefficients:
## (Intercept)         logY  
##      0.9969      17.2274

As an equation, the regression line reads: lifeExp = 1 + 17.23 * logY

In common language, we say that, using linear regression, a ten times increase in income per capita (logY increases by 1) predicts a 17.23 year increase in life expectancy.

The linear regression line is the best linear estimate for the relationship between the 2 variables in the sense that:

Mathematically, any linear relationship between logY and lifeExp is written by the formula: lifeExp = _A+B*LogY_. In this formula, A and B are numbers.
If we use the estimated coefficients by the linear regression formula A=1 & B=17.23 to predict lifeExp using values of logY in available data, we would minimize the total squared errors of our prediction.

Illustrating examples for prediction errors

In the graph above, Honduras is right at the middle of the graph. Its logY = 3.5. Our linear regression would predict that its lifeExp would be 62 years. Its actual lifeExp in the data is 62 years. So this is a pretty good prediction.
Costa Rica has logY = 4. Using linear regression, we’d predict that its lifeExp would be 70 years. Its actual lifeExp is 79 years, which is far better than predicted.
At the opposite end of the spectrum from Costa Rica is South Africa, which has comparable lifeExp. Using linear regression, we’d predict that South Africa’s lifeExp would be 70 years. Its actual lifeExp is 50 years, which is far lower than predicted.

Exercise: Regress lifeExp on logY for the subset of the data wdata8007 for which year = 2007 and report the formula of the linear estimate.

lm( lifeExp~logY, data= wdata8007[wdata8007$year==2007,])

## 
## Call:
## lm(formula = lifeExp ~ logY, data = wdata8007[wdata8007$year == 
##     2007, ])
## 
## Coefficients:
## (Intercept)         logY  
##        4.95        16.59

Exercise: Regress lifeExp on logY for the subset of the data wdata8007 for which year = 1980 and report the formula of the linear estimate.

lm( lifeExp~logY, data= wdata8007[wdata8007$year==1982,])

## 
## Call:
## lm(formula = lifeExp ~ logY, data = wdata8007[wdata8007$year == 
##     1982, ])
## 
## Coefficients:
## (Intercept)         logY  
##     -0.6505      17.2546

Your regression results, when mapped onto the scatter plot, look like this:

ggplot(wdata8007, aes(logY, lifeExp, color=factor(year))) +
  geom_point() +
  geom_smooth(method = "lm") +
  scale_color_brewer(palette="Dark2") +
  theme_bw()

## Multiple variable regression

We saw that year is a factor that helps reliably predict lifeExp even after having taking into consideration logY. Wouldn’t it be nice to include them both in the same regression? Yes it would. And that’s why regressions are so popular in all sorts of users.

The exact same logic (and process) of regression applies when we try to predict lifeExp based on logY AND year. Unfortunately, when you have more than one variable, graphical illustrations no longer yield helpful explanation. We’ll really have to trust our ability to read equations.

lm( lifeExp~year+logY, data= wdata8007)

## 
## Call:
## lm(formula = lifeExp ~ year + logY, data = wdata8007)
## 
## Coefficients:
## (Intercept)         year         logY  
##   -248.7193       0.1258      16.8835

# by summary() results, we get more details that will be useful later
summary(lm( lifeExp~year+logY, data= wdata8007))

## 
## Call:
## lm(formula = lifeExp ~ year + logY, data = wdata8007)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -25.921  -2.877   1.037   4.258  14.468 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -248.71930   61.45872  -4.047 6.71e-05 ***
## year           0.12584    0.03094   4.067 6.20e-05 ***
## logY          16.88354    0.68872  24.514  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.469 on 281 degrees of freedom
## Multiple R-squared:  0.6988, Adjusted R-squared:  0.6966 
## F-statistic: 325.9 on 2 and 281 DF,  p-value: < 2.2e-16

As an equation, the regression line above reads: lifeExp = 0.1258 + 16.88 * year + _____ * logY

In common language, we say that, for a given year, a ________ in income per capita predicts a ________ in life expectancy.

We also say that, for a given logY, as each year goes by, we predict a ________ in life expectancy.

Summary

We have learned how to:

Regress one variable on another (or others). we regressed life expectancy on GDP
Explain what the regression formula does.the regression formula is the best estimate for a linear relationship between life exxpectancy and GDP per capita
Apply regression results to predict a variable based on others. it enables us to predict our life expectancy based on GDP per capita

The fault in our star(s)

Do you see those *s at the end of those regression estimates? I needed them for my regression results to be useful. What do they mean and why do I need them? That’s for our next discussion.

Some practice for animated graph

The following code show developments of income per capita of north America countries. Use gganimate to create an animate graph that traces those developments over time.

First, extract data for north America

namdata = wdata[wdata$country=="Canada" | 
                  wdata$country=="Mexico" | 
                  wdata$country=="United States",]

Now, draw a line chart for gdpPercap over time of the 3 countries. The following tutorial may help: https://www.datanovia.com/en/blog/gganimate-how-to-create-plots-with-beautiful-animation-in-r/

Finally, add the following codes below to your graph to animate it over time. Note that we now use command transition_reveal() rather than transition_time()

# gganimate specific bits: labs(title = ‘Income per-capita growth over time’, y = ‘GDP per capita’, y = ‘years’) + transition_reveal(year) + ease_aes(‘linear’)

Linear Regression - A Visual Introduction

Averi Munro

4/5/2020

An animated chart of a 3-way relationship

A wrinkle in time