Introduction

This set of exercises is to develop your skill in applying numerical and graphical tools in examining relationships between two variables. Each exercise presents two variables. You need to classify each variable as to type. Is it quantitative or categorical? You need to consider the possible causal relationships between the two. Then you need to apply both numerical and graphical tools to examine the level of association between the two.

Each exercise will require you to make data available to your R session. You may need to install and load a package. You may need to upload an existing RData workspace. The exercise will give you the required code to get the data and, in many cases, modify it. You do not need to understand the data-acquisition code. You just need to run it.

To modify the data in preparation for your exercises, I have frequently used tools from the “Tidyverse”. You should not feel that you would have to mimic or explain what I’ve done. You do need to understand the result. As you work through these exercises, you should have another tab in your browser open to RStudio cloud with a script window ready to accept code copied from the exercises. You should use R commands such as str(), summary(), and head() to get a feeling for the dataset you’re working with.

It is important that you do the exercise before looking at the solution and listening to the instructor’s explanation in the accompanying video.

The video is at https://www.youtube.com/watch?v=O5saaJspC80&feature=youtu.be

Right click on the link and open it in a new tab.

Exercise 1

This exercise is based on the births dataframe from the openintro package. The two variables are weight (the baby’s birthweight) and weeks (The length of gestation).

The following code will make the package openinto available to your session. If it is not already installed on your machine, the code will install it.

package = "openintro"
if (!require(package, character.only=T, quietly=T)) {
    install.packages(package)
    library(package, character.only=T)
}

Examine the relationship graphically and numerically.

Do this on your own before you move on to the next slide. If you’re following this from the video, you need to pause the video now. Use the two vertical bars in the lower left corner to pause. Resume the video and go on to the next slide only after you’ve done the exercise.

Solution

Both variables are quantitative. There is certainly a causal relationship with weeks as the explanatory variable and weight as the response. The appropriate tools are the scatterplot, the correlation coefficient and the linear model.

Here are the results.

plot(births$weeks,births$weight)

cor(births$weeks,births$weight)

## [1] 0.6836602

bwmodel = lm(births$weight~births$weeks)
summary(bwmodel)

## 
## Call:
## lm(formula = births$weight ~ births$weeks)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2900 -0.7542  0.0851  0.6576  2.6576 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -7.31198    1.26305  -5.789 4.08e-08 ***
## births$weeks  0.37248    0.03268  11.396  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.096 on 148 degrees of freedom
## Multiple R-squared:  0.4674, Adjusted R-squared:  0.4638 
## F-statistic: 129.9 on 1 and 148 DF,  p-value: < 2.2e-16

Key Observations

The scatterplot shows that as weeks increases, weight also increases. There is a clear relationship.
The correlation coefficient, .68, confirms the impression given by the scatterplot. This is also confirmed by the R-Squared for the model, .4674.
The residual standard error, 1.096, gives us the typical error we would be making if we used the model to predict the weight of a baby based on the length of gestation.
The slope of the linear model, .37, tells us that an extra week of gestation typically adds .37 pounds to the baby’s weight.

Exercise 2

Let’s stay with the births data and examine the relationship between the mother’s smoking and the occurrence of a premature birth. The variable names are premature and smoke.

Solution

Both of these variables are cateogrical. The appropriate tools are a table produced by CrossTable from the gmodels package and a mosaicplot from base R.

First we need to make sure we have gmodels installed and loaded.

package = "gmodels"
if (!require(package, character.only=T, quietly=T)) {
    install.packages(package)
    library(package, character.only=T)
}

First, get the cross-tabulation.

CrossTable(births$smoke,births$premature)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  150 
## 
##  
##              | births$premature 
## births$smoke | full term |    premie | Row Total | 
## -------------|-----------|-----------|-----------|
##    nonsmoker |        87 |        13 |       100 | 
##              |     0.012 |     0.071 |           | 
##              |     0.870 |     0.130 |     0.667 | 
##              |     0.674 |     0.619 |           | 
##              |     0.580 |     0.087 |           | 
## -------------|-----------|-----------|-----------|
##       smoker |        42 |         8 |        50 | 
##              |     0.023 |     0.143 |           | 
##              |     0.840 |     0.160 |     0.333 | 
##              |     0.326 |     0.381 |           | 
##              |     0.280 |     0.053 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       129 |        21 |       150 | 
##              |     0.860 |     0.140 |           | 
## -------------|-----------|-----------|-----------|
## 
##

Next, get the mosaicplot.

mosaicplot(table(births$smoke,births$premature))

Key Observations

There are relatively few smokers. Only 33% of the mothers are smokers.
The mothers who smoke are more likely to have a premature birth, 16% versus 13%.
The numbers are not large, so we will want to revisit this problem when we do statistical inference. Possibly the difference we have observed with these 150 births is just random flucttuation.

Exercise 3

In some cases the father’s age is missing. This is probably an indication that the father had a weak relationship with the mother. I’ll create a new variable to capture this, Father has values “Present” and “Absent”. As usual, you don’t have to be able to do this on your own, just run the code. Copy the code to your open script window in RStudio if you want to look at it.

births$Father = ifelse(is.na(births$fAge),"Absent","Present")
table(births$Father)

## 
##  Absent Present 
##      31     119

Does the status of the father have any observed association with the weight of the baby?

Solution

Father is a categorical variable and weight is quantitative. The appropriate tools are tapply() with summary(), and a side-by-side boxplot.

Here is the tapply().

tapply(births$weight,births$Father,summary)

## $Absent
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.690   5.785   7.310   6.745   8.160   9.060 
## 
## $Present
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.630   6.530   7.310   7.124   7.940  10.130

Here is the boxplot().

boxplot(births$weight~births$Father)

Key Observation

There does not seem to be a strong association between the status of the father and the weight of the baby. We can revisit this data when we work on statistical inference.

The Gapminder Data

This is a large, complex dataset. I will extract a suitable subset or two for our exercises.

package = "gapminder"
if (!require(package, character.only=T, quietly=T)) {
    install.packages(package)
    library(package, character.only=T)
}

str(gapminder)

## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

summary(gapminder)

##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
##

Let’s get a subset of the data containing only the countries in the Americas, Europe, and Asia in 2007.

First, I want to make the tidyverse available to facilitate the data manipulation.

package = "tidyverse"
if (!require(package, character.only=T, quietly=T)) {
    install.packages(package)
    library(package, character.only=T)
}

The tools I’m using are from the Tidyverse. You don’t need to master these tools unless you feel inspired to do so. It’s not hard to read and follow them, but you don’t need to.

The goal is to create a smaller dataframe containing only observations from the Americas, Europe, and Asia in 2007.

AmEuAs07 = gapminder %>% 
    filter(continent %in% c( "Americas","Europe","Asia") &
               year == 2007) %>% 
    mutate(country = as.character(country),
           continent = as.character(continent))
str(AmEuAs07)

## tibble [88 × 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : chr [1:88] "Afghanistan" "Albania" "Argentina" "Austria" ...
##  $ continent: chr [1:88] "Asia" "Europe" "Americas" "Europe" ...
##  $ year     : int [1:88] 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 ...
##  $ lifeExp  : num [1:88] 43.8 76.4 75.3 79.8 75.6 ...
##  $ pop      : int [1:88] 31889923 3600523 40301927 8199783 708573 150448339 10392226 9119152 4552198 190010647 ...
##  $ gdpPercap: num [1:88] 975 5937 12779 36126 29796 ...

summary(AmEuAs07)

##    country           continent              year         lifeExp     
##  Length:88          Length:88          Min.   :2007   Min.   :43.83  
##  Class :character   Class :character   1st Qu.:2007   1st Qu.:71.77  
##  Mode  :character   Mode  :character   Median :2007   Median :74.60  
##                                        Mean   :2007   Mean   :73.91  
##                                        3rd Qu.:2007   3rd Qu.:78.65  
##                                        Max.   :2007   Max.   :82.60  
##       pop              gdpPercap    
##  Min.   :3.019e+05   Min.   :  944  
##  1st Qu.:5.395e+06   1st Qu.: 4507  
##  Median :1.106e+07   Median :10745  
##  Mean   :6.019e+07   Mean   :16345  
##  3rd Qu.:4.034e+07   3rd Qu.:28607  
##  Max.   :1.319e+09   Max.   :49357

Exercise 4

Examine the relationship between continent and life expectancy.

Solution

The variable continent is categorical and the variable life expectancy is quantitative. So the appropriate tools are the side-by-side boxplot and the tapply with summary.

boxplot(AmEuAs07$lifeExp~AmEuAs07$continent)

tapply(AmEuAs07$lifeExp,AmEuAs07$continent,summary)

## $Americas
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   60.92   71.75   72.90   73.61   76.38   80.65 
## 
## $Asia
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   43.83   65.48   72.40   70.73   75.64   82.60 
## 
## $Europe
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   71.78   75.03   78.61   77.65   79.81   81.76

Key Observations

Europe has the best life expectancies overall.
Countries in the Americas have moderately good life expectanties.
Asian countries have both the best and the worst.

Exercise 5

Using the data from the previous exercise, what is the relationship between per capita gross domestic product and life expectancy. One might assume that higher income contries would have longer life expectancy.

Solution

Both of these variables are quantitative. The appropriate tools are the scatterplot, the correlation coefficient, and the linear model.

inc_life_model = lm(AmEuAs07$lifeExp~AmEuAs07$gdpPercap)
summary(inc_life_model)

## 
## Call:
## lm(formula = AmEuAs07$lifeExp ~ AmEuAs07$gdpPercap)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -24.964  -1.449   0.820   2.580   7.105 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        6.847e+01  7.471e-01  91.648  < 2e-16 ***
## AmEuAs07$gdpPercap 3.327e-04  3.517e-05   9.459 5.71e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.476 on 86 degrees of freedom
## Multiple R-squared:  0.5099, Adjusted R-squared:  0.5042 
## F-statistic: 89.48 on 1 and 86 DF,  p-value: 5.709e-15

cor(AmEuAs07$gdpPercap,AmEuAs07$lifeExp)

## [1] 0.714086

plot(AmEuAs07$gdpPercap,AmEuAs07$lifeExp)

Key Observations:

There is a reasonable positive relationship in the expected direction as evidenced by the following.

The scatterplot shows that as per capita gdp increases, life expectancy also generally increases.
The positive correlation coefficient of .77 confirms this.
The linear mode fits reasonably well. The coefficient of gdp per capita in the model says that an increase of $10,000 in per capita GDP tends to increase life expectancy by about 3.3 years.

To satisfy my curiosity and your own about the results, the next slide uses some advanced R methods to allow a closer examination of individual countries. You are not expected to master these methods for this course.

Looking Deeper

First I need the tidyverse and plotly packages. The package ggplot2 from the tidyverse creates the basic graph. The plotly package provides the command ggplotly(), which makes the graph interactive.

package = "tidyverse"
if (!require(package, character.only=T, quietly=T)) {
    install.packages(package)
    library(package, character.only=T)
}

package = "plotly"
if (!require(package, character.only=T, quietly=T)) {
    install.packages(package)
    library(package, character.only=T)
}

plot_inc_life_model = AmEuAs07 %>% 
    ggplot(aes(x = gdpPercap,
               y = lifeExp,
               color = continent,
               group = country,
               size = pop)
           ) +
    geom_point() +
    ggtitle("Relationship Between Income and Life Expectancy - 2010")
    
ggplotly(plot_inc_life_model)

You can hover over a point in the graph to identify the country and the values of our variables.

Exercise 6

This exercise is based on the burger dataframe from the openintro package. The two variables are best_burger_place and gender. The data is from a survey of 500 individuals done in 2010. See the openintro package for a reference. There are two variables.

best_burger_place is the person’s favorite burger place.
gender is self explanatory.

What is the relationship between these two variables.

Solution

Both variables are categorical. The appropriate tools are CrossTable from the gmodels package and mosaicplot from base R.

First make sure we have the gmodels package.

package = "gmodels"
if (!require(package, character.only=T, quietly=T)) {
    install.packages(package)
    library(package, character.only=T)
}

Now run the analysis commands.

CrossTable(burger$gender,burger$best_burger_place)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  500 
## 
##  
##               | burger$best_burger_place 
## burger$gender |         Fat Burger |  Five Guys Burgers |    In-N-Out Burger |           Not Sure |              Other | Tommy's Hamburgers |       Umami Burger |          Row Total | 
## --------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
##        Female |                 12 |                  6 |                181 |                  5 |                 20 |                 27 |                  1 |                252 | 
##               |              0.075 |              0.038 |              0.382 |              1.828 |              0.437 |              0.002 |              1.355 |                    | 
##               |              0.048 |              0.024 |              0.718 |              0.020 |              0.079 |              0.107 |              0.004 |              0.504 | 
##               |              0.545 |              0.545 |              0.528 |              0.278 |              0.435 |              0.500 |              0.167 |                    | 
##               |              0.024 |              0.012 |              0.362 |              0.010 |              0.040 |              0.054 |              0.002 |                    | 
## --------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
##          Male |                 10 |                  5 |                162 |                 13 |                 26 |                 27 |                  5 |                248 | 
##               |              0.076 |              0.038 |              0.388 |              1.857 |              0.444 |              0.002 |              1.377 |                    | 
##               |              0.040 |              0.020 |              0.653 |              0.052 |              0.105 |              0.109 |              0.020 |              0.496 | 
##               |              0.455 |              0.455 |              0.472 |              0.722 |              0.565 |              0.500 |              0.833 |                    | 
##               |              0.020 |              0.010 |              0.324 |              0.026 |              0.052 |              0.054 |              0.010 |                    | 
## --------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
##  Column Total |                 22 |                 11 |                343 |                 18 |                 46 |                 54 |                  6 |                500 | 
##               |              0.044 |              0.022 |              0.686 |              0.036 |              0.092 |              0.108 |              0.012 |                    | 
## --------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|--------------------|
## 
##

mosaicplot(table(burger$best_burger_place,burger$gender),
           main = "Gender by Preferred Burger")

Key Observations:

In-N-Out Burger is the preferred burger place for both genders with 68.6& of the 500 people naming it.

In-N-Out Burger is favored by 71.8% of females and 65.3% of females.

Exercises on Relationships

Introduction

Exercise 1

Solution

Key Observations

Exercise 2

Solution

Exercise 3

Solution

Key Observation

The Gapminder Data

Exercise 4

Solution

Exercise 5

Solution

Looking Deeper

Exercise 6

Solution