DataM: In-class Exercise 0420: Grammar of graphics

library(dplyr)
library(ggplot2)

In-class exercise 1.

Find out what each code chunk (indicated by '##') in the following R script does and provide comments.

Chunk 1

Plot the main frame of dataset women without data points.
Add the first data point of women on the frame.

plot(women, type='n')
points(women[1,])

Chunk 2

Use xyplot{lattice} to draw the scatter plot frame of height and weight, variables in women. But we only add the first data point (type='p') of women (subset=row.names(women)==1) on the frame.

lattice::xyplot(weight ~ height, 
  data=women,
  subset=row.names(women)==1, type='p')

Chunk 3

Load in the package ggplot2
Use ggplot{ggplot2} to draw the scatter plot frame of height and weight, variables in women. But we only add the first () data point of women on the frame.

library(ggplot2)
ggplot(data=women[1,], aes(height, weight)) +
  geom_point()

In-class exercise 2.

The data set is concerned with grade 8 pupils (age about 11 years) in elementary schools in the Netherlands. After deleting pupils with missing values, the number of pupils is 2,287 and the number of schools is 131. Class size ranges from 4 to 35. The response variables are score on a language test and that on an arithmetic test. The research intest is on how the two test scores depend on the pupil’s intelligence (verbal IQ) and on the number of pupils in a school class.

The class size is categorized into small, medium, and large with roughly equal number of observations in each category. The verbal IQ is categorized into low, middle and high with roughly equal number of observations in each category. Reproduce the plot below.

Source: Snijders, T. & Bosker, R. (2002). Multilevel Analysis.

Column 1: School ID
Column 2: Pupil ID
Column 3: Verbal IQ score
Column 4: The number of pupils in a class
Column 5: Language test score
Column 6: Arithmetic test score

Target output

[Solution and Answer]

Load in the data set and check its structure

dta2 <- read.table('../data/data_inclass0420_2.txt', header = TRUE)
head(dta2)

str(dta2)

'data.frame':   2287 obs. of  6 variables:
 $ school: int  1 1 1 1 1 1 1 1 1 1 ...
 $ pupil : int  17001 17002 17003 17004 17005 17006 17007 17008 17009 17010 ...
 $ IQV   : num  15 14.5 9.5 11 8 9.5 9.5 13 9.5 11 ...
 $ size  : int  29 29 29 29 29 29 29 29 29 29 ...
 $ lang  : int  46 45 33 46 20 30 30 57 36 36 ...
 $ arith : int  24 19 24 26 9 13 13 30 23 22 ...

Create label variable for plotting with facet setting

Group the data by the class size (group_siZe) quantile and verbal IQ score (group_IQV) quantile.
Combine two variables group_siZe and group_IQV into group.

dta2 <- dta2 %>% 
  mutate(group_siZe = cut(size, include.lowest = TRUE,
                     breaks = quantile(size, c(0, 1/3, 2/3, 1)),
                     labels = c('Small', 'Medium', 'Large')),
         group_IQV = cut(IQV, include.lowest = TRUE,
                     breaks = quantile(IQV, c(0, 1/3, 2/3, 1)),
                     labels = c('Low', 'Middle', 'High'))) %>%
  mutate(group = paste(as.character(group_siZe), 
                       as.character(group_IQV), sep=', ') %>% 
           factor(.,  levels = c('Small, Low', 'Small, Middle', 'Small, High',
                                 'Medium, Low', 'Medium, Middle', 'Medium, High',
                                 'Large, Low', 'Large, Middle', 'Large, High')))
                      # adjust levels order
table(dta2$group) # check


    Small, Low  Small, Middle    Small, High    Medium, Low Medium, Middle 
           320            262            241            253            257 
  Medium, High     Large, Low  Large, Middle    Large, High 
           253            252            239            210

Plot with setting facet

Plot the main frame and assign variable to the axises.
Name the axises
Add data points on the plots.
Assign the method (linear model here) to fit the given formula.
Show plots in panels (facet setting).

ggplot(data = dta2, 
       mapping = aes(x = lang, y = arith)) +              #1
  labs(x = 'Language score', y = 'Arithmetic score') +    #2
  geom_point(shape = 23, fill = 'black') +                #3
  geom_smooth(formula = y ~ x, method = 'lm', lwd = .5) + #4
  facet_wrap(. ~ group)                                   #5

In-class exercise 3.

Use the USPersonalExpenditure{datasets} for this problem. This data set consists of United States personal expenditures (in billions of dollars) in the categories; food and tobacco, household operation, medical and health, personal care, and private education for the years 1940, 1945, 1950, 1955 and 1960. Plot the US personal expenditure data in the style of the third plot on the “Time Use” case study in the course web page. You might want to transform the dollar amounts to log base 10 unit first.

[Solution and Answer]

Load in the data set and check its structure

data(USPersonalExpenditure)
head(USPersonalExpenditure)

                      1940   1945  1950 1955  1960
Food and Tobacco    22.200 44.500 59.60 73.2 86.80
Household Operation 10.500 15.500 29.00 36.5 46.20
Medical and Health   3.530  5.760  9.71 14.0 21.10
Personal Care        1.040  1.980  2.45  3.4  5.40
Private Education    0.341  0.974  1.80  2.6  3.64

class(USPersonalExpenditure)

[1] "matrix"

df_Expen <- USPersonalExpenditure %>% as.data.frame()
str(df_Expen)

'data.frame':   5 obs. of  5 variables:
 $ 1940: num  22.2 10.5 3.53 1.04 0.341
 $ 1945: num  44.5 15.5 5.76 1.98 0.974
 $ 1950: num  59.6 29 9.71 2.45 1.8
 $ 1955: num  73.2 36.5 14 3.4 2.6
 $ 1960: num  86.8 46.2 21.1 5.4 3.64

Transform the data set

Stack the data set
Rename the columns
Put category back in the data set. Transform all expenditure-related variables to log base 10. Create a new variable expen_log_diff, which is the difference of $log_{10}(expenditure)$ and mean $log_{10}(expenditure)$ of each year.

dta3 <- df_Expen %>% stack()                          #1
colnames(dta3) <- c('expenditure', 'year')            #2
dta3 <- dta3 %>%                                      #3
  mutate(Category = rep(df_Expen %>% rownames(), 5),
         expen_log_M = rep(df_Expen %>% log10() %>% colMeans(), 5),
         expen_log_diff = log10(expenditure) - expen_log_M)
summary(dta3)

  expenditure       year     Category          expen_log_M    
 Min.   : 0.341   1940:5   Length:25          Min.   :0.4930  
 1st Qu.: 2.600   1945:5   Class :character   1st Qu.:0.7769  
 Median : 9.710   1950:5   Mode  :character   Median :0.9739  
 Mean   :20.069   1955:5                      Mean   :0.9184  
 3rd Qu.:29.000   1960:5                      3rd Qu.:1.1039  
 Max.   :86.800                               Max.   :1.2442  
 expen_log_diff    
 Min.   :-1.71143  
 1st Qu.:-0.71471  
 Median : 0.01336  
 Mean   : 0.00000  
 3rd Qu.: 0.78543  
 Max.   : 1.44550

Plot

Use qplot{ggplot2} to draws dot plot with expen_log_diff as x, Category as y. Set facet to draw different plots for differen year alone x-axis to easily see the change of expenditure of each category as year changes.
Add segments to make dot plots become lollipop plots.
Add a vertical line as a central basdline for each plot.
Scale x-axis in range of (-2, 2).
Name x-axis.

qplot(x = expen_log_diff, y = Category, data = dta3, facets = . ~ year) + #1
  geom_segment(aes(xend = 0, yend = Category)) +     #2
  geom_vline(xintercept = 0, color = 'grey55') +     #3
  scale_x_continuous(limits = c(-2, 2)) +            #4
  xlab('Expenditure [log10(billion)]')               #5

In-class exercise 4.

A sample of 158 children with autisim spectrum disorder were recruited. Social development was assessed using the Vineland Adaptive Behavior Interview survey form, a parent-reported measure of socialization. It is a combined score that included assessment of interpersonal relationships, play/leisure time activities, and coping skills. Initial language development was assessed using the Sequenced Inventory of Communication Development (SICD) scale. These assessments were repeated on these children when they were 3, 5, 9, 13 years of age.

Source: West, B.T., Welch, K.B., & Galecki, A.T. (2002). Linear Mixed Models: Practical Guide Using Statistical Software. p. 220-271.

Data: autism{WWGbook}

Column 1: Age (in years)
Column 2: Vineland Socialization Age Equivalent score
Column 3: Sequenced Inventory of Communication Development Expressive Group (1 = Low, 2 = Medium, 3 = High)
Column 4: Child ID

Target output 1

Target output 2

Replicate the two plots above using ggplot2.

[Solution and Answer]

Load the data set

data(autism, package = 'WWGbook') 
head(autism)

str(autism)

'data.frame':   612 obs. of  4 variables:
 $ age    : int  2 3 5 9 13 2 3 5 9 13 ...
 $ vsae   : int  6 7 18 25 27 17 18 12 18 24 ...
 $ sicdegp: int  3 3 3 3 3 3 3 3 3 3 ...
 $ childid: int  1 1 1 1 1 3 3 3 3 3 ...

summary(autism)

      age              vsae           sicdegp         childid      
 Min.   : 2.000   Min.   :  1.00   Min.   :1.000   Min.   :  1.00  
 1st Qu.: 2.000   1st Qu.: 10.00   1st Qu.:1.000   1st Qu.: 48.75  
 Median : 4.000   Median : 14.00   Median :2.000   Median :107.50  
 Mean   : 5.771   Mean   : 26.41   Mean   :1.956   Mean   :105.38  
 3rd Qu.: 9.000   3rd Qu.: 27.00   3rd Qu.:3.000   3rd Qu.:158.00  
 Max.   :13.000   Max.   :198.00   Max.   :3.000   Max.   :212.00  
                  NA's   :2

Create some variables for plot 1

Group: group data into three groups by sicdegp. This variable contains group labels. (for plot 1)
age_diff: The difference of age and its mean. (for plot 1)
age_2: age minus 2.

dta4 <- autism %>% mutate(Group = cut(sicdegp, breaks = 0:3, 
                                      labels = c('L', 'M', 'H')),
                          age_diff = age - mean(age),
                          age_2 = age - 2)

Plot 1

Draw a plot of age_centered as x and `vsae as y for each group.
Specify the range and break points of x-axis.
Assign points transparency by setting ‘alpha’.
Assign the method (linear model here) to fit the given formula.
Add polylines for each subject.
Specify the theme and adjust the layout: Set the text size of axises and axises labels.
Name the axises.

ggplot(data = dta4,  mapping = aes(x = age_diff, y = vsae)) + #1
  facet_grid(. ~ Group) +                                     #1
  scale_x_continuous(limits = c(-4, 7.5),                     #2
                     breaks = seq(-2.5, 5, by=2.5)) +
  geom_point(alpha = 0.6) +                                  #3
  geom_smooth(method = 'lm', formula = y ~ x) +              #4
  geom_line(aes(group = childid), alpha = 0.3) +             #5              
  theme_bw() +                                               #6
  theme(axis.text = element_text(size = 12),                 #6
        axis.title = element_text(size = 14, face = 'bold')) +
  labs(x = 'Age (in years, centered)', y = 'VSAE score')     #7

Plot 2

The ratio of height and width of the original plot is 12.5 cm : 14.4 cm.

Group data by Group and age_2. Compute mean and standard error for each group. Put the summarized data set into ggplot to plot.
Assign age_2 as x and vase_MEAN as y. Assign different shapes of data point of vase mean for different groups. Add the points on the given position (position_dodge()) for each group. Show points in the legend.
Add polylines on the given position for each group. Set different line types for different groups. Show lines in the legend.
Add error bars with size and width setting on the given position for each group.
Name the axises.
Specify the theme of black and white.
Detailed adjustment for the layout:

Change grid line layout.
Specify the text size of the axises and that of the legend title.
Specify the location and the text size of the legend.

# 1
dta4 %>% group_by(Group, age_2) %>%
  summarize(vsae_MEAN = mean(vsae, na.rm = TRUE), 
            vsae_SE = sd(vsae, na.rm = TRUE) / sqrt(n())) %>% ggplot() +

# 2  
  aes(x = age_2, y = vsae_MEAN, group = Group, shape = Group) + #
  geom_point(position = position_dodge(width = .3),
             size=rel(2), show.legend = TRUE) +
  scale_shape_manual(values = c(1, 2, 16)) + 
    # make shapes same with the original plot

# 3
  geom_line(position = position_dodge(width = .3), 
            aes(linetype = Group),
            show.legend = TRUE) +
# 4
  geom_errorbar(aes(ymax = vsae_MEAN + vsae_SE,
                    ymin = vsae_MEAN - vsae_SE),
                size=.3, width=.2, position = position_dodge(width = .3)) +

# 5
  xlab('Age (in year - 2)') + ylab('VSAE score') +

# 6
  theme_bw() +

# 7
  theme(panel.grid.minor = element_blank(),
        panel.grid.major = element_line(size=0.75),
        axis.text = element_text(size = 12),
        legend.position = c(.1, .85),
        legend.key = element_rect(color = "black"),
        legend.key.size = unit(.69, 'cm'),
        legend.title = element_text(size = 14),
        legend.box.background = element_rect(color = 'black'))

In-class exercise 5.

Use the diabetes dataset to generate a plot similar to the one below and inteprete the plot.

Target output 1

Load the data set and check its structure.

Diabetes <- read.table('../data/diabetes.txt', sep = ',', header = TRUE)
head(Diabetes)

str(Diabetes)

'data.frame':   8706 obs. of  9 variables:
 $ SEQN    : int  51624 51626 51627 51628 51629 51630 51632 51633 51634 51635 ...
 $ RIAGENDR: int  1 1 1 2 1 2 1 1 1 1 ...
 $ RIDRETH1: int  3 4 4 4 1 3 2 3 1 3 ...
 $ DIQ010  : int  2 2 2 1 2 2 2 2 2 1 ...
 $ BMXBMI  : num  32.2 22 18.2 42.4 32.6 ...
 $ gender  : Factor w/ 2 levels "Females","Males": 2 2 2 1 2 1 2 2 2 2 ...
 $ race    : Factor w/ 3 levels "Black","Hispanic",..: 3 1 1 1 2 3 2 3 2 3 ...
 $ diabetes: Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 1 2 ...
 $ BMI     : Factor w/ 2 levels "Normal weight",..: 2 1 1 2 2 2 1 2 1 2 ...

Select useful variables from the data set for plotting using tools in `ggalluvial`

The ratio of height and width of the original plot is 9 cm : 13 cm.

Select useful variables from the data set Diabetes and compute the number for each combination of levels of these variables.
Trun variables into factorial and rearrange their level order to get the same order with the original plot.
Use ggplot to plot. Set the frequency as y and other variables as three other axises.
Give the positions of the lodes of an alluvial plot. Obtain the intersections of the alluvia with the strata. Use BMI as the group index. Specify colors.
Set x-axis scale into a discrete one. Assign x labels. Use expand argument to expand the coverage of plot on the grids. Add stratums with x labels.
Rename y-axis. Give the plot a title and a subtitle.
Specify the theme (remove background color and remain grids). Detailed adjustment for layout: Put the legend at the bottom of the plot.

library(ggalluvial) # Load in the package

# 1
Diabetes %>% dplyr::select(race, gender, diabetes, BMI) %>%
  xtabs(data = ., ~ race + gender + diabetes + BMI) %>% data.frame() %>%
  
# 2
  mutate(race = factor(race, levels = c('Hispanic', 'White', 'Black')),
         gender = factor(gender, levels = c('Males', 'Females')),
         diabete = factor(diabetes, levels = c('Yes', 'No'))) %>%

# 3
  ggplot(., aes(y = Freq, axis1 = race, axis2 = gender, axis3 = diabetes)) +

# 4
  geom_alluvium(aes(fill = BMI)) +
  scale_fill_manual(values=c('gray40','tan1')) + 

# 5
  scale_x_discrete(limits=c('race', 'gender', 'diabetes'), expand=c(.1, .05)) + 
  geom_stratum() + geom_text(stat = 'stratum', infer.label=TRUE) + 

# 6  
  ylab('No. individuals') +
  ggtitle('Diabetes in overall population in US 2009-2010',
          subtitle = 'straitified by race, gender and diabetes mellitus') +

# 7
  theme_minimal() + 
  theme(legend.position = 'bottom')

In-class exercise 6.

Find out what each code chunk (indicated by '##') in the following R script does and provide comments.

Chunk 1

Load in the package ggplot2 and check its info.

library(ggplot2)
?ggplot2

Chunk 2

Install the package gapminder and load it in.

install.packages("gapminder")
library(gapminder)

Chunk 3

Load in the data set gapminder{gapminder} and check its structure.

data(gapminder)
str(gapminder)

tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
 $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
 $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
 $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
 $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

Chunk 4

Duplicate gapminder and name the new data frame gap.

gap <- gapminder

Chunk 5

Use ggplot{ggplot2} to plot the main frame of gap$lifeExp without putting any data points or lines or any shapes.

ggplot(data = gap, aes(x = lifeExp))

Chunk 6

Use ggplot{ggplot2} to plot the histogram of gap$lifeExp.

ggplot(data = gap, aes(x = lifeExp)) + 
    geom_histogram()

Chunk 7

Use ggplot{ggplot2} to plot the histogram of gap$lifeExp. Give the histogram a title and name the axises. Specify the theme of the plot.

ggplot(data = gap, aes(x = lifeExp)) + 
  geom_histogram(fill = "blue", color = "black", bins = 10) + 
  ggtitle("Life expectancy for the gap dataset") + 
  xlab("Life expectancy (years)") + 
  ylab("Frequency") + 
  theme_classic()

Chunk 8

Use ggplot{ggplot2} to draw a box plot of lifeExp for different continent. Give the plot a title and name the axises. Specify the theme of the plot.

ggplot(data = gap, aes(x = continent, y = lifeExp, fill = continent)) + 
  geom_boxplot() + 
  ggtitle("Boxplots for lifeExp by continent") + 
  xlab("Continent") + 
  ylab("Life expectancy (years)") +
  theme_minimal() # +       # line A

  # guides(fill = FALSE)    # line B

What happens if you un-hashtage `guides(fill = FALSE)` and the plus sign in lines 68 and 69 above?

ggplot(data = gap, aes(x = continent, y = lifeExp, fill = continent)) + 
  geom_boxplot() + 
  ggtitle("Boxplots for lifeExp by continent") + 
  xlab("Continent") + 
  ylab("Life expectancy (years)") +
  theme_minimal() +       # line A
  guides(fill = FALSE)    # line B

Finding

It will make the legend of different colors representing different continents not be displayed.

Chunk 9

Use ggplot{ggplot2} to draw a scatter plot of lifeExp and gdpPercap, two variables in gap. Specify data points into different colors for different continent. Specify data points into different shapes for different continent.
Set the size and the transparency for data points.
Specify the theme of the plot.
Give the plot a title
Name the axises.
Detailed setting of the theme.

ggplot(data = gap, aes(x = lifeExp, y = gdpPercap, color = continent, shape = continent)) + #1
    geom_point(size = 5, alpha = 0.5) + #2
    theme_classic() +                   #3
    ggtitle("Scatterplot of life expectancy by gdpPercap") + #4
    xlab("Life expectancy (years)") +   #5
    ylab("gdpPercap (USD)") +           #5
    theme(legend.position = "top",      #6-1
          plot.title = element_text(hjust = 0.5, size = 20), #6-2
          legend.title = element_text(size = 10),            #6-3
          legend.text = element_text(size = 5),              #6-3
          axis.text.x = element_text(angle = 45, hjust = 1)) #6-4

In lines the ggplot code above, what are the arguments inside of our second “theme” argument doing?

6-1: Put the legend at the top position.
6-2: Set the plot title size and horizontal justificaiton.
6-3: Set the size of the title and texts in the legend.
6-4: Set the angle and horizontal justification of texts in x-axis.

DataM: In-class Exercise 0420: Grammar of graphics