DataM: In-class Exercise 0420: Grammar of graphics
In-class exercise 1.
Find out what each code chunk (indicated by '##') in the following R script does and provide comments.
Chunk 1
- Plot the main frame of dataset
womenwithout data points. - Add the first data point of
womenon the frame.
Chunk 2
Use xyplot{lattice} to draw the scatter plot frame of height and weight, variables in women. But we only add the first data point (type='p') of women (subset=row.names(women)==1) on the frame.
Chunk 3
- Load in the package
ggplot2 - Use
ggplot{ggplot2}to draw the scatter plot frame ofheightandweight, variables inwomen. But we only add the first () data point ofwomenon the frame.
In-class exercise 2.
The data set is concerned with grade 8 pupils (age about 11 years) in elementary schools in the Netherlands. After deleting pupils with missing values, the number of pupils is 2,287 and the number of schools is 131. Class size ranges from 4 to 35. The response variables are score on a language test and that on an arithmetic test. The research intest is on how the two test scores depend on the pupil’s intelligence (verbal IQ) and on the number of pupils in a school class.
The class size is categorized into small, medium, and large with roughly equal number of observations in each category. The verbal IQ is categorized into low, middle and high with roughly equal number of observations in each category. Reproduce the plot below.
Source: Snijders, T. & Bosker, R. (2002). Multilevel Analysis.
- Column 1: School ID
- Column 2: Pupil ID
- Column 3: Verbal IQ score
- Column 4: The number of pupils in a class
- Column 5: Language test score
- Column 6: Arithmetic test score
Target output
[Solution and Answer]
Load in the data set and check its structure
'data.frame': 2287 obs. of 6 variables:
$ school: int 1 1 1 1 1 1 1 1 1 1 ...
$ pupil : int 17001 17002 17003 17004 17005 17006 17007 17008 17009 17010 ...
$ IQV : num 15 14.5 9.5 11 8 9.5 9.5 13 9.5 11 ...
$ size : int 29 29 29 29 29 29 29 29 29 29 ...
$ lang : int 46 45 33 46 20 30 30 57 36 36 ...
$ arith : int 24 19 24 26 9 13 13 30 23 22 ...
Create label variable for plotting with facet setting
- Group the data by the class size (
group_siZe) quantile and verbal IQ score (group_IQV) quantile. - Combine two variables
group_siZeandgroup_IQVintogroup.
dta2 <- dta2 %>%
mutate(group_siZe = cut(size, include.lowest = TRUE,
breaks = quantile(size, c(0, 1/3, 2/3, 1)),
labels = c('Small', 'Medium', 'Large')),
group_IQV = cut(IQV, include.lowest = TRUE,
breaks = quantile(IQV, c(0, 1/3, 2/3, 1)),
labels = c('Low', 'Middle', 'High'))) %>%
mutate(group = paste(as.character(group_siZe),
as.character(group_IQV), sep=', ') %>%
factor(., levels = c('Small, Low', 'Small, Middle', 'Small, High',
'Medium, Low', 'Medium, Middle', 'Medium, High',
'Large, Low', 'Large, Middle', 'Large, High')))
# adjust levels order
table(dta2$group) # check
Small, Low Small, Middle Small, High Medium, Low Medium, Middle
320 262 241 253 257
Medium, High Large, Low Large, Middle Large, High
253 252 239 210
Plot with setting facet
- Plot the main frame and assign variable to the axises.
- Name the axises
- Add data points on the plots.
- Assign the method (linear model here) to fit the given formula.
- Show plots in panels (facet setting).
ggplot(data = dta2,
mapping = aes(x = lang, y = arith)) + #1
labs(x = 'Language score', y = 'Arithmetic score') + #2
geom_point(shape = 23, fill = 'black') + #3
geom_smooth(formula = y ~ x, method = 'lm', lwd = .5) + #4
facet_wrap(. ~ group) #5In-class exercise 3.
Use the USPersonalExpenditure{datasets} for this problem. This data set consists of United States personal expenditures (in billions of dollars) in the categories; food and tobacco, household operation, medical and health, personal care, and private education for the years 1940, 1945, 1950, 1955 and 1960. Plot the US personal expenditure data in the style of the third plot on the “Time Use” case study in the course web page. You might want to transform the dollar amounts to log base 10 unit first.
[Solution and Answer]
Load in the data set and check its structure
1940 1945 1950 1955 1960
Food and Tobacco 22.200 44.500 59.60 73.2 86.80
Household Operation 10.500 15.500 29.00 36.5 46.20
Medical and Health 3.530 5.760 9.71 14.0 21.10
Personal Care 1.040 1.980 2.45 3.4 5.40
Private Education 0.341 0.974 1.80 2.6 3.64
[1] "matrix"
'data.frame': 5 obs. of 5 variables:
$ 1940: num 22.2 10.5 3.53 1.04 0.341
$ 1945: num 44.5 15.5 5.76 1.98 0.974
$ 1950: num 59.6 29 9.71 2.45 1.8
$ 1955: num 73.2 36.5 14 3.4 2.6
$ 1960: num 86.8 46.2 21.1 5.4 3.64
Transform the data set
- Stack the data set
- Rename the columns
- Put
categoryback in the data set. Transform allexpenditure-related variables to log base 10. Create a new variableexpen_log_diff, which is the difference of \(log_{10}(expenditure)\) and mean \(log_{10}(expenditure)\) of each year.
dta3 <- df_Expen %>% stack() #1
colnames(dta3) <- c('expenditure', 'year') #2
dta3 <- dta3 %>% #3
mutate(Category = rep(df_Expen %>% rownames(), 5),
expen_log_M = rep(df_Expen %>% log10() %>% colMeans(), 5),
expen_log_diff = log10(expenditure) - expen_log_M)
summary(dta3) expenditure year Category expen_log_M
Min. : 0.341 1940:5 Length:25 Min. :0.4930
1st Qu.: 2.600 1945:5 Class :character 1st Qu.:0.7769
Median : 9.710 1950:5 Mode :character Median :0.9739
Mean :20.069 1955:5 Mean :0.9184
3rd Qu.:29.000 1960:5 3rd Qu.:1.1039
Max. :86.800 Max. :1.2442
expen_log_diff
Min. :-1.71143
1st Qu.:-0.71471
Median : 0.01336
Mean : 0.00000
3rd Qu.: 0.78543
Max. : 1.44550
Plot
- Use
qplot{ggplot2}to draws dot plot withexpen_log_diffas x, Category as y. Set facet to draw different plots for differen year alone x-axis to easily see the change of expenditure of each category as year changes. - Add segments to make dot plots become lollipop plots.
- Add a vertical line as a central basdline for each plot.
- Scale x-axis in range of (-2, 2).
- Name x-axis.
qplot(x = expen_log_diff, y = Category, data = dta3, facets = . ~ year) + #1
geom_segment(aes(xend = 0, yend = Category)) + #2
geom_vline(xintercept = 0, color = 'grey55') + #3
scale_x_continuous(limits = c(-2, 2)) + #4
xlab('Expenditure [log10(billion)]') #5In-class exercise 4.
A sample of 158 children with autisim spectrum disorder were recruited. Social development was assessed using the Vineland Adaptive Behavior Interview survey form, a parent-reported measure of socialization. It is a combined score that included assessment of interpersonal relationships, play/leisure time activities, and coping skills. Initial language development was assessed using the Sequenced Inventory of Communication Development (SICD) scale. These assessments were repeated on these children when they were 3, 5, 9, 13 years of age.
Source: West, B.T., Welch, K.B., & Galecki, A.T. (2002). Linear Mixed Models: Practical Guide Using Statistical Software. p. 220-271.
Data: autism{WWGbook}
- Column 1: Age (in years)
- Column 2: Vineland Socialization Age Equivalent score
- Column 3: Sequenced Inventory of Communication Development Expressive Group (1 = Low, 2 = Medium, 3 = High)
- Column 4: Child ID
Target output 1
Target output 2
Replicate the two plots above using ggplot2.
[Solution and Answer]
Load the data set
'data.frame': 612 obs. of 4 variables:
$ age : int 2 3 5 9 13 2 3 5 9 13 ...
$ vsae : int 6 7 18 25 27 17 18 12 18 24 ...
$ sicdegp: int 3 3 3 3 3 3 3 3 3 3 ...
$ childid: int 1 1 1 1 1 3 3 3 3 3 ...
age vsae sicdegp childid
Min. : 2.000 Min. : 1.00 Min. :1.000 Min. : 1.00
1st Qu.: 2.000 1st Qu.: 10.00 1st Qu.:1.000 1st Qu.: 48.75
Median : 4.000 Median : 14.00 Median :2.000 Median :107.50
Mean : 5.771 Mean : 26.41 Mean :1.956 Mean :105.38
3rd Qu.: 9.000 3rd Qu.: 27.00 3rd Qu.:3.000 3rd Qu.:158.00
Max. :13.000 Max. :198.00 Max. :3.000 Max. :212.00
NA's :2
Create some variables for plot 1
Group: group data into three groups bysicdegp. This variable contains group labels. (for plot 1)age_diff: The difference ofageand its mean. (for plot 1)age_2:ageminus 2.
Plot 1
- Draw a plot of
age_centeredas x and `vsaeas y for each group. - Specify the range and break points of x-axis.
- Assign points transparency by setting ‘
alpha’. - Assign the method (linear model here) to fit the given formula.
- Add polylines for each subject.
- Specify the theme and adjust the layout: Set the text size of axises and axises labels.
- Name the axises.
ggplot(data = dta4, mapping = aes(x = age_diff, y = vsae)) + #1
facet_grid(. ~ Group) + #1
scale_x_continuous(limits = c(-4, 7.5), #2
breaks = seq(-2.5, 5, by=2.5)) +
geom_point(alpha = 0.6) + #3
geom_smooth(method = 'lm', formula = y ~ x) + #4
geom_line(aes(group = childid), alpha = 0.3) + #5
theme_bw() + #6
theme(axis.text = element_text(size = 12), #6
axis.title = element_text(size = 14, face = 'bold')) +
labs(x = 'Age (in years, centered)', y = 'VSAE score') #7Plot 2
The ratio of height and width of the original plot is 12.5 cm : 14.4 cm.
Group data by
Groupandage_2. Compute mean and standard error for each group. Put the summarized data set intoggplotto plot.Assign
age_2as x andvase_MEANas y. Assign different shapes of data point of vase mean for different groups. Add the points on the given position (position_dodge()) for each group. Show points in the legend.Add polylines on the given position for each group. Set different line types for different groups. Show lines in the legend.
Add error bars with size and width setting on the given position for each group.
Name the axises.
Specify the theme of black and white.
Detailed adjustment for the layout:
- Change grid line layout.
- Specify the text size of the axises and that of the legend title.
- Specify the location and the text size of the legend.
# 1
dta4 %>% group_by(Group, age_2) %>%
summarize(vsae_MEAN = mean(vsae, na.rm = TRUE),
vsae_SE = sd(vsae, na.rm = TRUE) / sqrt(n())) %>% ggplot() +
# 2
aes(x = age_2, y = vsae_MEAN, group = Group, shape = Group) + #
geom_point(position = position_dodge(width = .3),
size=rel(2), show.legend = TRUE) +
scale_shape_manual(values = c(1, 2, 16)) +
# make shapes same with the original plot
# 3
geom_line(position = position_dodge(width = .3),
aes(linetype = Group),
show.legend = TRUE) +
# 4
geom_errorbar(aes(ymax = vsae_MEAN + vsae_SE,
ymin = vsae_MEAN - vsae_SE),
size=.3, width=.2, position = position_dodge(width = .3)) +
# 5
xlab('Age (in year - 2)') + ylab('VSAE score') +
# 6
theme_bw() +
# 7
theme(panel.grid.minor = element_blank(),
panel.grid.major = element_line(size=0.75),
axis.text = element_text(size = 12),
legend.position = c(.1, .85),
legend.key = element_rect(color = "black"),
legend.key.size = unit(.69, 'cm'),
legend.title = element_text(size = 14),
legend.box.background = element_rect(color = 'black'))In-class exercise 5.
Use the diabetes dataset to generate a plot similar to the one below and inteprete the plot.
Target output 1
Load the data set and check its structure.
'data.frame': 8706 obs. of 9 variables:
$ SEQN : int 51624 51626 51627 51628 51629 51630 51632 51633 51634 51635 ...
$ RIAGENDR: int 1 1 1 2 1 2 1 1 1 1 ...
$ RIDRETH1: int 3 4 4 4 1 3 2 3 1 3 ...
$ DIQ010 : int 2 2 2 1 2 2 2 2 2 1 ...
$ BMXBMI : num 32.2 22 18.2 42.4 32.6 ...
$ gender : Factor w/ 2 levels "Females","Males": 2 2 2 1 2 1 2 2 2 2 ...
$ race : Factor w/ 3 levels "Black","Hispanic",..: 3 1 1 1 2 3 2 3 2 3 ...
$ diabetes: Factor w/ 2 levels "No","Yes": 1 1 1 2 1 1 1 1 1 2 ...
$ BMI : Factor w/ 2 levels "Normal weight",..: 2 1 1 2 2 2 1 2 1 2 ...
Select useful variables from the data set for plotting using tools in ggalluvial
The ratio of height and width of the original plot is 9 cm : 13 cm.
Select useful variables from the data set
Diabetesand compute the number for each combination of levels of these variables.Trun variables into factorial and rearrange their level order to get the same order with the original plot.
Use
ggplotto plot. Set the frequency as y and other variables as three other axises.Give the positions of the lodes of an alluvial plot. Obtain the intersections of the alluvia with the strata. Use
BMIas the group index. Specify colors.Set x-axis scale into a discrete one. Assign x labels. Use
expandargument to expand the coverage of plot on the grids. Add stratums with x labels.Rename y-axis. Give the plot a title and a subtitle.
Specify the theme (remove background color and remain grids). Detailed adjustment for layout: Put the legend at the bottom of the plot.
library(ggalluvial) # Load in the package
# 1
Diabetes %>% dplyr::select(race, gender, diabetes, BMI) %>%
xtabs(data = ., ~ race + gender + diabetes + BMI) %>% data.frame() %>%
# 2
mutate(race = factor(race, levels = c('Hispanic', 'White', 'Black')),
gender = factor(gender, levels = c('Males', 'Females')),
diabete = factor(diabetes, levels = c('Yes', 'No'))) %>%
# 3
ggplot(., aes(y = Freq, axis1 = race, axis2 = gender, axis3 = diabetes)) +
# 4
geom_alluvium(aes(fill = BMI)) +
scale_fill_manual(values=c('gray40','tan1')) +
# 5
scale_x_discrete(limits=c('race', 'gender', 'diabetes'), expand=c(.1, .05)) +
geom_stratum() + geom_text(stat = 'stratum', infer.label=TRUE) +
# 6
ylab('No. individuals') +
ggtitle('Diabetes in overall population in US 2009-2010',
subtitle = 'straitified by race, gender and diabetes mellitus') +
# 7
theme_minimal() +
theme(legend.position = 'bottom')In-class exercise 6.
Find out what each code chunk (indicated by '##') in the following R script does and provide comments.
Chunk 2
Install the package gapminder and load it in.
Chunk 3
Load in the data set gapminder{gapminder} and check its structure.
tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ year : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
$ pop : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
$ gdpPercap: num [1:1704] 779 821 853 836 740 ...
Chunk 5
Use ggplot{ggplot2} to plot the main frame of gap$lifeExp without putting any data points or lines or any shapes.
Chunk 6
Use ggplot{ggplot2} to plot the histogram of gap$lifeExp.
Chunk 7
Use ggplot{ggplot2} to plot the histogram of gap$lifeExp. Give the histogram a title and name the axises. Specify the theme of the plot.
ggplot(data = gap, aes(x = lifeExp)) +
geom_histogram(fill = "blue", color = "black", bins = 10) +
ggtitle("Life expectancy for the gap dataset") +
xlab("Life expectancy (years)") +
ylab("Frequency") +
theme_classic() Chunk 8
Use ggplot{ggplot2} to draw a box plot of lifeExp for different continent. Give the plot a title and name the axises. Specify the theme of the plot.
ggplot(data = gap, aes(x = continent, y = lifeExp, fill = continent)) +
geom_boxplot() +
ggtitle("Boxplots for lifeExp by continent") +
xlab("Continent") +
ylab("Life expectancy (years)") +
theme_minimal() # + # line AWhat happens if you un-hashtage guides(fill = FALSE) and the plus sign in lines 68 and 69 above?
- Try
ggplot(data = gap, aes(x = continent, y = lifeExp, fill = continent)) +
geom_boxplot() +
ggtitle("Boxplots for lifeExp by continent") +
xlab("Continent") +
ylab("Life expectancy (years)") +
theme_minimal() + # line A
guides(fill = FALSE) # line B- Finding
It will make the legend of different colors representing different continents not be displayed.
Chunk 9
- Use
ggplot{ggplot2}to draw a scatter plot oflifeExpandgdpPercap, two variables ingap. Specify data points into different colors for differentcontinent. Specify data points into different shapes for differentcontinent. - Set the size and the transparency for data points.
- Specify the theme of the plot.
- Give the plot a title
- Name the axises.
- Detailed setting of the theme.
ggplot(data = gap, aes(x = lifeExp, y = gdpPercap, color = continent, shape = continent)) + #1
geom_point(size = 5, alpha = 0.5) + #2
theme_classic() + #3
ggtitle("Scatterplot of life expectancy by gdpPercap") + #4
xlab("Life expectancy (years)") + #5
ylab("gdpPercap (USD)") + #5
theme(legend.position = "top", #6-1
plot.title = element_text(hjust = 0.5, size = 20), #6-2
legend.title = element_text(size = 10), #6-3
legend.text = element_text(size = 5), #6-3
axis.text.x = element_text(angle = 45, hjust = 1)) #6-4In lines the ggplot code above, what are the arguments inside of our second “theme” argument doing?
- 6-1: Put the legend at the top position.
- 6-2: Set the plot title size and horizontal justificaiton.
- 6-3: Set the size of the title and texts in the legend.
- 6-4: Set the angle and horizontal justification of texts in x-axis.