library(ggplot2)
## Warning in register(): Can't find generic `scale_type` in package ggplot2 to
## register S3 method.
library(gridExtra)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#library(XQuartz)
#install.packages('XQuartz')
incomeTab <- read.table('income_exmpl.dat', header = T, sep = "\t")


# change order of factor levels
incomeTab$occ <- factor(incomeTab$occ, levels = c('low', 'med.', 'high'))
incomeTab$edu <- factor(incomeTab$edu, levels = c('low', 'med.', 'high'))
incomeTab$sex <- factor(incomeTab$sex, levels = c('m', 'f'), labels = c('male', 'female'))

1 Group Homework

  • You will work with your group to complete this assignment.

  • Submit your group’s shared .Rmd AND “knitted”.html files

    • Your “knitted .html” submission must be created from your “group .Rmd” but be created on your own computer

    • Confirm this with the following comment included in your submission text box: “Honor Pledge: I have recreated my group submission using using the tools I have installed on my own computer”

    • Name the files with a group name and YOUR name for your submission

  • Please use ggplot2 for this assignment.

Each group member must be able to submit this assignment as created from their own computer. If only some members of the group submit the required files, those group members must additionally provide a supplemental explanation along with their submission as to why other students in their group have not completed this assignment.

2 Part 1

  • Use the occupational experience variable (“oexp”) of the income_example dataset and plot

    • a histogram,
    • a kernel density estimate,
    • a boxplot of “oexp”,
    • a set of boxplots showing the distribution of “oexp” by sex crossed with occupational status (“occ”).
  • You can either produce four separate but small plots, or you can use gridExtra::grid.arrange(), ggpubr::ggarrange(), cowplot::plot_grid(), or patchwork to create a plotting region consisting of four subplots.

  • That is, you should create a ggplot version of Part 1 in Assignment 4. There is no need to describe the distributions of occupational experience in words. But make sure that you draw four plots and add x, y-labels and titles using the function labs(x=..., y=..., title=...).

hist <- ggplot(incomeTab, aes(x=oexp)) + geom_histogram(color="darkblue", fill="lightblue",bins=20)+labs(x='Occupational Experience',y='Weights',title='Histogram of Occupation Experience')

kdens <- ggplot(incomeTab)+geom_density(aes(x=oexp)) +labs(x='Occupational Experience',title='Kernel Density of \nOccupational Experience')

boxplot <- ggplot(incomeTab, aes(x=oexp)) + geom_boxplot(aes(x=oexp))+labs(x='Occupational Experience',title='Boxplot of \nOccupational Experience')+theme(axis.text.y=element_blank(),axis.ticks.y=element_blank())

boxplots <- ggplot(incomeTab, aes(x=oexp, y=sex,color=sex))  +geom_boxplot()+facet_wrap(~ occ) + labs(x='Occupational Experience', y= 'Sex',title='Occ. Experience by Sex and Occupational Status')+ coord_flip()


grid.arrange(hist, kdens, boxplot, boxplots, nrow=2)

3 Part 2

  • Use the SCS Data set you downloaded from Collab for Assignment 4, and investigate the relationship between the mathematics achievement score (“mathpre”) and the math anxiety score (“mars”) by plotting the data.
  1. Produce a scatterplot between “mathpre” and “mars”. You might consider using function geom_jitter() or the argument alpha from package ggplot2 for avoiding overlying points.
library(foreign)
scs = data.frame(read.spss("SCS_QE.sav"))
## re-encoding from CP1252
## Warning in read.spss("SCS_QE.sav"): Undeclared level(s) 0 added in variable:
## married
mathpre <- scs$mathpre
mars <- scs$mars
ggplot(scs, aes(x=mathpre, y=mars))+geom_point()+geom_jitter()+labs(x='Math Anxiety Score', y='Math Achievment Score')

  1. Draw a conditioning plot for female and male students (variable “male”). Add + facet_wrap() or + facet_grid() in your first argument to create a conditioning plot.
ggplot(scs, aes(x=mathpre, y=mars))+geom_point()+facet_wrap(~male)+geom_jitter()+labs(x='Math Anxiety Score', y='Math Achievement Score')

4 Part 3

  • Use the UC-Berkely Admissions dataset which is named as “UCBAdmissions” and is included in base R. This shows the number of students – male and female – who were admitted or rejected from the six largest departments at UC-Berkeley. The dataset takes a form of a three-dimensional array.

  • I provide codes for creating aggregated data and grouped data. If you like, you can use your own codes to construct aggregated and grouped data. Also, you can use the rejection rate instead of the admission rate to draw plots. If you like to use the rejection rate, please use %>% filter(Admit == "Rejected") instead of %>% filter(Admit == "Admitted").

  • dplyr is a grammar of data manipulation. For more information about dplyr,

data(UCBAdmissions) # load data

library(broom) # load package broom
dat <- tidy(UCBAdmissions)
# load package dplyr
library(dplyr)

# create aggregated data
dat_agg <- dat %>% 
  group_by(Admit, Gender) %>% 
  summarize(n = sum(n)) %>%
  ungroup() %>% 
  group_by(Gender) %>% 
  mutate(Prop = n/sum(n)) %>% 
  filter(Admit == "Admitted")

knitr::kable(dat_agg)
Admit Gender n Prop
Admitted Female 557 0.3035422
Admitted Male 1198 0.4451877
# create grouped data
dat_dept <- dat %>% 
  group_by(Gender, Dept) %>% 
  mutate(Prop = n/sum(n)) %>% 
  filter(Admit == "Admitted")

knitr::kable(dat_dept)
Admit Gender Dept n Prop
Admitted Male A 512 0.6206061
Admitted Female A 89 0.8240741
Admitted Male B 353 0.6303571
Admitted Female B 17 0.6800000
Admitted Male C 120 0.3692308
Admitted Female C 202 0.3406408
Admitted Male D 138 0.3309353
Admitted Female D 131 0.3493333
Admitted Male E 53 0.2774869
Admitted Female E 94 0.2391858
Admitted Male F 22 0.0589812
Admitted Female F 24 0.0703812
  • Draw plots to provide evidence of Simpson’s Paradox.
#Data by just Male and Female
ggplot(dat_agg) + geom_bar(aes(x=Gender, y=Prop, fill=Gender), stat='identity') + labs(x='Gender', y='Admission Rate', title='UC Berkeley Admission Rate by Gender')

#Data by Department and Gender
ggplot(dat_dept) + geom_bar(aes(x=Gender, y=Prop, fill=Gender), stat='identity') + facet_wrap(~Dept) + labs(x='Gender', y='Admission Rate', title='UC Berkeley Admission Rate by Department')

  • Describe in words the relation between the admission rate and gender.

Plot 1 based on just gender illustrates how females had a lower admission rate in comparison to males. However, when you separate that data into each department, the plot indicates that females are actually accepted more in 5 out of the 6 departments. This proves Simpson’s paradox exists within this set of data because based on the form of data separation a completely different conclusion can be reached.

  • One of the group members will present R codes and plots for Part 3 in class on Feb. 14 (Tue). Also, if you’re a presenter, please bring your laptop so that you can share your screen on zoom for the presentation.