library(ggplot2)
## Warning in register(): Can't find generic `scale_type` in package ggplot2 to
## register S3 method.
library(gridExtra)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
#library(XQuartz)
#install.packages('XQuartz')
incomeTab <- read.table('income_exmpl.dat', header = T, sep = "\t")
# change order of factor levels
incomeTab$occ <- factor(incomeTab$occ, levels = c('low', 'med.', 'high'))
incomeTab$edu <- factor(incomeTab$edu, levels = c('low', 'med.', 'high'))
incomeTab$sex <- factor(incomeTab$sex, levels = c('m', 'f'), labels = c('male', 'female'))
You will work with your group to complete this assignment.
Submit your group’s shared .Rmd AND “knitted”.html files
Your “knitted .html” submission must be created from your “group .Rmd” but be created on your own computer
Confirm this with the following comment included in your submission text box: “Honor Pledge: I have recreated my group submission using using the tools I have installed on my own computer”
Name the files with a group name and YOUR name for your submission
Please use ggplot2 for this assignment.
Each group member must be able to submit this assignment as created from their own computer. If only some members of the group submit the required files, those group members must additionally provide a supplemental explanation along with their submission as to why other students in their group have not completed this assignment.
Use the occupational experience variable (“oexp”) of the income_example dataset and plot
You can either produce four separate but small plots, or you can use gridExtra::grid.arrange(), ggpubr::ggarrange(), cowplot::plot_grid(), or patchwork to create a plotting region consisting of four subplots.
That is, you should create a ggplot version of Part 1 in Assignment 4. There is no need to describe the distributions of occupational experience in words. But make sure that you draw four plots and add x, y-labels and titles using the function labs(x=..., y=..., title=...).
hist <- ggplot(incomeTab, aes(x=oexp)) + geom_histogram(color="darkblue", fill="lightblue",bins=20)+labs(x='Occupational Experience',y='Weights',title='Histogram of Occupation Experience')
kdens <- ggplot(incomeTab)+geom_density(aes(x=oexp)) +labs(x='Occupational Experience',title='Kernel Density of \nOccupational Experience')
boxplot <- ggplot(incomeTab, aes(x=oexp)) + geom_boxplot(aes(x=oexp))+labs(x='Occupational Experience',title='Boxplot of \nOccupational Experience')+theme(axis.text.y=element_blank(),axis.ticks.y=element_blank())
boxplots <- ggplot(incomeTab, aes(x=oexp, y=sex,color=sex)) +geom_boxplot()+facet_wrap(~ occ) + labs(x='Occupational Experience', y= 'Sex',title='Occ. Experience by Sex and Occupational Status')+ coord_flip()
grid.arrange(hist, kdens, boxplot, boxplots, nrow=2)
geom_jitter() or the argument alpha from package ggplot2 for avoiding overlying points.library(foreign)
scs = data.frame(read.spss("SCS_QE.sav"))
## re-encoding from CP1252
## Warning in read.spss("SCS_QE.sav"): Undeclared level(s) 0 added in variable:
## married
mathpre <- scs$mathpre
mars <- scs$mars
ggplot(scs, aes(x=mathpre, y=mars))+geom_point()+geom_jitter()+labs(x='Math Anxiety Score', y='Math Achievment Score')
+ facet_wrap() or + facet_grid() in your first argument to create a conditioning plot.ggplot(scs, aes(x=mathpre, y=mars))+geom_point()+facet_wrap(~male)+geom_jitter()+labs(x='Math Anxiety Score', y='Math Achievement Score')
Use the UC-Berkely Admissions dataset which is named as “UCBAdmissions” and is included in base R. This shows the number of students – male and female – who were admitted or rejected from the six largest departments at UC-Berkeley. The dataset takes a form of a three-dimensional array.
I provide codes for creating aggregated data and grouped data. If you like, you can use your own codes to construct aggregated and grouped data. Also, you can use the rejection rate instead of the admission rate to draw plots. If you like to use the rejection rate, please use %>% filter(Admit == "Rejected") instead of %>% filter(Admit == "Admitted").
dplyr is a grammar of data manipulation. For more information about dplyr,
data(UCBAdmissions) # load data
library(broom) # load package broom
dat <- tidy(UCBAdmissions)
# load package dplyr
library(dplyr)
# create aggregated data
dat_agg <- dat %>%
group_by(Admit, Gender) %>%
summarize(n = sum(n)) %>%
ungroup() %>%
group_by(Gender) %>%
mutate(Prop = n/sum(n)) %>%
filter(Admit == "Admitted")
knitr::kable(dat_agg)
| Admit | Gender | n | Prop |
|---|---|---|---|
| Admitted | Female | 557 | 0.3035422 |
| Admitted | Male | 1198 | 0.4451877 |
# create grouped data
dat_dept <- dat %>%
group_by(Gender, Dept) %>%
mutate(Prop = n/sum(n)) %>%
filter(Admit == "Admitted")
knitr::kable(dat_dept)
| Admit | Gender | Dept | n | Prop |
|---|---|---|---|---|
| Admitted | Male | A | 512 | 0.6206061 |
| Admitted | Female | A | 89 | 0.8240741 |
| Admitted | Male | B | 353 | 0.6303571 |
| Admitted | Female | B | 17 | 0.6800000 |
| Admitted | Male | C | 120 | 0.3692308 |
| Admitted | Female | C | 202 | 0.3406408 |
| Admitted | Male | D | 138 | 0.3309353 |
| Admitted | Female | D | 131 | 0.3493333 |
| Admitted | Male | E | 53 | 0.2774869 |
| Admitted | Female | E | 94 | 0.2391858 |
| Admitted | Male | F | 22 | 0.0589812 |
| Admitted | Female | F | 24 | 0.0703812 |
#Data by just Male and Female
ggplot(dat_agg) + geom_bar(aes(x=Gender, y=Prop, fill=Gender), stat='identity') + labs(x='Gender', y='Admission Rate', title='UC Berkeley Admission Rate by Gender')
#Data by Department and Gender
ggplot(dat_dept) + geom_bar(aes(x=Gender, y=Prop, fill=Gender), stat='identity') + facet_wrap(~Dept) + labs(x='Gender', y='Admission Rate', title='UC Berkeley Admission Rate by Department')
Plot 1 based on just gender illustrates how females had a lower admission rate in comparison to males. However, when you separate that data into each department, the plot indicates that females are actually accepted more in 5 out of the 6 departments. This proves Simpson’s paradox exists within this set of data because based on the form of data separation a completely different conclusion can be reached.