You have already made some plots in Sessions 1 and 2
- Histograms
July 2018
You have already made some plots in Sessions 1 and 2
You have already made some plots in Sessions 1 and 2
You have already made some plots in Sessions 1 and 2
You have already made some plots in Sessions 1 and 2
In this session you extend these plotting routines using ggplot2
. The ggplot2
package is part of the tidyverse
- a collection of packages that support data science - Lex and I are big fans!
ggplot2
packageKey Points
ggplot2
packageggplot
in action
par(mar = c(4,8,3,3)) # plot 1 boxplot(census$Owned~census$social.class, horizontal = T, outline = F, lwd = 0.5, las = 2, col = c("#D7191C","#FFFFBF", "#2B83BA"), xlab=expression(paste("Property Ownership")))
ggplot2
packageggplot
in action+
ggplot(data = census, aes(x="", y=Owned, fill = social.class)) + geom_boxplot()
ggplot2
packageggplot
in action
ggplot(data = census, aes(x="", y=Owned, fill = social.class)) + geom_boxplot() + scale_fill_manual("Social Class", values = c("#D7191C","#FFFFBF", "#2B83BA"))
ggplot2
packageggplot
in action
ggplot(data = census, aes(x="", y=Owned, fill = social.class)) + geom_boxplot() + scale_fill_manual("Social Class", values = c("#D7191C","#FFFFBF", "#2B83BA")) + coord_flip()
ggplot2
packageggplot
in action
ggplot(data = census, aes(x="", y=Owned, fill = social.class)) + geom_boxplot() + scale_fill_manual("Social Class", values = c("#D7191C","#FFFFBF", "#2B83BA")) + coord_flip() + ylab("Property Ownership") + xlab("")
ggplot2
packageggplot
in action
ggplot(data = census, aes(x="", y=Owned, fill = social.class)) + geom_boxplot() + scale_fill_manual("Social Class", values = c("#D7191C","#FFFFBF", "#2B83BA")) + coord_flip() + ylab("Property Ownership") + xlab("") + theme_bw()
ggplot2
packageggplot
in action
ggplot(data = census, aes(x="", y=Owned, fill = social.class)) + geom_boxplot(width = 0.5) + scale_fill_manual("Social Class", values = c("#D7191C","#FFFFBF", "#2B83BA")) + coord_flip() + ylab("Property Ownership") + xlab("") + theme_bw()
ggplot2
packageggplot
in action
ggplot(data = census, aes(x="", y=Owned, fill = social.class)) + geom_boxplot(width = 0.5, position=position_dodge(1)) + scale_fill_manual("Social Class", values = c("#D7191C","#FFFFBF", "#2B83BA")) + coord_flip() + ylab("Property Ownership") + xlab("") + theme_bw()
Using ggplot2
tools, this practical will show you how to use visualisation (and some data transformations) to explore your data in a systematic way, a task that known as exploratory data analysis, or EDA for short. EDA is an iterative cycle that involves
Generating questions about your data.
Searching for answers by visualising, transforming, and modelling your data.
Using what you learn to refine your questions and/or generate new questions.
“There are no routine statistical questions, only questionable statistical routines.” — Sir David Cox
“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey
Your goal during EDA is to develop an understanding of your data.
There is no rule about which questions you should ask to guide your research.
However, two types of questions will always be useful for making discoveries within your data.
You can loosely word these questions as:
What type of variation occurs within my variables?
What type of covariation occurs between my variables?
Look for Variation the tendency of the values of a variable to change from measurement to measurement
Visualising distributions for both categorical or continuous variables.
ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut))
Look for Variation the tendency of the values of a variable to change from measurement to measurement
Visualising distributions for both categorical or continuous variables.
ggplot(data = diamonds) + geom_histogram(mapping = aes(x = carat), binwidth = 0.5)
Look for Typical Values and Outliers
smaller <- diamonds %>% filter(carat < 3) ggplot(data = smaller, mapping = aes(x = carat)) + geom_histogram(binwidth = 0.01)
Look for Variation and Covariation
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) + geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)
ggplot(data = mpg) + geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))
ggplot(data = diamonds) + geom_count(mapping = aes(x = cut, y = color))
ggplot(data = diamonds) + geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 100)
ggplot(data = smaller) + geom_hex(mapping = aes(x = carat, y = price))
ggplot(data = smaller, mapping = aes(x = carat, y = price)) + geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))
%>%
%>%
df %>% summarise(avg_MedIbc = mean(MedInc))
## # A tibble: 1 x 1 ## avg_MedIbc ## <dbl> ## 1 37.1
%>%
dplyr
verbsdplyr
is also part of the tidyverse)group_by
function:tb %>% group_by(IncClass) %>% summarise(avg_MedIbc = mean(MedInc))
## # A tibble: 3 x 2 ## IncClass avg_MedIbc ## <chr> <dbl> ## 1 Average 34533 ## 2 Poor 27410 ## 3 Rich 52047
%>%
verbs in dplyr
verbs | Description |
---|---|
select() | select columns |
filter() | filter rows |
arrange() | re-order or arrange rows |
mutate() | create new columns |
summarise() | summarise values |
group_by() | allows for group operations in the split-apply-combine concept |
%>%
example
tb %>% ungroup %>% mutate(black_high = ifelse(PctBlack > median(PctBlack), "High % Black Area","Low % Black Area")) %>% ggplot(aes(y=PctBach,x=IncClass, fill = IncClass)) + facet_wrap(~black_high) + coord_flip() + geom_boxplot() + labs(y='Percentage with Bachelors',x='Income class') + scale_fill_manual(name = "Income Class", values = c("orange", "palegoldenrod","firebrick3"))
%>%
example
%>%
ggplot