library(tidyverse)
data(mpg)
ggplot(mpg, aes(x = displ, y = hwy))+
geom_point()
wrapping works fine
ggplot(mpg, aes(x = displ, y = hwy))+
geom_point()+
facet_wrap(.~class)
gridding works better, maybe (by columns)
ggplot(mpg, aes(x = displ, y = hwy))+
geom_point()+
facet_grid(.~class)
gridding by rows
ggplot(mpg, aes(x = displ, y = hwy))+
geom_point()+
facet_grid(class~.)
faceting with two variables - empty facets mean there’s no data there
ggplot(mpg, aes(x = displ, y = hwy))+
geom_point()+
facet_grid(drv~cyl)
gray area is the standard error
ggplot(mpg, aes(x = displ, y = hwy))+
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
removing the standard error bar:
ggplot(mpg, aes(x = displ, y = hwy))+
geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
multiple geometries:
ggplot(mpg, aes(x = displ, y = hwy))+
geom_point()+
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
specifying linear regression:
ggplot(mpg, aes(x = displ, y = hwy))+
geom_point()+
geom_smooth(method = "lm",se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
grouping with color - 3 different lines
ggplot(mpg, aes(x = displ, y = hwy, color = drv))+
geom_point()+
geom_smooth(method = "lm",se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
grouping with color - just 1 line
ggplot(mpg, aes(x = displ, y = hwy))+
geom_point(aes(color = drv))+
geom_smooth(method = "lm",se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
grouping without color - 3 lines
ggplot(mpg, aes(x = displ, y = hwy, group = drv))+
geom_point()+
geom_smooth(method = "lm",se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
making a simple boxplot
ggplot(mpg, aes(y = hwy))+
geom_boxplot()
making colorful side-by-side boxplots - good with compaing distributions across different groups
ggplot(mpg, aes(y = hwy, fill = drv))+
geom_boxplot()
loading a new dataset & making a simple bar graph
data("diamonds")
ggplot(diamonds, aes(x=cut))+
geom_bar()
adding color - double defining
ggplot(diamonds, aes(x = cut, fill = cut))+
geom_bar()
stacked bar graph - bad for comparison
ggplot(diamonds, aes(x=cut, fill=color))+
geom_bar()
mini bar graphs - better for comparison
ggplot(diamonds, aes(x=cut, fill=color))+
geom_bar(position = "dodge")
stacked but with proportion instead of count - can be useful!
ggplot(diamonds, aes(x=cut, fill=color))+
geom_bar(position = "fill")
EDA is an iterative cycle- you must:
Generate questions about your data
Search for answers by visualizing, transforming, and modeling your data
Use what you learn to refine your questions and/or generate new questions
Questions to ask yourself:
What type of variation occurs within my variables?
Which values are the most common? Why?
Which values are rare? Why? Does this match your expectations?
Can you see any unusual patterns? What might explain them?
What type of covariation occurs between my variables?
Two main tips: 1. write down expectations/preconcieved notions - gives you a starting point
show the data- don’t over-process the data. start with the rawest data possible and then refine it
Note what surprises you- otherwise you may forget how you got to what you did. USE R MARKDOWNS.