When running R code in the Console the keyboard shortcut: CMD/Ctrl + Enter is very useful to run each R expression. When running R code in an Rnotebook the keyboard shortcut: Ctrl/Shift + Enter is useful to run each chunck of R code.
Here is the link to the RStudio Keyboard Shortcuts.
Continuing with the flights data.
library(tidyverse)
library(nycflights13)
flights
not_cancelled <- flights %>%
filter(!is.na(dep_delay), !is.na(arr_delay))
not_cancelled %>%
group_by(year, month, day) %>%
summarise(mean = mean(dep_delay))
Chapter 7 is about Exploratory Data Analysis (EDA).
EDA is an iterative cycle. You:
- Generate questions about your data.
- Search for answers by visualising, transforming, and modelling your data.
- Use what you learn to refine your questions and/or generate new questions.
A good quote that starts the Chapter:
“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise.” — John Tukey
Your goal during EDA is to develop an understanding of your data. EDA is fundamentally a creative process.
Exploring variation in the data using visualization.
Categorical variables.
In this Chapter the diamonds dataset is explored.
ggplot(data = diamonds) +
geom_bar(mapping = aes(x = cut))

diamonds %>%
count()
diamonds %>%
count(cut)
Continuous variables.
ggplot(data = diamonds) +
geom_histogram(mapping = aes(x = carat), binwidth = 0.5)

Count a continuous variable within intervals of equal length.
diamonds %>%
count(cut_width(carat, 0.5))
smaller <- diamonds %>%
filter(carat < 3)
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.1)

ggplot(data = smaller, mapping = aes(x = carat, colour = cut)) +
geom_freqpoly(binwidth = 0.1)

Clusters in the data, round up!
ggplot(data = smaller, mapping = aes(x = carat)) +
geom_histogram(binwidth = 0.01)

Old Faithful data, another example of clusters.
ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_histogram(binwidth = 0.25)

Outliers:
ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5)

ggplot(diamonds) +
geom_histogram(mapping = aes(x = y), binwidth = 0.5) +
coord_cartesian(ylim = c(0, 50))

unusual <- diamonds %>%
filter(y < 3 | y > 20) %>%
select(price, x, y, z) %>%
arrange(y)
unusual
Replacing unusual values with missing values.
diamonds2 <- diamonds %>%
mutate(y = ifelse(y < 3 | y > 20, NA, y))
diamonds2
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point()

Supress the warning.
ggplot(data = diamonds2, mapping = aes(x = x, y = y)) +
geom_point(na.rm = TRUE)

Categorical and Continuous variables.
ggplot(data = diamonds, mapping = aes(x = price)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)

ggplot(diamonds) +
geom_bar(mapping = aes(x = cut))

Using a density plot.
ggplot(data = diamonds, mapping = aes(x = price, y = ..density..)) +
geom_freqpoly(mapping = aes(colour = cut), binwidth = 500)

ggplot(data = diamonds, mapping = aes(x = cut, y = price)) +
geom_boxplot()

ggplot(data = mpg, mapping = aes(x = class, y = hwy)) +
geom_boxplot()

ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy))

ggplot(data = mpg) +
geom_boxplot(mapping = aes(x = reorder(class, hwy, FUN = median), y = hwy)) +
coord_flip()

Two categorical variables. Covariation.
ggplot(data = diamonds) +
geom_count(mapping = aes(x = cut, y = color))

diamonds %>%
count(color, cut)
diamonds %>%
count(color, cut) %>%
ggplot(mapping = aes(x = color, y = cut)) +
geom_tile(mapping = aes(fill = n))

Two continuous variables.
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))

ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price), alpha = 1 / 100)

ggplot(data = smaller) +
geom_bin2d(mapping = aes(x = carat, y = price))

# install.packages("hexbin")
library(hexbin)
ggplot(data = smaller) +
geom_hex(mapping = aes(x = carat, y = price))

ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_width(carat, 0.1)))

ggplot(data = smaller, mapping = aes(x = carat, y = price)) +
geom_boxplot(mapping = aes(group = cut_number(carat, 20)))

ggplot(data = diamonds) +
geom_point(mapping = aes(x = x, y = y)) +
coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))

Back to Old Faithful. Seeing relationships between variables and using those relationships to build models.
ggplot(data = faithful) +
geom_point(mapping = aes(x = eruptions, y = waiting))

library(modelr)
mod <- lm(log(price) ~ log(carat), data = diamonds)
diamonds2 <- diamonds %>%
add_residuals(mod) %>%
mutate(resid = exp(resid))
ggplot(data = diamonds2) +
geom_point(mapping = aes(x = carat, y = resid))

ggplot(data = diamonds2) +
geom_boxplot(mapping = aes(x = cut, y = resid))

Same code, more consice.
ggplot(data = faithful, mapping = aes(x = eruptions)) +
geom_freqpoly(binwidth = 0.25)

ggplot(faithful, aes(eruptions)) +
geom_freqpoly(binwidth = 0.25)

Turn the end of a pipeline of data transformation into a plot. The value of the pipe.
diamonds %>%
count(cut, clarity) %>%
ggplot(aes(clarity, cut, fill = n)) +
geom_tile()

Chapter 8 is a short Chapter that introducts the getwd( ), setwd( ), and Projects. Try out Files to the right.
getwd()
[1] "/home/esuess/classes/2017-2018/Stat6864/Presentations/Chapter7"
ggplot(diamonds, aes(carat, price)) +
geom_hex()
ggsave("diamonds.pdf")
Saving 7.29 x 4.5 in image

write_csv(diamonds, "diamonds.csv")
Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Ctrl+Alt+I.
When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Ctrl+Shift+K to preview the HTML file).
The preview shows you a rendered HTML copy of the contents of the editor. Consequently, unlike Knit, Preview does not run any R code chunks. Instead, the output of the chunk when it was last run in the editor is displayed.
