library(tidyverse)
## -- Attaching packages --------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.2.0 v purrr 0.3.2
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 0.8.3 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'dplyr' was built under R version 3.6.1
## -- Conflicts ------------------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(viridis)
## Warning: package 'viridis' was built under R version 3.6.1
## Loading required package: viridisLite
#2.2.1 Exercises
Datasets in ggplot2:
diamonds Prices of 50,000 round cut diamonds
economics economics_long US economic time series
faithfuld 2d density estimate of Old Faithful data
midwest Midwest demographics
mpg Fuel economy data from 1999 and 2008 for 38 popular models of car
msleep An updated and expanded version of the mammals sleep dataset
presidential Terms of 11 presidents from Eisenhower to Obama
seals Vector field of seal movements
txhousing Housing sales in TX
luv_colours colors() in Luv space
Exercise 3
mpg_fuel_economy <- mpg %>% mutate(fuel_econ_cty_metric = 100*(cty*1.60934*3.78541)^-1) %>%
mutate(fuel_econ_hwy_metric = 100*(hwy*1.60934*3.78541)^-1) %>%
mutate(fuel_econ_hwy_metric = 100*(hwy*1.60934*3.78541)^-1) %>%
mutate(fuel_econ_hwy_us = 100*(hwy)^-1) %>%
mutate(fuel_econ_cty_us = 100*(cty)^-1)
Exercise 4
tally_mpg <- mpg %>% group_by(manufacturer) %>% tally()
tally_mpg %>% arrange(-n)
## # A tibble: 15 x 2
## manufacturer n
## <chr> <int>
## 1 dodge 37
## 2 toyota 34
## 3 volkswagen 27
## 4 ford 25
## 5 chevrolet 19
## 6 audi 18
## 7 hyundai 14
## 8 subaru 14
## 9 nissan 13
## 10 honda 9
## 11 jeep 8
## 12 pontiac 5
## 13 land rover 4
## 14 mercury 4
## 15 lincoln 3
I don’t understand the drive train comment…fell down internet rabbit hole. Still confused what the question wants me to remove about the observations.
mpg <- mpg
reduced_mpg <- mpg %>% select(manufacturer, model, year)
tally_reduced_mpg <- reduced_mpg %>% group_by(manufacturer) %>% unique() %>% tally()
tally_reduced_mpg %>% arrange(-n)
## # A tibble: 15 x 2
## manufacturer n
## <chr> <int>
## 1 toyota 12
## 2 chevrolet 8
## 3 dodge 8
## 4 ford 8
## 5 volkswagen 8
## 6 audi 6
## 7 nissan 6
## 8 hyundai 4
## 9 subaru 4
## 10 honda 2
## 11 jeep 2
## 12 land rover 2
## 13 lincoln 2
## 14 mercury 2
## 15 pontiac 2
#2.3.1 Exercises
(ggplot(mpg, aes(cty, hwy))+
geom_point())
ggplot(mpg, aes(model, manufacturer)) + geom_point()
ggplot(mpg, aes(manufacturer, model)) + geom_point()
ggplot(mpg, aes(manufacturer)) + geom_bar()
ggplot(mpg, aes(cty, hwy)) + geom_point()
ggplot(diamonds, aes(carat, price)) + geom_point()
ggplot(economics, aes(date, unemploy)) + geom_line()
ggplot(mpg, aes(cty)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Exercise 2.4.1 1. Map color, shape, and size aesthetics to:
ggplot(diamonds, aes(carat, price, color = color, size = depth)) + geom_point()
#ggplot(diamonds, aes(carat, price, color = clarity, size = depth, shape = cut, alpha = 0.2)) + geom_point()
#Using shapes for an ordinal variable is not advised
ggplot(diamonds, aes(carat, price, color = clarity, size = depth, alpha = 0.2)) + geom_point()
ggplot(diamonds, aes(carat, price, alpha = 0.2)) + geom_point()
ggplot(mpg_fuel_economy, aes(drv, fuel_econ_cty_us)) +
geom_boxplot()
b) Drive train to engine size and class Wow this is more interesting that I thought. I would like to figure out how to make the jitter a little tighter. I also would like to order class in my data so that is orders the class from likely small to likely big eg 2 seat to subcompact to compact …. to suv
ggplot(mpg_fuel_economy, aes(x = class, y = fuel_econ_cty_us, size = displ, color = drv)) + geom_jitter(alpha = 0.3)
#2.5.1 Exercises
ggplot(mpg, aes(displ, cty)) +
geom_point() +
facet_wrap(~hwy)
b) Since there are only 4 values for a non continous variable cyl it makes comparsions easier, but I think color on a single graph might work even better.
Facet by cyl
ggplot(mpg, aes(displ, cty)) +
geom_point() +
facet_wrap(~cyl)
ggplot(mpg_fuel_economy, aes(fuel_econ_cty_us, displ)) +
geom_point()
ggplot(mpg_fuel_economy, aes(fuel_econ_cty_us, displ)) +
geom_point() +
facet_wrap(~cyl)
ggplot(mpg_fuel_economy, aes(fuel_econ_cty_us, displ, color = factor(cyl))) +
geom_point() +
scale_color_viridis(discrete = TRUE) + theme_bw()
Argument to control rows and column numbers of facets in the final output in facet_wrap() are nrow and ncol.
Scales arguement fixes the scales and allows you to let one axis be free free_x or free_y if desired. If different groups had very different x axes I might use that. But without an obvious example to test right now I would be scared to use it because I think it would be easy for a reader to not notice a change in some of the facets scales.
#2.6.6 Exercises
ggplot(mpg, aes(cty, hwy)) +
geom_point()
ggplot(mpg, aes(cty, hwy)) +
geom_jitter()
ggplot(mpg, aes(cty, hwy)) +
geom_jitter() +
geom_smooth(method = "lm")
Okay quick side detour: https://rdrr.io/r/stats/lm.html
library(moderndive)
## Warning: package 'moderndive' was built under R version 3.6.1
score_model <- lm(hwy ~ cty, data = mpg)
get_regression_table(score_model)
## # A tibble: 2 x 7
## term estimate std_error statistic p_value lower_ci upper_ci
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 intercept 0.892 0.469 1.90 0.058 -0.032 1.82
## 2 cty 1.34 0.027 49.6 0 1.28 1.39
get_regression_summaries(score_model)
## # A tibble: 1 x 8
## r_squared adj_r_squared mse rmse sigma statistic p_value df
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.914 0.913 3.04 1.74 1.75 2459. 0 2
get_regression_table(score_model, print = TRUE)
| term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 0.892 | 0.469 | 1.902 | 0.058 | -0.032 | 1.816 |
| cty | 1.337 | 0.027 | 49.585 | 0.000 | 1.284 | 1.391 |
ggplot(mpg, aes(class, hwy)) +
geom_boxplot()
Ok so how to reorder. http://www.cookbook-r.com/Manipulating_data/Changing_the_order_of_levels_of_a_factor/ but actually finally used this to figure it out https://rstudio-pubs-static.s3.amazonaws.com/7433_4537ea5073dc4162950abb715f513469.html.
mpg <- mpg
new_levels <- ordered(c("2seater", "subcompact", "compact", "midsize", "minivan", "pickup", "suv"))
mpg$class <- ordered(mpg$class, levels = new_levels)
ggplot(mpg, aes(class, hwy)) +
geom_boxplot()
ggplot(diamonds, aes(carat)) +
geom_histogram(binwidth = .1)
ggplot(diamonds, aes(carat)) +
geom_histogram(binwidth = .01)
ggplot(diamonds, aes(price, fill = cut)) +
geom_freqpoly(binwidth = 0.1)
ggplot(diamonds, aes(price)) +
geom_freqpoly(binwidth = 0.1) +
facet_wrap(~cut, nrow = 1)
ggplot(diamonds, aes(price)) +
geom_freqpoly(binwidth = 0.1) +
facet_wrap(~cut, ncol = 1)
5. I feel geom_violin is a bit complicated to understand quickly if not familiar with the type of ploy. geom_frequency is ok, but I also worry it can be seen as a line graph instead of understanding it is binning the data. geom_histogram is fairly straightforward, but I just love how varying bin width on the diamonds from 1 to 0.1 tells a very different story. I think facetting offers a lot of options with graphs, but pulls them apart in ways that are difficult to understand at times. With geom_histogram fill can be confusing because it isn’t clear if they are stacked or cover each other up.
ggplot(diamonds, aes(carat)) +
geom_histogram(binwidth = 1)
ggplot(diamonds, aes(carat)) +
geom_histogram(binwidth = 0.1)
ggplot(diamonds, aes(carat)) +
geom_histogram(binwidth = .01)
“There are two types of bar charts: geom_bar() and geom_col(). geom_bar() makes the height of the bar proportional to the number of cases in each group (or if the weight aesthetic is supplied, the sum of the weights).”
and
"stat_count() understands the following aesthetics (required aesthetics are in bold):
x
group
weight
y"
Hmmm. Going to reflect on this a bit. Not sure I completely understand.
# geom_bar is designed to make it easy to create bar charts that show
# counts (or sums of weights)
g <- ggplot(mpg, aes(class))
# Number of cars in each class:
g + geom_bar()
# Total engine displacement of each class
g + geom_bar(aes(weight = displ))
model and manufacturer. Graphs are not very intersting. I am a bit stumped about how to use the plots from just this chapter to show the relationship between these two categorical variables. Also because there are so many manufacturers and model types that throws a curveball as well in displaying them.
ggplot(mpg, aes(model, manufacturer)) + geom_jitter()
ggplot(mpg, aes(manufacturer, model)) + geom_jitter()
ggplot(mpg, aes(manufacturer, fill = model)) + geom_bar() + facet_wrap(~year)
ggplot(mpg, aes(model)) + geom_bar() + facet_wrap(~manufacturer)
ggplot(mpg, aes(manufacturer)) + geom_bar() + facet_wrap(~model)
trans and class
Here I think there is more interesting stories to be told from the data, but still the trans variable is very “busy” seeming so it is hard to tease out the meaning.
ggplot(mpg, aes(trans, class)) + geom_jitter()
ggplot(mpg, aes(class, trans)) + geom_jitter()
ggplot(mpg, aes(class, fill = trans)) + geom_bar() + facet_wrap(~manufacturer)
ggplot(mpg, aes(trans, fill = class)) + geom_bar() + facet_wrap(~manufacturer)
ggplot(mpg, aes(class, fill = manufacturer)) + geom_bar() + facet_wrap(~trans)
ggplot(mpg, aes(manufacturer, fill = trans)) + geom_bar() + facet_wrap(~class)
cyl and trans
ggplot(mpg, aes(cyl)) + geom_bar() + facet_wrap(~trans)
ggplot(mpg, aes(trans)) + geom_bar() + facet_wrap(~cyl)
ggplot(mpg, aes(trans, fill = cyl)) + geom_bar() + facet_wrap(~manufacturer)
#My own exercises 1. Experiement with qplot
qplot(class, trans, data = mpg)
qplot(manufacturer, model, data = mpg)
qplot(class, drv, data = mpg)
###Notes to self after finishing chapter 2 Lots of learning. Biggest help is actually just playing around and listening to my own questions that I have after the exercises in the book. I think it helps solidify things in my memory when they come from my own internal voice that asks, “Sooo how would I do….?”
Learned a lot about cars.
Learned a bit about Rmarkdown. I need to come up with my own style of writing in these, so I think I will try to find some style guides and also now pay more attention to when I see it being used by someone else online. A bit issue is that I get distracted and want to look things up and then suddenly I am 5 tangents removed from trying to figure out how to visualize two categorical variables together.
Biggest homerun for me though? Just doing it. I have a big issue with “Perfect is the enemy of good.” It is easy for me not to finish something because I think it won’t be perfect looking or nicely polished. I reflected on how I might feel in 5 years when I look back at my early coding. If it looks terrible, isn’t that a great sign because it means I have progressed a lot!
I would also love to do a TidyTuesday in a group to apply things I am learning from this book (and other sources.)