Notes from ggplot2 book

Chapter 2

#2.2.1 Exercises

Datasets in ggplot2:

diamonds Prices of 50,000 round cut diamonds

economics economics_long US economic time series

faithfuld 2d density estimate of Old Faithful data

midwest Midwest demographics

mpg Fuel economy data from 1999 and 2008 for 38 popular models of car

msleep An updated and expanded version of the mammals sleep dataset

presidential Terms of 11 presidents from Eisenhower to Obama

seals Vector field of seal movements

txhousing Housing sales in TX

luv_colours colors() in Luv space

Exercise 3

mpg_fuel_economy <- mpg %>% mutate(fuel_econ_cty_metric = 100*(cty*1.60934*3.78541)^-1) %>% 
  mutate(fuel_econ_hwy_metric = 100*(hwy*1.60934*3.78541)^-1) %>%
  mutate(fuel_econ_hwy_metric = 100*(hwy*1.60934*3.78541)^-1) %>%
  mutate(fuel_econ_hwy_us = 100*(hwy)^-1) %>%
  mutate(fuel_econ_cty_us = 100*(cty)^-1)

Exercise 4

tally_mpg <- mpg %>% group_by(manufacturer) %>% tally()

tally_mpg %>% arrange(-n)

## # A tibble: 15 x 2
##    manufacturer     n
##    <chr>        <int>
##  1 dodge           37
##  2 toyota          34
##  3 volkswagen      27
##  4 ford            25
##  5 chevrolet       19
##  6 audi            18
##  7 hyundai         14
##  8 subaru          14
##  9 nissan          13
## 10 honda            9
## 11 jeep             8
## 12 pontiac          5
## 13 land rover       4
## 14 mercury          4
## 15 lincoln          3

I don’t understand the drive train comment…fell down internet rabbit hole. Still confused what the question wants me to remove about the observations.

mpg <- mpg
reduced_mpg <- mpg %>% select(manufacturer, model, year)

tally_reduced_mpg <- reduced_mpg %>% group_by(manufacturer) %>% unique() %>% tally()

tally_reduced_mpg %>% arrange(-n)

## # A tibble: 15 x 2
##    manufacturer     n
##    <chr>        <int>
##  1 toyota          12
##  2 chevrolet        8
##  3 dodge            8
##  4 ford             8
##  5 volkswagen       8
##  6 audi             6
##  7 nissan           6
##  8 hyundai          4
##  9 subaru           4
## 10 honda            2
## 11 jeep             2
## 12 land rover       2
## 13 lincoln          2
## 14 mercury          2
## 15 pontiac          2

#2.3.1 Exercises

This plot doesn’t tell me anything I don’t know from just one of the variables cty or hwy. They have a straightforward relationship.

(ggplot(mpg, aes(cty, hwy))+
   geom_point())

Not particularly useful because there isn’t an easy way to understand anything about the displayed information. Model vs manufacturer seem to both be categorically variables or factors without order and so displaying as a scatter plot isn’t very interesting. Flipping the x and y axis is a bit more interesting. Making it a bar plot might be though.

ggplot(mpg, aes(model, manufacturer)) + geom_point()

ggplot(mpg, aes(manufacturer, model)) + geom_point()

ggplot(mpg, aes(manufacturer)) + geom_bar()

Predictions: 3.1 – I already saw this above. cty and hwy will be very correlated 3.2 – I imagine that as carat increases price will also increase 3.3 – I think this will show me unemployment rates over time and there will be spikes during recessions 3.4 – Not sure how the the histogram will look. There won’t be too much close to zero because even an efficient car takes up a lot of fuel. AFTER: I didn’t have the variables straight in my head. I was thinking fuel economy not miles per gallon. So a few vehicles are quite low and there is a gap on the right tail with some outliers, probably subcompact cars or electric vehicles.

ggplot(mpg, aes(cty, hwy)) + geom_point()

ggplot(diamonds, aes(carat, price)) + geom_point()

ggplot(economics, aes(date, unemploy)) + geom_line()

ggplot(mpg, aes(cty)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#Exercise 2.4.1 1. Map color, shape, and size aesthetics to:

continous values Can’t map a continous variable to a shape.

ggplot(diamonds, aes(carat, price, color = color, size = depth)) + geom_point()

categorical values

#ggplot(diamonds, aes(carat, price, color = clarity, size = depth, shape = cut, alpha = 0.2)) + geom_point()
#Using shapes for an ordinal variable is not advised

ggplot(diamonds, aes(carat, price, color = clarity, size = depth, alpha = 0.2)) + geom_point()

Continous variable to shape CAn’t because there are too many different variables to each get their own shape.

ggplot(diamonds, aes(carat, price, alpha = 0.2)) + geom_point()

Drive train to fuel economy. Again with drive train!!!?!? Ok figured out what a drive train is and that is corresponds to the drv variable.

ggplot(mpg_fuel_economy, aes(drv, fuel_econ_cty_us)) +
  geom_boxplot()

b) Drive train to engine size and class Wow this is more interesting that I thought. I would like to figure out how to make the jitter a little tighter. I also would like to order class in my data so that is orders the class from likely small to likely big eg 2 seat to subcompact to compact …. to suv

ggplot(mpg_fuel_economy, aes(x = class, y = fuel_econ_cty_us, size = displ, color = drv)) + geom_jitter(alpha = 0.3)

#2.5.1 Exercises

The issue with faceting by a continous variable like hwy is that there are many different values of hwy so the graph is busy and doesn’t really help me understand the differences between vehicles with different hwy. I am also uncertain if many hwy with in between values (17.3) are just dropped or if they are squeezed in to which plot.

Facet by continous variable like hwy

ggplot(mpg, aes(displ, cty)) +
  geom_point() +
  facet_wrap(~hwy)

b) Since there are only 4 values for a non continous variable cyl it makes comparsions easier, but I think color on a single graph might work even better.

Facet by cyl

ggplot(mpg, aes(displ, cty)) +
  geom_point() +
  facet_wrap(~cyl)

Facetting to explore fuel economy, engine size, and cylinders. By facetting on cylinder I can tell how the difference much more clearly. Let’s try color though for the first one. Okay that’s a long enough diversion. Would like to figure out how to select colors. I don’t tend to like yellow in graphs as I find it hard to see, and some of the colors chosen automatically here are difficult to distinguish.

ggplot(mpg_fuel_economy, aes(fuel_econ_cty_us, displ)) +
  geom_point()

ggplot(mpg_fuel_economy, aes(fuel_econ_cty_us, displ)) +
  geom_point() +
  facet_wrap(~cyl)

ggplot(mpg_fuel_economy, aes(fuel_econ_cty_us, displ, color = factor(cyl))) +
  geom_point() +
    scale_color_viridis(discrete = TRUE) + theme_bw()

Argument to control rows and column numbers of facets in the final output in facet_wrap() are nrow and ncol.
Scales arguement fixes the scales and allows you to let one axis be free free_x or free_y if desired. If different groups had very different x axes I might use that. But without an obvious example to test right now I would be scared to use it because I think it would be easy for a reader to not notice a change in some of the facets scales.

#2.6.6 Exercises

What’s the problem with the below graph? You again old friend. Logically I expect that cty and hwy values are highly correlated, so the graph itself isn’t super interesting. Is there a geom discussed recently that would work better? I think data is being hidden due to overlapping. Trying jitter, which reveals a lot more data points, but doesn’t tell me much more interesting about the relationship between cty and hwy which was evident from the first graph. For fun, I’m going to add a linear model to it. I am sure there is a way for me to get summary statistics from a linear model.

ggplot(mpg, aes(cty, hwy)) +
  geom_point()

ggplot(mpg, aes(cty, hwy)) +
  geom_jitter()

ggplot(mpg, aes(cty, hwy)) +
  geom_jitter() +
  geom_smooth(method = "lm")

Okay quick side detour: https://rdrr.io/r/stats/lm.html

library(moderndive)

## Warning: package 'moderndive' was built under R version 3.6.1

score_model <- lm(hwy ~ cty, data = mpg)

get_regression_table(score_model)

## # A tibble: 2 x 7
##   term      estimate std_error statistic p_value lower_ci upper_ci
##   <chr>        <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl>
## 1 intercept    0.892     0.469      1.90   0.058   -0.032     1.82
## 2 cty          1.34      0.027     49.6    0        1.28      1.39

get_regression_summaries(score_model)

## # A tibble: 1 x 8
##   r_squared adj_r_squared   mse  rmse sigma statistic p_value    df
##       <dbl>         <dbl> <dbl> <dbl> <dbl>     <dbl>   <dbl> <dbl>
## 1     0.914         0.913  3.04  1.74  1.75     2459.       0     2

get_regression_table(score_model, print = TRUE)

term	estimate	std_error	statistic	p_value	lower_ci	upper_ci
intercept	0.892	0.469	1.902	0.058	-0.032	1.816
cty	1.337	0.027	49.585	0.000	1.284	1.391

Changing factor ordering

ggplot(mpg, aes(class, hwy)) + 
  geom_boxplot()

Ok so how to reorder. http://www.cookbook-r.com/Manipulating_data/Changing_the_order_of_levels_of_a_factor/ but actually finally used this to figure it out https://rstudio-pubs-static.s3.amazonaws.com/7433_4537ea5073dc4162950abb715f513469.html.

mpg <- mpg
new_levels <- ordered(c("2seater", "subcompact", "compact", "midsize", "minivan", "pickup", "suv"))

mpg$class <- ordered(mpg$class, levels = new_levels)

ggplot(mpg, aes(class, hwy)) + 
  geom_boxplot()

Now read about reorder as a tear runs down my face.

Question isn’t super clear, but I will assume histogram since it asked about bin width. Wow very small bin widths are much more interesting. With the 0.01 bin width you can see the sharp peaks at very different places often right by an interger value.

ggplot(diamonds, aes(carat)) +
  geom_histogram(binwidth = .1)

ggplot(diamonds, aes(carat)) +
  geom_histogram(binwidth = .01)

Exploring price distribution in the diamonds dataset and by cut.

ggplot(diamonds, aes(price, fill = cut)) +
  geom_freqpoly(binwidth = 0.1)

ggplot(diamonds, aes(price)) +
  geom_freqpoly(binwidth = 0.1) + 
  facet_wrap(~cut, nrow = 1)

ggplot(diamonds, aes(price)) +
  geom_freqpoly(binwidth = 0.1) + 
  facet_wrap(~cut, ncol = 1)

5. I feel geom_violin is a bit complicated to understand quickly if not familiar with the type of ploy. geom_frequency is ok, but I also worry it can be seen as a line graph instead of understanding it is binning the data. geom_histogram is fairly straightforward, but I just love how varying bin width on the diamonds from 1 to 0.1 tells a very different story. I think facetting offers a lot of options with graphs, but pulls them apart in ways that are difficult to understand at times. With geom_histogram fill can be confusing because it isn’t clear if they are stacked or cover each other up.

ggplot(diamonds, aes(carat)) +
  geom_histogram(binwidth = 1)

ggplot(diamonds, aes(carat)) +
  geom_histogram(binwidth = 0.1)

ggplot(diamonds, aes(carat)) +
  geom_histogram(binwidth = .01)

Read geom_bar documentation; weight means

“There are two types of bar charts: geom_bar() and geom_col(). geom_bar() makes the height of the bar proportional to the number of cases in each group (or if the weight aesthetic is supplied, the sum of the weights).”

and

"stat_count() understands the following aesthetics (required aesthetics are in bold):

group

weight

Hmmm. Going to reflect on this a bit. Not sure I completely understand.

# geom_bar is designed to make it easy to create bar charts that show
# counts (or sums of weights)
g <- ggplot(mpg, aes(class))
# Number of cars in each class:
g + geom_bar()

# Total engine displacement of each class
g + geom_bar(aes(weight = displ))

How to visualize a 2d categorical distribution:

model and manufacturer. Graphs are not very intersting. I am a bit stumped about how to use the plots from just this chapter to show the relationship between these two categorical variables. Also because there are so many manufacturers and model types that throws a curveball as well in displaying them.

ggplot(mpg, aes(model, manufacturer)) + geom_jitter()

ggplot(mpg, aes(manufacturer, model)) + geom_jitter()

ggplot(mpg, aes(manufacturer, fill = model)) + geom_bar() + facet_wrap(~year)

ggplot(mpg, aes(model)) + geom_bar() + facet_wrap(~manufacturer)

ggplot(mpg, aes(manufacturer)) + geom_bar() + facet_wrap(~model)

trans and class

Here I think there is more interesting stories to be told from the data, but still the trans variable is very “busy” seeming so it is hard to tease out the meaning.

ggplot(mpg, aes(trans, class)) + geom_jitter()

ggplot(mpg, aes(class, trans)) + geom_jitter()

ggplot(mpg, aes(class, fill = trans)) + geom_bar() + facet_wrap(~manufacturer)

ggplot(mpg, aes(trans, fill = class)) + geom_bar() + facet_wrap(~manufacturer)

ggplot(mpg, aes(class, fill = manufacturer)) + geom_bar() + facet_wrap(~trans)

ggplot(mpg, aes(manufacturer, fill = trans)) + geom_bar() + facet_wrap(~class)

cyl and trans

ggplot(mpg, aes(cyl)) + geom_bar() + facet_wrap(~trans)

ggplot(mpg, aes(trans)) + geom_bar() + facet_wrap(~cyl)

ggplot(mpg, aes(trans, fill = cyl)) + geom_bar() + facet_wrap(~manufacturer)

#My own exercises 1. Experiement with qplot

qplot(class, trans, data = mpg)

qplot(manufacturer, model, data = mpg)

qplot(class, drv, data = mpg)

###Notes to self after finishing chapter 2 Lots of learning. Biggest help is actually just playing around and listening to my own questions that I have after the exercises in the book. I think it helps solidify things in my memory when they come from my own internal voice that asks, “Sooo how would I do….?”

Learned a lot about cars.

Learned a bit about Rmarkdown. I need to come up with my own style of writing in these, so I think I will try to find some style guides and also now pay more attention to when I see it being used by someone else online. A bit issue is that I get distracted and want to look things up and then suddenly I am 5 tangents removed from trying to figure out how to visualize two categorical variables together.

Biggest homerun for me though? Just doing it. I have a big issue with “Perfect is the enemy of good.” It is easy for me not to finish something because I think it won’t be perfect looking or nicely polished. I reflected on how I might feel in 5 years when I look back at my early coding. If it looks terrible, isn’t that a great sign because it means I have progressed a lot!

I would also love to do a TidyTuesday in a group to apply things I am learning from this book (and other sources.)

Notes from ggplot2 book

Mara Alexeev

8/6/2019

Chapter 2