Welcome

Ch1 Introduction

The data science project workflow

Prerequisites

  • R
  • RStudio
  • r packages

Install the tidyverse package

Running R code

1+2
## [1] 3

Getting help

  • Google
  • Stackoverflow

Ch2 Introduction to Data Exploration

Ch3 Data Visualization

Set up```

library(tidyverse)

data

mpg
## # A tibble: 234 × 11
##    manufacturer model      displ  year   cyl trans drv     cty   hwy fl    class
##    <chr>        <chr>      <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
##  1 audi         a4           1.8  1999     4 auto… f        18    29 p     comp…
##  2 audi         a4           1.8  1999     4 manu… f        21    29 p     comp…
##  3 audi         a4           2    2008     4 manu… f        20    31 p     comp…
##  4 audi         a4           2    2008     4 auto… f        21    30 p     comp…
##  5 audi         a4           2.8  1999     6 auto… f        16    26 p     comp…
##  6 audi         a4           2.8  1999     6 manu… f        18    26 p     comp…
##  7 audi         a4           3.1  2008     6 auto… f        18    27 p     comp…
##  8 audi         a4 quattro   1.8  1999     4 manu… 4        18    26 p     comp…
##  9 audi         a4 quattro   1.8  1999     4 auto… 4        16    25 p     comp…
## 10 audi         a4 quattro   2    2008     4 manu… 4        20    28 p     comp…
## # ℹ 224 more rows

aesthetics

  • x
  • y
  • color
  • size
  • alpha
  • shape
ggplot ( data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) 

ggplot ( data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color ="blue"))

1. What’s gone wrong with this code? Why are the points not blue? The points aren’t blue because of in the expression blue is interpreted as an categorical value which only takes a single value. 2. Which variables in mpg are categorical? Which variables are continuous? The values that are catergorical variable in mpg are as follows, manufacturer, model, trans, drv, fl, and class The values that are continous variables are as follows, displ, year, cyl, cty, and hwy. 3. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?

ggplot (mpg, aes(x = displ, y = hwy, color = cty))+
  geom_point()

ggplot (mpg, aes(x = displ, y = hwy, size = cty)) +
  geom_point()

When answers question 3 it does not allow you to make a graph for shape as it will give you an error simple if you do. This is becuase shapes do not have a number value. Now while doing this continous variables uses a scale that varies light to a dark blue color. With this a categorical variable does not do this and keeps it all one color 4 . What happens if you map the same variable to multiple aesthetics?

ggplot (mpg, aes(x = displ, y = hwy, color = hwy, size = displ))+
  geom_point()

When answering question 4 the map will plot the code even if its a bad one if you map all of the same variables. This makes the plot look bad and very tight and unorganized and with this you want to avoid doing this. 5. What does the stroke aesthetic do? What shapes does it work with? With this stroke aesthetic is changing sizes of the borders for shapes and they are filled shapes in which color and size of the border can differ from the filled interior shape.

ggplot (mtcars, aes(wt, mpg))+
  geom_point(shape = 21, color = "black", fill = "white", size = 5, stroke = 5)

6. What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Note, you’ll also need to specify x and y. What happens when you do this is the function will behvae as if a temporary variable was added to the data with values that will equal to the result of th expression.

ggplot (mpg, aes(x = displ, y = hwy, color = displ < 5))+
  geom_point()

common problems

  • Sometimes you’ll run the code and nothing happens.
  • Putting the + in the wrong place.

How to get help

  • ? function name
  • Select the function name and press F1
  • Read the error message
  • Google the error message

facets

1.What happens if you facet on a continuous variable? With this the graph below shows what happens if you facet an continous variable

ggplot (mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  facet_grid(. ~ cty)

ggplot ( data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
  facet_wrap(~class, nrow = 2)

ggplot ( data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = class)) +
  facet_grid(drv ~ cyl)

  1. What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot? These empty cells/ facets in this plot are combinations of drv and cyl that have no observations.
ggplot ( data = mpg) +
  geom_point(mapping = aes(x = drv, y = cyl))

  1. What plots does the following code make? What does . do? The . symbol ignores that certain dimension when facetime =, for an example drv ~ . works on the y- axis while . ~ cyl works on the x-axis
ggplot ( data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy))+
  facet_grid(drv ~.)

ggplot ( data = mpg) +
  geom_point(mapping =aes(x = displ, y = hwy))+
  facet_grid(. ~ cyl)

4.What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

The advantages of using this is encoding class with facets instead of color include the ability to eoncod more distinct categories. The disadvantges are encoding the class variable with facets instead of color aesthetic include the difficulty of comparing values of observations between categories. With this also it is still more difficult than if they had been displayed on the same plot.

ggplot ( data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_wrap(~ class, nrow = 2)

  1. Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments? nrow determines the number of rows to use when laying out the facets. It’s neccessary since facet_wrap() only facets ojne variable These arguements with nrow and ncol are unnecessary for facet_grid since number of unqiue values of the variables are specified in the funtion that determines the number of rows and columns.
  2. When using facet_grid() you should usually put the variable with more unique levels in the columns. Why? When doing this and using facet_grid() you will get more space for columns if the plot is laid out horizontally.

geometric objects

different visual object to represent data

ggplot ( data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy))

ggplot ( data = mpg) +
  geom_smooth(mapping = aes(x = displ, y = hwy))

not every aesthetic works with every geom

two geoms in the same graph!

ggplot ( data = mpg, mapping = aes(x = displ, y = hwy)) +
  geom_point(mapping = aes(color = class)) +
  geom_smooth()

local vs. global mappings This makes it possible to display different aesthetics in different layers.

specify different data for each layer

3.6 Exercises 1. What geom would you use to draw a line chart? A boxplot? A histogram? An area chart?

linechart - geom_line() boxplot - geom_boxplot() histogram - geom_histogram() area chart - geom_area()

  1. Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions.
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + 
  geom_point() + 
  geom_smooth(se = FALSE)

  1. What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter? Removing the show. legend will change how the plot sizes making them smaller or larger. You should us this to show us that plot sizes can be changed easily

  2. What does the se argument to geom_smooth() do? It adds standard error bands to the lines

  3. Will these two graphs look different? Why/why not? These graphs will look the same because both geom_point() and geom_smooth() will be using the same data and mappings. These graphs will inherit the options from ggplot() so the mappings don’t need to specified again

ggplot(data = mpg, mapping = aes(x = displ, y = hwy))+
  geom_point()+
  geom_smooth()

ggplot()+
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy))+
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))

  1. Recreate the R code necessary to generate the following graphs.
ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point() +
  geom_smooth(se = FALSE)

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_smooth(mapping = aes(group = drv), se = FALSE) +
  geom_point()

ggplot(mpg, aes(x = displ, y = hwy, colour = drv)) +
  geom_point() +
  geom_smooth(se = FALSE)

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(colour = drv)) +
  geom_smooth(se = FALSE)

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(aes(colour = drv)) +
  geom_smooth(aes(linetype = drv), se = FALSE)

ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point(size = 4, color = "white") +
  geom_point(aes(colour = drv))

## statistical transformation

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))

position adjustments

adjustments for bar charts

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity),position = "dodge")

adjustments for scatterplots

coordinate systems

switch x and y

set the aspect ratio correctly for maps

Polar coordinates reveal an interesting connection between a bar chart and a Coxcomb chart.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = cut))+
  coord_polar()

3.8 exercise questions 1 What is the problem with this plot? How could you improve it? You can improve this by get rid of the overplotting and I would improve the plot by using jitter position to get rid of the overplotting

ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point()

2.What parameters to geom_jitter() control the amount of jttering? Width controls the amount of horizontal displacement height controls the amount of vertical displacement.

  1. Compare and contrast geom_jitter() with geom_count(). The geom geom_jitter() adds random variation to the locations points of the graph The geom geom_count() sizes the points relative to the number of observations. Combinations of (x, y) values with more observations will be larger than those with fewer observations.

  2. What’s the default position adjustment for geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.

The default position for geom_boxplot() is “dodge2”, which is a shortcut for position_dodge2. This position adjustment does not change the vertical position of a geom but moves the geom horizontally to avoid overlapping other geoms.

ggplot(data = mpg, aes(x = drv, y = hwy, colour = class)) +
  geom_boxplot()

the layered grammar of graphics

The grammar of graphics is based on the insight that you can uniquely describe any plot as a combination of:

a dataset, * a geom, * a set of mappings, * a stat, * a position adjustment, * a coordinate system, and * a faceting scheme.

3.9.1 Exercises 2. What does labs() do? Read the documentation. The labs function adds axis titles, plot titles, and a caption to the plot.

  1. What’s the difference between coord_quickmap() and coord_map()? The coord_map() function uses map projections to project the three-dimensional Earth onto a two-dimensional plane while the coord_quickmap() function uses an approximate but faster map projection.

  2. What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?

p <- ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() +
  geom_abline()
p + coord_fixed()