Chapter 1

prerequisites:

Load tidyverse by using library function

install.packages("tidyverse")
Error in install.packages : Updating loaded packages
library(tidyverse)

This one line loads the core tidyverse, packages that are needed. Question: Do cars with big engines more fuel efficient than smaller engines?

Load the mpg dataset

library(tidyverse)
package <U+393C><U+3E31>tidyverse<U+393C><U+3E32> was built under R version 3.3.3Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ------------------------------------------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats
mpg

Notes: displ is the car’s engine size in liters hwy is the car’s fuel efficiency on the highway in miles per gallon (mpg)

Create the graph using ggplot

ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy))

ggplot(data=mpg) creates an empty plot using the mpg dataset geom_point adds scatter point graph using displ on the x axis and hwy on the y axis

The template is: ggplot(data =) + (mapping =aes())

Exercises:

ggplot(data=mpg)

Answer: we see an empty plot
How many rows in mtcars? How many columns?

dim(mtcars)
[1] 32 11

Answer: 32 rows, and 11 columns

summary(mtcars)
      mpg             cyl             disp             hp             drat             wt             qsec      
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0   Min.   :2.760   Min.   :1.513   Min.   :14.50  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5   1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0   Median :3.695   Median :3.325   Median :17.71  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7   Mean   :3.597   Mean   :3.217   Mean   :17.85  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0   3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0   Max.   :4.930   Max.   :5.424   Max.   :22.90  
       vs               am              gear            carb      
 Min.   :0.0000   Min.   :0.0000   Min.   :3.000   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
 Median :0.0000   Median :0.0000   Median :4.000   Median :2.000  
 Mean   :0.4375   Mean   :0.4062   Mean   :3.688   Mean   :2.812  
 3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
 Max.   :1.0000   Max.   :1.0000   Max.   :5.000   Max.   :8.000  

Shows eleven fields.
To find out what drv variables describes, use the help function

?mpg

Pops up the help that shows up on the help screen:

drv f = front-wheel drive, r = rear wheel drive, 4 = 4wd

Exercise Make a scatterplot of hwy versus cyl

ggplot(data=mpg) + geom_point(mapping = aes(x=cyl, y=hwy))

What happens when we make a scatterplot of class versus drv?

ggplot(data=mpg) + geom_point(mapping = aes(x=class, y=drv))

Why is this plot not useful?
(no pattern??)

Looking at the first scatterplot

ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy))

You wont easily be able to identify the outliers. Why dont we apply colors to this plot?

ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy, color=class))

This now quickly helps readers to find that 2seaters are more fuel efficient.
Other ways will be to use size, alpha(transparency) and shape.

ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy, size=class))

ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy,alpha=class))

ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy, shape=class))

Note that for the shape , SUV is gone! Ggplot will only use 6 shapes at a time.

You can also manually set properties

ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy), color="red")

Note: for manually setting objects, do it outside the AES

ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy), shape=24)

Why is the graph below not blue?

ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy, color="blue"))

What happens when we map the same variable to different aesthetics? (size, shape, color)

ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy, color=class, shape=class, size=class))

?geom_point

What is the stroke aesthetic do?
Use the stroke aesthetic to modify the width of the border

ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy, shape=class,stroke=1))

What happens when you map an aethtic to something other than a variable name like aes(color =displ <5)

ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy, color= displ<5))

How to make small multiple graphs (facets in R ggplot convention)

ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy, color= displ <5)) +
  facet_wrap(~ class, nrow=2)

To face on a combination of two variables, add facet_grid to your plot call.

ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy, color= displ <5)) +
  facet_grid(drv ~ cyl)

Use [.] to avoid facet on either row or column. Like this below:

ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy, color= displ <5)) +
  facet_grid(. ~ cyl)

What happes if we facet on a continuous variable?

first find which variable is continuous.

summary(mpg)
 manufacturer          model               displ            year           cyl           trans          
 Length:234         Length:234         Min.   :1.600   Min.   :1999   Min.   :4.000   Length:234        
 Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999   1st Qu.:4.000   Class :character  
 Mode  :character   Mode  :character   Median :3.300   Median :2004   Median :6.000   Mode  :character  
                                       Mean   :3.472   Mean   :2004   Mean   :5.889                     
                                       3rd Qu.:4.600   3rd Qu.:2008   3rd Qu.:8.000                     
                                       Max.   :7.000   Max.   :2008   Max.   :8.000                     
     drv                 cty             hwy             fl               class          
 Length:234         Min.   : 9.00   Min.   :12.00   Length:234         Length:234        
 Class :character   1st Qu.:14.00   1st Qu.:18.00   Class :character   Class :character  
 Mode  :character   Median :17.00   Median :24.00   Mode  :character   Mode  :character  
                    Mean   :16.86   Mean   :23.44                                        
                    3rd Qu.:19.00   3rd Qu.:27.00                                        
                    Max.   :35.00   Max.   :44.00                                        
ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy, color= displ <5)) +
  facet_grid(drv ~ cty)

A geom is the geometrical object that a plot uss to represent data.
Using geom_point

ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy))

Using geom_smooth with no arguments

ggplot(data=mpg) + geom_smooth(mapping = aes(x=displ, y=hwy))

Using geom_smooth with linetype argument

ggplot(data=mpg) + geom_smooth(mapping = aes(x=displ, y=hwy, linetype =drv))

The above geom_smooth separates the cars into 3 lines based on their drv value which describes a car’s drivetrain.
We can overly geom_point on geom_smooth

ggplot(data=mpg) + geom_smooth(mapping = aes(x=displ, y=hwy, linetype =drv)) +
  geom_point(mapping = aes(x=displ, y=hwy, color=drv))

To avoid having to set the mappings in each geom, you can place a set of mapping in the main ggplot().
Then they become ‘global’

ggplot(data=mpg, mapping=aes(x=displ,y=hwy)) + geom_point() + geom_smooth()

If you place argument inside the geom function, it will be treated as local.

ggplot(data=mpg, mapping=aes(x=displ,y=hwy)) + 
  geom_point(mapping=aes(color=class)) + 
  geom_smooth()

using this technic we can use it to specify a different data for each layer.

ggplot(data=mpg, mapping=aes(x=displ,y=hwy)) + 
  geom_point(mapping=aes(color=class)) + 
  geom_smooth(data = filter(mpg, class =='subcompact'),se=FALSE)

Exercises:

How to draw a line chart, a boxplot, a histogram, an area chart?

ggplot(data=mpg, mapping=aes(x=displ,y=hwy)) + 
  geom_line()

ggplot(data=mpg, mapping=aes(x=displ,y=hwy)) + 
  geom_boxplot()

ggplot(data=mpg, mapping=aes(x=displ)) + 
  geom_histogram()

ggplot(data=mpg, mapping=aes(x=displ,y=hwy)) + 
  geom_area()

Exercise:

ggplot(data=mpg, mapping=aes(x=displ,y=hwy, color=drv)) + 
  geom_point() +
  geom_smooth(se =FALSE)

Q:What does show_legend do? What happens when it is removed?

ggplot(data=mpg, mapping=aes(x=displ,y=hwy, color=drv)) + 
  geom_point() +
  geom_smooth(se =FALSE,show.legend = FALSE)

The legend on the DRV for the line doesnt showup.

Recreating the lines in the exercises

ggplot(data=mpg, mapping=aes(x=displ,y=hwy))+
  geom_point(size=4)+
  geom_smooth(se =FALSE)

ggplot(data=mpg, mapping=aes(x=displ,y=hwy))+
  geom_point(size=4)+
  geom_line(mapping = aes(linetype =drv))

To learn about ggplot: http://www.ggplot2-exts.org/ Cheat sheet: https://www.rstudio.com/resources/cheatsheets/

Bar Charts

ggplot(data=diamonds) + 
 geom_bar(mapping = aes(x=cut))

Stat_count() can be interchanged with geom_bar since geom_bar also uses stat_count to statistically transform the data into counts.

ggplot(data=diamonds) + 
 stat_count(mapping = aes(x=cut))

This works because every geom has a default stat, and every stat has a default geom.
Three reasons to use a stat explicitly:
1. you want to overrid the default stat.
2. you want to to use aesthetics instead of transformed variables
3. you want to draw attention to the statistical transformation in the code.

demo <- tribble(
  ~a, ~b,
  "Bar_1",20,
  "Bar_2",30,
  "Bar_3", 40)
ggplot(data = demo)+
  geom_bar(
    mapping=aes(x=a, y=b), stat="identity"
  )

using aesthetics instead of transformed variables

ggplot(data = diamonds)+
  geom_bar(
    mapping=aes(x=cut, y=..prop.., group=1)
  )

Case 3 demo

ggplot(data = diamonds)+
  stat_summary(
    mapping=aes(x=cut, y=depth),
    fun.ymin=min,
    fun.ymax=max,
    fun.y=median
  )

How is geom_col() different from geom_bar ?
answer: geom_bar uses only one variable, geom_col uses two variables and displays the count that applies to both conditions.

ggplot(data = diamonds)+
  geom_col(
    mapping=aes(x=cut,y=carat)
  )

ggplot(data=diamonds) +
  geom_bar(mapping=aes(x=cut,fill=color, y=..prop..))

Position Adjustments for bar charts using either aesthetic

ggplot(data=diamonds) +
  geom_bar(mapping=aes(x=cut,color=cut))

Position Adjustments for bar charts using fill

ggplot(data=diamonds) +
  geom_bar(mapping=aes(x=cut,fill=color))

What happens when we fill using another variable like clarity

ggplot(data=diamonds) +
  geom_bar(mapping=aes(x=cut,fill=clarity))

If we don’t want a stacked bar chart…
the alpha option enables transparentcy so the overlaps can be seen.

ggplot(data=diamonds,mapping=aes(x=cut,fill=clarity)) +
  geom_bar(aplha=1/5, position="identity"
  )
Ignoring unknown parameters: aplha

using fill with NA values

ggplot(data=diamonds,mapping=aes(x=cut,color=clarity)) +
  geom_bar(fill=NA, position="identity")

NA

To make it easier to compare proportions across groups, use position=“fill”

ggplot(data=diamonds) +
  geom_bar(mapping=aes(x=cut,fill=clarity), position="fill")

Use position=“dodge” to place the bars beside each other

ggplot(data=diamonds) +
  geom_bar(mapping=aes(x=cut,fill=clarity), position="dodge")

Using Jitter to show all the data in a scatter plot

ggplot(data=mpg) +
  geom_point( mapping=aes(x=displ, y=hwy), position="jitter")

Adding randomness may be a strange way to improve the plot, but while it makes the graph less accurate, it makes your graph MORE revealing at large scales.

Coordinate System

The default coordinate is the cartesian coordinate system where the x and y position act independently to find the location of each point. some other coordinate systems:

coord_flip() switches the x and y axes. Useful when you want horizontal box plots.
coord_quickmap() sets the aspect ratio correctly for maps. Important for plotting spatial data
coord_polar() uses polar coordinates. It reveals an interesting connection between a bar chart and a coxcomb chart.

# example of coord_flip
ggplot(data =mpg, mapping=aes(x=class, y=hwy))+
  geom_boxplot()

ggplot(data =mpg, mapping=aes(x=class, y=hwy))+
  geom_boxplot()+
  coord_flip()

# exaple of quickmap()
library(ggplot2)
ph <- map_data("nz")
ggplot(ph, aes(long,lat, group=group))+
  geom_polygon(fill="white", color="black")

ggplot(ph, aes(long,lat, group=group))+
  geom_polygon(fill="white", color="black")+
  coord_quickmap()

# example of polar coordiantes
bar <- ggplot(data=diamonds)+
  geom_bar(mapping =aes(x=cut,fill=cut),show.legend=FALSE, width =1) +
  theme(aspect.ratio=1)+
  labs(x=NULL, y=NULL)
bar + coord_flip()

bar + coord_polar()

Layered grammar of graphics

The seven parameters template:

ggplot(data=datafile) + geom_function( mapping=aes(mappingsettings) stat=statsettings, position=positionsetting)+ coordinate_function+ facet_function )

