Chapter 1
prerequisites:
Load tidyverse by using library function
install.packages("tidyverse")
Error in install.packages : Updating loaded packages
library(tidyverse)
This one line loads the core tidyverse, packages that are needed. Question: Do cars with big engines more fuel efficient than smaller engines?
Load the mpg dataset
library(tidyverse)
package <U+393C><U+3E31>tidyverse<U+393C><U+3E32> was built under R version 3.3.3Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ------------------------------------------------------------------------------------
filter(): dplyr, stats
lag(): dplyr, stats
mpg
Notes: displ is the car’s engine size in liters hwy is the car’s fuel efficiency on the highway in miles per gallon (mpg)
Create the graph using ggplot
ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy))

ggplot(data=mpg) creates an empty plot using the mpg dataset geom_point adds scatter point graph using displ on the x axis and hwy on the y axis
The template is: ggplot(data =) + (mapping =aes())
Exercises:
ggplot(data=mpg)

Answer: we see an empty plot How many rows in mtcars? How many columns?
dim(mtcars)
[1] 32 11
Answer: 32 rows, and 11 columns
summary(mtcars)
mpg cyl disp hp drat wt qsec
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760 Min. :1.513 Min. :14.50
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89
Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695 Median :3.325 Median :17.71
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597 Mean :3.217 Mean :17.85
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930 Max. :5.424 Max. :22.90
vs am gear carb
Min. :0.0000 Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :0.0000 Median :0.0000 Median :4.000 Median :2.000
Mean :0.4375 Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :1.0000 Max. :1.0000 Max. :5.000 Max. :8.000
Shows eleven fields. To find out what drv variables describes, use the help function
?mpg
Pops up the help that shows up on the help screen:
drv f = front-wheel drive, r = rear wheel drive, 4 = 4wd
Exercise Make a scatterplot of hwy versus cyl
ggplot(data=mpg) + geom_point(mapping = aes(x=cyl, y=hwy))

What happens when we make a scatterplot of class versus drv?
ggplot(data=mpg) + geom_point(mapping = aes(x=class, y=drv))

Why is this plot not useful? (no pattern??)
Looking at the first scatterplot
ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy))

You wont easily be able to identify the outliers. Why dont we apply colors to this plot?
ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy, color=class))

This now quickly helps readers to find that 2seaters are more fuel efficient. Other ways will be to use size, alpha(transparency) and shape.
ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy, size=class))

ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy,alpha=class))

ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy, shape=class))

Note that for the shape , SUV is gone! Ggplot will only use 6 shapes at a time.
You can also manually set properties
ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy), color="red")

Note: for manually setting objects, do it outside the AES
ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy), shape=24)

Why is the graph below not blue?
ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy, color="blue"))

What happens when we map the same variable to different aesthetics? (size, shape, color)
ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy, color=class, shape=class, size=class))

?geom_point
What is the stroke aesthetic do? Use the stroke aesthetic to modify the width of the border
ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy, shape=class,stroke=1))

What happens when you map an aethtic to something other than a variable name like aes(color =displ <5)
ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy, color= displ<5))

How to make small multiple graphs (facets in R ggplot convention)
ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy, color= displ <5)) +
facet_wrap(~ class, nrow=2)

To face on a combination of two variables, add facet_grid to your plot call.
ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy, color= displ <5)) +
facet_grid(drv ~ cyl)

Use [.] to avoid facet on either row or column. Like this below:
ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy, color= displ <5)) +
facet_grid(. ~ cyl)

What happes if we facet on a continuous variable?
first find which variable is continuous.
summary(mpg)
manufacturer model displ year cyl trans
Length:234 Length:234 Min. :1.600 Min. :1999 Min. :4.000 Length:234
Class :character Class :character 1st Qu.:2.400 1st Qu.:1999 1st Qu.:4.000 Class :character
Mode :character Mode :character Median :3.300 Median :2004 Median :6.000 Mode :character
Mean :3.472 Mean :2004 Mean :5.889
3rd Qu.:4.600 3rd Qu.:2008 3rd Qu.:8.000
Max. :7.000 Max. :2008 Max. :8.000
drv cty hwy fl class
Length:234 Min. : 9.00 Min. :12.00 Length:234 Length:234
Class :character 1st Qu.:14.00 1st Qu.:18.00 Class :character Class :character
Mode :character Median :17.00 Median :24.00 Mode :character Mode :character
Mean :16.86 Mean :23.44
3rd Qu.:19.00 3rd Qu.:27.00
Max. :35.00 Max. :44.00
ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy, color= displ <5)) +
facet_grid(drv ~ cty)

A geom is the geometrical object that a plot uss to represent data. Using geom_point
ggplot(data=mpg) + geom_point(mapping = aes(x=displ, y=hwy))

Using geom_smooth with no arguments
ggplot(data=mpg) + geom_smooth(mapping = aes(x=displ, y=hwy))
Using geom_smooth with linetype argument
ggplot(data=mpg) + geom_smooth(mapping = aes(x=displ, y=hwy, linetype =drv))

The above geom_smooth separates the cars into 3 lines based on their drv value which describes a car’s drivetrain. We can overly geom_point on geom_smooth
ggplot(data=mpg) + geom_smooth(mapping = aes(x=displ, y=hwy, linetype =drv)) +
geom_point(mapping = aes(x=displ, y=hwy, color=drv))

To avoid having to set the mappings in each geom, you can place a set of mapping in the main ggplot(). Then they become ‘global’
ggplot(data=mpg, mapping=aes(x=displ,y=hwy)) + geom_point() + geom_smooth()

If you place argument inside the geom function, it will be treated as local.
ggplot(data=mpg, mapping=aes(x=displ,y=hwy)) +
geom_point(mapping=aes(color=class)) +
geom_smooth()

using this technic we can use it to specify a different data for each layer.
ggplot(data=mpg, mapping=aes(x=displ,y=hwy)) +
geom_point(mapping=aes(color=class)) +
geom_smooth(data = filter(mpg, class =='subcompact'),se=FALSE)

Exercises:
How to draw a line chart, a boxplot, a histogram, an area chart?
ggplot(data=mpg, mapping=aes(x=displ,y=hwy)) +
geom_line()

ggplot(data=mpg, mapping=aes(x=displ,y=hwy)) +
geom_boxplot()

ggplot(data=mpg, mapping=aes(x=displ)) +
geom_histogram()

ggplot(data=mpg, mapping=aes(x=displ,y=hwy)) +
geom_area()

Exercise:
ggplot(data=mpg, mapping=aes(x=displ,y=hwy, color=drv)) +
geom_point() +
geom_smooth(se =FALSE)

Q:What does show_legend do? What happens when it is removed?
ggplot(data=mpg, mapping=aes(x=displ,y=hwy, color=drv)) +
geom_point() +
geom_smooth(se =FALSE,show.legend = FALSE)

The legend on the DRV for the line doesnt showup.
Recreating the lines in the exercises
ggplot(data=mpg, mapping=aes(x=displ,y=hwy))+
geom_point(size=4)+
geom_smooth(se =FALSE)

ggplot(data=mpg, mapping=aes(x=displ,y=hwy))+
geom_point(size=4)+
geom_line(mapping = aes(linetype =drv))

To learn about ggplot: http://www.ggplot2-exts.org/ Cheat sheet: https://www.rstudio.com/resources/cheatsheets/
Bar Charts
ggplot(data=diamonds) +
geom_bar(mapping = aes(x=cut))

Stat_count() can be interchanged with geom_bar since geom_bar also uses stat_count to statistically transform the data into counts.
ggplot(data=diamonds) +
stat_count(mapping = aes(x=cut))

This works because every geom has a default stat, and every stat has a default geom. Three reasons to use a stat explicitly: 1. you want to overrid the default stat. 2. you want to to use aesthetics instead of transformed variables 3. you want to draw attention to the statistical transformation in the code.
demo <- tribble(
~a, ~b,
"Bar_1",20,
"Bar_2",30,
"Bar_3", 40)
ggplot(data = demo)+
geom_bar(
mapping=aes(x=a, y=b), stat="identity"
)

using aesthetics instead of transformed variables
ggplot(data = diamonds)+
geom_bar(
mapping=aes(x=cut, y=..prop.., group=1)
)

Case 3 demo
ggplot(data = diamonds)+
stat_summary(
mapping=aes(x=cut, y=depth),
fun.ymin=min,
fun.ymax=max,
fun.y=median
)

How is geom_col() different from geom_bar ? answer: geom_bar uses only one variable, geom_col uses two variables and displays the count that applies to both conditions.
ggplot(data = diamonds)+
geom_col(
mapping=aes(x=cut,y=carat)
)

ggplot(data=diamonds) +
geom_bar(mapping=aes(x=cut,fill=color, y=..prop..))

Position Adjustments for bar charts using either aesthetic
ggplot(data=diamonds) +
geom_bar(mapping=aes(x=cut,color=cut))

Position Adjustments for bar charts using fill
ggplot(data=diamonds) +
geom_bar(mapping=aes(x=cut,fill=color))

What happens when we fill using another variable like clarity
ggplot(data=diamonds) +
geom_bar(mapping=aes(x=cut,fill=clarity))

If we don’t want a stacked bar chart… the alpha option enables transparentcy so the overlaps can be seen.
ggplot(data=diamonds,mapping=aes(x=cut,fill=clarity)) +
geom_bar(aplha=1/5, position="identity"
)
Ignoring unknown parameters: aplha

using fill with NA values
ggplot(data=diamonds,mapping=aes(x=cut,color=clarity)) +
geom_bar(fill=NA, position="identity")

NA
To make it easier to compare proportions across groups, use position=“fill”
ggplot(data=diamonds) +
geom_bar(mapping=aes(x=cut,fill=clarity), position="fill")

Use position=“dodge” to place the bars beside each other
ggplot(data=diamonds) +
geom_bar(mapping=aes(x=cut,fill=clarity), position="dodge")

Using Jitter to show all the data in a scatter plot
ggplot(data=mpg) +
geom_point( mapping=aes(x=displ, y=hwy), position="jitter")

Adding randomness may be a strange way to improve the plot, but while it makes the graph less accurate, it makes your graph MORE revealing at large scales.
Coordinate System
The default coordinate is the cartesian coordinate system where the x and y position act independently to find the location of each point. some other coordinate systems: coord_flip() switches the x and y axes. Useful when you want horizontal box plots. coord_quickmap() sets the aspect ratio correctly for maps. Important for plotting spatial data coord_polar() uses polar coordinates. It reveals an interesting connection between a bar chart and a coxcomb chart.
# example of coord_flip
ggplot(data =mpg, mapping=aes(x=class, y=hwy))+
geom_boxplot()

ggplot(data =mpg, mapping=aes(x=class, y=hwy))+
geom_boxplot()+
coord_flip()

# exaple of quickmap()
library(ggplot2)
ph <- map_data("nz")
ggplot(ph, aes(long,lat, group=group))+
geom_polygon(fill="white", color="black")

ggplot(ph, aes(long,lat, group=group))+
geom_polygon(fill="white", color="black")+
coord_quickmap()

# example of polar coordiantes
bar <- ggplot(data=diamonds)+
geom_bar(mapping =aes(x=cut,fill=cut),show.legend=FALSE, width =1) +
theme(aspect.ratio=1)+
labs(x=NULL, y=NULL)
bar + coord_flip()

bar + coord_polar()

Layered grammar of graphics
The seven parameters template:
ggplot(data=datafile) + geom_function( mapping=aes(mappingsettings) stat=statsettings, position=positionsetting)+ coordinate_function+ facet_function )
