Introduction to R Graphics with ggplot2:

http://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html

Let’s look at housing prices.

library(ggplot2)
housing <- read.csv("dataSets/landdata-states.csv")

Comparison of histograms:


Old way…kind of lame:


hist(housing$Home.Value)


ggplot way…better?:


ggplot(housing, aes(x = Home.Value)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Colored Scatter Plot example:

ggplot(subset(housing, State %in% c("MA", "TX")),
       aes(x=Date,
           y=Home.Value,
           color=State))+
  geom_point()

Aesthetic Mapping

In ggplot land aesthetic means “something you can see”. Examples include: position (i.e., on the x and y axes), color (“outside” color), fill (“inside” color), shape (of points), linetype, size

Geometric Objects

Geometric objects are the actual marks we put on a plot. A plot must have at least one geom; there is no upper limit. You can add a geom to a plot using the + operator

Geom_point example
hp2001Q1 <- subset(housing, Date == 2001.25) 
ggplot(hp2001Q1,
       aes(y = Structure.Cost, x = log(Land.Value))) +
  geom_point()

Adding a prediction line

First construct linear regression model and use predict function

hp2001Q1$pred.SC <- predict(lm(Structure.Cost ~ log(Land.Value), data = hp2001Q1))

Then, add new variable for prediction line to plot:

# base chart
p1 <- ggplot(hp2001Q1, aes(x = log(Land.Value), y = Structure.Cost))

# with prediction line

p1 + geom_point(aes(color = Home.Value)) +
  geom_line(aes(y = pred.SC))

Adding Smoothers

What is a Smoother? https://www.stat.berkeley.edu/~s133/Smooth-a.html

There are various smoothing methods/formulas, the graph below uses Loess (most likely due to multiple predictors)

p1 +
  geom_point(aes(color = Home.Value)) +
  geom_smooth()
## `geom_smooth()` using method = 'loess'

Sidenote…you can add text labels to points under geom_text:
p1 + 
  geom_text(aes(label=State), size = 3)

Aesthetic Mapping vs. Assignment

Confusing…need to research

https://www.r-bloggers.com/ggplot2-mapping-vs-setting/

First, we need to understand that any aesthetic in ggplot2 (such as colour, size, shape, etc.) can be used in two distinct ways in your plots:

Option 1 – you can use the aesthetic to reflect some properties of your data. For example, clarity of the diamonds. This is called MAPPING an aesthetic.

Option 2 - you can choose a certain value for an aesthetic. For example, make the colour blue for ALL points or make the shape a square for ALL points. This is called SETTING an aesthetic and the keyword here is ALL.

When mapping you can convey more insights, whereas when setting you get more control of how your chart looks.

#For example
##  geom_point(aes(size = 2),# incorrect! 2 is not a variable
             ## color="red") # this is fine -- all points red

Practice example

dat <- read.csv("dataSets/EconomistData.csv")

Mapping vs. Setting:

ex1<-ggplot(dat, aes(x=CPI, y = HDI))

#Mapping
ex1+geom_point(aes(color = HDI.Rank))

#Setting
ex1+geom_point(color = "blue")

Statistical Transformations

Some plot types (such as scatterplots) do not require transformations–each point is plotted at x and y coordinates equal to the original value. Other plots, such as boxplots, histograms, prediction lines etc. require statistical transformations.

Each geom has a default statistic, but these can be changed. For example, the default statistic for geom_bar is stat_bin:

Arguments to stat_ functions can be passed through geom_ functions. This can be slightly annoying because in order to change it you have to first determine which stat the geom uses, then determine the arguments to that stat.

#bin_width example:
p2 <- ggplot(housing, aes(x = Home.Value))
p2 + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#with manually entered bin_width
p2 + geom_histogram(stat = "bin", binwidth=4000)

Sometimes the default statistical transformation is not what you need. The chart is trying to summarize a field that is already summarized. In this case, we can add stat=identity to the geom function.

#summarizing the data now:
housing.sum <- aggregate(housing["Home.Value"], housing["State"], FUN=mean)
rbind(head(housing.sum), tail(housing.sum))
##    State Home.Value
## 1     AK  147385.14
## 2     AL   92545.22
## 3     AR   82076.84
## 4     AZ  140755.59
## 5     CA  282808.08
## 6     CO  158175.99
## 46    VA  155391.44
## 47    VT  132394.60
## 48    WA  178522.58
## 49    WI  108359.45
## 50    WV   77161.71
## 51    WY  122897.25
#won't need to in the plot
ggplot(housing.sum, aes(x=State, y=Home.Value)) + 
  geom_bar(stat="identity")

Scaling

Used when you map an aesthetic (via aes()) to a variable and want to determine how. Ex. color = HDI.Rank but also red.

cales are modified with a series of functions using a scale_ naming scheme. Try typing scale to see a list of scale modification functions.

Examples:

  • name: the first argument gives the axis or legend title
  • limits: the minimum and maximum of the scale
  • breaks: the points along the scale where labels should appear
  • labels: the labels that appear at each break
ex2<- ggplot(housing, 
             aes(x = State,
                  y = Home.Price.Index))+
            theme(legend.position = "top", 
            axis.text = element_text(size = 6))

ex2 + geom_point(aes(color = Date), 
                        alpha = 0.5, 
                        size = 1.5,
                        position = position_jitter(width = 0.25, height = 0))

Why use Jitter?

Adds noise to charts with discrete variables to avoid over-charting. Helps see a little more of the points that would normally be plotted on top of one another. Just be careful to add too much noise.