BarChart

#Load Libraries
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.5.2

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.5.2

## -- Attaching packages -------------------------------- tidyverse 1.2.1 --

## v tibble  2.0.1     v purrr   0.2.5
## v tidyr   0.8.2     v dplyr   0.7.8
## v readr   1.3.1     v stringr 1.3.1
## v tibble  2.0.1     v forcats 0.3.0

## Warning: package 'tibble' was built under R version 3.5.2

## Warning: package 'tidyr' was built under R version 3.5.2

## Warning: package 'readr' was built under R version 3.5.2

## Warning: package 'purrr' was built under R version 3.5.2

## Warning: package 'dplyr' was built under R version 3.5.2

## Warning: package 'stringr' was built under R version 3.5.2

## Warning: package 'forcats' was built under R version 3.5.2

## -- Conflicts ----------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

# The following chart displays the total number of diamonds in the diamonds dataset, grouped by cut. 
#The diamonds dataset comes in ggplot2 and contains information about ~54,000 diamonds, including the price, carat, color, clarity, and cut of each diamond. 
#The chart shows that more diamonds are available with high quality cuts than with low quality cuts.

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut))

#You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using stat_count() instead of geom_bar():
ggplot(data = diamonds) + 
  stat_count(mapping = aes(x = cut))

#This works because every geom has a default stat; and every stat has a default geom. This means that you can typically use geoms without worrying about the underlying statistical transformation.

#There are three reasons you might need to use a stat explicitly:

#1
#You might want to override the default stat. In the code below, I change the stat of geom_bar() from count (the default) to identity. #This lets me map the height of the bars to the raw values of a  y variable. 
#Unfortunately when people talk about bar charts casually, they might be referring to this type of bar chart, where the height of the bar is already present in the data, or the previous bar chart where the height of the bar is generated by counting rows.

demo <- tribble(
  ~cut,         ~freq,
  "Fair",       1610,
  "Good",       4906,
  "Very Good",  12082,
  "Premium",    13791,
  "Ideal",      21551
)

ggplot(data = demo) +
  geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")

#2
#You might want to override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of proportion, rather than count:

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop.., group = 1))

#3
#You might want to draw greater attention to the statistical transformation in your code. 
#For example, you might use stat_summary(), which summarises the y values for each unique x value, to draw attention to the summary that you’re computing:
ggplot(data = diamonds) + 
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )

##EXERCISE

#What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?
#The default geom is geom_pointrange(). Rewritten, we could use:
ggplot(data = diamonds) +
  geom_pointrange(mapping = aes(x = cut, y = depth),
                  stat = "summary",
                  fun.ymin = min,
                  fun.ymax = max,
                  fun.y = median)

#What does geom_col() do? How is it different to geom_bar()?
#geom_bar() uses the stat_count() statistical transformation to draw the bar graph. 
#geom_col() assumes the values have already been transformed to the appropriate values. 
#geom_bar(stat = "identity") and  geom_col() are equivalent.

#What variables does stat_smooth() compute? What parameters control its behaviour?
#stat_smooth() calculates four variables:

#y - predicted value
#ymin - lower pointwise confidence interval around the mean
#ymax - upper pointwise confidence interval around the mean
#se - standard error

#See ?stat_smooth for more details on the specific parameters. Most importantly, method controls the smoothing method to be employed, se determines whether confidence interval should be plotted, and  level determines the level of confidence interval to use.

#In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop..))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))

#If we fail to set group = 1, the proportions for each cut are calculated using the complete dataset, rather than each subset of cut. Instead, we want the graphs to look like this:
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = stat(prop), group = 1))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = stat(prop), group = 1))

BarChart

Anubhav Gupta

7 February 2019