First Bar Chart with Diamond Datasets

Author

A Warsaw

Access Library Package - Tidyverse

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Load the Prebuilt Dataset, Diamonds, and View it in the Global Environment

head(diamonds) # shows the first few lines of the dataset
# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
data(diamonds) # places the dataset in the global environment

Statistical transformations from R for Data Science

Bar charts seem simple, but they are interesting because they reveal something subtle about plots. Consider a basic bar chart, as drawn with geom_bar(). The following chart displays the total number of diamonds in the diamonds dataset, grouped by cut. The diamonds dataset comes in ggplot2 and contains information about ~54,000 diamonds, including the price, carat, color, clarity, and cut of each diamond. The bar graph shows that more diamonds are available with high quality cuts than with low quality cuts.

First Bar Plot

ggplot(data = diamonds) +   # makes a blank graphic space to format your graph "Grammer of Graphics" 
  geom_bar(aes(x = cut))    # makes a bar graph with x-axis showing data for cut, aes is "aesthetics"

How Do Bar Charts Work with 2 Variables?

Bar graphs are EASY when you have a single categorical variable that defines several levels for each observation. Ex: “cut” has levels: fair, good, verey good, premium, and ideal. Each observation is categorized this way. But what if you have a table of aggregated data: x = cut vs y = frequency? Here is a tibble to show this table and how you can create a bar graph from this data

A Tibble/Tribble (think of this like a dataframe)

We will create a frequency table of the types of cuts that mimick the calculations done to create geom_bar

demo <- tribble(
        ~cut,             ~freq,
        "Fair",           1610,
        "Good",           4906,
        "Very Good",      12082,
        "Premium",        13791,
        "Ideal",          21551,
      )

Demo Tibble Bar Plot Looks Just Like Our Other Bar Graphs

ggplot(data = demo) +
  geom_bar(mapping = aes(x = cut, y = freq), stat = "identity")

(Don’t worry that you haven’t seen tibble() or tribble() before. You might be able to guess at their meaning from the context, and you’ll learn exactly what they do soon!)

Creating Proportional Bars

You might want to override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of proportion, rather than count:

Proportional Bar Graph (Relative Frequencies)

You need “group = 1 when plotting proportions (try to omit it and see)

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, y = after_stat(prop), group = 1)) +  # y = stat(prop) makes the info proportional, use after_stat(prop) instead, stat(prop) is discontinued
  labs(x = "Diamond Cut", y = "Proportion",
       title = "Proportional Bar Graph of Diamond Cuts")  # labs "labels" can label the axis and the chart

Some Details About Proportion Plot

To find the variables computed by the stat, look for the help section titled “computed variables”.

This is a Different Type of Plot That Shows a Line with Min, Max, and Median Values

You might want to draw greater attention to the statistical transformation in your code. For example, you might use stat_summary(), which summarizes the y values for each unique x value, to draw attention to the summary that you’re computing

Line Plot

This is a different way of visualizing center and spread of cuts and depth

ggplot(data = diamonds) +
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.min = min,         # Function for minimum, calculates minimum
    fun.max = max,         # Function for maximum, calculates maximum
    fun = median           # Function, calculating median (idk if there's a direct function for median)
  )

Fill vs Color

There’s one more piece of magic associated with bar charts. You can color bar charts using either the color aesthetic, or more usefully, fill: Notice that “fill=” fills inside the bar, whereas “color=” draws a color outline of the bar. Alpha gives a level of transparency, with alpha= 0 is invisible and alpha = 1 is fully saturated

Bar Plot with Alpha Transparency

ggplot(data = diamonds, aes(x = cut, fill = cut)) +  # fill = must be entered inside of aes() inside of ggplot() , identify the fill by the same variable you identified the x axis as
  geom_bar(alpha = 0.5) +
  labs(x = "Diamond Cut", y = "Frequency",
       title = "Frequency Bar Graph of Diamond Cuts")

Try Stacking Bar Graphs with Position = “stack”

ggplot(data = diamonds) +
  geom_bar(aes(x = cut, fill = clarity), position = "stack") + # fill=clarity makes the fill the clarity variable from the data, position=stack makes the fill variables stack on top of each other
  labs(x = "Diamond Cut", y = "Frequency",
       title = "Stacked Bar Graph of Diamond Cuts by Clarity")

Note what happens if you map the fill aesthetic to another variable, like clarity: the bars are automaticall stacked. Each colored rectangle represents a combination of cut and clarity.

Note Position = “dodge” or “identity” or “fill”

The identity position adjustment is more useful for 2d geoms, like points, where it is the default. Position = “fill” works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.

Using Position = “fill” the bars fill the vertical space proportionally

ggplot(data = diamonds) +
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill") + 
  labs(x = "Diamond Cut", y = "Proportion",
       title = "Proportional Bar Graph of Diamond Cuts by Clarity")

Position = “dodge” will get side-by-side bars

ggplot(data = diamonds, aes(x = cut, fill = clarity)) +
  geom_bar(alpha = 0.7, position = "dodge") +
  labs(x = "Diamond Cut", y = "Frequency",
       title = "Side-by-Side Bar Graph of Diamond Cuts by Clarity")

Change the Angle of the X-Axis Labels

When x-axis labels are too long, they may overlap. You can change the text angle with axis.text.x = element_text(angle = 45)

ggplot(data = diamonds, aes(x = cut, fill = clarity)) +
  geom_bar(alpha = 0.7, position = "dodge") +
  labs(x= "Diamond Cut", y = "Frequency",
       title = "Side-by-Side Bar Graph of Diamond Cuts by Clarity") +
  theme(axis.text.x = element_text(angle = 45))  # You can change the angle number, which will adjust the text

Finally, Make the X-Axis Labels Fit in a Narrow Width

Here is another option for dealing with x-axis labels when they are long. You can use this function to break words into 2 lines

ggplot(data = diamonds, aes(x = cut, fill = clarity)) +
  geom_bar(alpha = 0.7, position = "dodge") +
  labs(x = "Diamond Cuts", y = "Frequency",
       title = "Side-by-Side Bar Graph of Diamond Cuts by Clarity") +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 5))

Notice “Very Good” will fit on two lines instead of one line

Bonus “Bar-Like” Graph - Polar Point

bar <- ggplot(data = diamonds) +
  geom_bar(aes(x = clarity, fill = clarity),
           show.legend = FALSE, width = 1) +  # "False makes it so that the legend does not show, if you want to see it change to = TRUE, width is intuitive, recommend avoiding making too large
  theme(aspect.ratio = 1)   # I notice when I change the ratio number, the graph gets thinner, due to the format of the graph.Best to keep the ratio for this style of graph at 1

bar + coord_polar()  # coord_polar is the function that created this special style of pie-chart-eque graph