── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Load the Prebuilt Dataset, Diamonds, and View it in the Global Environment
head(diamonds) # shows the first few lines of the dataset
# A tibble: 6 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
data(diamonds) # places the dataset in the global environment
Statistical transformations from R for Data Science
Bar charts seem simple, but they are interesting because they reveal something subtle about plots. Consider a basic bar chart, as drawn with geom_bar(). The following chart displays the total number of diamonds in the diamonds dataset, grouped by cut. The diamonds dataset comes in ggplot2 and contains information about ~54,000 diamonds, including the price, carat, color, clarity, and cut of each diamond. The bar graph shows that more diamonds are available with high quality cuts than with low quality cuts.
First Bar Plot
ggplot(data = diamonds) +# makes a blank graphic space to format your graph "Grammer of Graphics" geom_bar(aes(x = cut)) # makes a bar graph with x-axis showing data for cut, aes is "aesthetics"
How Do Bar Charts Work with 2 Variables?
Bar graphs are EASY when you have a single categorical variable that defines several levels for each observation. Ex: “cut” has levels: fair, good, verey good, premium, and ideal. Each observation is categorized this way. But what if you have a table of aggregated data: x = cut vs y = frequency? Here is a tibble to show this table and how you can create a bar graph from this data
A Tibble/Tribble (think of this like a dataframe)
We will create a frequency table of the types of cuts that mimick the calculations done to create geom_bar
Demo Tibble Bar Plot Looks Just Like Our Other Bar Graphs
ggplot(data = demo) +geom_bar(mapping =aes(x = cut, y = freq), stat ="identity")
(Don’t worry that you haven’t seen tibble() or tribble() before. You might be able to guess at their meaning from the context, and you’ll learn exactly what they do soon!)
Creating Proportional Bars
You might want to override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of proportion, rather than count:
Proportional Bar Graph (Relative Frequencies)
You need “group = 1 when plotting proportions (try to omit it and see)
ggplot(data = diamonds) +geom_bar(mapping =aes(x = cut, y =after_stat(prop), group =1)) +# y = stat(prop) makes the info proportional, use after_stat(prop) instead, stat(prop) is discontinuedlabs(x ="Diamond Cut", y ="Proportion",title ="Proportional Bar Graph of Diamond Cuts") # labs "labels" can label the axis and the chart
Some Details About Proportion Plot
To find the variables computed by the stat, look for the help section titled “computed variables”.
This is a Different Type of Plot That Shows a Line with Min, Max, and Median Values
You might want to draw greater attention to the statistical transformation in your code. For example, you might use stat_summary(), which summarizes the y values for each unique x value, to draw attention to the summary that you’re computing
Line Plot
This is a different way of visualizing center and spread of cuts and depth
ggplot(data = diamonds) +stat_summary(mapping =aes(x = cut, y = depth),fun.min = min, # Function for minimum, calculates minimumfun.max = max, # Function for maximum, calculates maximumfun = median # Function, calculating median (idk if there's a direct function for median) )
Fill vs Color
There’s one more piece of magic associated with bar charts. You can color bar charts using either the color aesthetic, or more usefully, fill: Notice that “fill=” fills inside the bar, whereas “color=” draws a color outline of the bar. Alpha gives a level of transparency, with alpha= 0 is invisible and alpha = 1 is fully saturated
Bar Plot with Alpha Transparency
ggplot(data = diamonds, aes(x = cut, fill = cut)) +# fill = must be entered inside of aes() inside of ggplot() , identify the fill by the same variable you identified the x axis asgeom_bar(alpha =0.5) +labs(x ="Diamond Cut", y ="Frequency",title ="Frequency Bar Graph of Diamond Cuts")
Try Stacking Bar Graphs with Position = “stack”
ggplot(data = diamonds) +geom_bar(aes(x = cut, fill = clarity), position ="stack") +# fill=clarity makes the fill the clarity variable from the data, position=stack makes the fill variables stack on top of each otherlabs(x ="Diamond Cut", y ="Frequency",title ="Stacked Bar Graph of Diamond Cuts by Clarity")
Note what happens if you map the fill aesthetic to another variable, like clarity: the bars are automaticall stacked. Each colored rectangle represents a combination of cut and clarity.
Note Position = “dodge” or “identity” or “fill”
The identity position adjustment is more useful for 2d geoms, like points, where it is the default. Position = “fill” works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.
Using Position = “fill” the bars fill the vertical space proportionally
ggplot(data = diamonds) +geom_bar(mapping =aes(x = cut, fill = clarity), position ="fill") +labs(x ="Diamond Cut", y ="Proportion",title ="Proportional Bar Graph of Diamond Cuts by Clarity")
Position = “dodge” will get side-by-side bars
ggplot(data = diamonds, aes(x = cut, fill = clarity)) +geom_bar(alpha =0.7, position ="dodge") +labs(x ="Diamond Cut", y ="Frequency",title ="Side-by-Side Bar Graph of Diamond Cuts by Clarity")
Change the Angle of the X-Axis Labels
When x-axis labels are too long, they may overlap. You can change the text angle with axis.text.x = element_text(angle = 45)
ggplot(data = diamonds, aes(x = cut, fill = clarity)) +geom_bar(alpha =0.7, position ="dodge") +labs(x="Diamond Cut", y ="Frequency",title ="Side-by-Side Bar Graph of Diamond Cuts by Clarity") +theme(axis.text.x =element_text(angle =45)) # You can change the angle number, which will adjust the text
Finally, Make the X-Axis Labels Fit in a Narrow Width
Here is another option for dealing with x-axis labels when they are long. You can use this function to break words into 2 lines
ggplot(data = diamonds, aes(x = cut, fill = clarity)) +geom_bar(alpha =0.7, position ="dodge") +labs(x ="Diamond Cuts", y ="Frequency",title ="Side-by-Side Bar Graph of Diamond Cuts by Clarity") +scale_x_discrete(labels =function(x) str_wrap(x, width =5))
Notice “Very Good” will fit on two lines instead of one line
Bonus “Bar-Like” Graph - Polar Point
bar <-ggplot(data = diamonds) +geom_bar(aes(x = clarity, fill = clarity),show.legend =FALSE, width =1) +# "False makes it so that the legend does not show, if you want to see it change to = TRUE, width is intuitive, recommend avoiding making too largetheme(aspect.ratio =1) # I notice when I change the ratio number, the graph gets thinner, due to the format of the graph.Best to keep the ratio for this style of graph at 1bar +coord_polar() # coord_polar is the function that created this special style of pie-chart-eque graph