Box plot with jittered points using the “Fastfood” Dataset from openintro packag

Dataset contains nutrition info. of 515 fast food items, broken down by restaurant

library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(ggplot2)
ggplot(fastfood, aes(x = calories, y = restaurant, color = restaurant))+
   geom_boxplot(alpha = 0)+
  geom_jitter(alpha = 0.4, height = 0.3)+
  guides(color = guide_legend(reverse =TRUE))+
   labs(title = "Calories in Fast Food Restaurant Items", x = "Calories", y = "Restaurant",
        color = "Restaurant")+
  theme(plot.title = element_text(size = 22),
        axis.text.y = element_text(face = "bold", size = 18, angle = 10),
        axis.text.x = element_text(face = "bold", size = 18),
        axis.title.y = element_text(size = 20, face = "bold"), 
        axis.title.x = element_text(size = 20, face = "bold"),
        legend.key.size = unit(1.1, "cm"),
        legend.text = element_text(size = 15),
        legend.title = element_text(size = 15),
        panel.border = element_rect(fill = "transparent", 
                                    color = "black",          
                                    linewidth = 2))

Violin plots with points on top

#install.packages("ggbeeswarm")
library(ggbeeswarm)

ggplot(fastfood, aes(x = calories, y = restaurant, color = restaurant))+
   geom_violin()+
  geom_jitter(alpha = 0.3)+
  guides(color = guide_legend(reverse =TRUE))+
  labs(title = "Calories in Fast Food Restaurant Items", x = "Calories", y = "Restaurant",
       color = "restaurant")+
  theme(plot.title = element_text(size = 22),
        axis.text.y = element_text(face = "bold", size = 18, angle = 10),
        axis.text.x = element_text(face = "bold", size = 18),
        axis.title.y = element_text(size = 20, face = "bold"), 
        axis.title.x = element_text(size = 20, face = "bold"),
          legend.key.size = unit(1.1, "cm"),
        legend.text = element_text(size = 15),
        legend.title = element_text(size = 15),
         panel.background = element_rect(fill = "seashell"),
        panel.grid= element_line(color = "gray90", linetype = 2, linewidth = 0.8),
        panel.border = element_rect(fill = "transparent", 
                                    color = "black",            
                                    linewidth = 2))

Analysis

The original side-by-side boxplots show the median number of calories in the items provided in the dataset for each restaurant clearly. We can see that Sonic, Arby’s and Burger King seem to have the three highest median values for number of calories in menu items. Out of these three restaurants, Arby’s seems to have the least amount of variance in the number of calories in its food options. One advantange to the boxplots is that they easily convey the variation in number of calories among the restaurant’s options. Overall, McDonald’s seems to have a noticeably higher amount of variation than the other restaurants, given by the fact that we can see it has the most outliers and a considerable spread in the middle 50% of distribution of the calories in its food. It has the item with the maximum number of calories out of all the menu options in the dataset, at almost 2,500 calories, but its median is lower than the previously three mentioned restaurants. Within the middle 50% of the data for Subway’s distribution, we can observe noticeably higher spread than any other restuarant, suggesting that the center of the distribution of calories in Subway’s menu items have more varaition than the central portion of the distributions of the other restaurants.

Adding in the points in the second plot helps provide previously missing information, including that Chick-Fil-A has less data points than the other restaurants. Since each restaurant is clearly not evenly represented in the data, since some of the boxplots have more overlaid points than others, this helps us take caution when comparing the two distributions. While not a precise measure of the accuracy of the data, the viewer can still gain some insight. For example,Taco Bell has a pretty dense distribution in the number of calories, and has more data points than some other restaurants, so we may have a better sense of the distribution for Taco Bell than some other restaurants, like Chick-Fil-A.

While the violin without the boxplot does not show the medians anymore, they do provide the value of showing the outlines of density for the points in the dataset. We can see areas where the number of calories in menu items for the restaurants are more and less populated by looking at the width of the violins. For example, Dairy Queen’s distribution appears roughly symmetric about the median, with less data points branching out above and below the center value. Also, for Burger King, we get a clearer picture that the mean would likely be higher than the median, and that the data is skewed slightly right, as the violin plot is wider for lower values of calories, and then thins out to the right, just like McDonald’s. It is overall easier to judge skewness from the violin plot than the box plot.