First, we will read the Starbucks CSV file from “TidyTuesday” and store it in starbucksData as below:
starbucks_url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-12-21/starbucks.csv"
starbucksData <- read.csv( starbucks_url )
head(starbucksData, 4)## product_name size milk whip serv_size_m_l calories
## 1 brewed coffee - dark roast short 0 0 236 3
## 2 brewed coffee - dark roast tall 0 0 354 4
## 3 brewed coffee - dark roast grande 0 0 473 5
## 4 brewed coffee - dark roast venti 0 0 591 5
## total_fat_g saturated_fat_g trans_fat_g cholesterol_mg sodium_mg
## 1 0.1 0 0 0 5
## 2 0.1 0 0 0 10
## 3 0.1 0 0 0 10
## 4 0.1 0 0 0 10
## total_carbs_g fiber_g sugar_g caffeine_mg
## 1 0 0 0 130
## 2 0 0 0 193
## 3 0 0 0 260
## 4 0 0 0 340
I have selected “Calories” column (variable) from
the table to check normality of the data in the given sample.
I will try to use multiple options to see which one would help me to
identify the normality of this variable without conducting any
statistical test.
First, the summary method will give us a basic five-figure summary of the calories variable. It helps to get a basic idea about the mean, median and quartiles.
summary(starbucksData$calories)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 130.0 220.0 228.4 320.0 640.0
If we use the boxplot function with jitter from ggplot, we can see that although we get some idea about the distribution of data but it is not visually clear enough to help us understand the normality of underlying data.
ggplot(data = starbucksData,
mapping = aes(x = calories, y = 0))+
geom_jitter() +
geom_boxplot(colour = "#004b8c", fill = "#98c5ed", alpha = .6)Now, histogram gives us a much better visual understanding of data and its distribution. I have added mean (solid line) and median (dashed line) along with a frequency polygon (red line) to further help me understand normality of the calorie variable.
As we can see in the following graph, the calories sample data might not look like a perfect bell curve for normal distribution but it doesn’t mean that the variable is not normal.
ggplot( data = starbucksData, mapping = aes(x = calories) ) +
geom_histogram(binwidth = 70, color="black", fill="#b5e2f5") +
geom_vline(mapping = aes( xintercept = mean(calories)), colour = '#1d8a01') +
geom_vline(mapping = aes( xintercept = median(calories) ),
linetype = "dashed", colour = '#1d8a01') +
geom_freqpoly(aes(colour = "red"), binwidth = 70, show.legend = FALSE)As it seems hard to estimate normality from histogram alone, I am using “QQ-plot” on top of it to help me further visually evaluate the normality assumption.
Based on the QQ-pot below, because the majority of points are close to the reference line and within the confidence bands, the normality assumption can be considered as met.
ggqqplot(starbucksData, x = "calories", color = "#00AFBB")We can further extend our finding by using simple statistics to find out number of data points between one standard deviation of the mean. The result of the ratio reflects that the underlying data is quite close to the empirical rule of 68%.
lower_bound <- mean(starbucksData$calories)-sd(starbucksData$calories)
upper_bound <- mean(starbucksData$calories)+sd(starbucksData$calories)
index <- starbucksData$calories > lower_bound &
starbucksData$calories < upper_bound
sum(index)/nrow(starbucksData)## [1] 0.657367
If we check the same of for the empirical rule of 95% by checking the data points within 2 standard deviations of the mean, we can see that it is again a close estimate (96.9%).
lower_bound <- mean(starbucksData$calories)-2*sd(starbucksData$calories)
upper_bound <- mean(starbucksData$calories)+2*sd(starbucksData$calories)
index <- starbucksData$calories > lower_bound &
starbucksData$calories < upper_bound
sum(index)/nrow(starbucksData)## [1] 0.9694856
Based on the graphs and simple calculations, I can conclude that the “Calories” variable under the Starbucks data set seems to be a normal distribution.