Fetch Starbucks data set

First, we will read the Starbucks CSV file from “TidyTuesday” and store it in starbucksData as below:

starbucks_url <- "https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-12-21/starbucks.csv"
starbucksData <- read.csv( starbucks_url )
head(starbucksData, 4)

##                 product_name   size milk whip serv_size_m_l calories
## 1 brewed coffee - dark roast  short    0    0           236        3
## 2 brewed coffee - dark roast   tall    0    0           354        4
## 3 brewed coffee - dark roast grande    0    0           473        5
## 4 brewed coffee - dark roast  venti    0    0           591        5
##   total_fat_g saturated_fat_g trans_fat_g cholesterol_mg sodium_mg
## 1         0.1               0           0              0         5
## 2         0.1               0           0              0        10
## 3         0.1               0           0              0        10
## 4         0.1               0           0              0        10
##   total_carbs_g fiber_g sugar_g caffeine_mg
## 1             0       0       0         130
## 2             0       0       0         193
## 3             0       0       0         260
## 4             0       0       0         340

I have selected “Calories” column (variable) from the table to check normality of the data in the given sample.
I will try to use multiple options to see which one would help me to identify the normality of this variable without conducting any statistical test.

1. Using Summary function

First, the summary method will give us a basic five-figure summary of the calories variable. It helps to get a basic idea about the mean, median and quartiles.

summary(starbucksData$calories)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     0.0   130.0   220.0   228.4   320.0   640.0

2. Using Boxplot function

If we use the boxplot function with jitter from ggplot, we can see that although we get some idea about the distribution of data but it is not visually clear enough to help us understand the normality of underlying data.

ggplot(data = starbucksData,
       mapping = aes(x = calories, y = 0))+
  geom_jitter() +
  geom_boxplot(colour = "#004b8c", fill = "#98c5ed", alpha = .6)

3. Using Histogram function

Now, histogram gives us a much better visual understanding of data and its distribution. I have added mean (solid line) and median (dashed line) along with a frequency polygon (red line) to further help me understand normality of the calorie variable.

As we can see in the following graph, the calories sample data might not look like a perfect bell curve for normal distribution but it doesn’t mean that the variable is not normal.

ggplot( data = starbucksData, mapping = aes(x = calories) ) + 
  geom_histogram(binwidth = 70, color="black", fill="#b5e2f5") + 
  geom_vline(mapping = aes( xintercept = mean(calories)), colour = '#1d8a01') +
  geom_vline(mapping = aes( xintercept = median(calories) ),
  linetype = "dashed", colour = '#1d8a01') + 
  geom_freqpoly(aes(colour = "red"), binwidth = 70, show.legend = FALSE)

4. Using QQ-Plot function

As it seems hard to estimate normality from histogram alone, I am using “QQ-plot” on top of it to help me further visually evaluate the normality assumption.

Based on the QQ-pot below, because the majority of points are close to the reference line and within the confidence bands, the normality assumption can be considered as met.

ggqqplot(starbucksData, x = "calories", color = "#00AFBB")

5. Using Empirical rule to check normality

We can further extend our finding by using simple statistics to find out number of data points between one standard deviation of the mean. The result of the ratio reflects that the underlying data is quite close to the empirical rule of 68%.

lower_bound <- mean(starbucksData$calories)-sd(starbucksData$calories)
upper_bound <- mean(starbucksData$calories)+sd(starbucksData$calories)
index <- starbucksData$calories > lower_bound &
  starbucksData$calories < upper_bound
sum(index)/nrow(starbucksData)

## [1] 0.657367

If we check the same of for the empirical rule of 95% by checking the data points within 2 standard deviations of the mean, we can see that it is again a close estimate (96.9%).

lower_bound <- mean(starbucksData$calories)-2*sd(starbucksData$calories)
upper_bound <- mean(starbucksData$calories)+2*sd(starbucksData$calories)
index <- starbucksData$calories > lower_bound &
  starbucksData$calories < upper_bound
sum(index)/nrow(starbucksData)

## [1] 0.9694856

Conclusion

Based on the graphs and simple calculations, I can conclude that the “Calories” variable under the Starbucks data set seems to be a normal distribution.

Data Analytics: Test 4

Manish Rawat - 658489554