The purpose of this assignment is to create an Example by using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, to create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with the selected dataset.
For this assignment, I selected the coffee shop chain recipes and prices from Kaggle. This dataset contains prices of various coffee drinks from a coffee shop chain in Europe. This is the link to the dataset in Kaggle: https://www.kaggle.com/datasets/deryae0/coffee-shop-chain-recipes-and-prices
The TidyVerse is a collection of packages designed for data manipulation, visualization, and analysis in R.
# Load R packages
library(tidyverse)
## Warning: package 'tidyr' was built under R version 4.3.2
library(openintro)
library(ggplot2)
readr
readr is a fast and efficient package for reading rectangular data files (like CSV or TSV) into R data frames. It provides functions to read data quickly while preserving data types and handling various types of messy data.
I loaded the raw data from my github using the read.csv function.
I explored the data by checking if there were any missing or values. I also printed out summary statistics based on the data.
# Import the dataset
url <- "https://raw.githubusercontent.com/pujaroy280/SPRING2024TIDYVERSE/main/prices.csv"
df_coffee_prize <- read.csv(url)
# Display the first few rows of the dataset
head(df_coffee_prize)
## drink_type price_small price_medium price_large simple
## 1 frappucino 5.0 5.7 6.4 NA
## 2 iced chocolate 5.0 5.7 6.4 NA
## 3 iced chai latte 5.0 5.7 6.4 NA
## 4 milkshake (syrup) 4.6 5.3 6.0 NA
## 5 milkshake (natural fruit extract) 4.6 5.3 6.0 NA
## 6 cold brew latte 4.5 5.2 5.9 NA
## double
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
# Summary statistics
summary(df_coffee_prize)
## drink_type price_small price_medium price_large
## Length:31 Min. :2.900 Min. :3.400 Min. :3.800
## Class :character 1st Qu.:3.925 1st Qu.:4.625 1st Qu.:5.325
## Mode :character Median :4.600 Median :5.300 Median :6.000
## Mean :4.369 Mean :5.054 Mean :5.746
## 3rd Qu.:4.900 3rd Qu.:5.600 3rd Qu.:6.300
## Max. :5.400 Max. :6.100 Max. :6.800
## NA's :5 NA's :5 NA's :5
## simple double
## Min. :1.80 Min. :2.50
## 1st Qu.:1.80 1st Qu.:2.50
## Median :1.90 Median :2.65
## Mean :2.02 Mean :2.75
## 3rd Qu.:2.10 3rd Qu.:2.90
## Max. :2.50 Max. :3.20
## NA's :26 NA's :27
num_duplicates <- sum(duplicated(df_coffee_prize))
# Check for duplicates
duplicates <- df_coffee_prize[duplicated(df_coffee_prize), ]
print(duplicates)
## [1] drink_type price_small price_medium price_large simple
## [6] double
## <0 rows> (or 0-length row.names)
# Print out the column names
print(colnames(df_coffee_prize))
## [1] "drink_type" "price_small" "price_medium" "price_large" "simple"
## [6] "double"
dplyr
dplyr is a package for data manipulation that provides a set of functions optimized for common data manipulation tasks such as filtering, selecting, mutating, summarizing, and arranging data. It emphasizes a “grammar of data manipulation” approach for easy and intuitive data wrangling.
I removed the last 2 columns using the select function since there weren’t any useful info. Then, I removed rows with NA values in any column.
# Remove the last two columns
clean_df_coffee_prize <- df_coffee_prize %>%
select(-c(simple, double))
# Remove rows with NA values in any column
clean_df_coffee_prize <- clean_df_coffee_prize %>%
na.omit()
print(clean_df_coffee_prize)
## drink_type price_small price_medium
## 1 frappucino 5.0 5.7
## 2 iced chocolate 5.0 5.7
## 3 iced chai latte 5.0 5.7
## 4 milkshake (syrup) 4.6 5.3
## 5 milkshake (natural fruit extract) 4.6 5.3
## 6 cold brew latte 4.5 5.2
## 7 iced coco matcha 5.0 5.7
## 8 company ice tea passion fruit and citrus 3.5 4.2
## 9 company ice tea raspberry and elderflower 3.5 4.2
## 10 company ice tea peach 3.5 4.2
## 11 special company drink 3.8 4.5
## 12 smoothie - Kiwi Banana Mango Apple 4.9 5.6
## 13 smoothie - Strawberry Blueberry Blackcurrant Mango 4.9 5.6
## 14 smoothie - Passion fruit Guava Pineapple Aloe vera 4.9 5.6
## 15 orange juice 3.9 4.6
## 16 cappucino 4.1 4.8
## 17 latte 4.0 4.7
## 18 special company brand latte 5.4 6.1
## 19 mocha 4.9 5.6
## 20 chai latte 4.7 5.4
## 21 coco matcha 5.0 5.7
## 22 hot chocolate 4.0 4.7
## 23 hot white chocolate 4.2 4.9
## 24 viennese coffee 4.7 5.4
## 30 filter coffee 2.9 3.6
## 31 tea/herbal teas 3.1 3.4
## price_large
## 1 6.4
## 2 6.4
## 3 6.4
## 4 6.0
## 5 6.0
## 6 5.9
## 7 6.4
## 8 4.9
## 9 4.9
## 10 4.9
## 11 5.2
## 12 6.3
## 13 6.3
## 14 6.3
## 15 5.3
## 16 5.5
## 17 5.5
## 18 6.8
## 19 6.3
## 20 6.1
## 21 6.4
## 22 5.4
## 23 5.6
## 24 6.1
## 30 4.3
## 31 3.8
Shown below in the code chunk, I computed the average prices for each combination of drink type and size by using the group_by() function.
In R, the group_by() function is part of the dplyr package, which is a core component of the TidyVerse ecosystem. This function is used for grouping data based on one or more variables, allowing for the application of functions to each group separately.
# Compute average prices for each combination of drink type and size
avg_prices <- clean_df_coffee_prize %>%
group_by(drink_type) %>%
summarise(
avg_price_small = mean(price_small, na.rm = TRUE),
avg_price_medium = mean(price_medium, na.rm = TRUE),
avg_price_large = mean(price_large, na.rm = TRUE)
)
print(avg_prices)
## # A tibble: 26 × 4
## drink_type avg_price_small avg_price_medium avg_price_large
## <chr> <dbl> <dbl> <dbl>
## 1 cappucino 4.1 4.8 5.5
## 2 chai latte 4.7 5.4 6.1
## 3 coco matcha 5 5.7 6.4
## 4 cold brew latte 4.5 5.2 5.9
## 5 company ice tea passion fru… 3.5 4.2 4.9
## 6 company ice tea peach 3.5 4.2 4.9
## 7 company ice tea raspberry a… 3.5 4.2 4.9
## 8 filter coffee 2.9 3.6 4.3
## 9 frappucino 5 5.7 6.4
## 10 hot chocolate 4 4.7 5.4
## # ℹ 16 more rows
ggplot
To visualize the average prices of each drink type and size based on the average price, I utilized the ggplot library which is a powerful and flexible package for creating static graphics and building complex plots by adding layers and aesthetic mappings.
The ggplot below generates heatmaps to display the average prices of each drink type and size based on the average price.
# Create heatmap with reversed axes
ggplot(avg_prices, aes(x = avg_price_small, y = drink_type, fill = avg_price_small)) +
geom_tile(color = "darkgrey") +
labs(title = "Average Small Coffee Prices by Drink Type", x = "Average Price (Small)", y = "Drink Type", fill = "Average Price") +
scale_fill_gradient(low = "darkgrey", high = "brown")
# Create heatmap with reversed axes
ggplot(avg_prices, aes(x = avg_price_medium, y = drink_type, fill = avg_price_medium)) +
geom_tile(color = "darkgrey") +
labs(title = "Average Medium Coffee Prices by Drink Type", x = "Average Price (Medium)", y = "Drink Type", fill = "Average Price") +
scale_fill_gradient(low = "darkgrey", high = "brown")
# Create heatmap with reversed axes
ggplot(avg_prices, aes(x = avg_price_large, y = drink_type, fill = avg_price_large)) +
geom_tile(color = "darkgrey") +
labs(title = "Average Large Coffee Prices by Drink Type", x = "Average Price (Large)", y = "Drink Type", fill = "Average Price") +
scale_fill_gradient(low = "darkgrey", high = "brown")
# Creating an additional column that contains the overall average cost of the drink and getting summary statistics on the dataset
avg_prices$overall_avg<-((avg_prices$avg_price_small + avg_prices$avg_price_medium + avg_prices$avg_price_large) / 3)
summary(avg_prices)
## drink_type avg_price_small avg_price_medium avg_price_large
## Length:26 Min. :2.900 Min. :3.400 Min. :3.800
## Class :character 1st Qu.:3.925 1st Qu.:4.625 1st Qu.:5.325
## Mode :character Median :4.600 Median :5.300 Median :6.000
## Mean :4.369 Mean :5.054 Mean :5.746
## 3rd Qu.:4.900 3rd Qu.:5.600 3rd Qu.:6.300
## Max. :5.400 Max. :6.100 Max. :6.800
## overall_avg
## Min. :3.433
## 1st Qu.:4.625
## Median :5.300
## Mean :5.056
## 3rd Qu.:5.600
## Max. :6.100
#Graphing the frequency of the average prices
ggplot(data = avg_prices, aes(x= overall_avg))+
geom_bar(fill="blue") +
labs(title = "Average Coffee Prices", x = "Average Price", y = "Frequency")
# Finding the cheapest, moderate, and most expensive drinks in the small and large columns
min(avg_prices$avg_price_small)
## [1] 2.9
max(avg_prices$avg_price_large)
## [1] 6.8
# Adding an additional column to the dataframe that desegnates whether a drink is cheaper or more expensive based on the overall average cost of the drink \
mean(avg_prices$overall_avg)
## [1] 5.05641
avg_prices$Pricepoint <- ifelse(avg_prices$overall_avg > 5.05641, "cheaper", "more expensive")
…