Objective

The purpose of this assignment is to create an Example by using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, to create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with the selected dataset.

For this assignment, I selected the coffee shop chain recipes and prices from Kaggle. This dataset contains prices of various coffee drinks from a coffee shop chain in Europe. This is the link to the dataset in Kaggle: https://www.kaggle.com/datasets/deryae0/coffee-shop-chain-recipes-and-prices

CREATE Vignette

The TidyVerse is a collection of packages designed for data manipulation, visualization, and analysis in R.

# Load R packages
library(tidyverse)
## Warning: package 'tidyr' was built under R version 4.3.2
library(openintro)
library(ggplot2)

readr

readr

readr is a fast and efficient package for reading rectangular data files (like CSV or TSV) into R data frames. It provides functions to read data quickly while preserving data types and handling various types of messy data.

I loaded the raw data from my github using the read.csv function.

I explored the data by checking if there were any missing or values. I also printed out summary statistics based on the data.

# Import the dataset
url <- "https://raw.githubusercontent.com/pujaroy280/SPRING2024TIDYVERSE/main/prices.csv"
df_coffee_prize <- read.csv(url)
# Display the first few rows of the dataset
head(df_coffee_prize)
##                          drink_type price_small price_medium price_large simple
## 1                        frappucino         5.0          5.7         6.4     NA
## 2                    iced chocolate         5.0          5.7         6.4     NA
## 3                   iced chai latte         5.0          5.7         6.4     NA
## 4                 milkshake (syrup)         4.6          5.3         6.0     NA
## 5 milkshake (natural fruit extract)         4.6          5.3         6.0     NA
## 6                   cold brew latte         4.5          5.2         5.9     NA
##   double
## 1     NA
## 2     NA
## 3     NA
## 4     NA
## 5     NA
## 6     NA
# Summary statistics
summary(df_coffee_prize)
##   drink_type         price_small     price_medium    price_large   
##  Length:31          Min.   :2.900   Min.   :3.400   Min.   :3.800  
##  Class :character   1st Qu.:3.925   1st Qu.:4.625   1st Qu.:5.325  
##  Mode  :character   Median :4.600   Median :5.300   Median :6.000  
##                     Mean   :4.369   Mean   :5.054   Mean   :5.746  
##                     3rd Qu.:4.900   3rd Qu.:5.600   3rd Qu.:6.300  
##                     Max.   :5.400   Max.   :6.100   Max.   :6.800  
##                     NA's   :5       NA's   :5       NA's   :5      
##      simple         double    
##  Min.   :1.80   Min.   :2.50  
##  1st Qu.:1.80   1st Qu.:2.50  
##  Median :1.90   Median :2.65  
##  Mean   :2.02   Mean   :2.75  
##  3rd Qu.:2.10   3rd Qu.:2.90  
##  Max.   :2.50   Max.   :3.20  
##  NA's   :26     NA's   :27
num_duplicates <- sum(duplicated(df_coffee_prize))
# Check for duplicates
duplicates <- df_coffee_prize[duplicated(df_coffee_prize), ]
print(duplicates)
## [1] drink_type   price_small  price_medium price_large  simple      
## [6] double      
## <0 rows> (or 0-length row.names)
# Print out the column names
print(colnames(df_coffee_prize))
## [1] "drink_type"   "price_small"  "price_medium" "price_large"  "simple"      
## [6] "double"

dplyr

dplyr

dplyr is a package for data manipulation that provides a set of functions optimized for common data manipulation tasks such as filtering, selecting, mutating, summarizing, and arranging data. It emphasizes a “grammar of data manipulation” approach for easy and intuitive data wrangling.

I removed the last 2 columns using the select function since there weren’t any useful info. Then, I removed rows with NA values in any column.

# Remove the last two columns
clean_df_coffee_prize <- df_coffee_prize %>%
  select(-c(simple, double))

# Remove rows with NA values in any column
clean_df_coffee_prize <- clean_df_coffee_prize %>%
  na.omit()

print(clean_df_coffee_prize)
##                                            drink_type price_small price_medium
## 1                                          frappucino         5.0          5.7
## 2                                      iced chocolate         5.0          5.7
## 3                                     iced chai latte         5.0          5.7
## 4                                   milkshake (syrup)         4.6          5.3
## 5                   milkshake (natural fruit extract)         4.6          5.3
## 6                                     cold brew latte         4.5          5.2
## 7                                    iced coco matcha         5.0          5.7
## 8            company ice tea passion fruit and citrus         3.5          4.2
## 9           company ice tea raspberry and elderflower         3.5          4.2
## 10                              company ice tea peach         3.5          4.2
## 11                              special company drink         3.8          4.5
## 12                 smoothie - Kiwi Banana Mango Apple         4.9          5.6
## 13 smoothie - Strawberry Blueberry Blackcurrant Mango         4.9          5.6
## 14 smoothie - Passion fruit Guava Pineapple Aloe vera         4.9          5.6
## 15                                       orange juice         3.9          4.6
## 16                                          cappucino         4.1          4.8
## 17                                              latte         4.0          4.7
## 18                        special company brand latte         5.4          6.1
## 19                                              mocha         4.9          5.6
## 20                                         chai latte         4.7          5.4
## 21                                        coco matcha         5.0          5.7
## 22                                      hot chocolate         4.0          4.7
## 23                                hot white chocolate         4.2          4.9
## 24                                    viennese coffee         4.7          5.4
## 30                                      filter coffee         2.9          3.6
## 31                                    tea/herbal teas         3.1          3.4
##    price_large
## 1          6.4
## 2          6.4
## 3          6.4
## 4          6.0
## 5          6.0
## 6          5.9
## 7          6.4
## 8          4.9
## 9          4.9
## 10         4.9
## 11         5.2
## 12         6.3
## 13         6.3
## 14         6.3
## 15         5.3
## 16         5.5
## 17         5.5
## 18         6.8
## 19         6.3
## 20         6.1
## 21         6.4
## 22         5.4
## 23         5.6
## 24         6.1
## 30         4.3
## 31         3.8

Shown below in the code chunk, I computed the average prices for each combination of drink type and size by using the group_by() function.

In R, the group_by() function is part of the dplyr package, which is a core component of the TidyVerse ecosystem. This function is used for grouping data based on one or more variables, allowing for the application of functions to each group separately.

# Compute average prices for each combination of drink type and size
avg_prices <- clean_df_coffee_prize %>%
  group_by(drink_type) %>%
  summarise(
    avg_price_small = mean(price_small, na.rm = TRUE),
    avg_price_medium = mean(price_medium, na.rm = TRUE),
    avg_price_large = mean(price_large, na.rm = TRUE)
  )
print(avg_prices)
## # A tibble: 26 × 4
##    drink_type                   avg_price_small avg_price_medium avg_price_large
##    <chr>                                  <dbl>            <dbl>           <dbl>
##  1 cappucino                                4.1              4.8             5.5
##  2 chai latte                               4.7              5.4             6.1
##  3 coco matcha                              5                5.7             6.4
##  4 cold brew latte                          4.5              5.2             5.9
##  5 company ice tea passion fru…             3.5              4.2             4.9
##  6 company ice tea peach                    3.5              4.2             4.9
##  7 company ice tea raspberry a…             3.5              4.2             4.9
##  8 filter coffee                            2.9              3.6             4.3
##  9 frappucino                               5                5.7             6.4
## 10 hot chocolate                            4                4.7             5.4
## # ℹ 16 more rows

ggplot

ggplot

To visualize the average prices of each drink type and size based on the average price, I utilized the ggplot library which is a powerful and flexible package for creating static graphics and building complex plots by adding layers and aesthetic mappings.

The ggplot below generates heatmaps to display the average prices of each drink type and size based on the average price.

# Create heatmap with reversed axes
ggplot(avg_prices, aes(x = avg_price_small, y = drink_type, fill = avg_price_small)) +
  geom_tile(color = "darkgrey") +
  labs(title = "Average Small Coffee Prices by Drink Type", x = "Average Price (Small)", y = "Drink Type", fill = "Average Price") +
  scale_fill_gradient(low = "darkgrey", high = "brown")

# Create heatmap with reversed axes
ggplot(avg_prices, aes(x = avg_price_medium, y = drink_type, fill = avg_price_medium)) +
  geom_tile(color = "darkgrey") +
  labs(title = "Average Medium Coffee Prices by Drink Type", x = "Average Price (Medium)", y = "Drink Type", fill = "Average Price") +
  scale_fill_gradient(low = "darkgrey", high = "brown")

# Create heatmap with reversed axes
ggplot(avg_prices, aes(x = avg_price_large, y = drink_type, fill = avg_price_large)) +
  geom_tile(color = "darkgrey") +
  labs(title = "Average Large Coffee Prices by Drink Type", x = "Average Price (Large)", y = "Drink Type", fill = "Average Price") +
  scale_fill_gradient(low = "darkgrey", high = "brown")

# Creating an additional column that contains the overall average cost of the drink and getting summary statistics on the dataset 

avg_prices$overall_avg<-((avg_prices$avg_price_small + avg_prices$avg_price_medium + avg_prices$avg_price_large) / 3)

summary(avg_prices)
##   drink_type        avg_price_small avg_price_medium avg_price_large
##  Length:26          Min.   :2.900   Min.   :3.400    Min.   :3.800  
##  Class :character   1st Qu.:3.925   1st Qu.:4.625    1st Qu.:5.325  
##  Mode  :character   Median :4.600   Median :5.300    Median :6.000  
##                     Mean   :4.369   Mean   :5.054    Mean   :5.746  
##                     3rd Qu.:4.900   3rd Qu.:5.600    3rd Qu.:6.300  
##                     Max.   :5.400   Max.   :6.100    Max.   :6.800  
##   overall_avg   
##  Min.   :3.433  
##  1st Qu.:4.625  
##  Median :5.300  
##  Mean   :5.056  
##  3rd Qu.:5.600  
##  Max.   :6.100
#Graphing the frequency of the average prices

ggplot(data = avg_prices, aes(x= overall_avg))+
  geom_bar(fill="blue") +
  labs(title = "Average Coffee Prices", x = "Average Price", y = "Frequency") 

# Finding the cheapest, moderate, and most expensive drinks in the small and large columns 

min(avg_prices$avg_price_small)
## [1] 2.9
max(avg_prices$avg_price_large)
## [1] 6.8
# Adding an additional column to the dataframe that desegnates whether a drink is cheaper or more expensive based on the overall average cost of the drink \
mean(avg_prices$overall_avg)
## [1] 5.05641
avg_prices$Pricepoint <- ifelse(avg_prices$overall_avg > 5.05641, "cheaper", "more expensive")