Data 607 - Tidyverse CREATE

In this assignment, you’ll practice collaborating around a code project with GitHub.You could consider our collective work as building out a book of examples on how to use TidyVerse functions.

GitHub repository: https://github.com/pkowalchuk/SPRING2024TIDYVERSE

FiveThirtyEight.com datasets

Kaggle datasets

Your task here is to Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset. (25 points)

Later, you’ll be asked to extend an existing vignette. Using one of your classmate’s examples (as created above), you’ll then extend his or her example with additional annotated code. (15 points)

You should clone the provided repository. Once you have code to submit, you should make a pull request on the shared repository. You should also update the README.md file with your example.

After you’ve created your vignette, please submit your GitHub handle name in the submission link provided below. This will let your instructor know that your work is ready to be peer-graded.

You should complete your submission on the schedule stated in the course syllabus.

Tidyverse Packages

We will be using the following package: ggplot2

ggplot 2 is a system for declaratively creating graphics. Essentially, you provide data, and state what/how to map your variables, and it will create your visuals.

Load libraries

# load in the library
library(plyr)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::arrange()   masks plyr::arrange()
## ✖ purrr::compact()   masks plyr::compact()
## ✖ dplyr::count()     masks plyr::count()
## ✖ dplyr::desc()      masks plyr::desc()
## ✖ dplyr::failwith()  masks plyr::failwith()
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::id()        masks plyr::id()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ dplyr::mutate()    masks plyr::mutate()
## ✖ dplyr::rename()    masks plyr::rename()
## ✖ dplyr::summarise() masks plyr::summarise()
## ✖ dplyr::summarize() masks plyr::summarize()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(forcats)

The Data

We’ll be looking at a data set containing ramen reviews collected from a product review website, which emcompasses a variety of attributes relating to review like product name, country of origin, etc. The raiting is based on a 5-point scale.

https://www.kaggle.com/datasets/residentmario/ramen-ratings

Reading CSV

# read the one file with all different dataset combined
ramen <- read_csv("https://raw.githubusercontent.com/nk014914/Data-607/main/ramen-ratings.csv")

## Rows: 2580 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Brand, Variety, Style, Country, Stars, Top Ten
## dbl (1): Review #
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#let's view the dataset

glimpse(ramen)

## Rows: 2,580
## Columns: 7
## $ `Review #` <dbl> 2580, 2579, 2578, 2577, 2576, 2575, 2574, 2573, 2572, 2571,…
## $ Brand      <chr> "New Touch", "Just Way", "Nissin", "Wei Lih", "Ching's Secr…
## $ Variety    <chr> "T's Restaurant Tantanmen", "Noodles Spicy Hot Sesame Spicy…
## $ Style      <chr> "Cup", "Pack", "Cup", "Pack", "Pack", "Pack", "Cup", "Tray"…
## $ Country    <chr> "Japan", "Taiwan", "USA", "Taiwan", "India", "South Korea",…
## $ Stars      <chr> "3.75", "1", "2.25", "2.75", "3.75", "4.75", "4", "3.75", "…
## $ `Top Ten`  <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…

head(ramen)

## # A tibble: 6 × 7
##   `Review #` Brand          Variety                Style Country Stars `Top Ten`
##        <dbl> <chr>          <chr>                  <chr> <chr>   <chr> <chr>    
## 1       2580 New Touch      T's Restaurant Tantan… Cup   Japan   3.75  <NA>     
## 2       2579 Just Way       Noodles Spicy Hot Ses… Pack  Taiwan  1     <NA>     
## 3       2578 Nissin         Cup Noodles Chicken V… Cup   USA     2.25  <NA>     
## 4       2577 Wei Lih        GGE Ramen Snack Tomat… Pack  Taiwan  2.75  <NA>     
## 5       2576 Ching's Secret Singapore Curry        Pack  India   3.75  <NA>     
## 6       2575 Samyang Foods  Kimchi song Song Ramen Pack  South … 4.75  <NA>

Creating the Vignette

ggplot2 can be a really powerful tool to making discoveries in your data. By being able to visualize the data, we can answer questions that we wouldn’t normally be able to with just viewing the raw data.

To showcase ggplot2, we will be taking this data set (ramen), and using the function to create a few visuals to answer some questions.

1.Which country had the most ramen reviews?

#We will use ggplot in order to create a visual to view the reviews by country
#We can also use ggplot to produce bar charts using geom_bar

ggplot(data = ramen, aes(x = fct_rev(fct_infreq(Country)))) +
  geom_bar(stat = "count", fill = "Steelblue", width = 0.8) + 
  labs(title= "Ramen Reviews by Country",
       x = "Country",
       y = "# of Reviews") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 1)) +
  theme(axis.text.y = element_text(size = 7)) +
  coord_flip()

From our visual we can see that Japan has the most ramen reviews, with the US and South Korea coming in 2nd and 3rd, respectively.This was to be expected since ramen is a Japanese dish. Now, let’s use ggplot to help us answer another question.

Out of all the Japanese brand ramen, what were the top rated brands?

#First we subset for just Japanese Ramen
ramen_jp <- subset(ramen, Country == "Japan")

#Next, let's put them in order by ratings
ramen_jp <- ramen_jp %>%
  arrange(desc(Stars))


#Make sure all 5 star ratings are labeled as "5"
ramen$Stars <- revalue(ramen$Stars, c("5.00"="5") )
ramen$Stars <- revalue(ramen$Stars, c("5.0"="5") )

#Since there's many 5 star rated  let's subset for only 5 star ratings
ramen_jp_top <- subset(ramen_jp, Stars == "5")
ramen_jp_top

## # A tibble: 67 × 7
##    `Review #` Brand     Variety                    Style Country Stars `Top Ten`
##         <dbl> <chr>     <chr>                      <chr> <chr>   <chr> <chr>    
##  1       2522 Takamori  Hearty Japanese Style Cur… Pack  Japan   5     <NA>     
##  2       2441 MyKuali   Penang Hokkien Prawn Flav… Box   Japan   5     <NA>     
##  3       2419 MyKuali   Penang Red Tom Yum Goong   Box   Japan   5     <NA>     
##  4       2415 Nissin    Kitsune Udon Donbei (West) Bowl  Japan   5     <NA>     
##  5       2323 Ogasawara Kirin Giraffe Shio Ramen   Pack  Japan   5     <NA>     
##  6       2316 Nissin    Cup Noodle Spicy Curry Ch… Cup   Japan   5     <NA>     
##  7       2302 Nissin    Yakisoba                   Tray  Japan   5     <NA>     
##  8       2283 Nissin    Raoh Pork Bone Soy Soup N… Pack  Japan   5     <NA>     
##  9       2278 Nissin    Raoh Tantanmen             Pack  Japan   5     <NA>     
## 10       2255 Daikoku   Hiroshima Flavor Yakisoba  Tray  Japan   5     <NA>     
## # ℹ 57 more rows

#Now let's plot to see which Japanese brands had the most 5 star reviews
ggplot(data = ramen_jp_top, aes(x = fct_rev(fct_infreq(Brand)))) +
  geom_bar(stat = "count", fill = "Steelblue", width = 0.8) + 
  labs(title= "Top Japanese Ramen Brands",
       x = "Brand",
       y = "# of 5 Star Reviews") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 1)) +
  theme(axis.text.y = element_text(size = 7)) +
  coord_flip()

Nissin seems to be the winner by a long shot, with more than 8x the number of 5-star reviews compared to the second top brand, Sapporo Ichiban.

Conclusion

ggplot, which is a function in Tidyverse, is a powerful and versatile tool for building visualizations in R. It’s flexible grammar of graphics and wide variety of visualization options makes it very customizable in making high quality plots and charts.