In this assignment, you’ll practice collaborating around a code project with GitHub.You could consider our collective work as building out a book of examples on how to use TidyVerse functions.
GitHub repository: https://github.com/pkowalchuk/SPRING2024TIDYVERSE
FiveThirtyEight.com datasets
Kaggle datasets
Your task here is to Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset. (25 points)
Later, you’ll be asked to extend an existing vignette. Using one of your classmate’s examples (as created above), you’ll then extend his or her example with additional annotated code. (15 points)
You should clone the provided repository. Once you have code to submit, you should make a pull request on the shared repository. You should also update the README.md file with your example.
After you’ve created your vignette, please submit your GitHub handle name in the submission link provided below. This will let your instructor know that your work is ready to be peer-graded.
You should complete your submission on the schedule stated in the course syllabus.
We will be using the following package: ggplot2
ggplot 2 is a system for declaratively creating graphics. Essentially, you provide data, and state what/how to map your variables, and it will create your visuals.
# load in the library
library(plyr)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::arrange() masks plyr::arrange()
## ✖ purrr::compact() masks plyr::compact()
## ✖ dplyr::count() masks plyr::count()
## ✖ dplyr::desc() masks plyr::desc()
## ✖ dplyr::failwith() masks plyr::failwith()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::id() masks plyr::id()
## ✖ dplyr::lag() masks stats::lag()
## ✖ dplyr::mutate() masks plyr::mutate()
## ✖ dplyr::rename() masks plyr::rename()
## ✖ dplyr::summarise() masks plyr::summarise()
## ✖ dplyr::summarize() masks plyr::summarize()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(forcats)
We’ll be looking at a data set containing ramen reviews collected from a product review website, which emcompasses a variety of attributes relating to review like product name, country of origin, etc. The raiting is based on a 5-point scale.
https://www.kaggle.com/datasets/residentmario/ramen-ratings
# read the one file with all different dataset combined
ramen <- read_csv("https://raw.githubusercontent.com/nk014914/Data-607/main/ramen-ratings.csv")
## Rows: 2580 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Brand, Variety, Style, Country, Stars, Top Ten
## dbl (1): Review #
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#let's view the dataset
glimpse(ramen)
## Rows: 2,580
## Columns: 7
## $ `Review #` <dbl> 2580, 2579, 2578, 2577, 2576, 2575, 2574, 2573, 2572, 2571,…
## $ Brand <chr> "New Touch", "Just Way", "Nissin", "Wei Lih", "Ching's Secr…
## $ Variety <chr> "T's Restaurant Tantanmen", "Noodles Spicy Hot Sesame Spicy…
## $ Style <chr> "Cup", "Pack", "Cup", "Pack", "Pack", "Pack", "Cup", "Tray"…
## $ Country <chr> "Japan", "Taiwan", "USA", "Taiwan", "India", "South Korea",…
## $ Stars <chr> "3.75", "1", "2.25", "2.75", "3.75", "4.75", "4", "3.75", "…
## $ `Top Ten` <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
head(ramen)
## # A tibble: 6 × 7
## `Review #` Brand Variety Style Country Stars `Top Ten`
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2580 New Touch T's Restaurant Tantan… Cup Japan 3.75 <NA>
## 2 2579 Just Way Noodles Spicy Hot Ses… Pack Taiwan 1 <NA>
## 3 2578 Nissin Cup Noodles Chicken V… Cup USA 2.25 <NA>
## 4 2577 Wei Lih GGE Ramen Snack Tomat… Pack Taiwan 2.75 <NA>
## 5 2576 Ching's Secret Singapore Curry Pack India 3.75 <NA>
## 6 2575 Samyang Foods Kimchi song Song Ramen Pack South … 4.75 <NA>
ggplot2 can be a really powerful tool to making discoveries in your data. By being able to visualize the data, we can answer questions that we wouldn’t normally be able to with just viewing the raw data.
To showcase ggplot2, we will be taking this data set (ramen), and using the function to create a few visuals to answer some questions.
1.Which country had the most ramen reviews?
#We will use ggplot in order to create a visual to view the reviews by country
#We can also use ggplot to produce bar charts using geom_bar
ggplot(data = ramen, aes(x = fct_rev(fct_infreq(Country)))) +
geom_bar(stat = "count", fill = "Steelblue", width = 0.8) +
labs(title= "Ramen Reviews by Country",
x = "Country",
y = "# of Reviews") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 1)) +
theme(axis.text.y = element_text(size = 7)) +
coord_flip()
From our visual we can see that Japan has the most ramen reviews, with the US and South Korea coming in 2nd and 3rd, respectively.This was to be expected since ramen is a Japanese dish. Now, let’s use ggplot to help us answer another question.
#First we subset for just Japanese Ramen
ramen_jp <- subset(ramen, Country == "Japan")
#Next, let's put them in order by ratings
ramen_jp <- ramen_jp %>%
arrange(desc(Stars))
#Make sure all 5 star ratings are labeled as "5"
ramen$Stars <- revalue(ramen$Stars, c("5.00"="5") )
ramen$Stars <- revalue(ramen$Stars, c("5.0"="5") )
#Since there's many 5 star rated let's subset for only 5 star ratings
ramen_jp_top <- subset(ramen_jp, Stars == "5")
ramen_jp_top
## # A tibble: 67 × 7
## `Review #` Brand Variety Style Country Stars `Top Ten`
## <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2522 Takamori Hearty Japanese Style Cur… Pack Japan 5 <NA>
## 2 2441 MyKuali Penang Hokkien Prawn Flav… Box Japan 5 <NA>
## 3 2419 MyKuali Penang Red Tom Yum Goong Box Japan 5 <NA>
## 4 2415 Nissin Kitsune Udon Donbei (West) Bowl Japan 5 <NA>
## 5 2323 Ogasawara Kirin Giraffe Shio Ramen Pack Japan 5 <NA>
## 6 2316 Nissin Cup Noodle Spicy Curry Ch… Cup Japan 5 <NA>
## 7 2302 Nissin Yakisoba Tray Japan 5 <NA>
## 8 2283 Nissin Raoh Pork Bone Soy Soup N… Pack Japan 5 <NA>
## 9 2278 Nissin Raoh Tantanmen Pack Japan 5 <NA>
## 10 2255 Daikoku Hiroshima Flavor Yakisoba Tray Japan 5 <NA>
## # ℹ 57 more rows
#Now let's plot to see which Japanese brands had the most 5 star reviews
ggplot(data = ramen_jp_top, aes(x = fct_rev(fct_infreq(Brand)))) +
geom_bar(stat = "count", fill = "Steelblue", width = 0.8) +
labs(title= "Top Japanese Ramen Brands",
x = "Brand",
y = "# of 5 Star Reviews") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 0, vjust = 0.5, hjust = 1)) +
theme(axis.text.y = element_text(size = 7)) +
coord_flip()
Nissin seems to be the winner by a long shot, with more than 8x the number of 5-star reviews compared to the second top brand, Sapporo Ichiban.
ggplot, which is a function in Tidyverse, is a powerful and versatile tool for building visualizations in R. It’s flexible grammar of graphics and wide variety of visualization options makes it very customizable in making high quality plots and charts.