Getting Started

Instructions

Choose any three of the “wide” datasets identified in the Week 6 Discussion items. For each of the three chosen datasets:

  • Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
  • Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
  • Perform the analysis requested in the discussion item.
  • Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.



Load Libraries

We are coding in the tidyverse.

# Load packages --------------------------------------
library(tidyverse)
library(magrittr)



Superhero TV shows

Deepa identified a Kaggle dataset about superhero TV for analysis, asking:

  • What is the highest-rated TV show of each year?
  • What is the highest-rated TV show from each category in the data set?



Data

The original data can be found at the following link. We’ve saved a copy to our github for reading into this document.

Anoop Kumar Raut (2022) www.kaggle.com/anoopkumarraut/superhero-tv-shows/data

# Load data --------------------------------------
df1 <- read.csv("https://raw.githubusercontent.com/pkofy/DATA607/main/Project2/Dataset_Superhero-TV-Shows.csv", stringsAsFactors = FALSE)



Tidy and Transform

For NAN handling we’re going to remove any records with missing values in any of the columns: show_title, imdb_rating, release_year and genre. We’ll look at unique genres to see if we can combine any. And we’ll create a new column with just the release year.

# Show reader the data
glimpse(df1)
## Rows: 750
## Columns: 8
## $ show_title         <chr> "Peacemaker", "The Legend of Vox Machina", "Daredev…
## $ imdb_rating        <chr> "8.5", "8.6", "8.6", "8.7", "7.2", "7.6", "8.3", "8…
## $ release_year       <chr> "2022- ", "2022- ", "2015-2018", "2019- ", "2019- "…
## $ runtime            <chr> "40", "30", "54", "60", "50", "45", "23", "53", "43…
## $ genre              <chr> "Action, Adventure, Comedy", "Animation, Action, Ad…
## $ parental_guideline <chr> "TV-MA", "TV-MA", "TV-MA", "TV-MA", "TV-G", "TV-MA"…
## $ imdb_votes         <chr> "60,116", "13,128", "4,10,433", "3,47,831", "13,375…
## $ synopsis           <chr> "Picking up where The Suicide Squad (2021) left off…



Simplify the genre

There are 128 unique combinations of genre represented in the data. We’ll take the first genre before any comma to create a simplified genre column. For 34 there is no comma and so they are now blank. Ideally we would copy the value from the genre column to the simple_genre column if missing.

# There are 128 unique genre combinations
nrow(count(df1, genre))
## [1] 128
# Extract the first word from genre
df1$simple_genre <- str_extract(df1$genre, "[A-Za-z]+")
# The new simplified genres
count(df1, simple_genre)
##    simple_genre   n
## 1        Action 210
## 2     Adventure  25
## 3     Animation 462
## 4        Comedy  20
## 5         Crime   3
## 6   Documentary   1
## 7         Drama   9
## 8        Family   3
## 9          Game   1
## 10       Horror   1
## 11          Sci   9
## 12        Short   4
## 13         Talk   1
## 14      Western   1



Isolate release year

The current release year is a range with the terminal year blank if still in production. We’ll isolate just the first year of release for use in the analysis.

# Create a new column with just the release year
df1$firstyear <- str_extract(df1$release_year, "^[0-9]{4}")
# Remove records with NA in firstyear (21 records where release_year is TBA)
df1 <- filter(df1, firstyear != "NA")
# Show firstyear in order to reader
t(table(df1$firstyear[order(df1$firstyear)]))
##       
##        1948 1949 1950 1952 1953 1954 1955 1957 1958 1959 1960 1961 1962 1963
##   [1,]    1    2    2    1    1    2    1    1    2    1    1    2    2    1
##       
##        1964 1965 1966 1967 1968 1969 1971 1972 1973 1974 1975 1976 1977 1978
##   [1,]    2    4   16   10    4    1    4    7    4    3    3   11    7    6
##       
##        1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992
##   [1,]    6    3    7    3   10    9   13   12   14    8    6    8    9    9
##       
##        1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
##   [1,]   13   16   15   18   10   12   12   20   18   18   19   14   20   15
##       
##        2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
##   [1,]   15   14   12   16   18   21   14   16   28   18   20   18   19   10
##       
##        2021 2022
##   [1,]   32   19



Inspecting duplicate show titles

There are many records with duplicate show titles. One duplicate “Astro Boy” is two different TV shows airing in 1963-1966 and 2003-2004 respectively. Another duplicate “Robocop” aired as an animation in 1988 and as a live-action in 1994. It looks like each record is a unique TV series.

head(df1$show_title[duplicated(df1$show_title)], n = 10)
##  [1] "Young Justice"                "Harley Quinn"                
##  [3] "Hit-Monkey"                   "Star Trek: Prodigy"          
##  [5] "Ms. Marvel"                   "Misfits"                     
##  [7] "The Legend of Korra"          "Agent Carter"                
##  [9] "The Flash"                    "Teenage Mutant Ninja Turtles"



Remove empty IMDB ratings

There are 15 records with missing IMDB ratings to remove.

# Show reader there are 15 empty IMDB ratings
t(table(df1$imdb_rating[order(df1$imdb_rating)]))
##       
##            2 2.1 2.5 2.8 3.1 3.3 3.4 3.8 4.1 4.3 4.4 4.5 4.6 4.7 4.9  5 5.1 5.2
##   [1,] 15  1   1   1   1   2   1   2   1   2   1   3   2   1   2   2  1   5   3
##       
##        5.3 5.4 5.5 5.6 5.7 5.8 5.9  6 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9  7
##   [1,]   3   3   8   5   6  16   7 11  17  10  19  30  32  25  27  37  31 34
##       
##        7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9  8 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.9
##   [1,]  34  22  32  32  25  27  18  30  25 17  16  16  16  13  18   7   9   1
##       
##         9 9.1 9.3
##   [1,]  1   1   1
# Filter out the records with empty IMDB ratings
df1 <- filter(df1, imdb_rating != "")



Strip or reorder columns

Here we remove unnecessary columns.

# Select only the columns we need for analysis
df1 <- select(df1, show_title, imdb_rating, simple_genre, firstyear)



Analysis

Here we address the two provided questions. Ideally we would provide dynamic tables which would allow the user to restrict years or genres.



Question 1

What is the highest-rated TV show of each year?

We show below the highest-rated TV show of each year in a table. Only the most recent ten years display outside of RMarkdown.

# Take the highest rated shows in a given year
ShowOfTheYear <- df1 %>%
 group_by(firstyear) %>%
 slice_max(imdb_rating, n = 1)

# Sort shows by first year of release descending
arrange(ShowOfTheYear, desc(firstyear))
## # A tibble: 84 × 4
## # Groups:   firstyear [72]
##    show_title                imdb_rating simple_genre firstyear
##    <chr>                     <chr>       <chr>        <chr>    
##  1 The Legend of Vox Machina 8.6         Animation    2022     
##  2 Invincible                8.7         Animation    2021     
##  3 Mashin Sentai Kiramager   8           Action       2020     
##  4 The Boys                  8.7         Action       2019     
##  5 Cinema Club               8.3         Talk         2018     
##  6 The Punisher              8.5         Action       2017     
##  7 Kamen Rider Build         8.5         Action       2017     
##  8 Mob Psycho 100            8.6         Animation    2016     
##  9 One Punch Man             8.7         Animation    2015     
## 10 Star Wars: Rebels         8           Animation    2014     
## # … with 74 more rows



Question 2

What is the highest rated TV show from each category in the data set?

We show below the highest rated TV show by each genre.

# Take the highest rated shows in a given genre
ShowOfTheGenre <- df1 %>%
 group_by(simple_genre) %>%
 slice_max(imdb_rating, n = 1)

# Display shows by genre
ShowOfTheGenre
## # A tibble: 16 × 4
## # Groups:   simple_genre [14]
##    show_title                   imdb_rating simple_genre firstyear
##    <chr>                        <chr>       <chr>        <chr>    
##  1 The Boys                     8.7         Action       2019     
##  2 Journey to the West          8.7         Action       1986     
##  3 The Aquabats! Super Show!    8.5         Adventure    2012     
##  4 Avatar: The Last Airbender   9.3         Animation    2005     
##  5 El Chapulín Colorado         8.5         Comedy       1973     
##  6 Constantine                  7.5         Crime        2014     
##  7 Stingray                     7.5         Crime        1985     
##  8 Marvel Studios: Assembled    8           Documentary  2021     
##  9 Dick Tracy                   7.6         Drama        1950     
## 10 The Electric Company         8.1         Family       1971     
## 11 Who Wants to Be a Superhero? 6.1         Game         2006     
## 12 Werewolf                     8           Horror       1987     
## 13 Captain Midnight             8           Sci          1954     
## 14 Super Power Beat Down        8.3         Short        2012     
## 15 Cinema Club                  8.3         Talk         2018     
## 16 The Lone Ranger              7.7         Western      1949






COVID deaths and cases

Moiya found a dataset from the CDC that has all of the COVID deaths and cases in the United States. She provided the following analysis requests:

  • Find the state with the highest and lowest deaths.
  • Compare the death rate both before and after the vaccine was released.



Data

The original data can be found at the following link. We retrieved a copy at 3:09PM EST, March 13th, 2022, and saved it to our github for reading into this document.

CDC Case Task Force (Updated daily) data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o

# Load data --------------------------------------
df2 <- read.csv("https://raw.githubusercontent.com/pkofy/DATA607/main/Project2/United_States_COVID-19_Cases_and_Deaths_by_State_over_Time.csv", stringsAsFactors = FALSE)



Tidy and Transform

To simplify the project we are going to look at reported data for two days: January 19th in 2021 and 2022. As a possible expansion we could capture data for the month of January.

For our purposes, we’re going to look at submission date, state, total cases, total deaths, new cases and new deaths.

# Show reader the data
glimpse(df2)
## Rows: 46,800
## Columns: 15
## $ submission_date <chr> "12/01/2021", "08/17/2020", "07/20/2021", "05/13/2020"…
## $ state           <chr> "ND", "MD", "MD", "VT", "NC", "WA", "AL", "DE", "WI", …
## $ tot_cases       <int> 163565, 100715, 464491, 855, 706315, 346788, 695879, 1…
## $ conf_cases      <int> 135705, NA, NA, NA, 632991, NA, 532931, 117969, 22932,…
## $ prob_cases      <int> 27860, NA, NA, NA, 73324, NA, 162948, 10284, 2548, 225…
## $ new_case        <int> 589, 503, 155, 2, 7113, 674, 5166, 450, 185, 0, 180, 1…
## $ pnew_case       <int> 220, 0, 0, 0, 1040, 89, 1551, 36, 11, 0, 0, 2, 79, 208…
## $ tot_death       <int> 1907, 3765, 9822, 52, 10066, 5107, 13302, 1920, 700, 1…
## $ conf_death      <int> NA, 3616, 9604, NA, 9048, NA, 10425, 1756, 694, 6958, …
## $ prob_death      <int> NA, 149, 218, NA, 1018, NA, 2877, 164, 6, 3435, NA, 0,…
## $ new_death       <int> 9, 3, 3, 0, 113, 7, 57, 0, 2, 0, 0, 19, 31, 1, 0, 27, …
## $ pnew_death      <int> 0, 0, 1, 0, 22, 0, 19, 0, 0, 0, 0, 0, 0, 0, NA, 2, 8, …
## $ created_at      <chr> "12/02/2021 02:35:20 PM", "08/19/2020 12:00:00 AM", "0…
## $ consent_cases   <chr> "Agree", "N/A", "N/A", "Not agree", "Agree", "N/A", "A…
## $ consent_deaths  <chr> "Not agree", "Agree", "Agree", "Not agree", "Agree", "…



Isolate comparison date

Here we save dataframes with the values submitted on January 19th for 2021 and 2022.

# Save extracts of data on specific dates
Jan192021 <- filter(df2, submission_date =="01/19/2021")
Jan192022 <- filter(df2, submission_date =="01/19/2022")

# Append new data frames to each other
Jan19 <- rbind(Jan192021, Jan192022)



Inspect States

There are 10 additional states reported: DC, NYC, five territories and three freely associated states. From the CDC’s website of Island Affairs:

  • Territories are sub-national administrative divisions overseen by the US government.
  • Freely associated states are independent nations that have signed a comprehensive agreement with the US called a “Compact of Free Association,” which governs diplomatic, economic, and military relations with the US.
  • City/District
    • DC - District of Columbia
    • NYC - New York City
  • Territories
    • AS - American Samoa
    • MP - Commonwealth of the Northern Mariana Islands
    • GU - Guam
    • PR - Puerto Rico
    • VI - US Virgin Islands
  • Freely associated states
    • FSM - Federated States of Micronesia
    • PW - Republic of Palau
    • RMI - Republic of the Marshall Islands
t(table(count(Jan19, state)))
##    state
## n   AK AL AR AS AZ CA CO CT DC DE FL FSM GA GU HI IA ID IL IN KS KY LA MA MD ME
##   2  1  1  1  1  1  1  1  1  1  1  1   1  1  1  1  1  1  1  1  1  1  1  1  1  1
##    state
## n   MI MN MO MP MS MT NC ND NE NH NJ NM NV NY NYC OH OK OR PA PR PW RI RMI SC
##   2  1  1  1  1  1  1  1  1  1  1  1  1  1  1   1  1  1  1  1  1  1  1   1  1
##    state
## n   SD TN TX UT VA VI VT WA WI WV WY
##   2  1  1  1  1  1  1  1  1  1  1  1



Strip or reorder columns

Here we remove unnecessary columns.

# Select only the columns we need for analysis
Jan192021 <- select(Jan192021, submission_date, state, tot_cases, new_case, tot_death, new_death)
Jan192022 <- select(Jan192022, submission_date, state, tot_cases, new_case, tot_death, new_death)



Mutate columns

Here we create a new column with deaths as a percentage of total cases.

# Create column for deaths as a percent of total cases
Jan192021 <- Jan192021 %>% mutate(dpc = tot_death / tot_cases)
Jan192022 <- Jan192022 %>% mutate(dpc = tot_death / tot_cases)



Analysis

Here we address the two provided analysis requests; However, instead of finding the state with the highest and lowest deaths we will find the states with the highest and lowest deaths as a percentage of cases.



Analysis Request 1

Find the states with the highest and lowest deaths as a percentage of cases.

We have printed tables of the highest and lowest five states by deaths as a percent of cases for our chosen date, January 19th, in both 2021 and 2022. While we are interested in the islands, we selectively increased the number of results to show the first five of the 50 states, NYC, DC or Puerto Rico.


Deaths as a Percent of Cases on January 19, 2021:
Highest Percent Lowest Percent
New York City 4.77% Utah 0.46%
New Jersey 3.23% Alaska 0.60%
Massachusetts 2.93% Nebraska 1.00%
Connecticut 2.88% Idaho 1.04%
Washington DC 2.51% Wisconsin 1.05%
Deaths as a Percent of Cases on January 19, 2022:
Highest Percent Lowest Percent
New York City 1.72% Utah 0.50%
Mississippi 1.61% Vermont 0.56%
Alabama 1.56% Alaska 0.56%
Pennsylvania 1.55% Hawaii 0.62%
Arizona 1.53% Puerto Rico 0.83%
# Highest five states with deaths as a percent of cases on 1/19/2021
High5_2021 <-head(arrange(Jan192021, desc(dpc)), n = 7)
High5_2021
##   submission_date state tot_cases new_case tot_death new_death        dpc
## 1      01/19/2021   NYC    549352     3945     26212        57 0.04771440
## 2      01/19/2021    NJ    635702     4628     20512        54 0.03226669
## 3      01/19/2021    MA    476526     3085     13956        65 0.02928696
## 4      01/19/2021    CT    232219     2094      6682        12 0.02877456
## 5      01/19/2021    DC     34259      226       861         4 0.02513208
## 6      01/19/2021    PA    785302     5631     19467        77 0.02478919
## 7      01/19/2021    MS    259733     1967      6384        34 0.02457909
# Lowest five states with deaths as a percent of cases on 1/19/2021
Low5_2021 <-head(arrange(Jan192021, dpc), n = 8)
Low5_2021
##   submission_date state tot_cases new_case tot_death new_death         dpc
## 1      01/19/2021   RMI         4        0         0         0 0.000000000
## 2      01/19/2021    AS         3        0         0         0 0.000000000
## 3      01/19/2021   FSM         1        0         0         0 0.000000000
## 4      01/19/2021    UT    326221     1302      1507         7 0.004619568
## 5      01/19/2021    AK     50797      127       303         1 0.005964919
## 6      01/19/2021    NE    183891     1106      1842         0 0.010016803
## 7      01/19/2021    ID    156778     1224      1637        30 0.010441516
## 8      01/19/2021    WI    571268     1933      5973        47 0.010455688
# Highest five states with deaths as a percent of cases on 1/19/2022
High5_2022 <-head(arrange(Jan192022, desc(dpc)), n = 6)
High5_2022
##   submission_date state tot_cases new_case tot_death new_death        dpc
## 1      01/19/2022   NYC   2161429    12688     37235       106 0.01722703
## 2      01/19/2022    MS    665544     8460     10690        48 0.01606205
## 3      01/19/2022    AL   1108175    16151     17241        36 0.01555801
## 4      01/19/2022    PA   2506132    18207     38767       112 0.01546886
## 5      01/19/2022    AZ   1666191    20497     25416        21 0.01525395
## 6      01/19/2022    MI   2082983    19574     31275        60 0.01501452
# Lowest five states with deaths as a percent of cases on 1/19/2022
Low5_2022 <-head(arrange(Jan192022, dpc), n = 11)
Low5_2022
##    submission_date state tot_cases new_case tot_death new_death         dpc
## 1       01/19/2022   FSM         7        0         0         0 0.000000000
## 2       01/19/2022    AS        18        0         0         0 0.000000000
## 3       01/19/2022    PW       192       40         0         0 0.000000000
## 4       01/19/2022   RMI         7        0         0         0 0.000000000
## 5       01/19/2022    UT    802780    12564      3997        18 0.004978948
## 6       01/19/2022    MP      4118       51        21         0 0.005099563
## 7       01/19/2022    VT     85117        0       474         0 0.005568805
## 8       01/19/2022    AK    182285     9956      1018         1 0.005584661
## 9       01/19/2022    HI    180337     3898      1115         0 0.006182869
## 10      01/19/2022    VI     14028      127        94         0 0.006700884
## 11      01/19/2022    PR    427575     4354      3563         0 0.008333041



Analysis Request 2

Compare the death rate both before and after the vaccine was released.

By inspection of the tables above we can see that the death rate as a percent of cases declined after vaccines were released. When we look at it analytically we see death as a percent of cases on January 19th as follows:

2021: 1.73%

2022: 1.25%

# Calculate deaths as a percent of cases for 2021
dpc2021 <- sum(Jan192021$tot_death) / sum(Jan192021$tot_cases)
dpc2021
## [1] 0.01726364
# Calculate deaths as a percent of cases for 2022
dpc2022 <- sum(Jan192022$tot_death) / sum(Jan192022$tot_cases)
dpc2022
## [1] 0.0124984



Special note

Since we’re using one day’s data, we’re using Total deaths and Total cases as an approximation.

If we were doing a deeper analysis we would look at new deaths and new cases over equal periods of time, say a month or six-months.

By using Total deaths and Total cases as an approximation, we are under representing the difference between 2021 and 2022’s deaths as a percent of cases, because the second date contains deaths and cases from the first date as well.






Ramen Quality Dataset

Benson found a Kaggle dataset called ‘The Ramen Rater’ that is a collection of ramen product reviews. He provided the following analysis requests:

  • Analyze the favorite flavor
  • Analyze the best brand
  • Analyze the best ramen style



Data

The data we are using can be found at the following Kaggle link. We saved a copy to our github for reading into this document.

Alexsey Bilogur (2018) kaggle.com/residentmario/ramen-ratings

This dataset was originally from the BIG LIST on The Ramen Rater, a “product review website for the hardcore ramen enthusiast”.

Since the discussion topic suggesting this dataset was titled “Wine Quality Dataset” can we call the ramen enthusiasts, “rammeliers”?

# Load data --------------------------------------
df3 <- read.csv("https://raw.githubusercontent.com/pkofy/DATA607/main/Project2/ramen-ratings.csv", stringsAsFactors = FALSE)



Tidy and Transform

To answer the requests (best flavor, brand, style) it looks like we need the stars, style and brand columns. It looks like there is no flavor column unless you can extract it from the Variety name. Instead we’ll likely look into which country produces the best ramen. One difficulty we may run into is how to handle multiple reviews for the same variety. Or are there any ramen with the same variety name?

# Show reader the data
glimpse(df3)
## Rows: 2,580
## Columns: 7
## $ Review.. <int> 2580, 2579, 2578, 2577, 2576, 2575, 2574, 2573, 2572, 2571, 2…
## $ Brand    <chr> "New Touch", "Just Way", "Nissin", "Wei Lih", "Ching's Secret…
## $ Variety  <chr> "T's Restaurant Tantanmen ", "Noodles Spicy Hot Sesame Spicy …
## $ Style    <chr> "Cup", "Pack", "Cup", "Pack", "Pack", "Pack", "Cup", "Tray", …
## $ Country  <chr> "Japan", "Taiwan", "USA", "Taiwan", "India", "South Korea", "…
## $ Stars    <chr> "3.75", "1", "2.25", "2.75", "3.75", "4.75", "4", "3.75", "0.…
## $ Top.Ten  <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "…



Stars/Rating Check

Here we make sure the ratings are standardized.

All of the ratings are from 0.00 to 5.00. Three are unrated which we’ll have to remove. The ratings are strings which means 5, 5.0 and 5.00 are recorded separately so we’ll have to turn them into numbers.

# Inspect ratings
t(table(df3$Stars[order(df3$Stars)]))
##       
##          0 0.1 0.25 0.5 0.75 0.9   1 1.1 1.25 1.5 1.75 1.8   2 2.1 2.125 2.25
##   [1,]  26   1   11  14    1   1  26   2   10  37   27   1  68   1     1   21
##       
##        2.3 2.5 2.75 2.8 2.85 2.9   3 3.0 3.00 3.1 3.125 3.2 3.25 3.3 3.4 3.5
##   [1,]   2  67   85   2    1   2 173   2    1   2     1   1  170   1   1 326
##       
##        3.50 3.6 3.65 3.7 3.75 3.8   4 4.0 4.00 4.125 4.25 4.3 4.5 4.50 4.75   5
##   [1,]    9   1    1   1  350   3 384   3    6     2  143   4 132    3   64 369
##       
##        5.0 5.00 Unrated
##   [1,]  10    7       3
# Remove records without a rating
df3 <- filter(df3, Stars != "Unrated")
# Turn the Stars from strings to numbers
df3$Stars <- as.numeric(df3$Stars)



Variety names

There are 2577 rows and 2410 distinct variety names so we expect 167 duplicates. We are going to leave the records with duplicate variety names in. They could be the same variety but in multiple styles (cup, bowl, pack) or multiple brands selling the same variety. My guess now is that the website The Ramen Rater aggregates reviews for the same product so we don’t know how many reviews went into each review we have on record and each record is a unique product.

# The number of records minus the number of records with distinct variety name
nrow(df3) - nrow(distinct(df3, Variety))
## [1] 167
# The most common varieties are "Chicken" and "Beef"
df3 %>% 
  group_by(Variety) %>% 
  filter(n()>6)
## # A tibble: 14 × 7
## # Groups:   Variety [2]
##    Review.. Brand       Variety Style Country     Stars Top.Ten
##       <int> <chr>       <chr>   <chr> <chr>       <dbl> <chr>  
##  1      935 Maruchan    Beef    Cup   USA          3.5  ""     
##  2      814 Indomie     Beef    Pack  Indonesia    3.5  ""     
##  3      800 Mr. Noodles Chicken Pack  Canada       3    ""     
##  4      786 Mr. Noodles Beef    Pack  Canada       3    ""     
##  5      770 Kailo       Chicken Pack  Hong Kong    3.5  ""     
##  6      672 Mug Shot    Chicken Pack  UK           3    ""     
##  7      541 Yum Yum     Chicken Pack  Thailand     3.5  ""     
##  8      531 Yum Yum     Beef    Pack  Thailand     3.25 ""     
##  9      508 Samyang     Beef    Cup   South Korea  3.75 ""     
## 10      469 Gefen       Chicken Pack  USA          3.5  ""     
## 11      444 Mr. Udon    Chicken Pack  South Korea  3.25 ""     
## 12      438 Samyang     Beef    Bowl  South Korea  3.75 ""     
## 13      120 Samyang     Beef    Pack  South Korea  4    ""     
## 14       53 Indomie     Chicken Cup   Indonesia    2    ""



Analysis

Here we address the three provided analysis requests; However, instead of looking at the favorite flavor we’re going to look at which country has the highest average rating.



Analysis Request 1

Analyze the best brand.

More than 20 brands have a 5.00 average rating for their products. The first 20 are listed below:

# Calculate the average rating by brand
branddata <- df3 %>% 
  group_by(Brand) %>% 
  summarise(meanstars = mean(Stars)) 
branddata <- arrange(branddata, desc(meanstars))
head(branddata, n=20)
## # A tibble: 20 × 2
##    Brand                meanstars
##    <chr>                    <dbl>
##  1 ChoripDong                   5
##  2 Daddy                        5
##  3 Daifuku                      5
##  4 Foodmon                      5
##  5 Higashi                      5
##  6 Jackpot Teriyaki             5
##  7 Kiki Noodle                  5
##  8 Kimura                       5
##  9 Komforte Chockolates         5
## 10 MyOri                        5
## 11 Nyor Nyar                    5
## 12 ORee Garden                  5
## 13 Patanjali                    5
## 14 Peyang                       5
## 15 Plats Du Chef                5
## 16 Prima                        5
## 17 Prima Taste                  5
## 18 Seven & I                    5
## 19 Song Hak                     5
## 20 Takamori                     5



Analysis Request 2

Analyze the best ramen style.

In descending order of quality we have the highest, “Bar”, followed by “Box”, “Pack”, “Bowl”, “Tray” and “Can” and “Cup” are tied for last.

I looked on Amazon at Ramen types and saw Boxes, Packs, Bowls and Cups, but no Bars, Trays or Cans. So make sure your ramen comes in a box! If they put that much money into the packaging then maybe they put more money into the ingredients. Or maybe ramen IS like wine and it tastes better when it’s expensive!

# Calculate the average rating by style
styledata <- df3 %>% 
  group_by(Style) %>% 
  summarise(meanstars = mean(Stars)) 
styledata <- arrange(styledata, desc(meanstars))
head(styledata, n=20)
## # A tibble: 8 × 2
##   Style  meanstars
##   <chr>      <dbl>
## 1 "Bar"       5   
## 2 "Box"       4.29
## 3 "Pack"      3.70
## 4 "Bowl"      3.67
## 5 "Tray"      3.55
## 6 "Can"       3.5 
## 7 "Cup"       3.50
## 8 ""          3.38



Analysis Request 3

Which country makes the best ramen on average?

The country data is so intriguing I have to show the top 20 and the bottom 20.

Number 1 is Brazil at 4.35!

Number 2 is Sarawak at 4.33. (Sarawak is the part of Malaysia on Borneo island.)

Number 6 is Indonesia at 4.07 - makes sense, they have Goreng Rendang.

Second to last at Number 37 is Canada at 2.24!

Dubai and Finland are tied in the middle with an average rating of 3.58.

# Calculate the average rating by country, descending
countrydata <- df3 %>% 
  group_by(Country) %>% 
  summarise(meanstars = mean(Stars)) 
countrydata_desc <- arrange(countrydata, desc(meanstars))
head(countrydata_desc, n=20)
## # A tibble: 20 × 2
##    Country       meanstars
##    <chr>             <dbl>
##  1 Brazil             4.35
##  2 Sarawak            4.33
##  3 Cambodia           4.2 
##  4 Malaysia           4.15
##  5 Singapore          4.13
##  6 Indonesia          4.07
##  7 Japan              3.98
##  8 Myanmar            3.95
##  9 Fiji               3.88
## 10 Hong Kong          3.80
## 11 South Korea        3.79
## 12 United States      3.75
## 13 Mexico             3.73
## 14 Bangladesh         3.71
## 15 Taiwan             3.67
## 16 Germany            3.64
## 17 Poland             3.62
## 18 Hungary            3.61
## 19 Dubai              3.58
## 20 Finland            3.58
# Calculate the average rating by country, descending
countrydata <- df3 %>% 
  group_by(Country) %>% 
  summarise(meanstars = mean(Stars)) 
countrydata_ascend <- arrange(countrydata, meanstars)
head(countrydata_ascend, n=20)
## # A tibble: 20 × 2
##    Country     meanstars
##    <chr>           <dbl>
##  1 Nigeria          1.5 
##  2 Canada           2.24
##  3 Netherlands      2.48
##  4 UK               3.00
##  5 Pakistan         3   
##  6 Australia        3.14
##  7 Vietnam          3.19
##  8 Sweden           3.25
##  9 Colombia         3.29
## 10 Philippines      3.33
## 11 Thailand         3.38
## 12 India            3.40
## 13 China            3.42
## 14 USA              3.46
## 15 Estonia          3.5 
## 16 Ghana            3.5 
## 17 Nepal            3.55
## 18 Holland          3.56
## 19 Dubai            3.58
## 20 Finland          3.58



Conclusion

This was a good exercise. It helps me see the similarity in approaching data and yet how each dataset needs unique tricks and consideration to honor the requests.

As a beginner I wrote these backwards: Analysis, Transformations, Explanations. With more practice I can see how it becomes easier to anticipate the transformations for the analysis and write the explanations as you go from beginning to end.