Choose any three of the “wide” datasets identified in the Week 6 Discussion items. For each of the three chosen datasets:
- Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations as described below.
- Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data. [Most of your grade will be based on this step!]
- Perform the analysis requested in the discussion item.
- Your code should be in an R Markdown file, posted to rpubs.com, and should include narrative descriptions of your data cleanup work, analysis, and conclusions.
We are coding in the tidyverse.
# Load packages --------------------------------------
library(tidyverse)
library(magrittr)
Deepa identified a Kaggle dataset about superhero TV for analysis, asking:
The original data can be found at the following link. We’ve saved a copy to our github for reading into this document.
Anoop Kumar Raut (2022) www.kaggle.com/anoopkumarraut/superhero-tv-shows/data
# Load data --------------------------------------
<- read.csv("https://raw.githubusercontent.com/pkofy/DATA607/main/Project2/Dataset_Superhero-TV-Shows.csv", stringsAsFactors = FALSE) df1
For NAN handling we’re going to remove any records with
missing values in any of the columns: show_title
,
imdb_rating
, release_year
and
genre
. We’ll look at unique genres to see if we can combine
any. And we’ll create a new column with just the release
year.
# Show reader the data
glimpse(df1)
## Rows: 750
## Columns: 8
## $ show_title <chr> "Peacemaker", "The Legend of Vox Machina", "Daredev…
## $ imdb_rating <chr> "8.5", "8.6", "8.6", "8.7", "7.2", "7.6", "8.3", "8…
## $ release_year <chr> "2022- ", "2022- ", "2015-2018", "2019- ", "2019- "…
## $ runtime <chr> "40", "30", "54", "60", "50", "45", "23", "53", "43…
## $ genre <chr> "Action, Adventure, Comedy", "Animation, Action, Ad…
## $ parental_guideline <chr> "TV-MA", "TV-MA", "TV-MA", "TV-MA", "TV-G", "TV-MA"…
## $ imdb_votes <chr> "60,116", "13,128", "4,10,433", "3,47,831", "13,375…
## $ synopsis <chr> "Picking up where The Suicide Squad (2021) left off…
There are 128 unique combinations of genre represented in the data. We’ll take the first genre before any comma to create a simplified genre column. For 34 there is no comma and so they are now blank. Ideally we would copy the value from the genre column to the simple_genre column if missing.
# There are 128 unique genre combinations
nrow(count(df1, genre))
## [1] 128
# Extract the first word from genre
$simple_genre <- str_extract(df1$genre, "[A-Za-z]+") df1
# The new simplified genres
count(df1, simple_genre)
## simple_genre n
## 1 Action 210
## 2 Adventure 25
## 3 Animation 462
## 4 Comedy 20
## 5 Crime 3
## 6 Documentary 1
## 7 Drama 9
## 8 Family 3
## 9 Game 1
## 10 Horror 1
## 11 Sci 9
## 12 Short 4
## 13 Talk 1
## 14 Western 1
The current release year is a range with the terminal year blank if still in production. We’ll isolate just the first year of release for use in the analysis.
# Create a new column with just the release year
$firstyear <- str_extract(df1$release_year, "^[0-9]{4}") df1
# Remove records with NA in firstyear (21 records where release_year is TBA)
<- filter(df1, firstyear != "NA") df1
# Show firstyear in order to reader
t(table(df1$firstyear[order(df1$firstyear)]))
##
## 1948 1949 1950 1952 1953 1954 1955 1957 1958 1959 1960 1961 1962 1963
## [1,] 1 2 2 1 1 2 1 1 2 1 1 2 2 1
##
## 1964 1965 1966 1967 1968 1969 1971 1972 1973 1974 1975 1976 1977 1978
## [1,] 2 4 16 10 4 1 4 7 4 3 3 11 7 6
##
## 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992
## [1,] 6 3 7 3 10 9 13 12 14 8 6 8 9 9
##
## 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
## [1,] 13 16 15 18 10 12 12 20 18 18 19 14 20 15
##
## 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
## [1,] 15 14 12 16 18 21 14 16 28 18 20 18 19 10
##
## 2021 2022
## [1,] 32 19
There are many records with duplicate show titles. One duplicate “Astro Boy” is two different TV shows airing in 1963-1966 and 2003-2004 respectively. Another duplicate “Robocop” aired as an animation in 1988 and as a live-action in 1994. It looks like each record is a unique TV series.
head(df1$show_title[duplicated(df1$show_title)], n = 10)
## [1] "Young Justice" "Harley Quinn"
## [3] "Hit-Monkey" "Star Trek: Prodigy"
## [5] "Ms. Marvel" "Misfits"
## [7] "The Legend of Korra" "Agent Carter"
## [9] "The Flash" "Teenage Mutant Ninja Turtles"
There are 15 records with missing IMDB ratings to remove.
# Show reader there are 15 empty IMDB ratings
t(table(df1$imdb_rating[order(df1$imdb_rating)]))
##
## 2 2.1 2.5 2.8 3.1 3.3 3.4 3.8 4.1 4.3 4.4 4.5 4.6 4.7 4.9 5 5.1 5.2
## [1,] 15 1 1 1 1 2 1 2 1 2 1 3 2 1 2 2 1 5 3
##
## 5.3 5.4 5.5 5.6 5.7 5.8 5.9 6 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 7
## [1,] 3 3 8 5 6 16 7 11 17 10 19 30 32 25 27 37 31 34
##
## 7.1 7.2 7.3 7.4 7.5 7.6 7.7 7.8 7.9 8 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.9
## [1,] 34 22 32 32 25 27 18 30 25 17 16 16 16 13 18 7 9 1
##
## 9 9.1 9.3
## [1,] 1 1 1
# Filter out the records with empty IMDB ratings
<- filter(df1, imdb_rating != "") df1
Here we remove unnecessary columns.
# Select only the columns we need for analysis
<- select(df1, show_title, imdb_rating, simple_genre, firstyear) df1
Here we address the two provided questions. Ideally we would provide dynamic tables which would allow the user to restrict years or genres.
What is the highest-rated TV show of each year?
We show below the highest-rated TV show of each year in a table. Only the most recent ten years display outside of RMarkdown.
# Take the highest rated shows in a given year
<- df1 %>%
ShowOfTheYear group_by(firstyear) %>%
slice_max(imdb_rating, n = 1)
# Sort shows by first year of release descending
arrange(ShowOfTheYear, desc(firstyear))
## # A tibble: 84 × 4
## # Groups: firstyear [72]
## show_title imdb_rating simple_genre firstyear
## <chr> <chr> <chr> <chr>
## 1 The Legend of Vox Machina 8.6 Animation 2022
## 2 Invincible 8.7 Animation 2021
## 3 Mashin Sentai Kiramager 8 Action 2020
## 4 The Boys 8.7 Action 2019
## 5 Cinema Club 8.3 Talk 2018
## 6 The Punisher 8.5 Action 2017
## 7 Kamen Rider Build 8.5 Action 2017
## 8 Mob Psycho 100 8.6 Animation 2016
## 9 One Punch Man 8.7 Animation 2015
## 10 Star Wars: Rebels 8 Animation 2014
## # … with 74 more rows
What is the highest rated TV show from each category in the data set?
We show below the highest rated TV show by each genre.
# Take the highest rated shows in a given genre
<- df1 %>%
ShowOfTheGenre group_by(simple_genre) %>%
slice_max(imdb_rating, n = 1)
# Display shows by genre
ShowOfTheGenre
## # A tibble: 16 × 4
## # Groups: simple_genre [14]
## show_title imdb_rating simple_genre firstyear
## <chr> <chr> <chr> <chr>
## 1 The Boys 8.7 Action 2019
## 2 Journey to the West 8.7 Action 1986
## 3 The Aquabats! Super Show! 8.5 Adventure 2012
## 4 Avatar: The Last Airbender 9.3 Animation 2005
## 5 El Chapulín Colorado 8.5 Comedy 1973
## 6 Constantine 7.5 Crime 2014
## 7 Stingray 7.5 Crime 1985
## 8 Marvel Studios: Assembled 8 Documentary 2021
## 9 Dick Tracy 7.6 Drama 1950
## 10 The Electric Company 8.1 Family 1971
## 11 Who Wants to Be a Superhero? 6.1 Game 2006
## 12 Werewolf 8 Horror 1987
## 13 Captain Midnight 8 Sci 1954
## 14 Super Power Beat Down 8.3 Short 2012
## 15 Cinema Club 8.3 Talk 2018
## 16 The Lone Ranger 7.7 Western 1949
Moiya found a dataset from the CDC that has all of the COVID deaths and cases in the United States. She provided the following analysis requests:
The original data can be found at the following link. We retrieved a copy at 3:09PM EST, March 13th, 2022, and saved it to our github for reading into this document.
CDC Case Task Force (Updated daily) data.cdc.gov/Case-Surveillance/United-States-COVID-19-Cases-and-Deaths-by-State-o
# Load data --------------------------------------
<- read.csv("https://raw.githubusercontent.com/pkofy/DATA607/main/Project2/United_States_COVID-19_Cases_and_Deaths_by_State_over_Time.csv", stringsAsFactors = FALSE) df2
To simplify the project we are going to look at reported data for two days: January 19th in 2021 and 2022. As a possible expansion we could capture data for the month of January.
For our purposes, we’re going to look at submission date, state, total cases, total deaths, new cases and new deaths.
# Show reader the data
glimpse(df2)
## Rows: 46,800
## Columns: 15
## $ submission_date <chr> "12/01/2021", "08/17/2020", "07/20/2021", "05/13/2020"…
## $ state <chr> "ND", "MD", "MD", "VT", "NC", "WA", "AL", "DE", "WI", …
## $ tot_cases <int> 163565, 100715, 464491, 855, 706315, 346788, 695879, 1…
## $ conf_cases <int> 135705, NA, NA, NA, 632991, NA, 532931, 117969, 22932,…
## $ prob_cases <int> 27860, NA, NA, NA, 73324, NA, 162948, 10284, 2548, 225…
## $ new_case <int> 589, 503, 155, 2, 7113, 674, 5166, 450, 185, 0, 180, 1…
## $ pnew_case <int> 220, 0, 0, 0, 1040, 89, 1551, 36, 11, 0, 0, 2, 79, 208…
## $ tot_death <int> 1907, 3765, 9822, 52, 10066, 5107, 13302, 1920, 700, 1…
## $ conf_death <int> NA, 3616, 9604, NA, 9048, NA, 10425, 1756, 694, 6958, …
## $ prob_death <int> NA, 149, 218, NA, 1018, NA, 2877, 164, 6, 3435, NA, 0,…
## $ new_death <int> 9, 3, 3, 0, 113, 7, 57, 0, 2, 0, 0, 19, 31, 1, 0, 27, …
## $ pnew_death <int> 0, 0, 1, 0, 22, 0, 19, 0, 0, 0, 0, 0, 0, 0, NA, 2, 8, …
## $ created_at <chr> "12/02/2021 02:35:20 PM", "08/19/2020 12:00:00 AM", "0…
## $ consent_cases <chr> "Agree", "N/A", "N/A", "Not agree", "Agree", "N/A", "A…
## $ consent_deaths <chr> "Not agree", "Agree", "Agree", "Not agree", "Agree", "…
Here we save dataframes with the values submitted on January 19th for 2021 and 2022.
# Save extracts of data on specific dates
<- filter(df2, submission_date =="01/19/2021")
Jan192021 <- filter(df2, submission_date =="01/19/2022")
Jan192022
# Append new data frames to each other
<- rbind(Jan192021, Jan192022) Jan19
There are 10 additional states reported: DC, NYC, five territories and three freely associated states. From the CDC’s website of Island Affairs:
- Territories are sub-national administrative divisions overseen by the US government.
- Freely associated states are independent nations that have signed a comprehensive agreement with the US called a “Compact of Free Association,” which governs diplomatic, economic, and military relations with the US.
t(table(count(Jan19, state)))
## state
## n AK AL AR AS AZ CA CO CT DC DE FL FSM GA GU HI IA ID IL IN KS KY LA MA MD ME
## 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## state
## n MI MN MO MP MS MT NC ND NE NH NJ NM NV NY NYC OH OK OR PA PR PW RI RMI SC
## 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## state
## n SD TN TX UT VA VI VT WA WI WV WY
## 2 1 1 1 1 1 1 1 1 1 1 1
Here we remove unnecessary columns.
# Select only the columns we need for analysis
<- select(Jan192021, submission_date, state, tot_cases, new_case, tot_death, new_death)
Jan192021 <- select(Jan192022, submission_date, state, tot_cases, new_case, tot_death, new_death) Jan192022
Here we create a new column with deaths as a percentage of total cases.
# Create column for deaths as a percent of total cases
<- Jan192021 %>% mutate(dpc = tot_death / tot_cases)
Jan192021 <- Jan192022 %>% mutate(dpc = tot_death / tot_cases) Jan192022
Here we address the two provided analysis requests; However, instead of finding the state with the highest and lowest deaths we will find the states with the highest and lowest deaths as a percentage of cases.
Find the states with the highest and lowest deaths as a percentage of cases.
We have printed tables of the highest and lowest five states by deaths as a percent of cases for our chosen date, January 19th, in both 2021 and 2022. While we are interested in the islands, we selectively increased the number of results to show the first five of the 50 states, NYC, DC or Puerto Rico.
Highest | Percent | Lowest | Percent |
---|---|---|---|
New York City | 4.77% | Utah | 0.46% |
New Jersey | 3.23% | Alaska | 0.60% |
Massachusetts | 2.93% | Nebraska | 1.00% |
Connecticut | 2.88% | Idaho | 1.04% |
Washington DC | 2.51% | Wisconsin | 1.05% |
Highest | Percent | Lowest | Percent |
---|---|---|---|
New York City | 1.72% | Utah | 0.50% |
Mississippi | 1.61% | Vermont | 0.56% |
Alabama | 1.56% | Alaska | 0.56% |
Pennsylvania | 1.55% | Hawaii | 0.62% |
Arizona | 1.53% | Puerto Rico | 0.83% |
# Highest five states with deaths as a percent of cases on 1/19/2021
<-head(arrange(Jan192021, desc(dpc)), n = 7)
High5_2021 High5_2021
## submission_date state tot_cases new_case tot_death new_death dpc
## 1 01/19/2021 NYC 549352 3945 26212 57 0.04771440
## 2 01/19/2021 NJ 635702 4628 20512 54 0.03226669
## 3 01/19/2021 MA 476526 3085 13956 65 0.02928696
## 4 01/19/2021 CT 232219 2094 6682 12 0.02877456
## 5 01/19/2021 DC 34259 226 861 4 0.02513208
## 6 01/19/2021 PA 785302 5631 19467 77 0.02478919
## 7 01/19/2021 MS 259733 1967 6384 34 0.02457909
# Lowest five states with deaths as a percent of cases on 1/19/2021
<-head(arrange(Jan192021, dpc), n = 8)
Low5_2021 Low5_2021
## submission_date state tot_cases new_case tot_death new_death dpc
## 1 01/19/2021 RMI 4 0 0 0 0.000000000
## 2 01/19/2021 AS 3 0 0 0 0.000000000
## 3 01/19/2021 FSM 1 0 0 0 0.000000000
## 4 01/19/2021 UT 326221 1302 1507 7 0.004619568
## 5 01/19/2021 AK 50797 127 303 1 0.005964919
## 6 01/19/2021 NE 183891 1106 1842 0 0.010016803
## 7 01/19/2021 ID 156778 1224 1637 30 0.010441516
## 8 01/19/2021 WI 571268 1933 5973 47 0.010455688
# Highest five states with deaths as a percent of cases on 1/19/2022
<-head(arrange(Jan192022, desc(dpc)), n = 6)
High5_2022 High5_2022
## submission_date state tot_cases new_case tot_death new_death dpc
## 1 01/19/2022 NYC 2161429 12688 37235 106 0.01722703
## 2 01/19/2022 MS 665544 8460 10690 48 0.01606205
## 3 01/19/2022 AL 1108175 16151 17241 36 0.01555801
## 4 01/19/2022 PA 2506132 18207 38767 112 0.01546886
## 5 01/19/2022 AZ 1666191 20497 25416 21 0.01525395
## 6 01/19/2022 MI 2082983 19574 31275 60 0.01501452
# Lowest five states with deaths as a percent of cases on 1/19/2022
<-head(arrange(Jan192022, dpc), n = 11)
Low5_2022 Low5_2022
## submission_date state tot_cases new_case tot_death new_death dpc
## 1 01/19/2022 FSM 7 0 0 0 0.000000000
## 2 01/19/2022 AS 18 0 0 0 0.000000000
## 3 01/19/2022 PW 192 40 0 0 0.000000000
## 4 01/19/2022 RMI 7 0 0 0 0.000000000
## 5 01/19/2022 UT 802780 12564 3997 18 0.004978948
## 6 01/19/2022 MP 4118 51 21 0 0.005099563
## 7 01/19/2022 VT 85117 0 474 0 0.005568805
## 8 01/19/2022 AK 182285 9956 1018 1 0.005584661
## 9 01/19/2022 HI 180337 3898 1115 0 0.006182869
## 10 01/19/2022 VI 14028 127 94 0 0.006700884
## 11 01/19/2022 PR 427575 4354 3563 0 0.008333041
Compare the death rate both before and after the vaccine was released.
By inspection of the tables above we can see that the death rate as a percent of cases declined after vaccines were released. When we look at it analytically we see death as a percent of cases on January 19th as follows:
2021: 1.73%
2022: 1.25%
# Calculate deaths as a percent of cases for 2021
<- sum(Jan192021$tot_death) / sum(Jan192021$tot_cases)
dpc2021 dpc2021
## [1] 0.01726364
# Calculate deaths as a percent of cases for 2022
<- sum(Jan192022$tot_death) / sum(Jan192022$tot_cases)
dpc2022 dpc2022
## [1] 0.0124984
Since we’re using one day’s data, we’re using Total deaths and Total cases as an approximation.
If we were doing a deeper analysis we would look at new deaths and new cases over equal periods of time, say a month or six-months.
By using Total deaths and Total cases as an approximation, we are under representing the difference between 2021 and 2022’s deaths as a percent of cases, because the second date contains deaths and cases from the first date as well.
Benson found a Kaggle dataset called ‘The Ramen Rater’ that is a collection of ramen product reviews. He provided the following analysis requests:
The data we are using can be found at the following Kaggle link. We saved a copy to our github for reading into this document.
Alexsey Bilogur (2018) kaggle.com/residentmario/ramen-ratings
This dataset was originally from the BIG LIST on The Ramen Rater, a “product review website for the hardcore ramen enthusiast”.
Since the discussion topic suggesting this dataset was titled “Wine Quality Dataset” can we call the ramen enthusiasts, “rammeliers”?
# Load data --------------------------------------
<- read.csv("https://raw.githubusercontent.com/pkofy/DATA607/main/Project2/ramen-ratings.csv", stringsAsFactors = FALSE) df3
To answer the requests (best flavor, brand, style) it looks like we need the stars, style and brand columns. It looks like there is no flavor column unless you can extract it from the Variety name. Instead we’ll likely look into which country produces the best ramen. One difficulty we may run into is how to handle multiple reviews for the same variety. Or are there any ramen with the same variety name?
# Show reader the data
glimpse(df3)
## Rows: 2,580
## Columns: 7
## $ Review.. <int> 2580, 2579, 2578, 2577, 2576, 2575, 2574, 2573, 2572, 2571, 2…
## $ Brand <chr> "New Touch", "Just Way", "Nissin", "Wei Lih", "Ching's Secret…
## $ Variety <chr> "T's Restaurant Tantanmen ", "Noodles Spicy Hot Sesame Spicy …
## $ Style <chr> "Cup", "Pack", "Cup", "Pack", "Pack", "Pack", "Cup", "Tray", …
## $ Country <chr> "Japan", "Taiwan", "USA", "Taiwan", "India", "South Korea", "…
## $ Stars <chr> "3.75", "1", "2.25", "2.75", "3.75", "4.75", "4", "3.75", "0.…
## $ Top.Ten <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "…
Here we make sure the ratings are standardized.
All of the ratings are from 0.00 to 5.00. Three are unrated which we’ll have to remove. The ratings are strings which means 5, 5.0 and 5.00 are recorded separately so we’ll have to turn them into numbers.
# Inspect ratings
t(table(df3$Stars[order(df3$Stars)]))
##
## 0 0.1 0.25 0.5 0.75 0.9 1 1.1 1.25 1.5 1.75 1.8 2 2.1 2.125 2.25
## [1,] 26 1 11 14 1 1 26 2 10 37 27 1 68 1 1 21
##
## 2.3 2.5 2.75 2.8 2.85 2.9 3 3.0 3.00 3.1 3.125 3.2 3.25 3.3 3.4 3.5
## [1,] 2 67 85 2 1 2 173 2 1 2 1 1 170 1 1 326
##
## 3.50 3.6 3.65 3.7 3.75 3.8 4 4.0 4.00 4.125 4.25 4.3 4.5 4.50 4.75 5
## [1,] 9 1 1 1 350 3 384 3 6 2 143 4 132 3 64 369
##
## 5.0 5.00 Unrated
## [1,] 10 7 3
# Remove records without a rating
<- filter(df3, Stars != "Unrated") df3
# Turn the Stars from strings to numbers
$Stars <- as.numeric(df3$Stars) df3
There are 2577 rows and 2410 distinct variety names so we expect 167 duplicates. We are going to leave the records with duplicate variety names in. They could be the same variety but in multiple styles (cup, bowl, pack) or multiple brands selling the same variety. My guess now is that the website The Ramen Rater aggregates reviews for the same product so we don’t know how many reviews went into each review we have on record and each record is a unique product.
# The number of records minus the number of records with distinct variety name
nrow(df3) - nrow(distinct(df3, Variety))
## [1] 167
# The most common varieties are "Chicken" and "Beef"
%>%
df3 group_by(Variety) %>%
filter(n()>6)
## # A tibble: 14 × 7
## # Groups: Variety [2]
## Review.. Brand Variety Style Country Stars Top.Ten
## <int> <chr> <chr> <chr> <chr> <dbl> <chr>
## 1 935 Maruchan Beef Cup USA 3.5 ""
## 2 814 Indomie Beef Pack Indonesia 3.5 ""
## 3 800 Mr. Noodles Chicken Pack Canada 3 ""
## 4 786 Mr. Noodles Beef Pack Canada 3 ""
## 5 770 Kailo Chicken Pack Hong Kong 3.5 ""
## 6 672 Mug Shot Chicken Pack UK 3 ""
## 7 541 Yum Yum Chicken Pack Thailand 3.5 ""
## 8 531 Yum Yum Beef Pack Thailand 3.25 ""
## 9 508 Samyang Beef Cup South Korea 3.75 ""
## 10 469 Gefen Chicken Pack USA 3.5 ""
## 11 444 Mr. Udon Chicken Pack South Korea 3.25 ""
## 12 438 Samyang Beef Bowl South Korea 3.75 ""
## 13 120 Samyang Beef Pack South Korea 4 ""
## 14 53 Indomie Chicken Cup Indonesia 2 ""
Here we address the three provided analysis requests; However, instead of looking at the favorite flavor we’re going to look at which country has the highest average rating.
Analyze the best brand.
More than 20 brands have a 5.00 average rating for their products. The first 20 are listed below:
# Calculate the average rating by brand
<- df3 %>%
branddata group_by(Brand) %>%
summarise(meanstars = mean(Stars))
<- arrange(branddata, desc(meanstars))
branddata head(branddata, n=20)
## # A tibble: 20 × 2
## Brand meanstars
## <chr> <dbl>
## 1 ChoripDong 5
## 2 Daddy 5
## 3 Daifuku 5
## 4 Foodmon 5
## 5 Higashi 5
## 6 Jackpot Teriyaki 5
## 7 Kiki Noodle 5
## 8 Kimura 5
## 9 Komforte Chockolates 5
## 10 MyOri 5
## 11 Nyor Nyar 5
## 12 ORee Garden 5
## 13 Patanjali 5
## 14 Peyang 5
## 15 Plats Du Chef 5
## 16 Prima 5
## 17 Prima Taste 5
## 18 Seven & I 5
## 19 Song Hak 5
## 20 Takamori 5
Analyze the best ramen style.
In descending order of quality we have the highest, “Bar”, followed by “Box”, “Pack”, “Bowl”, “Tray” and “Can” and “Cup” are tied for last.
I looked on Amazon at Ramen types and saw Boxes, Packs, Bowls and Cups, but no Bars, Trays or Cans. So make sure your ramen comes in a box! If they put that much money into the packaging then maybe they put more money into the ingredients. Or maybe ramen IS like wine and it tastes better when it’s expensive!
# Calculate the average rating by style
<- df3 %>%
styledata group_by(Style) %>%
summarise(meanstars = mean(Stars))
<- arrange(styledata, desc(meanstars))
styledata head(styledata, n=20)
## # A tibble: 8 × 2
## Style meanstars
## <chr> <dbl>
## 1 "Bar" 5
## 2 "Box" 4.29
## 3 "Pack" 3.70
## 4 "Bowl" 3.67
## 5 "Tray" 3.55
## 6 "Can" 3.5
## 7 "Cup" 3.50
## 8 "" 3.38
Which country makes the best ramen on average?
The country data is so intriguing I have to show the top 20 and the bottom 20.
Number 1 is Brazil at 4.35!
Number 2 is Sarawak at 4.33. (Sarawak is the part of Malaysia on Borneo island.)
Number 6 is Indonesia at 4.07 - makes sense, they have Goreng Rendang.
Second to last at Number 37 is Canada at 2.24!
Dubai and Finland are tied in the middle with an average rating of 3.58.
# Calculate the average rating by country, descending
<- df3 %>%
countrydata group_by(Country) %>%
summarise(meanstars = mean(Stars))
<- arrange(countrydata, desc(meanstars))
countrydata_desc head(countrydata_desc, n=20)
## # A tibble: 20 × 2
## Country meanstars
## <chr> <dbl>
## 1 Brazil 4.35
## 2 Sarawak 4.33
## 3 Cambodia 4.2
## 4 Malaysia 4.15
## 5 Singapore 4.13
## 6 Indonesia 4.07
## 7 Japan 3.98
## 8 Myanmar 3.95
## 9 Fiji 3.88
## 10 Hong Kong 3.80
## 11 South Korea 3.79
## 12 United States 3.75
## 13 Mexico 3.73
## 14 Bangladesh 3.71
## 15 Taiwan 3.67
## 16 Germany 3.64
## 17 Poland 3.62
## 18 Hungary 3.61
## 19 Dubai 3.58
## 20 Finland 3.58
# Calculate the average rating by country, descending
<- df3 %>%
countrydata group_by(Country) %>%
summarise(meanstars = mean(Stars))
<- arrange(countrydata, meanstars)
countrydata_ascend head(countrydata_ascend, n=20)
## # A tibble: 20 × 2
## Country meanstars
## <chr> <dbl>
## 1 Nigeria 1.5
## 2 Canada 2.24
## 3 Netherlands 2.48
## 4 UK 3.00
## 5 Pakistan 3
## 6 Australia 3.14
## 7 Vietnam 3.19
## 8 Sweden 3.25
## 9 Colombia 3.29
## 10 Philippines 3.33
## 11 Thailand 3.38
## 12 India 3.40
## 13 China 3.42
## 14 USA 3.46
## 15 Estonia 3.5
## 16 Ghana 3.5
## 17 Nepal 3.55
## 18 Holland 3.56
## 19 Dubai 3.58
## 20 Finland 3.58
This was a good exercise. It helps me see the similarity in approaching data and yet how each dataset needs unique tricks and consideration to honor the requests.
As a beginner I wrote these backwards: Analysis, Transformations, Explanations. With more practice I can see how it becomes easier to anticipate the transformations for the analysis and write the explanations as you go from beginning to end.