The library load in

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(openxlsx)
library(dplyr)
library(zoo)

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(varhandle)

The First look into the data set

url.data <- "https://www.theramenrater.com/wp-content/uploads/2021/09/The-Big-List-All-reviews-up-to-3950.xlsx"
raw <-read.xlsx(url.data, sheet ='Reviewed')
raw

Looking at the data

At a high level, this data set looks fairly tame. It doesn’t appear to have much irregularity at first glance, but I do want to dig a big more on it. Let’s first grab all the unique views in terms of star rating to see if anything needs to be transformed.

sort(unique(raw$Stars))

##  [1] "0"                  "0.1"                "0.25"              
##  [4] "0.5"                "0.75"               "0.9"               
##  [7] "1"                  "1.1000000000000001" "1.25"              
## [10] "1.5"                "1.75"               "1.8"               
## [13] "2"                  "2.1"                "2.125"             
## [16] "2.25"               "2.2999999999999998" "2.5"               
## [19] "2.75"               "2.8"                "2.85"              
## [22] "2.9"                "3"                  "3.1"               
## [25] "3.125"              "3.2"                "3.25"              
## [28] "3.4"                "3.5"                "3.5/2.5"           
## [31] "3.6"                "3.65"               "3.7"               
## [34] "3.75"               "4"                  "4.125"             
## [37] "4.25"               "4.25/5"             "4.5"               
## [40] "4.5/5"              "4.75"               "42829"             
## [43] "42859"              "42860"              "5"                 
## [46] "5/2.5"              "NR"                 "NS"                
## [49] "Unrated"

Taking a quick peek over this, I believe anything with a fraction needs to be rationalized, anything with a NR, NS, Unrated needs to be rmoved, and I need to find the source of the 42xxx values.

Let’s first track down the 42xxx values.

raw[raw$Stars == 42859,]

So this is a rather unique encoding issue. The reviewer in this case separates the broth and noodle ratings. Due to this split, the primary values from Excel is formatted as a date, causing a data issue when its converted into an int. In order to correct this, we will simply exclude the ones that that are present as they are not scored on the same rating system. In addition, we will adjust the encoding in the future, taking into account the varied scoring system.

data <- raw
data <- data[(check.numeric(data$Stars)), ]
data

data$Stars <- parse_number(data$Stars)
clean_data <- data[data$Stars <= 5,]
clean_data

Verifying Clean Data

Now we will follow a similar procedure to ensure that our data is clean, with no irregular values or anything particularly high:

sort(unique(clean_data$Stars))

##  [1] 0.000 0.100 0.250 0.500 0.750 0.900 1.000 1.100 1.250 1.500 1.750 1.800
## [13] 2.000 2.100 2.125 2.250 2.300 2.500 2.750 2.800 2.850 2.900 3.000 3.100
## [25] 3.125 3.200 3.250 3.400 3.500 3.600 3.650 3.700 3.750 4.000 4.125 4.250
## [37] 4.500 4.750 5.000

At this point we can clearly see that the data is reasonably clean, with no values outside of the standard 0-5 scale. Now we will move on to generating the analyitics.

Ramen Packet Count by Country

For this first metric we will find the ramen by country. We will group by the country, creating a new column of the number grouped. At that point we will then save it.

best_Brand_Country <- clean_data %>%
   group_by(Country) %>%
   summarise(count = n())
write.csv(best_Brand_Country,"Ramen_By_Country.csv", row.names = FALSE)

Finding the best brand

For the next stage we will try to discover which brand is objectively the best. In order to do this, we must first think of combined scoring metric to use as there will most likely be more than one brand that has at least 1 perfect 5 rating score. In order to do this we will also see how many different types of ramen has also made the same perfect 5 score, biasing the results to manufacturer who has made the most 5 point ramens.

best_Brand <- clean_data %>%
   group_by(Brand) %>%
   summarise(Stars = mean(Stars), Count = n())
best_Brand <- best_Brand[order(-best_Brand$Stars),]
best_Brand

write.csv(best_Brand,"Best_Ramen_Weighted_Count.csv", row.names = FALSE)

At this point, we have done a similar grouping and sorting method as above, and then we will proceed to save the data.

The End

I have to admit, when I saw a ramen data set I was super excited as its always all kinds of fun to try new kinds of ramen as a snack! I really enjoy making my own noodles and I completely understand some of the scoring metrics separating the ramen and the broth.

All in all I had a lot of fun fixing and exploring this data set.

Here is where it is:

https://www.theramenrater.com/resources-2/the-list/

Data 607 Project-Ramen Data