library(tidyverse)
library(dplyr)
library(lemon)
library(rmarkdown)

library(stringr)
library(lubridate)
library(ggpubr)
library(ggplot2)
knit_print.data.frame <- lemon_print

Intro to wide data set collection

For project 3, I selected three data sets from our discussion boards. I will play with problems given by Benson, Tora, and Jay.

Benson’s Exercise| Ramen shops

For Benson’s data, it is based on the reviews on a international ramens on the market. His discussion posts ask to analyze the favorite favor, best brand, ramen style.

ramen<-read.csv("ramen-ratings.csv",header = TRUE)
head(ramen)
Review.. Brand Variety Style Country Stars Top.Ten
2580 New Touch T’s Restaurant Tantanmen Cup Japan 3.75
2579 Just Way Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles Pack Taiwan 1
2578 Nissin Cup Noodles Chicken Vegetable Cup USA 2.25
2577 Wei Lih GGE Ramen Snack Tomato Flavor Pack Taiwan 2.75
2576 Ching’s Secret Singapore Curry Pack India 3.75
2575 Samyang Foods Kimchi song Song Ramen Pack South Korea 4.75

Favorite favor

For Flavor, we will based this on the top ten ranking system in the data set. We need to tidy this data, as the top tens have text inside with the numeral ranking.

On further inspection, I noticed the rankings are all from different years and not all the top tens are in the data set! I will cut down on the repetitiveness of the exercise and view 2016’s top ten; however, these steps will lead to the same answer.

top_r<-ramen%>%filter(grepl("[0-9]",ramen$Top.Ten))
tw16<-top_r%>%filter(grepl("(2016)",top_r$Top.Ten))

tw16$Top.Ten<-str_sub(tw16$Top.Ten,7,8)
tw16$Top.Ten <- as.numeric(tw16$Top.Ten)
tw16%>%arrange(Top.Ten)
Review.. Brand Variety Style Country Stars Top.Ten
1947 Prima Taste Singapore Laksa Wholegrain La Mian Pack Singapore 5 1
1907 Prima Taste Singapore Curry Wholegrain La Mian Pack Singapore 5 5
1689 Wugudaochang Tomato Beef Brisket Flavor Purple Potato Noodle Pack China 5 7
1925 Prima Juzz’s Mee Creamy Chicken Flavour Pack Singapore 5 8
1828 Tseng Noodles Scallion With Sichuan Pepper Flavor Pack Taiwan 5 9
1964 MAMA Instant Noodles Coconut Milk Flavour Pack Myanmar 5 10

Best Brand

For best brand, I will group all the listed brands together and calculate their average score. We can compare the top scored brands with those on the top ten list to see if these raved brands made the scoring board.

Stars is char vector and has some nulls. Let’s clean these records! After our cleaning, I can now sort the brands on the list

ramen<-na.omit(ramen)
ramen<-ramen%>%filter(!grepl("Unrated",Stars))
ramen$Stars <- as.numeric(ramen$Stars)
Brand_rank<-ramen%>%group_by(Brand)%>%summarise(avg=mean(Stars))
head(Brand_rank%>%arrange(desc(avg)))
Brand avg
ChoripDong 5
Daddy 5
Daifuku 5
Foodmon 5
Higashi 5
Jackpot Teriyaki 5
paged_table(left_join(top_r,Brand_rank,by="Brand"))
Review.. Brand Variety Style Country Stars Top.Ten avg
1964 MAMA Instant Noodles Coconut Milk Flavour Pack Myanmar 5 2016 #10 3.898148
1947 Prima Taste Singapore Laksa Wholegrain La Mian Pack Singapore 5 2016 #1 5.000000
1925 Prima Juzz’s Mee Creamy Chicken Flavour Pack Singapore 5 2016 #8 5.000000
1907 Prima Taste Singapore Curry Wholegrain La Mian Pack Singapore 5 2016 #5 5.000000
1828 Tseng Noodles Scallion With Sichuan Pepper Flavor Pack Taiwan 5 2016 #9 4.821429
1689 Wugudaochang Tomato Beef Brisket Flavor Purple Potato Noodle Pack China 5 2016 #7 4.125000
1638 A-Sha Dry Noodle Veggie Noodle Tomato Noodle With Vine Ripened Tomato Sauce Pack Taiwan 5 2015 #10 4.067308
1617 MyKuali Penang Hokkien Prawn Noodle (New Improved Taste) Pack Malaysia 5 2015 #7 4.947917
1585 CarJEN Nyonya Curry Laksa Pack Malaysia 5 2015 #4 4.928571
1521 Maruchan Gotsumori Sauce Yakisoba Tray Japan 5 2015 #9 3.554276
1508 Mamee Chef Gold Recipe Mi Kari Seribu Rasa Pack Malaysia 5 2015 #6 3.939655
1500 MyKuali Penang Red Tom Yum Goong Noodle Pack Malaysia 5 2015 #1 4.947917
1471 Mama Instant Noodles Shrimp Creamy Tom Yum Flavour Jumbo Pack Pack Thailand 5 2013 #10 3.628873
1466 Mama Oriental Style Instant Noodles Green Curry Flavour Jumbo Pack Pack Thailand 5 2015 #8 3.628873
1350 Mamee Chef Curry Laksa Flavour Pack Malaysia 5 2014 #7 3.939655
1330 Sapporo Ichiban Otafuku Okonomi Sauce Yakisoba Tray Japan 5 2014 #4 3.830000
1308 Nongshim Soon Veggie Noodle Soup Pack South Korea 5 2014 #9 4.000000
1302 Mama Instant Noodles Yentafo Tom Yum Mohfai Flavour Pack Thailand 5 2014 #10 3.628873
1271 Prima Taste Singapore Chilli Crab La Mian Pack Singapore 5 2014 #8 5.000000
1198 Samyang Foods Maesaengyitangmyun Baked Noodle Pack South Korea 5 2014 #5 4.068627
1183 Paldo Cheese Noodle Pack South Korea 5 2014 #6 4.018939
1087 MyKuali Penang White Curry Noodle Pack Malaysia 5 2014 #1 4.947917
992 Prima Taste Singapore Laksa La Mian Pack Singapore 5 2013 #1 5.000000
991 Prima Taste Singapore Curry La Mian Pack Singapore 5 2013 #2 5.000000
934 Nongshim Jinjja Jinjja Flamin’ Hot & Nutty Pack USA 5 2013 #4 4.000000
823 Paldo Kokomen Spicy Chicken Pack South Korea 5 2013 #9 4.018939
715 Indomie Mi Goreng Rendang (Import) Pack Indonesia 5 2013 #3 4.070755
608 Koka Spicy Black Pepper Pack Singapore 5 2012 #10 3.750000
578 Nongshim Shin Ramyun Black Pack South Korea 4.75 2012 #7 4.000000
434 Mi Sedaap Kari Spesial Pack Indonesia 4.5 2012 #5 4.145833
392 Nissin Yakisoba Noodles Karashi Tray Japan 5 2012 #3 3.918570
391 Myojo Hyoubanya No Chukasoba Oriental Pack Japan 4.25 2012 #6 3.801587
285 Doll Artificial Chicken Pack Hong Kong 4.5 2012 #9 3.296875
105 Indomie Special Fried Curly Noodle Pack Indonesia 5 2012 #1 4.070755
47 Indomie Mi Goreng Jumbo Barbecue Chicken Pack Indonesia 5 2012 #2 4.070755
31 Myojo Ippeichan Yakisoba Tray Japan 4 2013 #6 3.801587
13 Sapporo Ichiban Chow Mein Pack Japan 5 2012 #4 3.830000

Ramen style

I want to group the data set by the different types of ramen style and see which type has the greatest average in Stars.

R_style<-ramen%>%group_by(Style)%>%summarise(avg=mean(Stars))
paged_table(R_style%>%arrange(desc(avg)))
Style avg
Bar 5.000000
Box 4.291667
Pack 3.700458
Bowl 3.670686
Tray 3.545139
Can 3.500000
Cup 3.498500
3.375000

Tora’s exercise| School Quality Reports

Tora’s data set is based on the school’s student learning process compared to the student’s performance. Tora’s discussion asked for the analysis: average student attendance (in-person) to teacher’s number of years of experience, movement of student with disabilities to less restrictive environments vs Race of student, Economic needs index vs Number of Remote Learning Days.

school<-read.csv("2020_2021_School_Quality_Reports_.csv",header = TRUE)

Avg student attendance s teacher’s # of YOE

I want to see the school name, school type , enrollment, teacher of more than three years and the metric value of student attendance in person.

The data is missing values noted by the “No Data” tag. For a cleaner view, let’s apply a filter. If a school’s teachers experience is above 90%, the student’s attendance leans above 90% as well. There are outliers where the teachers’ experience is below 40% with high student attendance, but we need more data points and samples to confirm if there is a correlation. It does seems promising the teacher’s YOE affects students for in person learning.

year_att<-school%>%select(c('school_name','school_type','enrollment','Teach_3_more_exp','val_attendance_inperson_k3_all'))
year_att<-year_att%>%filter(!grepl("No Data",Teach_3_more_exp))
year_att<-year_att%>%filter(!grepl("No Data",val_attendance_inperson_k3_all))

Mvmt of student with disabilities to less restrictive environments vs Race of student

For the second analysis, we want to see if there’s a correlation between the race of a student and their move towards less restrictive environments.

In my analysis, There is a stronger pattern of Hispanic student population in these high percentages in less restrictive environments compared to other races. For example, there are more black disabled students moved to less restrictive places when the majority of the black student population less than ~30%.

dis_race<-school%>%select(c('school_name','ethnicity_asian_pct','ethnicity_black_pct','ethnicity_hispanic_pct','ethnicity_amerindian_pct','ethnicity_pacific_pct','ethnicity_white_pct','val_lre_all'))

dis_race<-dis_race%>%filter(!grepl("No Data",val_lre_all))
paged_table(dis_race%>%arrange(desc(val_lre_all)))
his_dis<-ggplot(dis_race,aes(ethnicity_hispanic_pct,val_lre_all))+geom_point()+labs(x="% of hispanic student pop.",y="% less restrictive")
blk_dis<-ggplot(dis_race,aes(ethnicity_black_pct,val_lre_all))+geom_point()+labs(x="% of black student pop.",y="% less restrictive")
ggarrange(his_dis,blk_dis,ncol=2)

Economic needs index vs Number of Remote Learning Days

For context, the economic needs index is the average of students who family’s income is at or below the poverty level. It ranks at a scale 0.1-1.0; where 1.0 means the student qualifies for economical support1.

There is not a large standard deviation between the average attendance and the remote attendance across on the school. For the economic needs compared to remote attendance, there is a slight downwards trend of average attendance in remote learning. There could be multiple reasons, as children may substitute for child care or work through school hours in remote settings.

eni_remote<-school%>%select(c('school_name','eni_pct','n_attendance_k3_all','n_attendance_remote_k3_all','val_attendance_remote_k3_all'))

r_att<-ggplot(eni_remote,aes(n_attendance_k3_all,n_attendance_remote_k3_all))+geom_point()+labs(x="Avg s. attendance",y="Avg s. remote attendence")
 
r_enc<-ggplot(eni_remote,aes(eni_pct,val_attendance_remote_k3_all))+geom_point()+labs(x="Econmic needs index", y="Avg s. remote attendance")
 
ggarrange(r_att,r_enc,nrow=2)

Jay’s Exercise| Covid Cases

The data set is data from the CDC about Covid cases over state over time. Jay’s discussion post asked for the following analysis: Compare the seasons to confirmed and probable cases and deaths, compare the states to confirmed/probable cases and death, and review time gap between created_at and submission dates.

cvid_cases<-read.csv("COVID-19_Cases_and_Deaths.csv",header = TRUE)

Comparing probable vs confirmed cases/deaths by seasons

For this exercise, I do not want rows with NA in these columns. I created a duplicate versions with NAs omitted. After NAs are cleared, I need the submission date column to be recognized as date data in order to use the month function.

For confirmed cases to probable cases, the winter months Dec-Feb were the most infectious season. There appears more probable cases than confirmed; the confirmed cases increased each month leading to January.

cvid_cases$submission_date<-as.Date(cvid_cases$submission_date,format='%m/%d/%Y')

cc<-cvid_cases

cc<-cc%>%mutate(m=month(submission_date,label=TRUE))


cc<-cc%>%drop_na()

paged_table(cc%>%group_by(m)%>%summarise(con_count=max(conf_cases),prob_count=max(prob_cases)))

For confirmed to probable deaths, they mirror the same response. The winter months were the most deadly.

paged_table(cc%>%group_by(m)%>%summarise(con_count=max(conf_death),prob_count=max(prob_death)))

Comparing probable vs confirmed cases/deaths by state

For confirmed cases to probable cases, California leads first with 4,640,489 confirmed cases! For probable cases, Ohio leads first with 597,447 cases.

paged_table(cc%>%group_by(state)%>%summarise(con_count=max(conf_cases),prob_count=max(prob_cases))%>%arrange(desc(con_count)))
paged_table(cc%>%group_by(state)%>%summarise(con_count=max(conf_cases),prob_count=max(prob_cases))%>%arrange(desc(prob_count)))

For confirmed deaths, it goes to California again with 71,408 deaths. Tennesse counts for the largest probable deaths from Covid with 6,355 deaths.

paged_table(cc%>%group_by(state)%>%summarise(con_count=max(conf_death),prob_count=max(prob_death))%>%arrange(desc(con_count)))
paged_table(cc%>%group_by(state)%>%summarise(con_count=max(conf_death),prob_count=max(prob_death))%>%arrange(desc(prob_count)))

Compare the time gap between created_at and submission_date by state

For our final problem, we look at the % difference between created_at and submission_date time by state. For this problem, let us transform the created_at class into the date class. Then, I will use the interval function to find the difference in the submission date to the created by date and calculated the duration.

Lets review the duration summary. The average time difference between the two dates is 1 day.

cvid_cases$created_at<-as.Date(cvid_cases$created_at,format='%m/%d/%Y')
cc_date<-cvid_cases%>%select(c('submission_date','created_at'))
cc_date<-cc_date%>%mutate(dif=as.duration(interval(submission_date,created_at)))

paged_table(cc_date)
summary(cc_date$dif)
##                              Min.                           1st Qu. 
##             "-172800s (~-2 days)"                "86400s (~1 days)" 
##                            Median                              Mean 
##                "86400s (~1 days)" "919727.813712807s (~1.52 weeks)" 
##                           3rd Qu.                              Max. 
##               "172800s (~2 days)"         "62899200s (~1.99 years)"

  1. https://data.cccnewyork.org/data/bar/1371/student-economic-need-index#1371/a/1/1622/127↩︎