Project

library(tidyverse)
library(dplyr)
library(lemon)
library(rmarkdown)

library(stringr)
library(lubridate)
library(ggpubr)
library(ggplot2)
knit_print.data.frame <- lemon_print

Intro to wide data set collection

For project 3, I selected three data sets from our discussion boards. I will play with problems given by Benson, Tora, and Jay.

Benson’s Exercise| Ramen shops

For Benson’s data, it is based on the reviews on a international ramens on the market. His discussion posts ask to analyze the favorite favor, best brand, ramen style.

ramen<-read.csv("ramen-ratings.csv",header = TRUE)
head(ramen)

Review..	Brand	Variety	Style	Country	Stars
2580	New Touch	T’s Restaurant Tantanmen	Cup	Japan	3.75
2579	Just Way	Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles	Pack	Taiwan	1
2578	Nissin	Cup Noodles Chicken Vegetable	Cup	USA	2.25
2577	Wei Lih	GGE Ramen Snack Tomato Flavor	Pack	Taiwan	2.75
2576	Ching’s Secret	Singapore Curry	Pack	India	3.75
2575	Samyang Foods	Kimchi song Song Ramen	Pack	South Korea	4.75

Favorite favor

For Flavor, we will based this on the top ten ranking system in the data set. We need to tidy this data, as the top tens have text inside with the numeral ranking.

On further inspection, I noticed the rankings are all from different years and not all the top tens are in the data set! I will cut down on the repetitiveness of the exercise and view 2016’s top ten; however, these steps will lead to the same answer.

top_r<-ramen%>%filter(grepl("[0-9]",ramen$Top.Ten))
tw16<-top_r%>%filter(grepl("(2016)",top_r$Top.Ten))

tw16$Top.Ten<-str_sub(tw16$Top.Ten,7,8)
tw16$Top.Ten <- as.numeric(tw16$Top.Ten)
tw16%>%arrange(Top.Ten)

Review..	Brand	Variety	Style	Country	Stars	Top.Ten
1947	Prima Taste	Singapore Laksa Wholegrain La Mian	Pack	Singapore	5	1
1907	Prima Taste	Singapore Curry Wholegrain La Mian	Pack	Singapore	5	5
1689	Wugudaochang	Tomato Beef Brisket Flavor Purple Potato Noodle	Pack	China	5	7
1925	Prima	Juzz’s Mee Creamy Chicken Flavour	Pack	Singapore	5	8
1828	Tseng Noodles	Scallion With Sichuan Pepper Flavor	Pack	Taiwan	5	9
1964	MAMA	Instant Noodles Coconut Milk Flavour	Pack	Myanmar	5	10

Best Brand

For best brand, I will group all the listed brands together and calculate their average score. We can compare the top scored brands with those on the top ten list to see if these raved brands made the scoring board.

Stars is char vector and has some nulls. Let’s clean these records! After our cleaning, I can now sort the brands on the list

ramen<-na.omit(ramen)
ramen<-ramen%>%filter(!grepl("Unrated",Stars))
ramen$Stars <- as.numeric(ramen$Stars)
Brand_rank<-ramen%>%group_by(Brand)%>%summarise(avg=mean(Stars))
head(Brand_rank%>%arrange(desc(avg)))

Brand	avg
ChoripDong	5
Daddy	5
Daifuku	5
Foodmon	5
Higashi	5
Jackpot Teriyaki	5

paged_table(left_join(top_r,Brand_rank,by="Brand"))

Review..	Brand	Variety	Style	Country	Stars	Top.Ten	avg
1964	MAMA	Instant Noodles Coconut Milk Flavour	Pack	Myanmar	5	2016 #10	3.898148
1947	Prima Taste	Singapore Laksa Wholegrain La Mian	Pack	Singapore	5	2016 #1	5.000000
1925	Prima	Juzz’s Mee Creamy Chicken Flavour	Pack	Singapore	5	2016 #8	5.000000
1907	Prima Taste	Singapore Curry Wholegrain La Mian	Pack	Singapore	5	2016 #5	5.000000
1828	Tseng Noodles	Scallion With Sichuan Pepper Flavor	Pack	Taiwan	5	2016 #9	4.821429
1689	Wugudaochang	Tomato Beef Brisket Flavor Purple Potato Noodle	Pack	China	5	2016 #7	4.125000
1638	A-Sha Dry Noodle	Veggie Noodle Tomato Noodle With Vine Ripened Tomato Sauce	Pack	Taiwan	5	2015 #10	4.067308
1617	MyKuali	Penang Hokkien Prawn Noodle (New Improved Taste)	Pack	Malaysia	5	2015 #7	4.947917
1585	CarJEN	Nyonya Curry Laksa	Pack	Malaysia	5	2015 #4	4.928571
1521	Maruchan	Gotsumori Sauce Yakisoba	Tray	Japan	5	2015 #9	3.554276
1508	Mamee	Chef Gold Recipe Mi Kari Seribu Rasa	Pack	Malaysia	5	2015 #6	3.939655
1500	MyKuali	Penang Red Tom Yum Goong Noodle	Pack	Malaysia	5	2015 #1	4.947917
1471	Mama	Instant Noodles Shrimp Creamy Tom Yum Flavour Jumbo Pack	Pack	Thailand	5	2013 #10	3.628873
1466	Mama	Oriental Style Instant Noodles Green Curry Flavour Jumbo Pack	Pack	Thailand	5	2015 #8	3.628873
1350	Mamee	Chef Curry Laksa Flavour	Pack	Malaysia	5	2014 #7	3.939655
1330	Sapporo Ichiban	Otafuku Okonomi Sauce Yakisoba	Tray	Japan	5	2014 #4	3.830000
1308	Nongshim	Soon Veggie Noodle Soup	Pack	South Korea	5	2014 #9	4.000000
1302	Mama	Instant Noodles Yentafo Tom Yum Mohfai Flavour	Pack	Thailand	5	2014 #10	3.628873
1271	Prima Taste	Singapore Chilli Crab La Mian	Pack	Singapore	5	2014 #8	5.000000
1198	Samyang Foods	Maesaengyitangmyun Baked Noodle	Pack	South Korea	5	2014 #5	4.068627
1183	Paldo	Cheese Noodle	Pack	South Korea	5	2014 #6	4.018939
1087	MyKuali	Penang White Curry Noodle	Pack	Malaysia	5	2014 #1	4.947917
992	Prima Taste	Singapore Laksa La Mian	Pack	Singapore	5	2013 #1	5.000000
991	Prima Taste	Singapore Curry La Mian	Pack	Singapore	5	2013 #2	5.000000
934	Nongshim	Jinjja Jinjja Flamin’ Hot & Nutty	Pack	USA	5	2013 #4	4.000000
823	Paldo	Kokomen Spicy Chicken	Pack	South Korea	5	2013 #9	4.018939
715	Indomie	Mi Goreng Rendang (Import)	Pack	Indonesia	5	2013 #3	4.070755
608	Koka	Spicy Black Pepper	Pack	Singapore	5	2012 #10	3.750000
578	Nongshim	Shin Ramyun Black	Pack	South Korea	4.75	2012 #7	4.000000
434	Mi Sedaap	Kari Spesial	Pack	Indonesia	4.5	2012 #5	4.145833
392	Nissin	Yakisoba Noodles Karashi	Tray	Japan	5	2012 #3	3.918570
391	Myojo	Hyoubanya No Chukasoba Oriental	Pack	Japan	4.25	2012 #6	3.801587
285	Doll	Artificial Chicken	Pack	Hong Kong	4.5	2012 #9	3.296875
105	Indomie	Special Fried Curly Noodle	Pack	Indonesia	5	2012 #1	4.070755
47	Indomie	Mi Goreng Jumbo Barbecue Chicken	Pack	Indonesia	5	2012 #2	4.070755
31	Myojo	Ippeichan Yakisoba	Tray	Japan	4	2013 #6	3.801587
13	Sapporo Ichiban	Chow Mein	Pack	Japan	5	2012 #4	3.830000

Ramen style

I want to group the data set by the different types of ramen style and see which type has the greatest average in Stars.

R_style<-ramen%>%group_by(Style)%>%summarise(avg=mean(Stars))
paged_table(R_style%>%arrange(desc(avg)))

Style	avg
Bar	5.000000
Box	4.291667
Pack	3.700458
Bowl	3.670686
Tray	3.545139
Can	3.500000
Cup	3.498500
	3.375000

Tora’s exercise| School Quality Reports

Tora’s data set is based on the school’s student learning process compared to the student’s performance. Tora’s discussion asked for the analysis: average student attendance (in-person) to teacher’s number of years of experience, movement of student with disabilities to less restrictive environments vs Race of student, Economic needs index vs Number of Remote Learning Days.

school<-read.csv("2020_2021_School_Quality_Reports_.csv",header = TRUE)

Avg student attendance s teacher’s # of YOE

I want to see the school name, school type , enrollment, teacher of more than three years and the metric value of student attendance in person.

The data is missing values noted by the “No Data” tag. For a cleaner view, let’s apply a filter. If a school’s teachers experience is above 90%, the student’s attendance leans above 90% as well. There are outliers where the teachers’ experience is below 40% with high student attendance, but we need more data points and samples to confirm if there is a correlation. It does seems promising the teacher’s YOE affects students for in person learning.

year_att<-school%>%select(c('school_name','school_type','enrollment','Teach_3_more_exp','val_attendance_inperson_k3_all'))
year_att<-year_att%>%filter(!grepl("No Data",Teach_3_more_exp))
year_att<-year_att%>%filter(!grepl("No Data",val_attendance_inperson_k3_all))

Mvmt of student with disabilities to less restrictive environments vs Race of student

For the second analysis, we want to see if there’s a correlation between the race of a student and their move towards less restrictive environments.

In my analysis, There is a stronger pattern of Hispanic student population in these high percentages in less restrictive environments compared to other races. For example, there are more black disabled students moved to less restrictive places when the majority of the black student population less than ~30%.

dis_race<-school%>%select(c('school_name','ethnicity_asian_pct','ethnicity_black_pct','ethnicity_hispanic_pct','ethnicity_amerindian_pct','ethnicity_pacific_pct','ethnicity_white_pct','val_lre_all'))

dis_race<-dis_race%>%filter(!grepl("No Data",val_lre_all))
paged_table(dis_race%>%arrange(desc(val_lre_all)))

his_dis<-ggplot(dis_race,aes(ethnicity_hispanic_pct,val_lre_all))+geom_point()+labs(x="% of hispanic student pop.",y="% less restrictive")
blk_dis<-ggplot(dis_race,aes(ethnicity_black_pct,val_lre_all))+geom_point()+labs(x="% of black student pop.",y="% less restrictive")
ggarrange(his_dis,blk_dis,ncol=2)

Economic needs index vs Number of Remote Learning Days

For context, the economic needs index is the average of students who family’s income is at or below the poverty level. It ranks at a scale 0.1-1.0; where 1.0 means the student qualifies for economical support¹.

There is not a large standard deviation between the average attendance and the remote attendance across on the school. For the economic needs compared to remote attendance, there is a slight downwards trend of average attendance in remote learning. There could be multiple reasons, as children may substitute for child care or work through school hours in remote settings.

eni_remote<-school%>%select(c('school_name','eni_pct','n_attendance_k3_all','n_attendance_remote_k3_all','val_attendance_remote_k3_all'))

r_att<-ggplot(eni_remote,aes(n_attendance_k3_all,n_attendance_remote_k3_all))+geom_point()+labs(x="Avg s. attendance",y="Avg s. remote attendence")
 
r_enc<-ggplot(eni_remote,aes(eni_pct,val_attendance_remote_k3_all))+geom_point()+labs(x="Econmic needs index", y="Avg s. remote attendance")
 
ggarrange(r_att,r_enc,nrow=2)

Jay’s Exercise| Covid Cases

The data set is data from the CDC about Covid cases over state over time. Jay’s discussion post asked for the following analysis: Compare the seasons to confirmed and probable cases and deaths, compare the states to confirmed/probable cases and death, and review time gap between created_at and submission dates.

cvid_cases<-read.csv("COVID-19_Cases_and_Deaths.csv",header = TRUE)

Comparing probable vs confirmed cases/deaths by seasons

For this exercise, I do not want rows with NA in these columns. I created a duplicate versions with NAs omitted. After NAs are cleared, I need the submission date column to be recognized as date data in order to use the month function.

For confirmed cases to probable cases, the winter months Dec-Feb were the most infectious season. There appears more probable cases than confirmed; the confirmed cases increased each month leading to January.

cvid_cases$submission_date<-as.Date(cvid_cases$submission_date,format='%m/%d/%Y')

cc<-cvid_cases

cc<-cc%>%mutate(m=month(submission_date,label=TRUE))


cc<-cc%>%drop_na()

paged_table(cc%>%group_by(m)%>%summarise(con_count=max(conf_cases),prob_count=max(prob_cases)))

For confirmed to probable deaths, they mirror the same response. The winter months were the most deadly.

paged_table(cc%>%group_by(m)%>%summarise(con_count=max(conf_death),prob_count=max(prob_death)))

Comparing probable vs confirmed cases/deaths by state

For confirmed cases to probable cases, California leads first with 4,640,489 confirmed cases! For probable cases, Ohio leads first with 597,447 cases.

paged_table(cc%>%group_by(state)%>%summarise(con_count=max(conf_cases),prob_count=max(prob_cases))%>%arrange(desc(con_count)))

paged_table(cc%>%group_by(state)%>%summarise(con_count=max(conf_cases),prob_count=max(prob_cases))%>%arrange(desc(prob_count)))

For confirmed deaths, it goes to California again with 71,408 deaths. Tennesse counts for the largest probable deaths from Covid with 6,355 deaths.

paged_table(cc%>%group_by(state)%>%summarise(con_count=max(conf_death),prob_count=max(prob_death))%>%arrange(desc(con_count)))

paged_table(cc%>%group_by(state)%>%summarise(con_count=max(conf_death),prob_count=max(prob_death))%>%arrange(desc(prob_count)))

Compare the time gap between created_at and submission_date by state

For our final problem, we look at the % difference between created_at and submission_date time by state. For this problem, let us transform the created_at class into the date class. Then, I will use the interval function to find the difference in the submission date to the created by date and calculated the duration.

Lets review the duration summary. The average time difference between the two dates is 1 day.

cvid_cases$created_at<-as.Date(cvid_cases$created_at,format='%m/%d/%Y')
cc_date<-cvid_cases%>%select(c('submission_date','created_at'))
cc_date<-cc_date%>%mutate(dif=as.duration(interval(submission_date,created_at)))

paged_table(cc_date)

summary(cc_date$dif)

##                              Min.                           1st Qu. 
##             "-172800s (~-2 days)"                "86400s (~1 days)" 
##                            Median                              Mean 
##                "86400s (~1 days)" "919727.813712807s (~1.52 weeks)" 
##                           3rd Qu.                              Max. 
##               "172800s (~2 days)"         "62899200s (~1.99 years)"

https://data.cccnewyork.org/data/bar/1371/student-economic-need-index#1371/a/1/1622/127 ↩︎

Project_2

Vyanna Hill

2022-03-06