library(tidyverse)
library(dplyr)
library(lemon)
library(rmarkdown)
library(stringr)
library(lubridate)
library(ggpubr)
library(ggplot2)
knit_print.data.frame <- lemon_print
For project 3, I selected three data sets from our discussion boards. I will play with problems given by Benson, Tora, and Jay.
For Benson’s data, it is based on the reviews on a international ramens on the market. His discussion posts ask to analyze the favorite favor, best brand, ramen style.
ramen<-read.csv("ramen-ratings.csv",header = TRUE)
head(ramen)
| Review.. | Brand | Variety | Style | Country | Stars | Top.Ten |
|---|---|---|---|---|---|---|
| 2580 | New Touch | T’s Restaurant Tantanmen | Cup | Japan | 3.75 | |
| 2579 | Just Way | Noodles Spicy Hot Sesame Spicy Hot Sesame Guan-miao Noodles | Pack | Taiwan | 1 | |
| 2578 | Nissin | Cup Noodles Chicken Vegetable | Cup | USA | 2.25 | |
| 2577 | Wei Lih | GGE Ramen Snack Tomato Flavor | Pack | Taiwan | 2.75 | |
| 2576 | Ching’s Secret | Singapore Curry | Pack | India | 3.75 | |
| 2575 | Samyang Foods | Kimchi song Song Ramen | Pack | South Korea | 4.75 |
For Flavor, we will based this on the top ten ranking system in the data set. We need to tidy this data, as the top tens have text inside with the numeral ranking.
On further inspection, I noticed the rankings are all from different years and not all the top tens are in the data set! I will cut down on the repetitiveness of the exercise and view 2016’s top ten; however, these steps will lead to the same answer.
top_r<-ramen%>%filter(grepl("[0-9]",ramen$Top.Ten))
tw16<-top_r%>%filter(grepl("(2016)",top_r$Top.Ten))
tw16$Top.Ten<-str_sub(tw16$Top.Ten,7,8)
tw16$Top.Ten <- as.numeric(tw16$Top.Ten)
tw16%>%arrange(Top.Ten)
| Review.. | Brand | Variety | Style | Country | Stars | Top.Ten |
|---|---|---|---|---|---|---|
| 1947 | Prima Taste | Singapore Laksa Wholegrain La Mian | Pack | Singapore | 5 | 1 |
| 1907 | Prima Taste | Singapore Curry Wholegrain La Mian | Pack | Singapore | 5 | 5 |
| 1689 | Wugudaochang | Tomato Beef Brisket Flavor Purple Potato Noodle | Pack | China | 5 | 7 |
| 1925 | Prima | Juzz’s Mee Creamy Chicken Flavour | Pack | Singapore | 5 | 8 |
| 1828 | Tseng Noodles | Scallion With Sichuan Pepper Flavor | Pack | Taiwan | 5 | 9 |
| 1964 | MAMA | Instant Noodles Coconut Milk Flavour | Pack | Myanmar | 5 | 10 |
For best brand, I will group all the listed brands together and calculate their average score. We can compare the top scored brands with those on the top ten list to see if these raved brands made the scoring board.
Stars is char vector and has some nulls. Let’s clean these records! After our cleaning, I can now sort the brands on the list
ramen<-na.omit(ramen)
ramen<-ramen%>%filter(!grepl("Unrated",Stars))
ramen$Stars <- as.numeric(ramen$Stars)
Brand_rank<-ramen%>%group_by(Brand)%>%summarise(avg=mean(Stars))
head(Brand_rank%>%arrange(desc(avg)))
| Brand | avg |
|---|---|
| ChoripDong | 5 |
| Daddy | 5 |
| Daifuku | 5 |
| Foodmon | 5 |
| Higashi | 5 |
| Jackpot Teriyaki | 5 |
paged_table(left_join(top_r,Brand_rank,by="Brand"))
| Review.. | Brand | Variety | Style | Country | Stars | Top.Ten | avg |
|---|---|---|---|---|---|---|---|
| 1964 | MAMA | Instant Noodles Coconut Milk Flavour | Pack | Myanmar | 5 | 2016 #10 | 3.898148 |
| 1947 | Prima Taste | Singapore Laksa Wholegrain La Mian | Pack | Singapore | 5 | 2016 #1 | 5.000000 |
| 1925 | Prima | Juzz’s Mee Creamy Chicken Flavour | Pack | Singapore | 5 | 2016 #8 | 5.000000 |
| 1907 | Prima Taste | Singapore Curry Wholegrain La Mian | Pack | Singapore | 5 | 2016 #5 | 5.000000 |
| 1828 | Tseng Noodles | Scallion With Sichuan Pepper Flavor | Pack | Taiwan | 5 | 2016 #9 | 4.821429 |
| 1689 | Wugudaochang | Tomato Beef Brisket Flavor Purple Potato Noodle | Pack | China | 5 | 2016 #7 | 4.125000 |
| 1638 | A-Sha Dry Noodle | Veggie Noodle Tomato Noodle With Vine Ripened Tomato Sauce | Pack | Taiwan | 5 | 2015 #10 | 4.067308 |
| 1617 | MyKuali | Penang Hokkien Prawn Noodle (New Improved Taste) | Pack | Malaysia | 5 | 2015 #7 | 4.947917 |
| 1585 | CarJEN | Nyonya Curry Laksa | Pack | Malaysia | 5 | 2015 #4 | 4.928571 |
| 1521 | Maruchan | Gotsumori Sauce Yakisoba | Tray | Japan | 5 | 2015 #9 | 3.554276 |
| 1508 | Mamee | Chef Gold Recipe Mi Kari Seribu Rasa | Pack | Malaysia | 5 | 2015 #6 | 3.939655 |
| 1500 | MyKuali | Penang Red Tom Yum Goong Noodle | Pack | Malaysia | 5 | 2015 #1 | 4.947917 |
| 1471 | Mama | Instant Noodles Shrimp Creamy Tom Yum Flavour Jumbo Pack | Pack | Thailand | 5 | 2013 #10 | 3.628873 |
| 1466 | Mama | Oriental Style Instant Noodles Green Curry Flavour Jumbo Pack | Pack | Thailand | 5 | 2015 #8 | 3.628873 |
| 1350 | Mamee | Chef Curry Laksa Flavour | Pack | Malaysia | 5 | 2014 #7 | 3.939655 |
| 1330 | Sapporo Ichiban | Otafuku Okonomi Sauce Yakisoba | Tray | Japan | 5 | 2014 #4 | 3.830000 |
| 1308 | Nongshim | Soon Veggie Noodle Soup | Pack | South Korea | 5 | 2014 #9 | 4.000000 |
| 1302 | Mama | Instant Noodles Yentafo Tom Yum Mohfai Flavour | Pack | Thailand | 5 | 2014 #10 | 3.628873 |
| 1271 | Prima Taste | Singapore Chilli Crab La Mian | Pack | Singapore | 5 | 2014 #8 | 5.000000 |
| 1198 | Samyang Foods | Maesaengyitangmyun Baked Noodle | Pack | South Korea | 5 | 2014 #5 | 4.068627 |
| 1183 | Paldo | Cheese Noodle | Pack | South Korea | 5 | 2014 #6 | 4.018939 |
| 1087 | MyKuali | Penang White Curry Noodle | Pack | Malaysia | 5 | 2014 #1 | 4.947917 |
| 992 | Prima Taste | Singapore Laksa La Mian | Pack | Singapore | 5 | 2013 #1 | 5.000000 |
| 991 | Prima Taste | Singapore Curry La Mian | Pack | Singapore | 5 | 2013 #2 | 5.000000 |
| 934 | Nongshim | Jinjja Jinjja Flamin’ Hot & Nutty | Pack | USA | 5 | 2013 #4 | 4.000000 |
| 823 | Paldo | Kokomen Spicy Chicken | Pack | South Korea | 5 | 2013 #9 | 4.018939 |
| 715 | Indomie | Mi Goreng Rendang (Import) | Pack | Indonesia | 5 | 2013 #3 | 4.070755 |
| 608 | Koka | Spicy Black Pepper | Pack | Singapore | 5 | 2012 #10 | 3.750000 |
| 578 | Nongshim | Shin Ramyun Black | Pack | South Korea | 4.75 | 2012 #7 | 4.000000 |
| 434 | Mi Sedaap | Kari Spesial | Pack | Indonesia | 4.5 | 2012 #5 | 4.145833 |
| 392 | Nissin | Yakisoba Noodles Karashi | Tray | Japan | 5 | 2012 #3 | 3.918570 |
| 391 | Myojo | Hyoubanya No Chukasoba Oriental | Pack | Japan | 4.25 | 2012 #6 | 3.801587 |
| 285 | Doll | Artificial Chicken | Pack | Hong Kong | 4.5 | 2012 #9 | 3.296875 |
| 105 | Indomie | Special Fried Curly Noodle | Pack | Indonesia | 5 | 2012 #1 | 4.070755 |
| 47 | Indomie | Mi Goreng Jumbo Barbecue Chicken | Pack | Indonesia | 5 | 2012 #2 | 4.070755 |
| 31 | Myojo | Ippeichan Yakisoba | Tray | Japan | 4 | 2013 #6 | 3.801587 |
| 13 | Sapporo Ichiban | Chow Mein | Pack | Japan | 5 | 2012 #4 | 3.830000 |
I want to group the data set by the different types of ramen style and see which type has the greatest average in Stars.
R_style<-ramen%>%group_by(Style)%>%summarise(avg=mean(Stars))
paged_table(R_style%>%arrange(desc(avg)))
| Style | avg |
|---|---|
| Bar | 5.000000 |
| Box | 4.291667 |
| Pack | 3.700458 |
| Bowl | 3.670686 |
| Tray | 3.545139 |
| Can | 3.500000 |
| Cup | 3.498500 |
| 3.375000 |
Tora’s data set is based on the school’s student learning process compared to the student’s performance. Tora’s discussion asked for the analysis: average student attendance (in-person) to teacher’s number of years of experience, movement of student with disabilities to less restrictive environments vs Race of student, Economic needs index vs Number of Remote Learning Days.
school<-read.csv("2020_2021_School_Quality_Reports_.csv",header = TRUE)
I want to see the school name, school type , enrollment, teacher of more than three years and the metric value of student attendance in person.
The data is missing values noted by the “No Data” tag. For a cleaner view, let’s apply a filter. If a school’s teachers experience is above 90%, the student’s attendance leans above 90% as well. There are outliers where the teachers’ experience is below 40% with high student attendance, but we need more data points and samples to confirm if there is a correlation. It does seems promising the teacher’s YOE affects students for in person learning.
year_att<-school%>%select(c('school_name','school_type','enrollment','Teach_3_more_exp','val_attendance_inperson_k3_all'))
year_att<-year_att%>%filter(!grepl("No Data",Teach_3_more_exp))
year_att<-year_att%>%filter(!grepl("No Data",val_attendance_inperson_k3_all))
For the second analysis, we want to see if there’s a correlation between the race of a student and their move towards less restrictive environments.
In my analysis, There is a stronger pattern of Hispanic student population in these high percentages in less restrictive environments compared to other races. For example, there are more black disabled students moved to less restrictive places when the majority of the black student population less than ~30%.
dis_race<-school%>%select(c('school_name','ethnicity_asian_pct','ethnicity_black_pct','ethnicity_hispanic_pct','ethnicity_amerindian_pct','ethnicity_pacific_pct','ethnicity_white_pct','val_lre_all'))
dis_race<-dis_race%>%filter(!grepl("No Data",val_lre_all))
paged_table(dis_race%>%arrange(desc(val_lre_all)))
his_dis<-ggplot(dis_race,aes(ethnicity_hispanic_pct,val_lre_all))+geom_point()+labs(x="% of hispanic student pop.",y="% less restrictive")
blk_dis<-ggplot(dis_race,aes(ethnicity_black_pct,val_lre_all))+geom_point()+labs(x="% of black student pop.",y="% less restrictive")
ggarrange(his_dis,blk_dis,ncol=2)
For context, the economic needs index is the average of students who family’s income is at or below the poverty level. It ranks at a scale 0.1-1.0; where 1.0 means the student qualifies for economical support1.
There is not a large standard deviation between the average attendance and the remote attendance across on the school. For the economic needs compared to remote attendance, there is a slight downwards trend of average attendance in remote learning. There could be multiple reasons, as children may substitute for child care or work through school hours in remote settings.
eni_remote<-school%>%select(c('school_name','eni_pct','n_attendance_k3_all','n_attendance_remote_k3_all','val_attendance_remote_k3_all'))
r_att<-ggplot(eni_remote,aes(n_attendance_k3_all,n_attendance_remote_k3_all))+geom_point()+labs(x="Avg s. attendance",y="Avg s. remote attendence")
r_enc<-ggplot(eni_remote,aes(eni_pct,val_attendance_remote_k3_all))+geom_point()+labs(x="Econmic needs index", y="Avg s. remote attendance")
ggarrange(r_att,r_enc,nrow=2)
The data set is data from the CDC about Covid cases over state over time. Jay’s discussion post asked for the following analysis: Compare the seasons to confirmed and probable cases and deaths, compare the states to confirmed/probable cases and death, and review time gap between created_at and submission dates.
cvid_cases<-read.csv("COVID-19_Cases_and_Deaths.csv",header = TRUE)
For this exercise, I do not want rows with NA in these columns. I created a duplicate versions with NAs omitted. After NAs are cleared, I need the submission date column to be recognized as date data in order to use the month function.
For confirmed cases to probable cases, the winter months Dec-Feb were the most infectious season. There appears more probable cases than confirmed; the confirmed cases increased each month leading to January.
cvid_cases$submission_date<-as.Date(cvid_cases$submission_date,format='%m/%d/%Y')
cc<-cvid_cases
cc<-cc%>%mutate(m=month(submission_date,label=TRUE))
cc<-cc%>%drop_na()
paged_table(cc%>%group_by(m)%>%summarise(con_count=max(conf_cases),prob_count=max(prob_cases)))
For confirmed to probable deaths, they mirror the same response. The winter months were the most deadly.
paged_table(cc%>%group_by(m)%>%summarise(con_count=max(conf_death),prob_count=max(prob_death)))
For confirmed cases to probable cases, California leads first with 4,640,489 confirmed cases! For probable cases, Ohio leads first with 597,447 cases.
paged_table(cc%>%group_by(state)%>%summarise(con_count=max(conf_cases),prob_count=max(prob_cases))%>%arrange(desc(con_count)))
paged_table(cc%>%group_by(state)%>%summarise(con_count=max(conf_cases),prob_count=max(prob_cases))%>%arrange(desc(prob_count)))
For confirmed deaths, it goes to California again with 71,408 deaths. Tennesse counts for the largest probable deaths from Covid with 6,355 deaths.
paged_table(cc%>%group_by(state)%>%summarise(con_count=max(conf_death),prob_count=max(prob_death))%>%arrange(desc(con_count)))
paged_table(cc%>%group_by(state)%>%summarise(con_count=max(conf_death),prob_count=max(prob_death))%>%arrange(desc(prob_count)))
For our final problem, we look at the % difference between created_at and submission_date time by state. For this problem, let us transform the created_at class into the date class. Then, I will use the interval function to find the difference in the submission date to the created by date and calculated the duration.
Lets review the duration summary. The average time difference between the two dates is 1 day.
cvid_cases$created_at<-as.Date(cvid_cases$created_at,format='%m/%d/%Y')
cc_date<-cvid_cases%>%select(c('submission_date','created_at'))
cc_date<-cc_date%>%mutate(dif=as.duration(interval(submission_date,created_at)))
paged_table(cc_date)
summary(cc_date$dif)
## Min. 1st Qu.
## "-172800s (~-2 days)" "86400s (~1 days)"
## Median Mean
## "86400s (~1 days)" "919727.813712807s (~1.52 weeks)"
## 3rd Qu. Max.
## "172800s (~2 days)" "62899200s (~1.99 years)"