Tidy Data

Part: 1 San Francisco Mobile Food Truck Permits

Are there really taco trucks on every corner?

After reading in the data and changing all empty spaces to NA, remove the permits that are expired or suspended so that we only have active permitted trucks and carts. Then spread the food items into separate rows in order to anazlyze what foods are being served. I also excluded cold truck and hot truck from the foods items listed as they aren’t actual food types.

library(tidyverse)

foodtruck_data <- read_csv("Mobile_Food_Facility_Permit.csv", col_names = TRUE, na= c("", "NA"), trim_ws = TRUE)

foodtruck_long <- 
    foodtruck_data %>%
    select(Applicant, FacilityType, Status, FoodItems) %>%
    filter(Status != "EXPIRED") %>%
    filter(Status != "SUSPEND") %>%
    drop_na() %>%
    separate_rows(FoodItems ,sep=":|;|\\.") %>%
    mutate(FoodItems = tolower(trimws(FoodItems))) %>%
    filter(FoodItems != "cold truck") %>%
    filter(FoodItems != "hot truck") %>%
    filter(FoodItems != "")

How many food trucks serve tacos in San Francisco?

foodtruck_long %>% filter(FacilityType == "Truck") %>% filter( grepl("taco", FoodItems) ) %>% count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1    80

foodtruck_long %>% 
    count(FoodItems) %>%
    top_n(10, n) %>%
    arrange(desc(n))

## # A tibble: 10 x 2
##    FoodItems               n
##    <chr>               <int>
##  1 sandwiches            151
##  2 candy                 148
##  3 snacks                117
##  4 burritos              109
##  5 hot dogs              106
##  6 chips                 105
##  7 water                  82
##  8 coffee                 80
##  9 pre-packaged snacks    78
## 10 tacos                  76

There are only 80 foodtrucks that serve tacos and those can’t possibly be on every corner of San Francisco. Candy, snacks, sandwiches and even burritos are far more popular than tacos!

Part: 2 New York City Restaurant Inspections

Which cuisine gets the most critical violations proportionally?

Begin with reading in the data and selecting the columns to look at cuisine and critical violations. Then I rename columns to remove spaces and remove cuisine types sych as not applicable and not listed and spread the dataset to show proportional criticality by cuisine type arranging by percentage.

NYC_Restaurant_Inspection_Results <- read_csv("DOHMH_New_York_City_Restaurant_Inspection_Results.csv")

NYC_Restaurant_Health <- 
    NYC_Restaurant_Inspection_Results %>%
    select("DBA", "BORO", "CUISINE DESCRIPTION", "CRITICAL FLAG")

colnames(NYC_Restaurant_Health) <- c("DBA", "BORO", "CUISINE", "CRITICALITY")

NYC_Restaurant_Health_Clean <-
    NYC_Restaurant_Health %>%
    filter(CRITICALITY != "Not Applicable") %>%
    filter(CUISINE != "Not Listed/Not Applicable") %>%
    count(CUISINE, CRITICALITY) %>%
    mutate(CRITICALITY = ifelse(CRITICALITY=="Critical","Critical","NotCritical")) %>%
    spread(CRITICALITY, n) %>%
    mutate(PERCENTCRITICAL = Critical / (Critical + `NotCritical`)) %>%
    arrange(desc(PERCENTCRITICAL))

What are the top 10 cuisines with critical violations?

Select the top 10 cuisines with the highest critical percentages for violations.

NYC_Restaurant_Health_Clean %>% top_n(10, PERCENTCRITICAL)

## # A tibble: 10 x 4
##    CUISINE                       Critical NotCritical PERCENTCRITICAL
##    <chr>                            <int>       <int>           <dbl>
##  1 Creole/Cajun                        72          38           0.655
##  2 Bangladeshi                        624         361           0.634
##  3 Californian                         32          19           0.627
##  4 Creole                             340         213           0.615
##  5 Vietnamese/Cambodian/Malaysia      992         642           0.607
##  6 Armenian                           223         148           0.601
##  7 English                            129          86           0.6  
##  8 Chinese/Cuban                      165         112           0.596
##  9 Filipino                           397         271           0.594
## 10 Chinese/Japanese                   506         346           0.594

I think I would skip the Creole or Creole/Cajun and Bangladeshi cuisine restaurants!

And what exactly is Chinese/Cuban food?

What are the bottom 10 cuisines with critical violations?

Select the 10 cuisines which correspond to the lowest critical percentages for violations.

NYC_Restaurant_Health_Clean %>% top_n(-10, PERCENTCRITICAL)

## # A tibble: 11 x 4
##    CUISINE                         Critical NotCritical PERCENTCRITICAL
##    <chr>                              <int>       <int>           <dbl>
##  1 Fruits/Vegetables                     27          27           0.5  
##  2 Soups                                 21          21           0.5  
##  3 Soups & Sandwiches                   263         266           0.497
##  4 Hamburgers                          2286        2339           0.494
##  5 Afghan                                95         101           0.485
##  6 Salads                               387         417           0.481
##  7 Ice Cream, Gelato, Yogurt, Ices     1400        1512           0.481
##  8 Pancakes/Waffles                      94         112           0.456
##  9 Chilean                                7          10           0.412
## 10 Nuts/Confectionary                     8          13           0.381
## 11 Basque                                 1           4           0.2

Ice cream looks pretty safe though.

Part 3: 30 Years of Simpsons

What are the most and least watched Simpsons Episodes?

What does viewership look like over the seasons?

Reading in the dataset and plotting to see the viewership by episode and by season.

simpsons_data <- read_csv("simpsons_episodes.csv")

ggplot(data = simpsons_data) +
    aes(x = number_in_series, y = us_viewers_in_millions, color = number_in_season) +
    geom_line() +
    scale_color_distiller(palette = "RdYlGn")

Viewers seem to tune in towards the end of the each season and viewership has declined significantly over the course of 30 years. However there does seem to be a pickup in viewership in the between the 10th and 15th seasons.

What are the top 5 most viewed Simpsons Episodes?

Selecting the top 5 titles viewed and arranging in descending order by viewership.

simpsons_data %>% top_n(5, us_viewers_in_millions) %>% 
    select(title, us_viewers_in_millions, season, number_in_season) %>%
    arrange(desc(us_viewers_in_millions))

## # A tibble: 5 x 4
##   title                 us_viewers_in_millions season number_in_season
##   <chr>                                  <dbl>  <int>            <int>
## 1 "Bart Gets an \"F\""                    33.6      2                1
## 2 Life on the Fast Lane                   33.5      1                9
## 3 The Crepes of Wrath                     31.2      1               11
## 4 Krusty Gets Busted                      30.4      1               12
## 5 Homer's Night Out                       30.3      1               10

What are the bottom 5 least viewed Simpsons Episodes?

Selecting the bottom 5 titles viewed and arranging by viewership.

simpsons_data %>% top_n(-5, us_viewers_in_millions) %>% 
    select(title, us_viewers_in_millions, season, number_in_season) %>%
    arrange(desc(us_viewers_in_millions))

## # A tibble: 5 x 4
##   title                       us_viewers_in_milli… season number_in_season
##   <chr>                                      <dbl>  <int>            <int>
## 1 My Fare Lady                                2.67     26               14
## 2 How Lisa Got Her Marge Back                 2.55     27               18
## 3 Orange Is the New Yellow                    2.54     27               22
## 4 To Courier with Love                        2.52     27               20
## 5 The Burns Cage                              2.32     27               17

Data 607 Project 2

Stephanie Roark

10/7/2018