Project 3

1. Objective

Answer the question, “Which are the most valued data science skills?” Expand knowledge of data collecting, tidying, transformations, and exploratory data analysis in R.

2. Group Members

Daria Dubovskaia
Wilson Chau
Seung Min Song
Ted Kim

3. Collaboration Tools

We have chosen the most common collaboration tools to work on the project.
For the communication, the team will use a messaging program, Slack, https://slack.com/.
Project management will be organized via platform for software development and version control, GitHub, https://github.com/blacksmilez/DATA607/tree/main/Project3. Project progress will be managed by GitHub Board, https://github.com/users/blacksmilez/projects/2/views/1.
As well as a code sharing will also be organized via GitHub since team members can change the code and commit changes to the general file.
Since the assignment is including the work with databases, we have decided to use Microsoft Azure Database for MySQL Database, https://azure.microsoft.com. As a result, all teammates will be able to work together on the same database in real time and access the latest version of our database.
Project documentation and coding will be done by each team member via an integrated development environment for R, RStudio
Rpubs will be used for easy web publishing from R to present the work to the Professor, https://rpubs.com/seungm1nsong/956997.
The team chose Slack and GitHub as the main collaboration tools because all four members are comfortable using those two tools. Using unfamiliar tools will reduce efficiency and slow the project progress. Moreover, each member has a different background and the project period is relatively short. Therefore, we chose to use familiar tools, GitHub and Slack.

4. Data Source

DOHMH New York City Restaurant Inspection Results.
(DOHMH_New_York_City_Restaurant_Inspection_Results.csv 102MB)
The data will be downloaded as CSV file and loaded in R using read.csv() function. The source of the dataset: https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j
The dataset contains 239,000 observations of 27 variables.
The dataset contains every sustained or not yet adjudicated violation citation from every full or special program inspection conducted up to three years prior to the most recent inspection for restaurants and college cafeterias in an active status on the RECORD DATE (date of the data pull). When an inspection results in more than one violation, values for associated fields are repeated for each additional violation record. Establishments are uniquely identified by their CAMIS (record ID) number. Keep in mind that thousands of restaurants start business and go out of business every year; only restaurants in an active status are included in the dataset.
Records are also included for each restaurant that has applied for a permit but has not yet been inspected and for inspections resulting in no violations. Establishments with inspection date of 1/1/1900 are new establishments that have not yet received an inspection. Restaurants that received no violations are represented by a single row and coded as having no violations using the ACTION field.
Because this dataset is compiled from several large administrative data systems, it contains some illogical values that could be a result of data entry or transfer errors. Data may also be missing.
This dataset and the information on the Health Department’s Restaurant Grading website come from the same data source. The Health Department’s Restaurant Grading website is available via the link below: http://www1.nyc.gov/site/doh/services/restaurant-grades.page
The description of the columns is provided at the link below: https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j

Entity-Relationship diagram
A physical data model for the normalized database, and an Entity-Relationship (ER) diagram is available via team’s GitHub repository. On our ER diagram we have a few approaches to doing some analytical deliverable where we can inspect our dataset. One option is going through inspection type with and without the influence of inspection_code. This can give us some correlation where inspection code may have different types of inspection results. Another investigation road we can focus on would be the building and how it can influences some of the inspection results. These types of focuses can give us some insights on how New York Restaurant inspections are being graded and the types of influences that will affect the grading of the inspections for these NYC restaurants.

5. Load data, Azure database

5.1 Load csv file

We are going to read data from csv on NYC Open Data, it’s contained approx. 239,537 rows. However, sometimes there were connection issues.
The header of the csv file contains white spaces between characters. Therefore, when reading through read.csv(), the white space is changed to “.(dot)”. Replace “.(dot)” with “_(underscore)” for use in the database.

data <- read.csv(file = 'https://data.cityofnewyork.us/api/views/43nn-pn8j/rows.csv?accessType=DOWNLOAD&bom=true&format=true', header = TRUE)
names(data) <- str_replace_all(tolower(names(data)), '\\.', '_')
data <- data %>%
        mutate_at(c('inspection_date'), function(x) ifelse(x == '1900-01-01T00:00:00.000', '1970-01-01T00:00:01.000', x))
# Confirm the number of rows of data read.
nrow(data)

## [1] 237469

We will change the name of the first column as it contains unwanted characters.

sapply(data,class)

##       i_<U+00BF>camis                   dba                  boro 
##             "integer"           "character"           "character" 
##              building                street               zipcode 
##           "character"           "character"           "character" 
##                 phone   cuisine_description       inspection_date 
##           "character"           "character"           "character" 
##                action        violation_code violation_description 
##           "character"           "character"           "character" 
##         critical_flag                 score                 grade 
##           "character"             "integer"           "character" 
##            grade_date           record_date       inspection_type 
##           "character"           "character"           "character" 
##              latitude             longitude       community_board 
##             "numeric"             "numeric"             "integer" 
##      council_district          census_tract                   bin 
##             "integer"             "integer"             "integer" 
##                   bbl                   nta        location_point 
##             "numeric"           "character"             "logical"

colnames(data)[1] <- "camis"
colnames(data)

##  [1] "camis"                 "dba"                   "boro"                 
##  [4] "building"              "street"                "zipcode"              
##  [7] "phone"                 "cuisine_description"   "inspection_date"      
## [10] "action"                "violation_code"        "violation_description"
## [13] "critical_flag"         "score"                 "grade"                
## [16] "grade_date"            "record_date"           "inspection_type"      
## [19] "latitude"              "longitude"             "community_board"      
## [22] "council_district"      "census_tract"          "bin"                  
## [25] "bbl"                   "nta"                   "location_point"

5.2 Creating normalized tables for relational database.

We will start with creating a data frame to store the boro_code and boro_name. Since there is a code set by New York City, it is made manually without making it from the data read.

df_boro <- data.frame(boro_code = c('1', '2', '3', '4', '5'),
                      boro_name = c('Manhattan', 'Bronx', 'Brooklyn', 'Queens', 'Staten Island'))
df_boro

Next, create a data frame to store the grade_code and grade. Since there is a code set by New York City, it is made manually without making it from the data read.

# 
df_grade_code <- data.frame(grade_code = c('N', 'A', 'B', 'C', 'Z', 'P'),
                                 grade = c('Not Yet Graded', 'Grade A', 'Grade B', 'Grade C', 'Grade Pending', 
                                           'Grade Pending issued on re-opening following an initial inspection that resulted in a closure'))
df_grade_code

And after, create a data frame to store the critical_flag_code and critical_flag. Since there is a code set by New York City, it is made manually without making it from the data read.

df_critical_flag_code <- data.frame(critical_flag_code = c('1', '2', '3'),
                                    critical_flag = c('Critical', 'Not Critical', 'Not Applicable'))
df_critical_flag_code

We finished the work with typical codes, now we will created table to substitute long words and description with code names. The data frame below will store the cuisine_code and cuisine_description. Among the read data, data whose value in the cuisine_description column is not null or non-blank is selected. After that, make a sequence number using rowid_to_column() and add it to the cuisine_code column. The column cuisine_code may not have a duplicate value.

cuisine_descriptions = unique(data[!is.na(data$cuisine_description) & str_remove_all(data$cuisine_description, '\\s') != '',]$cuisine_description) 
df_cuisine_code <- data.frame(cuisine_description = sort(cuisine_descriptions))
df_cuisine_code <- tibble::rowid_to_column(df_cuisine_code, 'cuisine_code') %>%
                      mutate_at(c('cuisine_code'), function(x) str_sub(paste0('000', x), -3, -1))
head(df_cuisine_code, n=6)

The next data frame contains the violation_code and violation_code_5. Among the read data, data whose value in the violation_code column is not null or non-blank is selected. Update the name of the violation_code column to violation_code_5. After that, make a sequence number using rowid_to_column() and add it to the violation_code column again. The column violation_code may not have a duplicate value.

df_violation_code <- unique(data %>%
                              select(c(violation_code)) %>%
                              filter(!is.na(violation_code) & str_remove_all(violation_code, '\\s') != ''))
names(df_violation_code) <- paste0(names(df_violation_code), '_5')
df_violation_code <- tibble::rowid_to_column(df_violation_code, 'violation_code') %>%
                      mutate_at(c('violation_code'), function(x) str_sub(paste0('000', x), -3, -1))
df_violation_code

Instead of long description of each violation, we will create unique code for each violation and use it in the database and dataframe to store the violation_description_code and violation_description. Among the read data, data whose value in the violation_description column is not null is selected. After that, make a sequence number using rowid_to_column() and add it to the violation_description_code column. The column violation_descriptions may not have a duplicate value.

violation_descriptions = unique(data[!is.na(data$violation_description),]$violation_description) 
df_violation_description_code <- data.frame(violation_description = sort(violation_descriptions)) %>%
                                  tibble::rowid_to_column(., 'violation_description_code') %>%
                                  mutate_at(c('violation_description_code'), function(x) str_sub(paste0('000', x), -3, -1))
df_violation_description_code

Based on the results of inspection, some actions were taken at the end of each inspection. We will create a data frame to store the action_code and action description. Among the read data, data whose value in the action column is not null is selected. After that, make a sequence number using rowid_to_column() and add it to the action_code column. The column action may not have a duplicate value.

df_action_code <- unique(data %>%
                              select(c(action)) %>%
                              filter(!is.na(action))) %>%
                  tibble::rowid_to_column(., 'action_code') %>%
                  mutate_at(c('action_code'), function(x) str_sub(paste0('00', x), -2, -1))
df_action_code

Next data frame will contain type of each inspection and will store the inspection_code and inspection_type Among the read data, data whose value in the action column is not null or non-blank is selected. After that, make a sequence number using rowid_to_column() and add it to the inspection_code column. The column inspection_type may not have a duplicate value.

df_inspection_code <- unique(data %>%
                              select(c(inspection_type)) %>%
                              filter(!is.na(inspection_type) & str_remove_all(inspection_type, '\\s') != '')) %>%
                      tibble::rowid_to_column(., 'inspection_code') %>%
                      mutate_at(c('inspection_code'), function(x) str_sub(paste0('00', x), -2, -1))
head(df_inspection_code, n=3)

After, create a data frame to store the camis, inspection_date and inspection_code. Among the read data, data whose value in the inspection_type column is not null is selected. Get inspection_code from data frame df_inspection_code using left_join. After that, inspection_type is excluded from the data frame.

df_inspection_type <- data %>%
                        select(c(camis, inspection_date, inspection_type)) %>%
                        filter(!is.na(inspection_type)) %>%
                        left_join(., df_inspection_code, by = 'inspection_type') %>%
                        select(-c(inspection_type))
head(df_inspection_type, n=3)

The violation data frame contains the most necessary information about each violation detected during each inspection. It will store the camis, inspection_date, violation_code, violation_description_code and critical_flag_code. Among the read data, data whose value in the violation_code column is not null is selected. Fields related to other tables take key values from each data frame.

df_violation <- data %>%
                  select(c(camis, inspection_date, violation_code, violation_description, critical_flag)) %>%
                  filter(!is.na(violation_code)) %>%
                  left_join(., df_violation_code, by = c('violation_code' = 'violation_code_5')) %>%
                  left_join(., df_violation_description_code, by = 'violation_description') %>%
                  left_join(., df_critical_flag_code, by = 'critical_flag') %>%
                  select(-c(violation_code, violation_description, critical_flag))
names(df_violation) <- str_remove_all(names(df_violation), '\\.y')
df_violation

inspection_result data frame contains information about each inspection and consists of columns camis, inspection_date, action_code, score, grade_code, grade_date, record_date and cuisine_code. Fields related to other tables take key values from each data frame.

df_inspection_result <- data %>%
                          select(c(camis, inspection_date, action, score, grade,
                                   grade_date, record_date, cuisine_description)) %>%
                          left_join(., df_action_code, by = 'action') %>%
                          left_join(., df_cuisine_code, by = 'cuisine_description') %>%
                          select(-c(action, cuisine_description))
names(df_inspection_result) <- str_replace_all(names(df_inspection_result), '(?<=grade$)', '_code')
head(df_inspection_result, n=3)

The building data frame contain info about each building that was used for restaurants in the city. Bin should be a key value, but some data is missing. Therefore, a dataset with bin and datasets without bin are separated and processed, and then each dataset is merged. In addition, datasets without bin are divided into each boro to create temporary bin(boro_code+‘9’+‘00001’~). Also, since the building table is related to the dba table, it is separated into two data frames after working at once. Need to trim a little more

exclude_bin <- c('1000000', '2000000', '3000000', '4000000', '5000000')
df_w_bin <- unique(data %>%
                     select(c(bin, camis, dba, phone,
                              boro, building, street, zipcode, latitude, longitude,
                              community_board, council_district, census_tract, bbl, nta)) %>%
                     filter(!is.na(bin) & !(bin %in% exclude_bin)) %>%
                     mutate_at(c('latitude', 'longitude'), as.numeric))
df_w_bin$bin <- as.character(df_w_bin$bin)

# create temporary bin for Manhattan
df_wo_bin_1 <- unique(data %>%
                         filter(is.na(bin) & boro == 'Manhattan') %>%
                         select(c(camis, dba, phone,
                                  boro, building, street, zipcode, latitude, longitude,
                                  community_board, council_district, census_tract, bbl, nta)) %>%
                         mutate_at(c('latitude', 'longitude'), as.numeric)) %>%
                         tibble::rowid_to_column(., 'bin')
df_wo_bin_1$bin <- as.character(df_wo_bin_1$bin + 1900000)

# create temporary bin for Bronx
df_wo_bin_2 <- unique(data %>%
                         filter(is.na(bin) & boro == 'Bronx') %>%
                         select(c(camis, dba, phone,
                                  boro, building, street, zipcode, latitude, longitude,
                                  community_board, council_district, census_tract, bbl, nta)) %>%
                         mutate_at(c('latitude', 'longitude'), as.numeric)) %>%
                         tibble::rowid_to_column(., 'bin')
df_wo_bin_2$bin <- as.character(df_wo_bin_2$bin + 2900000)

# create temporary bin for Brooklyn
df_wo_bin_3 <- unique(data %>%
                         filter(is.na(bin) & boro == 'Brooklyn') %>%
                         select(c(camis, dba, phone,
                                  boro, building, street, zipcode, latitude, longitude,
                                  community_board, council_district, census_tract, bbl, nta)) %>%
                         mutate_at(c('latitude', 'longitude'), as.numeric)) %>%
                         tibble::rowid_to_column(., 'bin')
df_wo_bin_3$bin <- as.character(df_wo_bin_3$bin + 3900000)

# create temporary bin for Queens
df_wo_bin_4 <- unique(data %>%
                         filter(is.na(bin) & boro == 'Queens') %>%
                         select(c(camis, dba, phone,
                                  boro, building, street, zipcode, latitude, longitude,
                                  community_board, council_district, census_tract, bbl, nta)) %>%
                         mutate_at(c('latitude', 'longitude'), as.numeric)) %>%
                         tibble::rowid_to_column(., 'bin')
df_wo_bin_4$bin <- as.character(df_wo_bin_4$bin + 4900000)

# create temporary bin for Staten Island
df_wo_bin_5 <- unique(data %>%
                         filter(is.na(bin) & boro == 'Staten Island') %>%
                         select(c(camis, dba, phone,
                                  boro, building, street, zipcode, latitude, longitude,
                                  community_board, council_district, census_tract, bbl, nta)) %>%
                         mutate_at(c('latitude', 'longitude'), as.numeric)) %>%
                         tibble::rowid_to_column(., 'bin')
df_wo_bin_5$bin <- as.character(df_wo_bin_5$bin + 5900000)

# create temporary bin for others
df_wo_bin_9 <- unique(data %>%
                         filter(is.na(bin)) %>%
                         select(c(camis, dba, phone,
                                  boro, building, street, zipcode, latitude, longitude,
                                  community_board, council_district, census_tract, bbl, nta)) %>%
                         mutate_at(c('latitude', 'longitude'), as.numeric)) %>%
                         tibble::rowid_to_column(., 'bin')
df_wo_bin_9$bin <- as.character(df_wo_bin_9$bin + 9000000)
df_bin <- union(df_w_bin, df_wo_bin_1) %>%
             union(., df_wo_bin_2) %>%
             union(., df_wo_bin_3) %>%
             union(., df_wo_bin_4) %>%
             union(., df_wo_bin_5) %>%
             union(., df_wo_bin_9)
df_building <- unique(df_bin %>%
                        left_join(., df_boro, by = c('boro' = 'boro_name')) %>%
                        select(-c(camis, dba, phone, boro)))
head(df_building, n=3)

Dba data frame will contain info about name of the restaurant and address.

df_dba <- unique(df_bin %>%
                  select(-c(boro, building, street, zipcode, latitude, longitude,
                            community_board, council_district, census_tract, bbl, nta)))
head(df_dba, n=3)

5.3 Load tables to database

The group has used Azure database to store the tables, so all the group members could access it any time. As an alternative, you can store the tables on the local machine. (We had to comment the code for Azure j=for knitting purpose, otherwise it takes forever to knit the document)

#dbname <- 'data607'  # SECRET!!
#host <- 'sps-data-proj.mysql.database.azure.com'  # SECRET!!
#port <- 3306  # SECRET!!
#user <- 'data607user'  # SECRET!!
#password <- 'msds1234!'  # SECRET!!
#conn <- dbConnect(RMySQL::MySQL(),
                  #dbname=dbname,
                  #host=host,
                  #port=port,
                  #user=user,
                  #password=password)

#For local machine
db = dbConnect(MySQL(), user='root', password = '336261', dbname='delays', host='localhost')

We have created necessary sql tables to store the info from R data frames.

#tables <- c('boro', 'critical_flag_code', 'violation_description_code', 'violation_code',
#            'grade_code', 'action_code', 'inspection_code', 'cuisine_code', 'building',
#            'dba', 'inspection_result', 'violation', 'inspection_type')
#lapply(tables, function(x) dbExecute(conn, read_file(paste0('https://raw.githubusercontent.com/blacksmilez/',
#                                                           'DATA607/main/Project3/SQL/CreateTables/',
#                                                           x,
#                                                           '.sql'))))

Using dbWriteTable() function, we will send our R data frames to SQL tables.

#table_names <- c('tbl_boro', 'tbl_critical_flag_code', 'tbl_violation_description_code', 'tbl_violation_code',
#                 'tbl_grade_code', 'tbl_action_code', 'tbl_inspection_code', 'tbl_cuisine_code', 'tbl_building',
#                 'tbl_dba', 'tbl_inspection_result', 'tbl_inspection_type', 'tbl_violation')
#df <- list(df_boro, df_critical_flag_code, df_violation_description_code, df_violation_code,
#           df_grade_code, df_action_code, df_inspection_code, df_cuisine_code, df_building,
#           df_dba, df_inspection_result, df_inspection_type, df_violation)

#lapply(seq_along(table_names), function(i) dbWriteTable(conn, table_names[[i]], df[[i]], row.names=FALSE, append=TRUE))

6. Data preparation for analysis

6.1 Load data from the database.

Before starting the analysis, we need to load the data from the database and make it more convenient for analysis.
The list of tables in our remote database.

dbListTables(db)

##  [1] "airlines_status_cities"         "arrival"                       
##  [3] "tbl_action_code"                "tbl_boro"                      
##  [5] "tbl_building"                   "tbl_critical_flag_code"        
##  [7] "tbl_cuisine_code"               "tbl_dba"                       
##  [9] "tbl_grade_code"                 "tbl_inspection_code"           
## [11] "tbl_inspection_result"          "tbl_inspection_type"           
## [13] "tbl_violation"                  "tbl_violation_code"            
## [15] "tbl_violation_description_code"

Each table we will load to R data frame and disconnect from the database.

tbl_boro <- dbReadTable(db, 'tbl_boro', row.names = NULL)
tbl_critical_flag_code <- dbReadTable(db, 'tbl_critical_flag_code', row.names = NULL)
tbl_cuisine_code <- dbReadTable(db, 'tbl_cuisine_code', row.names = NULL)
tbl_grade_code <- dbReadTable(db, 'tbl_grade_code', row.names = NULL)
tbl_inspection_code <- dbReadTable(db, 'tbl_inspection_code', row.names = NULL)
tbl_inspection_type <- dbReadTable(db, 'tbl_inspection_type', row.names = NULL)
tbl_violation_code <- dbReadTable(db, 'tbl_violation_code', row.names = NULL)
tbl_violation_description_code <- dbReadTable(db, 'tbl_violation_description_code', row.names = NULL)
tbl_action_code <- dbReadTable(db, 'tbl_action_code', row.names = NULL)


tbl_building <- dbReadTable(db, 'tbl_building', row.names = NULL)
tbl_dba <- dbReadTable(db, 'tbl_dba', row.names = NULL)
tbl_inspection_result <- dbReadTable(db, 'tbl_inspection_result', row.names = NULL)
tbl_violation <- dbReadTable(db, 'tbl_violation', row.names = NULL)

#dbDisconnect(conn)
dbDisconnect(db)

## [1] TRUE

6.2 Combining tables

To make the work easy, we will combine the tables that store all the info about the inspections
First, we should unite tables that contain dba and building info (address, borough code, etc). The resulting data frame will provide info about every building in the city that contains restaurants.

restaurants <- full_join(x = tbl_dba, y = tbl_building, by = "bin")
head(restaurants,n=3)

The below tables will unite info about every inspection and violation.

inspection <- cbind(tbl_inspection_type,tbl_inspection_result, tbl_violation)
inspection <- inspection [, !duplicated(colnames(inspection))]
head(inspection)

Finally, we will combine two above table. As a result, we have info about every restaraunt (name, address, type of cusine, inspection date, violation type, etc). There are 26 columns of information.

final <- left_join(x = restaurants, y = inspection, by = "camis")
head(final)

6.2 Replacing codes

As the next step, we will substitute codes in some columns with words as it will be useful for explanatory analysis.
Instead of the borough code (1,2,3, etc) and action code (01, 02, etc)we will use the borough names (Manhattan, Bronx, etc) and action descriptions, the description of each borough code is stored in the tables tbl_boro, action is in the action_code table.

tbl_boro

tbl_action_code

final <- final %>%
  mutate(boro_code = recode(boro_code, '1' = 'Manhattan', '2' = 'Bronx', '3' = 'Brooklyn', '4' = 'Queens', '5' = 'Staten Island'),
         action_code = recode(action_code, '01' = ' No violations were recorded at the time of this inspection', '02' = 'Violations were cited in the following area(s)', '03' = 'Missing', '04' = 'Establishment re-opened by DOHMH', '05' = 'Establishment Closed by DOHMH. Violations were cited in the following area(s) and those requiring immediate action were addressed', '06' = 'Establishment re-closed by DOHMH'))

Grade_code will stay the same as it is easier to read: N = Not Yet Graded, A = Grade A, B = Grade B, C = Grade C, Z = Grade Pending, P= Grade Pending issued on re-opening following an initial inspection that resulted in a closure and all NA values will be substituted with “N” for “Not Yet Graded”.
The missing values in boro_code column will be replaces with “Missing”.
For some inspections, the columns inspection_code, cuisine_code, violation_code doesn’t contain information, we will substitute it with 0 values and distinguish them during the analysis.

final$grade_code[final$grade_code == ""] <- "N"
final["boro_code"][is.na(final["boro_code"])] <- "Missing"
final["cuisine_code"][is.na(final["cuisine_code"])] <- "000"
final["inspection_code"][is.na(final["inspection_code"])] <- "00"
final["violation_code"][is.na(final["violation_code"])] <- "000"

7. Explanatory analysis.

The first question that came to our mind was “Which borough has the most number of inspections? And is the most common grade of inspections for each borough?”.
Using ggplot function, the graph with the answer was built. It happened that most of the grades awarded after each inspection were “A”. Manhattan is a leader in the number of inspections. It may be because it has the most number of the restaurants.
Z and P grades stand for “Grade Pending” and “Grade Pending issued on re-opening following an initial inspection that resulted in a closure” respectfully. “N” grade means “Not yet graded”.

ggplot(data=final, aes(x=reorder(boro_code, desc(boro_code)))) + 
    geom_bar(aes(fill=grade_code), position='dodge') + 
    labs(title='Grades in each Borough', x='Borough', y='# of Inspections') + 
    theme_bw()

Besides ggplot, there is a great library to use to plot graphs “plot_ly”. This library helped us to answer our next question “Based on the previous plot, does Manhattan have really much more restaurants comparing to other borough?” as difference looks enormous.
It seems that Manhattan has around 3,000 more restaurants than Brooklyn but almost double amount of inspections.

final %>%
  dplyr::group_by(camis) %>%
  dplyr::summarise(n=dplyr::n(), boro_code = unique(boro_code)) %>%
  dplyr::ungroup() %>%
  plot_ly(x = ~boro_code, color='red') %>%
  add_histogram() %>%
  layout(title = "# of Restaurants in each borough",
         xaxis = list(title="Borough"),
         yaxis = list(title="#of restaurants"),
         bargap = 0.3)

It was interesting to know which cuisine is the most common for the city.
“003” code stands for “American” cuisine, “019” for Chinese, “022” for Coffee/Tea, “066” for Pizza, “047 for Italian”, “055” for Mexican and “048” for
Japanese. We took the codes from the table “tbl_cuisine_code”.

food <- final %>%
  dplyr::group_by(camis) %>%
  dplyr::summarise(cuisine_code = unique(cuisine_code)) %>%
  dplyr::ungroup()

food<- as.data.frame(dplyr::count(food, food$cuisine_code))
food<- dplyr::rename(food, 
                         "cuisine_code" = "food$cuisine_code", 
                         "n" = "n")
food <- food[order(food$n, decreasing = TRUE),]

food[1:10,] %>%
  plot_ly(x = ~cuisine_code, y = ~n, type = "bar") %>%
  layout(title = "The most common cuisines",
         xaxis = list(categoryorder = "array", 
         categoryarray = ~cuisine_code, title="cuisine code"),
         yaxis = list(title="# of restaurants"),
         bargap = 0.5)

tbl_cuisine_code %>% filter(grepl('003', cuisine_code))

tbl_cuisine_code %>% filter(grepl('019', cuisine_code))

tbl_cuisine_code %>% filter(grepl('022', cuisine_code))

tbl_cuisine_code %>% filter(grepl('066', cuisine_code))

tbl_cuisine_code %>% filter(grepl('047', cuisine_code))

tbl_cuisine_code %>% filter(grepl('055', cuisine_code))

Also, we were curious about the most common violations around the city. The most common is “Non-food contact surface improperly constructed. Unacceptable material used. Non-food contact surface or equipment improperly maintained and/or not properly sealed, raised, spaced or movable to allow accessibility for cleaning on all sides, etc”.
Second is “Facility not vermin proof. Harborage or conditions conducive to attracting vermin to the premises and/or allowing vermin to exist.”.
Third is “Food contact surface not properly washed, rinsed and sanitized after each use and following any activity when contamination may have occurred.”.

type <- final %>%
  dplyr::group_by(violation_description_code) %>%
  dplyr::summarise(n = dplyr::n()) %>%
  dplyr::ungroup()

type$n <- (type$n * 100)/sum(type$n)
type <- type[order(type$n, decreasing = TRUE),]
type <- type[1:10,]

plot_ly(x = type$violation_description_code, y = type$n, type = "bar") %>%
  layout(title = "Top 10 violation types",
  xaxis = list(title="Violation code", categoryorder = "array", 
         categoryarray = type$n),
  yaxis = list(title="Percentage", ticksuffix="%"))

8. Conclusion

The goal of the project is to build upon a relational database with the set of normalized tabled, together with R it allowed us to read and import data into data frames to perform the analysis. The analysis showed us the main idea about the inspection in New York city. Manhattan has a little more restaurants than Brooklyn. Manhattan restaurants are subjects to inspections because they pay more money for each inspection or the quality is always a question for inspectors? It was a surprise to see that Mexican cuisine is not so popular as Italian. Finally, the most common violations made us worried about the next eating out but at the same time with these awful violations the most common grade for the inspection is A. How could it be?
We can inspect the data further as it contains a lot of valuable information that can be useful for restaurant business.
To answer the main question of the project “Which are the most valued data science skills?”, we would use the quote It is literally true that you can succeed best and quickest by helping others to succeed”.
There were a lot of new information and tasks during the project, a lot we didn’t know. But there was always at least one person in the team who would know the answer. The ability to ask questions, to discuss it within the group, to brainstorm was a game changer. In case, you work alone - ask question using billions of forums around the internet.
At some point, if we stack and didn’t know how to transform our idea in the R code, the next skill came to save us - read the documentation as it contains all the answers.
One of the main skills was to be able to properly check the data after each load, save, transformation since usually you won’t get the expected result from the first attempt.As a result, there is the next important skill - learn based on your own mistakes. Because while you are trying to find how to fix the issue, you will learn a lot of new things.
Also, we have to mention that it is impossible to perform the analysis if you don’t try to see the connection between data or if you can’t ask proper questions for analysis. If you don’t have a proper questions for analysis, you won’t be able to perform analysis.
To sum up, we learned that each budding data scientist should be able to communicate effectively with the team, ask questions if needed, be able to read the documentation to find answers, check result of each part of the code, use mistakes as an effective way to learn new material, understand data and think carefully about the research question to perform the data analysis.