Introduction

Introduction The primary goal of this data exercise is to explore the data provided in the Las Vegas open data portal on restaurant grading, fit a MVP prediction model- predicting the NEXT_INSPECTION_GRADE_C_OR_BELOW as the response.

Data Source: https://opendataportal-lasvegas.opendata.arcgis.com/datasets/restaurant-inspections-open-data I have downloaded the data from the open data portal, and stored in csv in my repository.

Challenges:

The preliminary challenge is that the data description is not available, and in the absence of such a dictionary, we might need to make some assumptions about the data. The second challenge is that data is dirty- a lot of data columns has abnormal values which we need to check from various descriptive statistics, and the distributions.

We need to validate with common sense and business knowledge which abnormal values to treat, remove and which to retain

Packages Installed

The R packages used for data analysis in this project are:

• data.table - To import data • tidyverse - Fortidy data, visualisation, transformation • tibble - To create tibbles • plotrix - For 3D Exploded Pie Chart • tidyr - For tidy data • dplyr - For data analysis • DT - To display data set • ggplot2 - For data visualization • magrittr - For pipe oprator • knitr - To display tables • ROCR - For creating ROC curve • Scales -To demonstrate ggplot2 style scales for specific types of data • Pander -To produce simple tables from summary() output • Plotly -For creating interactive charts • visdata -For creating interactive charts • caret -For ML tasks • recipes -For Feature Engineering tasks

Loading libraries

options(warnings=-1)
# install.packages("tidyverse")
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.1
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
## Warning: package 'readr' was built under R version 4.1.1
# Helper packages
library(dplyr)    # for data manipulation
library(ggplot2)  # for awesome graphics
## Warning: package 'ggplot2' was built under R version 4.1.2
library(visdat)   # for additional visualizations
## Warning: package 'visdat' was built under R version 4.1.2
# Feature engineering packages
library(caret)    # for various ML tasks
## Warning: package 'caret' was built under R version 4.1.2
## Loading required package: lattice
library(recipes)  # for feature engineering tasks
## Warning: package 'recipes' was built under R version 4.1.2
## 
## Attaching package: 'recipes'
## The following object is masked from 'package:stats':
## 
##     step
# Helper packages
library(dplyr)      # for data wrangling
library(ggplot2)    # for awesome graphics
library(rsample)    # for creating validation splits
## Warning: package 'rsample' was built under R version 4.1.2
library(recipes)    # for feature engineering

# Modeling packages
library(caret)       # for fitting KNN models

# Helper packages
library(dplyr)       # for data wrangling
library(ggplot2)     # for awesome plotting

# Modeling packages
library(rpart)       # direct engine for decision tree application
library(caret)       # meta engine for decision tree application

# Model interpretability packages
library(rpart.plot)  # for plotting decision trees
## Warning: package 'rpart.plot' was built under R version 4.1.2
library(vip)         # for feature importance
## Warning: package 'vip' was built under R version 4.1.2
## 
## Attaching package: 'vip'
## The following object is masked from 'package:utils':
## 
##     vi
library(pdp)         # for feature effects
## Warning: package 'pdp' was built under R version 4.1.2
library(caTools)
## Warning: package 'caTools' was built under R version 4.1.2
library(caret)
library(ROCR)
## Warning: package 'ROCR' was built under R version 4.1.2
library(ranger)
## Warning: package 'ranger' was built under R version 4.1.2
set.seed(42)

Data Pre-processing

Data Source

Import the data

The data has 15627 observations on 34 features.

df=read.csv("C:\\Users\\shree\\OneDrive\\Desktop\\LastVegasRestaurantData.csv")

Data At a glance

glimpse(df)
## Rows: 15,673
## Columns: 34
## $ RESTAURANT_SERIAL_NUMBER         <chr> "DA0630765", "DAJI61TX5", "DAO2OUCAK"~
## $ RESTAURANT_PERMIT_NUMBER         <chr> "PR0023771", "PR0110047", "PR0110047"~
## $ RESTAURANT_NAME                  <chr> "#1 Hawaiian Barbecue", "#1 HAWAIIAN ~
## $ RESTAURANT_LOCATION              <chr> "#1 Hawaiian Barbecue", "CHINA DRAGON~
## $ RESTAURANT_CATEGORY              <chr> "Restaurant", "Restaurant", "Restaura~
## $ ADDRESS                          <chr> "5870 Losee Rd", "5905 S EASTERN Ave ~
## $ CITY                             <chr> "North Las Vegas", "Las Vegas", "Las ~
## $ STATE                            <chr> "Nevada", "Nevada", "Nevada", "Nevada~
## $ ZIP                              <chr> "89081-6593", "89119", "89119", "8911~
## $ CURRENT_DEMERITS                 <dbl> 0, 0, 3, 3, 8, 3, 9, 9, 3, 3, 3, 6, 6~
## $ CURRENT_GRADE                    <chr> "A", "A", "A", "A", "A", "A", "A", "A~
## $ EMPLOYEE_COUNT                   <int> 10, 14, 31, 22, 5, 3, 4, 13, 20, 25, ~
## $ MEDIAN_EMPLOYEE_AGE              <dbl> 29.37515, 27.57530, 33.98700, 26.5918~
## $ MEDIAN_EMPLOYEE_TENURE           <dbl> 1.435599, 1.436292, 2.041849, 4.86729~
## $ INSPECTION_TIME                  <chr> "4/19/2010 13:45", "8/18/2017 12:30",~
## $ Inspection_hr                    <chr> "13.00", "12.00", "11.00", "13.00", "~
## $ last_Inspection                  <chr> "4305.00", "1627.00", "1965.00", "270~
## $ INSPECTION_TYPE                  <chr> "Routine Inspection", "Routine Inspec~
## $ INSPECTION_DEMERITS              <chr> "19", "20", "27", "20", NA, "5", "10"~
## $ VIOLATIONS_RAW                   <chr> "14,26,27,36,64,114", "2,092,112,122,~
## $ RECORD_UPDATED                   <chr> "2/21/2013 22:26", "8/18/2017 16:03",~
## $ Record_hr                        <chr> "22.00", "16.00", "13.00", "9.00", "2~
## $ last_record                      <chr> "3266", "1627", "1965", "2694", "3266~
## $ LAT_LONG_RAW                     <chr> "(36.2666713, 115.1166619)", "(36.082~
## $ FIRST_VIOLATION                  <int> 14, 209, 202, 208, 10, 214, 206, 211,~
## $ SECOND_VIOLATION                 <int> 26, 211, 208, 211, 14, 232, 214, 212,~
## $ THIRD_VIOLATION                  <int> 27, 212, 209, 212, 30, 233, 229, 214,~
## $ FIRST_VIOLATION_TYPE             <chr> "Major", "Critical", "Critical", "Cri~
## $ SECOND_VIOLATION_TYPE            <chr> "Major", "Major", "Critical", "Major"~
## $ THIRD_VIOLATION_TYPE             <chr> NA, "Major", "Critical", "Major", "No~
## $ NUMBER_OF_VIOLATIONS             <chr> "6", "8", "7", "7", "6", "3", "4", "7~
## $ NEXT_INSPECTION_GRADE_C_OR_BELOW <chr> "0", "0", "0", "0", "0", "0", "0", "0~
## $ LAT_LONG_RAW.1                   <dbl> 36.26667, 36.08201, 36.08201, 36.0820~
## $ Long.Raw                         <dbl> 115.1167, 115.1196, 115.1196, 115.119~

Data Dictionary

The description for Las Vegas Data variables is not available. In the absence of that, I have tried to best explain it as per my understanding of the form shared.

data.type <- lapply(data, class)
Description <- c("SERIAL NUMBER OF THE RESTAURANT-ID COLUMN",
                 "PERMIT NUMBER OF THE RESTAURANT- ID COLUMN",
                 "NAME OF THE RESTAURANT",
                 "LOCATION OF THE RESTAURANT",
                 "WHICH TYPE OF RESTAURANT",
                 "ADDRESS OF THE RESTAURANT",
                 "CITY",
                 "STATE",
                 "ZIP",
                 "CURRENT DEMERITS OF THE INSPECTION",
                 "CURRENT GRADE",
                 "NUMBER OF EMPLOYEES",
                 "MEDIAN AGE OF THE EMPLOYEES",
                 "TIME OF INSPECTION",
                 "HOUR OF INSPECTION",
                 "DAYS PASSED SINCE LAST INSPECTION",
                 "TYPE OF INSPECTION",
                 "DEMERITS OF CURRENT INSPECTION",
                 "NO OF SAFETY REGULATIONS VIOLATED",
                 "LAST TIME OF RECORD",
                 "HOUR WHEN IT WAS RECORDED",
                 "DAYS PASSED SINCE LAST RECORD",
                 "LATITUDE,LONGITUDE",
                 "FIRST HEALTH CODE VIOLATED",
                 "SECOND HEALTH CODE VIOLATED",
                 "THIRD HEALTH CODE VIOLATED",
                 "FIRST VIOLATION TYPE",
                 "SECOND VIOLATION TYPE",
                 "NUMBER OF VIOLATIONS",
                 "NEXT INSPECTION GRADE C OR BELOW",
                 "LATITUDE",
                 "LONGITUDE"
              
)
Variable <- colnames(data)
data.description <- as.data.frame(cbind(Variable, Description))
options(warn=-1)
data.description
##                                   Description
## 1   SERIAL NUMBER OF THE RESTAURANT-ID COLUMN
## 2  PERMIT NUMBER OF THE RESTAURANT- ID COLUMN
## 3                      NAME OF THE RESTAURANT
## 4                  LOCATION OF THE RESTAURANT
## 5                    WHICH TYPE OF RESTAURANT
## 6                   ADDRESS OF THE RESTAURANT
## 7                                        CITY
## 8                                       STATE
## 9                                         ZIP
## 10         CURRENT DEMERITS OF THE INSPECTION
## 11                              CURRENT GRADE
## 12                        NUMBER OF EMPLOYEES
## 13                MEDIAN AGE OF THE EMPLOYEES
## 14                         TIME OF INSPECTION
## 15                         HOUR OF INSPECTION
## 16          DAYS PASSED SINCE LAST INSPECTION
## 17                         TYPE OF INSPECTION
## 18             DEMERITS OF CURRENT INSPECTION
## 19          NO OF SAFETY REGULATIONS VIOLATED
## 20                        LAST TIME OF RECORD
## 21                  HOUR WHEN IT WAS RECORDED
## 22              DAYS PASSED SINCE LAST RECORD
## 23                         LATITUDE,LONGITUDE
## 24                 FIRST HEALTH CODE VIOLATED
## 25                SECOND HEALTH CODE VIOLATED
## 26                 THIRD HEALTH CODE VIOLATED
## 27                       FIRST VIOLATION TYPE
## 28                      SECOND VIOLATION TYPE
## 29                       NUMBER OF VIOLATIONS
## 30           NEXT INSPECTION GRADE C OR BELOW
## 31                                   LATITUDE
## 32                                  LONGITUDE

Feature Classification

We are trying to break the feature space into: Identifiers, Numerical, Categorical-Nominal, Ordinal and Target Varible.

Nominal Categories: Restaurant Category Ordinal Categories: Current Grade, Inspection Type, First, Second and Third Violation Type Numeric Variables:Employee Count,Median Employee Age,Median Employee Tenure, Inspection Demerits, First, Second, Third Violation,Number of Violations Target: Grade C or Below in the Next Inspection Time Variables: Record Updated Time, Inspection Time

identifier_feature = c('RESTAURANT_SERIAL_NUMBER')
continuous_features = c('MEDIAN_EMPLOYEE_AGE', 'MEDIAN_EMPLOYEE_TENURE','Lat','Long')
nominal_features = c('RESTAURANT_CATEGORY', 'CITY', 'STATE', 'INSPECTION_TYPE')
ordinal_feactures = c('CURRENT_GRADE','THIRD_VIOLATION_TYPE','FIRST_VIOLATION_TYPE','SECOND_VIOLATION_TYPE')
numeric_features=c('EMPLOYEE_COUNT','CURRENT_DEMERITS', 'EMPLOYEE_COUNT', 'INSPECTION_DEMERITS','NUMBER_OF_VIOLATIONS')
target = c('NEXT_INSPECTION_GRADE_C_OR_BELOW')

Exploratory Data Analysis

We have used an automated EDA package ‘DataExplorer’ to have a look at all the variables, their univariate EDA, relationships with each other, identify data anomalies, and also find variable importance based on some statitiscal tests.

Data Exploration

Box plots for all numeric features vs categorical dependent variable - Bivariate comparision only with categories Information Value contained by various Categorical Variables

Distribution Of Target Variable

Distribution Of Categorical Variables

Feature Engineering:

Why is it Important? Data preprocessing and engineering techniques generally refer to the addition, deletion, or transformation of data. Feature Engineering step is essential here, as the data at hand is dirty.I spent a substantial time in understanding the data before I went ahead with the modeling.

Feature Extraction: To iterate, we have created some additional features: • Split the latitude, longitude column to 2 separate columns • Hour of Inspection - Hour of the day when the last inspection was made • Hour of Record - Hour of the day when the last record was made • Days past since last inspection - difference of today and the last inspection day • Days past since last record - difference of today and the last record day

Feature Cleaning: • Only consider the NEXT_INSPECTION_GRADE_C_OR_BELOW as 0 or 1 • We can drop columns that we will not be analyzing: ID columns like RESTAURANT_SERIAL_NUMBER,RESTAURANT_PERMIT_NUMBER, RESTAURANT_NAME • Location Columns like ‘ADDRESS’, ‘CITY’, ‘STATE’, ‘ZIP’ as we have Lat Long information already • ‘CURRENT_GRADE should be in [“a”, “b”, “c”, “x”, “o”, “n”] • ’INSPECTION_TYPE’ should be in [“routineinspection”, “reinspection”] • Treat unnaturally high values in Employee Count and negative employee count values • Treat Latitude Longitude columns so that both the columns having some values

df<-df%>%filter(NEXT_INSPECTION_GRADE_C_OR_BELOW %in% c(0,1))
data=df
df<-df%>%filter(NEXT_INSPECTION_GRADE_C_OR_BELOW %in% c(0,1))
df<-df%>%filter(STATE=="Nevada")
df<-df%>%select(-c('ADDRESS','CITY', 'STATE','RESTAURANT_SERIAL_NUMBER','RESTAURANT_PERMIT_NUMBER','RESTAURANT_NAME','RESTAURANT_LOCATION','ADDRESS','CITY','STATE','ZIP','VIOLATIONS_RAW','LAT_LONG_RAW'))
df<-df%>%filter(CURRENT_GRADE %in% c("A","B","C","X","O","N"))
df<-df%>%filter(INSPECTION_TYPE %in% c("Re-inspection", "Routine Inspection"))
df<-df%>%filter(FIRST_VIOLATION_TYPE %in% c("Major","Non-Major","Critical"))
df<-df%>%filter(SECOND_VIOLATION_TYPE %in% c("Major","Non-Major","Critical"))
df<-df%>%filter(THIRD_VIOLATION_TYPE %in% c("Major","Non-Major","Critical"))
df<-df %>%filter(EMPLOYEE_COUNT<=100)
df<-df %>%filter(CURRENT_DEMERITS<=100)
df<-df %>%filter(Long.Raw>0)

Dealing With Missingness:

Why do we need to deal with missingness? Our data clearly suffers from missing values. We would think since the proportion of missing values is less, we might as well omit them. Since our data-set is unbalanced, with 16% of times as the target variable as 1, we should study the pattern of missingness and then decide.

sum(is.na(df))
## [1] 1378
colSums(is.na(df))
##              RESTAURANT_CATEGORY                 CURRENT_DEMERITS 
##                              113                                0 
##                    CURRENT_GRADE                   EMPLOYEE_COUNT 
##                                0                                0 
##              MEDIAN_EMPLOYEE_AGE           MEDIAN_EMPLOYEE_TENURE 
##                               31                              269 
##                  INSPECTION_TIME                    Inspection_hr 
##                              158                                0 
##                  last_Inspection                  INSPECTION_TYPE 
##                                0                                0 
##              INSPECTION_DEMERITS                   RECORD_UPDATED 
##                              228                              109 
##                        Record_hr                      last_record 
##                                0                                0 
##                  FIRST_VIOLATION                 SECOND_VIOLATION 
##                              189                               75 
##                  THIRD_VIOLATION             FIRST_VIOLATION_TYPE 
##                               56                                0 
##            SECOND_VIOLATION_TYPE             THIRD_VIOLATION_TYPE 
##                                0                                0 
##             NUMBER_OF_VIOLATIONS NEXT_INSPECTION_GRADE_C_OR_BELOW 
##                              150                                0 
##                   LAT_LONG_RAW.1                         Long.Raw 
##                                0                                0

Data can be missing for many different reasons; however, these reasons are usually lumped into two categories: informative missingness and missingness at random .

Informative missingness implies a structural cause for the missing value that can provide insight in its own right; whether this be deficiencies in how the data was collected or abnormalities in the observational environment. Missingness at random implies that missing values occur independent of the data collection process.

Missingness Charts- The HEAT MAP:

df %>%
  is.na() %>%
  reshape2::melt() %>%
  ggplot(aes(Var2, Var1, fill=value)) + 
    geom_raster() + 
    coord_flip() +
    scale_y_continuous(NULL, expand = c(0, 0)) +
    scale_fill_grey(name = "", 
                    labels = c("Present", 
                               "Missing")) +
    xlab("Observation") +
    theme(axis.text.y  = element_text(size = 4))

The CLuster Chart:

vis_miss(df, cluster = TRUE)

Conclusion Upon careful inspection on both the charts, we fail to see any patterns in the missingness of the data.We can go ahead with omitting NAs

df<-na.omit(df)

Modeling Exercise

Feature Engineering

Feature Lumping Our data is primarily dominated by categorical Variables.

Most models require that the predictors take numeric form. There are exceptions; for example, tree-based models naturally handle numeric or categorical features. However, even tree-based models can benefit from preprocessing categorical features.

Sometimes features will contain levels that have very few observations. In our example, the restaurant category has several levels, but they contain very less values. An ideal way to deal with this is to, club levels having very few values into 1.

Variables we lumped: Restaurant Category, Inspection Hour, Record Hour.

count(df, RESTAURANT_CATEGORY) %>% arrange(n)
##            RESTAURANT_CATEGORY    n
## 1      Self-Service Food Truck    1
## 2               Farmers Market    2
## 3                 Main Kitchen    8
## 4                     Barbeque   12
## 5           Childcare Kitchens   12
## 6              Banquet Support   13
## 7    Elementary School Kitchen   14
## 8                 Portable Bar   14
## 9       Grocery Store Sampling   19
## 10                  Confection   23
## 11  Institutional Food Service   24
## 12                 Concessions   26
## 13              Produce Market   27
## 14              Vegetable Prep   35
## 15                Garde Manger   38
## 16                Bakery Sales   39
## 17              Kitchen Bakery   51
## 18             Banquet Kitchen   53
## 19                     Caterer   57
## 20 Food Trucks / Mobile Vendor   81
## 21        Meat/Poultry/Seafood  107
## 22                      Pantry  126
## 23               Portable Unit  159
## 24                      Buffet  180
## 25             Special Kitchen  938
## 26                   Snack Bar 1053
## 27                Bar / Tavern 1926
## 28                  Restaurant 7495
# Lump levels for TOP features
lumping <- recipe(NEXT_INSPECTION_GRADE_C_OR_BELOW ~ ., data = df) %>%
  step_other(RESTAURANT_CATEGORY, threshold = 0.01, 
             other = "other") 

# Apply this blue print 
df <- prep(lumping, training = df) %>%
  bake(df)



###INSPECTION HOUR CLUBBING###
# Lump levels for TOP features
lumping1 <- recipe(NEXT_INSPECTION_GRADE_C_OR_BELOW ~ ., data = df) %>%
  step_other(Inspection_hr, threshold = 0.01, 
             other = "other") 

# Apply this blue print 
df <- prep(lumping1, training = df) %>%
  bake(df)



###RECORD HOUR CLUBBING###
# Lump levels for TOP features
lumping2 <- recipe(NEXT_INSPECTION_GRADE_C_OR_BELOW ~ ., data = df) %>%
  step_other(Record_hr, threshold = 0.01, 
             other = "other") 

# Apply this blue print 
df <- prep(lumping2, training = df) %>%
  bake(df)

Dummy Encoding for Numerical Variables

# Lump levels for two features
recipe(NEXT_INSPECTION_GRADE_C_OR_BELOW ~ ., data = df) %>%
  step_dummy(all_nominal(), one_hot = TRUE)
## Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         23
## 
## Operations:
## 
## Dummy variables from all_nominal()

Label Encoding for Ordinal Features:

# Label encoded
df<-recipe(NEXT_INSPECTION_GRADE_C_OR_BELOW ~ ., data = df) %>%
  step_integer(CURRENT_GRADE) %>%
  prep(df) %>%
  bake(df) 

df<-recipe(NEXT_INSPECTION_GRADE_C_OR_BELOW ~ ., data = df) %>%
  step_integer(INSPECTION_TYPE) %>%
  prep(df) %>%
  bake(df) 

df<-recipe(FIRST_VIOLATION_TYPE ~ ., data = df) %>%
  step_integer(FIRST_VIOLATION_TYPE) %>%
  prep(df) %>%
  bake(df) 


df<-recipe(SECOND_VIOLATION_TYPE ~ ., data = df) %>%
  step_integer(SECOND_VIOLATION_TYPE) %>%
  prep(df) %>%
  bake(df)


df<-recipe(THIRD_VIOLATION_TYPE ~ ., data = df) %>%
  step_integer(THIRD_VIOLATION_TYPE) %>%
  prep(df) %>%
  bake(df)

Fitting the Classification Model

Decision Tree Model for Classification Tree-based models are a class of nonparametric algorithms that work by partitioning the feature space into a number of smaller (non-overlapping) regions with similar response values using a set of splitting rules.

Decision Trees also work well with a category heavy feature space with it’s divide-and-conquer methods and produce results and diagrams which are simpler to understand.

data1<-df%>%select(-c('INSPECTION_TIME','RECORD_UPDATED'))
ames_dt1 <- rpart(
  formula = NEXT_INSPECTION_GRADE_C_OR_BELOW ~ .,
  data    = data1,
method='anova')
#ames_dt1
rpart.plot(ames_dt1)

vip(ames_dt1, num_features = 40, bar = FALSE)

Conclusion From Decision Trees Algo From our decision tree algorithm, we observe that the variables Last Inspection: Number of days passed since last inspection, Last Record: Number of days passed since last record, Record Hour, Inspection Demerits and Number Of Violations emerge as the most Important Variables.

Random Forest

We try to fit a “out of box” model algo with usually a high predictive power. We have used an emsemble tree based- random forest Model particularly for 2 reasons. RF really works well when the data is dominated by categorical features. RF even with the default parameters does give a good prediction power, and can be used to even identify the most important variables. As our concern is to build a MVP, important variables becomes more important to us. RF gives us several stats and ranks the variables either using their entropy, Gini Index or permutations.

# number of features
n_features <- length(setdiff(names(data1),"NEXT_INSPECTION_GRADE_C_OR_BELOW"))
library(ranger)
# train a default random forest model
df_rf1 <- ranger(
  NEXT_INSPECTION_GRADE_C_OR_BELOW ~ ., 
  data = data1,
  probability = TRUE,
  mtry = floor(n_features / 3),
  respect.unordered.factors = "order",
  seed = 123
)

# get OOB RMSE
(default_rmse <- sqrt(df_rf1$prediction.error))
## [1] 0.331889

The lesser the OOB the better. This basically shows that we do not really tune our hyperparameter so much.

Random Forest Output Unlike other packages, we have used “ranger” for our RF model, to provide both the probabilities of 0 and 1. We did it this as it is more robust and gives the end user more flexibility to determine the threshold.

pred<-predict(df_rf1,data=data1)
pred1<-as.data.frame(pred$predictions)
head(pred1,5)
##           0         1
## 1 0.8487056 0.1512944
## 2 0.8801841 0.1198159
## 3 0.8189071 0.1810929
## 4 0.8001690 0.1998310
## 5 0.8732675 0.1267325

In Sample Prediction

Using the RF with default hyperparameters

pred<-predict(df_rf1,data=data1)
pred1<-as.data.frame(pred$predictions)
colnames(pred1)<-c('Chance0','Chance1')
pred2<-cbind(pred1,ifelse(pred1$Chance1>= 0.5, 1, 0))
colnames(pred2)=c('Chance0','Chance1','Predict')

sample_predicted_observed<-cbind(data1,pred2)
##write.csv(sample_predicted_observed,"C:\\Users\\shree\\OneDrive\\Desktop\\sample_predicted_observed.csv")

# Predict on train data
p <- predict(df_rf1, data1, type = "response")

# If p exceeds threshold of 0.5, 1 else 0
hd_or_nohd <- ifelse(pred1$Chance1 > 0.5, 1, 0)

# Convert to factor: p_class
p_class <- factor(hd_or_nohd, levels =levels(data1[["NEXT_INSPECTION_GRADE_C_OR_BELOW"]]))


# Create confusion matrix
confusionMatrix(p_class, data1[["NEXT_INSPECTION_GRADE_C_OR_BELOW"]])
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 10532   294
##          1     1  1706
##                                          
##                Accuracy : 0.9765         
##                  95% CI : (0.9737, 0.979)
##     No Information Rate : 0.8404         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9067         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
##                                          
##             Sensitivity : 0.9999         
##             Specificity : 0.8530         
##          Pos Pred Value : 0.9728         
##          Neg Pred Value : 0.9994         
##              Prevalence : 0.8404         
##          Detection Rate : 0.8403         
##    Detection Prevalence : 0.8638         
##       Balanced Accuracy : 0.9265         
##                                          
##        'Positive' Class : 0              
## 

ROC Curve

predictions<- pred2$Predict
pred <- prediction(pred2$Predict, data1$NEXT_INSPECTION_GRADE_C_OR_BELOW)
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, col=rainbow(10))

RF Models with Impurity and Permutation based Variable Importance

# re-run model with impurity-based variable importance

rf_impurity <- ranger(
  formula =NEXT_INSPECTION_GRADE_C_OR_BELOW ~ ., 
  data = data1, 
  num.trees = 2000,
  mtry = 5,
  min.node.size = 10,
  sample.fraction = .50,
  replace = TRUE,
  importance = "impurity",
  respect.unordered.factors = "order",
  verbose = FALSE,
  seed  = 123
)


# re-run model with permutation-based variable importance
rf_permutation <- ranger(
  formula = NEXT_INSPECTION_GRADE_C_OR_BELOW ~  ., 
  data = data1, 
  num.trees = 2000,
  mtry = 5,
  min.node.size = 10,
  sample.fraction = .50,
  replace = TRUE,
  importance = "permutation",
  respect.unordered.factors = "order",
  verbose = FALSE,
  seed  = 123
)

Variable Importance Plot

p1 <- vip::vip(rf_impurity, num_features = 14, bar = FALSE)
p2 <- vip::vip(rf_permutation, num_features = 14, bar = FALSE)

gridExtra::grid.arrange(p1, p2, nrow = 1)

Interpretation:

From the RF model, we observe: Out of Sample Error after 5-fold Cross Validation of ~14.2%. Confusion Matrix Output:

Reference Prediction 0 1 0 10532 294 1 1 1706

           Accuracy : 0.9765 
           

Please note that the sample has around 16% 1s as the variable of interest, so the RF does lift the prediction accuracy by a certain jump.

Random forests provide a very powerful out-of-the-box algorithm that often has great predictive accuracy. They come with all the benefits of decision trees and bagging but greatly reduce instability and between-tree correlation.

And due to the added split variable selection attribute, random forests are also faster than bagging as they have a smaller feature search space at each tree split.

The only issue RF faces when the data size is huge. Atleast in our problem statement the data size is pretty low making it computationally effective.

Working With the Test Data

Test Data Pre-Processing

test1<-read.csv("C:\\Users\\shree\\OneDrive\\Desktop\\test_data.csv")
test1<-na.omit(test1)

test1<-test1%>%filter(STATE=="Nevada")
test1<-test1%>%select(-c('ADDRESS','CITY', 'STATE','RESTAURANT_SERIAL_NUMBER','RESTAURANT_PERMIT_NUMBER','RESTAURANT_NAME','RESTAURANT_LOCATION','ADDRESS','CITY','STATE','ZIP','VIOLATIONS_RAW','LAT_LONG_RAW'))
test1<-test1%>%filter(CURRENT_GRADE %in% c("A","B","C","X","O","N"))
test1<-test1%>%filter(INSPECTION_TYPE %in% c("Re-inspection", "Routine Inspection"))
test1<-test1%>%filter(FIRST_VIOLATION_TYPE %in% c("Major","Non-Major","Critical"))
test1<-test1%>%filter(SECOND_VIOLATION_TYPE %in% c("Major","Non-Major","Critical"))
test1<-test1%>%filter(THIRD_VIOLATION_TYPE %in% c("Major","Non-Major","Critical"))
test1<-test1 %>%filter(EMPLOYEE_COUNT<=100)
test1<-test1 %>%filter(CURRENT_DEMERITS<=100)
test1<-test1 %>%filter(Long.Raw>0)



# Label encoded
test1<-recipe(Long.Raw ~ ., data = test1) %>%
  step_integer(CURRENT_GRADE) %>%
  prep(test1) %>%
  bake(test1) 




test1<-recipe(Long.Raw ~ ., data = test1) %>%
  step_integer(FIRST_VIOLATION_TYPE) %>%
  prep(test1) %>%
  bake(test1) 
test1<-recipe(Long.Raw ~ ., data = test1) %>%
  step_integer(INSPECTION_TYPE) %>%
  prep(test1) %>%
  bake(test1)

test1<-recipe(Long.Raw ~ ., data = test1) %>%
  step_integer(SECOND_VIOLATION_TYPE) %>%
  prep(test1) %>%
  bake(test1)


test1<-recipe(Long.Raw ~ ., data = test1) %>%
  step_integer(THIRD_VIOLATION_TYPE) %>%
  prep(test1) %>%
  bake(test1)

test1<-test1%>%select(-c('INSPECTION_TIME','RECORD_UPDATED'))

Data Manipulation in The TEst Data

#write.csv(test1,"C:\\Users\\shree\\OneDrive\\Desktop\\testfinal.csv")
test1<-na.omit(test1)
#colSums(is.na(test1))

  #RESTAURANT_CATEGORY<-as.factor(RESTAURANT_CATEGORY),
  #CURRENT_DEMERITS<-as.numeric(CURRENT_DEMERITS),
  #CURRENT_GRADE<-as.numeric(CURRENT_GRADE),
test1$Inspection_hr<-as.factor(test1$Inspection_hr)
test1$last_Inspection<-as.factor(test1$last_Inspection)
  #INSPECTION_TYPE <-as.numeric(INSPECTION_TYPE),
test1$INSPECTION_DEMERITS<-as.factor(test1$INSPECTION_DEMERITS)
test1$Record_hr<-as.factor(test1$Record_hr)
test1$last_record<-as.factor(test1$last_record)
test1$NUMBER_OF_VIOLATIONS<-as.factor(test1$NUMBER_OF_VIOLATIONS)

pred_test<-predict(df_rf1,data=test1)
pred_test1<-as.data.frame(pred_test$predictions)
colnames(pred_test1)<-c('Chance0','Chance1')
final_df<-cbind(test1,pred_test1)
#write.csv(final_df,"C:\\Users\\shree\\OneDrive\\Desktop\\final_#df.csv")

Suggestions

About the Model Proposed So, these suggestions are based off of a random forest model with default parameters. There are different ways of approaching this problem. And sometimes, it might change the explanatory interpretations that we obtained. So, we should always take the suggestions with a grain of salt unless it is echoed in all the model approaches.

About Multicollinearity The data might have some multi-collinearity introduced, and RF has a challenge of showing all multi-collinear variables as important ones in the Variable Importance Plot. In our case variables “Days Passed from Last Inspection” and “Days Passed from Last Record” are highly correlated and occurs as important variables in the VIP. But that is okay, as the idea is to predict, and multicollinearity doesn’t affect the prediction power.

Further Scope I tried to get some additional variables from open repositories- such as customer ratings from YELP, etc, but as the customer feedback data was of the current years and the data at our hand is pretty old, it was difficult to map. I believe some additional features: customer feedbacks, age of the restaurant, worth of the restaurant, location in which type of locality (high income etc, or office or school area etc)- these information can further increase the model performance.