Opening a restaurant is no easy feat. There might be a lot of different variables involved in whether such an entreprise will eventually prove to be a success or a failiure. Let’s start by loading the packages that we will need, set seed and theme.
library(pacman)
p_load(readr, dplyr, ggplot2, ggthemr, Boruta, lubridate, readr, randomForest)
set.seed(2445)
ggthemr('pale')
Next we load up the dataframes.
train <- read_csv("data/train.csv")
test <- read_csv("data/test.csv")
sample_sub <- read_csv("data/sampleSubmission.csv")
# create a copy for EDA
df <- train
After we can perform some exploratory data analysis. Let’s see how how the restaurants are distributed by cities.
par(mfrow=c(1,2))
barplot(sort(table(df$`City Group`)), main = "City Groups")
barplot(sort(table(df$Type)), main = "City Type")
par(mfrow=c(1,1))
par(las = 2)
barplot(sort(table(df$City)), main = "Restaurant Numbers", cex.names = 0.5)
Here we can see that most of the restaurants are in big cities, and of the type FC (food court) and IL (inline). Since we have the city names available, so we can create a map that visualises how the revenue is spread geographically.
p_load(leaflet, ggmap)
cities <- df$City
coordinates <- geocode(df$City)
city_coordinates <- data.frame(cities, coordinates)
leaflet() %>%
addTiles() %>%
addCircleMarkers(lng=city_coordinates$lon,
lat = city_coordinates$lat, color = "#FF5281", radius = df$revenue / 250000)
On the map you can see that the larger revenue is in bigger cities (also shown in the plot above), which makes a lot of sense. And finally, let’s use the random forests algorithm to see the variable importance and also make our predictions. First we create a function that will transform the data so it is suitable for further analysis. After this we subset our features set from the train dataframe and apply some transformation to normalize it.
# function to transform the features
add_features <- function(data) {
data$CityGroup <- as.factor(data[["City Group"]])
data$OpenDate <- mdy(data[["Open Date"]])
data$YearsSince1900 <- as.numeric(data$OpenDate - mdy("01/01/1900"),
units = "days") / 365
return(data)
}
# some more subsets and transformations
features <- c(names(train)[c(-1, -2, -3, -4, -5, -43)], "CityGroup",
"YearsSince1900")
train <- add_features(train)
test <- add_features(test)
train$Revenue <- train$revenue / 1e6
train$logRevenue <- log(train$revenue)
As a final step before the actual modeling process we can use the Boruta algorithm to help us choose the best features.
# feature selection
boruta <- Boruta(train[, features], train$LogRevenue, doTrace = 2)
important_features <- features[boruta$finalDecision != "Rejected"]
And we use those important features to fit a random forest model, and then make some predictions.
# modeling and prediction
rf <- randomForest(train[,important_features], train$logRevenue,
importance = TRUE)
prediction <- exp(predict(rf, test[,important_features]))
# sanity checking
print(prediction[1:5])
## 1 2 3 4 5
## 3887654 3887654 3887654 3422101 3887654
Lets get some more information about our model.
print(rf)
##
## Call:
## randomForest(x = train[, important_features], y = train$logRevenue, importance = TRUE)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 0.2279772
## % Var explained: 0.62
Finally we can create a plot of the features with largest importance. This can be important for future data collection.
# feature importance plot
imp <- importance(rf, type = 1)
featureImportance <- data.frame(Feature = row.names(imp),
Importance = imp[,1])
ggplot(featureImportance, aes(x = reorder(Feature, Importance),
y = Importance)) +
geom_bar(stat = "identity") +
coord_flip() +
xlab("") +
ylab("Importance") +
ggtitle("Random Forest Feature Importance\n")
In this project we managed to use the available variables to predict restaurant revenue, and also to find out a better way of collecting data.