1. INTRODUCTION

TFI is the company behind some of the world???s most well-known brands like Burger King, Sbarro, Popeyes, Usta Donerci, and Arby???s. They employ around 20,000 people in Europe and Asia and make considerable investments on new restaurant sites.The objective is to decide when and where to open a new restaurant site which is considerable profitable and this process is largely subjective.

The problem and the solution :-

It takes huge investments to open up new restaurant sites. Also when a wrong location for a restaurant brand is chosen, the site closes within 18 months and operating losses are incurred.

Finding a mathematical model to increase the effectiveness of investments in new restaurant sites would allow TFI to invest more in other important business areas, like sustainability, innovation, and training for new employees.

2. DATA FIELDS

. Id : Restaurant id. . Open Date : opening date for a restaurant . City : City that the restaurant is in. Note that there are unicode in the names. . City Group : Type of the city. Big cities, or Other. . Type : Type of the restaurant. FC: Food Court, IL: Inline, DT: Drive Thru, MB: Mobile . P1, P2 ??? P37 : There are three categories of these obfuscated data. Demographic data are gathered from third party providers with GIS systems. These include population in any given area, age and gender distribution, development scales. Real estate data mainly relate to the m2 of the location, front facade of the location, car park availability. Commercial data mainly include the existence of points of interest including schools, banks, other QSR operators. . Revenue : The revenue column indicates a (transformed) revenue of the restaurant in a given year and is the target of predictive analysis. Please note that the values are transformed so they don???t mean real dollar values.

3. DATA EXPLORATION

setwd("~/Downloads/Prof Sameer Mathur/Restaurant data")
train<-read.csv("train.csv")
test<-read.csv("test.csv")
str(train)
## 'data.frame':    137 obs. of  43 variables:
##  $ Id        : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Open.Date : Factor w/ 134 levels "01/03/2014","01/07/2000",..: 60 17 21 11 42 16 94 53 78 114 ...
##  $ City      : Factor w/ 34 levels "Adana","Afyonkarahisar",..: 17 4 11 31 15 4 17 17 2 12 ...
##  $ City.Group: Factor w/ 2 levels "Big Cities","Other": 1 1 2 2 2 1 1 1 2 2 ...
##  $ Type      : Factor w/ 3 levels "DT","FC","IL": 3 2 3 3 3 2 3 3 3 3 ...
##  $ P1        : int  4 4 2 6 3 6 2 4 1 6 ...
##  $ P2        : num  5 5 4 4.5 4 6 3 5 1 4.5 ...
##  $ P3        : num  4 4 2 6 3 4.5 4 4 4 6 ...
##  $ P4        : num  4 4 5 6 4 7.5 4 5 4 7.5 ...
##  $ P5        : int  2 1 2 4 2 8 1 2 1 6 ...
##  $ P6        : int  2 2 3 4 2 10 5 3 2 4 ...
##  $ P7        : int  5 5 5 10 5 10 5 5 1 10 ...
##  $ P8        : int  4 5 5 8 5 8 5 4 5 10 ...
##  $ P9        : int  5 5 5 10 5 8 5 4 5 10 ...
##  $ P10       : int  5 5 5 10 5 8 5 4 5 10 ...
##  $ P11       : int  3 1 2 8 2 10 2 4 1 2 ...
##  $ P12       : int  5 5 5 10 5 8 5 3 5 10 ...
##  $ P13       : num  5 5 5 7.5 5 6 5 4 5 7.5 ...
##  $ P14       : int  1 0 0 6 2 0 3 0 1 0 ...
##  $ P15       : int  2 0 0 4 1 0 4 0 1 0 ...
##  $ P16       : int  2 0 0 9 2 0 4 0 2 0 ...
##  $ P17       : int  2 0 0 3 1 0 3 0 1 0 ...
##  $ P18       : int  4 0 0 12 4 0 4 0 4 0 ...
##  $ P19       : int  5 3 1 20 2 5 2 3 1 25 ...
##  $ P20       : int  4 2 1 12 2 6 4 5 1 3 ...
##  $ P21       : int  1 1 1 6 1 3 1 2 1 3 ...
##  $ P22       : int  3 3 1 1 2 1 2 4 1 1 ...
##  $ P23       : int  3 2 1 10 1 5 1 2 1 10 ...
##  $ P24       : int  1 0 0 2 2 0 5 0 4 0 ...
##  $ P25       : int  1 0 0 2 3 0 4 0 4 0 ...
##  $ P26       : num  1 0 0 2.5 3 0 4 0 4 0 ...
##  $ P27       : num  4 0 0 2.5 5 0 5 0 2 0 ...
##  $ P28       : num  2 3 1 2.5 1 7.5 1 3 2 5 ...
##  $ P29       : num  3 3 3 7.5 3 5 3 2 3 2.5 ...
##  $ P30       : int  5 0 0 25 5 0 4 0 4 0 ...
##  $ P31       : int  3 0 0 12 1 0 5 0 5 0 ...
##  $ P32       : int  4 0 0 10 3 0 2 0 5 0 ...
##  $ P33       : int  5 0 0 6 2 0 2 0 3 0 ...
##  $ P34       : int  5 0 0 18 3 0 3 0 4 0 ...
##  $ P35       : int  4 0 0 12 4 0 5 0 5 0 ...
##  $ P36       : int  3 0 0 12 3 0 4 0 4 0 ...
##  $ P37       : int  4 0 0 6 3 0 4 0 5 0 ...
##  $ revenue   : num  5653753 6923131 2055379 2675511 4316715 ...
library(psych)
summary(train)
##        Id           Open.Date         City         City.Group Type   
##  Min.   :  0   01/07/2000:  2   Istanbul:50   Big Cities:78   DT: 1  
##  1st Qu.: 34   02/02/2012:  2   Ankara  :19   Other     :59   FC:76  
##  Median : 68   02/23/2010:  2   Izmir   : 9                   IL:60  
##  Mean   : 68   01/03/2014:  1   Bursa   : 5                          
##  3rd Qu.:102   01/09/2010:  1   Samsun  : 5                          
##  Max.   :136   01/17/2009:  1   Antalya : 4                          
##                (Other)   :128   (Other) :45                          
##        P1               P2              P3              P4       
##  Min.   : 1.000   Min.   :1.000   Min.   :0.000   Min.   :3.000  
##  1st Qu.: 2.000   1st Qu.:4.000   1st Qu.:4.000   1st Qu.:4.000  
##  Median : 3.000   Median :5.000   Median :4.000   Median :4.000  
##  Mean   : 4.015   Mean   :4.409   Mean   :4.318   Mean   :4.372  
##  3rd Qu.: 4.000   3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:5.000  
##  Max.   :12.000   Max.   :7.500   Max.   :7.500   Max.   :7.500  
##                                                                  
##        P5              P6               P7               P8        
##  Min.   :1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.000  
##  1st Qu.:1.000   1st Qu.: 2.000   1st Qu.: 5.000   1st Qu.: 4.000  
##  Median :2.000   Median : 3.000   Median : 5.000   Median : 5.000  
##  Mean   :2.007   Mean   : 3.358   Mean   : 5.423   Mean   : 5.153  
##  3rd Qu.:2.000   3rd Qu.: 4.000   3rd Qu.: 5.000   3rd Qu.: 5.000  
##  Max.   :8.000   Max.   :10.000   Max.   :10.000   Max.   :10.000  
##                                                                    
##        P9              P10              P11              P12        
##  Min.   : 4.000   Min.   : 4.000   Min.   : 1.000   Min.   : 2.000  
##  1st Qu.: 4.000   1st Qu.: 5.000   1st Qu.: 2.000   1st Qu.: 4.000  
##  Median : 5.000   Median : 5.000   Median : 3.000   Median : 5.000  
##  Mean   : 5.445   Mean   : 5.489   Mean   : 3.263   Mean   : 5.299  
##  3rd Qu.: 5.000   3rd Qu.: 5.000   3rd Qu.: 4.000   3rd Qu.: 5.000  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.000  
##                                                                     
##       P13            P14              P15              P16        
##  Min.   :3.00   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:5.00   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000  
##  Median :5.00   Median : 0.000   Median : 0.000   Median : 0.000  
##  Mean   :5.08   Mean   : 1.416   Mean   : 1.387   Mean   : 1.942  
##  3rd Qu.:5.00   3rd Qu.: 2.000   3rd Qu.: 2.000   3rd Qu.: 3.000  
##  Max.   :7.50   Max.   :15.000   Max.   :10.000   Max.   :15.000  
##                                                                   
##       P17              P18              P19              P20        
##  Min.   : 0.000   Min.   : 0.000   Min.   : 1.000   Min.   : 1.000  
##  1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 2.000   1st Qu.: 2.000  
##  Median : 0.000   Median : 0.000   Median : 3.000   Median : 4.000  
##  Mean   : 1.036   Mean   : 1.942   Mean   : 4.905   Mean   : 4.547  
##  3rd Qu.: 1.000   3rd Qu.: 4.000   3rd Qu.: 5.000   3rd Qu.: 5.000  
##  Max.   :15.000   Max.   :12.000   Max.   :25.000   Max.   :15.000  
##                                                                     
##       P21             P22             P23              P24        
##  Min.   : 1.00   Min.   :1.000   Min.   : 1.000   Min.   : 0.000  
##  1st Qu.: 1.00   1st Qu.:1.000   1st Qu.: 1.000   1st Qu.: 0.000  
##  Median : 1.00   Median :2.000   Median : 2.000   Median : 0.000  
##  Mean   : 2.27   Mean   :2.226   Mean   : 3.423   Mean   : 1.372  
##  3rd Qu.: 3.00   3rd Qu.:3.000   3rd Qu.: 5.000   3rd Qu.: 2.000  
##  Max.   :15.00   Max.   :5.000   Max.   :25.000   Max.   :10.000  
##                                                                   
##       P25              P26              P27              P28        
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 1.000  
##  1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 2.000  
##  Median : 0.000   Median : 0.000   Median : 0.000   Median : 2.500  
##  Mean   : 1.212   Mean   : 1.471   Mean   : 1.146   Mean   : 3.223  
##  3rd Qu.: 2.000   3rd Qu.: 2.500   3rd Qu.: 2.000   3rd Qu.: 4.000  
##  Max.   :10.000   Max.   :12.500   Max.   :12.500   Max.   :12.500  
##                                                                     
##       P29             P30             P31              P32        
##  Min.   :0.000   Min.   : 0.00   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:2.500   1st Qu.: 0.00   1st Qu.: 0.000   1st Qu.: 0.000  
##  Median :3.000   Median : 0.00   Median : 0.000   Median : 0.000  
##  Mean   :3.135   Mean   : 2.73   Mean   : 1.942   Mean   : 2.526  
##  3rd Qu.:3.000   3rd Qu.: 4.00   3rd Qu.: 3.000   3rd Qu.: 3.000  
##  Max.   :7.500   Max.   :25.00   Max.   :15.000   Max.   :25.000  
##                                                                   
##       P33             P34              P35              P36        
##  Min.   :0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:0.000   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000  
##  Median :0.000   Median : 0.000   Median : 0.000   Median : 0.000  
##  Mean   :1.139   Mean   : 2.489   Mean   : 2.029   Mean   : 2.212  
##  3rd Qu.:2.000   3rd Qu.: 3.000   3rd Qu.: 4.000   3rd Qu.: 3.000  
##  Max.   :6.000   Max.   :24.000   Max.   :15.000   Max.   :20.000  
##                                                                    
##       P37           revenue        
##  Min.   :0.000   Min.   : 1149870  
##  1st Qu.:0.000   1st Qu.: 2999068  
##  Median :0.000   Median : 3939804  
##  Mean   :1.117   Mean   : 4453533  
##  3rd Qu.:2.000   3rd Qu.: 5166635  
##  Max.   :8.000   Max.   :19696939  
## 
attach(train)
dim(train)
## [1] 137  43
sum(!complete.cases(train))
## [1] 0

Obeservations:-

The train dataset has ony 137 observations which is pretty small when it comes to building a prediction model so there is no need to split it further into training and validation datasets.

There are no NA values in the train dataset.

In order to check the revenue pattern in the dataset we need to plot histograms. Also the variables P1 to P37 need to be examined for “0” values.

4. DATA VISUALISATION

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
library(reshape)
qplot(x = revenue, y = City, data = train, color = City.Group, size = Type)
## Warning: Using size for a discrete variable is not advised.

train_plot <- melt(train[,-c(1:5)])
## Using  as id variables
ggplot(train_plot,aes(x = value)) + 
  facet_wrap(~variable,scales = "free_x") + 
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

hist(train$revenue)

ggplot(train, aes(x=Type, y=revenue)) +
        geom_point(shape=1) +
        geom_smooth()
## `geom_smooth()` using method = 'loess'

plot(revenue~., data = train)

Obeservations:-

We can see that there are only three big cities in this dataset namelly Istanbul, Izmir and Ankara, and the rest fall under the “other” category. Also the food courts in Istanbul seem to have the maximum revenue as shown in the scatterplot.

The histogram also has some interesting insights. The variables P1 to P37 have a lot of 0 values as shown from P14 to P18 and then P24 onwards upto P37. The revenue also is right skewed.

Food court seems to be the most popular type of restaurant followed by Inline and then just the one drive through.

Tabular arrangement:-

library(plyr)
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:reshape':
## 
##     rename, round_any
x <- train[, c(3, 43)]
y <- aggregate(x$revenue, list(City=x$City), mean)
colnames(y)[2] <- "Average Revenue"
z <- count(x, 'City')
x <- merge(y, z, by = "City")
x<-x[,-3]
x

4. Correlation:-

library(corrgram)
## 
## Attaching package: 'corrgram'
## The following object is masked from 'package:plyr':
## 
##     baseball
corrgram(train[,-c(1,2,3)], order=TRUE, lower.panel=panel.shade,upper.panel=panel.cor,text.panel=panel.txt,main="TFI revenue prediction")

Observations:-

The predictor variables P1 to P37 are definitely postitively correlated. Some of the predictor variables positively affect the revenue whereas some affect it negetively as shown in the corrgram.

5. MODELLING

Since we want to predict the revenue that is to be generated in the upcoming restaurant sites, we will build a linear model based on city group, restaurant type and some significant TFI indicators.

Our hypothesis that a certain variable does not affect the revenue can be tested by the one sample t-test.

t <- apply(train[,-c(1:5)], 2, t.test) 
fit <- lm(revenue ~ City.Group + Type + P1 + P2 + P7 , data = train)
summary(fit)
## 
## Call:
## lm(formula = revenue ~ City.Group + Type + P1 + P2 + P7, data = train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -4204859 -1515338  -260522   961954 14055457 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)  
## (Intercept)      2349141    2605036   0.902   0.3688  
## City.GroupOther  -894084     493568  -1.811   0.0724 .
## TypeFC           1366421    2571129   0.531   0.5960  
## TypeIL            744976    2582728   0.288   0.7735  
## P1               -143377     169313  -0.847   0.3987  
## P2                566516     288910   1.961   0.0520 .
## P7                -95306     187350  -0.509   0.6118  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2497000 on 130 degrees of freedom
## Multiple R-squared:  0.1016, Adjusted R-squared:  0.06016 
## F-statistic: 2.451 on 6 and 130 DF,  p-value: 0.02808

CONCLUSION

Based on the t-tests we can say that the all the predictor values have some effect on the revenue.

The model that is built on the above factors shows that the p value is less than 0.05 which implies that it has a lowere probability of failing.

This model can now be used on the test dataset to predict the revenue and a decision tree model would be the recommended procedure.