1 Part One: Research Problem and Sub-questions
2 Part Two: Demographic Analysis of the ‘Big Families’ Segment
3 Part Three: Logistic Regression on Selected Attributes
- 3.1 Purpose of Applying Regression
- 3.2 Methodology of Logistic Regression
4 Part Four: Conclusion and Recommendations
- 4.1 Findings
- 4.2 Future Improvements

Overview
This report is the final output of our data analysis of a cutomer survey data set that contians 28,811 observations and 150 variables. In Part One, we start our analysis by defining our main research question and related sub-questios. Next, in Part Two, we perform a demographic analysis of the the ‘Big Families’ group in order to explore this segment in more depth. Then, in Part Three, we perform regression analysis of selected attributes that metter to customers when visiting a fast food restaurant. Finally, in Part Four, we provide conclusions and recommendations for future improvement.

# Installing required packages
library(rms)
library(ggplot2)
library(tidyr)
library(dplyr)
library(ggpubr)
library(knitr)
library(rmarkdown)

# Loading the data set 
food <- read.csv("FastFood.csv")

1 Part One: Research Problem and Sub-questions

Main research question: What marketing strategy restaurants should implement to attract ‘big families’ with lower household income?

1.1 Graph 1 - Basic Analysis of Spending and Visits

Question 1: Which target audience we have to focus on?

In order to answer Question 1, we firstly converted the variables q2, q3 and q1514 from our data set into factors, and ordered them according to the levels. After that, we converted the observations of the variables q2 and q3 into numbers, and we created a table with 7 observations and 3 variables called table_g1, grouped by the variable q1514 and summarized by the mean of the variables q2 and q3, eliminating the missing values of them. Thus, we used the ggplot function to build a Positioning Map called G1 - Visits per Month x Dollars Spend per Visit. The data of this graph is the table_g1, the x axis is q2 (Visits per Month), the y axis is q3 (Dollars Spend per Visit) and the color is q1514 (People in Household). Therefore, we decided to choose four, five and six people in the household as our target audience, because this group spends more money but does not visit Fast Food restaurants frequently. We also decided to call this group as Big Families.

table_g1 <- food %>%
  group_by(q1514) %>%
  summarise_at(vars(q2,q3), funs(mean(., na.rm=TRUE)))

table_g1

G1 <- ggplot(table_g1, aes(x=q2, y=q3, colour = q1514)) + theme_bw() + geom_point(size=4) +
  geom_hline(yintercept = 4, color="red") + geom_vline(xintercept = 5.5, color="red") +
  labs(title="G1- Visits per Month x Dollars Spent per Visit",x="Visits per Month", y = "Dollars Spend per Visit",
       colour = "People in Household")
G1

1.2 Graph 2 - Big Families Spending on Fast Food

Question 2 How much money does this group spend per visit?

In order to answer Question 2, we firstly filtered the observations of the variable q1514 choosing the levels Four people, Five people and Six people, eliminating the missing values of the variable q8, and denominating this new data set with 6459 observations and 150 variables as table_g2. Thus, we used the ggplot function to build a bar graph called G2 - Big Families: Money Spending. The data used to build this graph is table_g2, the x axis is q3 (Dollars Spent per Visit) and the y axis is the percentage of respondents of each level of q3. Because we converted the levels of q3 in numeric observations in the Step 2, the levels 6 and 7 correspond to the respondents who spend more than USD10.01 per visit. Therefore, in order to increase the profits of the Fast Food restaurants, we decided to focus on the group of Big Families that spends more than USD10.01 per visit.

big_families <- food %>% 
  filter(q1514 == "Four people" | q1514 == "Five people" | q1514 == "Six people")

table_g2 <- filter(big_families, !q8 == "0")

G2 <- ggplot(data = table_g2, aes(x=q3)) + 
  geom_bar(aes(y = (..count..)/sum(..count..)),
           position="dodge", fill = "#FF6666") +
  labs(title="G2 - Big Families: Money Spending",x="Dollars Spent per Visit", y = "Percentage of Respondents")+
  scale_x_discrete(limits=c("1", "2", "3", "4", "5", "6", "7"))+
  theme_bw()
G2

1.3 Graph 3 - Big families monthly visits of Fast Food restaurants

Question 3 How often Big families who spend more than USD10.00 go to Fast Food restaurants?

In order to answer Question 3, we firstly filtered the observations of the variable q3 choosing the levels above or equal to six, eliminating the missing values of the variable q8, and denominating this new data set with 2971 observations and 150 variables as table_g3. Thus, we used the ggplot function to build a Bar Graph called G3 - Big Families: Visits per Month. The data used to build this this graph is table_g3, the x axis is q2 (Visits per Month) and the y axis is the percentage of respondents of each level of q2. After analyzing this graph, we could presume that household income influences in the visits per month. Thus, we could make our next sub question.

table_g3 <- filter(big_families, !q8 == "0" 
                   & q3 >= 6)

G3 <- ggplot(data = table_g3, aes(x=q2)) + 
  geom_bar(aes(y = (..count..)/sum(..count..)),
           position="dodge", fill = "#FF6666") +
  labs(title="G3 - Big Families: Visits per Month",x="Visits per Month", y = "Percentage of Respondents")+
  scale_x_discrete(limits=c("1", "2", "3", "4", "5", "6", "7"))+
  theme_bw()
G3

1.4 Graph 4 - Basic Analysis of Visits and Household Income

Question 4 Does the Household Income influence the frequency of visits for Big Families?

In order to answer Question 4, we firstly converted the variable q1516 into factors and ordered them according to the levels. After that, we converted these observations into numbers, and we created a table with 7 observations and 2 variables called table_g4, grouped by the variable q2 and summarized by the mean of the variable q1516, eliminating the missing values of it. Thus, we used the ggplot function to build a Positioning Map called G4 - Big Families: Visits per Month x Household Income. The data of this chart is table_g4, the x axis is q2 (Dollars Spent per Visit) and the y axis is q1516 (Household Income). After analyzing this graph, we could conclude that household income influences the visits per month. Thus, we could presume that price is important for people with lower household income.

table_g4 <- table_g3 %>%
  group_by(q2) %>%
  summarise_at(vars(q1516), funs(mean(., na.rm=TRUE)))

G4 <- ggplot(table_g4, aes(x=q2, y=q1516)) + theme_bw() + geom_point(size=4, color ="#FF6666") +
  geom_hline(yintercept = 10, color="black") + geom_vline(xintercept = 3.5, color="black") +
  labs(title="G4 - Big Families: Visits per Month x Household Income",x="Visits per Month", y = "Household Income")
G4

1.5 Graph 5 - Big Families Opinions on Price Importance

Question 5 Is the price important for Big Families that visits less than 2 times per month?

In order to answer Question 5, we firstly filtered the observations of the variable q2 choosing the levels below or equal to two, and denominating this new data set with 528 observations and 150 variables as table_g4.1. After that, we noticed that our observations just had two levels in the variable q73: Agree and Strongly Disagree, so we converted them into factors and changed the names of the levels to Yes and No. Thus, Yes means that the price is important for this respondent and No means that the price is not important for this respondent. After that, we used the ggplot function to build a bar graph called G5 - Big Families: Price. The data of this chart is table_g4.1, the x axis is q73 (Price Importance) and the y axis is the percentage of respondents of each level of q73. After analyzing this graph, we could conclude that price is important for some respondents and not so important for others. Therefore, we decided to proceed our analysis with the following two groups.

table_g4.1 <- table_g3 %>%
  filter(q2 <= 2)

table_g4.1$q73 <- factor(table_g4.1$q73)

levels(table_g4.1$q73) <- list(No = "Agree",
                               Yes = "Strongly Disagree")

G5 <- ggplot(data = table_g4.1, aes(x=q73)) + 
  geom_bar(aes(y = (..count..)/sum(..count..)),
           position="dodge", fill = "#FF6666") +
  labs(title="G5 - Big Families: Price",x="Price Importance", y = "Percentage of Respondents")+
  theme_bw()
G5

2 Part Two: Demographic Analysis of the ‘Big Families’ Segment

In this part, we explore the demographic characteristics of our target group and how each one of these relates to price impotance for members in this group.

2.1 Graph 6 - Occupation and Price Importance

In order to find out the occupation of each group, we firstly named the data set as table_g5 and then we used the ggplot function to build a bar graph called G6 - Big Families: Occupation. The data of this chart is table_g5, the x axis is q1517 (Occupation), the y axis is the percentage of respondents of each factor of q1517, and the legend is q73 (Price Importance). After analyzing this graph, we can conclude that more than 60% of the respondents that consider price as an important factor are working as homemakers, whereas less than 40% of the respondents that not consider price as an important factor are working in sales.

table_g5 <- table_g4.1

G6 <- ggplot(data = table_g5, aes(x=q1517)) + 
  geom_bar(aes(y = (..count..)/sum(..count..), fill = q73),
           position="dodge") +
  labs(title="G6 - Big Families: Occupation",x="Occupation", y = "Percentage of Respondents",
       fill = "Price Importance")+
  theme_bw()
G6

2.2 Graph 7 - Marital Status and Price Importance

In order to find out the marital status of each group, we used the ggplot function to build a bar graph called G7 - Big Families: Marital Status. The data of this graph is table_g5, the x axis is q1515 (Marital Status), the y axis is the percentage of respondents of each factor of q1515, and the legend is q73 (Price Importance). After analyzing this graph, we can conclude that more than 60% of the respondents that consider price as an important factor are married, whereas less than 40% of the respondents that not consider price as an important factor are widowed.

G7 <- ggplot(data = table_g5, aes(x=q1515)) + 
  geom_bar(aes(y = (..count..)/sum(..count..), fill = q73),
           position="dodge") +
  labs(title="G7 - Big Families: Marital Status",x="Marital Status", y = "Percentage of Respondents",
       fill = "Price Importance")+
  theme_bw()
G7

2.3 Graph 8 - Gender and Price Importance

In order to find out the gender of each group, we used the ggplot function to build a bar graph called G8 - Big Families: Gender. The data used to build this graph is table_g5, the x axis is q1520 (Gender), the y axis is the percentage of respondents of each factor of q1520, and the legend is q73 (Price Importance). After analyzing this graph, we can conclude that more than 60% of the respondents that consider price as an important factor are females, whereas less than 40% of the respondents that do not consider price as an important factor are males.

G8 <- ggplot(data = table_g5, aes(x=q1520)) + 
  geom_bar(aes(y = (..count..)/sum(..count..), fill = q73),
           position="dodge") +
  labs(title="G8 - Big Families: Gender",x="Gender", y = "Percentage of Respondents",
       fill = "Price Importance")+
  theme_bw()
G8

2.4 Graph 9 - Race Background and Price Importance

In order to find out the race background of each group, we used the ggplot function to build a bar graph called G9 - Big Families: Race Background. The data of this chart is table_g5, the x axis is q1519 (Race Background), the y axis is the percentage of respondents of each factor of q1519, and the legend is q73 (Price Importance). After analyzing this graph, we can conclude that the race background of both groups is white.

G9 <- ggplot(data = table_g5, aes(x=q1519)) + 
  geom_bar(aes(y = (..count..)/sum(..count..), fill = q73),
           position="dodge") +
  labs(title="G9 - Big Families: Race Background",x="Race Background", y = "Percentage of Respondents",
       fill = "Price Importance")+
  theme_bw()
G9

3 Part Three: Logistic Regression on Selected Attributes

3.1 Purpose of Applying Regression

In order to understand how certain attributes (independent variables) matter to customers such as speed of service, convenient location, consistent food quality, menu variety, friendly employees-staff and large size portions have relationship with price of food importance (dependent variable).

3.2 Methodology of Logistic Regression

These independent variables were divided in seven levels, so we ordered and converted them to numbers. After that, we selected from the data set called table_g5 the variable q73 (Price Importance) and the variables q5b (Speed of Service) to q5g (Large Size Portions), denominating this new data set with 528 observations and 8 variables as G6.1_df.

However, we analyzed the structure of this new data set and verified that the answers of respondents changed between two and three levels, so we decided to create a table with binary operators called G6.2_df, which binary 1 corresponded to the levels above four, and binary 0 corresponded to the levels below or equal four.

Before we use the lrm function to interpret the results of the Logistic Regression model, we analyzed the structure of the data set G6.2_df, and we decided to exclude the variables q5e (Menu Variety) and q5f (Friendly Employees-Staff) because the level is a constant.

In order to have a good model, we must fulfill the assumption of linearity and additivity, and to measure the predictive accuracy. The first two assumptions are verified by Likelihood ratio test, and Wald test, while for the estimation of predictive accuracy, it is used the index of the Nagelkerke R² _n , Brier’s score, ROC curve and Somers’s D_xy . Based on the generated values on R Studio, we can see that the variables Speed of Service and Convenient Location, have the index of the Nagelkerke R² _n = 1, and also the C value. These kinds of values make us think that the analysis might not be accurate because it is not possible to make an exact prediction. This is the reason why we decided to exclude these two variables, meanwhile the statistics of the left variables (Consistant Food Quality & Large size portions), are more meaningful to the analysis. Therefore we have only two variables in our final analysis.

# Ordering the levels and converting them into numbers
table_g5$q5a <- factor(table_g5$q5a, levels = c("Least Important", "Very Unimportant", 
                                                "Not Important", "Little important", "Important", 
                                                "Very Important", "Most Important"), ordered = TRUE)
table_g5$q5a <- as.numeric(table_g5$q5a)

table_g5$q5b <- factor(table_g5$q5b, levels = c("Least Important", "Very Unimportant", 
                                                "Not Important", "Little important", "Important", 
                                                "Very Important", "Most Important"), ordered = TRUE)
table_g5$q5b <- as.numeric(table_g5$q5b)

table_g5$q5c <- factor(table_g5$q5c, levels = c("Least Important", "Very Unimportant", 
                                                "Not Important", "Little important", "Important", 
                                                "Very Important", "Most Important"), ordered = TRUE)
table_g5$q5c <- as.numeric(table_g5$q5c)

table_g5$q5d <- factor(table_g5$q5d, levels = c("Least Important", "Very Unimportant", 
                                                "Not Important", "Little important", "Important", 
                                                "Very Important", "Most Important"), ordered = TRUE)
table_g5$q5d <- as.numeric(table_g5$q5d)

table_g5$q5e <- factor(table_g5$q5e, levels = c("Least Important", "Very Unimportant", 
                                                "Not Important", "Little important", "Important", 
                                                "Very Important", "Most Important"), ordered = TRUE)
table_g5$q5e <- as.numeric(table_g5$q5e)

table_g5$q5f <- factor(table_g5$q5f, levels = c("Least Important", "Very Unimportant", 
                                                "Not Important", "Little important", "Important", 
                                                "Very Important", "Most Important"), ordered = TRUE)
table_g5$q5f <- as.numeric(table_g5$q5f)

table_g5$q5g <- factor(table_g5$q5g, levels = c("Least Important", "Very Unimportant", 
                                                "Not Important", "Little important", "Important", 
                                                "Very Important", "Most Important"), ordered = TRUE)
table_g5$q5g <- as.numeric(table_g5$q5g)

# Selecting the variables and converting them into binary operators

G6.1_df <- table_g5 %>% select(q73, q5b:q5g)

G6.2_df <- data.frame(q73 = G6.1_df[, 1], 1*(G6.1_df[, -1] > 4))

levels(G6.2_df$q73) <- c(0,1)

# Analysing the variable q5e
table(G6.2_df$q73, G6.2_df$q5e)

##    
##       0
##   0 176
##   1 352

# Analysing the variable q5f
table(G6.2_df$q73, G6.2_df$q5f)

##    
##       0
##   0 176
##   1 352

# Logistic Regression with the variable q5b
lrm(q73 ~ q5b, data=G6.2_df)

## Logistic Regression Model
##  
##  lrm(formula = q73 ~ q5b, data = G6.2_df)
##  
##                         Model Likelihood     Discrimination     Rank Discrim.    
##                            Ratio Test            Indexes           Indexes       
##  Obs            528    LR chi2     672.15    R2        1.000    C       1.000    
##   0             176    d.f.             1    g        11.052    Dxy     1.000    
##   1             352    Pr(> chi2) <0.0001    gr    63074.154    gamma   1.000    
##  max |deriv| 0.0008                          gp        0.445    tau-a   0.445    
##                                              Brier     0.000                     
##  
##            Coef     S.E.    Wald Z Pr(>|Z|)
##  Intercept  12.3619 25.7681  0.48  0.6314  
##  q5b       -24.8201 46.1104 -0.54  0.5904  
##

# Logistic Regression with the variable q5c
lrm(q73 ~ q5c, data=G6.2_df)

## Logistic Regression Model
##  
##  lrm(formula = q73 ~ q5c, data = G6.2_df)
##  
##                         Model Likelihood     Discrimination     Rank Discrim.    
##                            Ratio Test            Indexes           Indexes       
##  Obs            528    LR chi2     672.15    R2        1.000    C       1.000    
##   0             176    d.f.             1    g        11.052    Dxy     1.000    
##   1             352    Pr(> chi2) <0.0001    gr    63074.154    gamma   1.000    
##  max |deriv| 0.0008                          gp        0.445    tau-a   0.445    
##                                              Brier     0.000                     
##  
##            Coef     S.E.    Wald Z Pr(>|Z|)
##  Intercept  12.3619 25.7681  0.48  0.6314  
##  q5c       -24.8201 46.1104 -0.54  0.5904  
##

# Logistic Regression with the variable q5d
lrm_d <- lrm(q73 ~ q5d, data = G6.2_df)
lrm_d

## Logistic Regression Model
##  
##  lrm(formula = q73 ~ q5d, data = G6.2_df)
##  
##                        Model Likelihood     Discrimination    Rank Discrim.    
##                           Ratio Test           Indexes           Indexes       
##  Obs           528    LR chi2     184.18    R2       0.409    C       0.750    
##   0            176    d.f.             1    g        5.059    Dxy     0.500    
##   1            352    Pr(> chi2) <0.0001    gr     157.482    gamma   1.000    
##  max |deriv| 0.002                          gp       0.223    tau-a   0.223    
##                                             Brier    0.167                     
##  
##            Coef     S.E.    Wald Z Pr(>|Z|)
##  Intercept  11.3619 22.1030  0.51  0.6072  
##  q5d       -11.3619 22.1032 -0.51  0.6072  
##

# Logistic Regression with the variable q5g
lrm_g <- lrm(q73 ~ q5g, data = G6.2_df)
lrm_g

## Logistic Regression Model
##  
##  lrm(formula = q73 ~ q5g, data = G6.2_df)
##  
##                        Model Likelihood     Discrimination    Rank Discrim.    
##                           Ratio Test           Indexes           Indexes       
##  Obs           528    LR chi2     184.18    R2       0.409    C       0.750    
##   0            176    d.f.             1    g        5.059    Dxy     0.500    
##   1            352    Pr(> chi2) <0.0001    gr     157.482    gamma   1.000    
##  max |deriv| 0.002                          gp       0.223    tau-a   0.223    
##                                             Brier    0.167                     
##  
##            Coef    S.E.    Wald Z Pr(>|Z|)
##  Intercept  0.0000  0.1066 0.00   1.0000  
##  q5g       11.3619 22.1032 0.51   0.6072  
##

# Intercept for q5d
intercept_d <- lrm_d$coefficients[1]
exp(intercept_d)/(1+exp(intercept_d))

## Intercept 
## 0.9999884

# Intercept for q5g
intercept_g <- lrm_g$coefficients[1]
exp(intercept_g)/(1+exp(intercept_g))

## Intercept 
##       0.5

# Likelihood Ratio Test for q5d 
glm(q73 ~ q5d, data = G6.2_df, family = binomial)

## 
## Call:  glm(formula = q73 ~ q5d, family = binomial, data = G6.2_df)
## 
## Coefficients:
## (Intercept)          q5d  
##       19.57       -19.57  
## 
## Degrees of Freedom: 527 Total (i.e. Null);  526 Residual
## Null Deviance:       672.2 
## Residual Deviance: 488   AIC: 492

# Likelihood Ratio Test for q5g 
glm(q73 ~ q5g, data = G6.2_df, family = binomial)

## 
## Call:  glm(formula = q73 ~ q5g, family = binomial, data = G6.2_df)
## 
## Coefficients:
## (Intercept)          q5g  
##   1.242e-18    1.957e+01  
## 
## Degrees of Freedom: 527 Total (i.e. Null);  526 Residual
## Null Deviance:       672.2 
## Residual Deviance: 488   AIC: 492

If we check the percentage of the respondents that said Yes about the price importance, we can see that we have 50% of them agreed about the fact that they care about the consistent food quality, and 50% disagreed about it. In this case we are in the situation 50:50 and we can’t distinguish which matters more about this category of respondents. Meanwhile if we check the percentage of the respondents that said No regarding the price quality, all of the answers agreed that the attribute of food quality is important.

If we check the percentage of the respondents that said Yes about the price importance, we can see that we have 50% of them agreed about the fact that they care about large size portions, and 50% disagreed about it. We are again in the situation 50:50 and we can’t distinguish which matters more about this category of respondents. But all of the respondents that said No regarding the price quality, disagreed that the attribute of large size portions is important.

# Cross-Tabulation Tabs for q5d >> Y = 0 (No), Y = 1 (Yes) Vs.  X = 0 (Disagree), X = 1 (Agree)
table(G6.2_df$q73, G6.2_df$q5d)

##    
##       0   1
##   0   0 176
##   1 176 176

# Cross-Tabulation Tabs for q5g >> Y = 0 (No), Y = 1 (Yes) Vs.  X = 0 (Disagree), X = 1 (Agree)
table(G6.2_df$q73, G6.2_df$q5g)

##    
##       0   1
##   0 176   0
##   1 176 176

If we try to form one general idea of all the respondents who answered No for the importance of price, we can see that they care about consistent food quality and they don’t care about large size portions. Regarding to the respondents who agreed that price is important, we can’t have an idea, which attribute they prefer the most because the number of respondents who agreed and disagreed for each attribute is the same.

4 Part Four: Conclusion and Recommendations

4.1 Findings

Based on our previous analysis and graphs, we found the following insights:

There is a significant large group of families with 4-6 members who tend to spend more dollars per vist, yet they don’t visit Fast Food restaurants very often (5 times a week). It’s recommnded that fast food restaurants marketing / advertisment campaigns should focus more on this group.
On the other hand, there is a group of larger families (8 and more members) who visit fast food restaurants more often (7 times a month) but tend to spend very less dollars ($6 per visit). Restaurants should deeply understand the buying habits of this group.
Most ‘Big families’ (4-6 members) spend more than $10.01 dollars per visiit, which makes it a priority group for any Fast Food restaurant.
Although a large proportion of Big familes visit fast food restaurants 7 times a month, the remaining part of this group makes less visits. It’s worth investigating how to increase the visits of those remaining people.
Within the Big Families group, there are is a group with high household income who have less monthly visit to Fast Food restaurant. A further research is need to understand this sub-group and move it to the upper right corner of Graph 4 (High income, more visits).
Fast food restaurants should be careful with there pricing strategies because most Big family memmbers think price is important. Discounts and promotios may even attract this more even better.
More specifically, those who think price is important are mainly families with married white females.

4.2 Future Improvements

In the future, we could further improve our analysis by focusing on one Fast Food restaurant to improve its marketing performance and sales figures, insted of provided a general strategy for all Fast Food restaurants.
To better visualize our data, we would invest more time in the future to use more visualizations types that fits each type of analysis instead of relying on bar graphs and positioning maps.

Analysing Fast Food Restaurants Customers’ Survey

Lucas Pereira Szupszynski, Masar Ali, Boralda Sheme

June 1, 2018