In this notebook, we will dive into an analysis of online food data sourced from a publicly available Kaggle dataset. Our objective is to explore the various factors present in the dataset and understand their influence on online ordering behavior. Through this analysis, we aim to uncover insights into the dynamics of online food consumption, shedding light on the intricate relationships between various demographic, socioeconomic, and behavioral variables.
library(tidyverse)
── Attaching core tidyverse packages ─────────────────────────────────────────────────────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2 ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the ]8;;http://conflicted.r-lib.org/conflicted package]8;; to force all conflicts to become errors
library(dplyr)
library(stringr)
library(gridExtra)
Attaching package: ‘gridExtra’
The following object is masked from ‘package:dplyr’:
combine
Let’s take a look at the dataset we are going to work with.
onlinefood_data <- read_csv("/Users/jennnguyen/Documents/Datasets/onlinefoods.csv")
Rows: 388 Columns: 12── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (7): Gender, Marital Status, Occupation, Monthly Income, Educational Qualifications, Output, Feedback
dbl (5): Age, Family size, Latitude, Longitude, Pin code
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(onlinefood_data)
Now that we have a clear understanding of the dataset and its variables, the next step is to clean the data. Before diving into analysis, we need to address several key tasks:
By completing these tasks, we’ll ensure that our dataset is properly cleaned and ready for analysis.
sapply(onlinefood_data, class)
Age Gender Marital Status Occupation
"numeric" "character" "character" "character"
Monthly Income Educational Qualifications Family size Latitude
"character" "character" "numeric" "numeric"
Longitude Pin code Output Feedback
"numeric" "numeric" "character" "character"
All data types appear correct. However, I intend to create a new column for Monthly Income to quantify it for potential future use. To do this, I’ll check to see how the incomes are categorized and assign them numerical ranks.
print(unique(onlinefood_data$`Monthly Income`))
[1] "No Income" "Below Rs.10000" "More than 50000" "10001 to 25000" "25001 to 50000"
There are five ranges by the looks of it. Now, I will assign each range a number 0-4:
adjust_income <- function(income_category) { # function to assign the ranks to each category
if (income_category == "No Income") {
return(0)
} else if (income_category == "Below Rs.10000") {
return(1)
} else if (income_category == "10001 to 25000") {
return(2)
} else if (income_category == "25001 to 50000") {
return(3)
} else if (income_category == "More than 50000") {
return(4)
} else {
return(NA) # handle unknown categories
}
}
onlinefood_data <- onlinefood_data %>% # create new column
mutate(`Adjusted Monthly Income` = sapply(`Monthly Income`, adjust_income))
head(onlinefood_data$`Adjusted Monthly Income`)
No Income Below Rs.10000 Below Rs.10000 No Income Below Rs.10000 More than 50000
0 1 1 0 1 4
This quantified Monthly Income column will provide a structured representation of income data, for potential future analyses.
onlinefood_data$Gender <- str_squish(onlinefood_data$Gender)
onlinefood_data$`Marital Status` <- str_squish(onlinefood_data$`Marital Status`)
onlinefood_data$`Monthly Income` <- str_squish(onlinefood_data$`Marital Status`)
onlinefood_data$`Educational Qualifications` <- str_squish(onlinefood_data$`Educational Qualifications`)
onlinefood_data$Feedback <- str_squish(onlinefood_data$Feedback)
colSums(is.na(onlinefood_data)) # count number of missing values in each column
Age Gender Marital Status Occupation
0 0 0 0
Monthly Income Educational Qualifications Family size Latitude
0 0 0 0
Longitude Pin code Output Feedback
0 0 0 0
Adjusted Monthly Income
0
onlinefood_data %>% distinct() # remove duplicate rows based on all columns
Now that the dataset has been cleaned, it’s ready for analysis. In this section, we’ll explore the data through a series of visualizations to uncover relationships between different factors and online food ordering behavior.
ggplot(data = onlinefood_data, aes(x = `Marital Status`, fill = Gender)) +
labs(x = "Count", title = "Marital Status and Gender ") +
geom_bar(color = "black")
In the dataset, males make up slightly over half of the total counts for ordering food across all marital status categories. Particularly, within each marital status category, the highest count is observed among individuals categorized as single.
occupation_plot <- ggplot(data = onlinefood_data, aes(x = Occupation, fill = Occupation)) +
geom_bar(stat = "count", colour = "black") +
geom_text(stat = "count", aes(label = stat(count)), color = "black", vjust = 0.5, position = position_stack(vjust = 0.5)) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5)) +
labs(x = "Occupation", y = "Count", title = "Occupation Count") +
guides(fill = FALSE)
Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2 3.3.4.
monthlyincome_plot <- ggplot(data = onlinefood_data, aes(x = `Monthly Income`, fill = `Monthly Income`)) +
geom_bar(stat = "count", color = "black") +
geom_text(stat = "count", aes(label = stat(count)), color = "black", vjust = 0.5, position = position_stack(vjust = 0.5)) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5)) +
labs(x = "Monthly Income", y = "Count", title = "Monthly Income Count") +
guides(fill = FALSE)
grid.arrange(occupation_plot, monthlyincome_plot, ncol = 2)
The predominant occupations appear to be students and individuals without any income. This observation aligns with the fact that many students typically do not have a stable income while pursuing their studies.
ggplot(data = onlinefood_data, aes(x = `Educational Qualifications`, fill = `Educational Qualifications`)) +
geom_bar(stat = "count", color = "black") +
geom_text(stat = "count", aes(label = stat(count)), vjust = -0.5) +
labs(x = "Educational Qualifications", y = "Count", title = "Educational Qualifications Count") +
guides(fill = FALSE)
As shown, there are only two people in this dataset that do not have an education nor are pursuing an education as the rest of the dataset.
pincode_count <- table(onlinefood_data$`Pin code`)
print(pincode_count)
560001 560002 560003 560004 560005 560006 560007 560008 560009 560010 560011 560012 560013 560014 560015 560016 560017 560018
8 6 7 5 6 5 4 6 36 14 7 4 2 4 7 3 6 7
560019 560020 560021 560022 560023 560024 560025 560026 560027 560028 560029 560030 560032 560033 560034 560036 560038 560041
2 5 6 3 5 8 3 1 6 2 8 5 2 1 11 9 16 6
560042 560043 560045 560046 560047 560048 560049 560050 560051 560055 560056 560059 560060 560061 560062 560064 560066 560067
5 8 2 4 1 4 2 1 1 2 2 1 2 3 2 2 8 4
560068 560070 560072 560073 560075 560076 560078 560079 560080 560085 560086 560091 560092 560093 560095 560096 560097 560098
4 4 1 1 9 12 5 1 8 6 2 2 5 4 11 7 2 1
560100 560102 560103 560104 560109
4 2 5 1 1
This visualization displays the frequency of pin codes, representing the number of times food orders have been made from the customer’s location. We will examine key statistics such as the minimum, maximum, and average occurrences of pin codes.
pincode_summary <- data.frame(
Minimum = min(pincode_count),
Maximum = max(pincode_count),
Average = mean(pincode_count)
)
print(pincode_summary)
ggplot(data = onlinefood_data, aes(x = `Pin code`)) +
geom_bar(stat = "count", fill = "darkseagreen3", color = "black") +
labs(x = "Pin Code", y = "Count", title = "Pin Code Count")
ggplot(data = onlinefood_data, aes(x = Output, fill = Feedback)) +
geom_bar(color = "black") +
labs(y = "Count", title = "Output vs Feedback")
The majority of responses indicate satisfaction with the timely arrival of their food, representing more than half of the total.
However, approximately one-third of the responses are negative, suggesting either non-delivery of the food or complications with the delivery process. The positive “no” responses could imply order cancellations or satisfactory resolutions to delivery issues.
ggplot(data = onlinefood_data, aes(x = Latitude, y = Longitude, color = Age)) +
geom_point() +
labs(title = "Customer's Location v Age") +
scale_color_gradient(name = "Age", low = "blue", high = "red")
ggplot(data = onlinefood_data, aes(x = Latitude, y = Longitude, color = Occupation)) +
geom_point() +
labs(title = "Customer's Location v Occupation")
plot1 <- ggplot(data = onlinefood_data, aes(x = Gender, fill = Feedback)) +
geom_bar(position = "dodge", color = "black") +
labs(y = "Count", title = "Feedback Based on Gender")
plot2 <- ggplot(data = onlinefood_data, aes(x = Gender, fill = Output)) +
geom_bar(position = "dodge", color = "black") +
labs(y = "Count", title = "Output Based on Gender")
grid.arrange(plot1, plot2, ncol = 2)
print(summary(onlinefood_data$Age))
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.00 23.00 24.00 24.63 26.00 33.00
ggplot(data = onlinefood_data, aes(x = Age)) +
geom_histogram(binwidth = 1, fill = "skyblue", colour = "black") +
geom_text(stat = "count", aes(label = stat(count)), vjust = -0.5, color = "black") +
labs(x = "Age", y = "Frequency", title = "Age Distribution")
The majority of individuals who order food fall within the age range of 21 to 26, with an average age of 24. This demographic likely includes university students, including undergraduates, graduates, and PhD students.
ggplot(data = onlinefood_data, aes(x = `Monthly Income`, y = Age, fill = `Monthly Income`)) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5)) +
guides(fill = FALSE)
ggplot(data = onlinefood_data, aes(x = `Marital Status`, y = Age, fill = `Marital Status`)) +
geom_boxplot(position = "identity", width = 0.1, colour = "black") +
guides(fill = FALSE)
ggplot(data = onlinefood_data, aes(x = `Occupation`, y = Age, fill = `Occupation`)) +
geom_boxplot(position = "identity", width = 0.1, colour = "black") +
theme(axis.text.x = element_text(angle = 45, vjust = 0.5)) +
guides(fill = FALSE)
summary(onlinefood_data$`Family size`)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 3.000 3.281 4.000 6.000
ggplot(data = onlinefood_data, aes(x = `Family size`)) +
geom_histogram(binwidth = 1, fill = "pink", color = "black") +
geom_text(stat = "count", aes(label = stat(count)), vjust = -0.5) +
labs(x = "Family Size", y = "Frequency", title = "Family Size Distribution")
The majority of family sizes who order food fall within the size range of 2 -4, with an average size of 3. This demographic likely includes a couple, parents with one kid or parents with two kids.
ggplot(data = onlinefood_data, aes(x = `Family size`, fill = Feedback)) +
geom_bar(position = "dodge", color = "black") +
labs(y = "Count", title = "Feedback Based on Family Size")
ggplot(data = onlinefood_data, aes(x = `Family size`, fill = Output)) +
geom_bar(position = "dodge", color = "black") +
labs(y = "Count", title = "Output Based on Family Size")
ggplot(data = onlinefood_data, aes(x = `Family size`, fill = `Monthly Income`)) +
geom_bar(position = "dodge", color = "black") +
labs(y = "Count", title = "Output Based on Family Size")