Setup

# Load the necessary packages required to reproduce the report. 
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Data generation

For this assignment, I have assumed myself in the role of data analyst at McDonalds, a dominant key player in the fast food service industry. I have created two realistic McDonalds synthetic datasets called ‘store_data_df’ and ‘customer_data_df’ to meet the requirement of this assessment. First dataset ‘store_data_df’ holds the data of the store details with customer product order details. Second dataset ‘customer_data_df’ holds the data of customer personal details with order frequency and satisfaction score.

# Data generation, provide your R codes here.
set.seed(469) # setting seed for reproducibility

n_rows <- 200
store_location <- c("Tullamarine", "CBD", "Southbank", "Flinders", "Docklands", "SouthernCross", "Brunswick", "Coburg", "Richmond", "SouthYarra", "Swanston Stree", "Bundoora", "Northcote", "Boxhill", "Clayton", "Collins Street", "North Melbourne", "Werribee")
products <- c("McChicken Burger", "BigMac Burger", "HamBurger", "CheesBurger", "QuarterPounder Burger", "Egg & Sau Muffin", "Bac & Sau Muffin", "Coffee Frappe", "Chocolate Frappe", "Vanilla Sundae", "Chocolate Sundae", "Apple Pie", "Hasbrown", "Banana Bread", "Chai Latte", "Iced Latte", "Mocha", "Hot Chocolate", "Long Black", "Flat White")
purchased_method <- c("Drive Through", "In Store", "Mobile Order")

# Generating synthetic dataset 1
store_data_df <- data.frame(
  store_id = 1:n_rows,
  store_name = sample(store_location, n_rows, replace = TRUE),
  customer_id = 1001:1200,
  products_ordered = sample(products, n_rows, replace = TRUE),
  purchase_total_amount = round(runif(n_rows, min = 3, max = 10), digits = 2),
  promotion_used = sample(c("Yes", "No"), n_rows, replace = TRUE),
  satisfaction_rating = round(runif(n_rows, min = 1, max = 10), digits = 1)
)

# Generating synthetic dataset 2
customer_data_df <- data.frame(
  customer_id = 1001:1200,
  age = round(rnorm(n_rows, mean = 35, sd = 10)),
  gender = sample(c("Male", "Female"), n_rows, replace = TRUE),
  purchased_via = sample(purchased_method, n_rows, replace = TRUE),
  order_frequency = round(runif(n_rows, min = 1, max = 10), digits = 0)
)

# Introducing the missing values to the datasets (5% of total rows)
missing_value <- sample(1:n_rows, size = round(n_rows * 0.05))
store_data_df$products_ordered[missing_value] <- NA
customer_data_df$order_frequency[missing_value] <- NA

# Generating the correlated random data
correlated_data <- rnorm(n_rows, mean(store_data_df$satisfaction_rating), sd(store_data_df$satisfaction_rating))
customer_data_df <- customer_data_df %>% mutate(
  # Ensures the correlated data lies within a reasonable range
  correlated_satisfaction = pmax(1, pmin(10, correlated_data)))

# Displaying the starting few rows of store_data_df dataset
head(store_data_df)
# Displaying the starting few rows of customer_data_df dataset
head(customer_data_df)

Provide explanations here.

Dataset 1 details: I have created this dataset (i.e dataframe) named ‘store_data_df’ by using the function called data.frame() which contains 7 variables and 200 observations (includes missing values) meeting all the requirements specified for this assessment task. Each variable details as follows:

  1. store_id - Unique identifier of the store ranging from 1 to 200.

  2. store_name - This is a character variable contains the McDonalds store location. It has a total of 18 different store locations.

  3. customer_id - Numeric id which uniquely identifies each customer ranging from 1001 to 1200. This is a common variable between both datasets.

  4. products_ordered - This is a character variable contains the McDonalds various products ordered by the customer. It has a total of 20 products. I have inserted 5% missing value to this column data to meet with the assessment requirement.

  5. purchase_total_amount - It is a total purchase amount of a product by the customer. Product price ranges from value minimum 3 to maximum 10. I have rounded it of by using round() to 2 decimal points.

  6. promotion_used - It’s a character data which has 2 values “Yes” and “No”. This variable gives information about whether customer used promotion code to purchase the product or not.

  7. satisfaction_rating - It’s a customer satisfaction/feedback score given to the service offered by store. Values ranges from 1 to 10 where 1 being low satisfaction and 10 being high satisfaction. I have rounded it of by using round() to 1 decimal points.

Dataset 2 details: I have created this dataset (i.e dataframe) named ‘customer_data_df’ by using the function called data.frame() which contains 6 variables and 200 observations (includes correlated random data and missing values) meeting all the requirements specified for this assessment task. Each variable details as follows:

  1. customer_id - Numeric id which uniquely identifies each customer ranging from 1001 to 1200. This is a common variable between both datasets.

  2. age - It’s a age of a customer.

  3. gender - It’s a gender of a customer which has two values “Male” & “Female”.

  4. purchased_via - This column gives information about methods by which customer purchased their order. It has 3 ways i.e via “Drive Through”, “In Store” and “Mobile Order”.

  5. order_frequency - This column represents how frequent customer purchased the products. It has values ranging from minimum 1 to maximum 10. I have inserted 5% missing value to this column data to meet with the assessment requirement.

  6. correlated_satisfaction - I created correlated random data to simulate this new variable within the ‘customer_data_df’ dataframe. By using the mean and standard deviation from the existing “satisfaction_rating” data in the ‘store_data_df’ dataframe, we ensured that our simulated data reflects similar distribution characteristics. The generated values were limited by the “pmax” and “pmin” functions to a realistic range of 1 to 10, preserving the authenticity of our simulated satisfaction ratings. Through the addition of associated customer satisfaction data, this procedure improves the applicability of our synthetic dataset for analysis and modeling.

Merging data sets

# Merge your synthetic data sets, provide R codes here.

# Converting appropriate variables to suitable data types before merging
store_data_df$promotion_used <- as.factor(store_data_df$promotion_used)
customer_data_df$purchased_via <- as.factor(customer_data_df$purchased_via)

# Labeling and ordering the factor variables
customer_data_df$purchased_via <- factor(customer_data_df$purchased_via, levels = c("Drive Through", "In Store", "Mobile Order"), ordered = TRUE)

# Merging the datasets
merged_data <- merge(store_data_df, customer_data_df, by = "customer_id", all = TRUE)
head(merged_data)

Provide explanations here.

Checking structure of combined data

# Check structure of combined data and perform all necessary data type conversions, provide R codes here.

# Performing all required necessary data type conversions
merged_data$gender <- as.factor(merged_data$gender)
merged_data$age <- as.integer(merged_data$age)

# Displaying the structure of the merged data
str(merged_data)
## 'data.frame':    200 obs. of  12 variables:
##  $ customer_id            : int  1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
##  $ store_id               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ store_name             : chr  "North Melbourne" "North Melbourne" "Flinders" "Richmond" ...
##  $ products_ordered       : chr  "Vanilla Sundae" "Apple Pie" "Mocha" "Banana Bread" ...
##  $ purchase_total_amount  : num  4.55 5.36 9.79 5.83 6.51 4.1 7.98 8.97 7.53 9.74 ...
##  $ promotion_used         : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 2 1 1 2 2 ...
##  $ satisfaction_rating    : num  6.5 4 8.3 1.9 3.1 4.3 2.6 2.6 6.8 2.2 ...
##  $ age                    : int  35 35 25 26 43 36 33 2 25 47 ...
##  $ gender                 : Factor w/ 2 levels "Female","Male": 2 2 1 1 2 1 1 2 1 1 ...
##  $ purchased_via          : Ord.factor w/ 3 levels "Drive Through"<..: 1 3 1 3 3 2 3 2 1 3 ...
##  $ order_frequency        : num  7 9 3 7 1 10 5 1 5 7 ...
##  $ correlated_satisfaction: num  3.41 8.5 4.66 7.04 7.92 ...

Provide explanations here.

Generate summary statistics

# Generate summary statistics, provide R codes here.

# Grouping by 'store_name' variable and generating summary statistics for 'purchase_total_amount' variable
summary_stats <- merged_data %>%
  group_by(store_name) %>%
  summarise(
    mean_purchase = mean(purchase_total_amount, na.rm = TRUE),
    median_purchase = median(purchase_total_amount, na.rm = TRUE),
    first_quartile = quantile(purchase_total_amount, 0.25, na.rm = TRUE),
    third_quartile = quantile(purchase_total_amount, 0.75, na.rm = TRUE),
    sd_purchase = sd(purchase_total_amount, na.rm = TRUE)
  )

# Printing the summary statistics
print(summary_stats)
## # A tibble: 18 × 6
##    store_name      mean_purchase median_purchase first_quartile third_quartile
##    <chr>                   <dbl>           <dbl>          <dbl>          <dbl>
##  1 Boxhill                  4.99            4.54           3.82           5.87
##  2 Brunswick                6.42            6.67           4.08           8.22
##  3 Bundoora                 7.53            8.64           5.54           9.24
##  4 CBD                      5.99            6.12           3.90           7.52
##  5 Clayton                  6.47            6.44           4.50           7.84
##  6 Coburg                   7.53            8.13           6.66           8.76
##  7 Collins Street           6.24            5.68           4.91           7.22
##  8 Docklands                5.87            5.67           4.44           7.29
##  9 Flinders                 7.33            6.79           6.24           8.70
## 10 North Melbourne          6.37            6.61           4.84           7.5 
## 11 Northcote                5.83            5.55           4.35           6.85
## 12 Richmond                 6.10            5.83           5.07           6.63
## 13 SouthYarra               5.70            5.54           4.16           6.88
## 14 Southbank                5.14            5.04           3.71           6.21
## 15 SouthernCross            5.80            5.08           4.47           6.96
## 16 Swanston Stree           5.59            5.31           3.86           6.90
## 17 Tullamarine              5.71            6.00           4.96           6.56
## 18 Werribee                 5.64            5.57           4.64           6.97
## # ℹ 1 more variable: sd_purchase <dbl>

Provide explanations here.

Scanning data

# Scan variables for missing values, provide R codes here.

# Scaning all the variables which has missing values
missing_values <- sapply(merged_data, function(x) sum(is.na(x)))
# Print the number of missing values for each variable
print(missing_values)
##             customer_id                store_id              store_name 
##                       0                       0                       0 
##        products_ordered   purchase_total_amount          promotion_used 
##                      10                       0                       0 
##     satisfaction_rating                     age                  gender 
##                       0                       0                       0 
##           purchased_via         order_frequency correlated_satisfaction 
##                       0                      10                       0
print("----------------------------------------------------------------")
## [1] "----------------------------------------------------------------"
# For numeric variables, replacing missing values with the mean
merged_data$order_frequency[is.na(merged_data$order_frequency)] <- mean(merged_data$order_frequency, na.rm = TRUE)

# For categorical variables, replacing missing values with the mode
merged_data$products_ordered[is.na(merged_data$products_ordered)] <- mode(merged_data$products_ordered)

# Re-scaning all variables for missing values to ensure they are handled
missing_values_after <- sapply(merged_data, function(x) sum(is.na(x)))

# Print the number of missing values for each variable after handling
print(missing_values_after)
##             customer_id                store_id              store_name 
##                       0                       0                       0 
##        products_ordered   purchase_total_amount          promotion_used 
##                       0                       0                       0 
##     satisfaction_rating                     age                  gender 
##                       0                       0                       0 
##           purchased_via         order_frequency correlated_satisfaction 
##                       0                       0                       0

Provide explanations here.

References

[1] RMIT (2024) Module 3: Understand: Understanding Data and Data Structures, Data Wrangling (Preprocessing). Accessed on: (22 Apr - 30 Apr)/2024. Link: http://rare-phoenix-161610.appspot.com/secured/Module_03.html

[2] RMIT (2024) Module 4: Tidy and Manipulate: Tidy Data Principles and Manipulating Data, Data Wrangling (Preprocessing). Accessed on: (22 Apr - 30 Apr)/2024. Link: http://rare-phoenix-161610.appspot.com/secured/Module_04.html

[3] RMIT (2024) Module 5: Scan: Missing Values Data Wrangling (Preprocessing). Accessed on: (22 Apr - 30 Apr)/2024. Link: http://rare-phoenix-161610.appspot.com/secured/Module_05.html

[4] QuillBot Website, used for paraphrasing my explanations. Accessed on: (29 Apr - 30 Apr)/2024. URL: https://quillbot.com