Data Wrangling (Data Preprocessing)

Setup

# Load the necessary packages required to reproduce the report. 
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Data generation

For this assignment, I have assumed myself in the role of data analyst at McDonalds, a dominant key player in the fast food service industry. I have created two realistic McDonalds synthetic datasets called ‘store_data_df’ and ‘customer_data_df’ to meet the requirement of this assessment. First dataset ‘store_data_df’ holds the data of the store details with customer product order details. Second dataset ‘customer_data_df’ holds the data of customer personal details with order frequency and satisfaction score.

# Data generation, provide your R codes here.
set.seed(469) # setting seed for reproducibility

n_rows <- 200
store_location <- c("Tullamarine", "CBD", "Southbank", "Flinders", "Docklands", "SouthernCross", "Brunswick", "Coburg", "Richmond", "SouthYarra", "Swanston Stree", "Bundoora", "Northcote", "Boxhill", "Clayton", "Collins Street", "North Melbourne", "Werribee")
products <- c("McChicken Burger", "BigMac Burger", "HamBurger", "CheesBurger", "QuarterPounder Burger", "Egg & Sau Muffin", "Bac & Sau Muffin", "Coffee Frappe", "Chocolate Frappe", "Vanilla Sundae", "Chocolate Sundae", "Apple Pie", "Hasbrown", "Banana Bread", "Chai Latte", "Iced Latte", "Mocha", "Hot Chocolate", "Long Black", "Flat White")
purchased_method <- c("Drive Through", "In Store", "Mobile Order")

# Generating synthetic dataset 1
store_data_df <- data.frame(
  store_id = 1:n_rows,
  store_name = sample(store_location, n_rows, replace = TRUE),
  customer_id = 1001:1200,
  products_ordered = sample(products, n_rows, replace = TRUE),
  purchase_total_amount = round(runif(n_rows, min = 3, max = 10), digits = 2),
  promotion_used = sample(c("Yes", "No"), n_rows, replace = TRUE),
  satisfaction_rating = round(runif(n_rows, min = 1, max = 10), digits = 1)
)

# Generating synthetic dataset 2
customer_data_df <- data.frame(
  customer_id = 1001:1200,
  age = round(rnorm(n_rows, mean = 35, sd = 10)),
  gender = sample(c("Male", "Female"), n_rows, replace = TRUE),
  purchased_via = sample(purchased_method, n_rows, replace = TRUE),
  order_frequency = round(runif(n_rows, min = 1, max = 10), digits = 0)
)

# Introducing the missing values to the datasets (5% of total rows)
missing_value <- sample(1:n_rows, size = round(n_rows * 0.05))
store_data_df$products_ordered[missing_value] <- NA
customer_data_df$order_frequency[missing_value] <- NA

# Generating the correlated random data
correlated_data <- rnorm(n_rows, mean(store_data_df$satisfaction_rating), sd(store_data_df$satisfaction_rating))
customer_data_df <- customer_data_df %>% mutate(
  # Ensures the correlated data lies within a reasonable range
  correlated_satisfaction = pmax(1, pmin(10, correlated_data)))

# Displaying the starting few rows of store_data_df dataset
head(store_data_df)

# Displaying the starting few rows of customer_data_df dataset
head(customer_data_df)

Provide explanations here.

I have began with this task by setting the seed value to 469 which helps for reproducibility.
rnorm() function is used to generate random data following a normal distribution.

Dataset 1 details: I have created this dataset (i.e dataframe) named ‘store_data_df’ by using the function called data.frame() which contains 7 variables and 200 observations (includes missing values) meeting all the requirements specified for this assessment task. Each variable details as follows:

store_id - Unique identifier of the store ranging from 1 to 200.
store_name - This is a character variable contains the McDonalds store location. It has a total of 18 different store locations.
customer_id - Numeric id which uniquely identifies each customer ranging from 1001 to 1200. This is a common variable between both datasets.
products_ordered - This is a character variable contains the McDonalds various products ordered by the customer. It has a total of 20 products. I have inserted 5% missing value to this column data to meet with the assessment requirement.
purchase_total_amount - It is a total purchase amount of a product by the customer. Product price ranges from value minimum 3 to maximum 10. I have rounded it of by using round() to 2 decimal points.
promotion_used - It’s a character data which has 2 values “Yes” and “No”. This variable gives information about whether customer used promotion code to purchase the product or not.
satisfaction_rating - It’s a customer satisfaction/feedback score given to the service offered by store. Values ranges from 1 to 10 where 1 being low satisfaction and 10 being high satisfaction. I have rounded it of by using round() to 1 decimal points.

Dataset 2 details: I have created this dataset (i.e dataframe) named ‘customer_data_df’ by using the function called data.frame() which contains 6 variables and 200 observations (includes correlated random data and missing values) meeting all the requirements specified for this assessment task. Each variable details as follows:

customer_id - Numeric id which uniquely identifies each customer ranging from 1001 to 1200. This is a common variable between both datasets.
age - It’s a age of a customer.
gender - It’s a gender of a customer which has two values “Male” & “Female”.
purchased_via - This column gives information about methods by which customer purchased their order. It has 3 ways i.e via “Drive Through”, “In Store” and “Mobile Order”.
order_frequency - This column represents how frequent customer purchased the products. It has values ranging from minimum 1 to maximum 10. I have inserted 5% missing value to this column data to meet with the assessment requirement.
correlated_satisfaction - I created correlated random data to simulate this new variable within the ‘customer_data_df’ dataframe. By using the mean and standard deviation from the existing “satisfaction_rating” data in the ‘store_data_df’ dataframe, we ensured that our simulated data reflects similar distribution characteristics. The generated values were limited by the “pmax” and “pmin” functions to a realistic range of 1 to 10, preserving the authenticity of our simulated satisfaction ratings. Through the addition of associated customer satisfaction data, this procedure improves the applicability of our synthetic dataset for analysis and modeling.

Merging data sets

# Merge your synthetic data sets, provide R codes here.

# Converting appropriate variables to suitable data types before merging
store_data_df$promotion_used <- as.factor(store_data_df$promotion_used)
customer_data_df$purchased_via <- as.factor(customer_data_df$purchased_via)

# Labeling and ordering the factor variables
customer_data_df$purchased_via <- factor(customer_data_df$purchased_via, levels = c("Drive Through", "In Store", "Mobile Order"), ordered = TRUE)

# Merging the datasets
merged_data <- merge(store_data_df, customer_data_df, by = "customer_id", all = TRUE)
head(merged_data)

Provide explanations here.

To merge my datasets, I carried out the necessary pre-processing of the data. Using the as.factor() method, I first transformed categorical values into factors, such as “promotion_used” in the store data and “purchased_via” in the customer data. This conversion enables to handle these variables as categorical factors instead of just character or numeric vectors.
I also labeled and orderred the factor levels for the “purchased_via” variable. We turned the variable into a factor and organized the levels in a useful hierarchical order by using the factor() method with the provided levels (“Drive Through,” “In Store,” and “Mobile Order”).
Lastly, I used the merge() function to combine the preprocessed datasets based on the common variable “customer_id”. The information from both datasets is combined during this merging process to create a single, cohesive dataset known as “merged_data”. All in all, these preprocessing procedures guarantee the accuracy and integrity of our combined dataset and provide a strong basis for further data exploration and analysis.

Checking structure of combined data

# Check structure of combined data and perform all necessary data type conversions, provide R codes here.

# Performing all required necessary data type conversions
merged_data$gender <- as.factor(merged_data$gender)
merged_data$age <- as.integer(merged_data$age)

# Displaying the structure of the merged data
str(merged_data)

## 'data.frame':    200 obs. of  12 variables:
##  $ customer_id            : int  1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 ...
##  $ store_id               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ store_name             : chr  "North Melbourne" "North Melbourne" "Flinders" "Richmond" ...
##  $ products_ordered       : chr  "Vanilla Sundae" "Apple Pie" "Mocha" "Banana Bread" ...
##  $ purchase_total_amount  : num  4.55 5.36 9.79 5.83 6.51 4.1 7.98 8.97 7.53 9.74 ...
##  $ promotion_used         : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 2 1 1 2 2 ...
##  $ satisfaction_rating    : num  6.5 4 8.3 1.9 3.1 4.3 2.6 2.6 6.8 2.2 ...
##  $ age                    : int  35 35 25 26 43 36 33 2 25 47 ...
##  $ gender                 : Factor w/ 2 levels "Female","Male": 2 2 1 1 2 1 1 2 1 1 ...
##  $ purchased_via          : Ord.factor w/ 3 levels "Drive Through"<..: 1 3 1 3 3 2 3 2 1 3 ...
##  $ order_frequency        : num  7 9 3 7 1 10 5 1 5 7 ...
##  $ correlated_satisfaction: num  3.41 8.5 4.66 7.04 7.92 ...

Provide explanations here.

In this step, atfirst, I did all required necessary variables data type conversions. I used the as.factor() function to turn the “gender” variable into a factor. By converting the gender variable, you can be sure that R will treat it as categorical data. I converted the “age” variable into integer format using the as.integer() function. By doing so, we ensure that the age variable is represented as whole numbers.
After performing these necessary data type conversions, I used str() function to display the structure of the merged dataframe. It provides a schema/structure of the data frame with displaying data types of each column and the initial few values of the data frame.

Generate summary statistics

# Generate summary statistics, provide R codes here.

# Grouping by 'store_name' variable and generating summary statistics for 'purchase_total_amount' variable
summary_stats <- merged_data %>%
  group_by(store_name) %>%
  summarise(
    mean_purchase = mean(purchase_total_amount, na.rm = TRUE),
    median_purchase = median(purchase_total_amount, na.rm = TRUE),
    first_quartile = quantile(purchase_total_amount, 0.25, na.rm = TRUE),
    third_quartile = quantile(purchase_total_amount, 0.75, na.rm = TRUE),
    sd_purchase = sd(purchase_total_amount, na.rm = TRUE)
  )

# Printing the summary statistics
print(summary_stats)

## # A tibble: 18 × 6
##    store_name      mean_purchase median_purchase first_quartile third_quartile
##    <chr>                   <dbl>           <dbl>          <dbl>          <dbl>
##  1 Boxhill                  4.99            4.54           3.82           5.87
##  2 Brunswick                6.42            6.67           4.08           8.22
##  3 Bundoora                 7.53            8.64           5.54           9.24
##  4 CBD                      5.99            6.12           3.90           7.52
##  5 Clayton                  6.47            6.44           4.50           7.84
##  6 Coburg                   7.53            8.13           6.66           8.76
##  7 Collins Street           6.24            5.68           4.91           7.22
##  8 Docklands                5.87            5.67           4.44           7.29
##  9 Flinders                 7.33            6.79           6.24           8.70
## 10 North Melbourne          6.37            6.61           4.84           7.5 
## 11 Northcote                5.83            5.55           4.35           6.85
## 12 Richmond                 6.10            5.83           5.07           6.63
## 13 SouthYarra               5.70            5.54           4.16           6.88
## 14 Southbank                5.14            5.04           3.71           6.21
## 15 SouthernCross            5.80            5.08           4.47           6.96
## 16 Swanston Stree           5.59            5.31           3.86           6.90
## 17 Tullamarine              5.71            6.00           4.96           6.56
## 18 Werribee                 5.64            5.57           4.64           6.97
## # ℹ 1 more variable: sd_purchase <dbl>

Provide explanations here.

I calculated summary statistics for the “purchase_total_amount” data in this section, which were categorized by the “store_name” field. I used the %>% operator to chain operations, making the code more concise and readable. The group_by() function grouped the dataset by the “store_name” variable, while the summarise() function computed various summary statistics for the “purchase_total_amount” variable within each group.
The summary statistics calculated include the mean, median, first quartile, third quartile, and standard deviation of the purchase amounts. These statistics provide insights into the distribution and central tendencies of purchase amounts across different stores. Lastly, I used the print() function to print the summary statistics that were produced to the console.

Scanning data

# Scan variables for missing values, provide R codes here.

# Scaning all the variables which has missing values
missing_values <- sapply(merged_data, function(x) sum(is.na(x)))
# Print the number of missing values for each variable
print(missing_values)

##             customer_id                store_id              store_name 
##                       0                       0                       0 
##        products_ordered   purchase_total_amount          promotion_used 
##                      10                       0                       0 
##     satisfaction_rating                     age                  gender 
##                       0                       0                       0 
##           purchased_via         order_frequency correlated_satisfaction 
##                       0                      10                       0

print("----------------------------------------------------------------")

## [1] "----------------------------------------------------------------"

# For numeric variables, replacing missing values with the mean
merged_data$order_frequency[is.na(merged_data$order_frequency)] <- mean(merged_data$order_frequency, na.rm = TRUE)

# For categorical variables, replacing missing values with the mode
merged_data$products_ordered[is.na(merged_data$products_ordered)] <- mode(merged_data$products_ordered)

# Re-scaning all variables for missing values to ensure they are handled
missing_values_after <- sapply(merged_data, function(x) sum(is.na(x)))

# Print the number of missing values for each variable after handling
print(missing_values_after)

##             customer_id                store_id              store_name 
##                       0                       0                       0 
##        products_ordered   purchase_total_amount          promotion_used 
##                       0                       0                       0 
##     satisfaction_rating                     age                  gender 
##                       0                       0                       0 
##           purchased_via         order_frequency correlated_satisfaction 
##                       0                       0                       0

Provide explanations here.

I have scanned our combined dataset for missing values in this part. The R code counted the number of missing values for each variable in the dataset by applying a function over all variables using the sapply() function. We identified the missing values and then put procedures in place to deal with them. The mean() function was utilized to substitute missing values for numerical variables with their corresponding variable’s mean value. Similar to this, the mode() function was used to replace missing values for categorical variables with the mode, or most frequent value, for that particular variable. The distribution of the categorical variable is preserved by substituting the mode for any missing categorical values.
I handled missing values and then rescanned every variable to make sure all issues had been resolved. After the imputation process, the summary that results gives an overview of how many missing values there are for each variable, ensuring that missing data are properly managed.

Link to presentation

# Add the link to your presentation here.

https://rmit-arc.instructuremedia.com/embed/97e8ebe9-c888-42c8-b52f-349812fc3640

References

[1] RMIT (2024) Module 3: Understand: Understanding Data and Data Structures, Data Wrangling (Preprocessing). Accessed on: (22 Apr - 30 Apr)/2024. Link: http://rare-phoenix-161610.appspot.com/secured/Module_03.html

[2] RMIT (2024) Module 4: Tidy and Manipulate: Tidy Data Principles and Manipulating Data, Data Wrangling (Preprocessing). Accessed on: (22 Apr - 30 Apr)/2024. Link: http://rare-phoenix-161610.appspot.com/secured/Module_04.html

[3] RMIT (2024) Module 5: Scan: Missing Values Data Wrangling (Preprocessing). Accessed on: (22 Apr - 30 Apr)/2024. Link: http://rare-phoenix-161610.appspot.com/secured/Module_05.html

[4] QuillBot Website, used for paraphrasing my explanations. Accessed on: (29 Apr - 30 Apr)/2024. URL: https://quillbot.com

Data Wrangling (Data Preprocessing)

Mid-term assessment

Chandangowda Maruvanahalli Shivaramu, s4063920

30-Apr-2024

Setup

Data generation

Merging data sets

Checking structure of combined data

Generate summary statistics

Scanning data

Link to presentation

References