Data Wrangling (Data Preprocessing)

Setup

# Loading the required library
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(dplyr) : R package, this package provides a consistent unified set of tools and functions for data manipulation.

Data generation

# Set seeding
set.seed(150)
# Defining  the number of rows in each data set
n_rows <- 100

# Data set 1 : df1

df1 <- data.frame(
  Prod_ID = sample(1:100),
  Menu_Item = sample(c('Harvest Bowl', 'Double Cheese & Bacon', 'Avo Parmesan Crunch with Protein', 'Paella'), n_rows, replace = TRUE),
  Price = round(runif(n_rows, 10, 50), 2),
  Calories = sample(200:1000, n_rows, replace = TRUE),
  Protein_g = sample(10:50, n_rows, replace = TRUE),
  Fat_g = sample(5:30, n_rows, replace = TRUE),
  Location = sample(c('Melbourne', 'Brisbane', 'Perth', 'Sydney'), n_rows, replace = TRUE)
)
# Calculate 5% of the number of rows
n_rows <- nrow(df1)
n_NA <- round(n_rows * 0.05)
# Randomly select 5% of the rows
rows_to_NA <- sample(1:n_rows, n_NA)
# Set the 'Calories' value to NA for the selected rows
df1$Calories[rows_to_NA] <- NA
# Print the dataframe
head(df1)

# Data set 2 : df2

# Create the data frame
df2 <- data.frame(
  Prod_ID = sample(1:100, n_rows, replace = TRUE),
  Menu_Item = sample(c('Harvest Bowl', 'Double Cheese & Bacon', 'Avo Parmesan Crunch with Protein', 'Paella'), n_rows, replace = TRUE),
  Rating = round(runif(n_rows, 1, 5), 1),
  Review_Count = rpois(n_rows, 10),
  Is_Spicy = sample(c(0, 1), n_rows, replace = TRUE),
  Contains_Garlic = sample(c(0, 1), n_rows, replace = TRUE),
  Sales = sample(100:1000, n_rows, replace = TRUE)
)

# Set the desired correlation between Rating and Review_Count
desired_correlation <- 0.5

# Introduce correlated random data for Rating and Review_Count
df2$Review_Count <- desired_correlation * df2$Rating + rnorm(n_rows, mean = 0, sd = sqrt(1 - desired_correlation^2))

# Introduce a new column correlated with Rating
df2$New_Co_relation <- 0.7 * df2$Rating + rnorm(n_rows, mean = 0, sd = sqrt(1 - 0.7^2))

# Calculate the correlation between Rating and New_Co_relation
cor_rating_new <- cor(df2$Rating, df2$New_Co_relation)

# Print the correlation coefficient between Rating and New_Co_relation
print(cor_rating_new)

## [1] 0.6312243

# Calculate the correlation between Rating and Review_Count
cor_rating_review <- cor(df2$Rating, df2$Review_Count)

# Print the correlation coefficient between Rating and Review_Count
print(cor_rating_review)

## [1] 0.5065912

# Calculate 5% of the number of rows
n_rows <- nrow(df2)
n_NA <- round(n_rows * 0.05)
# Randomly select 5% of the rows
rows_to_NA <- sample(1:n_rows, n_NA)
# Set the 'SALES' value to NA for the selected rows
df2$Sales[rows_to_NA] <- NA
# Print the dataframe
head(df2)

Nando’s, a popular restaurant chain known for its flame-grilled peri-peri chicken.

As a data analyst at Nando’s, a well-known restaurant chain recognized for its delicious flame-grilled peri-peri chicken, Here’s my approach would be:

I began by setting a seed for the random number generator to ensure consistent results.
Then, I initialized the number of rows to 100.
After that, I created a data frame called df1 with columns like Product ID, Menu Item, Price, Calories, Protein, Fat, and Location, populating them with random or predefined values.
I adjusted the number of rows to match the actual count in df1.
To introduce missing Calorie values in df1, I calculated 5% of the total rows and randomly selected that many row indices.
I then set the Calorie values to missing (NA) for those selected rows.
I displayed the first few rows of df1 to verify the changes.
Next, I created another data frame, df2, with similar columns to df1 but different random or predefined values.
This time, I introduced missing values for the Sales column following the same process as with Calories in df1.
I displayed the first few rows of df2 to confirm the changes.
Moving on to df2, I set the desired correlation between the ‘Rating’ and ‘Review_Count’ columns to 0.5.
To achieve this, I created a new ‘Review_Count’ column by combining the ‘Rating’ column with a random normal variable, ensuring the desired correlation.
I then created another column, ‘New_Co_relation,’ with values correlated to ‘Rating’ at 0.7, using a similar approach as for ‘Review_Count.’
Finally, I calculated and printed the actual correlation coefficients between ‘Rating’ and ‘New_Co_relation,’ as well as between ‘Rating’ and ‘Review_Count,’ to verify the results.

Merging data sets

# Merging the data sets on the common variable 'Prod_ID'

library(dplyr)
merged_df <- df2 %>% inner_join(df1, by = "Prod_ID")
merged_df <- merge(df2, df1, by = "Prod_ID")

# Convert a numeric variable to factor
merged_df$Is_Spicy <- as.factor(merged_df$Is_Spicy)
# Converting a character variable to factor
merged_df$Location <- as.factor(merged_df$Location)
# Labeling the levels of the factor variable 'Is_Spicy'
levels(merged_df$Is_Spicy) <- c("Not Spicy", "Spicy")
# Ordering the levels of the factor variable 'Location'
mrg <- merged_df$Location <- factor(merged_df$Location, levels = c('Melbourne', 'Brisbane', 'Perth', 'Sydney'))
head(merged_df)

I first load the dplyr package, which is helpful for manipulating data.
Then, I combine two data frames, df2 and df1, based on their ‘Prod_ID’ columns, creating a new merged data frame named merged_df.
Alternatively, I can achieve the same merge using the merge function.
Following that, I convert the ‘Is_Spicy’ column in merged_df into a categorical variable (factor), assuming it originally represented spiciness levels with numbers.
Similarly, I convert the ‘Location’ column in merged_df into a categorical variable, assuming it was initially stored as text.
To improve clarity, I change the levels of the ‘Is_Spicy’ variable to “Not Spicy” and “Spicy,” assuming 0 represents “Not Spicy” and 1 represents “Spicy.”
I rearrange the levels of the ‘Location’ variable to match a specific order: ‘Melbourne’, ‘Brisbane’, ‘Perth’, ‘Sydney.’
Lastly, I display the first few rows of merged_df to inspect the changes using the head function.

Checking structure of combined data

# Checking the structure of the combined data set
str(merged_df)

## 'data.frame':    100 obs. of  14 variables:
##  $ Prod_ID        : int  2 3 3 5 8 8 10 12 12 13 ...
##  $ Menu_Item.x    : chr  "Double Cheese & Bacon" "Paella" "Harvest Bowl" "Double Cheese & Bacon" ...
##  $ Rating         : num  1.1 2.9 1.3 3.3 3.6 2 1.7 1.5 2.1 1.2 ...
##  $ Review_Count   : num  0.484 1.7799 -1.0355 0.0295 0.7719 ...
##  $ Is_Spicy       : Factor w/ 2 levels "Not Spicy","Spicy": 2 2 2 2 2 2 1 2 2 1 ...
##  $ Contains_Garlic: num  1 0 1 0 1 1 0 1 0 1 ...
##  $ Sales          : int  940 518 567 615 405 660 738 696 721 135 ...
##  $ New_Co_relation: num  1.09 2.45 1.83 1.88 3.31 ...
##  $ Menu_Item.y    : chr  "Double Cheese & Bacon" "Paella" "Paella" "Double Cheese & Bacon" ...
##  $ Price          : num  41.8 22.6 22.6 45 32 ...
##  $ Calories       : int  574 979 979 345 753 753 571 294 294 NA ...
##  $ Protein_g      : int  43 18 18 32 43 43 46 26 26 47 ...
##  $ Fat_g          : int  18 7 7 30 11 11 11 18 18 13 ...
##  $ Location       : Factor w/ 4 levels "Melbourne","Brisbane",..: 4 1 1 1 4 4 1 3 3 4 ...

# Converting a character variable to factor
merged_df$Location <- as.factor(merged_df$Location)
# Converting a numeric variable to factor
merged_df$Is_Spicy <- as.factor(merged_df$Is_Spicy)

I used str(merged_df) to examine the merged_df data frame’s structure, which includes data types and sample values. This helped me understand how the data is organized.
To make working with categorical data easier, I converted the ‘Location’ column in merged_df from text to categories (factors) using merged_df$Location <- as.factor(merged_df$Location).
Similarly, I converted the ‘Is_Spicy’ column in merged_df from numeric values to categories using merged_df$Is_Spicy <- as.factor(merged_df$Is_Spicy). This change allows me to represent categories like “spicy” or “not spicy.”

The above steps are ensuring that categorical variables were correctly represented in the merged_df data frame.

Generate summary statistics

# Loading the required library
library(dplyr)

# Step 4: Generating summary statistics
summary_stats <- merged_df %>%
  group_by(Location) %>%
  summarise(
    Sales_Mean = mean(Sales, na.rm = TRUE),
    Sales_Median = median(Sales, na.rm = TRUE),
    Sales_1st_Quartile = quantile(Sales, 0.25, na.rm = TRUE),
    Sales_3rd_Quartile = quantile(Sales, 0.75, na.rm = TRUE),
    Sales_SD = sd(Sales, na.rm = TRUE)
  )
print(summary_stats)

## # A tibble: 4 × 6
##   Location  Sales_Mean Sales_Median Sales_1st_Quartile Sales_3rd_Quartile
##   <fct>          <dbl>        <dbl>              <dbl>              <dbl>
## 1 Melbourne       586.         567                462.               717 
## 2 Brisbane        577.         512                380                797 
## 3 Perth           560.         696                293                809 
## 4 Sydney          505.         492.               317.               668.
## # ℹ 1 more variable: Sales_SD <dbl>

First, I load the dplyr package using library(dplyr) to make data manipulation easier.
Then, I create a new dataframe called summary_stats by working with the merged_df dataframe.
I group the data by the ‘Location’ variable using group_by(Location) %>%.
Next, I calculate summary statistics for each group using summarise().
For each location, I calculate and store the mean ‘Sales’ value in Sales_Mean, excluding any missing values.
I also calculate and store the median ‘Sales’ value for each location in Sales_Median, excluding missing values.
Additionally, I calculate and store the 25th percentile (1st quartile) of ‘Sales’ for each location in Sales_1st_Quartile, excluding missing values.
Similarly, I calculate and store the 75th percentile (3rd quartile) of ‘Sales’ for each location in Sales_3rd_Quartile, excluding missing values.
Finally, I calculate and store the standard deviation of ‘Sales’ for each location in Sales_SD, excluding missing values.
To review the results, I print the summary_stats data frame, which displays the calculated summary statistics for each location.

These steps will allow one to understand how ‘Sales’ are distributed across different locations in the data set.

Scanning data

# Find missing values in the dataset
missing_values_before <- colSums(is.na(merged_df))
# Print the variables with missing values before
missing_vars_before <- names(missing_values_before)[missing_values_before > 0]
print("Missing values before handling:")

## [1] "Missing values before handling:"

print(missing_vars_before)

## [1] "Sales"    "Calories"

print(missing_values_before)

##         Prod_ID     Menu_Item.x          Rating    Review_Count        Is_Spicy 
##               0               0               0               0               0 
## Contains_Garlic           Sales New_Co_relation     Menu_Item.y           Price 
##               0               5               0               0               0 
##        Calories       Protein_g           Fat_g        Location 
##               7               0               0               0

# Handle missing values in 'Sales' column
merged_df$Sales[is.na(merged_df$Sales)] <- mean(merged_df$Sales, na.rm = TRUE)
# Handle missing values in 'Calories' column
merged_df$Calories[is.na(merged_df$Calories)] <- mean(merged_df$Calories, na.rm = TRUE)
# Find missing values in the dataset after handling
missing_values_after <- colSums(is.na(merged_df))
# Print the variables with missing values after
missing_vars_after <- names(missing_values_after)[missing_values_after > 0]
print("Missing values after handling:")

## [1] "Missing values after handling:"

print(missing_vars_after)

## character(0)

print(missing_values_after)

##         Prod_ID     Menu_Item.x          Rating    Review_Count        Is_Spicy 
##               0               0               0               0               0 
## Contains_Garlic           Sales New_Co_relation     Menu_Item.y           Price 
##               0               0               0               0               0 
##        Calories       Protein_g           Fat_g        Location 
##               0               0               0               0

# Print the updated data frame
print("Updated data frame:")

## [1] "Updated data frame:"

head(merged_df)

I calculated the total number of missing values in each column of my merged_df data frame and stored these counts in missing_values_before.
Then, I identified the columns with missing values (where the count was greater than 0) and saved their names in missing_vars_before.
To provide clarity, I printed a message indicating that the following output would display columns with missing values before any changes.
I displayed the names of columns with missing values before any modifications to get a clear picture.
Next, I displayed the count of missing values for each column before any modifications to understand the extent of missing data.
For data consistency, I replaced missing values in the ‘Sales’ column with the mean of the non-missing values in that column.
Similarly, I replaced missing values in the ‘Calories’ column with the mean of the non-missing values in that column.
After handling missing values, I recalculated the total number of missing values in each column of merged_df and stored these counts in missing_values_after.
I then identified the columns with missing values after handling (where the count was greater than 0) and saved their names in missing_vars_after.
To reflect the changes, I printed a message indicating that the following output would display columns with missing values after handling them.
I displayed the names of columns with missing values after handling them to see the impact of the changes made.
Additionally, I displayed the count of missing values for each column after handling them to ensure that missing values were appropriately addressed.
Finally, to showcase the updated data frame, I printed a message indicating that the following output would display the updated data frame.
I displayed the first few rows of the updated merged_df data frame to visually inspect the changes made to the missing values.

Above steps explains in detail how I identified and handled missing values in my merged_df data frame.

Link to presentation

# Add the link to your presentation here

Presentation link : https://rmit-arc.instructuremedia.com/embed/c775e396-26c7-4cb7-a8c4-4603bfafb253

REFERENCES :

RMIT. (2024). Module 3: Understand: Understanding Data and Data Structures, Data Wrangling (preprocessing). Accessed: 22 April 2024. Available: https://rare-phoenix-161610.appspot.com/secured/Module_03.html
RMIT. (2024). Module 4: Tidy and Manipulate: Tidy Data Principles and Manipulating Data, Data Wrangling (Preprocessing). Accessed: 23 April 2024. Available: https://rare-phoenix-161610.appspot.com/secured/Module_04.html#The_dplyr_packag
RMIT. (2024). Module 5: Scan - Missing Values, Data Wrangling (Preprocessing). Accessed: 24 April 2024. Available: https://rare-phoenix-161610.appspot.com/secured/Module_05.html#Excluding_Missing_Data
Quillbot Paraphrasing Tool: Used this for documentation. Accessed: 29th April 2024. Available: https://quillbot.com/
Grammarly Paraphrasing Tool: Used this for documentation. Accessed: 29th April 2024. Available: https://www.grammarly.com/paraphrasing