# Loading the required library
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(dplyr) : R package, this package provides a consistent unified set of tools and functions for data manipulation.
# Set seeding
set.seed(150)
# Defining the number of rows in each data set
n_rows <- 100
# Data set 1 : df1
df1 <- data.frame(
Prod_ID = sample(1:100),
Menu_Item = sample(c('Harvest Bowl', 'Double Cheese & Bacon', 'Avo Parmesan Crunch with Protein', 'Paella'), n_rows, replace = TRUE),
Price = round(runif(n_rows, 10, 50), 2),
Calories = sample(200:1000, n_rows, replace = TRUE),
Protein_g = sample(10:50, n_rows, replace = TRUE),
Fat_g = sample(5:30, n_rows, replace = TRUE),
Location = sample(c('Melbourne', 'Brisbane', 'Perth', 'Sydney'), n_rows, replace = TRUE)
)
# Calculate 5% of the number of rows
n_rows <- nrow(df1)
n_NA <- round(n_rows * 0.05)
# Randomly select 5% of the rows
rows_to_NA <- sample(1:n_rows, n_NA)
# Set the 'Calories' value to NA for the selected rows
df1$Calories[rows_to_NA] <- NA
# Print the dataframe
head(df1)
# Data set 2 : df2
# Create the data frame
df2 <- data.frame(
Prod_ID = sample(1:100, n_rows, replace = TRUE),
Menu_Item = sample(c('Harvest Bowl', 'Double Cheese & Bacon', 'Avo Parmesan Crunch with Protein', 'Paella'), n_rows, replace = TRUE),
Rating = round(runif(n_rows, 1, 5), 1),
Review_Count = rpois(n_rows, 10),
Is_Spicy = sample(c(0, 1), n_rows, replace = TRUE),
Contains_Garlic = sample(c(0, 1), n_rows, replace = TRUE),
Sales = sample(100:1000, n_rows, replace = TRUE)
)
# Set the desired correlation between Rating and Review_Count
desired_correlation <- 0.5
# Introduce correlated random data for Rating and Review_Count
df2$Review_Count <- desired_correlation * df2$Rating + rnorm(n_rows, mean = 0, sd = sqrt(1 - desired_correlation^2))
# Introduce a new column correlated with Rating
df2$New_Co_relation <- 0.7 * df2$Rating + rnorm(n_rows, mean = 0, sd = sqrt(1 - 0.7^2))
# Calculate the correlation between Rating and New_Co_relation
cor_rating_new <- cor(df2$Rating, df2$New_Co_relation)
# Print the correlation coefficient between Rating and New_Co_relation
print(cor_rating_new)
## [1] 0.6312243
# Calculate the correlation between Rating and Review_Count
cor_rating_review <- cor(df2$Rating, df2$Review_Count)
# Print the correlation coefficient between Rating and Review_Count
print(cor_rating_review)
## [1] 0.5065912
# Calculate 5% of the number of rows
n_rows <- nrow(df2)
n_NA <- round(n_rows * 0.05)
# Randomly select 5% of the rows
rows_to_NA <- sample(1:n_rows, n_NA)
# Set the 'SALES' value to NA for the selected rows
df2$Sales[rows_to_NA] <- NA
# Print the dataframe
head(df2)
Nando’s, a popular restaurant chain known for its flame-grilled peri-peri chicken.
As a data analyst at Nando’s, a well-known restaurant chain recognized for its delicious flame-grilled peri-peri chicken, Here’s my approach would be:
I began by setting a seed for the random number generator to ensure consistent results.
Then, I initialized the number of rows to 100.
After that, I created a data frame called df1 with columns like Product ID, Menu Item, Price, Calories, Protein, Fat, and Location, populating them with random or predefined values.
I adjusted the number of rows to match the actual count in df1.
To introduce missing Calorie values in df1, I calculated 5% of the total rows and randomly selected that many row indices.
I then set the Calorie values to missing (NA) for those selected rows.
I displayed the first few rows of df1 to verify the changes.
Next, I created another data frame, df2, with similar columns to df1 but different random or predefined values.
This time, I introduced missing values for the Sales column following the same process as with Calories in df1.
I displayed the first few rows of df2 to confirm the changes.
Moving on to df2, I set the desired correlation between the ‘Rating’ and ‘Review_Count’ columns to 0.5.
To achieve this, I created a new ‘Review_Count’ column by combining the ‘Rating’ column with a random normal variable, ensuring the desired correlation.
I then created another column, ‘New_Co_relation,’ with values correlated to ‘Rating’ at 0.7, using a similar approach as for ‘Review_Count.’
Finally, I calculated and printed the actual correlation coefficients between ‘Rating’ and ‘New_Co_relation,’ as well as between ‘Rating’ and ‘Review_Count,’ to verify the results.
# Merging the data sets on the common variable 'Prod_ID'
library(dplyr)
merged_df <- df2 %>% inner_join(df1, by = "Prod_ID")
merged_df <- merge(df2, df1, by = "Prod_ID")
# Convert a numeric variable to factor
merged_df$Is_Spicy <- as.factor(merged_df$Is_Spicy)
# Converting a character variable to factor
merged_df$Location <- as.factor(merged_df$Location)
# Labeling the levels of the factor variable 'Is_Spicy'
levels(merged_df$Is_Spicy) <- c("Not Spicy", "Spicy")
# Ordering the levels of the factor variable 'Location'
mrg <- merged_df$Location <- factor(merged_df$Location, levels = c('Melbourne', 'Brisbane', 'Perth', 'Sydney'))
head(merged_df)
I first load the dplyr package, which is helpful for manipulating data.
Then, I combine two data frames, df2 and df1, based on their ‘Prod_ID’ columns, creating a new merged data frame named merged_df.
Alternatively, I can achieve the same merge using the merge function.
Following that, I convert the ‘Is_Spicy’ column in merged_df into a categorical variable (factor), assuming it originally represented spiciness levels with numbers.
Similarly, I convert the ‘Location’ column in merged_df into a categorical variable, assuming it was initially stored as text.
To improve clarity, I change the levels of the ‘Is_Spicy’ variable to “Not Spicy” and “Spicy,” assuming 0 represents “Not Spicy” and 1 represents “Spicy.”
I rearrange the levels of the ‘Location’ variable to match a specific order: ‘Melbourne’, ‘Brisbane’, ‘Perth’, ‘Sydney.’
Lastly, I display the first few rows of merged_df to inspect the changes using the head function.
# Checking the structure of the combined data set
str(merged_df)
## 'data.frame': 100 obs. of 14 variables:
## $ Prod_ID : int 2 3 3 5 8 8 10 12 12 13 ...
## $ Menu_Item.x : chr "Double Cheese & Bacon" "Paella" "Harvest Bowl" "Double Cheese & Bacon" ...
## $ Rating : num 1.1 2.9 1.3 3.3 3.6 2 1.7 1.5 2.1 1.2 ...
## $ Review_Count : num 0.484 1.7799 -1.0355 0.0295 0.7719 ...
## $ Is_Spicy : Factor w/ 2 levels "Not Spicy","Spicy": 2 2 2 2 2 2 1 2 2 1 ...
## $ Contains_Garlic: num 1 0 1 0 1 1 0 1 0 1 ...
## $ Sales : int 940 518 567 615 405 660 738 696 721 135 ...
## $ New_Co_relation: num 1.09 2.45 1.83 1.88 3.31 ...
## $ Menu_Item.y : chr "Double Cheese & Bacon" "Paella" "Paella" "Double Cheese & Bacon" ...
## $ Price : num 41.8 22.6 22.6 45 32 ...
## $ Calories : int 574 979 979 345 753 753 571 294 294 NA ...
## $ Protein_g : int 43 18 18 32 43 43 46 26 26 47 ...
## $ Fat_g : int 18 7 7 30 11 11 11 18 18 13 ...
## $ Location : Factor w/ 4 levels "Melbourne","Brisbane",..: 4 1 1 1 4 4 1 3 3 4 ...
# Converting a character variable to factor
merged_df$Location <- as.factor(merged_df$Location)
# Converting a numeric variable to factor
merged_df$Is_Spicy <- as.factor(merged_df$Is_Spicy)
I used str(merged_df) to examine the
merged_df data frame’s structure, which includes data types
and sample values. This helped me understand how the data is
organized.
To make working with categorical data easier, I converted the
‘Location’ column in merged_df from text to categories
(factors) using
merged_df$Location <- as.factor(merged_df$Location).
Similarly, I converted the ‘Is_Spicy’ column in
merged_df from numeric values to categories using
merged_df$Is_Spicy <- as.factor(merged_df$Is_Spicy).
This change allows me to represent categories like “spicy” or “not
spicy.”
The above steps are ensuring that categorical variables were
correctly represented in the merged_df data
frame.
# Loading the required library
library(dplyr)
# Step 4: Generating summary statistics
summary_stats <- merged_df %>%
group_by(Location) %>%
summarise(
Sales_Mean = mean(Sales, na.rm = TRUE),
Sales_Median = median(Sales, na.rm = TRUE),
Sales_1st_Quartile = quantile(Sales, 0.25, na.rm = TRUE),
Sales_3rd_Quartile = quantile(Sales, 0.75, na.rm = TRUE),
Sales_SD = sd(Sales, na.rm = TRUE)
)
print(summary_stats)
## # A tibble: 4 × 6
## Location Sales_Mean Sales_Median Sales_1st_Quartile Sales_3rd_Quartile
## <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Melbourne 586. 567 462. 717
## 2 Brisbane 577. 512 380 797
## 3 Perth 560. 696 293 809
## 4 Sydney 505. 492. 317. 668.
## # ℹ 1 more variable: Sales_SD <dbl>
First, I load the dplyr package using library(dplyr)
to make data manipulation easier.
Then, I create a new dataframe called summary_stats
by working with the merged_df dataframe.
I group the data by the ‘Location’ variable using
group_by(Location) %>%.
Next, I calculate summary statistics for each group using
summarise().
For each location, I calculate and store the mean ‘Sales’ value
in Sales_Mean, excluding any missing values.
I also calculate and store the median ‘Sales’ value for each
location in Sales_Median, excluding missing
values.
Additionally, I calculate and store the 25th percentile (1st
quartile) of ‘Sales’ for each location in
Sales_1st_Quartile, excluding missing values.
Similarly, I calculate and store the 75th percentile (3rd
quartile) of ‘Sales’ for each location in
Sales_3rd_Quartile, excluding missing values.
Finally, I calculate and store the standard deviation of ‘Sales’
for each location in Sales_SD, excluding missing
values.
To review the results, I print the summary_stats
data frame, which displays the calculated summary statistics for each
location.
These steps will allow one to understand how ‘Sales’ are distributed across different locations in the data set.
# Find missing values in the dataset
missing_values_before <- colSums(is.na(merged_df))
# Print the variables with missing values before
missing_vars_before <- names(missing_values_before)[missing_values_before > 0]
print("Missing values before handling:")
## [1] "Missing values before handling:"
print(missing_vars_before)
## [1] "Sales" "Calories"
print(missing_values_before)
## Prod_ID Menu_Item.x Rating Review_Count Is_Spicy
## 0 0 0 0 0
## Contains_Garlic Sales New_Co_relation Menu_Item.y Price
## 0 5 0 0 0
## Calories Protein_g Fat_g Location
## 7 0 0 0
# Handle missing values in 'Sales' column
merged_df$Sales[is.na(merged_df$Sales)] <- mean(merged_df$Sales, na.rm = TRUE)
# Handle missing values in 'Calories' column
merged_df$Calories[is.na(merged_df$Calories)] <- mean(merged_df$Calories, na.rm = TRUE)
# Find missing values in the dataset after handling
missing_values_after <- colSums(is.na(merged_df))
# Print the variables with missing values after
missing_vars_after <- names(missing_values_after)[missing_values_after > 0]
print("Missing values after handling:")
## [1] "Missing values after handling:"
print(missing_vars_after)
## character(0)
print(missing_values_after)
## Prod_ID Menu_Item.x Rating Review_Count Is_Spicy
## 0 0 0 0 0
## Contains_Garlic Sales New_Co_relation Menu_Item.y Price
## 0 0 0 0 0
## Calories Protein_g Fat_g Location
## 0 0 0 0
# Print the updated data frame
print("Updated data frame:")
## [1] "Updated data frame:"
head(merged_df)
I calculated the total number of missing values in each column of
my merged_df data frame and stored these counts in
missing_values_before.
Then, I identified the columns with missing values (where the
count was greater than 0) and saved their names in
missing_vars_before.
To provide clarity, I printed a message indicating that the following output would display columns with missing values before any changes.
I displayed the names of columns with missing values before any modifications to get a clear picture.
Next, I displayed the count of missing values for each column before any modifications to understand the extent of missing data.
For data consistency, I replaced missing values in the ‘Sales’ column with the mean of the non-missing values in that column.
Similarly, I replaced missing values in the ‘Calories’ column with the mean of the non-missing values in that column.
After handling missing values, I recalculated the total number of
missing values in each column of merged_df and stored these
counts in missing_values_after.
I then identified the columns with missing values after handling
(where the count was greater than 0) and saved their names in
missing_vars_after.
To reflect the changes, I printed a message indicating that the following output would display columns with missing values after handling them.
I displayed the names of columns with missing values after handling them to see the impact of the changes made.
Additionally, I displayed the count of missing values for each column after handling them to ensure that missing values were appropriately addressed.
Finally, to showcase the updated data frame, I printed a message indicating that the following output would display the updated data frame.
I displayed the first few rows of the updated
merged_df data frame to visually inspect the changes made
to the missing values.
Above steps explains in detail how I identified and handled
missing values in my merged_df data frame.
# Add the link to your presentation here
Presentation link : https://rmit-arc.instructuremedia.com/embed/c775e396-26c7-4cb7-a8c4-4603bfafb253
REFERENCES :
RMIT. (2024). Module 3: Understand: Understanding Data and Data Structures, Data Wrangling (preprocessing). Accessed: 22 April 2024. Available: https://rare-phoenix-161610.appspot.com/secured/Module_03.html
RMIT. (2024). Module 4: Tidy and Manipulate: Tidy Data Principles and Manipulating Data, Data Wrangling (Preprocessing). Accessed: 23 April 2024. Available: https://rare-phoenix-161610.appspot.com/secured/Module_04.html#The_dplyr_packag
RMIT. (2024). Module 5: Scan - Missing Values, Data Wrangling (Preprocessing). Accessed: 24 April 2024. Available: https://rare-phoenix-161610.appspot.com/secured/Module_05.html#Excluding_Missing_Data
Quillbot Paraphrasing Tool: Used this for documentation. Accessed: 29th April 2024. Available: https://quillbot.com/
Grammarly Paraphrasing Tool: Used this for documentation. Accessed: 29th April 2024. Available: https://www.grammarly.com/paraphrasing