── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)library(tidyr)library(reshape2)
Attaching package: 'reshape2'
The following object is masked from 'package:tidyr':
smiths
# Set the working directorysetwd("C:\\Users\\USER\\OneDrive\\Desktop\\Monash\\ETW2001")# Import the CSV filebigbasket_data<-read.csv("bigbasket.csv", header =TRUE, stringsAsFactors =FALSE)head(bigbasket_data)
Handling missing values correctly ensures complete and accurate data for analysis, leading to more reliable business insights. This is crucial because missing values can lead to incomplete or biased analyses, which in turn can misinform business decisions. For example, analyzing customer purchase patterns with missing data could lead to incorrect conclusions about customer behavior that leads to poor marketing or inventory decisions. By addressing missing data, we ensure that the data set is more accurate and reliable. In this dataset, the missing values are removed to ensure that the analysis is only made based on complete and meaningful records to avoid misinterpretation and distortion of the data that may potentially mask important variations. This helps to maintain the integrity of the dataset and allows analyst to focus on meaningful information rather than trying to account for gaps or inconsistencies.
# View data summary print("Summary before removing missing values: ")
[1] "Summary before removing missing values: "
summary(bigbasket_data)
index product category sub_category
Min. : 1 Length:27561 Length:27561 Length:27561
1st Qu.: 6891 Class :character Class :character Class :character
Median :13781 Mode :character Mode :character Mode :character
Mean :13781
3rd Qu.:20671
Max. :27555
brand sale_price market_price type
Length:27561 Min. : 2.45 Min. : 3 Length:27561
Class :character 1st Qu.: 95.00 1st Qu.: 100 Class :character
Mode :character Median : 190.00 Median : 220 Mode :character
Mean : 322.49 Mean : 382
3rd Qu.: 359.00 3rd Qu.: 425
Max. :12500.00 Max. :12500
rating
Min. :1.000
1st Qu.:3.700
Median :4.100
Mean :3.943
3rd Qu.:4.300
Max. :5.000
NA's :8626
# Determine the number of missing values in each column of the dataset.colSums(is.na(bigbasket_data))
index product category sub_category brand sale_price
0 0 0 0 0 0
market_price type rating
0 0 8626
# Remove rows with any missing values in the rating column.bigbasket_data_clean <- bigbasket_data[!is.na(bigbasket_data$rating), ]# Check the number of rows after the data is cleaned.colSums(is.na(bigbasket_data_clean))
index product category sub_category brand sale_price
0 0 0 0 0 0
market_price type rating
0 0 0
# Check data summary print("Summary after removing missing values: ")
[1] "Summary after removing missing values: "
summary(bigbasket_data_clean)
index product category sub_category
Min. : 1 Length:18935 Length:18935 Length:18935
1st Qu.: 6868 Class :character Class :character Class :character
Median :13775 Mode :character Mode :character Mode :character
Mean :13798
3rd Qu.:20679
Max. :27555
brand sale_price market_price type
Length:18935 Min. : 3.0 Min. : 3.0 Length:18935
Class :character 1st Qu.: 85.0 1st Qu.: 96.0 Class :character
Mode :character Median : 165.0 Median : 196.0 Mode :character
Mean : 267.7 Mean : 322.1
3rd Qu.: 306.0 3rd Qu.: 370.0
Max. :6660.0 Max. :7400.0
rating
Min. :1.000
1st Qu.:3.700
Median :4.100
Mean :3.943
3rd Qu.:4.300
Max. :5.000
# View the first few rows of the cleaned datasethead(bigbasket_data_clean)
The initial summary shows that some columns in the rating category have missing values which can skew the analysis, as missing data can lead to inaccurate conclusions. The output of colSums(is.na(bigbasket_data)) will reveal which columns have missing values and how many. This count is essential for understanding the extent of the missing data problem in your dataset. After removing the missing values, the number of rows in the data set have decreased, as rows with missing rating values were removed and the data set will show no missing value in the rating column, as all rows with missing rating values are filtered out. The data set is now cleaner and more reliable for analysis. The head(bigbasket_data_clean) will display the first few rows of the cleaned dataset, confirming that no rows with missing rating values remain.
2. Data redundancy
Solving data redundancy helps to prevent skewed results by ensuring the analysis is based on unique data, making the insights more trustworthy. Data redundancy, such as duplicate records, can distort analysis results by over-representing certain observations. Eliminating redundant data ensures that the analysis is based on unique entries, leading to more accurate and trustworthy insights. In this dataset, redundancy might occur in product listings where the same item is entered multiple times. This may inflate the number of products and their ratings and distort the analysis of discount effectiveness, leading to incorrect conclusions about which products to discount further. Thus, duplicate rows in the dataset are identified and removed to improve data quality and enhance efficiency.
# Remove duplicate rowsbigbasket_data_clean_2 <- bigbasket_data_clean %>%distinct()# Verify that duplicates have been removedprint(paste("Number of rows before removing duplicates: ", nrow(bigbasket_data_clean)))
[1] "Number of rows before removing duplicates: 18935"
print(paste("Number of rows after removing duplicates: ", nrow(bigbasket_data_clean_2)))
[1] "Number of rows after removing duplicates: 18929"
# Check if there are any remaining duplicate rowsany(duplicated(bigbasket_data_clean_2))
[1] FALSE
# View the first few rows of the cleaned datasethead(bigbasket_data_clean_2)
The output shows that there is a reduction in rows where the number of rows after removing duplicates is less than before which indicates that duplicate rows were found and removed. This is a positive outcome, as it confirms that redundancy in the data has been addressed. The code any(duplicated(bigbasket_data_clean_2)) returns FALSE also confirms that no duplicate rows remain in the dataset. The dataset is now ready for analysis without the risk of redundancy affecting the results.
3. Removing outliers in prices
Removing the outliers provides a more accurate representation of the data and avoids misleading conclusions that could affect business strategy. Outliers can skew statistical analyses and affect the accuracy of models. By identifying and appropriately handling outliers, we ensure that the analysis reflects the typical trends in the data, leading to more reliable business insights. In this data, the outliers are detected and removed using the Interquartile Range (IQR) method. For instance, sale prices and market prices in this data that are far higher than the rest of the data is removed from the dataset.
# Calculate the IQR for sale_price and market_priceIQR_sale_price <-IQR(bigbasket_data_clean_2$sale_price, na.rm =TRUE)IQR_market_price <-IQR(bigbasket_data_clean_2$market_price, na.rm =TRUE)# Calculate the lower and upper bounds for outlierslower_bound_sale_price <-quantile(bigbasket_data_clean_2$sale_price, 0.25, na.rm =TRUE) - (1.5* IQR_sale_price)upper_bound_sale_price <-quantile(bigbasket_data_clean_2$sale_price, 0.75, na.rm =TRUE) + (1.5* IQR_sale_price)lower_bound_market_price <-quantile(bigbasket_data_clean_2$market_price, 0.25, na.rm =TRUE) -1.5* IQR_market_priceupper_bound_market_price <-quantile(bigbasket_data_clean_2$market_price, 0.75, na.rm =TRUE) +1.5* IQR_market_price# Remove outliersbigbasket_data_no_outliers <- bigbasket_data_clean_2[as.numeric(bigbasket_data_clean_2$sale_price) >=as.numeric(lower_bound_sale_price) &as.numeric(bigbasket_data_clean_2$sale_price) <=as.numeric(upper_bound_sale_price) &as.numeric(bigbasket_data_clean_2$market_price) >=as.numeric(lower_bound_market_price) &as.numeric(bigbasket_data_clean_2$market_price) <=as.numeric(upper_bound_market_price), ]# View the first few rows for the data with no outliershead(bigbasket_data_no_outliers)
# Create boxplots for sale_price and market_price before removing outliersboxplot(bigbasket_data_clean_2$sale_price, main ="Sale Price with Outliers", ylab ="Sale Price", col ="lightblue")
boxplot(bigbasket_data_clean_2$market_price, main ="Market Price with Outliers", ylab ="Market Price", col ="darkblue")
# Create boxplots for sale_price and market_price after removing outliersboxplot(bigbasket_data_no_outliers$sale_price, main ="Sale Price without Outliers", ylab ="Sale Price", col ="lightblue",range =0, ylim =c(min(bigbasket_data_no_outliers$sale_price),upper_bound_sale_price))
boxplot(bigbasket_data_no_outliers$market_price, main ="Market Price without Outliers", ylab ="Market Price", col ="darkblue",range =0, ylim =c(min(bigbasket_data_no_outliers$market_price),upper_bound_market_price))
Justification of the Code:
Outliers can significantly distort statistical analyses by inflating measures like mean and variance, leading to incorrect conclusions. By calculating and applying the IQR-based method to remove outliers, the dataset is made more representative of the typical data points, thus enhancing the accuracy of subsequent analyses. The code ensures that only data within a reasonable range, defined by the IQR method, is retained. Boxplots are used for visualizing the distribution of data, particularly for identifying the spread and central tendency after outlier removal. By creating boxplots post-outlier removal, the code provides a visual check to ensure that the data is now free from extreme values, and the distribution appears more normalized. The chosen color schemes (lightblue and darkblue) help distinguish the two different price distributions. Comparing the boxplots before and after removing outliers shows a reduction in the spread of the data, confirming that extreme values have been successfully filtered out. This approach effectively cleans the dataset, ensuring that the analysis will be based on more consistent and reliable data. The use of boxplots provides a visual confirmation that the outliers have been appropriately removed, and the dataset is now ready for further analysis.
4. Creating new column
Creating a new column based on existing data can help with better segmentation or categorization and facilitates more detailed analysis, enabling better-targeted strategies and decisions, which is essential for targeted business strategies. This allows for more granular analysis and can provide deeper insights into different segments of the data. In this case, a “discount_percentage” column that shows the percentage of discount of the market price for different products is created to help in analyzing how discount percentage perform in terms of sales, customer preferences.
# Create the discount_percentage columnnew_bigbasket_data <- bigbasket_data_no_outliers %>%mutate(discount_percentage = ((market_price - sale_price) / market_price) *100)# View the first few rows to check the new columnhead(new_bigbasket_data)
The “discount_percentage” column is crucial for analyzing pricing strategies, customer behavior, and marketing effectiveness. By quantifying how much of a discount is offered as a percentage of the market price, you can easily compare discounts across different products, regardless of their absolute prices. This code can be used to identify which products are heavily discounted and might attract more customers, as well as to understand the overall discounting strategy of the business.
Question 3 : Price variance analysis
1. Identify the top 100 products with the highest discount rate.
# Filter the top 100 products with the highest discount percentagetop_100_discounted_products <- new_bigbasket_data %>%arrange(desc(discount_percentage)) %>%head(100)# Find the category with the most products in the top 100top_category <- top_100_discounted_products %>%group_by(category) %>%summarise(count =n()) %>%arrange(desc(count))# Display the category with the highest countprint(top_category)
2. Which category does the majority of the top 100 products fall under?
The majority of the top 100 products fall under Kitchen, Garden and Pets.
3. Why do you think this category needs the highest discount rate compared to others?
The category needs the highest discount rate compared to others as it is likely to face higher competitive pressure as most of the products may be similar and have the same usage, so discounts are used to attract price-sensitive customers and outcompete other retailers. The discounts also help in rapid inventory turnover before the products become unsellable. Products in this category that are seasonal, such as gardening tools, which are more relevant during certain times of the year can use high discounts to clear out inventory before the season ends. Additionally, if the seller aims to establish a stronger presence in this category, aggressive discounting can drive market penetration by attracting new customers.
Question 4
As a business analyst, which product will you correlate with the category you found in (3) to increase the sales of the supermarket? Identify that and Explain why.
To boost sales, it’s important to choose a product that complements the items in the “Kitchen, Garden, and Pet” category. As a business analyst, I will correlate food and household products with the “Kitchen, Garden, and Pet” category. Items in the “Kitchen” segment, such as blenders, cookware, or storage containers, are directly tied to food preparation and storage. By correlating these with food products, I can create a natural pairing that meets the customers’ immediate needs. For example, a customer buying a blender is likely also interested in purchasing fruits, vegetables, or smoothie ingredients, which enhances the value of their purchase. Next, products like cleaning supplies, detergents, or storage solutions are essential for maintaining a well-organized and clean home. These household items naturally complement products in the “Kitchen, Garden, and Pet” category, especially when customers are purchasing kitchen appliances or garden tools that might require cleaning or organization. Thus, the food and household products will definitely correlate well with the “Kitchen, Garden, and Pet” category.
Question 5
Use the dataset produced in question 2. Write a conditional statement using the “rating” variable to classify the products into three categories: i) high, ii) average, and iii) poor and store the information in a new column in the same dataset.
# Initialize the new columnnew_bigbasket_data$rating_category <-NA# Categorize data based on ratingfor (i in1:nrow(new_bigbasket_data)) {if (new_bigbasket_data$rating[i] >=4.5) { new_bigbasket_data$rating_category[i] <-"high" } elseif (new_bigbasket_data$rating[i] >=3.5&& new_bigbasket_data$rating[i] <4.5) { new_bigbasket_data$rating_category[i] <-"average" } else { new_bigbasket_data$rating_category[i] <-"poor" }}# View the first few rows of the updated datasethead(new_bigbasket_data)
index product
1 1 Garlic Oil - Vegetarian Capsule 500 mg
2 2 Water Bottle - Orange
3 3 Brass Angle Deep - Plain, No.2
4 4 Cereal Flip Lid Container/Storage Jar - Assorted Colour
5 5 Creme Soft Soap - For Hands & Body
6 6 Germ - Removal Multipurpose Wipes
category sub_category brand sale_price
1 Beauty & Hygiene Hair Care Sri Sri Ayurveda 220
2 Kitchen, Garden & Pets Storage & Accessories Mastercook 180
3 Cleaning & Household Pooja Needs Trm 119
4 Cleaning & Household Bins & Bathroom Ware Nakoda 149
5 Beauty & Hygiene Bath & Hand Wash Nivea 162
6 Cleaning & Household All Purpose Cleaners Nature Protect 169
market_price type rating discount_percentage
1 220 Hair Oil & Serum 4.1 0.00000
2 180 Water & Fridge Bottles 2.3 0.00000
3 250 Lamp & Lamp Oil 3.4 52.40000
4 176 Laundry, Storage Baskets 3.7 15.34091
5 162 Bathing Bars & Soaps 4.4 0.00000
6 199 Disinfectant Spray & Cleaners 3.3 15.07538
rating_category
1 average
2 poor
3 poor
4 average
5 average
6 poor
Section B
Question 1: List down three business strategies used by ZARA to be the number one fashion store worldwide.q
ZARA employs several effective business strategies to maintain its status as a leading fashion retailer:
ZARA uses fast fashion model and quickly designs, produces, and delivers new clothing collections, allowing the brand to respond rapidly to changing fashion trends and customer preferences. This strategy keeps the offerings fresh and encourages frequent visits from customers.
ZARA limits its production runs by producing limited quantities of each design to create a sense of exclusivity. This scarcity drives demand and prompts customers to purchase items quickly, fearing they might miss out.
ZARA also has strong supply chain management to ensure efficiency and speed. The company closely monitors inventory and employs a just-in-time production approach, allowing for quick restocking and reducing excess inventory costs.
Question 2: Assume that you are hired as a business analyst in ZARA to analyze their E-Commerce data. Your first task is to create a data frame and perform basic analysis using conditional statements.
Create a data frame with 5 columns, including productID (create integers from 1 – 20 by yourself), product name, category, subcategory, and price using the information from the website. You must create 20 observations.
Write a conditional statement using the price of the products in your data frame as an independent variable and subcategory as your target variable. Identify the subcategories at the highest and lowest prices using the conditional statement.
# Initialize variables to store the highest and lowest prices and their corresponding subcategorieshighest_price <--Inflowest_price <-Infhighest_price_subcategory <-NAlowest_price_subcategory <-NA# Loop through each row in the datasetfor (i in1:nrow(ZARA_products)) {# Check for the highest priceif (ZARA_products$Price[i] > highest_price) { highest_price <- ZARA_products$Price[i] highest_price_subcategory <- ZARA_products$Subcategory[i] }# Check for the lowest priceif (ZARA_products$Price[i] < lowest_price) { lowest_price <- ZARA_products$Price[i] lowest_price_subcategory <- ZARA_products$Subcategory[i] }}# Print resultsprint(paste("The subcategory with the highest price is:", highest_price_subcategory))
[1] "The subcategory with the highest price is: Blazers"
print(paste("The subcategory with the lowest price is:", lowest_price_subcategory))
[1] "The subcategory with the lowest price is: Trousers"
Write a conditional statement to identify the products in Q1 and Q3 of the price range and explain the output.
# Calculate Q1 and Q3Q1 <-quantile(ZARA_products$Price, 0.25, na.rm =TRUE)Q3 <-quantile(ZARA_products$Price, 0.75, na.rm =TRUE)# Identify products in Q1products_Q1 <- ZARA_products %>%filter(Price <= Q1)# Identify products in Q3products_Q3 <- ZARA_products %>%filter(Price >= Q3)# View resultsprint(paste("Products in the first quartile (Q1) with a price less than or equal to", Q1, ":"))
[1] "Products in the first quartile (Q1) with a price less than or equal to 186.5 :"
This code categorizes products into different price ranges, allowing for analysis of price distribution. Products in Q1 are the least expensive (bottom 25%), while products in Q3 are the most expensive (top 25%). The product in Q1 are generally more affordable and could be aimed at customers looking for budget-friendly options while the product in Q3 are generally more expensive and might be premium products, possibly offering higher quality, better materials, or exclusive designs. Any products not classified in Q1 or Q3 fall between Q1 and Q3.