Step1:Read the CSV file
uncleaned_data <- read.csv("/Users/mac/Downloads/Nike_Sales_Uncleaned.csv")
head(uncleaned_data)
## Order_ID Gender_Category Product_Line Product_Name Size Units_Sold MRP
## 1 2000 Kids Training SuperRep Go M NA NA
## 2 2001 Women Soccer Tiempo Legend M 3 4957.93
## 3 2002 Women Soccer Premier III M 4 NA
## 4 2003 Kids Lifestyle Blazer Mid L NA 9673.57
## 5 2004 Kids Running React Infinity XL NA NA
## 6 2005 Women Training Flex Trainer M 1 7363.96
## Discount_Applied Revenue Order_Date Sales_Channel Region Profit
## 1 0.47 0 3/9/2024 Online bengaluru -770.45
## 2 NA 0 7/9/2024 Retail Hyd -112.53
## 3 NA 0 Retail Mumbai 3337.34
## 4 NA 0 4/10/2024 Online Pune 3376.85
## 5 NA 0 9/12/2024 Retail Delhi 187.89
## 6 NA 0 Retail Delhi 1415.98
Information about the dataset
Over 2,500 transaction records containing:
Multiple product lines (Running, Basketball, Lifestyle, Training, Soccer)
Gender-specific sales (Men, Women, Kids)
Sales from both Retail Stores and Online Channels
Common data issues like:
Null values
Typos in regions
Wrong data types
Negative values in numeric columns
Inconsistent date formats (e.g., 2023/07/21, 21-07-2023, etc.)
Discounts > 100%
Step2:Check for the structure, summary and number of missing values
str(uncleaned_data)
## 'data.frame': 2500 obs. of 13 variables:
## $ Order_ID : int 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ...
## $ Gender_Category : chr "Kids" "Women" "Women" "Kids" ...
## $ Product_Line : chr "Training" "Soccer" "Soccer" "Lifestyle" ...
## $ Product_Name : chr "SuperRep Go" "Tiempo Legend" "Premier III" "Blazer Mid" ...
## $ Size : chr "M" "M" "M" "L" ...
## $ Units_Sold : int NA 3 4 NA NA 1 NA 1 -1 4 ...
## $ MRP : num NA 4958 NA 9674 NA ...
## $ Discount_Applied: num 0.47 NA NA NA NA NA NA 0.32 NA NA ...
## $ Revenue : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Order_Date : chr "3/9/2024" "7/9/2024" "" "4/10/2024" ...
## $ Sales_Channel : chr "Online" "Retail" "Retail" "Online" ...
## $ Region : chr "bengaluru" "Hyd" "Mumbai" "Pune" ...
## $ Profit : num -770 -113 3337 3377 188 ...
summary(uncleaned_data)
## Order_ID Gender_Category Product_Line Product_Name
## Min. :2000 Length:2500 Length:2500 Length:2500
## 1st Qu.:2535 Class :character Class :character Class :character
## Median :3192 Mode :character Mode :character Mode :character
## Mean :3194
## 3rd Qu.:3841
## Max. :4499
##
## Size Units_Sold MRP Discount_Applied
## Length:2500 Min. :-1.000 Min. :2007 Min. :0.0000
## Class :character 1st Qu.: 0.000 1st Qu.:4039 1st Qu.:0.3200
## Mode :character Median : 2.000 Median :6110 Median :0.6150
## Mean : 1.482 Mean :6040 Mean :0.6310
## 3rd Qu.: 3.000 3rd Qu.:8022 3rd Qu.:0.9625
## Max. : 4.000 Max. :9996 Max. :1.2500
## NA's :1235 NA's :1254 NA's :1668
## Revenue Order_Date Sales_Channel Region
## Min. :-7561.6 Length:2500 Length:2500 Length:2500
## 1st Qu.: 0.0 Class :character Class :character Class :character
## Median : 0.0 Mode :character Mode :character Mode :character
## Mean : 274.9
## 3rd Qu.: 0.0
## Max. :37169.3
##
## Profit
## Min. :-1199.45
## 1st Qu.: 93.53
## Median : 1371.38
## Mean : 1376.01
## 3rd Qu.: 2660.64
## Max. : 3999.21
##
colSums(is.na(uncleaned_data))
## Order_ID Gender_Category Product_Line Product_Name
## 0 0 0 0
## Size Units_Sold MRP Discount_Applied
## 0 1235 1254 1668
## Revenue Order_Date Sales_Channel Region
## 0 0 0 0
## Profit
## 0
Step3:Replace the blank cells with NA
uncleaned_data[uncleaned_data==""] <- NA
head(uncleaned_data)
## Order_ID Gender_Category Product_Line Product_Name Size Units_Sold MRP
## 1 2000 Kids Training SuperRep Go M NA NA
## 2 2001 Women Soccer Tiempo Legend M 3 4957.93
## 3 2002 Women Soccer Premier III M 4 NA
## 4 2003 Kids Lifestyle Blazer Mid L NA 9673.57
## 5 2004 Kids Running React Infinity XL NA NA
## 6 2005 Women Training Flex Trainer M 1 7363.96
## Discount_Applied Revenue Order_Date Sales_Channel Region Profit
## 1 0.47 0 3/9/2024 Online bengaluru -770.45
## 2 NA 0 7/9/2024 Retail Hyd -112.53
## 3 NA 0 <NA> Retail Mumbai 3337.34
## 4 NA 0 4/10/2024 Online Pune 3376.85
## 5 NA 0 9/12/2024 Retail Delhi 187.89
## 6 NA 0 <NA> Retail Delhi 1415.98
Step4:Clean the size column
uncleaned_data$Size<- as.roman(uncleaned_data$Size)
head(uncleaned_data$Size)
## [1] M M M L XL M
sum(is.na(uncleaned_data$Size))
## [1] 510
# we have 510 missing values
uncleaned_data$Size[is.na(uncleaned_data$Size)] <- "M"
head(uncleaned_data$Size)
## [1] M M M L XL M
head(uncleaned_data)
## Order_ID Gender_Category Product_Line Product_Name Size Units_Sold MRP
## 1 2000 Kids Training SuperRep Go M NA NA
## 2 2001 Women Soccer Tiempo Legend M 3 4957.93
## 3 2002 Women Soccer Premier III M 4 NA
## 4 2003 Kids Lifestyle Blazer Mid L NA 9673.57
## 5 2004 Kids Running React Infinity XL NA NA
## 6 2005 Women Training Flex Trainer M 1 7363.96
## Discount_Applied Revenue Order_Date Sales_Channel Region Profit
## 1 0.47 0 3/9/2024 Online bengaluru -770.45
## 2 NA 0 7/9/2024 Retail Hyd -112.53
## 3 NA 0 <NA> Retail Mumbai 3337.34
## 4 NA 0 4/10/2024 Online Pune 3376.85
## 5 NA 0 9/12/2024 Retail Delhi 187.89
## 6 NA 0 <NA> Retail Delhi 1415.98
sum(is.na(uncleaned_data$Size))
## [1] 0
#we no longer have missing values in the column "Size"
Step5: Adjust the Units_Sold column
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
max(uncleaned_data$Units_Sold,na.rm=TRUE)
## [1] 4
min(uncleaned_data$Units_Sold,na.rm=TRUE)
## [1] -1
range(uncleaned_data$Units_Sold,na.rm=TRUE)
## [1] -1 4
# we need to normalize the observation "Units_Sold"
uncleaned_data$Units_Sold[uncleaned_data$Units_Sold== "0"|uncleaned_data$Units_Sold=="-1"] <- 1
head(uncleaned_data)
## Order_ID Gender_Category Product_Line Product_Name Size Units_Sold MRP
## 1 2000 Kids Training SuperRep Go M NA NA
## 2 2001 Women Soccer Tiempo Legend M 3 4957.93
## 3 2002 Women Soccer Premier III M 4 NA
## 4 2003 Kids Lifestyle Blazer Mid L NA 9673.57
## 5 2004 Kids Running React Infinity XL NA NA
## 6 2005 Women Training Flex Trainer M 1 7363.96
## Discount_Applied Revenue Order_Date Sales_Channel Region Profit
## 1 0.47 0 3/9/2024 Online bengaluru -770.45
## 2 NA 0 7/9/2024 Retail Hyd -112.53
## 3 NA 0 <NA> Retail Mumbai 3337.34
## 4 NA 0 4/10/2024 Online Pune 3376.85
## 5 NA 0 9/12/2024 Retail Delhi 187.89
## 6 NA 0 <NA> Retail Delhi 1415.98
mean <- round(mean(uncleaned_data$Units_Sold, na.rm = TRUE),1)
mean
## [1] 2
#Replace the NA's with value
sum(is.na(uncleaned_data$Units_Sold))
## [1] 1235
uncleaned_data$Units_Sold[is.na(uncleaned_data$Units_Sold)] <- mean #replacing the NA's with mean
head(uncleaned_data$Units_Sold)
## [1] 2 3 4 2 2 1
head(uncleaned_data)
## Order_ID Gender_Category Product_Line Product_Name Size Units_Sold MRP
## 1 2000 Kids Training SuperRep Go M 2 NA
## 2 2001 Women Soccer Tiempo Legend M 3 4957.93
## 3 2002 Women Soccer Premier III M 4 NA
## 4 2003 Kids Lifestyle Blazer Mid L 2 9673.57
## 5 2004 Kids Running React Infinity XL 2 NA
## 6 2005 Women Training Flex Trainer M 1 7363.96
## Discount_Applied Revenue Order_Date Sales_Channel Region Profit
## 1 0.47 0 3/9/2024 Online bengaluru -770.45
## 2 NA 0 7/9/2024 Retail Hyd -112.53
## 3 NA 0 <NA> Retail Mumbai 3337.34
## 4 NA 0 4/10/2024 Online Pune 3376.85
## 5 NA 0 9/12/2024 Retail Delhi 187.89
## 6 NA 0 <NA> Retail Delhi 1415.98
#we no longer have missing values
Step6: Clean the MRP column
sum(is.na(uncleaned_data$MRP))
## [1] 1254
uncleaned_data$MRP[is.na(uncleaned_data$MRP)] <- mean(uncleaned_data$MRP,na.rm=TRUE)
uncleaned_data$MRP <- round(uncleaned_data$MRP,digits=0)
head(uncleaned_data)
## Order_ID Gender_Category Product_Line Product_Name Size Units_Sold MRP
## 1 2000 Kids Training SuperRep Go M 2 6040
## 2 2001 Women Soccer Tiempo Legend M 3 4958
## 3 2002 Women Soccer Premier III M 4 6040
## 4 2003 Kids Lifestyle Blazer Mid L 2 9674
## 5 2004 Kids Running React Infinity XL 2 6040
## 6 2005 Women Training Flex Trainer M 1 7364
## Discount_Applied Revenue Order_Date Sales_Channel Region Profit
## 1 0.47 0 3/9/2024 Online bengaluru -770.45
## 2 NA 0 7/9/2024 Retail Hyd -112.53
## 3 NA 0 <NA> Retail Mumbai 3337.34
## 4 NA 0 4/10/2024 Online Pune 3376.85
## 5 NA 0 9/12/2024 Retail Delhi 187.89
## 6 NA 0 <NA> Retail Delhi 1415.98
Step7:Clean the discount_applied
uncleaned_data$Discount_Applied[is.na(uncleaned_data$Discount_Applied)] <- median(uncleaned_data$Discount_Applied,na.rm=TRUE)
head(uncleaned_data)
## Order_ID Gender_Category Product_Line Product_Name Size Units_Sold MRP
## 1 2000 Kids Training SuperRep Go M 2 6040
## 2 2001 Women Soccer Tiempo Legend M 3 4958
## 3 2002 Women Soccer Premier III M 4 6040
## 4 2003 Kids Lifestyle Blazer Mid L 2 9674
## 5 2004 Kids Running React Infinity XL 2 6040
## 6 2005 Women Training Flex Trainer M 1 7364
## Discount_Applied Revenue Order_Date Sales_Channel Region Profit
## 1 0.470 0 3/9/2024 Online bengaluru -770.45
## 2 0.615 0 7/9/2024 Retail Hyd -112.53
## 3 0.615 0 <NA> Retail Mumbai 3337.34
## 4 0.615 0 4/10/2024 Online Pune 3376.85
## 5 0.615 0 9/12/2024 Retail Delhi 187.89
## 6 0.615 0 <NA> Retail Delhi 1415.98
#some discounts are over 100percent
uncleaned_data$Discount_Applied[which(uncleaned_data$Discount_Applied>="1.00")] <- 0.04
head(uncleaned_data)
## Order_ID Gender_Category Product_Line Product_Name Size Units_Sold MRP
## 1 2000 Kids Training SuperRep Go M 2 6040
## 2 2001 Women Soccer Tiempo Legend M 3 4958
## 3 2002 Women Soccer Premier III M 4 6040
## 4 2003 Kids Lifestyle Blazer Mid L 2 9674
## 5 2004 Kids Running React Infinity XL 2 6040
## 6 2005 Women Training Flex Trainer M 1 7364
## Discount_Applied Revenue Order_Date Sales_Channel Region Profit
## 1 0.470 0 3/9/2024 Online bengaluru -770.45
## 2 0.615 0 7/9/2024 Retail Hyd -112.53
## 3 0.615 0 <NA> Retail Mumbai 3337.34
## 4 0.615 0 4/10/2024 Online Pune 3376.85
## 5 0.615 0 9/12/2024 Retail Delhi 187.89
## 6 0.615 0 <NA> Retail Delhi 1415.98
which(uncleaned_data$Discount_Applied>="1.0")
## integer(0)
#we no longer haVe discounts greater than 100%
Step8:Clean and adjust the revenue column
uncleaned_data$Revenue <- (uncleaned_data$Units_Sold*uncleaned_data$MRP *(1-uncleaned_data$Discount_Applied)) # formula for calculating Revenue
uncleaned_data$Revenue <- round(uncleaned_data$Revenue,digits=0)
head(uncleaned_data)
## Order_ID Gender_Category Product_Line Product_Name Size Units_Sold MRP
## 1 2000 Kids Training SuperRep Go M 2 6040
## 2 2001 Women Soccer Tiempo Legend M 3 4958
## 3 2002 Women Soccer Premier III M 4 6040
## 4 2003 Kids Lifestyle Blazer Mid L 2 9674
## 5 2004 Kids Running React Infinity XL 2 6040
## 6 2005 Women Training Flex Trainer M 1 7364
## Discount_Applied Revenue Order_Date Sales_Channel Region Profit
## 1 0.470 6402 3/9/2024 Online bengaluru -770.45
## 2 0.615 5726 7/9/2024 Retail Hyd -112.53
## 3 0.615 9302 <NA> Retail Mumbai 3337.34
## 4 0.615 7449 4/10/2024 Online Pune 3376.85
## 5 0.615 4651 9/12/2024 Retail Delhi 187.89
## 6 0.615 2835 <NA> Retail Delhi 1415.98
#we now have the correct revenue.
Step9:Clean and adjust the date format and column
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
uncleaned_data$Order_Date <- parse_date_time(uncleaned_data$Order_Date,
orders = c("ymd", "dmy", "Y-m-d", "B d Y"),
tz = "UTC")
## Warning: 740 failed to parse.
uncleaned_data$Order_Date[is.na(uncleaned_data$Order_Date)] <-as.Date("2024-10-11")
uncleaned_data$Order_Date <- as.Date(uncleaned_data$Order_Date, format="%Y-%m-%d")
head(uncleaned_data)
## Order_ID Gender_Category Product_Line Product_Name Size Units_Sold MRP
## 1 2000 Kids Training SuperRep Go M 2 6040
## 2 2001 Women Soccer Tiempo Legend M 3 4958
## 3 2002 Women Soccer Premier III M 4 6040
## 4 2003 Kids Lifestyle Blazer Mid L 2 9674
## 5 2004 Kids Running React Infinity XL 2 6040
## 6 2005 Women Training Flex Trainer M 1 7364
## Discount_Applied Revenue Order_Date Sales_Channel Region Profit
## 1 0.470 6402 2024-09-03 Online bengaluru -770.45
## 2 0.615 5726 2024-09-07 Retail Hyd -112.53
## 3 0.615 9302 2024-10-11 Retail Mumbai 3337.34
## 4 0.615 7449 2024-10-04 Online Pune 3376.85
## 5 0.615 4651 2024-12-09 Retail Delhi 187.89
## 6 0.615 2835 2024-10-11 Retail Delhi 1415.98
Step10:Clean the Region column
library(stringr)
uncleaned_data<-uncleaned_data %>%
mutate(
Region = str_trim(str_to_title(Region))
)
head(uncleaned_data)
## Order_ID Gender_Category Product_Line Product_Name Size Units_Sold MRP
## 1 2000 Kids Training SuperRep Go M 2 6040
## 2 2001 Women Soccer Tiempo Legend M 3 4958
## 3 2002 Women Soccer Premier III M 4 6040
## 4 2003 Kids Lifestyle Blazer Mid L 2 9674
## 5 2004 Kids Running React Infinity XL 2 6040
## 6 2005 Women Training Flex Trainer M 1 7364
## Discount_Applied Revenue Order_Date Sales_Channel Region Profit
## 1 0.470 6402 2024-09-03 Online Bengaluru -770.45
## 2 0.615 5726 2024-09-07 Retail Hyd -112.53
## 3 0.615 9302 2024-10-11 Retail Mumbai 3337.34
## 4 0.615 7449 2024-10-04 Online Pune 3376.85
## 5 0.615 4651 2024-12-09 Retail Delhi 187.89
## 6 0.615 2835 2024-10-11 Retail Delhi 1415.98
# The Region column has been well arranged
Step11:Clean the profit(the profits are unrealistic)
This data is unrealistic, there is no price per unit, I can’t clean or adjust the profits made.
#let’s drop the selling_price column
uncleaned_data <- uncleaned_data[,-13]
head(uncleaned_data)
## Order_ID Gender_Category Product_Line Product_Name Size Units_Sold MRP
## 1 2000 Kids Training SuperRep Go M 2 6040
## 2 2001 Women Soccer Tiempo Legend M 3 4958
## 3 2002 Women Soccer Premier III M 4 6040
## 4 2003 Kids Lifestyle Blazer Mid L 2 9674
## 5 2004 Kids Running React Infinity XL 2 6040
## 6 2005 Women Training Flex Trainer M 1 7364
## Discount_Applied Revenue Order_Date Sales_Channel Region
## 1 0.470 6402 2024-09-03 Online Bengaluru
## 2 0.615 5726 2024-09-07 Retail Hyd
## 3 0.615 9302 2024-10-11 Retail Mumbai
## 4 0.615 7449 2024-10-04 Online Pune
## 5 0.615 4651 2024-12-09 Retail Delhi
## 6 0.615 2835 2024-10-11 Retail Delhi
Lets’s save the cleaned dataset
data<-write.csv(uncleaned_data,"Nike sales cleaned dataset.csv",row.names=FALSE)
## Warning in .numeric2roman(x): NAs introduced by coercion
data
## NULL
Now, we have a cleaned dataset ready to be used for analysis.