Cleaning Nike Sales Messy Data

Step1:Read the CSV file

uncleaned_data <- read.csv("/Users/mac/Downloads/Nike_Sales_Uncleaned.csv")
head(uncleaned_data)

##   Order_ID Gender_Category Product_Line   Product_Name Size Units_Sold     MRP
## 1     2000            Kids     Training    SuperRep Go    M         NA      NA
## 2     2001           Women       Soccer  Tiempo Legend    M          3 4957.93
## 3     2002           Women       Soccer    Premier III    M          4      NA
## 4     2003            Kids    Lifestyle     Blazer Mid    L         NA 9673.57
## 5     2004            Kids      Running React Infinity   XL         NA      NA
## 6     2005           Women     Training   Flex Trainer    M          1 7363.96
##   Discount_Applied Revenue Order_Date Sales_Channel    Region  Profit
## 1             0.47       0   3/9/2024        Online bengaluru -770.45
## 2               NA       0   7/9/2024        Retail       Hyd -112.53
## 3               NA       0                   Retail    Mumbai 3337.34
## 4               NA       0  4/10/2024        Online      Pune 3376.85
## 5               NA       0  9/12/2024        Retail     Delhi  187.89
## 6               NA       0                   Retail     Delhi 1415.98

Information about the dataset

Over 2,500 transaction records containing:

Multiple product lines (Running, Basketball, Lifestyle, Training, Soccer)

Gender-specific sales (Men, Women, Kids)

Sales from both Retail Stores and Online Channels

Common data issues like:

Null values

Typos in regions

Wrong data types

Negative values in numeric columns

Inconsistent date formats (e.g., 2023/07/21, 21-07-2023, etc.)

Discounts > 100%

Step2:Check for the structure, summary and number of missing values

str(uncleaned_data)

## 'data.frame':    2500 obs. of  13 variables:
##  $ Order_ID        : int  2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ...
##  $ Gender_Category : chr  "Kids" "Women" "Women" "Kids" ...
##  $ Product_Line    : chr  "Training" "Soccer" "Soccer" "Lifestyle" ...
##  $ Product_Name    : chr  "SuperRep Go" "Tiempo Legend" "Premier III" "Blazer Mid" ...
##  $ Size            : chr  "M" "M" "M" "L" ...
##  $ Units_Sold      : int  NA 3 4 NA NA 1 NA 1 -1 4 ...
##  $ MRP             : num  NA 4958 NA 9674 NA ...
##  $ Discount_Applied: num  0.47 NA NA NA NA NA NA 0.32 NA NA ...
##  $ Revenue         : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Order_Date      : chr  "3/9/2024" "7/9/2024" "" "4/10/2024" ...
##  $ Sales_Channel   : chr  "Online" "Retail" "Retail" "Online" ...
##  $ Region          : chr  "bengaluru" "Hyd" "Mumbai" "Pune" ...
##  $ Profit          : num  -770 -113 3337 3377 188 ...

summary(uncleaned_data)

##     Order_ID    Gender_Category    Product_Line       Product_Name      
##  Min.   :2000   Length:2500        Length:2500        Length:2500       
##  1st Qu.:2535   Class :character   Class :character   Class :character  
##  Median :3192   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :3194                                                           
##  3rd Qu.:3841                                                           
##  Max.   :4499                                                           
##                                                                         
##      Size             Units_Sold          MRP       Discount_Applied
##  Length:2500        Min.   :-1.000   Min.   :2007   Min.   :0.0000  
##  Class :character   1st Qu.: 0.000   1st Qu.:4039   1st Qu.:0.3200  
##  Mode  :character   Median : 2.000   Median :6110   Median :0.6150  
##                     Mean   : 1.482   Mean   :6040   Mean   :0.6310  
##                     3rd Qu.: 3.000   3rd Qu.:8022   3rd Qu.:0.9625  
##                     Max.   : 4.000   Max.   :9996   Max.   :1.2500  
##                     NA's   :1235     NA's   :1254   NA's   :1668    
##     Revenue         Order_Date        Sales_Channel         Region         
##  Min.   :-7561.6   Length:2500        Length:2500        Length:2500       
##  1st Qu.:    0.0   Class :character   Class :character   Class :character  
##  Median :    0.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :  274.9                                                           
##  3rd Qu.:    0.0                                                           
##  Max.   :37169.3                                                           
##                                                                            
##      Profit        
##  Min.   :-1199.45  
##  1st Qu.:   93.53  
##  Median : 1371.38  
##  Mean   : 1376.01  
##  3rd Qu.: 2660.64  
##  Max.   : 3999.21  
##

colSums(is.na(uncleaned_data))

##         Order_ID  Gender_Category     Product_Line     Product_Name 
##                0                0                0                0 
##             Size       Units_Sold              MRP Discount_Applied 
##                0             1235             1254             1668 
##          Revenue       Order_Date    Sales_Channel           Region 
##                0                0                0                0 
##           Profit 
##                0

Step3:Replace the blank cells with NA

uncleaned_data[uncleaned_data==""] <- NA
head(uncleaned_data)

##   Order_ID Gender_Category Product_Line   Product_Name Size Units_Sold     MRP
## 1     2000            Kids     Training    SuperRep Go    M         NA      NA
## 2     2001           Women       Soccer  Tiempo Legend    M          3 4957.93
## 3     2002           Women       Soccer    Premier III    M          4      NA
## 4     2003            Kids    Lifestyle     Blazer Mid    L         NA 9673.57
## 5     2004            Kids      Running React Infinity   XL         NA      NA
## 6     2005           Women     Training   Flex Trainer    M          1 7363.96
##   Discount_Applied Revenue Order_Date Sales_Channel    Region  Profit
## 1             0.47       0   3/9/2024        Online bengaluru -770.45
## 2               NA       0   7/9/2024        Retail       Hyd -112.53
## 3               NA       0       <NA>        Retail    Mumbai 3337.34
## 4               NA       0  4/10/2024        Online      Pune 3376.85
## 5               NA       0  9/12/2024        Retail     Delhi  187.89
## 6               NA       0       <NA>        Retail     Delhi 1415.98

Step4:Clean the size column

uncleaned_data$Size<- as.roman(uncleaned_data$Size)
head(uncleaned_data$Size)

## [1] M  M  M  L  XL M

sum(is.na(uncleaned_data$Size))

## [1] 510

# we have 510 missing values 
uncleaned_data$Size[is.na(uncleaned_data$Size)] <- "M"
head(uncleaned_data$Size)

## [1] M  M  M  L  XL M

head(uncleaned_data)

##   Order_ID Gender_Category Product_Line   Product_Name Size Units_Sold     MRP
## 1     2000            Kids     Training    SuperRep Go    M         NA      NA
## 2     2001           Women       Soccer  Tiempo Legend    M          3 4957.93
## 3     2002           Women       Soccer    Premier III    M          4      NA
## 4     2003            Kids    Lifestyle     Blazer Mid    L         NA 9673.57
## 5     2004            Kids      Running React Infinity   XL         NA      NA
## 6     2005           Women     Training   Flex Trainer    M          1 7363.96
##   Discount_Applied Revenue Order_Date Sales_Channel    Region  Profit
## 1             0.47       0   3/9/2024        Online bengaluru -770.45
## 2               NA       0   7/9/2024        Retail       Hyd -112.53
## 3               NA       0       <NA>        Retail    Mumbai 3337.34
## 4               NA       0  4/10/2024        Online      Pune 3376.85
## 5               NA       0  9/12/2024        Retail     Delhi  187.89
## 6               NA       0       <NA>        Retail     Delhi 1415.98

sum(is.na(uncleaned_data$Size))

## [1] 0

#we no longer have missing values in the column "Size"

Step5: Adjust the Units_Sold column

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

max(uncleaned_data$Units_Sold,na.rm=TRUE)

## [1] 4

min(uncleaned_data$Units_Sold,na.rm=TRUE)

## [1] -1

range(uncleaned_data$Units_Sold,na.rm=TRUE)

## [1] -1  4

# we need to normalize the observation "Units_Sold"
uncleaned_data$Units_Sold[uncleaned_data$Units_Sold== "0"|uncleaned_data$Units_Sold=="-1"] <- 1
head(uncleaned_data)

##   Order_ID Gender_Category Product_Line   Product_Name Size Units_Sold     MRP
## 1     2000            Kids     Training    SuperRep Go    M         NA      NA
## 2     2001           Women       Soccer  Tiempo Legend    M          3 4957.93
## 3     2002           Women       Soccer    Premier III    M          4      NA
## 4     2003            Kids    Lifestyle     Blazer Mid    L         NA 9673.57
## 5     2004            Kids      Running React Infinity   XL         NA      NA
## 6     2005           Women     Training   Flex Trainer    M          1 7363.96
##   Discount_Applied Revenue Order_Date Sales_Channel    Region  Profit
## 1             0.47       0   3/9/2024        Online bengaluru -770.45
## 2               NA       0   7/9/2024        Retail       Hyd -112.53
## 3               NA       0       <NA>        Retail    Mumbai 3337.34
## 4               NA       0  4/10/2024        Online      Pune 3376.85
## 5               NA       0  9/12/2024        Retail     Delhi  187.89
## 6               NA       0       <NA>        Retail     Delhi 1415.98

mean <- round(mean(uncleaned_data$Units_Sold, na.rm = TRUE),1)
mean

## [1] 2

#Replace the NA's with value
sum(is.na(uncleaned_data$Units_Sold))

## [1] 1235

uncleaned_data$Units_Sold[is.na(uncleaned_data$Units_Sold)] <- mean #replacing the NA's with mean
head(uncleaned_data$Units_Sold)

## [1] 2 3 4 2 2 1

head(uncleaned_data)

##   Order_ID Gender_Category Product_Line   Product_Name Size Units_Sold     MRP
## 1     2000            Kids     Training    SuperRep Go    M          2      NA
## 2     2001           Women       Soccer  Tiempo Legend    M          3 4957.93
## 3     2002           Women       Soccer    Premier III    M          4      NA
## 4     2003            Kids    Lifestyle     Blazer Mid    L          2 9673.57
## 5     2004            Kids      Running React Infinity   XL          2      NA
## 6     2005           Women     Training   Flex Trainer    M          1 7363.96
##   Discount_Applied Revenue Order_Date Sales_Channel    Region  Profit
## 1             0.47       0   3/9/2024        Online bengaluru -770.45
## 2               NA       0   7/9/2024        Retail       Hyd -112.53
## 3               NA       0       <NA>        Retail    Mumbai 3337.34
## 4               NA       0  4/10/2024        Online      Pune 3376.85
## 5               NA       0  9/12/2024        Retail     Delhi  187.89
## 6               NA       0       <NA>        Retail     Delhi 1415.98

#we no longer have missing values

Step6: Clean the MRP column

sum(is.na(uncleaned_data$MRP))

## [1] 1254

uncleaned_data$MRP[is.na(uncleaned_data$MRP)] <- mean(uncleaned_data$MRP,na.rm=TRUE)
uncleaned_data$MRP <- round(uncleaned_data$MRP,digits=0)
head(uncleaned_data)

##   Order_ID Gender_Category Product_Line   Product_Name Size Units_Sold  MRP
## 1     2000            Kids     Training    SuperRep Go    M          2 6040
## 2     2001           Women       Soccer  Tiempo Legend    M          3 4958
## 3     2002           Women       Soccer    Premier III    M          4 6040
## 4     2003            Kids    Lifestyle     Blazer Mid    L          2 9674
## 5     2004            Kids      Running React Infinity   XL          2 6040
## 6     2005           Women     Training   Flex Trainer    M          1 7364
##   Discount_Applied Revenue Order_Date Sales_Channel    Region  Profit
## 1             0.47       0   3/9/2024        Online bengaluru -770.45
## 2               NA       0   7/9/2024        Retail       Hyd -112.53
## 3               NA       0       <NA>        Retail    Mumbai 3337.34
## 4               NA       0  4/10/2024        Online      Pune 3376.85
## 5               NA       0  9/12/2024        Retail     Delhi  187.89
## 6               NA       0       <NA>        Retail     Delhi 1415.98

Step7:Clean the discount_applied

uncleaned_data$Discount_Applied[is.na(uncleaned_data$Discount_Applied)] <- median(uncleaned_data$Discount_Applied,na.rm=TRUE)
head(uncleaned_data)

##   Order_ID Gender_Category Product_Line   Product_Name Size Units_Sold  MRP
## 1     2000            Kids     Training    SuperRep Go    M          2 6040
## 2     2001           Women       Soccer  Tiempo Legend    M          3 4958
## 3     2002           Women       Soccer    Premier III    M          4 6040
## 4     2003            Kids    Lifestyle     Blazer Mid    L          2 9674
## 5     2004            Kids      Running React Infinity   XL          2 6040
## 6     2005           Women     Training   Flex Trainer    M          1 7364
##   Discount_Applied Revenue Order_Date Sales_Channel    Region  Profit
## 1            0.470       0   3/9/2024        Online bengaluru -770.45
## 2            0.615       0   7/9/2024        Retail       Hyd -112.53
## 3            0.615       0       <NA>        Retail    Mumbai 3337.34
## 4            0.615       0  4/10/2024        Online      Pune 3376.85
## 5            0.615       0  9/12/2024        Retail     Delhi  187.89
## 6            0.615       0       <NA>        Retail     Delhi 1415.98

#some discounts are over 100percent 
uncleaned_data$Discount_Applied[which(uncleaned_data$Discount_Applied>="1.00")] <- 0.04
head(uncleaned_data)

##   Order_ID Gender_Category Product_Line   Product_Name Size Units_Sold  MRP
## 1     2000            Kids     Training    SuperRep Go    M          2 6040
## 2     2001           Women       Soccer  Tiempo Legend    M          3 4958
## 3     2002           Women       Soccer    Premier III    M          4 6040
## 4     2003            Kids    Lifestyle     Blazer Mid    L          2 9674
## 5     2004            Kids      Running React Infinity   XL          2 6040
## 6     2005           Women     Training   Flex Trainer    M          1 7364
##   Discount_Applied Revenue Order_Date Sales_Channel    Region  Profit
## 1            0.470       0   3/9/2024        Online bengaluru -770.45
## 2            0.615       0   7/9/2024        Retail       Hyd -112.53
## 3            0.615       0       <NA>        Retail    Mumbai 3337.34
## 4            0.615       0  4/10/2024        Online      Pune 3376.85
## 5            0.615       0  9/12/2024        Retail     Delhi  187.89
## 6            0.615       0       <NA>        Retail     Delhi 1415.98

which(uncleaned_data$Discount_Applied>="1.0")

## integer(0)

#we no longer haVe discounts greater than 100%

Step8:Clean and adjust the revenue column

uncleaned_data$Revenue <- (uncleaned_data$Units_Sold*uncleaned_data$MRP *(1-uncleaned_data$Discount_Applied)) # formula for calculating Revenue
uncleaned_data$Revenue <- round(uncleaned_data$Revenue,digits=0)
head(uncleaned_data)

##   Order_ID Gender_Category Product_Line   Product_Name Size Units_Sold  MRP
## 1     2000            Kids     Training    SuperRep Go    M          2 6040
## 2     2001           Women       Soccer  Tiempo Legend    M          3 4958
## 3     2002           Women       Soccer    Premier III    M          4 6040
## 4     2003            Kids    Lifestyle     Blazer Mid    L          2 9674
## 5     2004            Kids      Running React Infinity   XL          2 6040
## 6     2005           Women     Training   Flex Trainer    M          1 7364
##   Discount_Applied Revenue Order_Date Sales_Channel    Region  Profit
## 1            0.470    6402   3/9/2024        Online bengaluru -770.45
## 2            0.615    5726   7/9/2024        Retail       Hyd -112.53
## 3            0.615    9302       <NA>        Retail    Mumbai 3337.34
## 4            0.615    7449  4/10/2024        Online      Pune 3376.85
## 5            0.615    4651  9/12/2024        Retail     Delhi  187.89
## 6            0.615    2835       <NA>        Retail     Delhi 1415.98

#we now have the correct revenue.

Step9:Clean and adjust the date format and column

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

uncleaned_data$Order_Date <- parse_date_time(uncleaned_data$Order_Date,
                                  orders = c("ymd", "dmy", "Y-m-d", "B d Y"),
                                  tz = "UTC")

## Warning: 740 failed to parse.

uncleaned_data$Order_Date[is.na(uncleaned_data$Order_Date)] <-as.Date("2024-10-11")
uncleaned_data$Order_Date <- as.Date(uncleaned_data$Order_Date, format="%Y-%m-%d")
head(uncleaned_data)

##   Order_ID Gender_Category Product_Line   Product_Name Size Units_Sold  MRP
## 1     2000            Kids     Training    SuperRep Go    M          2 6040
## 2     2001           Women       Soccer  Tiempo Legend    M          3 4958
## 3     2002           Women       Soccer    Premier III    M          4 6040
## 4     2003            Kids    Lifestyle     Blazer Mid    L          2 9674
## 5     2004            Kids      Running React Infinity   XL          2 6040
## 6     2005           Women     Training   Flex Trainer    M          1 7364
##   Discount_Applied Revenue Order_Date Sales_Channel    Region  Profit
## 1            0.470    6402 2024-09-03        Online bengaluru -770.45
## 2            0.615    5726 2024-09-07        Retail       Hyd -112.53
## 3            0.615    9302 2024-10-11        Retail    Mumbai 3337.34
## 4            0.615    7449 2024-10-04        Online      Pune 3376.85
## 5            0.615    4651 2024-12-09        Retail     Delhi  187.89
## 6            0.615    2835 2024-10-11        Retail     Delhi 1415.98

Step10:Clean the Region column

library(stringr)
uncleaned_data<-uncleaned_data %>%
  mutate(
    Region = str_trim(str_to_title(Region))
  )
head(uncleaned_data)

##   Order_ID Gender_Category Product_Line   Product_Name Size Units_Sold  MRP
## 1     2000            Kids     Training    SuperRep Go    M          2 6040
## 2     2001           Women       Soccer  Tiempo Legend    M          3 4958
## 3     2002           Women       Soccer    Premier III    M          4 6040
## 4     2003            Kids    Lifestyle     Blazer Mid    L          2 9674
## 5     2004            Kids      Running React Infinity   XL          2 6040
## 6     2005           Women     Training   Flex Trainer    M          1 7364
##   Discount_Applied Revenue Order_Date Sales_Channel    Region  Profit
## 1            0.470    6402 2024-09-03        Online Bengaluru -770.45
## 2            0.615    5726 2024-09-07        Retail       Hyd -112.53
## 3            0.615    9302 2024-10-11        Retail    Mumbai 3337.34
## 4            0.615    7449 2024-10-04        Online      Pune 3376.85
## 5            0.615    4651 2024-12-09        Retail     Delhi  187.89
## 6            0.615    2835 2024-10-11        Retail     Delhi 1415.98

# The Region column has been well arranged

Step11:Clean the profit(the profits are unrealistic)

This data is unrealistic, there is no price per unit, I can’t clean or adjust the profits made.

#let’s drop the selling_price column

uncleaned_data <- uncleaned_data[,-13]
head(uncleaned_data)

##   Order_ID Gender_Category Product_Line   Product_Name Size Units_Sold  MRP
## 1     2000            Kids     Training    SuperRep Go    M          2 6040
## 2     2001           Women       Soccer  Tiempo Legend    M          3 4958
## 3     2002           Women       Soccer    Premier III    M          4 6040
## 4     2003            Kids    Lifestyle     Blazer Mid    L          2 9674
## 5     2004            Kids      Running React Infinity   XL          2 6040
## 6     2005           Women     Training   Flex Trainer    M          1 7364
##   Discount_Applied Revenue Order_Date Sales_Channel    Region
## 1            0.470    6402 2024-09-03        Online Bengaluru
## 2            0.615    5726 2024-09-07        Retail       Hyd
## 3            0.615    9302 2024-10-11        Retail    Mumbai
## 4            0.615    7449 2024-10-04        Online      Pune
## 5            0.615    4651 2024-12-09        Retail     Delhi
## 6            0.615    2835 2024-10-11        Retail     Delhi

Lets’s save the cleaned dataset

data<-write.csv(uncleaned_data,"Nike sales cleaned dataset.csv",row.names=FALSE)

## Warning in .numeric2roman(x): NAs introduced by coercion

data

## NULL

Now, we have a cleaned dataset ready to be used for analysis.

Cleaning Nike Sales Messy Data

Abiola Toluwani

2025-10-23