–
The job is to perform data cleaning and data preprocessing techniques in R programming, analyze, visualize, and conclude our analysis based on our research and results which we find after applying data preprocessing techniques and analysis.
Data Cleaning:
library(tidyverse) # Load CSV files
df <- read_csv("sales_data.csv")
[36m--[39m [1m[1mColumn specification[1m[22m [36m-------------------------------------------------------------------------------------------------------------------------------------[39m
cols(
`Invoice ID` = [31mcol_character()[39m,
Branch = [31mcol_character()[39m,
City = [31mcol_character()[39m,
`Customer type` = [31mcol_character()[39m,
Gender = [31mcol_character()[39m,
`Product line` = [31mcol_character()[39m,
`Unit price` = [32mcol_double()[39m,
Quantity = [32mcol_double()[39m,
`Tax 5%` = [32mcol_double()[39m,
Total = [32mcol_double()[39m,
Date = [31mcol_character()[39m,
Time = [34mcol_time(format = "")[39m,
Payment = [31mcol_character()[39m,
cogs = [32mcol_double()[39m,
`gross margin percentage` = [32mcol_double()[39m,
`gross income` = [32mcol_double()[39m,
Rating = [32mcol_double()[39m
)
# Get to know the data
head(df)
df[order(df$Date ),]
dim(df) # To check number of rows and columns
[1] 1000 17
summary(df) # description of the data
Invoice ID Branch City Customer type Gender Product line Unit price Quantity
Length:1000 Length:1000 Length:1000 Length:1000 Length:1000 Length:1000 Min. :10.08 Min. : 1.00
Class :character Class :character Class :character Class :character Class :character Class :character 1st Qu.:32.88 1st Qu.: 3.00
Mode :character Mode :character Mode :character Mode :character Mode :character Mode :character Median :55.23 Median : 5.00
Mean :55.67 Mean : 5.51
3rd Qu.:77.94 3rd Qu.: 8.00
Max. :99.96 Max. :10.00
Tax 5% Total Date Time Payment cogs gross margin percentage gross income
Min. : 0.5085 Min. : 10.68 Length:1000 Length:1000 Length:1000 Min. : 10.17 Min. :4.762 Min. : 0.5085
1st Qu.: 5.8948 1st Qu.: 124.42 Class :character Class1:hms Class :character 1st Qu.:118.50 1st Qu.:4.762 1st Qu.: 5.9249
Median :12.0960 Median : 253.85 Mode :character Class2:difftime Mode :character Median :241.76 Median :4.762 Median :12.0880
Mean :15.3855 Mean : 322.97 Mode :numeric Mean :307.59 Mean :4.762 Mean :15.3794
3rd Qu.:22.4605 3rd Qu.: 471.35 3rd Qu.:448.90 3rd Qu.:4.762 3rd Qu.:22.4453
Max. :49.6500 Max. :1042.65 Max. :993.00 Max. :4.762 Max. :49.6500
NA's :1
Rating
Min. : 4.000
1st Qu.: 5.500
Median : 7.000
Mean : 6.973
3rd Qu.: 8.500
Max. :10.000
Here, we need to make sure that date is converted to date time
# Converting date to Datetime
library(lubridate)
df$Date <- parse_date_time(x = df$Date,
orders = c("d m y", "d B Y", "m/d/y"),
locale = "eng")
View the data again to verify the DATE data type:
df[order(df$Date ),]
Checking the null values: Â
===========================
summary(is.na(df)) # To check null values
Invoice ID Branch City Customer type Gender Product line Unit price Quantity Tax 5%
Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:1000 FALSE:1000 FALSE:998 FALSE:1000 FALSE:998 FALSE:1000 FALSE:1000 FALSE:1000 FALSE:999
TRUE :2 TRUE :2 TRUE :1
Total Date Time Payment cogs gross margin percentage gross income Rating
Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical Mode :logical
FALSE:1000 FALSE:999 FALSE:1000 FALSE:1000 FALSE:1000 FALSE:1000 FALSE:1000 FALSE:1000
TRUE :1
This means that we have:
- 2 missing inputs in column cities,
- 2 missing inputs in column gender,
- 1 missing input in column Tax 5%,
- 1 missing input in column Date
we use the library naniar to count the remaining missing values in our attributes
#install.packages("naniar")
library(naniar)
gg_miss_var(df)
Fixing the null values: Â
===========================
Handling missing taxes by using the mean
library(tidyr)
df$`Tax 5%` <- replace_na(df$`Tax 5%`,mean(df$`Tax 5%`, na.rm = TRUE) )
summary(is.na(df$`Tax 5%`))
Mode FALSE
logical 1000
=> Missing iput on taxes has been solved
Catching up with a summary overview about the missing values
miss_var_summary(df)
Again we are still having to handles these:
Now handling the missing Date by using the complete function from the library ‘Tidyr’  complete(Date = seq.Date(<
then remove all the remaining null values:
x <- df[complete.cases(df), ]
str(x)
tibble [996 x 17] (S3: tbl_df/tbl/data.frame)
$ Invoice ID : chr [1:996] "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
$ Branch : chr [1:996] "A" "C" "A" "A" ...
$ City : chr [1:996] "Yangon" "Naypyitaw" "Yangon" "Yangon" ...
$ Customer type : chr [1:996] "Member" "Normal" "Normal" "Member" ...
$ Gender : chr [1:996] "Female" "Female" "Male" "Male" ...
$ Product line : chr [1:996] "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
$ Unit price : num [1:996] 74.7 15.3 46.3 58.2 86.3 ...
$ Quantity : num [1:996] 7 5 7 8 7 7 6 10 2 3 ...
$ Tax 5% : num [1:996] 26.14 3.82 16.22 23.29 30.21 ...
$ Total : num [1:996] 549 80.2 340.5 489 634.4 ...
$ Date : POSIXct[1:996], format: "2019-01-05" "2019-03-08" "2019-03-03" "2019-01-27" ...
$ Time : 'hms' num [1:996] 13:08:00 10:29:00 13:23:00 20:33:00 ...
..- attr(*, "units")= chr "secs"
$ Payment : chr [1:996] "Ewallet" "Cash" "Credit card" "Ewallet" ...
$ cogs : num [1:996] 522.8 76.4 324.3 465.8 604.2 ...
$ gross margin percentage: num [1:996] 4.76 4.76 4.76 4.76 4.76 ...
$ gross income : num [1:996] 26.14 3.82 16.22 23.29 30.21 ...
$ Rating : num [1:996] 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
df <- na.omit(df)
Lasdt check on null values if they still exist:
miss_var_summary(df)
df
gg_miss_var(df)
Now, clearly, there is no missing inputs, and now we can dive deeper into the exploratory data analysis
Data Analysis and Exploration:Â
_________________________________Â
We’ll start by looking for the most correlated variables using the polychoric correlation:
# selecting numerical columns
library(magrittr) # needs to be run every time you start R and want to use %>%
library(dplyr) # alternatively, this also loads %>%
my_data <- df %>% select(7, 8, 9, 10, 14, 15, 16, 17)
# Correlations with significance levels
# polychoric correlation
library(polycor)
polychor(my_data)
NaNs producedNaNs produced
[1] -0.004516063
head(my_data)
Now we’ll use plotly dash to sketch the heatmap of the sales data to look for the best correlated variables. For this we will need to install the following libraries:
- library(tidyverse)
- library(hrbrthemes)
- library(viridis)
- library(plotly)
- library(d3heatmap)
#install.packages("tidyverse")
#install.packages("hrbrthemes")
#install.packages("viridis")
#install.packages("plotly")
#install.packages("d3heatmap")
#data.matrix(my_data, rownames.force = NA)
# round to two decimal places:
round_df <- function(x, digits) {
# round all numeric variables
# x: data frame
# digits: number of digits to round
numeric_columns <- sapply(x, mode) == 'numeric'
x[numeric_columns] <- round(x[numeric_columns], digits)
x
}
cleaned_df <- round_df(my_data, 2)
cleaned_df
# The sales dataset:
Sales_data <- as.matrix(cleaned_df)
Plotting a heatmap for the cleaned data we have worked on:
#########################################################
### A) Installing and loading required packages
#########################################################
if (!require("gplots")) {
install.packages("gplots", dependencies = TRUE)
library(gplots)
}
if (!require("RColorBrewer")) {
install.packages("RColorBrewer", dependencies = TRUE)
library(RColorBrewer)
}
#########################################################
### B) Reading in data and transform it into matrix format
#########################################################
rnames <- Sales_data[,1] # assign labels in column 1 to "rnames"
mat_data <- Sales_data # transform column 2-5 into a matrix
rownames(mat_data) <- rnames # assign row names
#########################################################
### C) Customizing and plotting the heat map
#########################################################
# creates a own color palette from red to green
my_palette <- colorRampPalette(c("red", "yellow", "green"))(n = 299)
# (optional) defines the color breaks manually for a "skewed" color transition
col_breaks = c(seq(-1,0,length=100), # for red
seq(0.01,0.8,length=100), # for yellow
seq(0.81,1,length=100)) # for green
# creates a 5 x 5 inch image
png("./image/reda.png", # create PNG for the heat map
width = 5*300, # 5 x 300 pixels
height = 5*300,
res = 300, # 300 pixels per inch
pointsize = 8) # smaller font size
heatmap.2(mat_data,
cellnote = mat_data, # same data set for cell labels
main = "Correlation", # heat map title
notecol="black", # change font color of cell labels to black
density.info="none", # turns off density plot inside color legend
trace="none", # turns off trace lines inside the heat map
margins =c(12,9), # widens margins around plot
col=my_palette, # use on color palette defined earlier
breaks=col_breaks, # enable color transition at specified limits
dendrogram="row", # only draw a row dendrogram
Colv="NA") # turn off column clustering
dev.off() # close the PNG device
null device
1
heatmap.2(mat_data)
The best correlated are Tax 5%, Total, Gross Income and cogs i.e Cost of Goods sold with a correlation of 1.
Also, the above mentioned all has a good correlation of 0.71 with Quantity
x <- df$`Tax 5%`
y <- df$`gross income`
plot(x, y, col = "red", main="Tax 5% as a function of the Gross Income",
xlab="Tax 5%", ylab="Gross Income")
x <- df$Quantity
y <- df$cogs
plot(x, y, col = "Dark Green", main="Quantity Vs Cost of Goods Sale",
xlab="Quantity", ylab="Cost of Goods Sale")
x <- df$`Unit price`
y <- df$`gross income`
plot(x, y, col = "Blue", main="Unit Price Vs Gross Income",
xlab="Unit Price", ylab="Gross Income")
Rating <- df$Rating
hist(Rating, col = 'Gold')
Let’s find the mean rating now
Rating <- df$Rating
avg <- mean(Rating)
hist(Rating, col = 'Gold')
abline(v= avg, col='Blue', lwd =2)
Plotting histogram for all the dataset
library(Hmisc)
hist.data.frame(cleaned_df)
Analysis of Branch, City and Product Type:Â
_________________________________________Â
#Counting the branches
counts <- df %>% count(Branch)
counts
Warning messages:
1: In readChar(file, size, TRUE) : truncating string with embedded nuls
2: In readChar(file, size, TRUE) : truncating string with embedded nuls
3: In readChar(file, size, TRUE) : truncating string with embedded nuls
4: In readChar(file, size, TRUE) : truncating string with embedded nuls
5: In readChar(file, size, TRUE) : truncating string with embedded nuls
6: In readChar(file, size, TRUE) : truncating string with embedded nuls
7: In readChar(file, size, TRUE) : truncating string with embedded nuls
8: In readChar(file, size, TRUE) : truncating string with embedded nuls
9: In readChar(file, size, TRUE) : truncating string with embedded nuls
10: In readChar(file, size, TRUE) : truncating string with embedded nuls
# Import ggplot2 package
library(ggplot2)
# Build the classification graph
ggplot(data = df, aes(x = Branch)) +
geom_bar(aes(fill = Branch)) +
labs(title = "Branch Analysis",
subtitle = "Which branch is the busiest?.",
x = "Branches",
y = "Count")
Now let’s check the payment methods:
# Build the payment methid graph
ggplot(data = df, aes(x = Payment)) +
geom_bar(aes(fill = Payment)) +
labs(title = "Payment Method Analysis",
subtitle = "Which Payment method is most used?.",
x = "Payment methods",
y = "Count")
Showcasing results:
# Print results
summary <- count(df, Payment)
summary$percentage <- summary$n / sum(summary$n) * 100
summary
Now checking the busiest city:
# Build the cities graph
ggplot(data = df, aes(x = City)) +
geom_bar(aes(fill = City)) +
labs(title = "Geo Analysis",
subtitle = "Which city is most busy?.",
x = "City",
y = "Count")
Percentage outcomes per busiest cities is the following:
# Print results
summary <- count(df, City)
summary$percentage <- summary$n / sum(summary$n) * 100
summary
Gender:
# Build the gender graph
ggplot(data = df, aes(x = Gender)) +
geom_bar(aes(fill = Gender)) +
labs(title = "Gender Analysis",
subtitle = "Gender Count",
x = "Gender",
y = "Count")
Showcasing results per gender:
# Print results
summary <- count(df, Gender)
summary$percentage <- summary$n / sum(summary$n) * 100
summary
Visualizing a Gender based comparison related to Product Type
# Build Gender based comparison related to Product Type graph
#plot(df$`Unit price`, df$`Product line`, pch = 16, col = c('cornflowerblue', 'springgreen')[df$Gender])
library(ggplot2)
#Use loess (local fiting as smooth method)
ggplot() + geom_point(data = df, aes(x = `Product line`, y = `Unit price`, color = Gender))
Using countplot method to better showcase the result per gender
ggplot(df, aes(y=`Product line`)) +
geom_bar(aes(fill = Gender)) +
labs(title = "Gender Analysis per Product Type",
x = "Count",
y = "Product Type")
Showcasing results per gender:
# Print results
summary <- count(df, `Product line`)
summary$percentage <- summary$n / sum(summary$n) * 100
summary
Well, In Health & Beauty, Males are much more than Females whereas in Fashion accessories , Food & beverages and Sports & Travel Females are more and in the rest there is not much significant difference.
Visualizing a City based comparison related to Product Type
Now let’s focus more on the cities that lead the product lines per genre
# Build Gender based comparison related to Product Type graph
There were 15 warnings (use warnings() to see them)
library(ggplot2)
#Use loess (local fiting as smooth method)
ggplot() + geom_point(data = df, aes(x = `Product line`, y = `Unit price`, color = City))
Let’s re-adjust the graph using a barplot
ggplot(df, aes(y=`Product line`)) +
geom_bar(aes(fill = City)) +
labs(title = "Gender Analysis per Product Type",
x = "Count",
y = "Product Type")
Conslusion:
- Mandalay leads Sports and Travels and also Health and Beauty,
- Naypyitaw leads Foods and Beverage and also Fashion and accessories,
- Yangon leads Home and Liofestyle and and also Electronic accessories,
Visualizing the most used payment method for Product Type, Branch and City
Now let’s focus more on the product type that are mostly purchased using specific payment methods:
ggplot(df, aes(y=`Product line`)) +
geom_bar(aes(fill = Payment)) +
labs(title = "payment method for Product Type",
x = "Count",
y = "Product Type")
And here’s, we check the branches that mostly promote a specific type of payment
ggplot(df, aes(y=Branch)) +
geom_bar(aes(fill = Payment)) +
labs(title = "payment method for Branch Type",
x = "Count",
y = "Branch")
These cities in which we mostly use a specific type of paymnent:
ggplot(df, aes(y=City)) +
geom_bar(aes(fill = Payment)) +
labs(title = "payment method for City Type",
x = "Count",
y = "City")
Finding Which Branch has better sale for a particular product type
__________________________________________________________________
ggplot(df, aes(y=`Product line`)) +
geom_bar(aes(fill = Branch)) +
labs(title = "Braches having better sales for Product Type",
x = "Count",
y = "Product Type")
Boxen Plot for Rating and Quantity
__________________________________
couleur <- c("gold","darkgreen","red", "pink", "yellow", "brown","orange","gray", "green", "silver")
boxplot(Rating~Quantity,data=df, main="Boxen Plot for Quantity and Rating",notch=TRUE,
xlab="Quantity", ylab="Rating")
Now implementing the wordcloud to showcase the most sold product
#install.packages("wordcloud")
#library(wordcloud)
#install.packages("wordcloud2")
#library(wordcloud2)
#install.packages("tm")
library(tm)
#Create a vector containing only the text
text <- df$`Product line`
# Create a corpus
docs <- Corpus(VectorSource(text))
dtm <- TermDocumentMatrix(docs)
matrix <- as.matrix(dtm)
words <- sort(rowSums(matrix),decreasing=TRUE)
df <- data.frame(word = names(words),freq=words)
wordcloud2(data=df, size=1.6, color='random-dark')
set.seed(1234) # for reproducibility
wordcloud(words = df$word, freq = df$freq, min.freq = 1,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))