title: “Dimensionality Reduction” output: html_document
Carrefour Kenya and are currently undertaking a project that will inform the marketing department on the most relevant marketing strategies that will result in the highest no. of sales (total price including tax).This project is aimed at doing analysis on the dataset provided by carrefour and create insights on how to achieve highest sales.
Identifying the principal components that contribute a highly to behaviour of sales.
The provided data if from Carre Four Kenya’s database and shows transactions that have been underway over a certain period. This is a reflection of behaviour of sales at carre four and well use it to identify principal components in the transactions.
The provided data is relevant for this kind of study since it has a reflection of carre four sales.
#install.packages('data.table')
#install.packages('tidyverse')
#install.packages("dplyr")
#install.packages("modelr")
#install.packages("broom")
#install.packages("caret")
#install.packages("rpart")
#install.packages("ggplot2")
#install.packages("Amelia")
library(modelr)
library(broom)
##
## Attaching package: 'broom'
## The following object is masked from 'package:modelr':
##
## bootstrap
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(rpart)
library(ggplot2)
library(Amelia)
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.8.0, built: 2021-05-26)
## ## Copyright (C) 2005-2022 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.6 v purrr 0.3.4
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x data.table::between() masks dplyr::between()
## x broom::bootstrap() masks modelr::bootstrap()
## x dplyr::filter() masks stats::filter()
## x data.table::first() masks dplyr::first()
## x dplyr::lag() masks stats::lag()
## x data.table::last() masks dplyr::last()
## x purrr::lift() masks caret::lift()
## x purrr::transpose() masks data.table::transpose()
carrefour <- fread('http://bit.ly/CarreFourDataset')
carrefour
## Invoice ID Branch Customer type Gender Product line Unit price
## 1: 750-67-8428 A Member Female Health and beauty 74.69
## 2: 226-31-3081 C Normal Female Electronic accessories 15.28
## 3: 631-41-3108 A Normal Male Home and lifestyle 46.33
## 4: 123-19-1176 A Member Male Health and beauty 58.22
## 5: 373-73-7910 A Normal Male Sports and travel 86.31
## ---
## 996: 233-67-5758 C Normal Male Health and beauty 40.35
## 997: 303-96-2227 B Normal Female Home and lifestyle 97.38
## 998: 727-02-1313 A Member Male Food and beverages 31.84
## 999: 347-56-2442 A Normal Male Home and lifestyle 65.82
## 1000: 849-09-3807 A Member Female Fashion accessories 88.34
## Quantity Tax Date Time Payment cogs
## 1: 7 26.1415 1/5/2019 13:08 Ewallet 522.83
## 2: 5 3.8200 3/8/2019 10:29 Cash 76.40
## 3: 7 16.2155 3/3/2019 13:23 Credit card 324.31
## 4: 8 23.2880 1/27/2019 20:33 Ewallet 465.76
## 5: 7 30.2085 2/8/2019 10:37 Ewallet 604.17
## ---
## 996: 1 2.0175 1/29/2019 13:46 Ewallet 40.35
## 997: 10 48.6900 3/2/2019 17:16 Ewallet 973.80
## 998: 1 1.5920 2/9/2019 13:22 Cash 31.84
## 999: 1 3.2910 2/22/2019 15:33 Cash 65.82
## 1000: 7 30.9190 2/18/2019 13:28 Cash 618.38
## gross margin percentage gross income Rating Total
## 1: 4.761905 26.1415 9.1 548.9715
## 2: 4.761905 3.8200 9.6 80.2200
## 3: 4.761905 16.2155 7.4 340.5255
## 4: 4.761905 23.2880 8.4 489.0480
## 5: 4.761905 30.2085 5.3 634.3785
## ---
## 996: 4.761905 2.0175 6.2 42.3675
## 997: 4.761905 48.6900 4.4 1022.4900
## 998: 4.761905 1.5920 7.7 33.4320
## 999: 4.761905 3.2910 4.1 69.1110
## 1000: 4.761905 30.9190 6.6 649.2990
head(carrefour, 6)
## Invoice ID Branch Customer type Gender Product line Unit price
## 1: 750-67-8428 A Member Female Health and beauty 74.69
## 2: 226-31-3081 C Normal Female Electronic accessories 15.28
## 3: 631-41-3108 A Normal Male Home and lifestyle 46.33
## 4: 123-19-1176 A Member Male Health and beauty 58.22
## 5: 373-73-7910 A Normal Male Sports and travel 86.31
## 6: 699-14-3026 C Normal Male Electronic accessories 85.39
## Quantity Tax Date Time Payment cogs gross margin percentage
## 1: 7 26.1415 1/5/2019 13:08 Ewallet 522.83 4.761905
## 2: 5 3.8200 3/8/2019 10:29 Cash 76.40 4.761905
## 3: 7 16.2155 3/3/2019 13:23 Credit card 324.31 4.761905
## 4: 8 23.2880 1/27/2019 20:33 Ewallet 465.76 4.761905
## 5: 7 30.2085 2/8/2019 10:37 Ewallet 604.17 4.761905
## 6: 7 29.8865 3/25/2019 18:30 Ewallet 597.73 4.761905
## gross income Rating Total
## 1: 26.1415 9.1 548.9715
## 2: 3.8200 9.6 80.2200
## 3: 16.2155 7.4 340.5255
## 4: 23.2880 8.4 489.0480
## 5: 30.2085 5.3 634.3785
## 6: 29.8865 4.1 627.6165
tail(carrefour, 6)
## Invoice ID Branch Customer type Gender Product line Unit price
## 1: 652-49-6720 C Member Female Electronic accessories 60.95
## 2: 233-67-5758 C Normal Male Health and beauty 40.35
## 3: 303-96-2227 B Normal Female Home and lifestyle 97.38
## 4: 727-02-1313 A Member Male Food and beverages 31.84
## 5: 347-56-2442 A Normal Male Home and lifestyle 65.82
## 6: 849-09-3807 A Member Female Fashion accessories 88.34
## Quantity Tax Date Time Payment cogs gross margin percentage
## 1: 1 3.0475 2/18/2019 11:40 Ewallet 60.95 4.761905
## 2: 1 2.0175 1/29/2019 13:46 Ewallet 40.35 4.761905
## 3: 10 48.6900 3/2/2019 17:16 Ewallet 973.80 4.761905
## 4: 1 1.5920 2/9/2019 13:22 Cash 31.84 4.761905
## 5: 1 3.2910 2/22/2019 15:33 Cash 65.82 4.761905
## 6: 7 30.9190 2/18/2019 13:28 Cash 618.38 4.761905
## gross income Rating Total
## 1: 3.0475 5.9 63.9975
## 2: 2.0175 6.2 42.3675
## 3: 48.6900 4.4 1022.4900
## 4: 1.5920 7.7 33.4320
## 5: 3.2910 4.1 69.1110
## 6: 30.9190 6.6 649.2990
str(carrefour)
## Classes 'data.table' and 'data.frame': 1000 obs. of 16 variables:
## $ Invoice ID : chr "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
## $ Branch : chr "A" "C" "A" "A" ...
## $ Customer type : chr "Member" "Normal" "Normal" "Member" ...
## $ Gender : chr "Female" "Female" "Male" "Male" ...
## $ Product line : chr "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
## $ Unit price : num 74.7 15.3 46.3 58.2 86.3 ...
## $ Quantity : int 7 5 7 8 7 7 6 10 2 3 ...
## $ Tax : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Date : chr "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
## $ Time : chr "13:08" "10:29" "13:23" "20:33" ...
## $ Payment : chr "Ewallet" "Cash" "Credit card" "Ewallet" ...
## $ cogs : num 522.8 76.4 324.3 465.8 604.2 ...
## $ gross margin percentage: num 4.76 4.76 4.76 4.76 4.76 ...
## $ gross income : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Rating : num 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
## $ Total : num 549 80.2 340.5 489 634.4 ...
## - attr(*, ".internal.selfref")=<externalptr>
summary(carrefour)
## Invoice ID Branch Customer type Gender
## Length:1000 Length:1000 Length:1000 Length:1000
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## Product line Unit price Quantity Tax
## Length:1000 Min. :10.08 Min. : 1.00 Min. : 0.5085
## Class :character 1st Qu.:32.88 1st Qu.: 3.00 1st Qu.: 5.9249
## Mode :character Median :55.23 Median : 5.00 Median :12.0880
## Mean :55.67 Mean : 5.51 Mean :15.3794
## 3rd Qu.:77.94 3rd Qu.: 8.00 3rd Qu.:22.4453
## Max. :99.96 Max. :10.00 Max. :49.6500
## Date Time Payment cogs
## Length:1000 Length:1000 Length:1000 Min. : 10.17
## Class :character Class :character Class :character 1st Qu.:118.50
## Mode :character Mode :character Mode :character Median :241.76
## Mean :307.59
## 3rd Qu.:448.90
## Max. :993.00
## gross margin percentage gross income Rating Total
## Min. :4.762 Min. : 0.5085 Min. : 4.000 Min. : 10.68
## 1st Qu.:4.762 1st Qu.: 5.9249 1st Qu.: 5.500 1st Qu.: 124.42
## Median :4.762 Median :12.0880 Median : 7.000 Median : 253.85
## Mean :4.762 Mean :15.3794 Mean : 6.973 Mean : 322.97
## 3rd Qu.:4.762 3rd Qu.:22.4453 3rd Qu.: 8.500 3rd Qu.: 471.35
## Max. :4.762 Max. :49.6500 Max. :10.000 Max. :1042.65
dim(carrefour)
## [1] 1000 16
This is achieved by checking for missing values if any imputed to ensure correct predictions are made.
is.null(carrefour)
## [1] FALSE
total_null <- sum(is.na(carrefour))
total_null
## [1] 0
Consistency is achieved when all the duplicated rows are done away with.
duplicated_rows <- carrefour[duplicated(carrefour), ]
duplicated_rows
## Empty data.table (0 rows and 16 cols): Invoice ID,Branch,Customer type,Gender,Product line,Unit price...
duplicated(carrefour)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [217] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [253] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [277] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [289] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [301] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [313] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [325] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [337] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [349] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [361] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [373] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [385] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [397] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [409] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [421] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [433] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [445] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [457] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [469] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [481] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [493] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [505] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [517] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [529] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [541] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [553] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [565] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [577] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [589] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [601] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [613] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [625] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [637] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [649] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [661] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [673] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [685] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [697] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [709] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [721] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [733] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [745] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [757] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [769] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [781] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [793] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [805] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [817] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [829] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [841] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [853] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [865] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [877] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [889] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [901] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [913] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [925] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [937] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [949] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [961] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [973] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [985] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [997] FALSE FALSE FALSE FALSE
Relevance is achieved by ensuring all the features provided for the analysis are relevant to the objective Which in this case all provided features are.
Checking that all entries are correct.
We can visualize any outliers in a dataset using boxplots
colnames(carrefour)
## [1] "Invoice ID" "Branch"
## [3] "Customer type" "Gender"
## [5] "Product line" "Unit price"
## [7] "Quantity" "Tax"
## [9] "Date" "Time"
## [11] "Payment" "cogs"
## [13] "gross margin percentage" "gross income"
## [15] "Rating" "Total"
# Rename column where names
names(carrefour)[names(carrefour) == "Invoice ID"] <- "Invoice_ID"
names(carrefour)[names(carrefour) == "Customer type"] <- "Customer_type"
names(carrefour)[names(carrefour) == "Product line"] <- "Product_line"
names(carrefour)[names(carrefour) == "gross margin percentage"] <- "gross_margin_percentage"
names(carrefour)[names(carrefour) == "Unit price"] <- "Unit_price"
names(carrefour)[names(carrefour) == "gross income"] <- "gross_income"
colnames(carrefour)
## [1] "Invoice_ID" "Branch"
## [3] "Customer_type" "Gender"
## [5] "Product_line" "Unit_price"
## [7] "Quantity" "Tax"
## [9] "Date" "Time"
## [11] "Payment" "cogs"
## [13] "gross_margin_percentage" "gross_income"
## [15] "Rating" "Total"
str(carrefour)
## Classes 'data.table' and 'data.frame': 1000 obs. of 16 variables:
## $ Invoice_ID : chr "750-67-8428" "226-31-3081" "631-41-3108" "123-19-1176" ...
## $ Branch : chr "A" "C" "A" "A" ...
## $ Customer_type : chr "Member" "Normal" "Normal" "Member" ...
## $ Gender : chr "Female" "Female" "Male" "Male" ...
## $ Product_line : chr "Health and beauty" "Electronic accessories" "Home and lifestyle" "Health and beauty" ...
## $ Unit_price : num 74.7 15.3 46.3 58.2 86.3 ...
## $ Quantity : int 7 5 7 8 7 7 6 10 2 3 ...
## $ Tax : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Date : chr "1/5/2019" "3/8/2019" "3/3/2019" "1/27/2019" ...
## $ Time : chr "13:08" "10:29" "13:23" "20:33" ...
## $ Payment : chr "Ewallet" "Cash" "Credit card" "Ewallet" ...
## $ cogs : num 522.8 76.4 324.3 465.8 604.2 ...
## $ gross_margin_percentage: num 4.76 4.76 4.76 4.76 4.76 ...
## $ gross_income : num 26.14 3.82 16.22 23.29 30.21 ...
## $ Rating : num 9.1 9.6 7.4 8.4 5.3 4.1 5.8 8 7.2 5.9 ...
## $ Total : num 549 80.2 340.5 489 634.4 ...
## - attr(*, ".internal.selfref")=<externalptr>
a <- carrefour$Unit_price
boxplot(a)
b.Quantity
quantity <- carrefour$Quantity
boxplot(quantity)
cogs <- carrefour$cogs
boxplot(cogs)
b <- carrefour$gross_margin_percentage
boxplot(b)
gross_income <- carrefour$gross_income
boxplot(gross_income)
rating <- carrefour$Rating
boxplot(rating)
total <- carrefour$Total
boxplot(total)
To see the number of outliers
Gross Income
a <- carrefour$gross_income
boxplot.stats(a)$out
## [1] 47.790 49.490 49.650 47.720 48.605 49.260 48.750 48.685 48.690
The outlier entries are 9.
Total
a <- carrefour$Total
boxplot.stats(a)$out
## [1] 1003.590 1039.290 1042.650 1002.120 1020.705 1034.460 1023.750 1022.385
## [9] 1022.490
The outlier entries are 9.
mean(carrefour$Unit_price, trim = 0, na.rm=FALSE)
## [1] 55.67213
median(carrefour$Unit_price,na.rm=FALSE)
## [1] 55.23
range(carrefour$Unit_price,na.rm=FALSE, finite=FALSE)
## [1] 10.08 99.96
quantile(carrefour$Unit_price, probs=seq(0, 1,0.25), na.rm=FALSE, names=TRUE, type=7)
## 0% 25% 50% 75% 100%
## 10.080 32.875 55.230 77.935 99.960
var(carrefour$Unit_price)
## [1] 701.9653
sd(carrefour$Unit_price,na.rm=FALSE)
## [1] 26.49463
mode
getmode <- function(v){
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
unit_price_mode <- getmode(carrefour$Unit_price)
unit_price_mode
## [1] 83.77
Visualizing Unit Price.
d<-hist(carrefour$Unit_price, breaks=10, col="red", xlab="Unit price",main="Unit price")
plot(d)
The highest unit prices are between 20-30, 70-80 and 90-100.
head(carrefour)
## Invoice_ID Branch Customer_type Gender Product_line Unit_price
## 1: 750-67-8428 A Member Female Health and beauty 74.69
## 2: 226-31-3081 C Normal Female Electronic accessories 15.28
## 3: 631-41-3108 A Normal Male Home and lifestyle 46.33
## 4: 123-19-1176 A Member Male Health and beauty 58.22
## 5: 373-73-7910 A Normal Male Sports and travel 86.31
## 6: 699-14-3026 C Normal Male Electronic accessories 85.39
## Quantity Tax Date Time Payment cogs gross_margin_percentage
## 1: 7 26.1415 1/5/2019 13:08 Ewallet 522.83 4.761905
## 2: 5 3.8200 3/8/2019 10:29 Cash 76.40 4.761905
## 3: 7 16.2155 3/3/2019 13:23 Credit card 324.31 4.761905
## 4: 8 23.2880 1/27/2019 20:33 Ewallet 465.76 4.761905
## 5: 7 30.2085 2/8/2019 10:37 Ewallet 604.17 4.761905
## 6: 7 29.8865 3/25/2019 18:30 Ewallet 597.73 4.761905
## gross_income Rating Total
## 1: 26.1415 9.1 548.9715
## 2: 3.8200 9.6 80.2200
## 3: 16.2155 7.4 340.5255
## 4: 23.2880 8.4 489.0480
## 5: 30.2085 5.3 634.3785
## 6: 29.8865 4.1 627.6165
hist(carrefour$Quantity, breaks=12, col="skyblue",xlab="Quantity", main='Quantity of Products')
Most product quantities bought are 1 or 2.
d <- density(carrefour$Tax, xlab="Tax")
## Warning: In density.default(carrefour$Tax, xlab = "Tax") :
## extra argument 'xlab' will be disregarded
plot(d)
cogs <- hist(carrefour$cogs, xlab="cogs")
plot(cogs)
The highest number of cogs is at zero but the occurence reduces as the
value of cogs increases.
head(carrefour)
## Invoice_ID Branch Customer_type Gender Product_line Unit_price
## 1: 750-67-8428 A Member Female Health and beauty 74.69
## 2: 226-31-3081 C Normal Female Electronic accessories 15.28
## 3: 631-41-3108 A Normal Male Home and lifestyle 46.33
## 4: 123-19-1176 A Member Male Health and beauty 58.22
## 5: 373-73-7910 A Normal Male Sports and travel 86.31
## 6: 699-14-3026 C Normal Male Electronic accessories 85.39
## Quantity Tax Date Time Payment cogs gross_margin_percentage
## 1: 7 26.1415 1/5/2019 13:08 Ewallet 522.83 4.761905
## 2: 5 3.8200 3/8/2019 10:29 Cash 76.40 4.761905
## 3: 7 16.2155 3/3/2019 13:23 Credit card 324.31 4.761905
## 4: 8 23.2880 1/27/2019 20:33 Ewallet 465.76 4.761905
## 5: 7 30.2085 2/8/2019 10:37 Ewallet 604.17 4.761905
## 6: 7 29.8865 3/25/2019 18:30 Ewallet 597.73 4.761905
## gross_income Rating Total
## 1: 26.1415 9.1 548.9715
## 2: 3.8200 9.6 80.2200
## 3: 16.2155 7.4 340.5255
## 4: 23.2880 8.4 489.0480
## 5: 30.2085 5.3 634.3785
## 6: 29.8865 4.1 627.6165
Covariance Covariance is the statistical representation of the degree to which two variables vary from each other.
carrefour_cov <- carrefour[,c(6,7,8,12,14,16)]
cov(carrefour_cov)
## Unit_price Quantity Tax cogs gross_income
## Unit_price 701.9653313 0.8347785 196.66834 3933.3668 196.66834
## Quantity 0.8347785 8.5464464 24.14957 482.9914 24.14957
## Tax 196.6683401 24.1495704 137.09659 2741.9319 137.09659
## cogs 3933.3668019 482.9914076 2741.93188 54838.6377 2741.93188
## gross_income 196.6683401 24.1495704 137.09659 2741.9319 137.09659
## Total 4130.0351420 507.1409780 2879.02848 57580.5695 2879.02848
## Total
## Unit_price 4130.035
## Quantity 507.141
## Tax 2879.028
## cogs 57580.570
## gross_income 2879.028
## Total 60459.598
carrefour.cor <- cor(carrefour_cov, method=c('spearman'))
visualizing
#install.packages('corrplot')
library(corrplot)
## corrplot 0.92 loaded
corrplot(carrefour.cor)
cogs,gross income, tax and total are highly correlated to each
other.
1.Dimensionality Reduction.
carrefour_1 <- carrefour
head(carrefour_1)
## Invoice_ID Branch Customer_type Gender Product_line Unit_price
## 1: 750-67-8428 A Member Female Health and beauty 74.69
## 2: 226-31-3081 C Normal Female Electronic accessories 15.28
## 3: 631-41-3108 A Normal Male Home and lifestyle 46.33
## 4: 123-19-1176 A Member Male Health and beauty 58.22
## 5: 373-73-7910 A Normal Male Sports and travel 86.31
## 6: 699-14-3026 C Normal Male Electronic accessories 85.39
## Quantity Tax Date Time Payment cogs gross_margin_percentage
## 1: 7 26.1415 1/5/2019 13:08 Ewallet 522.83 4.761905
## 2: 5 3.8200 3/8/2019 10:29 Cash 76.40 4.761905
## 3: 7 16.2155 3/3/2019 13:23 Credit card 324.31 4.761905
## 4: 8 23.2880 1/27/2019 20:33 Ewallet 465.76 4.761905
## 5: 7 30.2085 2/8/2019 10:37 Ewallet 604.17 4.761905
## 6: 7 29.8865 3/25/2019 18:30 Ewallet 597.73 4.761905
## gross_income Rating Total
## 1: 26.1415 9.1 548.9715
## 2: 3.8200 9.6 80.2200
## 3: 16.2155 7.4 340.5255
## 4: 23.2880 8.4 489.0480
## 5: 30.2085 5.3 634.3785
## 6: 29.8865 4.1 627.6165
carrefour.pca <- prcomp(carrefour_1[,c(6,7,8,12,14,15,16)], center =TRUE, scale. = TRUE)
summary(carrefour.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.2185 1.0002 0.9939 0.30001 2.981e-16 1.493e-16
## Proportion of Variance 0.7031 0.1429 0.1411 0.01286 0.000e+00 0.000e+00
## Cumulative Proportion 0.7031 0.8460 0.9871 1.00000 1.000e+00 1.000e+00
## PC7
## Standard deviation 9.831e-17
## Proportion of Variance 0.000e+00
## Cumulative Proportion 1.000e+00
We obtain 7 principal components each explaining the total variation of dataset. PC1 explains 70% and PC2 and PC3 14% each.
str(carrefour.pca)
## List of 5
## $ sdev : num [1:7] 2.22 1.00 9.94e-01 3.00e-01 2.98e-16 ...
## $ rotation: num [1:7, 1:7] -0.292 -0.325 -0.45 -0.45 -0.45 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:7] "Unit_price" "Quantity" "Tax" "cogs" ...
## .. ..$ : chr [1:7] "PC1" "PC2" "PC3" "PC4" ...
## $ center : Named num [1:7] 55.67 5.51 15.38 307.59 15.38 ...
## ..- attr(*, "names")= chr [1:7] "Unit_price" "Quantity" "Tax" "cogs" ...
## $ scale : Named num [1:7] 26.49 2.92 11.71 234.18 11.71 ...
## ..- attr(*, "names")= chr [1:7] "Unit_price" "Quantity" "Tax" "cogs" ...
## $ x : num [1:1000, 1:7] -2.005 2.306 -0.186 -1.504 -2.8 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : NULL
## .. ..$ : chr [1:7] "PC1" "PC2" "PC3" "PC4" ...
## - attr(*, "class")= chr "prcomp"
Plotting PCA.
#install.packages("devtools")
#install_github("vqv/ggbiplot")
library(devtools)
## Loading required package: usethis
library(ggbiplot)
## Loading required package: plyr
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following object is masked from 'package:purrr':
##
## compact
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## Loading required package: scales
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
## Loading required package: grid
ggbiplot(carrefour.pca)
yes
yes