This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
df =read.csv("http://bit.ly/CarreFourDataset")
head(df)
## Invoice.ID Branch Customer.type Gender Product.line Unit.price
## 1 750-67-8428 A Member Female Health and beauty 74.69
## 2 226-31-3081 C Normal Female Electronic accessories 15.28
## 3 631-41-3108 A Normal Male Home and lifestyle 46.33
## 4 123-19-1176 A Member Male Health and beauty 58.22
## 5 373-73-7910 A Normal Male Sports and travel 86.31
## 6 699-14-3026 C Normal Male Electronic accessories 85.39
## Quantity Tax Date Time Payment cogs gross.margin.percentage
## 1 7 26.1415 1/5/2019 13:08 Ewallet 522.83 4.761905
## 2 5 3.8200 3/8/2019 10:29 Cash 76.40 4.761905
## 3 7 16.2155 3/3/2019 13:23 Credit card 324.31 4.761905
## 4 8 23.2880 1/27/2019 20:33 Ewallet 465.76 4.761905
## 5 7 30.2085 2/8/2019 10:37 Ewallet 604.17 4.761905
## 6 7 29.8865 3/25/2019 18:30 Ewallet 597.73 4.761905
## gross.income Rating Total
## 1 26.1415 9.1 548.9715
## 2 3.8200 9.6 80.2200
## 3 16.2155 7.4 340.5255
## 4 23.2880 8.4 489.0480
## 5 30.2085 5.3 634.3785
## 6 29.8865 4.1 627.6165
##Data Cleaning
# Checking null values
colSums(is.na(df))
## Invoice.ID Branch Customer.type
## 0 0 0
## Gender Product.line Unit.price
## 0 0 0
## Quantity Tax Date
## 0 0 0
## Time Payment cogs
## 0 0 0
## gross.margin.percentage gross.income Rating
## 0 0 0
## Total
## 0
#There is no null values in our dataset
sum(duplicated(df))
## [1] 0
#there are no duplicates in our data
numcols= df[c(6:8,12,14:16)]
head(numcols)
## Unit.price Quantity Tax cogs gross.income Rating Total
## 1 74.69 7 26.1415 522.83 26.1415 9.1 548.9715
## 2 15.28 5 3.8200 76.40 3.8200 9.6 80.2200
## 3 46.33 7 16.2155 324.31 16.2155 7.4 340.5255
## 4 58.22 8 23.2880 465.76 23.2880 8.4 489.0480
## 5 86.31 7 30.2085 604.17 30.2085 5.3 634.3785
## 6 85.39 7 29.8865 597.73 29.8865 4.1 627.6165
# Apply PCA using prcomp function
#The prcomp function takes in the data as input, and it is highly recommended to set the argument scale=TRUE. This standardize the input data so that it has zero mean and variance one before doing PCA
# Need to scale / Normalize as
# PCA depends on distance measure
my_pca <- prcomp(numcols, scale = TRUE,
center = TRUE, retx = T)
names(my_pca)
## [1] "sdev" "rotation" "center" "scale" "x"
#it has 5 variables
pc=princomp(numcols, cor = TRUE)
names(pc)
## [1] "sdev" "loadings" "center" "scale" "n.obs" "scores" "call"
# bar plot
plot(pc)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(devtools)
## Loading required package: usethis
library(ggbiplot)
## Loading required package: plyr
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## Loading required package: scales
## Loading required package: grid
ggbiplot(my_pca)
#From the graph we will see that the variables unit price, gross income and quantity contribute mostle to pc2 whereas rating contributes to pc 1
#Adding more detail to the plot, we provide arguments rownames as labels.
ggbiplot(my_pca, labels=rownames(numcols), obs.scale = 1, var.scale = 1)
##Conclusion
Pc1 contributes to majority of the variance in our data whereas Pc2 and Pc3 are almost equal.
The variables unit price, gross income and quantity contribute mostle to pc2 whereas rating contributes to pc 1