getwd()
shill<- read.csv("C:\\Users\\cynar\\Desktop\\school\\Semester 1\\unsupervised learning\\Shill Bidding Dataset\\Shill Bidding Dataset\\Shill Bidding Dataset.csv", header = T)
The data set have one feature which has a character value that is to be removed because it can not be used in the PCA analysis.The data used in this project was downloaded from [UCI data repository].
Principal component analysis (PCA) is a dimension reduction technique for reducing the dimensionality of such data sets, increasing interpretability but at the same time minimizing information loss.
PCA is a projection based method which transforms the data by projecting it onto a set of orthogonal axes. Reducing the number of input variables for a predictive model is referred to as dimensionality reduction.
shill <- shill[,-3]
pc.shill <- princomp(shill,cor = TRUE)
These are variables that we can can call from dataset pc.data.
names(pc.shill)
## [1] "sdev" "loadings" "center" "scale" "n.obs" "scores" "call"
Getting the right components we need to get a reduced dataset.
summary(pc.shill)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 1.7434048 1.5893746 1.1353522 1.0135259 0.99664789
## Proportion of Variance 0.2532884 0.2105093 0.1074187 0.0856029 0.08277558
## Cumulative Proportion 0.2532884 0.4637976 0.5712164 0.6568193 0.73959485
## Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
## Standard deviation 0.98350869 0.90545315 0.75364468 0.58377079 0.53721240
## Proportion of Variance 0.08060745 0.06832045 0.04733169 0.02839903 0.02404976
## Cumulative Proportion 0.82020230 0.88852275 0.93585444 0.96425347 0.98830323
## Comp.11 Comp.12
## Standard deviation 0.308423292 0.212688301
## Proportion of Variance 0.007927077 0.003769693
## Cumulative Proportion 0.996230307 1.000000000
Note: these values are scaled so the Sum of squares =1
engevectors <- pc.shill$loadings
eigenvalues <- pc.shill$sdev * pc.shill$sdev
This will return a matrix which can be used to analyze the relevance between components and data. The closer it is for a component to one the more relevant it is to the data.
round(cor(shill[,1:8],pc.shill$scores),3)
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## Record_ID 0.055 0.055 0.023 0.537 0.811 0.197 0.015 0.083
## Auction_ID 0.072 0.110 -0.020 -0.578 0.513 -0.618 -0.018 0.028
## Bidder_Tendency -0.383 0.308 0.250 0.105 -0.010 -0.074 -0.806 -0.162
## Bidding_Ratio -0.869 0.087 -0.115 -0.036 0.016 -0.002 -0.006 0.138
## Successive_Outbidding -0.739 0.508 0.197 0.021 0.020 -0.012 0.262 -0.186
## Last_Bidding 0.239 0.783 -0.546 -0.013 -0.014 0.071 -0.047 -0.014
## Auction_Bids 0.433 0.599 0.505 0.001 -0.010 -0.057 0.042 0.057
## Starting_Price_Average 0.371 0.588 0.440 0.083 -0.165 -0.068 0.043 0.420
## Comp.9 Comp.10 Comp.11 Comp.12
## Record_ID 0.015 0.005 0.002 0.001
## Auction_ID 0.036 0.028 0.000 0.000
## Bidder_Tendency 0.022 0.051 0.001 -0.001
## Bidding_Ratio 0.178 -0.413 0.033 0.018
## Successive_Outbidding 0.008 0.059 -0.224 -0.013
## Last_Bidding -0.002 -0.035 0.016 -0.149
## Auction_Bids -0.373 -0.225 0.013 0.009
## Starting_Price_Average 0.312 0.093 -0.001 -0.002
This will help in establishing the components to be used using the elbow method. The aim will be around variance is equal to one for the correct companents to use.
screeplot(pc.shill,type ='l',main="Screeplot for Shilling Bidding")
abline(1,0,col = "red",lty =2)
Below is a plot that shows how scores for “C1’ to”C2"
plot(pc.shill$scores[,1:2],type = 'n',xlab='C1',ylab = 'C2')
points(pc.shill$scores[,1:2],cex = 0.5)
Principle analyses:
library(psych)
Rotate parameter <- varimax The cumulative var will explain 89% of our data.
PCA2 <- principal(shill,nfactors =4, rotate = "none")
PCA2
## Principal Components Analysis
## Call: principal(r = shill, nfactors = 4, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PC1 PC2 PC3 PC4 h2 u2 com
## Record_ID -0.05 0.06 0.02 -0.54 0.30 0.705 1.0
## Auction_ID -0.07 0.11 -0.02 0.58 0.35 0.648 1.1
## Bidder_Tendency 0.38 0.31 0.25 -0.11 0.32 0.685 2.9
## Bidding_Ratio 0.87 0.09 -0.11 0.04 0.78 0.223 1.1
## Successive_Outbidding 0.74 0.51 0.20 -0.02 0.84 0.157 1.9
## Last_Bidding -0.24 0.78 -0.55 0.01 0.97 0.031 2.0
## Auction_Bids -0.43 0.60 0.51 0.00 0.80 0.198 2.8
## Starting_Price_Average -0.37 0.59 0.44 -0.08 0.68 0.316 2.7
## Early_Bidding -0.30 0.76 -0.54 0.01 0.97 0.033 2.2
## Winning_Ratio 0.76 -0.03 -0.24 0.05 0.63 0.367 1.2
## Auction_Duration -0.03 0.07 0.20 0.62 0.43 0.573 1.2
## Class 0.73 0.50 0.19 -0.01 0.82 0.182 1.9
##
## PC1 PC2 PC3 PC4
## SS loadings 3.04 2.53 1.29 1.03
## Proportion Var 0.25 0.21 0.11 0.09
## Cumulative Var 0.25 0.46 0.57 0.66
## Proportion Explained 0.39 0.32 0.16 0.13
## Cumulative Proportion 0.39 0.71 0.87 1.00
##
## Mean item complexity = 1.8
## Test of the hypothesis that 4 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.09
## with the empirical chi square 6517.27 with prob < 0
##
## Fit based upon off diagonal values = 0.89