This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
Factor analysis is a statistical method used to identify variations among observed, correlated variables in terms of a potentially lower set of unobserved factors or dimensions. Marketers often have many variables (and therefore many dimensions) and it is helpful to examine the top dimensions or factors from a marketing dataset. In particular, marketers might be interested in customer services, product attributes, and service attributes among others. Our goal is to identify underlying patterns for the factors presented in a dataset. Usually, the following methods are commonly used for factor analysis. • Principal component analysis (PCA or feature selection) - it is used to find primary dimensions that capture maximal variance. • Exploratory factor analysis (EFA) and Confirmatory Factor Analysis - it is used to capture variance with a few interrelated factors while interpreting the underlying dimensions of the dataset. These methods are also related to perceptual mapping and multidimensional scaling.
For this analysis, we will be using a dataset offered by MOHAMMAD SHADAN. This online data can be directly accessed using R.
brand.ratings <- read.csv("http://goo.gl/IQl8nc")
# write.csv(brand.ratings, "brand_ratings.csv")
head(brand.ratings)
## perform leader latest fun serious bargain value trendy rebuy brand
## 1 2 4 8 8 2 9 7 4 6 a
## 2 1 1 4 7 1 1 1 2 2 a
## 3 2 3 5 9 2 9 5 1 6 a
## 4 1 6 10 8 3 4 5 2 1 a
## 5 1 1 5 8 1 9 9 1 1 a
## 6 2 8 9 5 3 8 7 1 2 a
#we summarize the variables in the following 2 steps
str(brand.ratings)
## 'data.frame': 1000 obs. of 10 variables:
## $ perform: int 2 1 2 1 1 2 1 2 2 3 ...
## $ leader : int 4 1 3 6 1 8 1 1 1 1 ...
## $ latest : int 8 4 5 10 5 9 5 7 8 9 ...
## $ fun : int 8 7 9 8 8 5 7 5 10 8 ...
## $ serious: int 2 1 2 3 1 3 1 2 1 1 ...
## $ bargain: int 9 1 9 4 9 8 5 8 7 3 ...
## $ value : int 7 1 5 5 9 7 1 7 7 3 ...
## $ trendy : int 4 2 1 2 1 1 1 7 5 4 ...
## $ rebuy : int 6 2 6 1 1 2 1 1 1 1 ...
## $ brand : Factor w/ 10 levels "a","b","c","d",..: 1 1 1 1 1 1 1 1 1 1 ...
summary(brand.ratings)
## perform leader latest fun
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.000
## 1st Qu.: 1.000 1st Qu.: 2.000 1st Qu.: 4.000 1st Qu.: 4.000
## Median : 4.000 Median : 4.000 Median : 7.000 Median : 6.000
## Mean : 4.488 Mean : 4.417 Mean : 6.195 Mean : 6.068
## 3rd Qu.: 7.000 3rd Qu.: 6.000 3rd Qu.: 9.000 3rd Qu.: 8.000
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.000
##
## serious bargain value trendy
## Min. : 1.000 Min. : 1.000 Min. : 1.000 Min. : 1.00
## 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.: 2.000 1st Qu.: 3.00
## Median : 4.000 Median : 4.000 Median : 4.000 Median : 5.00
## Mean : 4.323 Mean : 4.259 Mean : 4.337 Mean : 5.22
## 3rd Qu.: 6.000 3rd Qu.: 6.000 3rd Qu.: 6.000 3rd Qu.: 7.00
## Max. :10.000 Max. :10.000 Max. :10.000 Max. :10.00
##
## rebuy brand
## Min. : 1.000 a :100
## 1st Qu.: 1.000 b :100
## Median : 3.000 c :100
## Mean : 3.727 d :100
## 3rd Qu.: 5.000 e :100
## Max. :10.000 f :100
## (Other):400
#How does the last column look like? What is the last variable?
#Rescaling (standardizing) the Data
#[, 1:9] means that we exclude the 10th column since it is a categorical
# variable; however, we keep all of the rows in the dataset for the #analysis.
brand.sc <- brand.ratings
brand.sc[, 1:9] <- scale(brand.ratings[, 1:9])
summary(brand.sc)
## perform leader latest fun
## Min. :-1.0888 Min. :-1.3100 Min. :-1.6878 Min. :-1.84677
## 1st Qu.:-1.0888 1st Qu.:-0.9266 1st Qu.:-0.7131 1st Qu.:-0.75358
## Median :-0.1523 Median :-0.1599 Median : 0.2615 Median :-0.02478
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.7842 3rd Qu.: 0.6069 3rd Qu.: 0.9113 3rd Qu.: 0.70402
## Max. : 1.7206 Max. : 2.1404 Max. : 1.2362 Max. : 1.43281
##
## serious bargain value trendy
## Min. :-1.1961 Min. :-1.22196 Min. :-1.3912 Min. :-1.53897
## 1st Qu.:-0.8362 1st Qu.:-0.84701 1st Qu.:-0.9743 1st Qu.:-0.80960
## Median :-0.1163 Median :-0.09711 Median :-0.1405 Median :-0.08023
## Mean : 0.0000 Mean : 0.00000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.6036 3rd Qu.: 0.65279 3rd Qu.: 0.6933 3rd Qu.: 0.64914
## Max. : 2.0434 Max. : 2.15258 Max. : 2.3610 Max. : 1.74319
##
## rebuy brand
## Min. :-1.0717 a :100
## 1st Qu.:-1.0717 b :100
## Median :-0.2857 c :100
## Mean : 0.0000 d :100
## 3rd Qu.: 0.5003 e :100
## Max. : 2.4652 f :100
## (Other):400
What would we expect the components to be from this data? We will be using the function, prcomp() to perform PCA.
First, there is shared variance across all factors because they are positively correlated.
Second, we should expect to see a few components that capture the majority of associations of the dataset.
#Using prcomp() to perform PCA
brand.pc <- prcomp(brand.sc[, 1:9])
summary(brand.pc)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.726 1.4479 1.0389 0.8528 0.79846 0.73133 0.62458
## Proportion of Variance 0.331 0.2329 0.1199 0.0808 0.07084 0.05943 0.04334
## Cumulative Proportion 0.331 0.5640 0.6839 0.7647 0.83554 0.89497 0.93831
## PC8 PC9
## Standard deviation 0.55861 0.49310
## Proportion of Variance 0.03467 0.02702
## Cumulative Proportion 0.97298 1.00000
plot(brand.pc, type="l")
biplot(brand.pc)
The plot function shows the visualization of the principle components. Please zoom in to see a larger variation of the figure if you could see any details. There are 9 components because we have 9 variables.
How much variance does the first component explain? How about the second component?
What does the plot tell you? See the following conclusions section.
Rencher, A. (2002). Methods of Multivariate Analysis (2nd ed.). Brigham Young University: John Wiley & Sons, Inc.
http://web.stanford.edu/class/psych253/tutorials/FactorAnalysis.html