Dimension Reduction of Google Review Ratings

Introduction

In this paper various methods connected to dimension reduction will be put to use on dataset conataining avarege ratings of different types of places from Google reviews. Data contains 5456 instances with 25 attributes and was downloaded from UCI Machine Learning Repository.

Instalation of packages

Here we can see packages used in this paper and code used to install and load them.

requiredPackages = c("smacof", "labdsv", "vegan", "MASS", "ape", "ggfortify", "pca3d", "pls", "ClusterR","FactoMineR", "factoextra", "psych") # list of packages add new packages and run the code 
for(i in requiredPackages) {if(!require(i,character.only = TRUE)) install.packages(i)} 
for(i in requiredPackages) {library(i,character.only = TRUE)}

Classical multidimensional scaling

After choosing a group of variables we can perform MDS. In this example a set containing 8 variables, hence 8 dimensions, is scaled down to two to make it possible to visualize the data on a 2D plot. The processes may take a while to calculate but thanks to possibility of saving Data contained in R’s Enviroment we only need to calculate it once.

dist.reg<-dist(set_01) # as a main input we need distance between units
mds1<-cmdscale(dist.reg, k=2) #k - the maximum dimension of the space
summary(mds1)

##        V1                V2         
##  Min.   :-5.9719   Min.   :-4.6967  
##  1st Qu.:-1.1430   1st Qu.:-1.1935  
##  Median : 0.5851   Median : 0.2934  
##  Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 1.6461   3rd Qu.: 1.1456  
##  Max.   : 3.0848   Max.   : 4.0153

plot(mds1)

Plot we get may be a little cluttered but it can provide essential info and can be a basis for further analysis. Of course the dimension reduction we performed means we loose some depth to data we are analyzing but we wouldn’t be able to plot the data otherwise. Even now knowing that our data consists of ratings of Google Reviews users on different places such as pubs or restaurants we can see a cluster of similiar points forming on right side of the plot.

It is possible to plot relations between units, as seen below. This plot is much less cluttered and provides us with some useful information. We can see that categories 12 to 15 are somewhat close to each other ( average ratings on local services, burger/pizza shoips, hotels, juice bars) and categories 16 and 17 are far away from them (art galleries and dance clubs). This can be explained both by preferences of users or by general standard of service of said places in cities the reviews come from.

dist.reg2<-dist(t(set_01)) # transposition – analysis for variables not units
mds2<-cmdscale(dist.reg2, k=2) #k - the maximum dimension of the space
summary(mds2)

##        V1               V2        
##  Min.   :-98.10   Min.   :-75.87  
##  1st Qu.:-19.62   1st Qu.:-32.62  
##  Median : 11.56   Median : 14.30  
##  Mean   :  0.00   Mean   :  0.00  
##  3rd Qu.: 27.41   3rd Qu.: 37.83  
##  Max.   : 80.03   Max.   : 52.97

plot(mds2) 
text(mds2, rownames(mds2), cex=0.8)

When dealing with data we often come across outliers which make our analysis harder, we can easily visualize lines we choose to be borders as shown below. With few lines of code we can then remove points that are outside of those borders. But considering the nature of our data we can argue that points shown below are not in fact outliers because the relative spread of points in plot is similiar without observations clearly being outliers.

plot(mds1) # simple graphics
abline(v=-5.5, h=-4) # vertical line

Next we would like to perform Mantel test on our data in order to perform Stress Najorization of a COmplicated Function. Here we can’t allow our dataset to contain NA or NaN values, if it does matrixes listed below will not be computed. Below we can see a simple solution that will replace missing values with zero making our calculations possible.

fix_na <- function(x){
    x[is.na(x)] <- 0
    x
}
set_01 <- fix_na(set_01)
set_01f <- rapply( set_01, f=function(x) ifelse(is.nan(x),0,x), how="replace" )

After making sure our dataset is ready we can now run the code shown below to obtain our results.

sim<-cor(set_01f)  # similarity matrix
dis<-dist(set_01f) # dissimilarity matrix

sim[1:5, 1:5] # first few obs

##              Category.10 Category.11 Category.12 Category.13  Category.14
## Category.10  1.000000000   0.4598103   0.1288225  0.06622179 -0.001802804
## Category.11  0.459810277   1.0000000   0.4076424  0.33926580  0.165328602
## Category.12  0.128822476   0.4076424   1.0000000  0.47044119  0.353753274
## Category.13  0.066221790   0.3392658   0.4704412  1.00000000  0.512251769
## Category.14 -0.001802804   0.1653286   0.3537533  0.51225177  1.000000000

as.matrix(dis)[1:5, 1:5] # first few obs

##      1        2    3        4    5
## 1 0.00 0.010000 0.00 3.000000 0.00
## 2 0.01 0.000000 0.01 3.000017 0.01
## 3 0.00 0.010000 0.00 3.000000 0.00
## 4 3.00 3.000017 3.00 0.000000 3.00
## 5 0.00 0.010000 0.00 3.000000 0.00

dis2<-sim2diss(dis, method=1, to.dist = TRUE)
as.matrix(dis2)[1:5, 1:5] # first few obs

##       1         2     3         4     5
## 1  0.00  0.990000  1.00 -2.000000  1.00
## 2  0.99  0.000000  0.99 -2.000017  0.99
## 3  1.00  0.990000  0.00 -2.000000  1.00
## 4 -2.00 -2.000017 -2.00  0.000000 -2.00
## 5  1.00  0.990000  1.00 -2.000000  0.00

# Mantel test – for similarity of matrices
# H0 - the compared matrices are random, thus they are different 
# H1 - the non-random pattern, interpreted as a similarity of the matrices

library(ape)
# to check if dis matrix and converted sim to dis are the same
mantel.test(as.matrix(dis), as.matrix(dis2))

## $z.stat
## [1] -269904658655
## 
## $p
## [1] 0.001
## 
## $alternative
## [1] "two.sided"

We can see that p-value obtained from Mantel test is 0.001 which makes possible for us to proceede forward. A stress majorization is an optimization strategy for MDS. It shows a cost or loss function that measures differences between distances in full dimensional spaces and our reduced dimensional space. It can be used to draw graphs.

smacof_1<-cbind(set_01f, google_review_ratings$Category.12)
sim<-cor(smacof_1)
dis2<-sim2diss(sim, method=1, to.dist=TRUE)
fit.data<-mds(dis2, ndim=2)

plot(fit.data)

Principal Component Analysis

As shown before when we performed MDS we were projecting our multidimensional data onto 2D space while attempting to preserve their distances. When using PCA we are doing the same process while trying to preserve directions with most variance.

set_02f.pca2<-PCA(set_02f, scale.unit = TRUE) # plot with circe

As we can see the individuals factor map is hard to read, so the variables factor map of the PCA is very handy. In this case we can say that all chosen categories are positively correlated, hence drawn on the same side of the plot

fviz_pca_var(set_02f.pca2, col.var="black")

Scree plot drawn below is a hlepful tool to determine number of principal components kept in principal component analysis. We can see that in our case almost 70% of variances can be explained with only two dimensions.

plot(set_02f.pca2)

fviz_eig(set_02f.pca2)

set_02f.pca3<-principal(set_02f, nfactors=3, rotate="varimax")
set_02f.pca3

## Principal Components Analysis
## Call: principal(r = set_02f, nfactors = 3, rotate = "varimax")
## Standardized loadings (pattern matrix) based upon correlation matrix
##              RC3  RC2   RC1   h2   u2 com
## Category.10 0.01 0.94 -0.03 0.89 0.11 1.0
## Category.11 0.08 0.67  0.55 0.76 0.24 2.0
## Category.12 0.24 0.06  0.89 0.85 0.15 1.1
## Category.13 0.70 0.07  0.47 0.72 0.28 1.8
## Category.14 0.94 0.01  0.08 0.88 0.12 1.0
## 
##                        RC3  RC2  RC1
## SS loadings           1.43 1.35 1.33
## Proportion Var        0.29 0.27 0.27
## Cumulative Var        0.29 0.56 0.82
## Proportion Explained  0.35 0.33 0.32
## Cumulative Proportion 0.35 0.68 1.00
## 
## Mean item complexity =  1.4
## Test of the hypothesis that 3 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.1 
##  with the empirical chi square  1177.35  with prob <  NA 
## 
## Fit based upon off diagonal values = 0.91

Here by extracting complexity data from our saved PCA results we can determine which of the factors listed below have more values grater than zero. We can see that category 11 and 13 have significantly greater values than the rest.

set_02f.pca3$complexity

## Category.10 Category.11 Category.12 Category.13 Category.14 
##    1.002044    1.961996    1.146989    1.775119    1.014435

Another very important data we can extract is uniquenesses, the lower it is the lwoer portion of the variance is not shared with other variables. Having a low uniquenesses is thus favorable since we can easily reduce number of dimension without loosing much of the variance our data has.

set_02f.pca3$uniqueness

## Category.10 Category.11 Category.12 Category.13 Category.14 
##   0.1070638   0.2381581   0.1465390   0.2846921   0.1162819

We can then visualize our PCA data. By doing this we can see how our data differs. There is a large and dense group of points that are on negative side of our chosen principal components. and generally most of the data is below zero on Dim2 axis. This emphasizes how our data differs, by examining raw data we can then see what exactly makes those points stand out from the rest.

fviz_pca_ind(set_02f.pca2, col.ind="cos2", geom = "point", gradient.cols = c("white", "#2E9FDF", "#FC4E07" ))

We can use our PCA results for clustering, a very useful and quick way to obtain meaningful results from our data. Below we can see a quick clustering of said data showing how average reviews can be groupped.

set_02f.cs<-center_scale(set_02f) # use ClusterR:: package for center_scale()
set_02f.pca2<-princomp(set_02f.cs)$scores[, 1:2] # stats:: package
km<-KMeans_rcpp(set_02f.pca2, clusters=2, num_init=5, max_iters = 100) # from ClusterR::
plot_2d(set_02f.pca2, km$clusters, km$centroids)

We can also perform hierchical clustering on our data. Shown below is a cluster dendrogram which is a simple way of determining the number of clsuters we desire in multidemnsional data. At the same time we obtain a density function showing us that middle of the dendrogram contains largest portion of our data. This information can bu quickly used to determine the how users of Google Reviews behave and provide us with data usefl in for example targeting advertisment.

set_02f.s<-scale(set_02f) # standardised data (obs-mean)/sd
#xxx.n<-normalize(xxx) # normalized data (obs-min)/(max-min)from BBmisc::
head(set_02f)

##   Category.10 Category.11 Category.12 Category.13 Category.14
## 1        2.64          93        1.69        1.70        1.72
## 2        2.65          93        1.69        1.70        1.72
## 3        2.64          93        1.69        1.70        1.72
## 4        2.64          96        1.69        1.70        1.72
## 5        2.64          93        1.69        1.70        1.72
## 6        2.65          94        1.69        1.69        1.72

head(set_02f.s)

##      Category.10 Category.11 Category.12 Category.13 Category.14
## [1,]  -0.1473842  -0.6198836  -0.3105129  -0.3025230  -0.2986399
## [2,]  -0.1397370  -0.6198836  -0.3105129  -0.3025230  -0.2986399
## [3,]  -0.1473842  -0.6198836  -0.3105129  -0.3025230  -0.2986399
## [4,]  -0.1473842  -0.5884523  -0.3105129  -0.3025230  -0.2986399
## [5,]  -0.1473842  -0.6198836  -0.3105129  -0.3025230  -0.2986399
## [6,]  -0.1397370  -0.6094065  -0.3105129  -0.3096327  -0.2986399

dm<-dist(set_02f.s) # distances between 380 observations
hc<-hclust(dm, method="complete") # simple dendrogram
plot(hc)    # basic clustering, nothing visible

plot(density(dm))

Dimension Reduction of Google Review Ratings

Igor Łapkiewicz

22 12 2019

Introduction

Instalation of packages

Classical multidimensional scaling

Principal Component Analysis