Prolect 2

library(psych)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

library(dendextend)

## 
## ---------------------
## Welcome to dendextend version 1.13.4
## Type citation('dendextend') for how to cite the package.
## 
## Type browseVignettes(package = 'dendextend') for the package vignette.
## The github page is: https://github.com/talgalili/dendextend/
## 
## Suggestions and bug-reports can be submitted at: https://github.com/talgalili/dendextend/issues
## Or contact: <tal.galili@gmail.com>
## 
##  To suppress this message use:  suppressPackageStartupMessages(library(dendextend))
## ---------------------

## 
## Attaching package: 'dendextend'

## The following object is masked from 'package:stats':
## 
##     cutree

library(dunn.test)
library(cluster) 
library(Rtsne) 

getwd()

## [1] "/Users/simapikunova/Downloads"

setwd("/Users/simapikunova/Downloads")
df1 = read.csv("df.csv")

Data description

For this work, I decided to focus on the professionalism of coders and selected 4 variables that reflect different aspects: First, it is ImpSyn - self-expressed competence level - that is, how the individual evaluates him/her-self. Second, WorkWeekHrs - number of working hours per week, which shows how much of an individual’s life is given to codding. next, CodeRevHrs - how much time does the codder devote to the code review (it reflects not only the applied value, but also the development in the field). And fourth - ConvertedComp - aggregated value of all payments.

For cluster analysis we need to have data with no empty values, so I cleaned the dataset.

df2 = df1 %>% dplyr::select(ConvertedComp, WorkWeekHrs, CodeRevHrs, ImpSyn) %>% na.omit()

head(df2)

##    ConvertedComp WorkWeekHrs CodeRevHrs                 ImpSyn
## 2          13293          70        4.0      Far above average
## 4          21996          40        0.5                Average
## 5          41244         140        1.0 A little above average
## 6           8400          35        5.0 A little above average
## 9         110000          40        5.0      Far above average
## 11         66750          45        1.0 A little above average

describe(df2)

##               vars    n      mean        sd median  trimmed      mad min
## ConvertedComp    1 1942 164790.48 351066.02  68745 77262.79 61163.18   0
## WorkWeekHrs      2 1942     41.87     18.68     40    41.36     4.45   1
## CodeRevHrs       3 1942      4.81      6.04      4     3.92     2.97   0
## ImpSyn*          4 1942      2.34      1.35      2     2.27     1.48   1
##                   max   range  skew kurtosis      se
## ConvertedComp 2000000 2000000  4.00    16.23 7966.44
## WorkWeekHrs       425     424 11.06   198.24    0.42
## CodeRevHrs         99      99  8.47   114.35    0.14
## ImpSyn*             5       4  0.24    -1.60    0.03

summary(df2)

##  ConvertedComp      WorkWeekHrs       CodeRevHrs    
##  Min.   :      0   Min.   :  1.00   Min.   : 0.000  
##  1st Qu.:  31406   1st Qu.: 40.00   1st Qu.: 2.000  
##  Median :  68745   Median : 40.00   Median : 4.000  
##  Mean   : 164790   Mean   : 41.87   Mean   : 4.815  
##  3rd Qu.: 120000   3rd Qu.: 45.00   3rd Qu.: 5.000  
##  Max.   :2000000   Max.   :425.00   Max.   :99.000  
##                     ImpSyn   
##  A little above average:903  
##  A little below average: 94  
##  Average               :375  
##  Far above average     :528  
##  Far below average     : 42  
##

The ImpSyn variable has a five-digit scale that can be converted to a numeric scale.

df2$ImpSyn = str_replace(df2$ImpSyn, "Far below average", "1")
df2$ImpSyn = str_replace(df2$ImpSyn, "A little below average", "2")
df2$ImpSyn = str_replace(df2$ImpSyn, "Average", "3")
df2$ImpSyn = str_replace(df2$ImpSyn, "A little above average", "4")
df2$ImpSyn = str_replace(df2$ImpSyn, "Far above average", "5")
df2$ImpSyn = as.numeric(df2$ImpSyn)

All variables in data are numeric now. Let’s look at distributions.

library(dplyr)
library(ggplot2)

df2 %>%
  ggplot(aes(x = ConvertedComp)) +
  geom_histogram(fill = 'lightblue2', color = 'black') +
  labs(x="Values", y="Frequency") +
  theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here we see a long tail on the left , so I decided to get rid of the outliers and delete the values > 500000.

df2 = df2 %>% filter(ConvertedComp < 500000)

df2 %>%
  ggplot(aes(x = WorkWeekHrs)) +
  geom_histogram(fill = 'lightblue2', color = 'black') +
  labs(x="Values", y="Frequency") +
  theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Here we see a long tail on the left , so I decided to get rid of the outliers and delete the values > 100.

df2 = df2 %>% filter(WorkWeekHrs < 100)

df2 %>%
  ggplot(aes(x = CodeRevHrs)) +
  geom_histogram(fill = 'lightblue2', color = 'black') +
  labs(x="Values", y="Frequency") +
  theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

df2 %>%
  ggplot(aes(x = ImpSyn)) +
  geom_histogram(fill = 'lightblue2', color = 'black') +
  labs(x="Values", y="Frequency") +
  theme_bw()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As far as there is a crucial difference in scales, I normalized the variables.

df2$ConvertedComp = scale(df2$ConvertedComp)
df2$WorkWeekHrs = scale(df2$WorkWeekHrs)
df2$CodeRevHrs = scale(df2$CodeRevHrs)
df2$ImpSyn = scale(df2$ImpSyn)

Let’s look at the data now.

describe(df2)

##               vars    n mean sd median trimmed  mad   min   max range
## ConvertedComp    1 1792    0  1  -0.20   -0.15 0.79 -1.13  5.93  7.06
## WorkWeekHrs      2 1792    0  1  -0.06    0.05 0.39 -3.82  4.77  8.59
## CodeRevHrs       3 1792    0  1  -0.13   -0.15 0.56 -0.89 17.93 18.82
## ImpSyn           4 1792    0  1   0.09    0.10 1.61 -3.17  1.17  4.34
##                skew kurtosis   se
## ConvertedComp  2.08     6.83 0.02
## WorkWeekHrs   -0.43     3.82 0.02
## CodeRevHrs     8.02   120.96 0.02
## ImpSyn        -0.86     0.75 0.02

Final dataset has 1792 observations of 4 numeric variables with mean equal to 0.

Cluster analysis

K-means

Number of clusters

First, how many clusters do we need? Let’s look at the fviz_nbclust with “wss” method.

library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(cluster)
fviz_nbclust(df2, kmeans, method = "wss")

The graph doesn’t show clearly, but we can see the change in angle in values 2 and 3.

Now - “silhouette” method.

fviz_nbclust(df2, kmeans, method = "silhouette")

Here the optimal value is 3.

Computing k-means clusters

Here will group the data into three clusters (centers = 3) according to the previous step.

set.seed(20) 
dfCluster <- kmeans(df2, 3, nstart = 20)
print(dfCluster)

## K-means clustering with 3 clusters of sizes 447, 317, 1028
## 
## Cluster means:
##   ConvertedComp WorkWeekHrs  CodeRevHrs     ImpSyn
## 1    -0.3392283  -0.1608265  0.05966583 -1.3846361
## 2     1.5623734   0.6539469  0.01562908  0.4134569
## 3    -0.3342775  -0.1317235 -0.03076366  0.4745783
## 
## Clustering vector:
##    [1] 3 1 3 3 3 3 3 3 3 1 3 1 3 3 3 3 3 3 3 3 3 1 2 3 3 3 2 2 1 1 3 2 3 3
##   [35] 2 3 1 1 1 2 3 1 3 2 3 3 2 1 3 3 3 1 3 3 3 1 1 1 3 3 3 3 2 3 3 3 3 3
##   [69] 3 2 3 3 3 1 3 1 1 3 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 1 3 3 1 2 1 1 3 3
##  [103] 1 1 3 3 3 3 3 3 3 3 1 2 3 3 1 1 3 1 3 3 3 3 1 2 1 2 1 3 3 3 3 3 3 2
##  [137] 3 2 1 3 3 3 2 3 3 3 2 1 2 3 1 3 1 3 2 3 1 3 3 1 2 1 2 1 1 3 3 1 2 2
##  [171] 3 2 1 3 3 1 1 1 3 3 3 2 1 3 1 2 3 3 2 3 2 2 3 3 3 3 3 3 3 3 3 3 1 3
##  [205] 2 1 1 2 3 3 3 3 2 3 3 3 1 1 1 3 1 3 3 1 3 3 2 1 3 1 2 1 3 3 3 1 3 1
##  [239] 1 3 3 3 3 1 3 1 2 3 3 3 1 3 2 3 1 2 1 2 1 3 1 2 1 3 1 2 3 3 1 1 3 3
##  [273] 3 3 1 3 3 1 3 3 3 3 3 1 3 3 1 1 3 3 3 2 2 3 1 3 2 1 2 2 3 3 1 3 1 3
##  [307] 3 3 2 3 3 2 2 2 3 3 3 1 3 2 3 1 3 3 3 2 3 3 3 3 3 1 2 3 3 3 3 3 3 3
##  [341] 2 1 2 2 3 3 3 3 3 3 3 3 1 3 3 2 2 2 3 2 3 1 3 3 3 3 3 1 3 3 1 3 3 3
##  [375] 2 1 1 1 3 3 2 1 3 1 3 3 2 3 3 1 1 2 3 3 3 3 3 3 2 1 2 1 1 3 1 2 1 3
##  [409] 3 3 3 2 2 3 1 1 1 2 3 2 3 3 3 3 1 3 2 3 3 3 1 1 1 2 3 3 1 2 3 3 3 1
##  [443] 3 3 1 1 1 3 1 3 3 1 3 1 3 2 3 1 3 3 3 3 1 1 3 3 1 1 3 3 1 2 3 1 1 3
##  [477] 1 2 3 1 1 3 3 1 3 3 2 2 3 1 1 1 3 3 3 2 3 1 3 1 1 1 3 1 1 3 1 3 3 3
##  [511] 3 3 3 1 3 3 1 2 3 2 3 3 2 1 3 2 1 1 1 3 3 1 1 3 3 3 1 2 3 3 3 3 3 3
##  [545] 3 3 1 3 1 1 1 3 3 3 3 3 3 1 1 1 3 3 3 2 3 3 1 1 1 3 1 3 2 3 2 3 3 1
##  [579] 2 3 2 3 3 2 2 1 3 3 3 1 3 2 3 3 3 1 3 3 3 2 1 1 3 1 1 3 3 3 3 3 3 2
##  [613] 3 3 3 3 1 3 2 3 3 3 1 2 2 3 2 1 3 1 3 3 3 3 3 1 2 3 2 3 1 2 3 1 1 1
##  [647] 3 3 1 3 3 3 3 3 3 2 1 3 2 3 3 3 3 3 3 2 3 3 3 3 3 2 3 2 1 3 1 3 3 2
##  [681] 3 3 1 3 3 3 2 1 1 2 3 2 2 1 3 3 3 3 3 3 3 1 1 2 3 1 2 3 3 1 3 2 1 3
##  [715] 3 1 1 1 3 3 3 2 3 1 3 2 2 3 3 1 1 1 3 2 1 3 2 3 3 3 3 1 3 3 3 2 3 2
##  [749] 3 2 2 2 1 3 1 3 1 2 3 3 1 2 3 3 2 3 1 3 3 3 3 3 3 3 3 2 3 1 2 3 3 2
##  [783] 2 3 3 1 1 3 1 2 3 3 3 3 1 3 1 3 3 3 3 3 3 2 3 2 1 1 3 1 3 3 1 1 2 3
##  [817] 3 2 3 3 2 2 3 1 1 1 3 1 2 3 1 3 1 3 1 3 3 3 3 3 3 1 3 2 3 3 3 3 1 1
##  [851] 3 3 3 1 2 3 3 3 1 1 2 2 2 3 3 1 3 3 2 3 3 3 1 1 3 1 2 3 3 1 3 1 3 3
##  [885] 3 2 3 2 3 3 2 3 2 1 2 1 3 2 3 3 3 3 3 3 3 1 3 2 3 3 3 2 2 3 3 3 3 3
##  [919] 1 3 1 2 1 2 1 3 1 3 3 3 3 2 3 1 3 2 1 3 1 1 2 1 3 3 1 3 3 3 3 1 3 3
##  [953] 1 3 2 2 2 1 3 2 3 3 3 2 2 1 3 1 1 2 3 3 2 3 3 2 1 3 3 1 1 3 3 3 1 3
##  [987] 3 2 3 3 1 3 3 2 3 1 3 1 2 1 3 1 3 1 3 3 1 3 3 1 2 1 3 3 3 3 2 3 2 3
## [1021] 3 3 2 3 3 3 2 2 3 1 3 1 3 1 3 2 2 1 1 2 3 3 1 2 3 2 3 3 1 3 1 3 3 1
## [1055] 1 1 3 3 3 3 3 3 3 3 1 2 3 3 1 1 3 2 3 3 3 1 1 3 3 3 3 1 2 3 2 3 1 3
## [1089] 3 1 1 2 1 1 2 3 3 3 3 3 3 2 1 1 1 3 3 3 1 3 3 3 3 3 2 1 2 1 2 3 2 2
## [1123] 3 3 2 3 3 1 3 1 3 1 2 3 3 1 3 3 2 3 2 3 2 3 3 2 1 3 3 1 3 3 3 3 2 3
## [1157] 3 1 3 3 3 2 2 3 3 1 3 3 1 3 3 3 1 3 3 1 3 1 1 1 3 1 3 1 3 3 3 3 2 3
## [1191] 2 3 2 1 1 1 1 1 3 3 2 3 3 2 1 1 3 3 2 3 3 1 3 3 3 3 3 3 2 3 2 1 3 1
## [1225] 3 3 2 1 3 3 3 2 1 2 3 1 3 1 3 3 3 2 3 1 3 3 2 3 3 3 1 3 3 3 3 3 2 3
## [1259] 2 3 1 3 3 3 3 3 3 1 3 3 3 3 2 1 1 3 2 2 1 3 1 3 1 2 3 3 3 2 3 3 3 3
## [1293] 1 3 1 2 2 3 3 3 3 3 3 3 2 3 2 3 2 1 3 3 1 3 3 2 3 1 1 3 1 1 1 3 1 3
## [1327] 2 3 3 3 2 2 3 2 1 3 1 1 3 3 2 1 1 1 3 3 1 3 3 1 3 3 2 3 3 3 1 3 3 2
## [1361] 3 3 3 3 3 1 1 3 3 1 1 3 3 3 3 3 1 3 3 3 3 3 3 2 3 3 2 1 1 1 2 2 2 3
## [1395] 1 3 3 3 1 3 2 1 3 2 3 2 1 2 1 3 3 3 3 2 1 3 2 3 3 3 3 3 3 3 3 3 2 1
## [1429] 3 2 3 3 3 1 3 1 3 1 3 1 3 3 2 1 2 3 3 3 3 3 3 3 1 1 3 2 3 3 2 3 3 1
## [1463] 2 3 2 1 1 1 2 3 3 3 3 3 3 3 3 1 2 3 3 3 3 2 1 3 1 3 3 3 1 3 1 1 3 2
## [1497] 3 2 3 1 1 2 2 2 3 1 3 3 3 2 3 1 2 3 1 3 1 3 3 3 3 3 3 3 3 3 3 3 3 1
## [1531] 3 3 3 3 3 3 3 3 2 3 3 3 2 3 2 1 3 3 3 3 1 2 3 3 1 3 1 3 3 1 3 2 3 1
## [1565] 1 1 3 2 3 1 1 1 3 3 3 3 2 3 3 3 3 2 3 2 3 1 3 3 3 3 3 3 2 3 2 2 1 3
## [1599] 2 1 1 1 1 2 2 1 1 2 2 3 3 2 1 3 3 3 3 1 3 3 3 1 3 3 1 3 2 3 2 2 3 3
## [1633] 3 1 3 3 2 2 2 3 1 3 1 1 3 2 3 2 3 3 2 3 1 3 2 2 3 3 3 2 3 1 3 3 3 3
## [1667] 3 3 3 1 1 3 3 3 3 1 3 2 1 2 3 3 3 3 3 3 3 1 1 2 3 3 1 3 3 3 3 3 1 3
## [1701] 3 3 3 2 2 1 1 3 2 3 3 3 1 3 1 3 3 3 3 3 3 1 3 3 3 3 3 1 3 3 2 3 1 1
## [1735] 2 3 2 1 1 3 2 3 3 3 3 3 3 3 3 3 1 3 3 3 3 2 3 3 2 1 1 3 2 2 1 3 3 3
## [1769] 3 1 3 2 3 1 2 3 3 3 3 1 2 3 3 3 2 1 2 3 3 3 1 2
## 
## Within cluster sum of squares by cluster:
## [1] 1871.739 1082.432 1959.398
##  (between_SS / total_SS =  31.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

fviz_cluster(dfCluster, data = df2,
  ellipse.type = "convex",
  palette = "jco",
  repel = TRUE)

As far as it was knited incorrectly, I will add clear screenshot of the clusters.

set.seed(20) 
dfCluster2 <- kmeans(df2, 2, nstart = 20)

fviz_cluster(dfCluster2, data = df2,
  ellipse.type = "convex",
  palette = "jco",
  repel = TRUE)

The plot with 2 clusters has overlapping in the place of grey cluster, so I decided to use the first option and divide the variables into 3 clusters.

Different metrics

dfCluster$centers

##   ConvertedComp WorkWeekHrs  CodeRevHrs     ImpSyn
## 1    -0.3392283  -0.1608265  0.05966583 -1.3846361
## 2     1.5623734   0.6539469  0.01562908  0.4134569
## 3    -0.3342775  -0.1317235 -0.03076366  0.4745783

dfCluster$size

## [1]  447  317 1028

dfCluster$betweenss

## [1] 2250.431

dfCluster$withinss

## [1] 1871.739 1082.432 1959.398

dfCluster$totss

## [1] 7164

Here we can see differences in all metrics between clusters:

1st cluster: 447 observations with negative values of ConvertedComp, WorkWeekHrs, ImpSyn and positive (the biggest one) value of CodeRevHrs 2nd cluster: 317 observations with all positive values 3rd cluster: 1028 observations with negative value of ConvertedComp, WorkWeekHrs, CodeRevHrs and positive value of ImpSyn (the biggest one).

Hierarchical clustering

To begin with , I built dendrograms for three main methods - complete, average, and Ward’s method. Since all variables are metric, I used euclidean distance.

par(mfrow=c(2,3))
plot(hclust(dist(df2, method="euclidean"), method="complete"), cex = 0.7, hang = -1, labels = F)
plot(hclust(dist(df2, method="euclidean"), method="average"), cex = 0.7, hang = -1, labels = F)
plot(hclust(dist(df2, method="euclidean"), method="ward.D"),cex = 0.7, hang = -1, labels = F)

Next - let’s check how many clusters we should use with this method.

wss_table1 <- fviz_nbclust(df2, hcut, method = "wss") 

wss_table1 +
    geom_vline(xintercept = 3, linetype = 2)+
  labs(subtitle = "Elbow method")

Optimal number of cluster is equal to 3.

set.seed(20) 
d <- dist(df2, method = 'euclidean') # calculate a distance matrix on selected variables

hfit_ward <- agnes(d, method = "ward") # principle: minimal average distance between clusters
hfit_average <- agnes(d, method = "average") # average distance between an observation and existing clusters
hfit_complete <- agnes(d, method = "complete") # maximal distance between an observation and existing clusters

Now - cutting at the level of 3.

table(cutree(hfit_ward, 3))

## 
##    1    2    3 
## 1046  457  289

table(cutree(hfit_average, 3))

## 
##    1    2    3 
## 1787    3    2

table(cutree(hfit_complete, 3))

## 
##    1    2    3 
## 1783    7    2

Two of the 3 methods distributed values unequally, and most of the values were assigned to 1 cluster. The exception is the Ward method, which assigned 1046 values to the first cluster, 457 to the second one, and 289 to the third cluster. That’s why I continued working with the Ward method.

A little bit more visualizations with cutting:

hc = hclust(d, method = "ward.D")
plot(hc,labels=FALSE,hang=-1) 
rect.hclust(hc, k=3, border="red")

dend <- as.dendrogram(hfit_ward)
dend_col <- color_branches(dend, k = 3)
plot(dend_col, labels = F)

## Warning in plot.window(...): "labels" is not a graphical parameter

## Warning in plot.xy(xy, type, ...): "labels" is not a graphical parameter

## Warning in title(...): "labels" is not a graphical parameter

The following one is silhouette visualization.

ward3 = cutree(hfit_ward, 3)
sil <- silhouette(ward3, d)
fviz_silhouette(sil,palette= "jco",ggtheme = theme_classic())

##   cluster size ave.sil.width
## 1       1 1046          0.24
## 2       2  457          0.18
## 3       3  289          0.16

As seen in the graph above the average silhouette score is 0.21 and all clusters show negative scores - the observations that are not in the right cluster. The number is small, and the graph shows that clusterization is inaccurate.

Let’s visualize clusters.

h_avg_cut <- hcut(d, k = 3, hc_method = "ward.D")
fviz_cluster(h_avg_cut, na.omit(df2), ellipse.type = "convex")

Here again we see a similar picture - three clusters, one of which has a pair of outliers.

Now let’s compare k-means and hierarchical clustering.

Dunn’s test

#library(dunn.test)

#set.seed(20)

#dunn_km <- dunn.test::dunn(clusters = dfCluster$cluster, Data = df2)


#dunn_ward <- dunn.test::dunn(clusters = ward3, Data = df2)

#dunn_km
#dunn_ward

(I don’t know why this code doesn’t work when I’m kniting it, but if I just run the chunk, dunn_km is equal to 0.001310377 and dunn_ward is equal to 0.006774352. To knit it, I put # in all lines).

The principle of the test is that the higher the value, the better the clusterization. In general, both values are very low, but the Ward method shows better result.

Evaluation of clustering

The following step is looking at Ward’s clusters.

df2$cluster = factor(ward3)

df2 %>% group_by(cluster) %>% 
  summarise_at(.vars = vars(ConvertedComp, WorkWeekHrs, CodeRevHrs, ImpSyn), .funs = funs(mean))

## Warning: funs() is soft deprecated as of dplyr 0.8.0
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once per session.

## # A tibble: 3 x 5
##   cluster ConvertedComp WorkWeekHrs CodeRevHrs ImpSyn
##   <fct>           <dbl>       <dbl>      <dbl>  <dbl>
## 1 1              -0.287       0.138      0.190  0.340
## 2 2              -0.344      -0.621     -0.316 -1.06 
## 3 3               1.58        0.483     -0.186  0.437

The table shows that the average values in the groups differ greatly.

Let’s check this out with box rafts and a statistical test.

ggplot(data = df2, aes(x = "", y = ConvertedComp)) + geom_boxplot() + facet_grid(~cluster)

ggplot(data = df2, aes(x = "", y = WorkWeekHrs)) + geom_boxplot() + facet_grid(~cluster)

ggplot(data = df2, aes(x = "", y = CodeRevHrs)) + geom_boxplot() + facet_grid(~cluster)

ggplot(data = df2, aes(x = "", y = ImpSyn)) + geom_boxplot() + facet_grid(~cluster)

kruskal.test(df2$ConvertedComp ~ df2$cluster)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  df2$ConvertedComp by df2$cluster
## Kruskal-Wallis chi-squared = 651.83, df = 2, p-value < 2.2e-16

kruskal.test(df2$WorkWeekHrs ~ df2$cluster)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  df2$WorkWeekHrs by df2$cluster
## Kruskal-Wallis chi-squared = 182.46, df = 2, p-value < 2.2e-16

kruskal.test(df2$CodeRevHrs ~ df2$cluster)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  df2$CodeRevHrs by df2$cluster
## Kruskal-Wallis chi-squared = 112.47, df = 2, p-value < 2.2e-16

kruskal.test(df2$ImpSyn ~ df2$cluster)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  df2$ImpSyn by df2$cluster
## Kruskal-Wallis chi-squared = 672.33, df = 2, p-value < 2.2e-16

=> Thus, these indicators demonstrate that the values in the groups are different - distributions differ statistically.

Let’s check with regression (MLR)

library(nnet)
test <- multinom(cluster ~ ., data = df2)

## # weights:  18 (10 variable)
## initial  value 1968.713221 
## iter  10 value 468.825626
## iter  20 value 452.469319
## final  value 452.469077 
## converged

multi = summary(test)
print(multi)

## Call:
## multinom(formula = cluster ~ ., data = df2)
## 
## Coefficients:
##   (Intercept) ConvertedComp WorkWeekHrs CodeRevHrs     ImpSyn
## 2   -3.726683     0.3577529  -2.9643880  -4.066387 -4.7439609
## 3   -4.375754     4.9725397   0.5851368  -2.527132  0.3555903
## 
## Std. Errors:
##   (Intercept) ConvertedComp WorkWeekHrs CodeRevHrs    ImpSyn
## 2   0.2538936     0.2143581   0.2081343  0.3220571 0.2810146
## 3   0.3073622     0.3235609   0.1752788  0.2982952 0.1926338
## 
## Residual Deviance: 904.9382 
## AIC: 924.9382

z <- multi$coefficients/multi$standard.errors
p <- (1 - pnorm(abs(z), 0, 1))*2

summodel <- rbind(multi$coefficients[1,],multi$standard.errors[1,],z[1,],p[1,])
rownames(summodel) <- c("Coefficient","Std. Errors","z stat","p value")
knitr::kable(summodel)

	(Intercept)	ConvertedComp	WorkWeekHrs	CodeRevHrs	ImpSyn
Coefficient	-3.7266826	0.3577529	-2.9643880	-4.0663869	-4.7439609
Std. Errors	0.2538936	0.2143581	0.2081343	0.3220571	0.2810146
z stat	-14.6781275	1.6689493	-14.2426722	-12.6262907	-16.8815481
p value	0.0000000	0.0951274	0.0000000	0.0000000	0.0000000

Multinominal logistic regression shows that all predictors are statistically significant at the level of 0.1 and all predictors except ConvertedComp are statistically significant at the level of 0.05.

Clusters

df2 %>% group_by(cluster) %>% 
  summarise_at(.vars = vars(ConvertedComp, WorkWeekHrs, CodeRevHrs, ImpSyn), .funs = funs(mean))

## # A tibble: 3 x 5
##   cluster ConvertedComp WorkWeekHrs CodeRevHrs ImpSyn
##   <fct>           <dbl>       <dbl>      <dbl>  <dbl>
## 1 1              -0.287       0.138      0.190  0.340
## 2 2              -0.344      -0.621     -0.316 -1.06 
## 3 3               1.58        0.483     -0.186  0.437

So, here we have: 3 cluster with the size 1046, 457, and 289;

1st cluster: 1046 observations with positive values of WorkWeekHrs, CodeRevHrs and ImpSyn. People who work and rate their competence above average, spend above average time revising code and have lower-than-average earnings. I think that these are people who are passionate about working with R, but have not yet found how to make money from it. I have named this group as “enthusiasts”.

2nd cluster: 457 observations with all negative mean values. People who work significantly less than average, have a low rating of their competence, check the code the least, and earn less than others. I have called this cluster as “beginners”.

3rd cluster: 289 observations with all positive values except CodeRevHrs People who work, earn, and rate their competence above all others, but rarely review code on average. I have named this cluster “professionals”.

df2$cluster = str_replace(df2$cluster, "1", "enthusiasts")
df2$cluster = str_replace(df2$cluster, "2", "beginners")
df2$cluster = str_replace(df2$cluster, "3", "professionals")

Conclusion

Thus, the analysis has resulted in 3 clusters of users: enthusiasts, beginners, professionals (perhaps the wording does not accurately reflect the characteristics of the cluster, but I tried to find the most appropriate description). Although the clustering test showed that the results are not sufficiently valid, the groups are statistically different and the selected variables are good indicators for classification.

I also tried the following methods for checking clusterization within and between the methods themselves, but the code didn ’t work :(

#wards_hc_clusters <- cutree(tree=hc, k=3)
#plot(silhouette(wards_hc_clusters, d), main="")

#library(clValid)
#valid = clValid(df2, clMethods = c("hierarchical","kmeans"), validation = "internal")
#valid = clValid(df2, clMethods = "hierarchical", validation = "internal")

Data Analysis: Project 2

Serafima Pikunova