Clusters

Variables

Continuous variables: 1) Age 2) YearsCode - Years Since Learning to Code 3) YearsCode - Years Since Learning to Code Professionally

Categorical variables: 1) WorkLoc - Where Do Developers Want to Work 2) OpenSourcer - Contributing to Open Source

I have chosen these variables as these variables create better cluster in comparison with Employment, ImpSyn, Edlevel, CareerSat and WorkRemote.

Also, YearsCode, YearsCodePro and Age are important as they can influence on, which language people prefer.

Missing data

#install.packages('Amelia')
#install.packages("mlbench")
library(Amelia)

## Warning: package 'Amelia' was built under R version 3.6.3

## Loading required package: Rcpp

## ## 
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.6, built: 2019-11-24)
## ## Copyright (C) 2005-2020 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##

library(mlbench)

## Warning: package 'mlbench' was built under R version 3.6.3

missmap(data, col=c("blue", "red"), legend = T, main = "Missing values vs observed")

As we can see there are NA in dataset (only 9%), so we can delete them.

str(data)

## 'data.frame':    3451 obs. of  6 variables:
##  $ LanguageWorkedWith: Factor w/ 2557 levels "Assembly;Bash/Shell/PowerShell;C#;F#;HTML/CSS;Java;JavaScript;PHP;Python;R;SQL;VBA;Other(s):",..: 2492 533 2189 1674 1555 2189 853 1732 2439 2545 ...
##  $ Age               : num  28 38 25 30 25 34 24 28 40 58 ...
##  $ YearsCode         : Factor w/ 52 levels "1","10","11",..: 5 49 47 3 48 8 48 2 48 37 ...
##  $ YearsCodePro      : Factor w/ 46 levels "1","10","11",..: 23 34 23 43 41 3 12 42 42 13 ...
##  $ WorkLoc           : Factor w/ 3 levels "Home","Office",..: 1 2 2 2 2 2 1 2 1 1 ...
##  $ OpenSourcer       : Factor w/ 4 levels "Less than once a month but more than once per year",..: 3 3 1 2 3 3 1 3 3 3 ...
##  - attr(*, "na.action")= 'omit' Named int  2 3 4 7 8 10 13 17 20 23 ...
##   ..- attr(*, "names")= chr  "2" "3" "4" "7" ...

data$YearsCode <- as.numeric(data$YearsCode)
data$YearsCodePro <- as.numeric(data$YearsCodePro)

Change the type of variable “YearsCodePro” and “YearsCode” into numeric.

Descriptive statistics

library(psych)
summary(data)

##                                               LanguageWorkedWith
##  Python;R;SQL                                          : 119    
##  R                                                     : 100    
##  Bash/Shell/PowerShell;Python;R;SQL                    :  83    
##  Python;R                                              :  82    
##  R;SQL                                                 :  55    
##  Bash/Shell/PowerShell;HTML/CSS;JavaScript;Python;R;SQL:  43    
##  (Other)                                               :2969    
##       Age          YearsCode      YearsCodePro  
##  Min.   : 1.00   Min.   : 1.00   Min.   : 1.00  
##  1st Qu.:25.00   1st Qu.: 7.00   1st Qu.:10.00  
##  Median :30.00   Median :23.00   Median :23.00  
##  Mean   :31.84   Mean   :25.45   Mean   :23.14  
##  3rd Qu.:36.00   3rd Qu.:45.00   3rd Qu.:40.00  
##  Max.   :99.00   Max.   :52.00   Max.   :46.00  
##                                                 
##                                            WorkLoc    
##  Home                                          :1067  
##  Office                                        :2033  
##  Other place, such as a coworking space or cafe: 351  
##                                                       
##                                                       
##                                                       
##                                                       
##                                              OpenSourcer  
##  Less than once a month but more than once per year: 900  
##  Less than once per year                           : 983  
##  Never                                             :1107  
##  Once a month or more often                        : 461  
##                                                           
##                                                           
##

describe(data)

##                     vars    n    mean     sd median trimmed    mad min
## LanguageWorkedWith*    1 3451 1601.88 728.93   1576 1647.20 968.14   1
## Age                    2 3451   31.84   9.09     30   30.62   7.41   1
## YearsCode              3 3451   25.45  18.09     23   25.39  28.17   1
## YearsCodePro           4 3451   23.14  15.55     23   23.21  25.20   1
## WorkLoc*               5 3451    1.79   0.61      2    1.74   0.00   1
## OpenSourcer*           6 3451    2.33   1.00      2    2.28   1.48   1
##                      max range  skew kurtosis    se
## LanguageWorkedWith* 2557  2556 -0.32    -0.95 12.41
## Age                   99    98  1.53     4.16  0.15
## YearsCode             52    51  0.09    -1.61  0.31
## YearsCodePro          46    45  0.02    -1.53  0.26
## WorkLoc*               3     2  0.14    -0.49  0.01
## OpenSourcer*           4     3  0.10    -1.11  0.02

So, we have 6 variables and 3451 observations. There are 3 continious variables (Age, YearsCodePro, YearsCode) and 3 categorical (LanguageWorkedWith, WorkLoc, OpenSourcer) ones.

As we can see mean age is about 32 years old(kurtosis = 4.16, skew = 1.53), mean YearsCodePro is 23(kurtosis = -1.61, skew = 0.09) and mean YearsCodePro is about 25 (kurtosis = -1.53, skew = 0.02).

hist(data$Age)

As we can see the distribution of age is not normal and positively skewed.

hist(data$YearsCode)

The same result we can wee with YearsCode. It is not normal and positively and negatively skewed.

hist(data$YearsCodePro)

As for YearsCodePro, the distribution is also not normal.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

ggplot(data, aes(x = `WorkLoc`)) +
        geom_bar() +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

The most preferable work location is office and the least one is other place.

ggplot(data, aes(x = `OpenSourcer`)) +
        geom_bar() +
        theme(axis.text.x = element_text(angle = 45, hjust = 1))

Never is the most popular answer about contributing to Open Source, the least popular is once a mouth.

Distance metric

As we have mixed data types, we should use gover distance. Also we need to pay attention to the fact that distribution of our variables is not normal, so we logarithmically transform it.

library(cluster)
gower_dist <- daisy(data [ , -1],
                    metric = "gower",
                    type = list(logratio = 2))

summary(gower_dist)

## 5952975 dissimilarities, summarized :
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3033  0.4221  0.4149  0.5326  0.9755 
## Metric :  mixed ;  Types = I, I, I, N, N 
## Number of objects : 3451

There are 5952975 dissimilarities, 3 interval and 2 nominal variables, and we have matrix 3451 X 3451.

The most similar and dissimilar pairs in the data

Look at the most similar and dissimilar pairs in the data

The most similar pair of observations:

data[
  which(gower_mat == min(gower_mat[gower_mat != min(gower_mat)]), 
        arr.ind = TRUE)[1, ], ]

##                                LanguageWorkedWith  Age YearsCode
## 910 Bash/Shell/PowerShell;C++;JavaScript;Python;R 24.0        23
## 759                       JavaScript;Python;R;SQL 24.5        23
##     YearsCodePro WorkLoc             OpenSourcer
## 910           12  Office Less than once per year
## 759           12  Office Less than once per year

In our case 910 and 759 are the most similar. And these pairs make sense.

The most dissimilar pair of observation:

data[
  which(gower_mat == max(gower_mat[gower_mat != max(gower_mat)]),
        arr.ind = TRUE)[1, ], ]

##                                                                                                                                                                                    LanguageWorkedWith
## 4394                                                                                                                            Bash/Shell/PowerShell;C;C++;HTML/CSS;Java;JavaScript;PHP;Python;R;SQL
## 4385 Assembly;Bash/Shell/PowerShell;C;C++;C#;Clojure;Dart;Elixir;Erlang;F#;Go;HTML/CSS;Java;JavaScript;Kotlin;Objective-C;PHP;Python;R;Ruby;Rust;Scala;SQL;Swift;TypeScript;VBA;WebAssembly;Other(s):
##      Age YearsCode YearsCodePro WorkLoc                OpenSourcer
## 4394  98         1            1  Office                      Never
## 4385  18        52           46    Home Once a month or more often

4394 and 4385 are most dissimilar. Also, these pairs make sense.

Choosing the clustering algorithm

We will use partitioning around medoids (PAM) for handling a custom distance matrix.

PAM is an iterative algorithm. A ‘medoid’ is the observation that would yield the lowest average distance if it were to be re-assigned to the cluster it is assigned to. It works well with n < 10,000 observations per group.

Look at the silhouette width metric. This is an internal validation metric, an aggregated measure of how similar an observation is to its own cluster, compared to its closest neighboring cluster. The metric can range from -1 to 1, where higher values are better.

Calculate silhouette width for many several solutions using PAM:

Plot the sihouette width (larger value is better):

plot(1:10, sil_width,
     xlab = "Number of clusters", xaxt='n',
     ylab = "Silhouette Width",
     ylim = c(0,1))
axis(1, at = seq(2, 10, by = 1), las=2)
lines(1:10, sil_width)

Conclusion: select the 10-cluster solutions as it has the highest silhouette width.

Interpret the solution: by running summary on each cluster

data$cluster <- as.factor(pam_fit$clustering)
describeBy(data, data$cluster)

## 
##  Descriptive statistics by group 
## group: 1
##                     vars   n    mean     sd median trimmed     mad min
## LanguageWorkedWith*    1 337 1676.48 732.74   1668 1731.40 1058.58  27
## Age                    2 337   32.96  10.25     30   31.67    7.41   1
## YearsCode              3 337   25.78  17.53     24   25.86   26.69   1
## YearsCodePro           4 337   22.85  15.61     23   22.86   23.72   1
## WorkLoc*               5 337    1.00   0.00      1    1.00    0.00   1
## OpenSourcer*           6 337    3.00   0.00      3    3.00    0.00   3
## cluster*               7 337    1.00   0.00      1    1.00    0.00   1
##                      max range  skew kurtosis    se
## LanguageWorkedWith* 2557  2530 -0.41    -0.95 39.91
## Age                   84    83  1.34     2.76  0.56
## YearsCode             52    51  0.02    -1.54  0.95
## YearsCodePro          46    45  0.02    -1.49  0.85
## WorkLoc*               1     0   NaN      NaN  0.00
## OpenSourcer*           3     0   NaN      NaN  0.00
## cluster*               1     0   NaN      NaN  0.00
## -------------------------------------------------------- 
## group: 2
##                     vars   n    mean     sd median trimmed    mad min  max
## LanguageWorkedWith*    1 658 1753.08 684.89   1683 1818.50 938.49   2 2557
## Age                    2 658   30.43   8.61     28   29.12   5.93   1   98
## YearsCode              3 658   28.14  18.20     34   28.76  20.76   1   51
## YearsCodePro           4 658   23.45  15.70     23   23.55  23.72   1   46
## WorkLoc*               5 658    2.00   0.00      2    2.00   0.00   2    2
## OpenSourcer*           6 658    3.00   0.00      3    3.00   0.00   3    3
## cluster*               7 658    2.00   0.00      2    2.00   0.00   2    2
##                     range  skew kurtosis    se
## LanguageWorkedWith*  2555 -0.54    -0.62 26.70
## Age                    97  1.98     7.49  0.34
## YearsCode              50 -0.21    -1.59  0.71
## YearsCodePro           45  0.02    -1.50  0.61
## WorkLoc*                0   NaN      NaN  0.00
## OpenSourcer*            0   NaN      NaN  0.00
## cluster*                0   NaN      NaN  0.00
## -------------------------------------------------------- 
## group: 3
##                     vars   n    mean     sd median trimmed    mad min  max
## LanguageWorkedWith*    1 269 1586.95 715.70   1570 1634.31 922.18   1 2555
## Age                    2 269   29.55   7.41     28   28.43   4.45  18   66
## YearsCode              3 269   27.80  20.45     34   28.27  22.24   1   52
## YearsCodePro           4 269   37.15   7.20     40   37.93   4.45  23   45
## WorkLoc*               5 269    2.07   0.26      2    2.00   0.00   2    3
## OpenSourcer*           6 269    1.00   0.00      1    1.00   0.00   1    1
## cluster*               7 269    3.00   0.00      3    3.00   0.00   3    3
##                     range  skew kurtosis    se
## LanguageWorkedWith*  2554 -0.34    -0.80 43.64
## Age                    48  2.34     7.33  0.45
## YearsCode              51 -0.16    -1.84  1.25
## YearsCodePro           22 -1.02    -0.31  0.44
## WorkLoc*                1  3.33     9.14  0.02
## OpenSourcer*            0   NaN      NaN  0.00
## cluster*                0   NaN      NaN  0.00
## -------------------------------------------------------- 
## group: 4
##                     vars   n    mean     sd median trimmed    mad min  max
## LanguageWorkedWith*    1 611 1605.27 701.09   1576 1646.39 891.04  12 2557
## Age                    2 611   31.72   8.45     29   30.44   5.93  20   64
## YearsCode              3 611   24.86  18.67     23   24.63  28.17   1   51
## YearsCodePro           4 611   25.36  15.42     24   25.96  23.72   1   45
## WorkLoc*               5 611    2.03   0.18      2    2.00   0.00   2    3
## OpenSourcer*           6 611    2.00   0.00      2    2.00   0.00   2    2
## cluster*               7 611    4.00   0.00      4    4.00   0.00   4    4
##                     range  skew kurtosis    se
## LanguageWorkedWith*  2545 -0.31    -0.89 28.36
## Age                    44  1.45     2.02  0.34
## YearsCode              50  0.14    -1.67  0.76
## YearsCodePro           44 -0.23    -1.48  0.62
## WorkLoc*                1  5.24    25.49  0.01
## OpenSourcer*            0   NaN      NaN  0.00
## cluster*                0   NaN      NaN  0.00
## -------------------------------------------------------- 
## group: 5
##                     vars   n    mean     sd median trimmed     mad min
## LanguageWorkedWith*    1 307 1432.18 759.00   1476 1451.96 1005.20   8
## Age                    2 307   33.12  10.20     31   32.04    8.90  16
## YearsCode              3 307   24.89  17.47     23   24.66   23.72   1
## YearsCodePro           4 307   21.09  14.81     17   20.75   17.79   1
## WorkLoc*               5 307    1.15   0.53      1    1.00    0.00   1
## OpenSourcer*           6 307    1.00   0.00      1    1.00    0.00   1
## cluster*               7 307    5.00   0.00      5    5.00    0.00   5
##                      max range  skew kurtosis    se
## LanguageWorkedWith* 2552  2544 -0.07    -1.17 43.32
## Age                   99    83  1.50     5.04  0.58
## YearsCode             51    50  0.14    -1.51  1.00
## YearsCodePro          45    44  0.27    -1.38  0.85
## WorkLoc*               3     2  3.21     8.35  0.03
## OpenSourcer*           1     0   NaN      NaN  0.00
## cluster*               5     0   NaN      NaN  0.00
## -------------------------------------------------------- 
## group: 6
##                     vars   n    mean     sd median trimmed    mad min  max
## LanguageWorkedWith*    1 339 1631.88 720.45   1581 1677.08 993.34  18 2552
## Age                    2 339   34.04  10.01     32   32.90   8.90  17   79
## YearsCode              3 339   23.73  17.01     23   23.24  23.72   1   52
## YearsCodePro           4 339   21.13  15.53     17   20.72  20.76   1   46
## WorkLoc*               5 339    1.29   0.71      1    1.12   0.00   1    3
## OpenSourcer*           6 339    2.00   0.00      2    2.00   0.00   2    2
## cluster*               7 339    6.00   0.00      6    6.00   0.00   6    6
##                     range  skew kurtosis    se
## LanguageWorkedWith*  2534 -0.37    -0.96 39.13
## Age                    62  1.09     1.16  0.54
## YearsCode              51  0.28    -1.42  0.92
## YearsCodePro           45  0.25    -1.49  0.84
## WorkLoc*                2  1.98     1.92  0.04
## OpenSourcer*            0   NaN      NaN  0.00
## cluster*                0   NaN      NaN  0.00
## -------------------------------------------------------- 
## group: 7
##                     vars   n    mean     sd median trimmed    mad min  max
## LanguageWorkedWith*    1 267 1398.99 754.33   1391 1410.48 886.59  27 2552
## Age                    2 267   31.69   8.30     30   30.82   7.41  18   63
## YearsCode              3 267   23.81  17.39     19   23.33  22.24   1   52
## YearsCodePro           4 267   22.46  16.04     23   22.38  25.20   1   45
## WorkLoc*               5 267    2.04   0.21      2    2.00   0.00   2    3
## OpenSourcer*           6 267    4.00   0.00      4    4.00   0.00   4    4
## cluster*               7 267    7.00   0.00      7    7.00   0.00   7    7
##                     range skew kurtosis    se
## LanguageWorkedWith*  2525 0.00    -1.12 46.16
## Age                    45 1.08     1.26  0.51
## YearsCode              51 0.31    -1.46  1.06
## YearsCodePro           44 0.10    -1.65  0.98
## WorkLoc*                1 4.37    17.15  0.01
## OpenSourcer*            0  NaN      NaN  0.00
## cluster*                0  NaN      NaN  0.00
## -------------------------------------------------------- 
## group: 8
##                     vars   n    mean     sd median trimmed    mad min  max
## LanguageWorkedWith*    1 301 1531.11 729.92   1504 1558.78 935.52  24 2552
## Age                    2 301   32.92   8.53     32   32.29   8.90   2   59
## YearsCode              3 301   20.82  14.92     16   19.67  13.34   1   50
## YearsCodePro           4 301    8.70   6.46      8    7.98   7.41   1   24
## WorkLoc*               5 301    2.07   0.26      2    2.00   0.00   2    3
## OpenSourcer*           6 301    1.00   0.00      1    1.00   0.00   1    1
## cluster*               7 301    8.00   0.00      8    8.00   0.00   8    8
##                     range  skew kurtosis    se
## LanguageWorkedWith*  2528 -0.14    -1.12 42.07
## Age                    57  0.45     0.29  0.49
## YearsCode              49  0.62    -0.91  0.86
## YearsCodePro           23  0.63    -0.39  0.37
## WorkLoc*                1  3.26     8.68  0.02
## OpenSourcer*            0   NaN      NaN  0.00
## cluster*                0   NaN      NaN  0.00
## -------------------------------------------------------- 
## group: 9
##                     vars   n    mean     sd median trimmed    mad min  max
## LanguageWorkedWith*    1 179 1760.12 689.49   1778 1835.21 831.74  13 2557
## Age                    2 179   30.37   9.72     27   28.82   5.93  12   99
## YearsCode              3 179   33.13  17.71     45   34.83   5.93   1   52
## YearsCodePro           4 179   25.17  13.57     23   25.66  16.31   1   46
## WorkLoc*               5 179    3.00   0.00      3    3.00   0.00   3    3
## OpenSourcer*           6 179    2.62   0.79      3    2.69   0.00   1    4
## cluster*               7 179    9.00   0.00      9    9.00   0.00   9    9
##                     range  skew kurtosis    se
## LanguageWorkedWith*  2544 -0.70    -0.36 51.53
## Age                    87  2.74    13.31  0.73
## YearsCode              51 -0.68    -1.20  1.32
## YearsCodePro           45 -0.16    -1.03  1.01
## WorkLoc*                0   NaN      NaN  0.00
## OpenSourcer*            3 -0.81     0.00  0.06
## cluster*                0   NaN      NaN  0.00
## -------------------------------------------------------- 
## group: 10
##                     vars   n    mean     sd median trimmed     mad min
## LanguageWorkedWith*    1 183 1418.15 779.65   1388 1437.00 1031.89  28
## Age                    2 183   32.32   8.82     31   31.42    7.41  16
## YearsCode              3 183   20.35  17.96     13   19.06   16.31   1
## YearsCodePro           4 183   24.47  16.04     24   24.90   23.72   1
## WorkLoc*               5 183    1.28   0.70      1    1.11    0.00   1
## OpenSourcer*           6 183    4.00   0.00      4    4.00    0.00   4
## cluster*               7 183   10.00   0.00     10   10.00    0.00  10
##                      max range  skew kurtosis    se
## LanguageWorkedWith* 2552  2524 -0.02    -1.17 57.63
## Age                   71    55  1.26     2.61  0.65
## YearsCode             52    51  0.56    -1.32  1.33
## YearsCodePro          46    45 -0.18    -1.62  1.19
## WorkLoc*               3     2  2.03     2.15  0.05
## OpenSourcer*           4     0   NaN      NaN  0.00
## cluster*              10     0   NaN      NaN  0.00

data[pam_fit$medoids, ]

##                                                          LanguageWorkedWith
## 2653                 Bash/Shell/PowerShell;HTML/CSS;JavaScript;Python;R;SQL
## 2429                                  Bash/Shell/PowerShell;C#;Python;R;SQL
## 4762                                         Bash/Shell/PowerShell;Python;R
## 112                                                                       R
## 919                                                            Python;R;SQL
## 3518                                                                      R
## 1802                       HTML/CSS;Java;JavaScript;Python;R;SQL;TypeScript
## 499  Bash/Shell/PowerShell;C#;HTML/CSS;Java;JavaScript;PHP;R;SQL;TypeScript
## 2586                                                              R;SQL;VBA
## 1921 Bash/Shell/PowerShell;C#;HTML/CSS;Java;JavaScript;PHP;R;SQL;TypeScript
##      Age YearsCode YearsCodePro
## 2653  30        23           23
## 2429  29        34           23
## 4762  28        45           40
## 112   27        23           23
## 919   31        23           12
## 3518  30        23           12
## 1802  28        23           23
## 499   34        17            8
## 2586  28        47           23
## 1921  32         7           23
##                                             WorkLoc
## 2653                                           Home
## 2429                                         Office
## 4762                                         Office
## 112                                          Office
## 919                                            Home
## 3518                                           Home
## 1802                                         Office
## 499                                          Office
## 2586 Other place, such as a coworking space or cafe
## 1921                                           Home
##                                             OpenSourcer cluster
## 2653                                              Never       1
## 2429                                              Never       2
## 4762 Less than once a month but more than once per year       3
## 112                             Less than once per year       4
## 919  Less than once a month but more than once per year       5
## 3518                            Less than once per year       6
## 1802                         Once a month or more often       7
## 499  Less than once a month but more than once per year       8
## 2586                                              Never       9
## 1921                         Once a month or more often      10

Cluster 1 is coding 23 years(professionally 23 years), working at home, and never contribute to Open Source.

Cluster 2 is is coding 34 years(professionally 23 years), working at office, and never contribute to Open Source.

Cluster 3 is coding 45 years(professionally 40 years), working at home, and contribute to Open Source less than once a month but more than once per year.

Cluster 4 is coding 23 years(professionally 23 years), working at office, and contribute to Open Source less than once per year.

Cluster 5 is coding 23 years (professionally 12 years), working at home, and contribute to Open Source less than once a month but more than once per year.

Cluster 6 is is coding 23 years(professionally 12 years), years old, working at home and contribute to Open Source Less than once per year.

Cluster 7 is coding 23 years(professionally 23 years), working at office, and contribute to Open Source Once a month or more often.

Cluster 8 is coding 17 years(professionally 8 years), working at office, and contribute to Open Source less Less than once a month but more than once per year.

Cluster 9 is coding 47 years(professionally 23 years), working at Other place, such as a coworking space or cafe, and never contribute to Open Source.

Cluster 10 is is coding 7, they are 29 years old, working at home, and contribute to Open Source Once a month or more often.

Visualizations

To visualize many variables in a lower dimensional space, use t-distributed stochastic neighborhood embedding, or t-SNE. It tries to preserve local structure so as to make clusters visible in 2D or 3D.

#install.packages("Rtsne")
library("Rtsne")

## Warning: package 'Rtsne' was built under R version 3.6.3

set.seed(42)
tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)

tsne_data <- tsne_obj$Y %>% 
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(cluster = factor(pam_fit$clustering),
         name = data$LanguageWorkedWith)

ggplot(aes(x = X, y = Y), data = tsne_data) +
  geom_point(aes(color = cluster))

As we can see clusters are defined.

Mix approaches

heatmap(gower_mat, symm = T,
        distfun = function(x) as.dist(x))

There are 10 clusters.

pam_fit2 <- pam(gower_dist, diss = TRUE, k = 10)

pam_results2 <- data %>%
  dplyr::select(-LanguageWorkedWith) %>%
  mutate(cluster = pam_fit2$clustering) %>%
  group_by(cluster) %>%
  do(the_summary = summary(.))

data[pam_fit2$medoids, ]

##                                                          LanguageWorkedWith
## 2653                 Bash/Shell/PowerShell;HTML/CSS;JavaScript;Python;R;SQL
## 2429                                  Bash/Shell/PowerShell;C#;Python;R;SQL
## 4762                                         Bash/Shell/PowerShell;Python;R
## 112                                                                       R
## 919                                                            Python;R;SQL
## 3518                                                                      R
## 1802                       HTML/CSS;Java;JavaScript;Python;R;SQL;TypeScript
## 499  Bash/Shell/PowerShell;C#;HTML/CSS;Java;JavaScript;PHP;R;SQL;TypeScript
## 2586                                                              R;SQL;VBA
## 1921 Bash/Shell/PowerShell;C#;HTML/CSS;Java;JavaScript;PHP;R;SQL;TypeScript
##      Age YearsCode YearsCodePro
## 2653  30        23           23
## 2429  29        34           23
## 4762  28        45           40
## 112   27        23           23
## 919   31        23           12
## 3518  30        23           12
## 1802  28        23           23
## 499   34        17            8
## 2586  28        47           23
## 1921  32         7           23
##                                             WorkLoc
## 2653                                           Home
## 2429                                         Office
## 4762                                         Office
## 112                                          Office
## 919                                            Home
## 3518                                           Home
## 1802                                         Office
## 499                                          Office
## 2586 Other place, such as a coworking space or cafe
## 1921                                           Home
##                                             OpenSourcer cluster
## 2653                                              Never       1
## 2429                                              Never       2
## 4762 Less than once a month but more than once per year       3
## 112                             Less than once per year       4
## 919  Less than once a month but more than once per year       5
## 3518                            Less than once per year       6
## 1802                         Once a month or more often       7
## 499  Less than once a month but more than once per year       8
## 2586                                              Never       9
## 1921                         Once a month or more often      10

tsne_obj2 <- Rtsne(gower_dist, is_distance = TRUE)

tsne_data2 <- tsne_obj2$Y %>% 
  data.frame() %>%
  setNames(c("X", "Y")) %>%
  mutate(cluster = factor(pam_fit2$clustering),
         name = data$LanguageWorkedWith)

ggplot(aes(x = X, y = Y), data = tsne_data2) +
  geom_point(aes(color = cluster))

As we can see there are still 10 clusters.

Thus, a 10-cluster solution is satisfactory.

describeBy(data, tsne_data$cluster)

## 
##  Descriptive statistics by group 
## group: 1
##                     vars   n    mean     sd median trimmed     mad min
## LanguageWorkedWith*    1 337 1676.48 732.74   1668 1731.40 1058.58  27
## Age                    2 337   32.96  10.25     30   31.67    7.41   1
## YearsCode              3 337   25.78  17.53     24   25.86   26.69   1
## YearsCodePro           4 337   22.85  15.61     23   22.86   23.72   1
## WorkLoc*               5 337    1.00   0.00      1    1.00    0.00   1
## OpenSourcer*           6 337    3.00   0.00      3    3.00    0.00   3
## cluster*               7 337    1.00   0.00      1    1.00    0.00   1
##                      max range  skew kurtosis    se
## LanguageWorkedWith* 2557  2530 -0.41    -0.95 39.91
## Age                   84    83  1.34     2.76  0.56
## YearsCode             52    51  0.02    -1.54  0.95
## YearsCodePro          46    45  0.02    -1.49  0.85
## WorkLoc*               1     0   NaN      NaN  0.00
## OpenSourcer*           3     0   NaN      NaN  0.00
## cluster*               1     0   NaN      NaN  0.00
## -------------------------------------------------------- 
## group: 2
##                     vars   n    mean     sd median trimmed    mad min  max
## LanguageWorkedWith*    1 658 1753.08 684.89   1683 1818.50 938.49   2 2557
## Age                    2 658   30.43   8.61     28   29.12   5.93   1   98
## YearsCode              3 658   28.14  18.20     34   28.76  20.76   1   51
## YearsCodePro           4 658   23.45  15.70     23   23.55  23.72   1   46
## WorkLoc*               5 658    2.00   0.00      2    2.00   0.00   2    2
## OpenSourcer*           6 658    3.00   0.00      3    3.00   0.00   3    3
## cluster*               7 658    2.00   0.00      2    2.00   0.00   2    2
##                     range  skew kurtosis    se
## LanguageWorkedWith*  2555 -0.54    -0.62 26.70
## Age                    97  1.98     7.49  0.34
## YearsCode              50 -0.21    -1.59  0.71
## YearsCodePro           45  0.02    -1.50  0.61
## WorkLoc*                0   NaN      NaN  0.00
## OpenSourcer*            0   NaN      NaN  0.00
## cluster*                0   NaN      NaN  0.00
## -------------------------------------------------------- 
## group: 3
##                     vars   n    mean     sd median trimmed    mad min  max
## LanguageWorkedWith*    1 269 1586.95 715.70   1570 1634.31 922.18   1 2555
## Age                    2 269   29.55   7.41     28   28.43   4.45  18   66
## YearsCode              3 269   27.80  20.45     34   28.27  22.24   1   52
## YearsCodePro           4 269   37.15   7.20     40   37.93   4.45  23   45
## WorkLoc*               5 269    2.07   0.26      2    2.00   0.00   2    3
## OpenSourcer*           6 269    1.00   0.00      1    1.00   0.00   1    1
## cluster*               7 269    3.00   0.00      3    3.00   0.00   3    3
##                     range  skew kurtosis    se
## LanguageWorkedWith*  2554 -0.34    -0.80 43.64
## Age                    48  2.34     7.33  0.45
## YearsCode              51 -0.16    -1.84  1.25
## YearsCodePro           22 -1.02    -0.31  0.44
## WorkLoc*                1  3.33     9.14  0.02
## OpenSourcer*            0   NaN      NaN  0.00
## cluster*                0   NaN      NaN  0.00
## -------------------------------------------------------- 
## group: 4
##                     vars   n    mean     sd median trimmed    mad min  max
## LanguageWorkedWith*    1 611 1605.27 701.09   1576 1646.39 891.04  12 2557
## Age                    2 611   31.72   8.45     29   30.44   5.93  20   64
## YearsCode              3 611   24.86  18.67     23   24.63  28.17   1   51
## YearsCodePro           4 611   25.36  15.42     24   25.96  23.72   1   45
## WorkLoc*               5 611    2.03   0.18      2    2.00   0.00   2    3
## OpenSourcer*           6 611    2.00   0.00      2    2.00   0.00   2    2
## cluster*               7 611    4.00   0.00      4    4.00   0.00   4    4
##                     range  skew kurtosis    se
## LanguageWorkedWith*  2545 -0.31    -0.89 28.36
## Age                    44  1.45     2.02  0.34
## YearsCode              50  0.14    -1.67  0.76
## YearsCodePro           44 -0.23    -1.48  0.62
## WorkLoc*                1  5.24    25.49  0.01
## OpenSourcer*            0   NaN      NaN  0.00
## cluster*                0   NaN      NaN  0.00
## -------------------------------------------------------- 
## group: 5
##                     vars   n    mean     sd median trimmed     mad min
## LanguageWorkedWith*    1 307 1432.18 759.00   1476 1451.96 1005.20   8
## Age                    2 307   33.12  10.20     31   32.04    8.90  16
## YearsCode              3 307   24.89  17.47     23   24.66   23.72   1
## YearsCodePro           4 307   21.09  14.81     17   20.75   17.79   1
## WorkLoc*               5 307    1.15   0.53      1    1.00    0.00   1
## OpenSourcer*           6 307    1.00   0.00      1    1.00    0.00   1
## cluster*               7 307    5.00   0.00      5    5.00    0.00   5
##                      max range  skew kurtosis    se
## LanguageWorkedWith* 2552  2544 -0.07    -1.17 43.32
## Age                   99    83  1.50     5.04  0.58
## YearsCode             51    50  0.14    -1.51  1.00
## YearsCodePro          45    44  0.27    -1.38  0.85
## WorkLoc*               3     2  3.21     8.35  0.03
## OpenSourcer*           1     0   NaN      NaN  0.00
## cluster*               5     0   NaN      NaN  0.00
## -------------------------------------------------------- 
## group: 6
##                     vars   n    mean     sd median trimmed    mad min  max
## LanguageWorkedWith*    1 339 1631.88 720.45   1581 1677.08 993.34  18 2552
## Age                    2 339   34.04  10.01     32   32.90   8.90  17   79
## YearsCode              3 339   23.73  17.01     23   23.24  23.72   1   52
## YearsCodePro           4 339   21.13  15.53     17   20.72  20.76   1   46
## WorkLoc*               5 339    1.29   0.71      1    1.12   0.00   1    3
## OpenSourcer*           6 339    2.00   0.00      2    2.00   0.00   2    2
## cluster*               7 339    6.00   0.00      6    6.00   0.00   6    6
##                     range  skew kurtosis    se
## LanguageWorkedWith*  2534 -0.37    -0.96 39.13
## Age                    62  1.09     1.16  0.54
## YearsCode              51  0.28    -1.42  0.92
## YearsCodePro           45  0.25    -1.49  0.84
## WorkLoc*                2  1.98     1.92  0.04
## OpenSourcer*            0   NaN      NaN  0.00
## cluster*                0   NaN      NaN  0.00
## -------------------------------------------------------- 
## group: 7
##                     vars   n    mean     sd median trimmed    mad min  max
## LanguageWorkedWith*    1 267 1398.99 754.33   1391 1410.48 886.59  27 2552
## Age                    2 267   31.69   8.30     30   30.82   7.41  18   63
## YearsCode              3 267   23.81  17.39     19   23.33  22.24   1   52
## YearsCodePro           4 267   22.46  16.04     23   22.38  25.20   1   45
## WorkLoc*               5 267    2.04   0.21      2    2.00   0.00   2    3
## OpenSourcer*           6 267    4.00   0.00      4    4.00   0.00   4    4
## cluster*               7 267    7.00   0.00      7    7.00   0.00   7    7
##                     range skew kurtosis    se
## LanguageWorkedWith*  2525 0.00    -1.12 46.16
## Age                    45 1.08     1.26  0.51
## YearsCode              51 0.31    -1.46  1.06
## YearsCodePro           44 0.10    -1.65  0.98
## WorkLoc*                1 4.37    17.15  0.01
## OpenSourcer*            0  NaN      NaN  0.00
## cluster*                0  NaN      NaN  0.00
## -------------------------------------------------------- 
## group: 8
##                     vars   n    mean     sd median trimmed    mad min  max
## LanguageWorkedWith*    1 301 1531.11 729.92   1504 1558.78 935.52  24 2552
## Age                    2 301   32.92   8.53     32   32.29   8.90   2   59
## YearsCode              3 301   20.82  14.92     16   19.67  13.34   1   50
## YearsCodePro           4 301    8.70   6.46      8    7.98   7.41   1   24
## WorkLoc*               5 301    2.07   0.26      2    2.00   0.00   2    3
## OpenSourcer*           6 301    1.00   0.00      1    1.00   0.00   1    1
## cluster*               7 301    8.00   0.00      8    8.00   0.00   8    8
##                     range  skew kurtosis    se
## LanguageWorkedWith*  2528 -0.14    -1.12 42.07
## Age                    57  0.45     0.29  0.49
## YearsCode              49  0.62    -0.91  0.86
## YearsCodePro           23  0.63    -0.39  0.37
## WorkLoc*                1  3.26     8.68  0.02
## OpenSourcer*            0   NaN      NaN  0.00
## cluster*                0   NaN      NaN  0.00
## -------------------------------------------------------- 
## group: 9
##                     vars   n    mean     sd median trimmed    mad min  max
## LanguageWorkedWith*    1 179 1760.12 689.49   1778 1835.21 831.74  13 2557
## Age                    2 179   30.37   9.72     27   28.82   5.93  12   99
## YearsCode              3 179   33.13  17.71     45   34.83   5.93   1   52
## YearsCodePro           4 179   25.17  13.57     23   25.66  16.31   1   46
## WorkLoc*               5 179    3.00   0.00      3    3.00   0.00   3    3
## OpenSourcer*           6 179    2.62   0.79      3    2.69   0.00   1    4
## cluster*               7 179    9.00   0.00      9    9.00   0.00   9    9
##                     range  skew kurtosis    se
## LanguageWorkedWith*  2544 -0.70    -0.36 51.53
## Age                    87  2.74    13.31  0.73
## YearsCode              51 -0.68    -1.20  1.32
## YearsCodePro           45 -0.16    -1.03  1.01
## WorkLoc*                0   NaN      NaN  0.00
## OpenSourcer*            3 -0.81     0.00  0.06
## cluster*                0   NaN      NaN  0.00
## -------------------------------------------------------- 
## group: 10
##                     vars   n    mean     sd median trimmed     mad min
## LanguageWorkedWith*    1 183 1418.15 779.65   1388 1437.00 1031.89  28
## Age                    2 183   32.32   8.82     31   31.42    7.41  16
## YearsCode              3 183   20.35  17.96     13   19.06   16.31   1
## YearsCodePro           4 183   24.47  16.04     24   24.90   23.72   1
## WorkLoc*               5 183    1.28   0.70      1    1.11    0.00   1
## OpenSourcer*           6 183    4.00   0.00      4    4.00    0.00   4
## cluster*               7 183   10.00   0.00     10   10.00    0.00  10
##                      max range  skew kurtosis    se
## LanguageWorkedWith* 2552  2524 -0.02    -1.17 57.63
## Age                   71    55  1.26     2.61  0.65
## YearsCode             52    51  0.56    -1.32  1.33
## YearsCodePro          46    45 -0.18    -1.62  1.19
## WorkLoc*               3     2  2.03     2.15  0.05
## OpenSourcer*           4     0   NaN      NaN  0.00
## cluster*              10     0   NaN      NaN  0.00

Summary:

To conclude, there are 10 cluster among 3463 programmer.

Programmer from cluster 1 is coding 23 years(professionally 23 years), working at home, and never contributes to Open Source.

Programmer from cluster 2 is is coding 34 years(professionally 23 years), working at office, and never contributes to Open Source.

Programmer from cluster 3 is coding 45 years(professionally 40 years), working at home, and contributes to Open Source less than once a month but more than once per year.

Programmer from cluster 4 is coding 23 years(professionally 23 years), working at office, and contributes to Open Source less than once per year.

Programmer from cluster 5 is coding 23 years (professionally 12 years), working at home, and contributes to Open Source less than once a month but more than once per year.

Programmer from cluster 6 is coding 23 years(professionally 12 years), years old, working at home and contributes to Open Source Less than once per year.

Programmer from cluster 7 is coding 23 years(professionally 23 years), working at office, and contributes to Open Source Once a month or more often.

Programmer from cluster 8 is coding 17 years(professionally 8 years), working at office, and contributes to Open Source less Less than once a month but more than once per year.

Programmer from cluster 9 has the most significant experience in coding,they are working at other place, such as a coworking space or cafe, and never contributes to Open Source.

Programmer from cluster 10 has the least significant experience in coding, working at home, and contributes to Open Source Once a month or more often.

Clusters

Lkhasaranova Yumzhana

14 марта 2020 г