Variables
Continuous variables: 1) Age 2) YearsCode - Years Since Learning to Code 3) YearsCode - Years Since Learning to Code Professionally
Categorical variables: 1) WorkLoc - Where Do Developers Want to Work 2) OpenSourcer - Contributing to Open Source
I have chosen these variables as these variables create better cluster in comparison with Employment, ImpSyn, Edlevel, CareerSat and WorkRemote.
Also, YearsCode, YearsCodePro and Age are important as they can influence on, which language people prefer.
Missing data
#install.packages('Amelia')
#install.packages("mlbench")
library(Amelia)
## Warning: package 'Amelia' was built under R version 3.6.3
## Loading required package: Rcpp
## ##
## ## Amelia II: Multiple Imputation
## ## (Version 1.7.6, built: 2019-11-24)
## ## Copyright (C) 2005-2020 James Honaker, Gary King and Matthew Blackwell
## ## Refer to http://gking.harvard.edu/amelia/ for more information
## ##
library(mlbench)
## Warning: package 'mlbench' was built under R version 3.6.3
missmap(data, col=c("blue", "red"), legend = T, main = "Missing values vs observed")
As we can see there are NA in dataset (only 9%), so we can delete them.
str(data)
## 'data.frame': 3451 obs. of 6 variables:
## $ LanguageWorkedWith: Factor w/ 2557 levels "Assembly;Bash/Shell/PowerShell;C#;F#;HTML/CSS;Java;JavaScript;PHP;Python;R;SQL;VBA;Other(s):",..: 2492 533 2189 1674 1555 2189 853 1732 2439 2545 ...
## $ Age : num 28 38 25 30 25 34 24 28 40 58 ...
## $ YearsCode : Factor w/ 52 levels "1","10","11",..: 5 49 47 3 48 8 48 2 48 37 ...
## $ YearsCodePro : Factor w/ 46 levels "1","10","11",..: 23 34 23 43 41 3 12 42 42 13 ...
## $ WorkLoc : Factor w/ 3 levels "Home","Office",..: 1 2 2 2 2 2 1 2 1 1 ...
## $ OpenSourcer : Factor w/ 4 levels "Less than once a month but more than once per year",..: 3 3 1 2 3 3 1 3 3 3 ...
## - attr(*, "na.action")= 'omit' Named int 2 3 4 7 8 10 13 17 20 23 ...
## ..- attr(*, "names")= chr "2" "3" "4" "7" ...
data$YearsCode <- as.numeric(data$YearsCode)
data$YearsCodePro <- as.numeric(data$YearsCodePro)
Change the type of variable “YearsCodePro” and “YearsCode” into numeric.
Descriptive statistics
library(psych)
summary(data)
## LanguageWorkedWith
## Python;R;SQL : 119
## R : 100
## Bash/Shell/PowerShell;Python;R;SQL : 83
## Python;R : 82
## R;SQL : 55
## Bash/Shell/PowerShell;HTML/CSS;JavaScript;Python;R;SQL: 43
## (Other) :2969
## Age YearsCode YearsCodePro
## Min. : 1.00 Min. : 1.00 Min. : 1.00
## 1st Qu.:25.00 1st Qu.: 7.00 1st Qu.:10.00
## Median :30.00 Median :23.00 Median :23.00
## Mean :31.84 Mean :25.45 Mean :23.14
## 3rd Qu.:36.00 3rd Qu.:45.00 3rd Qu.:40.00
## Max. :99.00 Max. :52.00 Max. :46.00
##
## WorkLoc
## Home :1067
## Office :2033
## Other place, such as a coworking space or cafe: 351
##
##
##
##
## OpenSourcer
## Less than once a month but more than once per year: 900
## Less than once per year : 983
## Never :1107
## Once a month or more often : 461
##
##
##
describe(data)
## vars n mean sd median trimmed mad min
## LanguageWorkedWith* 1 3451 1601.88 728.93 1576 1647.20 968.14 1
## Age 2 3451 31.84 9.09 30 30.62 7.41 1
## YearsCode 3 3451 25.45 18.09 23 25.39 28.17 1
## YearsCodePro 4 3451 23.14 15.55 23 23.21 25.20 1
## WorkLoc* 5 3451 1.79 0.61 2 1.74 0.00 1
## OpenSourcer* 6 3451 2.33 1.00 2 2.28 1.48 1
## max range skew kurtosis se
## LanguageWorkedWith* 2557 2556 -0.32 -0.95 12.41
## Age 99 98 1.53 4.16 0.15
## YearsCode 52 51 0.09 -1.61 0.31
## YearsCodePro 46 45 0.02 -1.53 0.26
## WorkLoc* 3 2 0.14 -0.49 0.01
## OpenSourcer* 4 3 0.10 -1.11 0.02
So, we have 6 variables and 3451 observations. There are 3 continious variables (Age, YearsCodePro, YearsCode) and 3 categorical (LanguageWorkedWith, WorkLoc, OpenSourcer) ones.
As we can see mean age is about 32 years old(kurtosis = 4.16, skew = 1.53), mean YearsCodePro is 23(kurtosis = -1.61, skew = 0.09) and mean YearsCodePro is about 25 (kurtosis = -1.53, skew = 0.02).
hist(data$Age)
As we can see the distribution of age is not normal and positively skewed.
hist(data$YearsCode)
The same result we can wee with YearsCode. It is not normal and positively and negatively skewed.
hist(data$YearsCodePro)
As for YearsCodePro, the distribution is also not normal.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(data, aes(x = `WorkLoc`)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The most preferable work location is office and the least one is other place.
ggplot(data, aes(x = `OpenSourcer`)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Never is the most popular answer about contributing to Open Source, the least popular is once a mouth.
Distance metric
As we have mixed data types, we should use gover distance. Also we need to pay attention to the fact that distribution of our variables is not normal, so we logarithmically transform it.
library(cluster)
gower_dist <- daisy(data [ , -1],
metric = "gower",
type = list(logratio = 2))
summary(gower_dist)
## 5952975 dissimilarities, summarized :
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.3033 0.4221 0.4149 0.5326 0.9755
## Metric : mixed ; Types = I, I, I, N, N
## Number of objects : 3451
There are 5952975 dissimilarities, 3 interval and 2 nominal variables, and we have matrix 3451 X 3451.
The most similar and dissimilar pairs in the data
Look at the most similar and dissimilar pairs in the data
The most similar pair of observations:
data[
which(gower_mat == min(gower_mat[gower_mat != min(gower_mat)]),
arr.ind = TRUE)[1, ], ]
## LanguageWorkedWith Age YearsCode
## 910 Bash/Shell/PowerShell;C++;JavaScript;Python;R 24.0 23
## 759 JavaScript;Python;R;SQL 24.5 23
## YearsCodePro WorkLoc OpenSourcer
## 910 12 Office Less than once per year
## 759 12 Office Less than once per year
In our case 910 and 759 are the most similar. And these pairs make sense.
The most dissimilar pair of observation:
data[
which(gower_mat == max(gower_mat[gower_mat != max(gower_mat)]),
arr.ind = TRUE)[1, ], ]
## LanguageWorkedWith
## 4394 Bash/Shell/PowerShell;C;C++;HTML/CSS;Java;JavaScript;PHP;Python;R;SQL
## 4385 Assembly;Bash/Shell/PowerShell;C;C++;C#;Clojure;Dart;Elixir;Erlang;F#;Go;HTML/CSS;Java;JavaScript;Kotlin;Objective-C;PHP;Python;R;Ruby;Rust;Scala;SQL;Swift;TypeScript;VBA;WebAssembly;Other(s):
## Age YearsCode YearsCodePro WorkLoc OpenSourcer
## 4394 98 1 1 Office Never
## 4385 18 52 46 Home Once a month or more often
4394 and 4385 are most dissimilar. Also, these pairs make sense.
Choosing the clustering algorithm
We will use partitioning around medoids (PAM) for handling a custom distance matrix.
PAM is an iterative algorithm. A ‘medoid’ is the observation that would yield the lowest average distance if it were to be re-assigned to the cluster it is assigned to. It works well with n < 10,000 observations per group.
Look at the silhouette width metric. This is an internal validation metric, an aggregated measure of how similar an observation is to its own cluster, compared to its closest neighboring cluster. The metric can range from -1 to 1, where higher values are better.
Calculate silhouette width for many several solutions using PAM:
Plot the sihouette width (larger value is better):
plot(1:10, sil_width,
xlab = "Number of clusters", xaxt='n',
ylab = "Silhouette Width",
ylim = c(0,1))
axis(1, at = seq(2, 10, by = 1), las=2)
lines(1:10, sil_width)
Conclusion: select the 10-cluster solutions as it has the highest silhouette width.
Interpret the solution: by running summary on each cluster
data$cluster <- as.factor(pam_fit$clustering)
describeBy(data, data$cluster)
##
## Descriptive statistics by group
## group: 1
## vars n mean sd median trimmed mad min
## LanguageWorkedWith* 1 337 1676.48 732.74 1668 1731.40 1058.58 27
## Age 2 337 32.96 10.25 30 31.67 7.41 1
## YearsCode 3 337 25.78 17.53 24 25.86 26.69 1
## YearsCodePro 4 337 22.85 15.61 23 22.86 23.72 1
## WorkLoc* 5 337 1.00 0.00 1 1.00 0.00 1
## OpenSourcer* 6 337 3.00 0.00 3 3.00 0.00 3
## cluster* 7 337 1.00 0.00 1 1.00 0.00 1
## max range skew kurtosis se
## LanguageWorkedWith* 2557 2530 -0.41 -0.95 39.91
## Age 84 83 1.34 2.76 0.56
## YearsCode 52 51 0.02 -1.54 0.95
## YearsCodePro 46 45 0.02 -1.49 0.85
## WorkLoc* 1 0 NaN NaN 0.00
## OpenSourcer* 3 0 NaN NaN 0.00
## cluster* 1 0 NaN NaN 0.00
## --------------------------------------------------------
## group: 2
## vars n mean sd median trimmed mad min max
## LanguageWorkedWith* 1 658 1753.08 684.89 1683 1818.50 938.49 2 2557
## Age 2 658 30.43 8.61 28 29.12 5.93 1 98
## YearsCode 3 658 28.14 18.20 34 28.76 20.76 1 51
## YearsCodePro 4 658 23.45 15.70 23 23.55 23.72 1 46
## WorkLoc* 5 658 2.00 0.00 2 2.00 0.00 2 2
## OpenSourcer* 6 658 3.00 0.00 3 3.00 0.00 3 3
## cluster* 7 658 2.00 0.00 2 2.00 0.00 2 2
## range skew kurtosis se
## LanguageWorkedWith* 2555 -0.54 -0.62 26.70
## Age 97 1.98 7.49 0.34
## YearsCode 50 -0.21 -1.59 0.71
## YearsCodePro 45 0.02 -1.50 0.61
## WorkLoc* 0 NaN NaN 0.00
## OpenSourcer* 0 NaN NaN 0.00
## cluster* 0 NaN NaN 0.00
## --------------------------------------------------------
## group: 3
## vars n mean sd median trimmed mad min max
## LanguageWorkedWith* 1 269 1586.95 715.70 1570 1634.31 922.18 1 2555
## Age 2 269 29.55 7.41 28 28.43 4.45 18 66
## YearsCode 3 269 27.80 20.45 34 28.27 22.24 1 52
## YearsCodePro 4 269 37.15 7.20 40 37.93 4.45 23 45
## WorkLoc* 5 269 2.07 0.26 2 2.00 0.00 2 3
## OpenSourcer* 6 269 1.00 0.00 1 1.00 0.00 1 1
## cluster* 7 269 3.00 0.00 3 3.00 0.00 3 3
## range skew kurtosis se
## LanguageWorkedWith* 2554 -0.34 -0.80 43.64
## Age 48 2.34 7.33 0.45
## YearsCode 51 -0.16 -1.84 1.25
## YearsCodePro 22 -1.02 -0.31 0.44
## WorkLoc* 1 3.33 9.14 0.02
## OpenSourcer* 0 NaN NaN 0.00
## cluster* 0 NaN NaN 0.00
## --------------------------------------------------------
## group: 4
## vars n mean sd median trimmed mad min max
## LanguageWorkedWith* 1 611 1605.27 701.09 1576 1646.39 891.04 12 2557
## Age 2 611 31.72 8.45 29 30.44 5.93 20 64
## YearsCode 3 611 24.86 18.67 23 24.63 28.17 1 51
## YearsCodePro 4 611 25.36 15.42 24 25.96 23.72 1 45
## WorkLoc* 5 611 2.03 0.18 2 2.00 0.00 2 3
## OpenSourcer* 6 611 2.00 0.00 2 2.00 0.00 2 2
## cluster* 7 611 4.00 0.00 4 4.00 0.00 4 4
## range skew kurtosis se
## LanguageWorkedWith* 2545 -0.31 -0.89 28.36
## Age 44 1.45 2.02 0.34
## YearsCode 50 0.14 -1.67 0.76
## YearsCodePro 44 -0.23 -1.48 0.62
## WorkLoc* 1 5.24 25.49 0.01
## OpenSourcer* 0 NaN NaN 0.00
## cluster* 0 NaN NaN 0.00
## --------------------------------------------------------
## group: 5
## vars n mean sd median trimmed mad min
## LanguageWorkedWith* 1 307 1432.18 759.00 1476 1451.96 1005.20 8
## Age 2 307 33.12 10.20 31 32.04 8.90 16
## YearsCode 3 307 24.89 17.47 23 24.66 23.72 1
## YearsCodePro 4 307 21.09 14.81 17 20.75 17.79 1
## WorkLoc* 5 307 1.15 0.53 1 1.00 0.00 1
## OpenSourcer* 6 307 1.00 0.00 1 1.00 0.00 1
## cluster* 7 307 5.00 0.00 5 5.00 0.00 5
## max range skew kurtosis se
## LanguageWorkedWith* 2552 2544 -0.07 -1.17 43.32
## Age 99 83 1.50 5.04 0.58
## YearsCode 51 50 0.14 -1.51 1.00
## YearsCodePro 45 44 0.27 -1.38 0.85
## WorkLoc* 3 2 3.21 8.35 0.03
## OpenSourcer* 1 0 NaN NaN 0.00
## cluster* 5 0 NaN NaN 0.00
## --------------------------------------------------------
## group: 6
## vars n mean sd median trimmed mad min max
## LanguageWorkedWith* 1 339 1631.88 720.45 1581 1677.08 993.34 18 2552
## Age 2 339 34.04 10.01 32 32.90 8.90 17 79
## YearsCode 3 339 23.73 17.01 23 23.24 23.72 1 52
## YearsCodePro 4 339 21.13 15.53 17 20.72 20.76 1 46
## WorkLoc* 5 339 1.29 0.71 1 1.12 0.00 1 3
## OpenSourcer* 6 339 2.00 0.00 2 2.00 0.00 2 2
## cluster* 7 339 6.00 0.00 6 6.00 0.00 6 6
## range skew kurtosis se
## LanguageWorkedWith* 2534 -0.37 -0.96 39.13
## Age 62 1.09 1.16 0.54
## YearsCode 51 0.28 -1.42 0.92
## YearsCodePro 45 0.25 -1.49 0.84
## WorkLoc* 2 1.98 1.92 0.04
## OpenSourcer* 0 NaN NaN 0.00
## cluster* 0 NaN NaN 0.00
## --------------------------------------------------------
## group: 7
## vars n mean sd median trimmed mad min max
## LanguageWorkedWith* 1 267 1398.99 754.33 1391 1410.48 886.59 27 2552
## Age 2 267 31.69 8.30 30 30.82 7.41 18 63
## YearsCode 3 267 23.81 17.39 19 23.33 22.24 1 52
## YearsCodePro 4 267 22.46 16.04 23 22.38 25.20 1 45
## WorkLoc* 5 267 2.04 0.21 2 2.00 0.00 2 3
## OpenSourcer* 6 267 4.00 0.00 4 4.00 0.00 4 4
## cluster* 7 267 7.00 0.00 7 7.00 0.00 7 7
## range skew kurtosis se
## LanguageWorkedWith* 2525 0.00 -1.12 46.16
## Age 45 1.08 1.26 0.51
## YearsCode 51 0.31 -1.46 1.06
## YearsCodePro 44 0.10 -1.65 0.98
## WorkLoc* 1 4.37 17.15 0.01
## OpenSourcer* 0 NaN NaN 0.00
## cluster* 0 NaN NaN 0.00
## --------------------------------------------------------
## group: 8
## vars n mean sd median trimmed mad min max
## LanguageWorkedWith* 1 301 1531.11 729.92 1504 1558.78 935.52 24 2552
## Age 2 301 32.92 8.53 32 32.29 8.90 2 59
## YearsCode 3 301 20.82 14.92 16 19.67 13.34 1 50
## YearsCodePro 4 301 8.70 6.46 8 7.98 7.41 1 24
## WorkLoc* 5 301 2.07 0.26 2 2.00 0.00 2 3
## OpenSourcer* 6 301 1.00 0.00 1 1.00 0.00 1 1
## cluster* 7 301 8.00 0.00 8 8.00 0.00 8 8
## range skew kurtosis se
## LanguageWorkedWith* 2528 -0.14 -1.12 42.07
## Age 57 0.45 0.29 0.49
## YearsCode 49 0.62 -0.91 0.86
## YearsCodePro 23 0.63 -0.39 0.37
## WorkLoc* 1 3.26 8.68 0.02
## OpenSourcer* 0 NaN NaN 0.00
## cluster* 0 NaN NaN 0.00
## --------------------------------------------------------
## group: 9
## vars n mean sd median trimmed mad min max
## LanguageWorkedWith* 1 179 1760.12 689.49 1778 1835.21 831.74 13 2557
## Age 2 179 30.37 9.72 27 28.82 5.93 12 99
## YearsCode 3 179 33.13 17.71 45 34.83 5.93 1 52
## YearsCodePro 4 179 25.17 13.57 23 25.66 16.31 1 46
## WorkLoc* 5 179 3.00 0.00 3 3.00 0.00 3 3
## OpenSourcer* 6 179 2.62 0.79 3 2.69 0.00 1 4
## cluster* 7 179 9.00 0.00 9 9.00 0.00 9 9
## range skew kurtosis se
## LanguageWorkedWith* 2544 -0.70 -0.36 51.53
## Age 87 2.74 13.31 0.73
## YearsCode 51 -0.68 -1.20 1.32
## YearsCodePro 45 -0.16 -1.03 1.01
## WorkLoc* 0 NaN NaN 0.00
## OpenSourcer* 3 -0.81 0.00 0.06
## cluster* 0 NaN NaN 0.00
## --------------------------------------------------------
## group: 10
## vars n mean sd median trimmed mad min
## LanguageWorkedWith* 1 183 1418.15 779.65 1388 1437.00 1031.89 28
## Age 2 183 32.32 8.82 31 31.42 7.41 16
## YearsCode 3 183 20.35 17.96 13 19.06 16.31 1
## YearsCodePro 4 183 24.47 16.04 24 24.90 23.72 1
## WorkLoc* 5 183 1.28 0.70 1 1.11 0.00 1
## OpenSourcer* 6 183 4.00 0.00 4 4.00 0.00 4
## cluster* 7 183 10.00 0.00 10 10.00 0.00 10
## max range skew kurtosis se
## LanguageWorkedWith* 2552 2524 -0.02 -1.17 57.63
## Age 71 55 1.26 2.61 0.65
## YearsCode 52 51 0.56 -1.32 1.33
## YearsCodePro 46 45 -0.18 -1.62 1.19
## WorkLoc* 3 2 2.03 2.15 0.05
## OpenSourcer* 4 0 NaN NaN 0.00
## cluster* 10 0 NaN NaN 0.00
data[pam_fit$medoids, ]
## LanguageWorkedWith
## 2653 Bash/Shell/PowerShell;HTML/CSS;JavaScript;Python;R;SQL
## 2429 Bash/Shell/PowerShell;C#;Python;R;SQL
## 4762 Bash/Shell/PowerShell;Python;R
## 112 R
## 919 Python;R;SQL
## 3518 R
## 1802 HTML/CSS;Java;JavaScript;Python;R;SQL;TypeScript
## 499 Bash/Shell/PowerShell;C#;HTML/CSS;Java;JavaScript;PHP;R;SQL;TypeScript
## 2586 R;SQL;VBA
## 1921 Bash/Shell/PowerShell;C#;HTML/CSS;Java;JavaScript;PHP;R;SQL;TypeScript
## Age YearsCode YearsCodePro
## 2653 30 23 23
## 2429 29 34 23
## 4762 28 45 40
## 112 27 23 23
## 919 31 23 12
## 3518 30 23 12
## 1802 28 23 23
## 499 34 17 8
## 2586 28 47 23
## 1921 32 7 23
## WorkLoc
## 2653 Home
## 2429 Office
## 4762 Office
## 112 Office
## 919 Home
## 3518 Home
## 1802 Office
## 499 Office
## 2586 Other place, such as a coworking space or cafe
## 1921 Home
## OpenSourcer cluster
## 2653 Never 1
## 2429 Never 2
## 4762 Less than once a month but more than once per year 3
## 112 Less than once per year 4
## 919 Less than once a month but more than once per year 5
## 3518 Less than once per year 6
## 1802 Once a month or more often 7
## 499 Less than once a month but more than once per year 8
## 2586 Never 9
## 1921 Once a month or more often 10
Cluster 1 is coding 23 years(professionally 23 years), working at home, and never contribute to Open Source.
Cluster 2 is is coding 34 years(professionally 23 years), working at office, and never contribute to Open Source.
Cluster 3 is coding 45 years(professionally 40 years), working at home, and contribute to Open Source less than once a month but more than once per year.
Cluster 4 is coding 23 years(professionally 23 years), working at office, and contribute to Open Source less than once per year.
Cluster 5 is coding 23 years (professionally 12 years), working at home, and contribute to Open Source less than once a month but more than once per year.
Cluster 6 is is coding 23 years(professionally 12 years), years old, working at home and contribute to Open Source Less than once per year.
Cluster 7 is coding 23 years(professionally 23 years), working at office, and contribute to Open Source Once a month or more often.
Cluster 8 is coding 17 years(professionally 8 years), working at office, and contribute to Open Source less Less than once a month but more than once per year.
Cluster 9 is coding 47 years(professionally 23 years), working at Other place, such as a coworking space or cafe, and never contribute to Open Source.
Cluster 10 is is coding 7, they are 29 years old, working at home, and contribute to Open Source Once a month or more often.
Visualizations
To visualize many variables in a lower dimensional space, use t-distributed stochastic neighborhood embedding, or t-SNE. It tries to preserve local structure so as to make clusters visible in 2D or 3D.
#install.packages("Rtsne")
library("Rtsne")
## Warning: package 'Rtsne' was built under R version 3.6.3
set.seed(42)
tsne_obj <- Rtsne(gower_dist, is_distance = TRUE)
tsne_data <- tsne_obj$Y %>%
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(cluster = factor(pam_fit$clustering),
name = data$LanguageWorkedWith)
ggplot(aes(x = X, y = Y), data = tsne_data) +
geom_point(aes(color = cluster))
As we can see clusters are defined.
Mix approaches
heatmap(gower_mat, symm = T,
distfun = function(x) as.dist(x))
There are 10 clusters.
pam_fit2 <- pam(gower_dist, diss = TRUE, k = 10)
pam_results2 <- data %>%
dplyr::select(-LanguageWorkedWith) %>%
mutate(cluster = pam_fit2$clustering) %>%
group_by(cluster) %>%
do(the_summary = summary(.))
data[pam_fit2$medoids, ]
## LanguageWorkedWith
## 2653 Bash/Shell/PowerShell;HTML/CSS;JavaScript;Python;R;SQL
## 2429 Bash/Shell/PowerShell;C#;Python;R;SQL
## 4762 Bash/Shell/PowerShell;Python;R
## 112 R
## 919 Python;R;SQL
## 3518 R
## 1802 HTML/CSS;Java;JavaScript;Python;R;SQL;TypeScript
## 499 Bash/Shell/PowerShell;C#;HTML/CSS;Java;JavaScript;PHP;R;SQL;TypeScript
## 2586 R;SQL;VBA
## 1921 Bash/Shell/PowerShell;C#;HTML/CSS;Java;JavaScript;PHP;R;SQL;TypeScript
## Age YearsCode YearsCodePro
## 2653 30 23 23
## 2429 29 34 23
## 4762 28 45 40
## 112 27 23 23
## 919 31 23 12
## 3518 30 23 12
## 1802 28 23 23
## 499 34 17 8
## 2586 28 47 23
## 1921 32 7 23
## WorkLoc
## 2653 Home
## 2429 Office
## 4762 Office
## 112 Office
## 919 Home
## 3518 Home
## 1802 Office
## 499 Office
## 2586 Other place, such as a coworking space or cafe
## 1921 Home
## OpenSourcer cluster
## 2653 Never 1
## 2429 Never 2
## 4762 Less than once a month but more than once per year 3
## 112 Less than once per year 4
## 919 Less than once a month but more than once per year 5
## 3518 Less than once per year 6
## 1802 Once a month or more often 7
## 499 Less than once a month but more than once per year 8
## 2586 Never 9
## 1921 Once a month or more often 10
tsne_obj2 <- Rtsne(gower_dist, is_distance = TRUE)
tsne_data2 <- tsne_obj2$Y %>%
data.frame() %>%
setNames(c("X", "Y")) %>%
mutate(cluster = factor(pam_fit2$clustering),
name = data$LanguageWorkedWith)
ggplot(aes(x = X, y = Y), data = tsne_data2) +
geom_point(aes(color = cluster))
As we can see there are still 10 clusters.
Thus, a 10-cluster solution is satisfactory.
describeBy(data, tsne_data$cluster)
##
## Descriptive statistics by group
## group: 1
## vars n mean sd median trimmed mad min
## LanguageWorkedWith* 1 337 1676.48 732.74 1668 1731.40 1058.58 27
## Age 2 337 32.96 10.25 30 31.67 7.41 1
## YearsCode 3 337 25.78 17.53 24 25.86 26.69 1
## YearsCodePro 4 337 22.85 15.61 23 22.86 23.72 1
## WorkLoc* 5 337 1.00 0.00 1 1.00 0.00 1
## OpenSourcer* 6 337 3.00 0.00 3 3.00 0.00 3
## cluster* 7 337 1.00 0.00 1 1.00 0.00 1
## max range skew kurtosis se
## LanguageWorkedWith* 2557 2530 -0.41 -0.95 39.91
## Age 84 83 1.34 2.76 0.56
## YearsCode 52 51 0.02 -1.54 0.95
## YearsCodePro 46 45 0.02 -1.49 0.85
## WorkLoc* 1 0 NaN NaN 0.00
## OpenSourcer* 3 0 NaN NaN 0.00
## cluster* 1 0 NaN NaN 0.00
## --------------------------------------------------------
## group: 2
## vars n mean sd median trimmed mad min max
## LanguageWorkedWith* 1 658 1753.08 684.89 1683 1818.50 938.49 2 2557
## Age 2 658 30.43 8.61 28 29.12 5.93 1 98
## YearsCode 3 658 28.14 18.20 34 28.76 20.76 1 51
## YearsCodePro 4 658 23.45 15.70 23 23.55 23.72 1 46
## WorkLoc* 5 658 2.00 0.00 2 2.00 0.00 2 2
## OpenSourcer* 6 658 3.00 0.00 3 3.00 0.00 3 3
## cluster* 7 658 2.00 0.00 2 2.00 0.00 2 2
## range skew kurtosis se
## LanguageWorkedWith* 2555 -0.54 -0.62 26.70
## Age 97 1.98 7.49 0.34
## YearsCode 50 -0.21 -1.59 0.71
## YearsCodePro 45 0.02 -1.50 0.61
## WorkLoc* 0 NaN NaN 0.00
## OpenSourcer* 0 NaN NaN 0.00
## cluster* 0 NaN NaN 0.00
## --------------------------------------------------------
## group: 3
## vars n mean sd median trimmed mad min max
## LanguageWorkedWith* 1 269 1586.95 715.70 1570 1634.31 922.18 1 2555
## Age 2 269 29.55 7.41 28 28.43 4.45 18 66
## YearsCode 3 269 27.80 20.45 34 28.27 22.24 1 52
## YearsCodePro 4 269 37.15 7.20 40 37.93 4.45 23 45
## WorkLoc* 5 269 2.07 0.26 2 2.00 0.00 2 3
## OpenSourcer* 6 269 1.00 0.00 1 1.00 0.00 1 1
## cluster* 7 269 3.00 0.00 3 3.00 0.00 3 3
## range skew kurtosis se
## LanguageWorkedWith* 2554 -0.34 -0.80 43.64
## Age 48 2.34 7.33 0.45
## YearsCode 51 -0.16 -1.84 1.25
## YearsCodePro 22 -1.02 -0.31 0.44
## WorkLoc* 1 3.33 9.14 0.02
## OpenSourcer* 0 NaN NaN 0.00
## cluster* 0 NaN NaN 0.00
## --------------------------------------------------------
## group: 4
## vars n mean sd median trimmed mad min max
## LanguageWorkedWith* 1 611 1605.27 701.09 1576 1646.39 891.04 12 2557
## Age 2 611 31.72 8.45 29 30.44 5.93 20 64
## YearsCode 3 611 24.86 18.67 23 24.63 28.17 1 51
## YearsCodePro 4 611 25.36 15.42 24 25.96 23.72 1 45
## WorkLoc* 5 611 2.03 0.18 2 2.00 0.00 2 3
## OpenSourcer* 6 611 2.00 0.00 2 2.00 0.00 2 2
## cluster* 7 611 4.00 0.00 4 4.00 0.00 4 4
## range skew kurtosis se
## LanguageWorkedWith* 2545 -0.31 -0.89 28.36
## Age 44 1.45 2.02 0.34
## YearsCode 50 0.14 -1.67 0.76
## YearsCodePro 44 -0.23 -1.48 0.62
## WorkLoc* 1 5.24 25.49 0.01
## OpenSourcer* 0 NaN NaN 0.00
## cluster* 0 NaN NaN 0.00
## --------------------------------------------------------
## group: 5
## vars n mean sd median trimmed mad min
## LanguageWorkedWith* 1 307 1432.18 759.00 1476 1451.96 1005.20 8
## Age 2 307 33.12 10.20 31 32.04 8.90 16
## YearsCode 3 307 24.89 17.47 23 24.66 23.72 1
## YearsCodePro 4 307 21.09 14.81 17 20.75 17.79 1
## WorkLoc* 5 307 1.15 0.53 1 1.00 0.00 1
## OpenSourcer* 6 307 1.00 0.00 1 1.00 0.00 1
## cluster* 7 307 5.00 0.00 5 5.00 0.00 5
## max range skew kurtosis se
## LanguageWorkedWith* 2552 2544 -0.07 -1.17 43.32
## Age 99 83 1.50 5.04 0.58
## YearsCode 51 50 0.14 -1.51 1.00
## YearsCodePro 45 44 0.27 -1.38 0.85
## WorkLoc* 3 2 3.21 8.35 0.03
## OpenSourcer* 1 0 NaN NaN 0.00
## cluster* 5 0 NaN NaN 0.00
## --------------------------------------------------------
## group: 6
## vars n mean sd median trimmed mad min max
## LanguageWorkedWith* 1 339 1631.88 720.45 1581 1677.08 993.34 18 2552
## Age 2 339 34.04 10.01 32 32.90 8.90 17 79
## YearsCode 3 339 23.73 17.01 23 23.24 23.72 1 52
## YearsCodePro 4 339 21.13 15.53 17 20.72 20.76 1 46
## WorkLoc* 5 339 1.29 0.71 1 1.12 0.00 1 3
## OpenSourcer* 6 339 2.00 0.00 2 2.00 0.00 2 2
## cluster* 7 339 6.00 0.00 6 6.00 0.00 6 6
## range skew kurtosis se
## LanguageWorkedWith* 2534 -0.37 -0.96 39.13
## Age 62 1.09 1.16 0.54
## YearsCode 51 0.28 -1.42 0.92
## YearsCodePro 45 0.25 -1.49 0.84
## WorkLoc* 2 1.98 1.92 0.04
## OpenSourcer* 0 NaN NaN 0.00
## cluster* 0 NaN NaN 0.00
## --------------------------------------------------------
## group: 7
## vars n mean sd median trimmed mad min max
## LanguageWorkedWith* 1 267 1398.99 754.33 1391 1410.48 886.59 27 2552
## Age 2 267 31.69 8.30 30 30.82 7.41 18 63
## YearsCode 3 267 23.81 17.39 19 23.33 22.24 1 52
## YearsCodePro 4 267 22.46 16.04 23 22.38 25.20 1 45
## WorkLoc* 5 267 2.04 0.21 2 2.00 0.00 2 3
## OpenSourcer* 6 267 4.00 0.00 4 4.00 0.00 4 4
## cluster* 7 267 7.00 0.00 7 7.00 0.00 7 7
## range skew kurtosis se
## LanguageWorkedWith* 2525 0.00 -1.12 46.16
## Age 45 1.08 1.26 0.51
## YearsCode 51 0.31 -1.46 1.06
## YearsCodePro 44 0.10 -1.65 0.98
## WorkLoc* 1 4.37 17.15 0.01
## OpenSourcer* 0 NaN NaN 0.00
## cluster* 0 NaN NaN 0.00
## --------------------------------------------------------
## group: 8
## vars n mean sd median trimmed mad min max
## LanguageWorkedWith* 1 301 1531.11 729.92 1504 1558.78 935.52 24 2552
## Age 2 301 32.92 8.53 32 32.29 8.90 2 59
## YearsCode 3 301 20.82 14.92 16 19.67 13.34 1 50
## YearsCodePro 4 301 8.70 6.46 8 7.98 7.41 1 24
## WorkLoc* 5 301 2.07 0.26 2 2.00 0.00 2 3
## OpenSourcer* 6 301 1.00 0.00 1 1.00 0.00 1 1
## cluster* 7 301 8.00 0.00 8 8.00 0.00 8 8
## range skew kurtosis se
## LanguageWorkedWith* 2528 -0.14 -1.12 42.07
## Age 57 0.45 0.29 0.49
## YearsCode 49 0.62 -0.91 0.86
## YearsCodePro 23 0.63 -0.39 0.37
## WorkLoc* 1 3.26 8.68 0.02
## OpenSourcer* 0 NaN NaN 0.00
## cluster* 0 NaN NaN 0.00
## --------------------------------------------------------
## group: 9
## vars n mean sd median trimmed mad min max
## LanguageWorkedWith* 1 179 1760.12 689.49 1778 1835.21 831.74 13 2557
## Age 2 179 30.37 9.72 27 28.82 5.93 12 99
## YearsCode 3 179 33.13 17.71 45 34.83 5.93 1 52
## YearsCodePro 4 179 25.17 13.57 23 25.66 16.31 1 46
## WorkLoc* 5 179 3.00 0.00 3 3.00 0.00 3 3
## OpenSourcer* 6 179 2.62 0.79 3 2.69 0.00 1 4
## cluster* 7 179 9.00 0.00 9 9.00 0.00 9 9
## range skew kurtosis se
## LanguageWorkedWith* 2544 -0.70 -0.36 51.53
## Age 87 2.74 13.31 0.73
## YearsCode 51 -0.68 -1.20 1.32
## YearsCodePro 45 -0.16 -1.03 1.01
## WorkLoc* 0 NaN NaN 0.00
## OpenSourcer* 3 -0.81 0.00 0.06
## cluster* 0 NaN NaN 0.00
## --------------------------------------------------------
## group: 10
## vars n mean sd median trimmed mad min
## LanguageWorkedWith* 1 183 1418.15 779.65 1388 1437.00 1031.89 28
## Age 2 183 32.32 8.82 31 31.42 7.41 16
## YearsCode 3 183 20.35 17.96 13 19.06 16.31 1
## YearsCodePro 4 183 24.47 16.04 24 24.90 23.72 1
## WorkLoc* 5 183 1.28 0.70 1 1.11 0.00 1
## OpenSourcer* 6 183 4.00 0.00 4 4.00 0.00 4
## cluster* 7 183 10.00 0.00 10 10.00 0.00 10
## max range skew kurtosis se
## LanguageWorkedWith* 2552 2524 -0.02 -1.17 57.63
## Age 71 55 1.26 2.61 0.65
## YearsCode 52 51 0.56 -1.32 1.33
## YearsCodePro 46 45 -0.18 -1.62 1.19
## WorkLoc* 3 2 2.03 2.15 0.05
## OpenSourcer* 4 0 NaN NaN 0.00
## cluster* 10 0 NaN NaN 0.00
Summary:
To conclude, there are 10 cluster among 3463 programmer.
Programmer from cluster 1 is coding 23 years(professionally 23 years), working at home, and never contributes to Open Source.
Programmer from cluster 2 is is coding 34 years(professionally 23 years), working at office, and never contributes to Open Source.
Programmer from cluster 3 is coding 45 years(professionally 40 years), working at home, and contributes to Open Source less than once a month but more than once per year.
Programmer from cluster 4 is coding 23 years(professionally 23 years), working at office, and contributes to Open Source less than once per year.
Programmer from cluster 5 is coding 23 years (professionally 12 years), working at home, and contributes to Open Source less than once a month but more than once per year.
Programmer from cluster 6 is coding 23 years(professionally 12 years), years old, working at home and contributes to Open Source Less than once per year.
Programmer from cluster 7 is coding 23 years(professionally 23 years), working at office, and contributes to Open Source Once a month or more often.
Programmer from cluster 8 is coding 17 years(professionally 8 years), working at office, and contributes to Open Source less Less than once a month but more than once per year.
Programmer from cluster 9 has the most significant experience in coding,they are working at other place, such as a coworking space or cafe, and never contributes to Open Source.
Programmer from cluster 10 has the least significant experience in coding, working at home, and contributes to Open Source Once a month or more often.