Cluster Analysis

Introduction

The purposes of this work are to examine possible cluster solutions and describe the differences between the resulting user clusters R. In our study, we will rely on the filtered data from Stackoverflow Developer Survey 2019 - dataforproject2. For this work were taken a four different variables, some of them had recorded in numeric variable. First of all, for this work was taken one numeric variable - YearsCode - How many years have you coded professionally (as a part of your work)? Also, were used three categorical variables, which have several levels of answers:

Which of the following best describes your current employment status? (Employment)

Employed full-time
Employed part-time
Independent contractor, freelancer, or self-employed Not employed, but looking for work
Not employed, and not looking for work
Retired
I prefer not to say

How satisfied are you with your current job? (If you work multiple jobs, answer for the one you spend the most hours on.) (JobSat)

Very dissatisfied
Slightly dissatisfied
Neither satisfied nor dissatisfied
Slightly satisfied
Very satisfied

Where would you prefer to work?(WorkLoc)

Home
Office
Other place, such as a coworking space or cafe

I decided to use these variables, because i suppose, that it seems to me that they can differently relate to the purpose of the work, namely to users who have experience with R. Nowadays, programmer profession is popular. So, it is why i decided to find out how many programmers do not have a job and whether this is due to coding experience. The assumption that programmers with less extensive work experience is more in demand is coming to the fore. Moreover, it is interesting to find out how much they are satisfied with their work and where they organize it, in the office or elsewhere.

An important feature of the data is that it has 14 percent of the missing values.
The average of coding experience is 12.46 years, the max value is 40 years.
Data has a huge number of respondets, which employed full-time

dataforproject2 <- read_csv("dataforproject2.csv")

head(dataforproject2)

## # A tibble: 6 x 86
##      X1 Respondent MainBranch Hobbyist OpenSourcer OpenSource Employment Country
##   <dbl>      <dbl> <chr>      <chr>    <chr>       <chr>      <chr>      <chr>  
## 1     1          6 I am not … Yes      Never       The quali… Employed … Canada 
## 2     2         10 I am a de… Yes      Once a mon… OSS is, o… Employed … India  
## 3     3         12 I am a st… No       Never       OSS is, o… Employed … Canada 
## 4     4         18 I am not … Yes      Less than … The quali… Employed … Russia…
## 5     5         20 I am not … No       Never       OSS is, o… Employed … Lithua…
## 6     6         33 I am a de… No       Less than … OSS is, o… Employed … Argent…
## # … with 78 more variables: Student <chr>, EdLevel <chr>, UndergradMajor <chr>,
## #   EduOther <chr>, OrgSize <chr>, DevType <chr>, YearsCode <chr>,
## #   Age1stCode <chr>, YearsCodePro <chr>, CareerSat <chr>, JobSat <chr>,
## #   MgrIdiot <chr>, MgrMoney <chr>, MgrWant <chr>, JobSeek <chr>,
## #   LastHireDate <chr>, LastInt <chr>, FizzBuzz <chr>, JobFactors <chr>,
## #   ResumeUpdate <chr>, CurrencySymbol <chr>, CurrencyDesc <chr>,
## #   CompTotal <dbl>, CompFreq <chr>, ConvertedComp <dbl>, WorkWeekHrs <dbl>,
## #   WorkPlan <chr>, WorkChallenge <chr>, WorkRemote <chr>, WorkLoc <chr>,
## #   ImpSyn <chr>, CodeRev <chr>, CodeRevHrs <dbl>, UnitTests <chr>,
## #   PurchaseHow <chr>, PurchaseWhat <chr>, LanguageWorkedWith <chr>,
## #   LanguageDesireNextYear <chr>, DatabaseWorkedWith <chr>,
## #   DatabaseDesireNextYear <chr>, PlatformWorkedWith <chr>,
## #   PlatformDesireNextYear <chr>, WebFrameWorkedWith <chr>,
## #   WebFrameDesireNextYear <chr>, MiscTechWorkedWith <chr>,
## #   MiscTechDesireNextYear <chr>, DevEnviron <chr>, OpSys <chr>,
## #   Containers <chr>, BlockchainOrg <chr>, BlockchainIs <chr>,
## #   BetterLife <chr>, ITperson <chr>, OffOn <chr>, SocialMedia <chr>,
## #   Extraversion <chr>, ScreenName <chr>, SOVisit1st <chr>, SOVisitFreq <chr>,
## #   SOVisitTo <chr>, SOFindAnswer <chr>, SOTimeSaved <chr>,
## #   SOHowMuchTime <chr>, SOAccount <chr>, SOPartFreq <chr>, SOJobs <chr>,
## #   EntTeams <chr>, SOComm <chr>, WelcomeChange <chr>, SONewContent <chr>,
## #   Age <dbl>, Gender <chr>, Trans <chr>, Sexuality <chr>, Ethnicity <chr>,
## #   Dependents <chr>, SurveyLength <chr>, SurveyEase <chr>

#Missing Data

missmap(dataforproject2, col=c("blue", "red"), legend = T, main = "Missing values vs observed")

Based on this graph, we observe the percentage of missing information. Thus, our general dataset has 14% of the missing values. Further in the work we will remove the missing values.

surv_new <- na.omit(dataforproject2)

surv_new = surv_new %>% dplyr::select(Employment, YearsCode, JobSat, WorkLoc)
head(surv_new)

## # A tibble: 6 x 4
##   Employment         YearsCode JobSat                WorkLoc
##   <chr>              <chr>     <chr>                 <chr>  
## 1 Employed full-time 8         Slightly dissatisfied Office 
## 2 Employed full-time 5         Very satisfied        Office 
## 3 Employed full-time 5         Slightly dissatisfied Office 
## 4 Employed full-time 18        Slightly satisfied    Home   
## 5 Employed full-time 8         Slightly satisfied    Office 
## 6 Employed full-time 12        Very satisfied        Home

summary(surv_new)

##   Employment         YearsCode            JobSat            WorkLoc         
##  Length:198         Length:198         Length:198         Length:198        
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character

Recoding of variables

surv_new$WorkLoc <- revalue(surv_new$WorkLoc, c("Home"=2))
surv_new$WorkLoc <- revalue(surv_new$WorkLoc, c("Office"=1))
surv_new$WorkLoc <- revalue(surv_new$WorkLoc, c("Other place, such as a coworking space or cafe"=3))
head(surv_new$WorkLoc)

## [1] "1" "1" "1" "2" "1" "2"

surv_new$JobSat <- revalue(surv_new$JobSat, c("Slightly satisfied"=4))
surv_new$JobSat <- revalue(surv_new$JobSat, c("Very satisfied"=5))
surv_new$JobSat <- revalue(surv_new$JobSat, c("Slightly dissatisfied"=2))
surv_new$JobSat <- revalue(surv_new$JobSat, c("Very dissatisfied"=1))
surv_new$JobSat <- revalue(surv_new$JobSat, c("Neither satisfied nor dissatisfied"=3))
head(surv_new$JobSat)

## [1] "2" "5" "2" "4" "4" "5"

surv_new$Employment <- revalue(surv_new$Employment, c("Employed full-time"=2))
surv_new$Employment <- revalue(surv_new$Employment, c("Employed part-time"=1))
surv_new$Employment <- revalue(surv_new$Employment, c("Independent contractor, freelancer, or self-employed"=3))
head(surv_new)

## # A tibble: 6 x 4
##   Employment YearsCode JobSat WorkLoc
##   <chr>      <chr>     <chr>  <chr>  
## 1 2          8         2      1      
## 2 2          5         5      1      
## 3 2          5         2      1      
## 4 2          18        4      2      
## 5 2          8         4      1      
## 6 2          12        5      2

as.numeric(surv_new$Employment)

##   [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 1 2 2 2 2
##  [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [186] 2 2 2 2 2 2 2 2 1 2 2 1 2

surv_new$YearsCode = as.numeric(surv_new$YearsCode)
surv_new$WorkLoc = as.numeric(surv_new$WorkLoc)
surv_new$JobSat = as.numeric(surv_new$JobSat)
surv_new$Employment  = as.numeric(surv_new$Employment)

Vizualization variables

hist(surv_new$YearsCode,col="aquamarine",main="Histogram of YearsCode")

Correlation check

cor(surv_new)

##             Employment   YearsCode      JobSat     WorkLoc
## Employment  1.00000000  0.13970414 -0.10880558 -0.08526286
## YearsCode   0.13970414  1.00000000 -0.07990838  0.09471518
## JobSat     -0.10880558 -0.07990838  1.00000000 -0.19349981
## WorkLoc    -0.08526286  0.09471518 -0.19349981  1.00000000

Data Preparation

Before cluster analysis, we should prepare the variables; for this, the na.omit function has already been used to filter out the missing values. Next, we standardize our dataset.

scaled_surv <- scale(surv_new)

Clusters

set.seed(12)
fviz_nbclust(scaled_surv, kmeans, method = "wss", linecolor = "deepskyblue4")+ theme_classic()

According to this graph optimal number of clusters (k) is 7, because it is where the within-sum-of-squares decreases dramatically.

#Silhouette method + partitioning function pam

set.seed(12)
fviz_nbclust(scaled_surv, pam, method = "silhouette", linecolor = "deepskyblue4") + theme_classic()

This graph shows optimal number of clusters - 10

# Silhouette method + partitioning function kmeans
set.seed(12)
fviz_nbclust(scaled_surv, kmeans, method = "silhouette", linecolor = "deepskyblue4") + theme_classic()

Suggested number of clusters is 7

fviz_nbclust(scaled_surv, kmeans, method = "gap", linecolor = "deepskyblue4")

According to this graph suggested number of clusters is also 7. To sum up all findings, the optimal k is 7 or 10.

K-means clustering

Plot with 7 clusters

set.seed(12)
clusters7<- kmeans(scaled_surv, 7)
clusters7$cluster<-as.numeric(clusters7$cluster)
fviz_cluster(clusters7, data = scaled_surv)

Plot with 10 clusters

set.seed(12)
clusters10<- kmeans(scaled_surv, 10)
clusters10$cluster<-as.numeric(clusters10$cluster)
fviz_cluster(clusters10, data = scaled_surv)

Plot without variable “Emloyment”

clusters2 <- kmeans(scaled_surv, 7)
clusters2$cluster<-as.numeric(clusters2$cluster)
fviz_cluster(clusters2, data = scaled_surv[ ,-1])

It is better cluster plot than others, so i will use ## Visualization of K-means method

fviz_cluster(clusters2, data = scaled_surv, ellipse.type = "convex", palette = "heat", ggtheme = theme_minimal())

clusplot(scaled_surv, clusters2$cluster, main = '2D representation of the Cluster solution',color=TRUE, shade=TRUE, 
         labels=7, lines=0)

This plot shows that we can easily differentiate one cluster from other clusters. Moreover, all others .clusters are mixed. Therefore, we can classify not every observation correctly.

Hierarchical clustering

Distance - Euclidean

It will be rational to use the Euclidean method because the date was standardized and all variables are numerical

dist_mat <- dist(scaled_surv, method = 'euclidean')
hclust_ward <- hclust(dist_mat, method = 'ward.D2',)
plot(hclust_ward, hang = -1, cex = 0.5,)

Here, the dendrogram results from a Euclidean distances’s matrix and average clustering. Cutting the dendrogram at the height of 14.8 means that the averagefor Euclidean distances between any two observations will not exceed 14.8

dend <- as.dendrogram(hclust_ward)
dend_1 <- color_branches(dend, k = 6)
plot(dend_1)

dend_2 <- color_branches(dend, k = 7)
plot(dend_2)

Which method is better?

As a result of applying the “K-means clustering” and “Hierarchical clustering” methods, it was found that both methods are suitable for use. But the K-means method gives us more possibilities to interpret the plots.

Clusters description

describeBy(scaled_surv, clusters2$cluster)

## 
##  Descriptive statistics by group 
## INDICES: 1
##            vars  n  mean   sd median trimmed  mad   min   max range skew
## Employment    1 36  0.22 0.00   0.22    0.22 0.00  0.22  0.22  0.00  NaN
## YearsCode     2 36 -0.46 0.39  -0.54   -0.48 0.45 -1.15  0.43  1.59 0.32
## JobSat        3 36  1.02 0.00   1.02    1.02 0.00  1.02  1.02  0.00  NaN
## WorkLoc       4 36 -0.76 0.00  -0.76   -0.76 0.00 -0.76 -0.76  0.00  NaN
##            kurtosis   se
## Employment      NaN 0.00
## YearsCode     -0.79 0.07
## JobSat          NaN 0.00
## WorkLoc         NaN 0.00
## ------------------------------------------------------------ 
## INDICES: 2
##            vars  n  mean   sd median trimmed  mad   min  max range  skew
## Employment    1 39  0.22 0.00   0.22    0.22 0.00  0.22 0.22  0.00   NaN
## YearsCode     2 39 -0.48 0.45  -0.54   -0.49 0.36 -1.40 0.55  1.95  0.28
## JobSat        3 39  0.15 0.72   0.23    0.21 1.17 -1.34 1.02  2.36 -0.62
## WorkLoc       4 39  1.25 0.72   0.72    1.21 0.00  0.72 2.20  1.48  0.57
##            kurtosis   se
## Employment      NaN 0.00
## YearsCode     -0.29 0.07
## JobSat        -0.38 0.12
## WorkLoc       -1.72 0.12
## ------------------------------------------------------------ 
## INDICES: 3
##            vars n  mean   sd median trimmed  mad   min   max range skew
## Employment    1 9 -4.57 0.00  -4.57   -4.57 0.00 -4.57 -4.57  0.00  NaN
## YearsCode     2 9 -0.64 0.37  -0.79   -0.64 0.36 -1.03 -0.06  0.98 0.41
## JobSat        3 9  0.50 0.39   0.23    0.50 0.00  0.23  1.02  0.79 0.59
## WorkLoc       4 9  0.39 1.44  -0.76    0.39 0.00 -0.76  2.20  2.97 0.38
##            kurtosis   se
## Employment      NaN 0.00
## YearsCode     -1.65 0.12
## JobSat        -1.81 0.13
## WorkLoc       -1.94 0.48
## ------------------------------------------------------------ 
## INDICES: 4
##            vars  n  mean   sd median trimmed  mad   min   max range  skew
## Employment    1 31  0.22 0.00   0.22    0.22 0.00  0.22  0.22  0.00   NaN
## YearsCode     2 31 -0.29 0.68  -0.42   -0.34 0.72 -1.15  1.65  2.80  0.75
## JobSat        3 31 -1.62 0.38  -1.34   -1.59 0.00 -2.13 -1.34  0.79 -0.58
## WorkLoc       4 31 -0.09 0.84  -0.76   -0.17 0.00 -0.76  2.20  2.97  0.71
##            kurtosis   se
## Employment      NaN 0.00
## YearsCode      0.10 0.12
## JobSat        -1.72 0.07
## WorkLoc       -0.65 0.15
## ------------------------------------------------------------ 
## INDICES: 5
##            vars  n  mean   sd median trimmed  mad   min  max range  skew
## Employment    1 26  0.22 0.00   0.22    0.22 0.00  0.22 0.22  0.00   NaN
## YearsCode     2 26  1.66 0.78   1.53    1.61 0.90  0.68 3.24  2.56  0.47
## JobSat        3 26 -0.28 1.13   0.23   -0.23 1.17 -2.13 1.02  3.15 -0.44
## WorkLoc       4 26  0.83 0.40   0.72    0.72 0.00  0.72 2.20  1.48  2.99
##            kurtosis   se
## Employment      NaN 0.00
## YearsCode     -1.06 0.15
## JobSat        -1.35 0.22
## WorkLoc        7.25 0.08
## ------------------------------------------------------------ 
## INDICES: 6
##            vars  n  mean   sd median trimmed  mad   min   max range  skew
## Employment    1 15  0.22 0.00   0.22    0.22 0.00  0.22  0.22  0.00   NaN
## YearsCode     2 15  1.56 0.78   1.41    1.48 0.72  0.80  3.36  2.56  0.91
## JobSat        3 15  0.60 0.59   1.02    0.66 0.00 -0.55  1.02  1.58 -0.87
## WorkLoc       4 15 -0.76 0.00  -0.76   -0.76 0.00 -0.76 -0.76  0.00   NaN
##            kurtosis   se
## Employment      NaN 0.00
## YearsCode     -0.41 0.20
## JobSat        -0.78 0.15
## WorkLoc         NaN 0.00
## ------------------------------------------------------------ 
## INDICES: 7
##            vars  n  mean   sd median trimmed  mad   min   max range  skew
## Employment    1 42  0.22 0.00   0.22    0.22 0.00  0.22  0.22  0.00   NaN
## YearsCode     2 42 -0.40 0.43  -0.36   -0.39 0.45 -1.28  0.43  1.71 -0.06
## JobSat        3 42  0.03 0.35   0.23    0.07 0.00 -0.55  0.23  0.79 -1.04
## WorkLoc       4 42 -0.76 0.00  -0.76   -0.76 0.00 -0.76 -0.76  0.00   NaN
##            kurtosis   se
## Employment      NaN 0.00
## YearsCode     -0.64 0.07
## JobSat        -0.93 0.05
## WorkLoc         NaN 0.00

Cluster 1

General description: this cluster has 36 observations.

Cluster 2

General description: this cluster has 39 observations.

Cluster 3

General description: this cluster the smallest cluster and has only 9 observations.

Cluster 4

General description: this cluster has 31 observations.

Cluster 5

General description: this cluster has 26 observations.

Cluster 6

General description: this cluster is also wery small and has 15 observations.

Cluster 7

General description: this cluster is the biggest one - 42 observations

Validation

I use PCA, because my variables are numeric

surv.pca <- prcomp(scaled_surv, center = TRUE,scale. = TRUE)
summary(surv.pca)

## Importance of components:
##                           PC1    PC2    PC3    PC4
## Standard deviation     1.1320 1.0512 0.9497 0.8437
## Proportion of Variance 0.3203 0.2762 0.2255 0.1779
## Cumulative Proportion  0.3203 0.5966 0.8220 1.0000

The table shows that two components (PC1 and PC2) describe 59% of variance, so it is not bad:)

Conclusion

The aim of this work was to find out how many programmers do not have a job and whether this is due to coding experience. The assumption that programmers with less extensive work experience is more in demand is coming to the fore. It was interesting to do cluster anakysis but i suppose that my variables was not pretty good, because of fact that clusters were mixed.