The purposes of this work are to examine possible cluster solutions and describe the differences between the resulting user clusters R. In our study, we will rely on the filtered data from Stackoverflow Developer Survey 2019 - dataforproject2. For this work were taken a four different variables, some of them had recorded in numeric variable. First of all, for this work was taken one numeric variable - YearsCode - How many years have you coded professionally (as a part of your work)? Also, were used three categorical variables, which have several levels of answers:
I decided to use these variables, because i suppose, that it seems to me that they can differently relate to the purpose of the work, namely to users who have experience with R. Nowadays, programmer profession is popular. So, it is why i decided to find out how many programmers do not have a job and whether this is due to coding experience. The assumption that programmers with less extensive work experience is more in demand is coming to the fore. Moreover, it is interesting to find out how much they are satisfied with their work and where they organize it, in the office or elsewhere.
dataforproject2 <- read_csv("dataforproject2.csv")
head(dataforproject2)
## # A tibble: 6 x 86
## X1 Respondent MainBranch Hobbyist OpenSourcer OpenSource Employment Country
## <dbl> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 1 6 I am not … Yes Never The quali… Employed … Canada
## 2 2 10 I am a de… Yes Once a mon… OSS is, o… Employed … India
## 3 3 12 I am a st… No Never OSS is, o… Employed … Canada
## 4 4 18 I am not … Yes Less than … The quali… Employed … Russia…
## 5 5 20 I am not … No Never OSS is, o… Employed … Lithua…
## 6 6 33 I am a de… No Less than … OSS is, o… Employed … Argent…
## # … with 78 more variables: Student <chr>, EdLevel <chr>, UndergradMajor <chr>,
## # EduOther <chr>, OrgSize <chr>, DevType <chr>, YearsCode <chr>,
## # Age1stCode <chr>, YearsCodePro <chr>, CareerSat <chr>, JobSat <chr>,
## # MgrIdiot <chr>, MgrMoney <chr>, MgrWant <chr>, JobSeek <chr>,
## # LastHireDate <chr>, LastInt <chr>, FizzBuzz <chr>, JobFactors <chr>,
## # ResumeUpdate <chr>, CurrencySymbol <chr>, CurrencyDesc <chr>,
## # CompTotal <dbl>, CompFreq <chr>, ConvertedComp <dbl>, WorkWeekHrs <dbl>,
## # WorkPlan <chr>, WorkChallenge <chr>, WorkRemote <chr>, WorkLoc <chr>,
## # ImpSyn <chr>, CodeRev <chr>, CodeRevHrs <dbl>, UnitTests <chr>,
## # PurchaseHow <chr>, PurchaseWhat <chr>, LanguageWorkedWith <chr>,
## # LanguageDesireNextYear <chr>, DatabaseWorkedWith <chr>,
## # DatabaseDesireNextYear <chr>, PlatformWorkedWith <chr>,
## # PlatformDesireNextYear <chr>, WebFrameWorkedWith <chr>,
## # WebFrameDesireNextYear <chr>, MiscTechWorkedWith <chr>,
## # MiscTechDesireNextYear <chr>, DevEnviron <chr>, OpSys <chr>,
## # Containers <chr>, BlockchainOrg <chr>, BlockchainIs <chr>,
## # BetterLife <chr>, ITperson <chr>, OffOn <chr>, SocialMedia <chr>,
## # Extraversion <chr>, ScreenName <chr>, SOVisit1st <chr>, SOVisitFreq <chr>,
## # SOVisitTo <chr>, SOFindAnswer <chr>, SOTimeSaved <chr>,
## # SOHowMuchTime <chr>, SOAccount <chr>, SOPartFreq <chr>, SOJobs <chr>,
## # EntTeams <chr>, SOComm <chr>, WelcomeChange <chr>, SONewContent <chr>,
## # Age <dbl>, Gender <chr>, Trans <chr>, Sexuality <chr>, Ethnicity <chr>,
## # Dependents <chr>, SurveyLength <chr>, SurveyEase <chr>
#Missing Data
missmap(dataforproject2, col=c("blue", "red"), legend = T, main = "Missing values vs observed")
Based on this graph, we observe the percentage of missing information. Thus, our general dataset has 14% of the missing values. Further in the work we will remove the missing values.
surv_new <- na.omit(dataforproject2)
surv_new = surv_new %>% dplyr::select(Employment, YearsCode, JobSat, WorkLoc)
head(surv_new)
## # A tibble: 6 x 4
## Employment YearsCode JobSat WorkLoc
## <chr> <chr> <chr> <chr>
## 1 Employed full-time 8 Slightly dissatisfied Office
## 2 Employed full-time 5 Very satisfied Office
## 3 Employed full-time 5 Slightly dissatisfied Office
## 4 Employed full-time 18 Slightly satisfied Home
## 5 Employed full-time 8 Slightly satisfied Office
## 6 Employed full-time 12 Very satisfied Home
summary(surv_new)
## Employment YearsCode JobSat WorkLoc
## Length:198 Length:198 Length:198 Length:198
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
surv_new$WorkLoc <- revalue(surv_new$WorkLoc, c("Home"=2))
surv_new$WorkLoc <- revalue(surv_new$WorkLoc, c("Office"=1))
surv_new$WorkLoc <- revalue(surv_new$WorkLoc, c("Other place, such as a coworking space or cafe"=3))
head(surv_new$WorkLoc)
## [1] "1" "1" "1" "2" "1" "2"
surv_new$JobSat <- revalue(surv_new$JobSat, c("Slightly satisfied"=4))
surv_new$JobSat <- revalue(surv_new$JobSat, c("Very satisfied"=5))
surv_new$JobSat <- revalue(surv_new$JobSat, c("Slightly dissatisfied"=2))
surv_new$JobSat <- revalue(surv_new$JobSat, c("Very dissatisfied"=1))
surv_new$JobSat <- revalue(surv_new$JobSat, c("Neither satisfied nor dissatisfied"=3))
head(surv_new$JobSat)
## [1] "2" "5" "2" "4" "4" "5"
surv_new$Employment <- revalue(surv_new$Employment, c("Employed full-time"=2))
surv_new$Employment <- revalue(surv_new$Employment, c("Employed part-time"=1))
surv_new$Employment <- revalue(surv_new$Employment, c("Independent contractor, freelancer, or self-employed"=3))
head(surv_new)
## # A tibble: 6 x 4
## Employment YearsCode JobSat WorkLoc
## <chr> <chr> <chr> <chr>
## 1 2 8 2 1
## 2 2 5 5 1
## 3 2 5 2 1
## 4 2 18 4 2
## 5 2 8 4 1
## 6 2 12 5 2
as.numeric(surv_new$Employment)
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 1 2 2 2 2
## [75] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2
## [112] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [149] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [186] 2 2 2 2 2 2 2 2 1 2 2 1 2
surv_new$YearsCode = as.numeric(surv_new$YearsCode)
surv_new$WorkLoc = as.numeric(surv_new$WorkLoc)
surv_new$JobSat = as.numeric(surv_new$JobSat)
surv_new$Employment = as.numeric(surv_new$Employment)
hist(surv_new$YearsCode,col="aquamarine",main="Histogram of YearsCode")
cor(surv_new)
## Employment YearsCode JobSat WorkLoc
## Employment 1.00000000 0.13970414 -0.10880558 -0.08526286
## YearsCode 0.13970414 1.00000000 -0.07990838 0.09471518
## JobSat -0.10880558 -0.07990838 1.00000000 -0.19349981
## WorkLoc -0.08526286 0.09471518 -0.19349981 1.00000000
Before cluster analysis, we should prepare the variables; for this, the na.omit function has already been used to filter out the missing values. Next, we standardize our dataset.
scaled_surv <- scale(surv_new)
set.seed(12)
fviz_nbclust(scaled_surv, kmeans, method = "wss", linecolor = "deepskyblue4")+ theme_classic()
According to this graph optimal number of clusters (k) is 7, because it is where the within-sum-of-squares decreases dramatically.
#Silhouette method + partitioning function pam
set.seed(12)
fviz_nbclust(scaled_surv, pam, method = "silhouette", linecolor = "deepskyblue4") + theme_classic()
This graph shows optimal number of clusters - 10
# Silhouette method + partitioning function kmeans
set.seed(12)
fviz_nbclust(scaled_surv, kmeans, method = "silhouette", linecolor = "deepskyblue4") + theme_classic()
Suggested number of clusters is 7
fviz_nbclust(scaled_surv, kmeans, method = "gap", linecolor = "deepskyblue4")
According to this graph suggested number of clusters is also 7. To sum up all findings, the optimal k is 7 or 10.
set.seed(12)
clusters7<- kmeans(scaled_surv, 7)
clusters7$cluster<-as.numeric(clusters7$cluster)
fviz_cluster(clusters7, data = scaled_surv)
set.seed(12)
clusters10<- kmeans(scaled_surv, 10)
clusters10$cluster<-as.numeric(clusters10$cluster)
fviz_cluster(clusters10, data = scaled_surv)
clusters2 <- kmeans(scaled_surv, 7)
clusters2$cluster<-as.numeric(clusters2$cluster)
fviz_cluster(clusters2, data = scaled_surv[ ,-1])
It is better cluster plot than others, so i will use ## Visualization of K-means method
fviz_cluster(clusters2, data = scaled_surv, ellipse.type = "convex", palette = "heat", ggtheme = theme_minimal())
clusplot(scaled_surv, clusters2$cluster, main = '2D representation of the Cluster solution',color=TRUE, shade=TRUE,
labels=7, lines=0)
This plot shows that we can easily differentiate one cluster from other clusters. Moreover, all others .clusters are mixed. Therefore, we can classify not every observation correctly.
It will be rational to use the Euclidean method because the date was standardized and all variables are numerical
dist_mat <- dist(scaled_surv, method = 'euclidean')
hclust_ward <- hclust(dist_mat, method = 'ward.D2',)
plot(hclust_ward, hang = -1, cex = 0.5,)
Here, the dendrogram results from a Euclidean distances’s matrix and average clustering. Cutting the dendrogram at the height of 14.8 means that the averagefor Euclidean distances between any two observations will not exceed 14.8
dend <- as.dendrogram(hclust_ward)
dend_1 <- color_branches(dend, k = 6)
plot(dend_1)
dend_2 <- color_branches(dend, k = 7)
plot(dend_2)
As a result of applying the “K-means clustering” and “Hierarchical clustering” methods, it was found that both methods are suitable for use. But the K-means method gives us more possibilities to interpret the plots.
describeBy(scaled_surv, clusters2$cluster)
##
## Descriptive statistics by group
## INDICES: 1
## vars n mean sd median trimmed mad min max range skew
## Employment 1 36 0.22 0.00 0.22 0.22 0.00 0.22 0.22 0.00 NaN
## YearsCode 2 36 -0.46 0.39 -0.54 -0.48 0.45 -1.15 0.43 1.59 0.32
## JobSat 3 36 1.02 0.00 1.02 1.02 0.00 1.02 1.02 0.00 NaN
## WorkLoc 4 36 -0.76 0.00 -0.76 -0.76 0.00 -0.76 -0.76 0.00 NaN
## kurtosis se
## Employment NaN 0.00
## YearsCode -0.79 0.07
## JobSat NaN 0.00
## WorkLoc NaN 0.00
## ------------------------------------------------------------
## INDICES: 2
## vars n mean sd median trimmed mad min max range skew
## Employment 1 39 0.22 0.00 0.22 0.22 0.00 0.22 0.22 0.00 NaN
## YearsCode 2 39 -0.48 0.45 -0.54 -0.49 0.36 -1.40 0.55 1.95 0.28
## JobSat 3 39 0.15 0.72 0.23 0.21 1.17 -1.34 1.02 2.36 -0.62
## WorkLoc 4 39 1.25 0.72 0.72 1.21 0.00 0.72 2.20 1.48 0.57
## kurtosis se
## Employment NaN 0.00
## YearsCode -0.29 0.07
## JobSat -0.38 0.12
## WorkLoc -1.72 0.12
## ------------------------------------------------------------
## INDICES: 3
## vars n mean sd median trimmed mad min max range skew
## Employment 1 9 -4.57 0.00 -4.57 -4.57 0.00 -4.57 -4.57 0.00 NaN
## YearsCode 2 9 -0.64 0.37 -0.79 -0.64 0.36 -1.03 -0.06 0.98 0.41
## JobSat 3 9 0.50 0.39 0.23 0.50 0.00 0.23 1.02 0.79 0.59
## WorkLoc 4 9 0.39 1.44 -0.76 0.39 0.00 -0.76 2.20 2.97 0.38
## kurtosis se
## Employment NaN 0.00
## YearsCode -1.65 0.12
## JobSat -1.81 0.13
## WorkLoc -1.94 0.48
## ------------------------------------------------------------
## INDICES: 4
## vars n mean sd median trimmed mad min max range skew
## Employment 1 31 0.22 0.00 0.22 0.22 0.00 0.22 0.22 0.00 NaN
## YearsCode 2 31 -0.29 0.68 -0.42 -0.34 0.72 -1.15 1.65 2.80 0.75
## JobSat 3 31 -1.62 0.38 -1.34 -1.59 0.00 -2.13 -1.34 0.79 -0.58
## WorkLoc 4 31 -0.09 0.84 -0.76 -0.17 0.00 -0.76 2.20 2.97 0.71
## kurtosis se
## Employment NaN 0.00
## YearsCode 0.10 0.12
## JobSat -1.72 0.07
## WorkLoc -0.65 0.15
## ------------------------------------------------------------
## INDICES: 5
## vars n mean sd median trimmed mad min max range skew
## Employment 1 26 0.22 0.00 0.22 0.22 0.00 0.22 0.22 0.00 NaN
## YearsCode 2 26 1.66 0.78 1.53 1.61 0.90 0.68 3.24 2.56 0.47
## JobSat 3 26 -0.28 1.13 0.23 -0.23 1.17 -2.13 1.02 3.15 -0.44
## WorkLoc 4 26 0.83 0.40 0.72 0.72 0.00 0.72 2.20 1.48 2.99
## kurtosis se
## Employment NaN 0.00
## YearsCode -1.06 0.15
## JobSat -1.35 0.22
## WorkLoc 7.25 0.08
## ------------------------------------------------------------
## INDICES: 6
## vars n mean sd median trimmed mad min max range skew
## Employment 1 15 0.22 0.00 0.22 0.22 0.00 0.22 0.22 0.00 NaN
## YearsCode 2 15 1.56 0.78 1.41 1.48 0.72 0.80 3.36 2.56 0.91
## JobSat 3 15 0.60 0.59 1.02 0.66 0.00 -0.55 1.02 1.58 -0.87
## WorkLoc 4 15 -0.76 0.00 -0.76 -0.76 0.00 -0.76 -0.76 0.00 NaN
## kurtosis se
## Employment NaN 0.00
## YearsCode -0.41 0.20
## JobSat -0.78 0.15
## WorkLoc NaN 0.00
## ------------------------------------------------------------
## INDICES: 7
## vars n mean sd median trimmed mad min max range skew
## Employment 1 42 0.22 0.00 0.22 0.22 0.00 0.22 0.22 0.00 NaN
## YearsCode 2 42 -0.40 0.43 -0.36 -0.39 0.45 -1.28 0.43 1.71 -0.06
## JobSat 3 42 0.03 0.35 0.23 0.07 0.00 -0.55 0.23 0.79 -1.04
## WorkLoc 4 42 -0.76 0.00 -0.76 -0.76 0.00 -0.76 -0.76 0.00 NaN
## kurtosis se
## Employment NaN 0.00
## YearsCode -0.64 0.07
## JobSat -0.93 0.05
## WorkLoc NaN 0.00
General description: this cluster has 36 observations.
General description: this cluster has 39 observations.
General description: this cluster the smallest cluster and has only 9 observations.
General description: this cluster has 31 observations.
General description: this cluster has 26 observations.
General description: this cluster is also wery small and has 15 observations.
General description: this cluster is the biggest one - 42 observations
I use PCA, because my variables are numeric
surv.pca <- prcomp(scaled_surv, center = TRUE,scale. = TRUE)
summary(surv.pca)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.1320 1.0512 0.9497 0.8437
## Proportion of Variance 0.3203 0.2762 0.2255 0.1779
## Cumulative Proportion 0.3203 0.5966 0.8220 1.0000
The table shows that two components (PC1 and PC2) describe 59% of variance, so it is not bad:)
The aim of this work was to find out how many programmers do not have a job and whether this is due to coding experience. The assumption that programmers with less extensive work experience is more in demand is coming to the fore. It was interesting to do cluster anakysis but i suppose that my variables was not pretty good, because of fact that clusters were mixed.