1 Cluster Analysis

1.1 Important Characteristics of Our Data - 0.6 points

library(readr)
library(ggplot2)
library(dplyr)
library(rvest)
library(stringr)
library(tidytext)
library(tidyr)
library(ggwordcloud)
library(psych)
library(knitr)
library(kableExtra)
library(factoextra)
library(cluster)

dataforproject2 <- read_csv("/srv/store/students/vvsuschevskiy/datanal/4year/clusters/dataforproject2.csv")

## Warning: Missing column names filled in: 'X1' [1]

head(dataforproject2[,-1:-2]) %>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

MainBranch	Hobbyist	OpenSourcer	OpenSource	Employment	Country	Student	EdLevel	UndergradMajor	EduOther	OrgSize	DevType	YearsCode	Age1stCode	YearsCodePro	CareerSat	JobSat	MgrIdiot	MgrMoney	MgrWant	JobSeek	LastHireDate	LastInt	FizzBuzz	JobFactors	ResumeUpdate	CurrencySymbol	CurrencyDesc	CompTotal	CompFreq	ConvertedComp	WorkWeekHrs	WorkPlan	WorkChallenge	WorkRemote	WorkLoc	ImpSyn	CodeRev	CodeRevHrs	UnitTests	PurchaseHow	PurchaseWhat	LanguageWorkedWith	LanguageDesireNextYear	DatabaseWorkedWith	DatabaseDesireNextYear	PlatformWorkedWith	PlatformDesireNextYear	WebFrameWorkedWith	WebFrameDesireNextYear	MiscTechWorkedWith	MiscTechDesireNextYear	DevEnviron	OpSys	Containers	BlockchainOrg	BlockchainIs	BetterLife	ITperson	OffOn	SocialMedia	Extraversion	ScreenName	SOVisit1st	SOVisitFreq	SOVisitTo	SOFindAnswer	SOTimeSaved	SOHowMuchTime	SOAccount	SOPartFreq	SOJobs	EntTeams	SOComm	WelcomeChange	SONewContent	Age	Gender	Trans	Sexuality	Ethnicity	Dependents	SurveyLength	SurveyEase
I am not primarily a developer, but I write code sometimes as part of my work	Yes	Never	The quality of OSS and closed source software is about the same	Employed full-time	Canada	No	Bachelor’s degree (BA, BS, B.Eng., etc.)	Mathematics or statistics	Taken an online course in programming or software development (e.g. a MOOC);Received on-the-job training in software development;Taught yourself a new language, framework, or tool without taking a formal course	NA	Data or business analyst;Data scientist or machine learning specialist;Database administrator;Engineer, data	13	15	3	Very satisfied	Slightly satisfied	Very confident	No	Yes	I am not interested in new job opportunities	1-2 years ago	Write any code;Complete a take-home project;Interview with people in senior / management roles	No	Financial performance or funding status of the company or organization;Opportunities for professional development;How widely used or impactful my work output would be	I heard about a job opportunity (from a recruiter, online job posting, etc.)	CAD	Canadian dollar	40000	Monthly	366420	15	There’s no schedule or spec; I work on what seems most important or urgent	NA	A few days each month	Home	A little above average	No	NA	Yes, it’s not part of our process but the developers do it on their own	Not sure	I have little or no influence	Java;R;SQL	Python;Scala;SQL	MongoDB;PostgreSQL	PostgreSQL	Android;Google Cloud Platform;Linux;Windows	Android;Google Cloud Platform;Linux;Windows	NA	NA	Hadoop	Hadoop;Pandas;TensorFlow;Unity 3D	Android Studio;Eclipse;PyCharm;RStudio;Visual Studio Code	Windows	I do not use containers	Not at all	NA	No	Yes	No	YouTube	In real life (in person)	Login	2011	A few times per month or weekly	Find answers to specific questions	Less than once per week	Stack Overflow was slightly faster	60+ minutes	Yes	I have never participated in Q&A on Stack Overflow	No, I knew that Stack Overflow had a job board but have never used or visited it	No, and I don’t know what those are	No, not really	Just as welcome now as I felt last year	Tech articles written by other developers;Industry news about technologies you’re interested in;Tech meetups or events in your area;Courses on technologies you’re interested in	28	Man	No	Straight / Heterosexual	East Asian	No	Too long	Neither easy nor difficult
I am a developer by profession	Yes	Once a month or more often	OSS is, on average, of HIGHER quality than proprietary / closed source software	Employed full-time	India	No	Master’s degree (MA, MS, M.Eng., MBA, etc.)	NA	NA	10,000 or more employees	Data or business analyst;Data scientist or machine learning specialist;Database administrator;Developer, back-end;Developer, desktop or enterprise applications;Developer, front-end;Developer, full-stack;Developer, game or graphics;Educator	12	20	10	Slightly dissatisfied	Slightly dissatisfied	Somewhat confident	Yes	Yes	I’m not actively looking, but I am open to new opportunities	3-4 years ago	NA	No	Languages, frameworks, and other technologies I’d be working with;Remote work options;Flex time or a flexible schedule	NA	INR	Indian rupee	950000	Yearly	13293	70	There’s no schedule or spec; I work on what seems most important or urgent	NA	A few days each month	Home	Far above average	Yes, because I see value in code review	4.0	Yes, it’s part of our process	NA	NA	C#;Go;JavaScript;Python;R;SQL	C#;Go;JavaScript;Kotlin;Python;R;SQL	Elasticsearch;MongoDB;Microsoft SQL Server;MySQL;SQLite	Elasticsearch;MongoDB;Microsoft SQL Server	Linux;Windows	Android;Linux;Raspberry Pi;Windows	Angular/Angular.js;ASP.NET;Django;Express;Flask;jQuery	Angular/Angular.js;ASP.NET;Django;Express;Flask;jQuery	.NET;Node.js;Pandas;Torch/PyTorch	.NET;Node.js;TensorFlow;Torch/PyTorch	Android Studio;Eclipse;IPython / Jupyter;Notepad++;RStudio;Vim;Visual Studio;Visual Studio Code	Windows	NA	Not at all	Useful for immutable record keeping outside of currency	No	Yes	Yes	YouTube	Neither	Screen Name	NA	Multiple times per day	Find answers to specific questions;Get a sense of belonging to the developer community;Meet other people with similar skills or interests	3-5 times per week	They were about the same	NA	Yes	A few times per month or weekly	Yes	No, and I don’t know what those are	Yes, somewhat	Somewhat less welcome now than last year	Tech articles written by other developers;Tech meetups or events in your area	NA	NA	NA	NA	NA	Yes	Too long	Difficult
I am a student who is learning to code	No	Never	OSS is, on average, of HIGHER quality than proprietary / closed source software	Employed part-time	Canada	Yes, full-time	Some college/university study without earning a degree	Mathematics or statistics	Taken an online course in programming or software development (e.g. a MOOC);Taught yourself a new language, framework, or tool without taking a formal course	NA	Data or business analyst;Data scientist or machine learning specialist;Engineer, data;Student	5	16	NA	NA	NA	NA	NA	NA	I am not interested in new job opportunities	Less than a year ago	NA	NA	Financial performance or funding status of the company or organization;Office environment or company culture;Opportunities for professional development	My job status changed (promotion, new job, etc.)	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	NA	Bash/Shell/PowerShell;HTML/CSS;Java;Python;R;SQL	Bash/Shell/PowerShell;C++;Go;Python;R;Scala;SQL	MySQL;PostgreSQL;SQLite	Elasticsearch;MongoDB;MySQL;PostgreSQL	AWS;Docker;Google Cloud Platform;Linux;MacOS;Slack;Windows	AWS;Linux;MacOS;Slack	NA	NA	Ansible;Chef;Hadoop;Pandas;TensorFlow	Ansible;Apache Spark;Chef;Hadoop;Pandas;TensorFlow;Torch/PyTorch	IPython / Jupyter;PyCharm;RStudio;Sublime Text;Vim	MacOS	Testing;Production	NA	NA	Yes	Yes	Yes	Reddit	In real life (in person)	Username	2014	A few times per month or weekly	Find answers to specific questions;Learn how to do things I didn’t necessarily look for	1-2 times per week	The other resource was slightly faster	11-30 minutes	Not sure / can’t remember	NA	No, I knew that Stack Overflow had a job board but have never used or visited it	Yes	Yes, somewhat	Just as welcome now as I felt last year	Courses on technologies you’re interested in	21	Woman	No	Straight / Heterosexual	Black or of African descent	No	Appropriate in length	Easy
I am not primarily a developer, but I write code sometimes as part of my work	Yes	Less than once a month but more than once per year	The quality of OSS and closed source software is about the same	Employed full-time	Russian Federation	No	Master’s degree (MA, MS, M.Eng., MBA, etc.)	Computer science, computer engineering, or software engineering	Taken an online course in programming or software development (e.g. a MOOC);Taught yourself a new language, framework, or tool without taking a formal course;Contributed to open source software	1,000 to 4,999 employees	Data or business analyst	10	18	3	Slightly satisfied	Very satisfied	Very confident	Yes	Yes	I’m not actively looking, but I am open to new opportunities	3-4 years ago	Complete a take-home project;Interview with people in peer roles;Interview with people in senior / management roles	No	Financial performance or funding status of the company or organization;Opportunities for professional development;How widely used or impactful my work output would be	My job status changed (promotion, new job, etc.)	RUB	Russian ruble	120000	Monthly	21996	40	There’s no schedule or spec; I work on what seems most important or urgent	Distracting work environment;Non-work commitments (parenting, school work, hobbies, etc.);Not enough people for the workload	Less than once per month / Never	Office	Average	Yes, because I see value in code review	0.5	Yes, it’s not part of our process but the developers do it on their own	Not sure	I have some influence	Python;R	Python;R	MongoDB	MongoDB	NA	NA	NA	NA	NA	NA	PyCharm;RStudio	Linux-based	Production	NA	A passing fad	Yes	SIGH	Yes	VK ВКонта́кте	In real life (in person)	Login	I don’t remember	Multiple times per day	Find answers to specific questions	More than 10 times per week	Stack Overflow was slightly faster	0-10 minutes	Yes	I have never participated in Q&A on Stack Overflow	No, I knew that Stack Overflow had a job board but have never used or visited it	No, and I don’t know what those are	No, not really	Just as welcome now as I felt last year	NA	NA	Man	No	Straight / Heterosexual	White or of European descent	Yes	Appropriate in length	Neither easy nor difficult
I am not primarily a developer, but I write code sometimes as part of my work	No	Never	OSS is, on average, of HIGHER quality than proprietary / closed source software	Employed full-time	Lithuania	No	Master’s degree (MA, MS, M.Eng., MBA, etc.)	Information systems, information technology, or system administration	Taken an online course in programming or software development (e.g. a MOOC);Taken a part-time in-person course in programming or software development;Taught yourself a new language, framework, or tool without taking a formal course	1,000 to 4,999 employees	Database administrator;Designer;Developer, back-end;Developer, embedded applications or devices;Developer, front-end;Developer, full-stack;Developer, mobile;System administrator	8	17	4	Very satisfied	Slightly dissatisfied	Very confident	No	I am already a manager	I’m not actively looking, but I am open to new opportunities	More than 4 years ago	Interview with people in peer roles;Interview with people in senior / management roles	No	Remote work options;How widely used or impactful my work output would be;Flex time or a flexible schedule	My job status changed (promotion, new job, etc.)	EUR	European Euro	3000	Monthly	41244	140	There’s no schedule or spec; I work on what seems most important or urgent	Lack of support from management;Non-work commitments (parenting, school work, hobbies, etc.);Not enough people for the workload	More than half, but not all, the time	Office	A little above average	Yes, because I see value in code review	1.0	No, but I think we should	Developers typically have the most influence on purchasing new technology	I have a great deal of influence	Bash/Shell/PowerShell;C#;HTML/CSS;Java;JavaScript;PHP;Python;R;SQL	Bash/Shell/PowerShell;C#;HTML/CSS;Java;JavaScript;PHP;Python;R;SQL	Elasticsearch;MariaDB;MongoDB;Microsoft SQL Server	Elasticsearch;MariaDB;MongoDB;Microsoft SQL Server	Android;Docker;Windows;WordPress	Android;Docker;Windows	Angular/Angular.js;ASP.NET;jQuery	Angular/Angular.js;ASP.NET;jQuery	.NET;Pandas	.NET;Pandas;Unity 3D;Xamarin	Android Studio;Visual Studio;Visual Studio Code	Windows	Outside of work, for personal projects	Not at all	Useful for immutable record keeping outside of currency	Yes	Also Yes	Yes	Facebook	In real life (in person)	Username	2010	A few times per month or weekly	Find answers to specific questions;Learn how to do things I didn’t necessarily look for	3-5 times per week	Stack Overflow was much faster	11-30 minutes	Yes	I have never participated in Q&A on Stack Overflow	No, I didn’t know that Stack Overflow had a job board	No, and I don’t know what those are	Neutral	Not applicable - I did not use Stack Overflow last year	Tech articles written by other developers	38	Man	No	Straight / Heterosexual	White or of European descent	Yes	Appropriate in length	Easy
I am a developer by profession	No	Less than once a month but more than once per year	OSS is, on average, of HIGHER quality than proprietary / closed source software	Employed full-time	Argentina	Yes, full-time	Master’s degree (MA, MS, M.Eng., MBA, etc.)	A natural science (ex. biology, chemistry, physics)	Taken an online course in programming or software development (e.g. a MOOC);Taken a part-time in-person course in programming or software development;Taught yourself a new language, framework, or tool without taking a formal course;Contributed to open source software	10,000 or more employees	Academic researcher;Data scientist or machine learning specialist;Scientist;Student	6	16	3	Very satisfied	Very satisfied	Somewhat confident	No	Not sure	I’m not actively looking, but I am open to new opportunities	1-2 years ago	NA	No	Specific department or team I’d be working on;Office environment or company culture;Flex time or a flexible schedule	My job status changed (promotion, new job, etc.)	USD	United States dollar	700	Monthly	8400	35	There is a schedule and/or spec (made by me or by a colleague), and I follow it very closely	Inadequate access to necessary tools;Meetings;Toxic work environment	Less than once per month / Never	Office	A little above average	Yes, because I see value in code review	5.0	No, but I think we should	Not sure	I have little or no influence	C++;Python;R	R	NA	NA	NA	NA	NA	NA	NA	NA	RStudio	Linux-based	I do not use containers	Not at all	NA	Yes	Yes	What?	WhatsApp	In real life (in person)	Username	2014	Daily or almost daily	Find answers to specific questions;Learn how to do things I didn’t necessarily look for;Contribute to a library of information;Pass the time / relax	3-5 times per week	Stack Overflow was much faster	60+ minutes	Yes	A few times per week	Yes	No, and I don’t know what those are	Yes, somewhat	Just as welcome now as I felt last year	Tech articles written by other developers;Tech meetups or events in your area;Courses on technologies you’re interested in	25	Man	No	Straight / Heterosexual	Hispanic or Latino/Latina	No	Appropriate in length	Neither easy nor difficult

dataforproject2 %>% 
  filter(str_detect(LanguageWorkedWith, "R")) %>%
  mutate(LearnR = ifelse(str_detect(LanguageDesireNextYear, "R"), "WantR", "Nope")) -> Use_R

Use_R[sapply(Use_R, is.character)] <- lapply(Use_R[sapply(Use_R, is.character)], as.factor)

#describeBy(Use_R, group = Use_R$LearnR)

names_text = names(Use_R[,c(3:7, 9:11, 13, 27, 35,  37:40)])

b = 1
for( i in names(Use_R[,c(3:7, 9:11, 13, 27, 35,  37:40) ] )){
  print(
    Use_R %>% 
    ggplot(aes_string(x = i))+
    geom_bar(aes(fill = LearnR),color = "black", stat="count", position = "dodge")+
      ggtitle(names_text[b])+
      geom_text(aes( label =  paste0(round((..count..)/sum(..count..)*100), "%"),
                   y=  (..count..)/sum(..count..)), stat= "count", vjust = -.5)+
      theme_minimal()+
      scale_x_discrete(guide = guide_axis(n.dodge = 3))
  )
  b = b + 1
}

R users are mostly developers by profession, or at least develop something as a part of their work. Also, they usually code for a hobby, and there is no big difference between people who code in R and people who want to learn R. However, that might be a proble of the data, because it was pre-filltered.

1.2 Justification of Variables (you may want to use not all the variables) - 1 point

To justify of variables lets take a look what is happening with R right now. And collect data from UseR 2019. Unfortunately, I founf only keynotes not proceedings, but I am not ready to care too much.

first_page <- read_html("https://user2019.r-project.org/program/")

first_page %>% 
  html_nodes(".speaker-bio , ul") %>% 
  html_text() %>% 
  as_tibble() -> wikipedia_text

## Warning: Calling `as_tibble()` on a vector is discouraged, because the behavior is likely to change in the future. Use `tibble::enframe(name = NULL)` instead.
## This warning is displayed once per session.

head(wikipedia_text)%>%
  kable() %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

value
Home `Program Program overview Talk schedule Important dates Keynotes Tutorials Datathon Information for presenters Social Program Side events Posters Registration Registration Abstract submission Scholarships Venue Travel Toulouse Gala dinner Around Toulouse About Local organization committee Scientific committee Past events FAQ Carbon footprint Legal information Code of Conduct Contact </td>`
Program overview `Talk schedule Important dates Keynotes Tutorials Datathon Information for presenters Social Program Side events Posters </td>`
Registration `Abstract submission Scholarships </td>`
Travel `Toulouse Gala dinner Around Toulouse </td>`
Local organization committee `Scientific committee Past events FAQ Carbon footprint Legal information </td>`
Joe Cheng

stop_words = get_stopwords("en")

stop_words = rbind(stop_words, c("packages"))

draw_wc = function(text){
  text %>%
  unnest_tokens(bigram, value, token = "skip_ngrams", n = 2, k = 5) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word) %>%
  filter(!word1 %in% c(1:20)) %>%
  filter(!word2 %in% c(1:20)) %>% 
  count(word1, word2, sort = TRUE) %>% na.omit() %>% 
  mutate(n = n, word = paste(word1, word2, sep = " ")) %>% 
  select(word, n) %>% 
  filter(n > 1) %>% 
  ggplot(aes(label = word, size = n)) +
  geom_text_wordcloud(rm_outside = T) +
  scale_size_area(max_size = 20) +
  scale_color_manual(values = c("red", "skyblue", "black"))+
  theme_minimal()
}

draw_wc(wikipedia_text)

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 2954 rows [1,
## 8, 15, 22, 29, 36, 43, 50, 57, 64, 71, 78, 85, 92, 99, 106, 113, 120, 127,
## 134, ...].

In 2019 R was mostly about DS, Science, Statistics, creating applications, little bit about AI and education. So lets try to find different R users.

Thus, there are different reasons to code in R, so I will try to capture this reasons with following variables:

OpenSourcer – because people who work in idustry usually could not contribute
Employment – work is an important for any R user, and part time workers might be more interesting
Students – just to control for students
YearsCode – because R is not so old
JobFactors – beacuse that is an answer for my question
WorkPlan – cool programmers do not have shedule

1.3 the distance metric matches variable types - 1 point (if this is incorrect, interpretation will fail)

Use_R %>% 
  select(Respondent, OpenSourcer, Employment, Student, YearsCode, JobFactors, WorkPlan) %>% na.omit() -> df_cluster

1.4 k-means

df_cluster_num <- mutate_all(df_cluster, function(x) as.numeric(x))


fviz_nbclust(df_cluster_num[,-1], kmeans, method = "wss")

fviz_nbclust(df_cluster_num[,-1], kmeans, method = "silhouette")

fviz_nbclust(df_cluster_num[,-1], kmeans, method = "gap")

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 185350)

## Warning: did not converge in 10 iterations

wss and silhouette say that 2 clusters is an optimal amout for this data, but when we are talking about our data, we should take into consideration, that convert factors into numeric should be illigal.

clusters <- kmeans(df_cluster_num[,-1],
                      2, # how many groups to locate
                      nstart = 20 # R will try 20 different random starting assignments 
                      # and then select the one with the lowest within cluster variation.
)


fviz_cluster(clusters, data = df_cluster_num[,-1],
  ellipse.type = "convex",
  palette = "jco",
  repel = TRUE)

As we see our data could not be properly plotted, bacause our dementions do not take enough load.

so we need to create a proper distance matrix

I will use gower, since I have mostly factors

Use_R %>% 
  select(Respondent, OpenSourcer, Employment, Student, YearsCode, JobFactors, WorkPlan) %>% mutate_if(is.factor, addNA) -> df_cluster

df_cluster$YearsCode = as.numeric(as.character(df_cluster$YearsCode))

## Warning: NAs introduced by coercion

# to perform different types of hierarchical clustering
# package functions used: daisy(), diana(), clusplot()
gower.dist <- daisy(df_cluster[ ,-1], metric = c("gower"))
#class(gower.dist) 
## dissimilarity , dist

1.5 DIVISIVE

divisive.clust <- diana(as.matrix(gower.dist),
                  diss = TRUE, keep.diss = F)

?diana
plot(divisive.clust)

works too bad, could not compute((

1.6 AGGLOMERATIVE

aggl.clust.c <- hclust(gower.dist, method = "complete")

plot(aggl.clust.c,
     main = "Agglomerative, complete linkages", hang = -1, cex = 0.6, ylim =c(0.5,1))

1.7 Vizualization

 #install.packages("ape")
library("ape")
# Default plot
plot(as.phylo(aggl.clust.c), cex = 0.6, label.offset = 0.5)

plot(as.phylo(aggl.clust.c), type = "unrooted", cex = 0.6,
     no.margin = TRUE, show.tip.label = F)

It looks that we have around 12 clusters what is not good, but not really bad, lets cut them, and try to look at them

clus4 = cutree(aggl.clust.c, 12)

plot(as.phylo(aggl.clust.c), type = "fan", 
     label.offset = 1, cex = 0.3, show.tip.label = F)

lets put these clusters back to data

Use_R$cluster = clus4

and draw all graphs

names_text = names(Use_R[,c(3:7, 9:11, 13, 27, 35,  37:40)])

b = 1
for( i in names(Use_R[,c(3:7, 9:11, 13, 27, 35,  37:40) ] )){
  print(
    Use_R %>% 
    ggplot(aes_string(x = i))+
    geom_bar(aes(fill = as.factor(cluster)), color = "black", stat="count", position = "dodge")+
      ggtitle(names_text[b])+
      geom_text(aes( label =  paste0(round((..count..)/sum(..count..)*100), "%"),
                   y=  (..count..)/sum(..count..)), stat= "count", vjust = -.5)+
      theme_minimal()+
      scale_x_discrete(guide = guide_axis(n.dodge = 3))
  )
  b = b + 1
}

Ok, I could not really understand why are they are different, we need a computer to make this decision for me. All credits to https://towardsdatascience.com/hierarchical-clustering-on-categorical-data-in-r-a27e578f2995

# Cluster stats comes out as list while it is more convenient to look at it as a table
# This code below will produce a dataframe with observations in columns and variables in row
# Not quite tidy data, which will require a tweak for plotting, but I prefer this view as an output here as I find it more comprehensive 
library(fpc)
cstats.table <- function(dist, tree, k) {
clust.assess <- c("cluster.number","n","within.cluster.ss","average.within","average.between",
                  "wb.ratio","dunn2","avg.silwidth")
clust.size <- c("cluster.size")
stats.names <- c()
row.clust <- c()
output.stats <- matrix(ncol = k, nrow = length(clust.assess))
cluster.sizes <- matrix(ncol = k, nrow = k)
for(i in c(1:k)){
  row.clust[i] <- paste("Cluster-", i, " size")
}
for(i in c(2:k)){
  stats.names[i] <- paste("Test", i-1)
  
  for(j in seq_along(clust.assess)){
    output.stats[j, i] <- unlist(cluster.stats(d = dist, clustering = cutree(tree, k = i))[clust.assess])[j]
    
  }
  
  for(d in 1:k) {
    cluster.sizes[d, i] <- unlist(cluster.stats(d = dist, clustering = cutree(tree, k = i))[clust.size])[d]
    dim(cluster.sizes[d, i]) <- c(length(cluster.sizes[i]), 1)
    cluster.sizes[d, i]
    
  }
}
output.stats.df <- data.frame(output.stats)
cluster.sizes <- data.frame(cluster.sizes)
cluster.sizes[is.na(cluster.sizes)] <- 0
rows.all <- c(clust.assess, row.clust)
# rownames(output.stats.df) <- clust.assess
output <- rbind(output.stats.df, cluster.sizes)[ ,-1]
colnames(output) <- stats.names[2:k]
rownames(output) <- rows.all
is.num <- sapply(output, is.numeric)
output[is.num] <- lapply(output[is.num], round, 2)
output
}
# I am capping the maximum amout of clusters by 7
# I want to choose a reasonable number, based on which I will be able to see basic differences between customer groups as a result
stats.df.divisive <- cstats.table(gower.dist, divisive.clust, 7)
stats.df.divisive

##                    Test 1  Test 2  Test 3  Test 4  Test 5  Test 6
## cluster.number       2.00    3.00    4.00    5.00    6.00    7.00
## n                 5048.00 5048.00 5048.00 5048.00 5048.00 5048.00
## within.cluster.ss  780.28  732.45  643.05  628.96  626.63  592.96
## average.within       0.53    0.51    0.48    0.48    0.48    0.46
## average.between      0.71    0.70    0.71    0.71    0.71    0.71
## wb.ratio             0.75    0.73    0.68    0.67    0.67    0.66
## dunn2                1.13    0.98    1.08    1.09    1.10    1.16
## avg.silwidth         0.25    0.18    0.21    0.21    0.21    0.22
## Cluster- 1  size  2934.00 2934.00 2934.00 2934.00 2934.00 2934.00
## Cluster- 2  size  2114.00 1679.00 1342.00 1299.00 1293.00 1094.00
## Cluster- 3  size     0.00  435.00  435.00  435.00  435.00  435.00
## Cluster- 4  size     0.00    0.00  337.00  337.00  337.00  337.00
## Cluster- 5  size     0.00    0.00    0.00   43.00   43.00  199.00
## Cluster- 6  size     0.00    0.00    0.00    0.00    6.00   43.00
## Cluster- 7  size     0.00    0.00    0.00    0.00    0.00    6.00

stats.df.aggl <- cstats.table(gower.dist, aggl.clust.c, 7)
stats.df.aggl

##                    Test 1  Test 2  Test 3  Test 4  Test 5  Test 6
## cluster.number       2.00    3.00    4.00    5.00    6.00    7.00
## n                 5048.00 5048.00 5048.00 5048.00 5048.00 5048.00
## within.cluster.ss 1001.66  981.15  924.83  920.76  897.20  864.59
## average.within       0.60    0.59    0.57    0.57    0.56    0.55
## average.between      0.67    0.69    0.64    0.64    0.65    0.65
## wb.ratio             0.90    0.87    0.90    0.89    0.87    0.85
## dunn2                1.11    1.11    1.01    0.98    0.98    0.98
## avg.silwidth         0.10    0.07   -0.01   -0.01   -0.02   -0.02
## Cluster- 1  size  4913.00 4785.00 4259.00 4234.00 4124.00 3826.00
## Cluster- 2  size   135.00  128.00  526.00   25.00  110.00  298.00
## Cluster- 3  size     0.00  135.00  128.00  526.00   25.00  110.00
## Cluster- 4  size     0.00    0.00  135.00  128.00  526.00   25.00
## Cluster- 5  size     0.00    0.00    0.00  135.00  128.00  526.00
## Cluster- 6  size     0.00    0.00    0.00    0.00  135.00  128.00
## Cluster- 7  size     0.00    0.00    0.00    0.00    0.00  135.00

and plot them

stats.df.divisive["method",] = "divi"
stats.df.aggl["method",] = "aggl"

df_elbow = as.data.frame(rbind(t(stats.df.divisive),t(stats.df.aggl)))

ggplot(data = df_elbow, aes(x=cluster.number, y=within.cluster.ss, group = method)) + 
  geom_point()+
  geom_line(aes(color = method))+
  ggtitle("clustering") +
  labs(x = "Num.of clusters", y = "Within clusters sum of squares (SS)") +
  theme(plot.title = element_text(hjust = 0.5))

according to our picture there is no elbow, (probably because it happen later, but I could no dercribe more than 7 clusters)

ggplot(data = df_elbow, aes(x=cluster.number, y=avg.silwidth, group = method)) + 
  geom_point()+
  geom_line(aes(color = method))+
  ggtitle(" clustering") +
  labs(x = "Num.of clusters", y = "Average silhouette width") +
  theme(plot.title = element_text(hjust = 0.5))

We could se it here, but for JK, I could not run that script with test one more time, my PC will explodes. & is fine, just fine. Divi is better in all metrics

clus7 = cutree(divisive.clust, 7)

Use_R$cluster = clus7 

df_cluster$cluster = clus7

names_text = names(df_cluster[,c(-1,-6,-5)])

b = 1
for( i in names(df_cluster[,c(-1,-6,-5)] )){
  print(
    Use_R %>% 
    ggplot(aes_string(x = i))+
    geom_bar(aes(fill = as.factor(cluster)), color = "black", stat="count", position = "dodge")+
      ggtitle(names_text[b])+
      geom_text(aes( label =  paste0(round((..count..)/sum(..count..)*100), "%"),
                   y=  (..count..)/sum(..count..)), stat= "count", vjust = -.5)+
      theme_minimal()+
      scale_x_discrete(guide = guide_axis(n.dodge = 3))
  )
  b = b + 1
}

df_cluster %>% 
ggplot(aes(YearsCode, fill = as.factor(cluster)))+
  geom_histogram()+
  theme_minimal()+
  facet_grid(. ~ cluster)

## Warning: Removed 112 rows containing non-finite values (stat_bin).

1.8 Cluster names

cluster 1 is mostly developers, or people who develop, they have enough time for hobbies, definately not a students. They work for sure. mostly from office
cluster 2 students, but or just graduated
- 5 office workers, but develompent is not their primary job

6- 7. mostly noise. less then 1% of data

df_cluster$JobFactors = as.factor(df_cluster$JobFactors)

Mode <- function(x) {
  ux <- unique(x)
  ux[which.max(tabulate(match(x, ux)))]
}

df_cluster %>% group_by(cluster) %>% summarise(mode = Mode(JobFactors))

## # A tibble: 7 x 2
##   cluster mode                                                                  
##     <int> <fct>                                                                 
## 1       1 Languages, frameworks, and other technologies I'd be working with;Off…
## 2       2 <NA>                                                                  
## 3       3 Office environment or company culture;Opportunities for professional …
## 4       4 Office environment or company culture;Opportunities for professional …
## 5       5 Office environment or company culture;Opportunities for professional …
## 6       6 Languages, frameworks, and other technologies I'd be working with;Off…
## 7       7 Industry that I'd be working in;Financial performance or funding stat…