1 Cluster Analysis
1.1 Important Characteristics of Our Data - 0.6 points
library(readr)
library(ggplot2)
library(dplyr)
library(rvest)
library(stringr)
library(tidytext)
library(tidyr)
library(ggwordcloud)
library(psych)
library(knitr)
library(kableExtra)
library(factoextra)
library(cluster)
dataforproject2 <- read_csv("/srv/store/students/vvsuschevskiy/datanal/4year/clusters/dataforproject2.csv")
## Warning: Missing column names filled in: 'X1' [1]
head(dataforproject2[,-1:-2]) %>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
MainBranch | Hobbyist | OpenSourcer | OpenSource | Employment | Country | Student | EdLevel | UndergradMajor | EduOther | OrgSize | DevType | YearsCode | Age1stCode | YearsCodePro | CareerSat | JobSat | MgrIdiot | MgrMoney | MgrWant | JobSeek | LastHireDate | LastInt | FizzBuzz | JobFactors | ResumeUpdate | CurrencySymbol | CurrencyDesc | CompTotal | CompFreq | ConvertedComp | WorkWeekHrs | WorkPlan | WorkChallenge | WorkRemote | WorkLoc | ImpSyn | CodeRev | CodeRevHrs | UnitTests | PurchaseHow | PurchaseWhat | LanguageWorkedWith | LanguageDesireNextYear | DatabaseWorkedWith | DatabaseDesireNextYear | PlatformWorkedWith | PlatformDesireNextYear | WebFrameWorkedWith | WebFrameDesireNextYear | MiscTechWorkedWith | MiscTechDesireNextYear | DevEnviron | OpSys | Containers | BlockchainOrg | BlockchainIs | BetterLife | ITperson | OffOn | SocialMedia | Extraversion | ScreenName | SOVisit1st | SOVisitFreq | SOVisitTo | SOFindAnswer | SOTimeSaved | SOHowMuchTime | SOAccount | SOPartFreq | SOJobs | EntTeams | SOComm | WelcomeChange | SONewContent | Age | Gender | Trans | Sexuality | Ethnicity | Dependents | SurveyLength | SurveyEase |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
I am not primarily a developer, but I write code sometimes as part of my work | Yes | Never | The quality of OSS and closed source software is about the same | Employed full-time | Canada | No | Bachelor’s degree (BA, BS, B.Eng., etc.) | Mathematics or statistics | Taken an online course in programming or software development (e.g. a MOOC);Received on-the-job training in software development;Taught yourself a new language, framework, or tool without taking a formal course | NA | Data or business analyst;Data scientist or machine learning specialist;Database administrator;Engineer, data | 13 | 15 | 3 | Very satisfied | Slightly satisfied | Very confident | No | Yes | I am not interested in new job opportunities | 1-2 years ago | Write any code;Complete a take-home project;Interview with people in senior / management roles | No | Financial performance or funding status of the company or organization;Opportunities for professional development;How widely used or impactful my work output would be | I heard about a job opportunity (from a recruiter, online job posting, etc.) | CAD | Canadian dollar | 40000 | Monthly | 366420 | 15 | There’s no schedule or spec; I work on what seems most important or urgent | NA | A few days each month | Home | A little above average | No | NA | Yes, it’s not part of our process but the developers do it on their own | Not sure | I have little or no influence | Java;R;SQL | Python;Scala;SQL | MongoDB;PostgreSQL | PostgreSQL | Android;Google Cloud Platform;Linux;Windows | Android;Google Cloud Platform;Linux;Windows | NA | NA | Hadoop | Hadoop;Pandas;TensorFlow;Unity 3D | Android Studio;Eclipse;PyCharm;RStudio;Visual Studio Code | Windows | I do not use containers | Not at all | NA | No | Yes | No | YouTube | In real life (in person) | Login | 2011 | A few times per month or weekly | Find answers to specific questions | Less than once per week | Stack Overflow was slightly faster | 60+ minutes | Yes | I have never participated in Q&A on Stack Overflow | No, I knew that Stack Overflow had a job board but have never used or visited it | No, and I don’t know what those are | No, not really | Just as welcome now as I felt last year | Tech articles written by other developers;Industry news about technologies you’re interested in;Tech meetups or events in your area;Courses on technologies you’re interested in | 28 | Man | No | Straight / Heterosexual | East Asian | No | Too long | Neither easy nor difficult |
I am a developer by profession | Yes | Once a month or more often | OSS is, on average, of HIGHER quality than proprietary / closed source software | Employed full-time | India | No | Master’s degree (MA, MS, M.Eng., MBA, etc.) | NA | NA | 10,000 or more employees | Data or business analyst;Data scientist or machine learning specialist;Database administrator;Developer, back-end;Developer, desktop or enterprise applications;Developer, front-end;Developer, full-stack;Developer, game or graphics;Educator | 12 | 20 | 10 | Slightly dissatisfied | Slightly dissatisfied | Somewhat confident | Yes | Yes | I’m not actively looking, but I am open to new opportunities | 3-4 years ago | NA | No | Languages, frameworks, and other technologies I’d be working with;Remote work options;Flex time or a flexible schedule | NA | INR | Indian rupee | 950000 | Yearly | 13293 | 70 | There’s no schedule or spec; I work on what seems most important or urgent | NA | A few days each month | Home | Far above average | Yes, because I see value in code review | 4.0 | Yes, it’s part of our process | NA | NA | C#;Go;JavaScript;Python;R;SQL | C#;Go;JavaScript;Kotlin;Python;R;SQL | Elasticsearch;MongoDB;Microsoft SQL Server;MySQL;SQLite | Elasticsearch;MongoDB;Microsoft SQL Server | Linux;Windows | Android;Linux;Raspberry Pi;Windows | Angular/Angular.js;ASP.NET;Django;Express;Flask;jQuery | Angular/Angular.js;ASP.NET;Django;Express;Flask;jQuery | .NET;Node.js;Pandas;Torch/PyTorch | .NET;Node.js;TensorFlow;Torch/PyTorch | Android Studio;Eclipse;IPython / Jupyter;Notepad++;RStudio;Vim;Visual Studio;Visual Studio Code | Windows | NA | Not at all | Useful for immutable record keeping outside of currency | No | Yes | Yes | YouTube | Neither | Screen Name | NA | Multiple times per day | Find answers to specific questions;Get a sense of belonging to the developer community;Meet other people with similar skills or interests | 3-5 times per week | They were about the same | NA | Yes | A few times per month or weekly | Yes | No, and I don’t know what those are | Yes, somewhat | Somewhat less welcome now than last year | Tech articles written by other developers;Tech meetups or events in your area | NA | NA | NA | NA | NA | Yes | Too long | Difficult |
I am a student who is learning to code | No | Never | OSS is, on average, of HIGHER quality than proprietary / closed source software | Employed part-time | Canada | Yes, full-time | Some college/university study without earning a degree | Mathematics or statistics | Taken an online course in programming or software development (e.g. a MOOC);Taught yourself a new language, framework, or tool without taking a formal course | NA | Data or business analyst;Data scientist or machine learning specialist;Engineer, data;Student | 5 | 16 | NA | NA | NA | NA | NA | NA | I am not interested in new job opportunities | Less than a year ago | NA | NA | Financial performance or funding status of the company or organization;Office environment or company culture;Opportunities for professional development | My job status changed (promotion, new job, etc.) | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | Bash/Shell/PowerShell;HTML/CSS;Java;Python;R;SQL | Bash/Shell/PowerShell;C++;Go;Python;R;Scala;SQL | MySQL;PostgreSQL;SQLite | Elasticsearch;MongoDB;MySQL;PostgreSQL | AWS;Docker;Google Cloud Platform;Linux;MacOS;Slack;Windows | AWS;Linux;MacOS;Slack | NA | NA | Ansible;Chef;Hadoop;Pandas;TensorFlow | Ansible;Apache Spark;Chef;Hadoop;Pandas;TensorFlow;Torch/PyTorch | IPython / Jupyter;PyCharm;RStudio;Sublime Text;Vim | MacOS | Testing;Production | NA | NA | Yes | Yes | Yes | In real life (in person) | Username | 2014 | A few times per month or weekly | Find answers to specific questions;Learn how to do things I didn’t necessarily look for | 1-2 times per week | The other resource was slightly faster | 11-30 minutes | Not sure / can’t remember | NA | No, I knew that Stack Overflow had a job board but have never used or visited it | Yes | Yes, somewhat | Just as welcome now as I felt last year | Courses on technologies you’re interested in | 21 | Woman | No | Straight / Heterosexual | Black or of African descent | No | Appropriate in length | Easy | |
I am not primarily a developer, but I write code sometimes as part of my work | Yes | Less than once a month but more than once per year | The quality of OSS and closed source software is about the same | Employed full-time | Russian Federation | No | Master’s degree (MA, MS, M.Eng., MBA, etc.) | Computer science, computer engineering, or software engineering | Taken an online course in programming or software development (e.g. a MOOC);Taught yourself a new language, framework, or tool without taking a formal course;Contributed to open source software | 1,000 to 4,999 employees | Data or business analyst | 10 | 18 | 3 | Slightly satisfied | Very satisfied | Very confident | Yes | Yes | I’m not actively looking, but I am open to new opportunities | 3-4 years ago | Complete a take-home project;Interview with people in peer roles;Interview with people in senior / management roles | No | Financial performance or funding status of the company or organization;Opportunities for professional development;How widely used or impactful my work output would be | My job status changed (promotion, new job, etc.) | RUB | Russian ruble | 120000 | Monthly | 21996 | 40 | There’s no schedule or spec; I work on what seems most important or urgent | Distracting work environment;Non-work commitments (parenting, school work, hobbies, etc.);Not enough people for the workload | Less than once per month / Never | Office | Average | Yes, because I see value in code review | 0.5 | Yes, it’s not part of our process but the developers do it on their own | Not sure | I have some influence | Python;R | Python;R | MongoDB | MongoDB | NA | NA | NA | NA | NA | NA | PyCharm;RStudio | Linux-based | Production | NA | A passing fad | Yes | SIGH | Yes | VK ВКонта́кте | In real life (in person) | Login | I don’t remember | Multiple times per day | Find answers to specific questions | More than 10 times per week | Stack Overflow was slightly faster | 0-10 minutes | Yes | I have never participated in Q&A on Stack Overflow | No, I knew that Stack Overflow had a job board but have never used or visited it | No, and I don’t know what those are | No, not really | Just as welcome now as I felt last year | NA | NA | Man | No | Straight / Heterosexual | White or of European descent | Yes | Appropriate in length | Neither easy nor difficult |
I am not primarily a developer, but I write code sometimes as part of my work | No | Never | OSS is, on average, of HIGHER quality than proprietary / closed source software | Employed full-time | Lithuania | No | Master’s degree (MA, MS, M.Eng., MBA, etc.) | Information systems, information technology, or system administration | Taken an online course in programming or software development (e.g. a MOOC);Taken a part-time in-person course in programming or software development;Taught yourself a new language, framework, or tool without taking a formal course | 1,000 to 4,999 employees | Database administrator;Designer;Developer, back-end;Developer, embedded applications or devices;Developer, front-end;Developer, full-stack;Developer, mobile;System administrator | 8 | 17 | 4 | Very satisfied | Slightly dissatisfied | Very confident | No | I am already a manager | I’m not actively looking, but I am open to new opportunities | More than 4 years ago | Interview with people in peer roles;Interview with people in senior / management roles | No | Remote work options;How widely used or impactful my work output would be;Flex time or a flexible schedule | My job status changed (promotion, new job, etc.) | EUR | European Euro | 3000 | Monthly | 41244 | 140 | There’s no schedule or spec; I work on what seems most important or urgent | Lack of support from management;Non-work commitments (parenting, school work, hobbies, etc.);Not enough people for the workload | More than half, but not all, the time | Office | A little above average | Yes, because I see value in code review | 1.0 | No, but I think we should | Developers typically have the most influence on purchasing new technology | I have a great deal of influence | Bash/Shell/PowerShell;C#;HTML/CSS;Java;JavaScript;PHP;Python;R;SQL | Bash/Shell/PowerShell;C#;HTML/CSS;Java;JavaScript;PHP;Python;R;SQL | Elasticsearch;MariaDB;MongoDB;Microsoft SQL Server | Elasticsearch;MariaDB;MongoDB;Microsoft SQL Server | Android;Docker;Windows;WordPress | Android;Docker;Windows | Angular/Angular.js;ASP.NET;jQuery | Angular/Angular.js;ASP.NET;jQuery | .NET;Pandas | .NET;Pandas;Unity 3D;Xamarin | Android Studio;Visual Studio;Visual Studio Code | Windows | Outside of work, for personal projects | Not at all | Useful for immutable record keeping outside of currency | Yes | Also Yes | Yes | In real life (in person) | Username | 2010 | A few times per month or weekly | Find answers to specific questions;Learn how to do things I didn’t necessarily look for | 3-5 times per week | Stack Overflow was much faster | 11-30 minutes | Yes | I have never participated in Q&A on Stack Overflow | No, I didn’t know that Stack Overflow had a job board | No, and I don’t know what those are | Neutral | Not applicable - I did not use Stack Overflow last year | Tech articles written by other developers | 38 | Man | No | Straight / Heterosexual | White or of European descent | Yes | Appropriate in length | Easy | |
I am a developer by profession | No | Less than once a month but more than once per year | OSS is, on average, of HIGHER quality than proprietary / closed source software | Employed full-time | Argentina | Yes, full-time | Master’s degree (MA, MS, M.Eng., MBA, etc.) | A natural science (ex. biology, chemistry, physics) | Taken an online course in programming or software development (e.g. a MOOC);Taken a part-time in-person course in programming or software development;Taught yourself a new language, framework, or tool without taking a formal course;Contributed to open source software | 10,000 or more employees | Academic researcher;Data scientist or machine learning specialist;Scientist;Student | 6 | 16 | 3 | Very satisfied | Very satisfied | Somewhat confident | No | Not sure | I’m not actively looking, but I am open to new opportunities | 1-2 years ago | NA | No | Specific department or team I’d be working on;Office environment or company culture;Flex time or a flexible schedule | My job status changed (promotion, new job, etc.) | USD | United States dollar | 700 | Monthly | 8400 | 35 | There is a schedule and/or spec (made by me or by a colleague), and I follow it very closely | Inadequate access to necessary tools;Meetings;Toxic work environment | Less than once per month / Never | Office | A little above average | Yes, because I see value in code review | 5.0 | No, but I think we should | Not sure | I have little or no influence | C++;Python;R | R | NA | NA | NA | NA | NA | NA | NA | NA | RStudio | Linux-based | I do not use containers | Not at all | NA | Yes | Yes | What? | In real life (in person) | Username | 2014 | Daily or almost daily | Find answers to specific questions;Learn how to do things I didn’t necessarily look for;Contribute to a library of information;Pass the time / relax | 3-5 times per week | Stack Overflow was much faster | 60+ minutes | Yes | A few times per week | Yes | No, and I don’t know what those are | Yes, somewhat | Just as welcome now as I felt last year | Tech articles written by other developers;Tech meetups or events in your area;Courses on technologies you’re interested in | 25 | Man | No | Straight / Heterosexual | Hispanic or Latino/Latina | No | Appropriate in length | Neither easy nor difficult |
dataforproject2 %>%
filter(str_detect(LanguageWorkedWith, "R")) %>%
mutate(LearnR = ifelse(str_detect(LanguageDesireNextYear, "R"), "WantR", "Nope")) -> Use_R
Use_R[sapply(Use_R, is.character)] <- lapply(Use_R[sapply(Use_R, is.character)], as.factor)
#describeBy(Use_R, group = Use_R$LearnR)
names_text = names(Use_R[,c(3:7, 9:11, 13, 27, 35, 37:40)])
b = 1
for( i in names(Use_R[,c(3:7, 9:11, 13, 27, 35, 37:40) ] )){
print(
Use_R %>%
ggplot(aes_string(x = i))+
geom_bar(aes(fill = LearnR),color = "black", stat="count", position = "dodge")+
ggtitle(names_text[b])+
geom_text(aes( label = paste0(round((..count..)/sum(..count..)*100), "%"),
y= (..count..)/sum(..count..)), stat= "count", vjust = -.5)+
theme_minimal()+
scale_x_discrete(guide = guide_axis(n.dodge = 3))
)
b = b + 1
}
R users are mostly developers by profession, or at least develop something as a part of their work. Also, they usually code for a hobby, and there is no big difference between people who code in R and people who want to learn R. However, that might be a proble of the data, because it was pre-filltered.
1.2 Justification of Variables (you may want to use not all the variables) - 1 point
To justify of variables lets take a look what is happening with R right now. And collect data from UseR 2019. Unfortunately, I founf only keynotes not proceedings, but I am not ready to care too much.
first_page <- read_html("https://user2019.r-project.org/program/")
first_page %>%
html_nodes(".speaker-bio , ul") %>%
html_text() %>%
as_tibble() -> wikipedia_text
## Warning: Calling `as_tibble()` on a vector is discouraged, because the behavior is likely to change in the future. Use `tibble::enframe(name = NULL)` instead.
## This warning is displayed once per session.
head(wikipedia_text)%>%
kable() %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))
value |
---|
Home
|
Program overview
|
Registration
|
Travel
|
Local organization committee
|
Joe Cheng |
stop_words = get_stopwords("en")
stop_words = rbind(stop_words, c("packages"))
draw_wc = function(text){
text %>%
unnest_tokens(bigram, value, token = "skip_ngrams", n = 2, k = 5) %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
filter(!word1 %in% c(1:20)) %>%
filter(!word2 %in% c(1:20)) %>%
count(word1, word2, sort = TRUE) %>% na.omit() %>%
mutate(n = n, word = paste(word1, word2, sep = " ")) %>%
select(word, n) %>%
filter(n > 1) %>%
ggplot(aes(label = word, size = n)) +
geom_text_wordcloud(rm_outside = T) +
scale_size_area(max_size = 20) +
scale_color_manual(values = c("red", "skyblue", "black"))+
theme_minimal()
}
draw_wc(wikipedia_text)
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 2954 rows [1,
## 8, 15, 22, 29, 36, 43, 50, 57, 64, 71, 78, 85, 92, 99, 106, 113, 120, 127,
## 134, ...].
In 2019 R was mostly about DS, Science, Statistics, creating applications, little bit about AI and education. So lets try to find different R users.
Thus, there are different reasons to code in R, so I will try to capture this reasons with following variables:
- OpenSourcer – because people who work in idustry usually could not contribute
- Employment – work is an important for any R user, and part time workers might be more interesting
- Students – just to control for students
- YearsCode – because R is not so old
- JobFactors – beacuse that is an answer for my question
- WorkPlan – cool programmers do not have shedule
1.3 the distance metric matches variable types - 1 point (if this is incorrect, interpretation will fail)
1.4 k-means
df_cluster_num <- mutate_all(df_cluster, function(x) as.numeric(x))
fviz_nbclust(df_cluster_num[,-1], kmeans, method = "wss")
## Warning: Quick-TRANSfer stage steps exceeded maximum (= 185350)
## Warning: did not converge in 10 iterations
wss and silhouette say that 2 clusters is an optimal amout for this data, but when we are talking about our data, we should take into consideration, that convert factors into numeric should be illigal.
clusters <- kmeans(df_cluster_num[,-1],
2, # how many groups to locate
nstart = 20 # R will try 20 different random starting assignments
# and then select the one with the lowest within cluster variation.
)
fviz_cluster(clusters, data = df_cluster_num[,-1],
ellipse.type = "convex",
palette = "jco",
repel = TRUE)
As we see our data could not be properly plotted, bacause our dementions do not take enough load.
so we need to create a proper distance matrix
I will use gower, since I have mostly factors
Use_R %>%
select(Respondent, OpenSourcer, Employment, Student, YearsCode, JobFactors, WorkPlan) %>% mutate_if(is.factor, addNA) -> df_cluster
df_cluster$YearsCode = as.numeric(as.character(df_cluster$YearsCode))
## Warning: NAs introduced by coercion
1.5 DIVISIVE
divisive.clust <- diana(as.matrix(gower.dist),
diss = TRUE, keep.diss = F)
?diana
plot(divisive.clust)
works too bad, could not compute((
1.6 AGGLOMERATIVE
aggl.clust.c <- hclust(gower.dist, method = "complete")
plot(aggl.clust.c,
main = "Agglomerative, complete linkages", hang = -1, cex = 0.6, ylim =c(0.5,1))
1.7 Vizualization
#install.packages("ape")
library("ape")
# Default plot
plot(as.phylo(aggl.clust.c), cex = 0.6, label.offset = 0.5)
It looks that we have around 12 clusters what is not good, but not really bad, lets cut them, and try to look at them
clus4 = cutree(aggl.clust.c, 12)
plot(as.phylo(aggl.clust.c), type = "fan",
label.offset = 1, cex = 0.3, show.tip.label = F)
lets put these clusters back to data
and draw all graphs
names_text = names(Use_R[,c(3:7, 9:11, 13, 27, 35, 37:40)])
b = 1
for( i in names(Use_R[,c(3:7, 9:11, 13, 27, 35, 37:40) ] )){
print(
Use_R %>%
ggplot(aes_string(x = i))+
geom_bar(aes(fill = as.factor(cluster)), color = "black", stat="count", position = "dodge")+
ggtitle(names_text[b])+
geom_text(aes( label = paste0(round((..count..)/sum(..count..)*100), "%"),
y= (..count..)/sum(..count..)), stat= "count", vjust = -.5)+
theme_minimal()+
scale_x_discrete(guide = guide_axis(n.dodge = 3))
)
b = b + 1
}
Ok, I could not really understand why are they are different, we need a computer to make this decision for me. All credits to https://towardsdatascience.com/hierarchical-clustering-on-categorical-data-in-r-a27e578f2995
# Cluster stats comes out as list while it is more convenient to look at it as a table
# This code below will produce a dataframe with observations in columns and variables in row
# Not quite tidy data, which will require a tweak for plotting, but I prefer this view as an output here as I find it more comprehensive
library(fpc)
cstats.table <- function(dist, tree, k) {
clust.assess <- c("cluster.number","n","within.cluster.ss","average.within","average.between",
"wb.ratio","dunn2","avg.silwidth")
clust.size <- c("cluster.size")
stats.names <- c()
row.clust <- c()
output.stats <- matrix(ncol = k, nrow = length(clust.assess))
cluster.sizes <- matrix(ncol = k, nrow = k)
for(i in c(1:k)){
row.clust[i] <- paste("Cluster-", i, " size")
}
for(i in c(2:k)){
stats.names[i] <- paste("Test", i-1)
for(j in seq_along(clust.assess)){
output.stats[j, i] <- unlist(cluster.stats(d = dist, clustering = cutree(tree, k = i))[clust.assess])[j]
}
for(d in 1:k) {
cluster.sizes[d, i] <- unlist(cluster.stats(d = dist, clustering = cutree(tree, k = i))[clust.size])[d]
dim(cluster.sizes[d, i]) <- c(length(cluster.sizes[i]), 1)
cluster.sizes[d, i]
}
}
output.stats.df <- data.frame(output.stats)
cluster.sizes <- data.frame(cluster.sizes)
cluster.sizes[is.na(cluster.sizes)] <- 0
rows.all <- c(clust.assess, row.clust)
# rownames(output.stats.df) <- clust.assess
output <- rbind(output.stats.df, cluster.sizes)[ ,-1]
colnames(output) <- stats.names[2:k]
rownames(output) <- rows.all
is.num <- sapply(output, is.numeric)
output[is.num] <- lapply(output[is.num], round, 2)
output
}
# I am capping the maximum amout of clusters by 7
# I want to choose a reasonable number, based on which I will be able to see basic differences between customer groups as a result
stats.df.divisive <- cstats.table(gower.dist, divisive.clust, 7)
stats.df.divisive
## Test 1 Test 2 Test 3 Test 4 Test 5 Test 6
## cluster.number 2.00 3.00 4.00 5.00 6.00 7.00
## n 5048.00 5048.00 5048.00 5048.00 5048.00 5048.00
## within.cluster.ss 780.28 732.45 643.05 628.96 626.63 592.96
## average.within 0.53 0.51 0.48 0.48 0.48 0.46
## average.between 0.71 0.70 0.71 0.71 0.71 0.71
## wb.ratio 0.75 0.73 0.68 0.67 0.67 0.66
## dunn2 1.13 0.98 1.08 1.09 1.10 1.16
## avg.silwidth 0.25 0.18 0.21 0.21 0.21 0.22
## Cluster- 1 size 2934.00 2934.00 2934.00 2934.00 2934.00 2934.00
## Cluster- 2 size 2114.00 1679.00 1342.00 1299.00 1293.00 1094.00
## Cluster- 3 size 0.00 435.00 435.00 435.00 435.00 435.00
## Cluster- 4 size 0.00 0.00 337.00 337.00 337.00 337.00
## Cluster- 5 size 0.00 0.00 0.00 43.00 43.00 199.00
## Cluster- 6 size 0.00 0.00 0.00 0.00 6.00 43.00
## Cluster- 7 size 0.00 0.00 0.00 0.00 0.00 6.00
## Test 1 Test 2 Test 3 Test 4 Test 5 Test 6
## cluster.number 2.00 3.00 4.00 5.00 6.00 7.00
## n 5048.00 5048.00 5048.00 5048.00 5048.00 5048.00
## within.cluster.ss 1001.66 981.15 924.83 920.76 897.20 864.59
## average.within 0.60 0.59 0.57 0.57 0.56 0.55
## average.between 0.67 0.69 0.64 0.64 0.65 0.65
## wb.ratio 0.90 0.87 0.90 0.89 0.87 0.85
## dunn2 1.11 1.11 1.01 0.98 0.98 0.98
## avg.silwidth 0.10 0.07 -0.01 -0.01 -0.02 -0.02
## Cluster- 1 size 4913.00 4785.00 4259.00 4234.00 4124.00 3826.00
## Cluster- 2 size 135.00 128.00 526.00 25.00 110.00 298.00
## Cluster- 3 size 0.00 135.00 128.00 526.00 25.00 110.00
## Cluster- 4 size 0.00 0.00 135.00 128.00 526.00 25.00
## Cluster- 5 size 0.00 0.00 0.00 135.00 128.00 526.00
## Cluster- 6 size 0.00 0.00 0.00 0.00 135.00 128.00
## Cluster- 7 size 0.00 0.00 0.00 0.00 0.00 135.00
and plot them
stats.df.divisive["method",] = "divi"
stats.df.aggl["method",] = "aggl"
df_elbow = as.data.frame(rbind(t(stats.df.divisive),t(stats.df.aggl)))
ggplot(data = df_elbow, aes(x=cluster.number, y=within.cluster.ss, group = method)) +
geom_point()+
geom_line(aes(color = method))+
ggtitle("clustering") +
labs(x = "Num.of clusters", y = "Within clusters sum of squares (SS)") +
theme(plot.title = element_text(hjust = 0.5))
according to our picture there is no elbow, (probably because it happen later, but I could no dercribe more than 7 clusters)
ggplot(data = df_elbow, aes(x=cluster.number, y=avg.silwidth, group = method)) +
geom_point()+
geom_line(aes(color = method))+
ggtitle(" clustering") +
labs(x = "Num.of clusters", y = "Average silhouette width") +
theme(plot.title = element_text(hjust = 0.5))
We could se it here, but for JK, I could not run that script with test one more time, my PC will explodes. & is fine, just fine. Divi is better in all metrics
clus7 = cutree(divisive.clust, 7)
Use_R$cluster = clus7
df_cluster$cluster = clus7
names_text = names(df_cluster[,c(-1,-6,-5)])
b = 1
for( i in names(df_cluster[,c(-1,-6,-5)] )){
print(
Use_R %>%
ggplot(aes_string(x = i))+
geom_bar(aes(fill = as.factor(cluster)), color = "black", stat="count", position = "dodge")+
ggtitle(names_text[b])+
geom_text(aes( label = paste0(round((..count..)/sum(..count..)*100), "%"),
y= (..count..)/sum(..count..)), stat= "count", vjust = -.5)+
theme_minimal()+
scale_x_discrete(guide = guide_axis(n.dodge = 3))
)
b = b + 1
}
df_cluster %>%
ggplot(aes(YearsCode, fill = as.factor(cluster)))+
geom_histogram()+
theme_minimal()+
facet_grid(. ~ cluster)
## Warning: Removed 112 rows containing non-finite values (stat_bin).
1.8 Cluster names
cluster 1 is mostly developers, or people who develop, they have enough time for hobbies, definately not a students. They work for sure. mostly from office
cluster 2 students, but or just graduated
- 5 office workers, but develompent is not their primary job
6- 7. mostly noise. less then 1% of data
df_cluster$JobFactors = as.factor(df_cluster$JobFactors)
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
df_cluster %>% group_by(cluster) %>% summarise(mode = Mode(JobFactors))
## # A tibble: 7 x 2
## cluster mode
## <int> <fct>
## 1 1 Languages, frameworks, and other technologies I'd be working with;Off…
## 2 2 <NA>
## 3 3 Office environment or company culture;Opportunities for professional …
## 4 4 Office environment or company culture;Opportunities for professional …
## 5 5 Office environment or company culture;Opportunities for professional …
## 6 6 Languages, frameworks, and other technologies I'd be working with;Off…
## 7 7 Industry that I'd be working in;Financial performance or funding stat…