1 Cluster Analysis

1.1 Important Characteristics of Our Data - 0.6 points

## Warning: Missing column names filled in: 'X1' [1]
MainBranch Hobbyist OpenSourcer OpenSource Employment Country Student EdLevel UndergradMajor EduOther OrgSize DevType YearsCode Age1stCode YearsCodePro CareerSat JobSat MgrIdiot MgrMoney MgrWant JobSeek LastHireDate LastInt FizzBuzz JobFactors ResumeUpdate CurrencySymbol CurrencyDesc CompTotal CompFreq ConvertedComp WorkWeekHrs WorkPlan WorkChallenge WorkRemote WorkLoc ImpSyn CodeRev CodeRevHrs UnitTests PurchaseHow PurchaseWhat LanguageWorkedWith LanguageDesireNextYear DatabaseWorkedWith DatabaseDesireNextYear PlatformWorkedWith PlatformDesireNextYear WebFrameWorkedWith WebFrameDesireNextYear MiscTechWorkedWith MiscTechDesireNextYear DevEnviron OpSys Containers BlockchainOrg BlockchainIs BetterLife ITperson OffOn SocialMedia Extraversion ScreenName SOVisit1st SOVisitFreq SOVisitTo SOFindAnswer SOTimeSaved SOHowMuchTime SOAccount SOPartFreq SOJobs EntTeams SOComm WelcomeChange SONewContent Age Gender Trans Sexuality Ethnicity Dependents SurveyLength SurveyEase
I am not primarily a developer, but I write code sometimes as part of my work Yes Never The quality of OSS and closed source software is about the same Employed full-time Canada No Bachelor’s degree (BA, BS, B.Eng., etc.) Mathematics or statistics Taken an online course in programming or software development (e.g. a MOOC);Received on-the-job training in software development;Taught yourself a new language, framework, or tool without taking a formal course NA Data or business analyst;Data scientist or machine learning specialist;Database administrator;Engineer, data 13 15 3 Very satisfied Slightly satisfied Very confident No Yes I am not interested in new job opportunities 1-2 years ago Write any code;Complete a take-home project;Interview with people in senior / management roles No Financial performance or funding status of the company or organization;Opportunities for professional development;How widely used or impactful my work output would be I heard about a job opportunity (from a recruiter, online job posting, etc.) CAD Canadian dollar 40000 Monthly 366420 15 There’s no schedule or spec; I work on what seems most important or urgent NA A few days each month Home A little above average No NA Yes, it’s not part of our process but the developers do it on their own Not sure I have little or no influence Java;R;SQL Python;Scala;SQL MongoDB;PostgreSQL PostgreSQL Android;Google Cloud Platform;Linux;Windows Android;Google Cloud Platform;Linux;Windows NA NA Hadoop Hadoop;Pandas;TensorFlow;Unity 3D Android Studio;Eclipse;PyCharm;RStudio;Visual Studio Code Windows I do not use containers Not at all NA No Yes No YouTube In real life (in person) Login 2011 A few times per month or weekly Find answers to specific questions Less than once per week Stack Overflow was slightly faster 60+ minutes Yes I have never participated in Q&A on Stack Overflow No, I knew that Stack Overflow had a job board but have never used or visited it No, and I don’t know what those are No, not really Just as welcome now as I felt last year Tech articles written by other developers;Industry news about technologies you’re interested in;Tech meetups or events in your area;Courses on technologies you’re interested in 28 Man No Straight / Heterosexual East Asian No Too long Neither easy nor difficult
I am a developer by profession Yes Once a month or more often OSS is, on average, of HIGHER quality than proprietary / closed source software Employed full-time India No Master’s degree (MA, MS, M.Eng., MBA, etc.) NA NA 10,000 or more employees Data or business analyst;Data scientist or machine learning specialist;Database administrator;Developer, back-end;Developer, desktop or enterprise applications;Developer, front-end;Developer, full-stack;Developer, game or graphics;Educator 12 20 10 Slightly dissatisfied Slightly dissatisfied Somewhat confident Yes Yes I’m not actively looking, but I am open to new opportunities 3-4 years ago NA No Languages, frameworks, and other technologies I’d be working with;Remote work options;Flex time or a flexible schedule NA INR Indian rupee 950000 Yearly 13293 70 There’s no schedule or spec; I work on what seems most important or urgent NA A few days each month Home Far above average Yes, because I see value in code review 4.0 Yes, it’s part of our process NA NA C#;Go;JavaScript;Python;R;SQL C#;Go;JavaScript;Kotlin;Python;R;SQL Elasticsearch;MongoDB;Microsoft SQL Server;MySQL;SQLite Elasticsearch;MongoDB;Microsoft SQL Server Linux;Windows Android;Linux;Raspberry Pi;Windows Angular/Angular.js;ASP.NET;Django;Express;Flask;jQuery Angular/Angular.js;ASP.NET;Django;Express;Flask;jQuery .NET;Node.js;Pandas;Torch/PyTorch .NET;Node.js;TensorFlow;Torch/PyTorch Android Studio;Eclipse;IPython / Jupyter;Notepad++;RStudio;Vim;Visual Studio;Visual Studio Code Windows NA Not at all Useful for immutable record keeping outside of currency No Yes Yes YouTube Neither Screen Name NA Multiple times per day Find answers to specific questions;Get a sense of belonging to the developer community;Meet other people with similar skills or interests 3-5 times per week They were about the same NA Yes A few times per month or weekly Yes No, and I don’t know what those are Yes, somewhat Somewhat less welcome now than last year Tech articles written by other developers;Tech meetups or events in your area NA NA NA NA NA Yes Too long Difficult
I am a student who is learning to code No Never OSS is, on average, of HIGHER quality than proprietary / closed source software Employed part-time Canada Yes, full-time Some college/university study without earning a degree Mathematics or statistics Taken an online course in programming or software development (e.g. a MOOC);Taught yourself a new language, framework, or tool without taking a formal course NA Data or business analyst;Data scientist or machine learning specialist;Engineer, data;Student 5 16 NA NA NA NA NA NA I am not interested in new job opportunities Less than a year ago NA NA Financial performance or funding status of the company or organization;Office environment or company culture;Opportunities for professional development My job status changed (promotion, new job, etc.) NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA Bash/Shell/PowerShell;HTML/CSS;Java;Python;R;SQL Bash/Shell/PowerShell;C++;Go;Python;R;Scala;SQL MySQL;PostgreSQL;SQLite Elasticsearch;MongoDB;MySQL;PostgreSQL AWS;Docker;Google Cloud Platform;Linux;MacOS;Slack;Windows AWS;Linux;MacOS;Slack NA NA Ansible;Chef;Hadoop;Pandas;TensorFlow Ansible;Apache Spark;Chef;Hadoop;Pandas;TensorFlow;Torch/PyTorch IPython / Jupyter;PyCharm;RStudio;Sublime Text;Vim MacOS Testing;Production NA NA Yes Yes Yes Reddit In real life (in person) Username 2014 A few times per month or weekly Find answers to specific questions;Learn how to do things I didn’t necessarily look for 1-2 times per week The other resource was slightly faster 11-30 minutes Not sure / can’t remember NA No, I knew that Stack Overflow had a job board but have never used or visited it Yes Yes, somewhat Just as welcome now as I felt last year Courses on technologies you’re interested in 21 Woman No Straight / Heterosexual Black or of African descent No Appropriate in length Easy
I am not primarily a developer, but I write code sometimes as part of my work Yes Less than once a month but more than once per year The quality of OSS and closed source software is about the same Employed full-time Russian Federation No Master’s degree (MA, MS, M.Eng., MBA, etc.) Computer science, computer engineering, or software engineering Taken an online course in programming or software development (e.g. a MOOC);Taught yourself a new language, framework, or tool without taking a formal course;Contributed to open source software 1,000 to 4,999 employees Data or business analyst 10 18 3 Slightly satisfied Very satisfied Very confident Yes Yes I’m not actively looking, but I am open to new opportunities 3-4 years ago Complete a take-home project;Interview with people in peer roles;Interview with people in senior / management roles No Financial performance or funding status of the company or organization;Opportunities for professional development;How widely used or impactful my work output would be My job status changed (promotion, new job, etc.) RUB Russian ruble 120000 Monthly 21996 40 There’s no schedule or spec; I work on what seems most important or urgent Distracting work environment;Non-work commitments (parenting, school work, hobbies, etc.);Not enough people for the workload Less than once per month / Never Office Average Yes, because I see value in code review 0.5 Yes, it’s not part of our process but the developers do it on their own Not sure I have some influence Python;R Python;R MongoDB MongoDB NA NA NA NA NA NA PyCharm;RStudio Linux-based Production NA A passing fad Yes SIGH Yes VK ВКонта́кте In real life (in person) Login I don’t remember Multiple times per day Find answers to specific questions More than 10 times per week Stack Overflow was slightly faster 0-10 minutes Yes I have never participated in Q&A on Stack Overflow No, I knew that Stack Overflow had a job board but have never used or visited it No, and I don’t know what those are No, not really Just as welcome now as I felt last year NA NA Man No Straight / Heterosexual White or of European descent Yes Appropriate in length Neither easy nor difficult
I am not primarily a developer, but I write code sometimes as part of my work No Never OSS is, on average, of HIGHER quality than proprietary / closed source software Employed full-time Lithuania No Master’s degree (MA, MS, M.Eng., MBA, etc.) Information systems, information technology, or system administration Taken an online course in programming or software development (e.g. a MOOC);Taken a part-time in-person course in programming or software development;Taught yourself a new language, framework, or tool without taking a formal course 1,000 to 4,999 employees Database administrator;Designer;Developer, back-end;Developer, embedded applications or devices;Developer, front-end;Developer, full-stack;Developer, mobile;System administrator 8 17 4 Very satisfied Slightly dissatisfied Very confident No I am already a manager I’m not actively looking, but I am open to new opportunities More than 4 years ago Interview with people in peer roles;Interview with people in senior / management roles No Remote work options;How widely used or impactful my work output would be;Flex time or a flexible schedule My job status changed (promotion, new job, etc.) EUR European Euro 3000 Monthly 41244 140 There’s no schedule or spec; I work on what seems most important or urgent Lack of support from management;Non-work commitments (parenting, school work, hobbies, etc.);Not enough people for the workload More than half, but not all, the time Office A little above average Yes, because I see value in code review 1.0 No, but I think we should Developers typically have the most influence on purchasing new technology I have a great deal of influence Bash/Shell/PowerShell;C#;HTML/CSS;Java;JavaScript;PHP;Python;R;SQL Bash/Shell/PowerShell;C#;HTML/CSS;Java;JavaScript;PHP;Python;R;SQL Elasticsearch;MariaDB;MongoDB;Microsoft SQL Server Elasticsearch;MariaDB;MongoDB;Microsoft SQL Server Android;Docker;Windows;WordPress Android;Docker;Windows Angular/Angular.js;ASP.NET;jQuery Angular/Angular.js;ASP.NET;jQuery .NET;Pandas .NET;Pandas;Unity 3D;Xamarin Android Studio;Visual Studio;Visual Studio Code Windows Outside of work, for personal projects Not at all Useful for immutable record keeping outside of currency Yes Also Yes Yes Facebook In real life (in person) Username 2010 A few times per month or weekly Find answers to specific questions;Learn how to do things I didn’t necessarily look for 3-5 times per week Stack Overflow was much faster 11-30 minutes Yes I have never participated in Q&A on Stack Overflow No, I didn’t know that Stack Overflow had a job board No, and I don’t know what those are Neutral Not applicable - I did not use Stack Overflow last year Tech articles written by other developers 38 Man No Straight / Heterosexual White or of European descent Yes Appropriate in length Easy
I am a developer by profession No Less than once a month but more than once per year OSS is, on average, of HIGHER quality than proprietary / closed source software Employed full-time Argentina Yes, full-time Master’s degree (MA, MS, M.Eng., MBA, etc.) A natural science (ex. biology, chemistry, physics) Taken an online course in programming or software development (e.g. a MOOC);Taken a part-time in-person course in programming or software development;Taught yourself a new language, framework, or tool without taking a formal course;Contributed to open source software 10,000 or more employees Academic researcher;Data scientist or machine learning specialist;Scientist;Student 6 16 3 Very satisfied Very satisfied Somewhat confident No Not sure I’m not actively looking, but I am open to new opportunities 1-2 years ago NA No Specific department or team I’d be working on;Office environment or company culture;Flex time or a flexible schedule My job status changed (promotion, new job, etc.) USD United States dollar 700 Monthly 8400 35 There is a schedule and/or spec (made by me or by a colleague), and I follow it very closely Inadequate access to necessary tools;Meetings;Toxic work environment Less than once per month / Never Office A little above average Yes, because I see value in code review 5.0 No, but I think we should Not sure I have little or no influence C++;Python;R R NA NA NA NA NA NA NA NA RStudio Linux-based I do not use containers Not at all NA Yes Yes What? WhatsApp In real life (in person) Username 2014 Daily or almost daily Find answers to specific questions;Learn how to do things I didn’t necessarily look for;Contribute to a library of information;Pass the time / relax 3-5 times per week Stack Overflow was much faster 60+ minutes Yes A few times per week Yes No, and I don’t know what those are Yes, somewhat Just as welcome now as I felt last year Tech articles written by other developers;Tech meetups or events in your area;Courses on technologies you’re interested in 25 Man No Straight / Heterosexual Hispanic or Latino/Latina No Appropriate in length Neither easy nor difficult

R users are mostly developers by profession, or at least develop something as a part of their work. Also, they usually code for a hobby, and there is no big difference between people who code in R and people who want to learn R. However, that might be a proble of the data, because it was pre-filltered.

1.2 Justification of Variables (you may want to use not all the variables) - 1 point

To justify of variables lets take a look what is happening with R right now. And collect data from UseR 2019. Unfortunately, I founf only keynotes not proceedings, but I am not ready to care too much.

## Warning: Calling `as_tibble()` on a vector is discouraged, because the behavior is likely to change in the future. Use `tibble::enframe(name = NULL)` instead.
## This warning is displayed once per session.
value

Home

      Program
      Program overview
        
        Talk schedule
        
        Important dates
        
        Keynotes
        
        Tutorials
        
        Datathon
        
        Information for presenters 
        
        Social Program
        
        Side events
        
        Posters
        
      
    
    
    
    
      Registration
      Registration
        
        Abstract submission
        
        Scholarships
        
      
    
    
    
    
      Venue
      Travel
        
        Toulouse
        
        Gala dinner
        
        Around Toulouse
        
      
    
    
    
    
      About
      Local organization committee
        
        Scientific committee
        
        Past events
        
        FAQ
        
        Carbon footprint
        
        Legal information
        
      
    
    
    
    
      Code of Conduct
    
    
    
    
    
      Contact </td>

Program overview

        Talk schedule
        
        Important dates
        
        Keynotes
        
        Tutorials
        
        Datathon
        
        Information for presenters 
        
        Social Program
        
        Side events
        
        Posters </td>

Registration

        Abstract submission
        
        Scholarships </td>

Travel

        Toulouse
        
        Gala dinner
        
        Around Toulouse </td>

Local organization committee

        Scientific committee
        
        Past events
        
        FAQ
        
        Carbon footprint
        
        Legal information </td>
Joe Cheng
## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 2954 rows [1,
## 8, 15, 22, 29, 36, 43, 50, 57, 64, 71, 78, 85, 92, 99, 106, 113, 120, 127,
## 134, ...].

In 2019 R was mostly about DS, Science, Statistics, creating applications, little bit about AI and education. So lets try to find different R users.

Thus, there are different reasons to code in R, so I will try to capture this reasons with following variables:

  1. OpenSourcer – because people who work in idustry usually could not contribute
  2. Employment – work is an important for any R user, and part time workers might be more interesting
  3. Students – just to control for students
  4. YearsCode – because R is not so old
  5. JobFactors – beacuse that is an answer for my question
  6. WorkPlan – cool programmers do not have shedule

1.3 the distance metric matches variable types - 1 point (if this is incorrect, interpretation will fail)

1.4 k-means

## Warning: Quick-TRANSfer stage steps exceeded maximum (= 185350)
## Warning: did not converge in 10 iterations

wss and silhouette say that 2 clusters is an optimal amout for this data, but when we are talking about our data, we should take into consideration, that convert factors into numeric should be illigal.

As we see our data could not be properly plotted, bacause our dementions do not take enough load.

so we need to create a proper distance matrix

I will use gower, since I have mostly factors

## Warning: NAs introduced by coercion

1.7 Vizualization

It looks that we have around 12 clusters what is not good, but not really bad, lets cut them, and try to look at them

lets put these clusters back to data

and draw all graphs

Ok, I could not really understand why are they are different, we need a computer to make this decision for me. All credits to https://towardsdatascience.com/hierarchical-clustering-on-categorical-data-in-r-a27e578f2995

# Cluster stats comes out as list while it is more convenient to look at it as a table
# This code below will produce a dataframe with observations in columns and variables in row
# Not quite tidy data, which will require a tweak for plotting, but I prefer this view as an output here as I find it more comprehensive 
library(fpc)
cstats.table <- function(dist, tree, k) {
clust.assess <- c("cluster.number","n","within.cluster.ss","average.within","average.between",
                  "wb.ratio","dunn2","avg.silwidth")
clust.size <- c("cluster.size")
stats.names <- c()
row.clust <- c()
output.stats <- matrix(ncol = k, nrow = length(clust.assess))
cluster.sizes <- matrix(ncol = k, nrow = k)
for(i in c(1:k)){
  row.clust[i] <- paste("Cluster-", i, " size")
}
for(i in c(2:k)){
  stats.names[i] <- paste("Test", i-1)
  
  for(j in seq_along(clust.assess)){
    output.stats[j, i] <- unlist(cluster.stats(d = dist, clustering = cutree(tree, k = i))[clust.assess])[j]
    
  }
  
  for(d in 1:k) {
    cluster.sizes[d, i] <- unlist(cluster.stats(d = dist, clustering = cutree(tree, k = i))[clust.size])[d]
    dim(cluster.sizes[d, i]) <- c(length(cluster.sizes[i]), 1)
    cluster.sizes[d, i]
    
  }
}
output.stats.df <- data.frame(output.stats)
cluster.sizes <- data.frame(cluster.sizes)
cluster.sizes[is.na(cluster.sizes)] <- 0
rows.all <- c(clust.assess, row.clust)
# rownames(output.stats.df) <- clust.assess
output <- rbind(output.stats.df, cluster.sizes)[ ,-1]
colnames(output) <- stats.names[2:k]
rownames(output) <- rows.all
is.num <- sapply(output, is.numeric)
output[is.num] <- lapply(output[is.num], round, 2)
output
}
# I am capping the maximum amout of clusters by 7
# I want to choose a reasonable number, based on which I will be able to see basic differences between customer groups as a result
stats.df.divisive <- cstats.table(gower.dist, divisive.clust, 7)
stats.df.divisive
##                    Test 1  Test 2  Test 3  Test 4  Test 5  Test 6
## cluster.number       2.00    3.00    4.00    5.00    6.00    7.00
## n                 5048.00 5048.00 5048.00 5048.00 5048.00 5048.00
## within.cluster.ss  780.28  732.45  643.05  628.96  626.63  592.96
## average.within       0.53    0.51    0.48    0.48    0.48    0.46
## average.between      0.71    0.70    0.71    0.71    0.71    0.71
## wb.ratio             0.75    0.73    0.68    0.67    0.67    0.66
## dunn2                1.13    0.98    1.08    1.09    1.10    1.16
## avg.silwidth         0.25    0.18    0.21    0.21    0.21    0.22
## Cluster- 1  size  2934.00 2934.00 2934.00 2934.00 2934.00 2934.00
## Cluster- 2  size  2114.00 1679.00 1342.00 1299.00 1293.00 1094.00
## Cluster- 3  size     0.00  435.00  435.00  435.00  435.00  435.00
## Cluster- 4  size     0.00    0.00  337.00  337.00  337.00  337.00
## Cluster- 5  size     0.00    0.00    0.00   43.00   43.00  199.00
## Cluster- 6  size     0.00    0.00    0.00    0.00    6.00   43.00
## Cluster- 7  size     0.00    0.00    0.00    0.00    0.00    6.00
##                    Test 1  Test 2  Test 3  Test 4  Test 5  Test 6
## cluster.number       2.00    3.00    4.00    5.00    6.00    7.00
## n                 5048.00 5048.00 5048.00 5048.00 5048.00 5048.00
## within.cluster.ss 1001.66  981.15  924.83  920.76  897.20  864.59
## average.within       0.60    0.59    0.57    0.57    0.56    0.55
## average.between      0.67    0.69    0.64    0.64    0.65    0.65
## wb.ratio             0.90    0.87    0.90    0.89    0.87    0.85
## dunn2                1.11    1.11    1.01    0.98    0.98    0.98
## avg.silwidth         0.10    0.07   -0.01   -0.01   -0.02   -0.02
## Cluster- 1  size  4913.00 4785.00 4259.00 4234.00 4124.00 3826.00
## Cluster- 2  size   135.00  128.00  526.00   25.00  110.00  298.00
## Cluster- 3  size     0.00  135.00  128.00  526.00   25.00  110.00
## Cluster- 4  size     0.00    0.00  135.00  128.00  526.00   25.00
## Cluster- 5  size     0.00    0.00    0.00  135.00  128.00  526.00
## Cluster- 6  size     0.00    0.00    0.00    0.00  135.00  128.00
## Cluster- 7  size     0.00    0.00    0.00    0.00    0.00  135.00

and plot them

according to our picture there is no elbow, (probably because it happen later, but I could no dercribe more than 7 clusters)

We could se it here, but for JK, I could not run that script with test one more time, my PC will explodes. & is fine, just fine. Divi is better in all metrics

## Warning: Removed 112 rows containing non-finite values (stat_bin).

1.8 Cluster names

  1. cluster 1 is mostly developers, or people who develop, they have enough time for hobbies, definately not a students. They work for sure. mostly from office

  2. cluster 2 students, but or just graduated

    • 5 office workers, but develompent is not their primary job

6- 7. mostly noise. less then 1% of data

## # A tibble: 7 x 2
##   cluster mode                                                                  
##     <int> <fct>                                                                 
## 1       1 Languages, frameworks, and other technologies I'd be working with;Off…
## 2       2 <NA>                                                                  
## 3       3 Office environment or company culture;Opportunities for professional …
## 4       4 Office environment or company culture;Opportunities for professional …
## 5       5 Office environment or company culture;Opportunities for professional …
## 6       6 Languages, frameworks, and other technologies I'd be working with;Off…
## 7       7 Industry that I'd be working in;Financial performance or funding stat…