Cluster analysis

Do we have meaningful clusters of Mechanical Turk workers that can be grouped by age and years of computer programming experience?

Data cleaning

Remove NA’s, invalid age and years of experience, and entries for which difference between worker age and years of experience is smaller than 5 (assuming that the minimum age for person to start programming is 5 years old)

##  years_of_programming_experience      age    
##  Min.   : 0.200                  Min.   :16  
##  1st Qu.: 2.000                  1st Qu.:24  
##  Median : 3.000                  Median :28  
##  Mean   : 5.239                  Mean   :30  
##  3rd Qu.: 6.000                  3rd Qu.:33  
##  Max.   :35.000                  Max.   :71
## number of workers: 2000

Plotting and regression curve

## `geom_smooth()` using method = 'gam'

Many people seemed to have reported their experiences in multiples of 5 years.

Clustering

What happens if we divide workers in 5 clusters, which is the number of professions?

## `geom_smooth()` using method = 'loess'

We can see a lot of superposition of clusters. So, we let’s try fewer clusters

## `geom_smooth()` using method = 'loess'

## `geom_smooth()` using method = 'gam'

## `geom_smooth()` using method = 'gam'

Only with two clusters we circles do not superpose. Interpreting that, we would have two large groups of workers. People above 35 years old with a wide spread of programming experience. While people below 35 concentrated from 1 to 15 years of programming experience.

Future analysis

  • What if you cluster only up to 10 years of experience? For the remaining data, you can either filter out or create a bin for +10 years
  • Do these clusters map to professions? In other words, can we predict the worker profession based on age and years of programming experience?
  • Would it make much difference in terms of number of meaningful clusters if I apply SVM or Decision trees methods?
  • Are worker clusters statistically significanlty distinct in terms of average accuracy?

Thanks for reading! Is there anything I could improve? Please leave a comment - Christian.

“We are trying to prove ourselves wrong as quickly as possible, because only in that way we can find progress.” Richard P. Feynman