Need to subset the data into the random sample portion by way of Population variable:

rand_data <- filter(data, POPULATION=="github")

Complete versus partially complete surveys:

status_data <- summarise(group_by(rand_data, STATUS), n()); status_data
## # A tibble: 2 x 2
##     STATUS `n()`
##     <fctr> <int>
## 1 Complete  3302
## 2  Partial  2193
ggplot(status_data, aes(STATUS,`n()`)) + 
  geom_col(aes(fill=STATUS), width=.5) +
  labs(x = 'Status of Survey',
       y = 'Number of Surveys')

We can see that nearly 40% of the survey respondents are incomplete cases however, our sample of complete cases is still large and useable. We can also use some of the insights to determine whether the missing data is due to MAR, or MNAR (MCAR is ruled out since this data was published and monitored carefully by those who conducted the survey).

Summary of gender specification on surveys:

gender_data <- summarise(group_by(rand_data, GENDER), n())
gender_names <- as.vector(gender_data$GENDER); gender_data
## # A tibble: 5 x 2
##                 GENDER `n()`
##                 <fctr> <int>
## 1                       2216
## 2                  Man  2982
## 3 Non-binary  or Other    36
## 4    Prefer not to say   154
## 5                Woman   107
ggplot(gender_data, aes(GENDER, `n()`)) +
  geom_col(aes(fill=GENDER), width=.5) +
  labs(y = 'Number of Respondents') +
  scale_x_discrete(name='Gender Identity Specified',
                   breaks=gender_names,
                   labels=c('Missing', 'Male', 'Non-Binary/\nOther', 'Prefer not\n to say', 'Female'))+
  scale_fill_discrete(name='Gender', 
                    breaks=gender_names,
                    labels = c('Missing', 'Male', 'Non-Binary/Other', 'Prefer not to say', 'Female'))

Males seem to dominate the pool of respondents in this case. Also note, there is a large number of missing and “prefer not to say” oservations. I infer that the missing data is due to sensitivity to this question and can be ruled as MAR.

Summary of Employment Status:

employ_data <- summarise(group_by(rand_data, EMPLOYMENT.STATUS), n())
employ_names <- as.vector(employ_data$EMPLOYMENT.STATUS); employ_data
## # A tibble: 7 x 2
##                                             EMPLOYMENT.STATUS `n()`
##                                                        <fctr> <int>
## 1                                                               412
## 2                                          Employed full time  3259
## 3                                          Employed part time   320
## 4                                           Full time student   995
## 5                                     Other - please describe   164
## 6 Retired or permanently not working (e.g. due to disability)    62
## 7                                     Temporarily not working   283
ggplot(employ_data, aes(EMPLOYMENT.STATUS, `n()`)) +
  geom_col(aes(fill=EMPLOYMENT.STATUS), width=.5) +
  labs(y = 'Number of Respondents') +
  scale_x_discrete(name='Type of Employment Specified',
                   breaks=employ_names, 
                   labels=c('Missing', 'Full Time', 'Part Time', 'F/T Student', 'Other\n(describe)', 'Retired', 'Unemployed')) +
  scale_fill_discrete(name='Employment', 
                      breaks=employ_names,
                      labels=c('Missing', 'Full Time', 'Part Time', 'F/T Student', 'Other (describe)', 'Retired', 'Unemployed'))

The majority of respondents are full time employees, with many also falling under the full time student category. I find this unsurprising since contributing heavily to open source data is usually the responsibility of those working with a company or a task undertaken by students to achieve publication of their edits.

Summary of Education Information:

edu_data <- summarise(group_by(rand_data, FORMAL.EDUCATION), n())
edu_names <- as.vector(edu_data$FORMAL.EDUCATION); edu_data
## # A tibble: 8 x 2
##                                               FORMAL.EDUCATION `n()`
##                                                         <fctr> <int>
## 1                                                               2238
## 2                                            Bachelor's degree  1201
## 3 Doctorate (Ph.D.) or other advanced degree (e.g. M.D., J.D.)   154
## 4                            Less than secondary (high) school   119
## 5                                              Master's degree   730
## 6               Secondary (high) school graduate or equivalent   356
## 7                                      Some college, no degree   585
## 8                   Vocational/trade program or apprenticeship   112
ggplot(edu_data, aes(FORMAL.EDUCATION, `n()`)) +
  geom_col(aes(fill=FORMAL.EDUCATION), width=.5) +
  labs(y = 'Number of Respondents') +
  theme(
    axis.text.x = element_text(size = 6),
    legend.text = element_text(size = 6)) +
  scale_x_discrete(name='Highest Education Specified',
                   breaks=edu_names, 
                   labels=c('Missing', 'Bachelor\'s', 'Ph.D or\nHigher', 'LessThan\nSecondary\nEducation', 'Master\'s', 'High School\n Diploma', 'Some\nCollege', 'Vocational/\nTrade School/\nCertification')) +
  scale_fill_discrete(name='Education', 
                      breaks=edu_names,
                      labels=c('Missing', 'Bachelor\'s', 'Ph.D or Higher', 'Less Than\nSecondary Education', 'Master\'s', 'High School Diploma', 'Some College', 'Vocational/\nTrade School/\nCertification'))

A Bachelor’s level education seems to be the standard for those in the software development world. However, many students and Master’s level graduates are also busy contributing to code. These results also fall in line with my personal experience in development; many of my colleagues are college grads or currently enrolled.

Summary of Contribution Data:

contrib_data_subset <- rand_data[,grepl('CONTRIBUTOR.TYPE',colnames(rand_data))]
contrib_data <- cbind(contrib_data_subset, rand_data$GENDER)
colnames(contrib_data)[7] <- 'GENDER'
contrib_data_missing <- contrib_data[contrib_data$GENDER !="",]
contrib_data_missing[contrib_data_missing == ""] <- NA
contrib_data_missing <- contrib_data_missing[complete.cases(contrib_data_missing),]
theme_update(axis.text.x = element_text(size = 6),
             legend.text = element_text(size = 6))
p1 <- ggplot(contrib_data_missing, aes(CONTRIBUTOR.TYPE.CONTRIBUTE.CODE)) + 
  geom_bar(aes(fill=GENDER)) + 
  xlab("Code") + ylab("Count") 
p2 <- ggplot(contrib_data_missing, aes(CONTRIBUTOR.TYPE.CONTRIBUTE.DOCS)) + 
  geom_bar(aes(fill=GENDER)) + 
  xlab("Documentation") + ylab("Count")
p3 <- ggplot(contrib_data_missing, aes(CONTRIBUTOR.TYPE.COMMUNITY.ADMIN)) + 
  geom_bar(aes(fill=GENDER)) + 
  xlab("Administrative") + ylab("Count")
p4 <- ggplot(contrib_data_missing, aes(CONTRIBUTOR.TYPE.FEATURE.REQUESTS)) + 
  geom_bar(aes(fill=GENDER)) + 
  xlab("Feature Requests") + ylab("Count")
p5 <- ggplot(contrib_data_missing, aes(CONTRIBUTOR.TYPE.FILE.BUGS)) + 
  geom_bar(aes(fill=GENDER)) + 
  xlab("Bug Fixes") + ylab("Count")
p6 <- ggplot(contrib_data_missing, aes(CONTRIBUTOR.TYPE.PROJECT.MAINTENANCE)) + 
  geom_bar(aes(fill=GENDER)) + 
  xlab("Maintains Projects") + ylab("Count")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=2, top="Frequency of Contribution Type by Gender")

After analyzing the contribution types with gender as a factor, there are some key insights: Women seem to contribute most frequently to code, project maintenance and bug fixes. Women seem to be less active in administrative actions and documentation. Administrative actions (i.e. mailing lists, event organization, etc…) have the highest frequency of “never” across all genders. Code, bug fixes and maintenance contributions have the highest number of “frequently” responses. *Overall, bug fixes and code contributions are the most active changes across genders.

Conclusion

After reading through the methodology behind The Open Source Survey 2017, and running some basic analysis of the data set, I believe there are great insights to be had. On the survey’s website, they have many more data visualizations summarizing more concise observations about the respondents. However, in my brief analysis I feel I was able to find some great inferences about the population – men seem to dominate the field, at least an undergraduate college education seems to be an industry requirement for many full-time positions, and contributions to open source code generally take the form of bug fixes and general code amendments. As for the survey design, randomly sampling users who contribute to active open source repositories yielded a great sample of developers although nearly 40% of the respondents offered only partially completed surveys. Despite this, over 3000 complete cases were available to analyze. With more time and experience handling non-numeric data, I would love to dig deeper into survey responses regarding negative feedback for developers whether it be from other developers or employers.