rand_data <- filter(data, POPULATION=="github")
status_data <- summarise(group_by(rand_data, STATUS), n()); status_data
## # A tibble: 2 x 2
## STATUS `n()`
## <fctr> <int>
## 1 Complete 3302
## 2 Partial 2193
ggplot(status_data, aes(STATUS,`n()`)) +
geom_col(aes(fill=STATUS), width=.5) +
labs(x = 'Status of Survey',
y = 'Number of Surveys')
We can see that nearly 40% of the survey respondents are incomplete cases however, our sample of complete cases is still large and useable. We can also use some of the insights to determine whether the missing data is due to MAR, or MNAR (MCAR is ruled out since this data was published and monitored carefully by those who conducted the survey).
gender_data <- summarise(group_by(rand_data, GENDER), n())
gender_names <- as.vector(gender_data$GENDER); gender_data
## # A tibble: 5 x 2
## GENDER `n()`
## <fctr> <int>
## 1 2216
## 2 Man 2982
## 3 Non-binary or Other 36
## 4 Prefer not to say 154
## 5 Woman 107
ggplot(gender_data, aes(GENDER, `n()`)) +
geom_col(aes(fill=GENDER), width=.5) +
labs(y = 'Number of Respondents') +
scale_x_discrete(name='Gender Identity Specified',
breaks=gender_names,
labels=c('Missing', 'Male', 'Non-Binary/\nOther', 'Prefer not\n to say', 'Female'))+
scale_fill_discrete(name='Gender',
breaks=gender_names,
labels = c('Missing', 'Male', 'Non-Binary/Other', 'Prefer not to say', 'Female'))
Males seem to dominate the pool of respondents in this case. Also note, there is a large number of missing and “prefer not to say” oservations. I infer that the missing data is due to sensitivity to this question and can be ruled as MAR.
employ_data <- summarise(group_by(rand_data, EMPLOYMENT.STATUS), n())
employ_names <- as.vector(employ_data$EMPLOYMENT.STATUS); employ_data
## # A tibble: 7 x 2
## EMPLOYMENT.STATUS `n()`
## <fctr> <int>
## 1 412
## 2 Employed full time 3259
## 3 Employed part time 320
## 4 Full time student 995
## 5 Other - please describe 164
## 6 Retired or permanently not working (e.g. due to disability) 62
## 7 Temporarily not working 283
ggplot(employ_data, aes(EMPLOYMENT.STATUS, `n()`)) +
geom_col(aes(fill=EMPLOYMENT.STATUS), width=.5) +
labs(y = 'Number of Respondents') +
scale_x_discrete(name='Type of Employment Specified',
breaks=employ_names,
labels=c('Missing', 'Full Time', 'Part Time', 'F/T Student', 'Other\n(describe)', 'Retired', 'Unemployed')) +
scale_fill_discrete(name='Employment',
breaks=employ_names,
labels=c('Missing', 'Full Time', 'Part Time', 'F/T Student', 'Other (describe)', 'Retired', 'Unemployed'))
The majority of respondents are full time employees, with many also falling under the full time student category. I find this unsurprising since contributing heavily to open source data is usually the responsibility of those working with a company or a task undertaken by students to achieve publication of their edits.
edu_data <- summarise(group_by(rand_data, FORMAL.EDUCATION), n())
edu_names <- as.vector(edu_data$FORMAL.EDUCATION); edu_data
## # A tibble: 8 x 2
## FORMAL.EDUCATION `n()`
## <fctr> <int>
## 1 2238
## 2 Bachelor's degree 1201
## 3 Doctorate (Ph.D.) or other advanced degree (e.g. M.D., J.D.) 154
## 4 Less than secondary (high) school 119
## 5 Master's degree 730
## 6 Secondary (high) school graduate or equivalent 356
## 7 Some college, no degree 585
## 8 Vocational/trade program or apprenticeship 112
ggplot(edu_data, aes(FORMAL.EDUCATION, `n()`)) +
geom_col(aes(fill=FORMAL.EDUCATION), width=.5) +
labs(y = 'Number of Respondents') +
theme(
axis.text.x = element_text(size = 6),
legend.text = element_text(size = 6)) +
scale_x_discrete(name='Highest Education Specified',
breaks=edu_names,
labels=c('Missing', 'Bachelor\'s', 'Ph.D or\nHigher', 'LessThan\nSecondary\nEducation', 'Master\'s', 'High School\n Diploma', 'Some\nCollege', 'Vocational/\nTrade School/\nCertification')) +
scale_fill_discrete(name='Education',
breaks=edu_names,
labels=c('Missing', 'Bachelor\'s', 'Ph.D or Higher', 'Less Than\nSecondary Education', 'Master\'s', 'High School Diploma', 'Some College', 'Vocational/\nTrade School/\nCertification'))
A Bachelor’s level education seems to be the standard for those in the software development world. However, many students and Master’s level graduates are also busy contributing to code. These results also fall in line with my personal experience in development; many of my colleagues are college grads or currently enrolled.
contrib_data_subset <- rand_data[,grepl('CONTRIBUTOR.TYPE',colnames(rand_data))]
contrib_data <- cbind(contrib_data_subset, rand_data$GENDER)
colnames(contrib_data)[7] <- 'GENDER'
contrib_data_missing <- contrib_data[contrib_data$GENDER !="",]
contrib_data_missing[contrib_data_missing == ""] <- NA
contrib_data_missing <- contrib_data_missing[complete.cases(contrib_data_missing),]
theme_update(axis.text.x = element_text(size = 6),
legend.text = element_text(size = 6))
p1 <- ggplot(contrib_data_missing, aes(CONTRIBUTOR.TYPE.CONTRIBUTE.CODE)) +
geom_bar(aes(fill=GENDER)) +
xlab("Code") + ylab("Count")
p2 <- ggplot(contrib_data_missing, aes(CONTRIBUTOR.TYPE.CONTRIBUTE.DOCS)) +
geom_bar(aes(fill=GENDER)) +
xlab("Documentation") + ylab("Count")
p3 <- ggplot(contrib_data_missing, aes(CONTRIBUTOR.TYPE.COMMUNITY.ADMIN)) +
geom_bar(aes(fill=GENDER)) +
xlab("Administrative") + ylab("Count")
p4 <- ggplot(contrib_data_missing, aes(CONTRIBUTOR.TYPE.FEATURE.REQUESTS)) +
geom_bar(aes(fill=GENDER)) +
xlab("Feature Requests") + ylab("Count")
p5 <- ggplot(contrib_data_missing, aes(CONTRIBUTOR.TYPE.FILE.BUGS)) +
geom_bar(aes(fill=GENDER)) +
xlab("Bug Fixes") + ylab("Count")
p6 <- ggplot(contrib_data_missing, aes(CONTRIBUTOR.TYPE.PROJECT.MAINTENANCE)) +
geom_bar(aes(fill=GENDER)) +
xlab("Maintains Projects") + ylab("Count")
grid.arrange(p1,p2,p3,p4,p5,p6, ncol=2, top="Frequency of Contribution Type by Gender")
After analyzing the contribution types with gender as a factor, there are some key insights: Women seem to contribute most frequently to code, project maintenance and bug fixes. Women seem to be less active in administrative actions and documentation. Administrative actions (i.e. mailing lists, event organization, etc…) have the highest frequency of “never” across all genders. Code, bug fixes and maintenance contributions have the highest number of “frequently” responses. *Overall, bug fixes and code contributions are the most active changes across genders.
After reading through the methodology behind The Open Source Survey 2017, and running some basic analysis of the data set, I believe there are great insights to be had. On the survey’s website, they have many more data visualizations summarizing more concise observations about the respondents. However, in my brief analysis I feel I was able to find some great inferences about the population – men seem to dominate the field, at least an undergraduate college education seems to be an industry requirement for many full-time positions, and contributions to open source code generally take the form of bug fixes and general code amendments. As for the survey design, randomly sampling users who contribute to active open source repositories yielded a great sample of developers although nearly 40% of the respondents offered only partially completed surveys. Despite this, over 3000 complete cases were available to analyze. With more time and experience handling non-numeric data, I would love to dig deeper into survey responses regarding negative feedback for developers whether it be from other developers or employers.