You should open the file in a browser other than Firefox (sic!) - Microsoft Edge is fine.
In order to R Notebook HTML file, you have to execute this Rmd file in its directory/folder - not in the root! of repository setwd("....where Rmd is located...") && rmarkdown::render("Edge.Rmd").
The source code for this R Notebook can be found at GitHub: https://github.com/dmpe/r
Description set out in https://docs.google.com/document/d/1fXQBLdWydISskOKhoq8gl5unuwsv7VA3pkKY4IWFS6o/edit.
In short, trying to confirm or rebut following:
Hypothesis 1: “A woman’s tendency to participate actively in the conversation correlates positively with the number of females in the discussion.”
Hypothesis 2: “Higher status participants are more verbose than are lower status participants.”
When finished, report my results using this survey: https://esmt.az1.qualtrics.com/jfe/form/SV_eaOyF0q39J6CVKZ
Type: 1 for annual question, 2 for conversation
Edge does an annual question every year; some examples are “what scientific idea is ready for retirement?” and “What will change everything?” People then write in with their answers. So all of the text is written and asynchronous
What Edge refers to as a conversation can actually be multiple things. Some of these are written essays by a single person, some are transcripts of a speech, and some are transcripts of a conversation (either between two or more guests or an interview).
Limited_Information is described as: equals 1 if we could only find limited information about the person (e.g. they commented in 2013 but we only have their job title from 2012), 0 otherwise
Role: Either author (=1) or commentator (=2)
Female Contributions: the number of times a woman speaks in a specific conversation, it does not always equal the number of unique women in a conversation (see below)
Female Participation: simply female contributors/(number of total contributions); the percentage of comments that are made by a woman
Unique Contributors: Unique Male Contributors + Unique Female Contributors
Unique Male Contributors: the number of unique male contributors
Unique Female Contributors: the number of unique female contributors
Unique Female Participation: the percentage of unique female participants; Unique Female Contributors divided by Unique Contributors
Academic: 1 = the person is in academia, 0 = they are not
Other columns are described in the corresponding PDF file.
You can control the default appearance with options:
options(tibble.print_max = n, tibble.print_min = m): if there are more than n rows, print only the first m rows. Use options(tibble.print_max = Inf) to always show all rows.
options(tibble.width = Inf) will print all columns, regardless of the width of the screen.
From vignette:
#options(tibble.print_max = Inf, tibble.width = Inf)
set.seed(1234)# data wrangling
library(tidyverse)
library(reshape2)
library(scales)
library(tidyr)
# imputation & missingness + other
#library(mice)
#library(mi)
#library(Amelia)
library(VIM)
library(moments)
library(devtools)
# markdown Rmd + table styling
library(kableExtra)
# for plotting - arranging/positioning
library(grid)
library(gridExtra)
# plotting
library(GGally)
library(ggthemes)
library(plotly)
library(visdat)
library(corrplot)edgeDS <- read_csv("edge1.1EditedFixed.csv", na = c("NA", "", " ", "#N/A"), trim_ws = T)vis_dat(edgeDS[,1:70], warn_large_data = F)# print(problems(edgeDS))
# I can either delete or fix row no. 1817 because it is broken mess. I fixed it in csv.
# edgeDS <- edgeDS[-c(1817),]
# Replace 'Not Available' & 'Pending' with NA
edgeDS$PhD_Year[edgeDS$PhD_Year == "Not Available"] <- NA
edgeDS$PhD_Year[edgeDS$PhD_Year == "Pending"] <- NA
# Aha...
edgeDS$Academic[edgeDS$Id == "david_c_geary"] <- 1
#Gender Issues
edgeDS$Male[edgeDS$Id == "benedict_carey"] <- 1
edgeDS$Female[edgeDS$Id == "benedict_carey"] <- 0Number of rows that have some missing values:
missing_data <- edgeDS[!complete.cases(edgeDS),]
nrow(missing_data) [1] 7689
List variables with most missing values: 41 have some missing values
# delete those columns which contain...
mis <- edgeDS[, -grep("dummy|Unique|Contributions|Numerals", colnames(edgeDS))]
arrange(aggr(mis, col=mdc(1:2), numbers=TRUE, sortVars=TRUE, labels=names(mis),
cex.axis = .4, gap=3, prop = F, plot = F)$missings, desc(Count))[1:43, ]And rename variable Academic from 0/1 to more descriptive Non_Academicians/Academicians
cols <- c("Type", "Live", "Role", "TwoAuthors", "Limited_Information",
"Female", "Male", "Academic", "Job_Title_S", "Job_Title_S_num",
"Department_S" ,"Department_S_num" ,"Discipline", "HavePhD",
"AuthorAndCommenter", "PhD_Institution_SR_Bin", "Workplace_SR_Bin",
"SR_Ranking_Dif", "PhD_Institution_US_IR_Bin", "Workplace_US_IR_Bin",
"USA_I_Ranking_Dif", "PhD_Institution_US_Bin", "Workplace_US_Bin",
"USA_Ranking_Dif", "AcademicHierarchyStrict")
edgeDS[cols] <- lapply(edgeDS[cols], factor)Transform two variables from factor to character to give them more meaning (similar to above).
edgeDS$Academic <- recode(edgeDS$Academic,
`1` = "Academicians",
`0` = "Non_Academicians",
.default = "Unknown")
edgeDS$Gender <- recode(edgeDS$Female,
`1` = "Female",
`0` = "Male",
.default = "Unknown")Drop now female & male columns and reorder data frame.
edgeDS <- edgeDS[, !(names(edgeDS) %in% c("Female", "Male"))]
edgeDS <- edgeDS[, c(1,2,3,4,5,185,6:184)]Potentially, transform year (do not evaluate though)
numToChar <- function(x) { ux = unique(x); formatC(ux)[match(x, ux)] }
edgeDS$Year <- year(as.Date(numToChar(edgeDS$Year), "%Y"))What is the structure of dataset?
str(edgeDS[,5:15])Classes 'tbl_df', 'tbl' and 'data.frame': 7975 obs. of 11 variables:
$ ThreadId : num 1 1 1 1 1 1 1 1 1 1 ...
$ Gender : Factor w/ 2 levels "Male","Female": 1 1 2 1 2 1 2 1 2 2 ...
$ Male_Contributions : num 141 141 141 141 141 141 141 141 141 141 ...
$ Female_Contributions : num 36 36 36 36 36 36 36 36 36 36 ...
$ FemaleParticipation : num 0.203 0.203 0.203 0.203 0.203 ...
$ NumberofAuthorContributions: num 0 0 0 0 0 0 0 0 0 0 ...
$ DebateSize : num 177 177 177 177 177 177 177 177 177 177 ...
$ Live : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ UniqueContributors : num 177 177 177 177 177 177 177 177 177 177 ...
$ UniqueMaleContributors : num 141 141 141 141 141 141 141 141 141 141 ...
$ UniqueFemaleContributors : num 36 36 36 36 36 36 36 36 36 36 ...
vis_dat(edgeDS[,1:70], warn_large_data = F)Do not evaluate code below because the output would be too large.
head(edgeDS)
summary(edgeDS) # print summaries of each variableBut first take a broader look on the dataset. tail()/head() will report (i.e. print) only last/first 6 values.
Who has contributed the most in all discussions?
tail(sort(table(edgeDS$Id)))
richard_h_thaler EDGE steven_pinker
196 203 213
napoleon_chagnon sendhil_mullainathan john_brockman
214 326 504
How many unique participants are there?
length(unique(edgeDS$Id)) [1] 728
And overall number of unique values per each variable (printed only first and last 6)?
| Year | 19 |
| Title | 522 |
| Link | 523 |
| Type | 2 |
| ThreadId | 522 |
| Gender | 3 |
| Quote | 608 |
| Apostro | 698 |
| Parenth | 262 |
| OtherP | 558 |
| AllPct | 1899 |
| Number.Characters | 3814 |
And hey, look, that’s interesting. While there are just 522 unique titles, there are 523 unique links. Using count function, I show number of observations (i.e. links) per title.
And indeed, there are two same titles for different EDGE’s conversations. Preface from 2011 and Preface from 2008.
count(edgeDS, Title, Link)[304:305, ]count_numberOfChars_small <- arrange(count(edgeDS, Number.Characters)[, c(2,1)], desc(n))[1:20,]
count_numberOfChars <- arrange(count(edgeDS, Number.Characters)[, c(2,1)], desc(n))Moreover, out of almost 7975 observations, I can see that 30.7% of comments have a lenght of 1.
Another 9.02% of comments have a character lenght of 2.
kable(head(count_numberOfChars_small, 10), "html",
col.names = c("Number of comments...", "that have char lenght of...")) %>%
kable_styling(bootstrap_options = c("hover", "condensed"), full_width = F)| Number of comments… | that have char lenght of… |
|---|---|
| 34 | 32 |
| 32 | 25 |
| 32 | 33 |
| 31 | 26 |
| 31 | 34 |
| 30 | 13 |
| 30 | 31 |
| 30 | 36 |
| 30 | 39 |
| 29 | 20 |
The most common length of a comment is 32 which occurs in 34 comments.
A parallel coordinate plot shows how many times can a specific number of characters be observed. Here are only 20 vertices - “connections”. I let the user figure out what is nicer and more understandable - I like the first one more.
p1 <- ggparcoord(data = count_numberOfChars_small, columns = c(1, 2), scale = "globalminmax") +
scale_y_continuous(breaks = seq(0,60, by=2),
sec.axis = sec_axis(~., name = "Number of characters",
breaks = seq(0,60, by=5))) +
labs(y = "Number of occurrences", x = "",
title = "TOP 20 connections - from # of comments to length of comments") +
scale_fill_ptol() +
theme_minimal() +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
geom_segment(mapping = aes(x = 1, y = n, xend = 2, yend= Number.Characters),
inherit.aes = F,
data = count_numberOfChars_small,
arrow=arrow(length=unit(0.2,"cm")))
p2 <- ggparcoord(data = count_numberOfChars_small, columns = c(2, 1), scale = "globalminmax") +
scale_y_continuous(breaks = seq(0,60, by=2),
sec.axis = sec_axis(~., name = "Number of occurrences",
breaks = seq(0,60, by=5))) +
labs(y = "Number of characters", x = "",
title = "TOP 20 connections - from length of comments to # of comments") +
scale_fill_ptol() +
theme_minimal() +
theme(axis.title.x=element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank()) +
geom_segment(mapping = aes(x = 1, y = Number.Characters, xend = 2, yend=n ),
inherit.aes = F,
data = count_numberOfChars_small,
arrow=arrow(length=unit(0.1,"cm")))
grid.arrange(p1, p2, ncol=2)Source 1 / Source 2 / GGPlot2 - geom_segment
Namely, what is the relationship between number of all debaters and a group of academicians over the years? Did it mean that if you participated as an academician in the conversations, the debate size was also increasing?
Let’s find it out :) First, however, prepare the dataset.
edgeDS_academiciansInDebates <- edgeDS %>%
filter(Type == 2) %>% # just "conversations" (not annual questions) written by either 1 or 2 authors
select(Year, ThreadId, DebateSize, Academic) %>%
group_by_all() %>%
tally %>%
spread(Academic, n, fill = 0)
colnames(edgeDS_academiciansInDebates)[4] <- "NumberOf_Non_Academicians"
colnames(edgeDS_academiciansInDebates)[5] <- "NumberOf_Academicians"
colnames(edgeDS_academiciansInDebates)[6] <- "Unknown"| Year | ThreadId | DebateSize | NumberOf_Non_Academicians | NumberOf_Academicians | Unknown |
|---|---|---|---|---|---|
| 1996 | 19 | 7 | 4 | 3 | 0 |
| 1996 | 20 | 6 | 1 | 5 | 0 |
| 1996 | 21 | 13 | 4 | 9 | 0 |
Source for tally & spread code
How does debate size relate to the number of academicians? I make a simple linear regression to find out.
lm_aca_deb <- lm(DebateSize ~ NumberOf_Academicians, data = edgeDS_academiciansInDebates)
summary(lm_aca_deb)
Call:
lm(formula = DebateSize ~ NumberOf_Academicians, data = edgeDS_academiciansInDebates)
Residuals:
Min 1Q Median 3Q Max
-100.767 -4.014 -2.791 -2.014 94.408
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.79110 0.56725 6.683 6.21e-11 ***
NumberOf_Academicians 1.22242 0.02292 53.324 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 12.37 on 503 degrees of freedom
Multiple R-squared: 0.8497, Adjusted R-squared: 0.8494
F-statistic: 2843 on 1 and 503 DF, p-value: < 2.2e-16
The mean debate size will grow by 1.2224209 as the number of academicians increases by 1. If there are zero academicians, the debate size predicts to be 3.7911047. (Here is nice article on how to interpret LM)
Now, summarize number of academicians by year (from all conversations).
year_aca_sum <- edgeDS_academiciansInDebates[,c(1,3,5)] %>%
group_by(Year) %>%
summarise(DebateSize = sum(DebateSize),
NumberOf_Academicians_all = sum(NumberOf_Academicians),
Proportion = NumberOf_Academicians_all/DebateSize)| Year | DebateSize | NumberOf_Academicians_all | Proportion |
|---|---|---|---|
| 1996 | 194 | 123 | 0.6340206 |
| 1997 | 416 | 101 | 0.2427885 |
| 1998 | 250 | 51 | 0.2040000 |
Let’s now look on how different groups contribute in debates.
First, I start with (non-)academicians and hence need the right data.
year_aca_byGroup <- edgeDS_academiciansInDebates[,c(1,3,4,5,6)] %>%
group_by(Year) %>%
summarise(DebateSize = sum(DebateSize),
Academicians = sum(NumberOf_Academicians),
Non_Academicians = sum(NumberOf_Non_Academicians),
Unknown = sum(Unknown))| Year | DebateSize | Academicians | Non_Academicians | Unknown |
|---|---|---|---|---|
| 1996 | 194 | 123 | 71 | 0 |
| 1997 | 416 | 101 | 315 | 0 |
| 1998 | 250 | 51 | 183 | 16 |
| 1999 | 396 | 94 | 276 | 26 |
| 2000 | 250 | 66 | 118 | 66 |
| 2001 | 187 | 68 | 69 | 50 |
Clearly, academicians prevail in majority of debates as the number of outsiders (non-aca + unknown) declined significantly over last few years.
Also, the number of participants in debates does not grow much - the summary statistics are: Min. : 52.0 , 1st Qu.: 107.5 , Median : 228.0 , Mean : 296.2 , 3rd Qu.: 389.5 , Max. :1256.0 . The mean size of conversations is just shy of 300.
year_aca_byGroup.long <- melt(year_aca_byGroup, id.vars = c(1,2))
k1 <- ggplot(data = year_aca_byGroup.long, aes(x = Year, y = value, fill = variable)) +
geom_col(position = position_stack(reverse = TRUE)) +
geom_line(aes(x = Year, y=DebateSize), data=year_aca_byGroup.long) +
scale_y_continuous(breaks = seq(0, 1300, by = 100)) +
ggtitle("Number and type of debaters per year") +
labs(y = "Number of Participants") +
theme_minimal() + scale_fill_ptol() +
theme(legend.position="none")
year_aca_byGroup.long$variable <- factor(year_aca_byGroup.long$variable,
levels = c("Unknown","Non_Academicians","Academicians" ,
ordered = T))
k2 <- ggplot(year_aca_byGroup.long, aes(x = Year, y = value, fill = variable)) +
geom_bar(stat="identity", position="fill") +
scale_y_continuous(breaks = seq(0,1,by=0.1), labels = percent) +
labs(y = "Percent", fill = "Type", title = "Debaters' proportionality per year by background") +
theme_minimal() + # scale_fill_ptol() +
scale_fill_manual(
values=c("Academicians"="#4477AA", "Non_Academicians"="#DDCC77", "Unknown"="#CC6677")) +
scale_colour_manual(
values=c("Academicians"="#4477AA", "Non_Academicians"="#DDCC77", "Unknown"="#CC6677"),
labels=c("Academicians", "Non_Academicians", "Unknown")) +
labs(fill = "Group")
grid.arrange(k1, k2, ncol=2)Source #1 unused / Source #2 position fill / Source #3 for Percent bar plot / Source #4 for position_stack reverse / Source #5 for deleting first legend
The same now with gender.
edgeDS_genderInDebates <- edgeDS %>%
filter(Type == 2) %>% # just "conversations" (not annual questions) written by either 1 or 2 authors
select(Year, ThreadId, DebateSize, Gender) %>%
group_by_all() %>%
tally %>%
spread(Gender, n, fill = 0) %>%
ungroup() %>%
select(Year, Male, Female, `<NA>`) %>%
group_by(Year) %>%
summarise(Male = sum(Male), Female = sum(Female), Unknown = sum(`<NA>`))
edgeDS_genderInDebates.long <- melt(edgeDS_genderInDebates, id.vars = c(1)) #2,3
edgeDS_genderInDebates.long$variable <- factor(edgeDS_genderInDebates.long$variable,
levels = c("Unknown","Female","Male" , ordered = T))
ggplot(edgeDS_genderInDebates.long, aes(x = Year, y = value, fill = variable, label = value)) +
geom_bar(stat="identity", position="fill") +
geom_text(aes(label = value),
size = 4, position = position_fill(vjust = 0.5)) +
scale_y_continuous(breaks = seq(0,1,by=0.05), labels = percent) +
labs(y = "Percent", fill = "Type") +
ggtitle ("Debaters' proportionality per year by Gender",
subtitle = "Number present number of debaters per gender") +
theme_minimal() +
scale_fill_manual(
values=c("Male"="#4477AA", "Female"="#DDCC77", "Unknown"="#CC6677")) +
scale_colour_manual(
values=c("Male"="#4477AA", "Female"="#DDCC77", "Unknown"="#CC6677"),
labels=c("Male", "Female", "Unknown")) +
labs(fill = "Gender")In recent years, after 2010, number of women has increased somewhat.
Hypothesis 1: “A woman’s tendency to participate actively in the conversation correlates positively with the number of females in the discussion.”
I start however first with a broad look on the data. Hence, I create data that can be used for later (prints only 5 columns).
edgeDS_conversations <- edgeDS %>%
filter(Type == 2) %>%
select(-Link, -Title, -Order, -Text,-Limited_Information, -starts_with("dummy")) %>%
select(-(62:144))| Year | Type | ThreadId | Gender | Male_Contributions |
|---|---|---|---|---|
| 1996 | 2 | 19 | Male | 7 |
| 1996 | 2 | 19 | Male | 7 |
| 1996 | 2 | 19 | Male | 7 |
| 1996 | 2 | 19 | Male | 7 |
| 1996 | 2 | 19 | Male | 7 |
| 1996 | 2 | 19 | Male | 7 |
What is number of contributions (~ # of comments) and unique people who contribute by gender per year.
edgeDS_conversations_q1 <- edgeDS_conversations %>%
select(Year, ThreadId, Male_Contributions, Female_Contributions, UniqueMaleContributors, UniqueFemaleContributors) %>%
group_by_all() %>%
distinct(.keep_all = TRUE) %>%
group_by(Year) %>%
summarise(Total_Male_Contributions = sum(Male_Contributions),
Total_Female_Contributions = sum(Female_Contributions),
Total_Contributions = Total_Male_Contributions + Total_Female_Contributions,
Total_UniqueMaleContributors = sum(UniqueMaleContributors),
Total_UniqueFemaleContributors = sum(UniqueFemaleContributors),
Total_Unique_Contributions = Total_UniqueMaleContributors + Total_UniqueFemaleContributors)| Year | Total_Male_Contributions | Total_Female_Contributions | Total_Contributions | Total_UniqueMaleContributors | Total_UniqueFemaleContributors | Total_Unique_Contributions |
|---|---|---|---|---|---|---|
| 1996 | 188 | 6 | 194 | 187 | 6 | 193 |
| 1997 | 410 | 6 | 416 | 74 | 6 | 80 |
| 1998 | 219 | 15 | 234 | 28 | 6 | 34 |
| 1999 | 327 | 43 | 370 | 49 | 7 | 56 |
| 2000 | 158 | 26 | 184 | 71 | 10 | 81 |
| 2001 | 136 | 1 | 137 | 86 | 4 | 90 |
Explanation for hike in 2008: The big spike would be attributed to the financial crises where I would assume that a lot of people were searching for job at home and had a lot of time to invest into Edge conversations.
Both, the unique number of contributors (left plot) and total number of contributions in debates is growing for both genders - albeit just a little.
r1 <- ggplot(edgeDS_conversations_q1, aes(x= Year)) +
geom_point(aes(y = Total_UniqueMaleContributors, colour = "Male Contr.")) +
geom_line(aes(y = Total_UniqueMaleContributors, colour = "Male Contr.")) +
stat_smooth(aes(y = Total_UniqueMaleContributors, colour = "Male Contr."),
method = "lm", col = "#1695A3") +
geom_point(aes(y = Total_UniqueFemaleContributors, colour = "Female Contr.")) +
geom_line(aes(y = Total_UniqueFemaleContributors, colour = "Female Contr.")) +
stat_smooth(aes(y = Total_UniqueFemaleContributors, colour = "Female Contr."),
method = "lm", col = "#450003") +
ylab("# of Unique people contributing by Gender") + labs(color = "Gender") +
ggtitle("Unique Contributors over years") +
theme_minimal() + scale_fill_ptol() + theme(legend.position="none") +
scale_x_continuous(breaks = pretty(edgeDS_conversations_q1$Year, n = 19)) +
scale_y_continuous(breaks = seq(0, 200, by = 25), limits = c(0, 200))
r2 <- ggplot(edgeDS_conversations_q1, aes(x= Year)) +
geom_point(aes(y = Total_Male_Contributions, colour = "Male Contr.")) +
geom_line(aes(y = Total_Male_Contributions, colour = "Male Contr.")) +
stat_smooth(aes(y = Total_Male_Contributions, colour = "Male Contr."),
method = "lm", col = "#1695A3") +
geom_point(aes(y = Total_Female_Contributions, colour = "Female Contr.")) +
geom_line(aes(y = Total_Female_Contributions, colour = "Female Contr.")) +
stat_smooth(aes(y = Total_Female_Contributions, colour = "Female Contr."),
method = "lm", col = "#450003") +
ylab("Total Contributions by Gender") + labs(color = "Gender") +
ggtitle("Total Contributions by Men are volatile",
subtitle = "While those by Women stay pretty much the same") +
theme_minimal() + scale_fill_ptol() +
scale_x_continuous(breaks = pretty(edgeDS_conversations_q1$Year, n = 19)) +
scale_y_continuous(breaks = seq(0, 1250, by = 100))
grid.arrange(r1, r2, ncol= 2)We can now look on how the participation of women has increased over the years - considering that there might be differences between those in Academia with PhD or where the author and commenter are the same person.
The plot below shows e.g. that there are no women who are academicians and don’t have their PhD. Every women is either academician with PhD. or is a non-academician with either PhD or without one.
women_part_by_year <- edgeDS_conversations %>%
select(Year, Gender, Female_Contributions, Live, Academic,
TwoAuthors, HavePhD, AuthorAndCommenter) %>%
filter(Gender == "Female") %>%
group_by_all() %>%
tally %>%
spread(Gender, n, fill = 0)
women_part_by_year$Live <- ifelse(women_part_by_year$Live == 0, "Written TP", "Transcribed TP")
women_part_by_year$TwoAuthors <- ifelse(women_part_by_year$TwoAuthors == 1, "2 authors", "1 author")
women_part_by_year$HavePhD <- ifelse(women_part_by_year$HavePhD == 1, "Have PHD", "Don't have PHD")
women_part_by_year$AuthorAndCommenter <- ifelse(women_part_by_year$AuthorAndCommenter == 1,
"Both author and commentator", "Otherwise")
colnames(women_part_by_year)[8] <- "#_of_Females"
men_part_by_year <- edgeDS_conversations %>%
select(Year, Gender, Male_Contributions, Live, Academic,
TwoAuthors, HavePhD, AuthorAndCommenter) %>%
filter(Gender == "Male") %>%
group_by_all() %>%
tally %>%
spread(Gender, n, fill = 0)
men_part_by_year$Live <- ifelse(men_part_by_year$Live == 0, "Written TP", "Transcribed TP")
men_part_by_year$TwoAuthors <- ifelse(men_part_by_year$TwoAuthors == 1, "2 authors", "1 author")
men_part_by_year$HavePhD <- ifelse(men_part_by_year$HavePhD == 1, "Have PHD", "Don't have PHD")
men_part_by_year$AuthorAndCommenter <- ifelse(men_part_by_year$AuthorAndCommenter == 1,
"Both author and commentator", "Otherwise")
colnames(men_part_by_year)[8] <- "#_of_Males"
d1 <- ggplot(women_part_by_year, aes(x=Year, y = Female_Contributions)) +
geom_point() + facet_grid(Academic ~ HavePhD) +
theme_minimal() + scale_fill_ptol() +
ggtitle("Female contributions and impact of being academician with/out PhD.",
subtitle = "There are no women in the academia without PhD.")
d2 <- ggplot(men_part_by_year, aes(x=Year, y = Male_Contributions)) +
geom_point() + facet_grid(Academic ~ HavePhD) +
theme_minimal() + scale_fill_ptol() +
ggtitle("Male contributions and impact of being academician with/out PhD.",
subtitle = "Here, there are actually men who don't have PhD. but are in academia")
grid.arrange(d1,d2,ncol=2)d3 <- ggplot(women_part_by_year, aes(x=Year, y = Female_Contributions)) +
geom_point() + facet_grid(AuthorAndCommenter ~ TwoAuthors) +
theme_minimal() + scale_fill_ptol() +
ggtitle("Female contributions to the articles where the author and commenter is",
subtitle = "same person and where comments can be written by two people")
d4 <- ggplot(men_part_by_year, aes(x=Year, y = Male_Contributions)) +
geom_point() + facet_grid(AuthorAndCommenter ~ TwoAuthors) +
theme_minimal() + scale_fill_ptol() +
ggtitle("Male contributions to the articles where the author and commenter is",
subtitle = "same person and where comments can be written by up 2 people")
grid.arrange(d3,d4,ncol=2)Skewness function confirms that all distributions are highly skewed. E.g. contributions of women to the transcribed text pieces (top-left plot) is visibly skewed to the left (being ‘+’) - i.e. the curve is right-leaning.
On the other hand, the bottom-left distribution is more normally distributed than others having the lowest skewness of just 1.4.
We can also conduct skewness test of D’Agostino. This is also confirmed because of the low p-value, i.e. the null-hypothesis that [our distributions] are normally distributed can be rejected, which is the case here
skew_fem_tran <- skewness(women_part_by_year$Female_Contributions[women_part_by_year$Live == "Transcribed TP"])
agostino.test(women_part_by_year$Female_Contributions[women_part_by_year$Live == "Transcribed TP"])
D'Agostino skewness test
data: women_part_by_year$Female_Contributions[women_part_by_year$Live == "Transcribed TP"]
skew = 2.1243, z = 4.3596, p-value = 1.303e-05
alternative hypothesis: data have a skewness
skew_fem_writ <- skewness(women_part_by_year$Female_Contributions[women_part_by_year$Live == "Written TP"])
agostino.test(women_part_by_year$Female_Contributions[women_part_by_year$Live == "Written TP"])
D'Agostino skewness test
data: women_part_by_year$Female_Contributions[women_part_by_year$Live == "Written TP"]
skew = 1.4393, z = 3.8910, p-value = 9.983e-05
alternative hypothesis: data have a skewness
skew_m_tran <- skewness(men_part_by_year$Male_Contributions[men_part_by_year$Live == "Transcribed TP"])
skew_m_writ <- skewness(men_part_by_year$Male_Contributions[men_part_by_year$Live == "Written TP"])
df_annotationsForFacets <- data.frame(Live = c("Transcribed TP", "Written TP", "Transcribed TP", "Written TP"),
value = c(skew_fem_tran, skew_fem_writ, skew_m_tran, skew_m_writ),
x = 2005, y= c(30, 30, 300, 300))Source #1 / Source #2 / Source #3
d5 <- ggplot(women_part_by_year, aes(x=Year, y = Female_Contributions)) +
geom_point() + facet_grid(Live ~ . ) +
theme_minimal() + scale_fill_ptol() +
ggtitle("Female contributions to the Transcribed or Written Text Pieces",
subtitle = "Somewhat skew to the right - right-leaning curve") +
geom_text(data = df_annotationsForFacets[1:2,],
aes(x=x, y=y, label = paste("Skewness = ", round(value,2))),
inherit.aes = FALSE)
d6 <- ggplot(men_part_by_year, aes(x=Year, y = Male_Contributions)) +
geom_point() + facet_grid(Live ~ . ) +
theme_minimal() + scale_fill_ptol() +
ggtitle("Male contributions to the Transcribed or Written Text Pieces",
subtitle = "Relatively equally spread") +
geom_text(data = df_annotationsForFacets[3:4,],
aes(x=x, y=y, label = paste("Skewness = ", round(value,2))),
inherit.aes = FALSE)
grid.arrange(d5,d6,ncol=2)Again prepare the data for later use.
PeoplePerThreadID <- edgeDS_conversations %>%
select(Year, ThreadId, Male_Contributions,
Female_Contributions, FemaleParticipation,
DebateSize, UniqueContributors, UniqueMaleContributors,
UniqueFemaleContributors, UniqueFemaleParticipation) %>%
distinct(.keep_all = TRUE) %>%
select(Year, FemaleParticipation) %>%
group_by(Year) %>%
mutate(AvgFemaleParticipationPerYear = mean(FemaleParticipation),
MedianFemaleParticipationPerYear = median(FemaleParticipation),
SumFemaleParticipationPerYear = sum(FemaleParticipation))
ratio <- edgeDS_conversations_q1$Total_Female_Contributions/edgeDS_conversations_q1$Total_ContributionsIndependently of how one calculates the (average) female participation, the trend is still growing.
q1 <- ggplot(PeoplePerThreadID, aes(x = Year, y = AvgFemaleParticipationPerYear)) +
geom_point() +
ggtitle("Female Participation per Year is slightly increasing",
subtitle = "Calculation based on averaging 'Female Participation' (from dataset)") +
geom_line() +
ylab("Avarage Female Participation") +
geom_smooth(method = "lm", se = T, show.legend = TRUE) +
scale_y_continuous(breaks = seq(0, 1, by = 0.1), limits = c(0, 1)) +
scale_x_continuous(breaks = pretty(PeoplePerThreadID$Year, n = 19)) +
theme_minimal() + scale_fill_ptol()
q2 <- ggplot(edgeDS_conversations_q1, aes(x=Year)) + geom_point(aes(y=ratio)) + geom_line(aes(y=ratio)) +
ggtitle("Here it is also growing",
subtitle = "Calculation based on total fe(male) contributions per year") +
theme_minimal() + scale_fill_ptol() + ylab("Total_Female_Contributions/Total_Contributions") +
scale_x_continuous(breaks = pretty(edgeDS_conversations_q1$Year, n = 19)) +
scale_y_continuous(breaks = seq(0, 1, by = 0.1), limits = c(0, 1))
grid.arrange(q1,q2, ncol=2)For reminder:
Hypothesis 1: “A woman’s tendency to participate actively in the conversation correlates positively with the number of (I guess unique) females in the discussion.”
cor_data <- edgeDS %>%
filter(Role == 1, Gender == "Female", Type == 2) %>%
select(-Link, -Id_num, -ThreadId, -Year, -Title,-Role,-Type,
-Order, -Text,-Limited_Information,-Gender, -starts_with("dummy"))
all_factors_names <- as.data.frame(sapply(cor_data, is.factor))
all_factors_names$name <- rownames(all_factors_names)
all_factors_names <- all_factors_names[all_factors_names$`sapply(cor_data, is.factor)` == TRUE,]
cor_data <- cor_data[ , !(names(cor_data) %in% all_factors_names$name)]
all_char_names <- as.data.frame(sapply(cor_data, is.character))
all_char_names$name <- rownames(all_char_names)
all_char_names <- all_char_names[all_char_names$`sapply(cor_data, is.character)` == TRUE,]
cor_data <- cor_data[ , !(names(cor_data) %in% all_char_names$name)]
cor_data <- cor_data[ , -c(26:106)]
M <- cor(as.data.frame(cor_data), use = "pairwise.complete.obs")Idea: If an author is a woman, then there is potentially a higher engagement with other women as well (those could be more interesting to contribute).
Result: Yet correlation - at least for the case of Female Contributions - doesn’t confirm this.
cor(cor_data$Female_Contributions, cor_data$UniqueFemaleContributors)[1] -0.1779973
This is opposite with Female Participation where there is a strong correlation between the normal Female Participation ratio and the unique one.
cor(cor_data$FemaleParticipation, cor_data$UniqueFemaleParticipation)[1] 0.7275177
Plot correlation matrix.
corrplot(M)ggcorr(M, nbreaks = 6, palette = "RdGy", label = TRUE, label_size = 3,
label_color = "white", hjust = 0.75, size = 2.5, angle = -5,layout.exp = 5)Hypothesis 2: “Higher status participants are more verbose than are lower status participants.”
My Definition of ‘Status’: I define “higher status” as being solely dependent on the person’s background, workplace (i.e. where (s)he comes from), his/her job (and if academicians, then also the department, discipline, academic rank and citations) and whether (s)he has a PhD and from which institution, etc. I use data from the dataset, exclusively.
edgeDS_verbosity <- edgeDS %>%
select(Year, Academic, Limited_Information, H_Index, i10_Index, Role, Gender,
HavePhD, DebateSize, Job_Title_S, Department_S, Discipline,
PhD_Institution_SR_Bin, AcademicHierarchyStrict, Number.Characters) %>%
group_by_all()Group of academia <> Number of characters
ggplotly(
ggplot(edgeDS_verbosity, aes(Academic, Number.Characters)) + geom_boxplot() +
scale_y_continuous(breaks = seq(0,4000, by = 200), limits = NA) +
coord_cartesian(ylim =c(0,4000)) +
xlab("Group") + ylab("Number of characters") +
ggtitle("Number of characters by group (both genders)") +
theme_minimal() + scale_fill_ptol()
) %>% layout(xaxis = list(ticktext = c("Non_Academicians", "Academicians", "NA")))Hopefully one day plotly R will fix subtitle issues
Avarage number of characters by academicians and later non-academicians:
edgeDS_verbosity_aca_y <- edgeDS_verbosity %>% filter(Academic == "Academicians")
mean(edgeDS_verbosity_aca_y$Number.Characters) [1] 3333.498
edgeDS_verbosity_aca_n <- edgeDS_verbosity %>% filter(Academic == "Non_Academicians")
mean(edgeDS_verbosity_aca_n$Number.Characters)[1] 2495.364
Does the number of characters in the comments depend anyhow on person’s job title? Well, only slightly.
ggplotly(
ggplot(edgeDS_verbosity, aes(Job_Title_S, Number.Characters)) + geom_boxplot() +
scale_y_continuous(breaks = seq(0,20000, by = 1000), limits = c(0,190000)) +
xlab("Job Title") + ylab("Number of characters") +
ggtitle("When graduate students write, then it is always quite a lot",
subtitle = "maybe there are some in NA or Other bins") +
theme_minimal() + scale_fill_ptol() + coord_flip(ylim =c(0,6000))
) %>% layout(margin = list(l = 200))Also, of those academic people, departments/fields they come from and work in - does it have any impact on how much they write?
ggplotly(
ggplot(edgeDS_verbosity, aes(Department_S, Number.Characters)) + geom_boxplot() +
coord_flip(ylim =c(0,45000)) +
scale_y_continuous(breaks = seq(0,45000, by = 5000), limits = c(0,190000)) +
xlab("Academic Department") + ylab("Number of characters") +
theme_minimal() + scale_fill_ptol()
)Discipline also seems to have an impact. The difference between “Professions” and “Other” (“NA”) is also quite significant.
ggplotly(
ggplot(edgeDS_verbosity, aes(Discipline, Number.Characters)) +
geom_boxplot() + #coord_flip() +
xlab("Academic Discipline") + ylab("Number of characters") +
scale_y_continuous(breaks = seq(0, 100000, by = 1000), limits = NA) +
coord_cartesian(ylim =c(0,10000)) +
stat_summary(fun.y=mean, fun.args = list(na.rm = F),
na.rm = T, geom="point", size=2.5, color="green") +
ggtitle("Box Plot of number of characters by academic discipline",
subtitle = "Green Point shows an average/mean") +
theme_minimal() + scale_fill_ptol()
)
edgeDS_verbosity_disc_diff <- edgeDS_verbosity %>%
dplyr::ungroup() %>%
dplyr::select(Discipline, Number.Characters) %>%
dplyr::group_by(Discipline) %>%
dplyr::summarize(min_Number.Characters = min(Number.Characters),
#minDisciplineName = Discipline[which.min(Number.Characters)],
max_Number.Characters = max(Number.Characters),
#maxDisciplineName = Discipline[which.max(Number.Characters)],
mean_Number.Characters = round(mean(Number.Characters, na.rm = T),2)
) %>%
dplyr::mutate(difference_max_min = max_Number.Characters-min_Number.Characters)| Discipline | min_Number.Characters | max_Number.Characters | mean_Number.Characters | difference_max_min |
|---|---|---|---|---|
| Formal Sciences | 13 | 28219 | 3003.34 | 28206 |
| Humanities | 4 | 56874 | 3251.17 | 56870 |
| Natural Sciences | 4 | 84762 | 4194.41 | 84758 |
| Professions | 11 | 87374 | 3972.31 | 87363 |
| Social Sciences | 3 | 118762 | 2891.57 | 118759 |
| NA | 3 | 91341 | 2335.37 | 91338 |
The Shanghai Ranking of their PhD Institution vs. number of characters they write. Like
1 = university was ranked between 1 and 50
2 = university was ranked between 51 and 100
etc.
ggplotly(
ggplot(edgeDS_verbosity, aes(PhD_Institution_SR_Bin, Number.Characters)) +
geom_boxplot() +
coord_cartesian(ylim =c(0,7000)) +
ggtitle("Shanghai Rankings of their PhD Institution")+
scale_y_continuous(breaks = seq(0,10000, by = 500), limits = c(0,190000)) +
theme_minimal() + scale_fill_ptol()
) %>% layout(xaxis = list(ticktext = c("1-50", "51-100",
"101-150", "151-200",
"201-300", "301-400", "401-510", "NA"), showgrid = F))By AcademicHierarchyStrict:
ggplotly(
ggplot(edgeDS_verbosity, aes(AcademicHierarchyStrict, Number.Characters)) +
geom_boxplot() +
coord_cartesian(ylim =c(0,6000)) +
scale_y_continuous(breaks = seq(0,10000, by = 500), limits = c(0,190000)) +
theme_minimal() + scale_fill_ptol()
) %>% layout(xaxis = list(title = "Academic Hierarchy",
ticktext = c("Graduate Student", "Postdoctoral",
"Assistant Prof.", "Associate Prof.",
"Professor", "Chaired Prof.", "NA"), showgrid = F))More verbose when they have PhD? Yes!
ggplotly(
ggplot(edgeDS_verbosity, aes(HavePhD, Number.Characters)) + geom_boxplot() +
coord_cartesian(ylim =c(0,4000)) +
scale_y_continuous(breaks = seq(0,10000, by = 250), limits = c(0,190000)) +
theme_minimal() + scale_fill_ptol()
) %>% layout(xaxis = list(title = "Are they Doctor of Philosophy?",
ticktext = c("Dont have PhD", "Have PhD", "NA")))How much people with/out PhD write on average?
edgeDS_verbosity_hphd_y <- edgeDS_verbosity %>% filter(HavePhD == 1)
mean(edgeDS_verbosity_hphd_y$Number.Characters)[1] 3238.645
edgeDS_verbosity_hphd_n <- edgeDS_verbosity %>% filter(HavePhD == 0)
mean(edgeDS_verbosity_hphd_n$Number.Characters)[1] 2304.422
By Gender
ggplotly(
ggplot(edgeDS_verbosity, aes(Gender, Number.Characters)) + geom_boxplot() +
coord_cartesian(ylim =c(0,5000)) +
scale_y_continuous(breaks = seq(0,10000, by = 500), limits = c(0,190000)) +
theme_minimal() + scale_fill_ptol()
) %>% layout(xaxis = list(title = "Gender", ticktext = c("Male", "Female", "NA")))Role ?
ggplotly(
ggplot(edgeDS_verbosity, aes(Role, Number.Characters)) + geom_boxplot() +
coord_cartesian(ylim =c(0,5000)) +
scale_y_continuous(breaks = seq(0,10000, by = 500), limits = c(0,190000))
) %>% layout(xaxis = list(title = "Who were debaters?", ticktext = c("Author", "Commentator")))The most comments are written when the debate size is below 200 people.
ggplot(edgeDS_verbosity, aes(DebateSize, Number.Characters)) +
coord_cartesian(ylim =c(0,7000)) +
scale_y_continuous(breaks = seq(0,10000, by = 500), limits = c(0,190000)) +
geom_boxplot(aes(group = cut_interval(DebateSize, 10)))Create bins for h-index and i10-index (citations)
edgeDS_verbosity$hindex_bin <- cut(edgeDS_verbosity$H_Index, breaks=seq(0,150, by = 15), labels=c("0-15","15-30","30-45", "45-60", "60-75", "75-90", "90-105", "105-120", "120-135", "135-150"))
ggplotly(
ggplot(edgeDS_verbosity, aes(hindex_bin, Number.Characters)) +
geom_boxplot() +
coord_cartesian(ylim =c(0,7000)) +
scale_y_continuous(breaks = seq(0,10000, by = 500), limits = c(0,190000))
)edgeDS_verbosity$i10_bin <- cut(edgeDS_verbosity$i10_Index, breaks=seq(0,600, by = 40))
ggplotly(
ggplot(edgeDS_verbosity, aes(i10_bin, Number.Characters)) + geom_boxplot() +
geom_boxplot() +
coord_cartesian(ylim =c(0,13000)) +
scale_y_continuous(breaks = seq(0,15000, by = 1000), limits = c(0,190000))
)rm(edgeDS_verbosity_aca_n, edgeDS_verbosity_aca_y, df_annotationsForFacets, edgeDS_verbosity_hphd_y,
edgeDS_verbosity_disc_diff, edgeDS_verbosity_hphd_n, edgeDS_verbosity)What I call “High Flyers” are somebody who is in Academia with PhD (from TOP 100 universities according to Shanghai Ranking and US News and World Reported - union) and where their university workplace was also ranked in TOP 100 (union).
high_flyers <- edgeDS %>%
filter(Academic == "Academicians" & HavePhD == 1) %>%
filter(PhD_Institution_SR_Bin %in% c(1,2) | PhD_Institution_US_IR_Bin %in% c(1,2)) %>%
filter(Workplace_SR_Bin %in% c(1,2) | Workplace_US_IR_Bin %in% c(1,2)) %>%
select(Year, Number.Characters) %>%
group_by(Year) %>%
summarise(Number.Characters = sum(Number.Characters)) %>%
mutate(type = "high_flyers")
high_flyers_oppo <- edgeDS %>%
filter(!(Academic == "Academicians" & HavePhD == 1)) %>%
filter(!(PhD_Institution_SR_Bin %in% c(1,2) | PhD_Institution_US_IR_Bin %in% c(1,2))) %>%
filter(!(Workplace_SR_Bin %in% c(1,2) | Workplace_US_IR_Bin %in% c(1,2))) %>%
write_csv("~/documents/R-github/Crowdsourcing_Data_Analysis_2_EDGE_org/high_flyers_oppo.csv") %>%
select(Year, Number.Characters) %>%
group_by(Year) %>%
summarise(Number.Characters = sum(Number.Characters)) %>%
mutate(type = "high_flyers_oppo")
df <- rbind(high_flyers,high_flyers_oppo )After data are prepared, plot them now.
ggplot(df, aes(x=Year, y = Number.Characters, fill = as.factor(type))) +
geom_bar(position="dodge", stat="identity") + # geom_line(aes(color = as.factor(type))) +
theme_minimal() + scale_fill_ptol() +
labs(fill = "Type", color = "Type") +
scale_x_continuous(breaks = pretty(unique(df$Year), n = 19)) +
scale_y_continuous(breaks = seq(0,1000000, by = 50000), labels = comma) +
ggtitle("Cummulative sum of characters per year and group",
subtitle = "Except for a few years where it is almost equal or even higher,
high flyers comment much more than non-high-flyers.")sessionInfo()R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 17.04
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.7.0
LAPACK: /usr/lib/lapack/liblapack.so.3.7.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] grid stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] rsconnect_0.8.4 bindrcpp_0.2 corrplot_0.77
[4] visdat_0.1.0 plotly_4.7.1 ggthemes_3.4.0
[7] GGally_1.3.2 gridExtra_2.2.1 kableExtra_0.3.0
[10] devtools_1.13.3 moments_0.14 VIM_4.7.0
[13] data.table_1.10.4 colorspace_1.3-2 scales_0.4.1.9002
[16] reshape2_1.4.2 forcats_0.2.0 stringr_1.2.0
[19] dplyr_0.7.2.9000 purrr_0.2.3 readr_1.1.1.9000
[22] tidyr_0.6.3 tibble_1.3.3 ggplot2_2.2.1.9000
[25] tidyverse_1.1.1.9000 rmarkdown_1.6.0.9001 knitr_1.16
loaded via a namespace (and not attached):
[1] nlme_3.1-131 bitops_1.0-6 pbkrtest_0.4-7
[4] lubridate_1.6.0 RColorBrewer_1.1-2 httr_1.2.1
[7] rprojroot_1.2 tools_3.4.1 backports_1.1.0
[10] R6_2.2.2 lazyeval_0.2.0 mgcv_1.8-18
[13] nnet_7.3-12 withr_2.0.0 sp_1.2-5
[16] mnormt_1.5-5 compiler_3.4.1 rvest_0.3.2
[19] quantreg_5.33 SparseM_1.77 xml2_1.1.1
[22] labeling_0.3 DEoptimR_1.0-8 lmtest_0.9-35
[25] psych_1.7.5 robustbase_0.92-7 digest_0.6.12
[28] foreign_0.8-69 minqa_1.2.4 base64enc_0.1-3
[31] pkgconfig_2.0.1 htmltools_0.3.6 boxes_0.0.0.9000
[34] lme4_1.1-13 highr_0.6 htmlwidgets_0.9
[37] rlang_0.1.1.9000 readxl_1.0.0 rstudioapi_0.6
[40] shiny_1.0.3.9002 bindr_0.1 zoo_1.8-0
[43] jsonlite_1.5 crosstalk_1.0.0 car_2.1-5
[46] RCurl_1.95-4.8 magrittr_1.5 Matrix_1.2-10
[49] Rcpp_0.12.12 munsell_0.4.3 stringi_1.1.5
[52] yaml_2.1.14 RJSONIO_1.3-0 MASS_7.3-47
[55] plyr_1.8.4 parallel_3.4.1 crayon_1.3.2.9000
[58] lattice_0.20-35 haven_1.1.0 splines_3.4.1
[61] hms_0.3 boot_1.3-20 codetools_0.2-15
[64] clisymbols_1.2.0 glue_1.1.1 evaluate_0.10.1
[67] laeken_0.4.6 modelr_0.1.1 vcd_1.4-3
[70] httpuv_1.3.5 nloptr_1.0.4 MatrixModels_0.4-1
[73] cellranger_1.1.0 gtable_0.2.0 reshape_0.8.6
[76] assertthat_0.2.0 mime_0.5 xtable_1.8-2
[79] broom_0.4.2 e1071_1.6-8 class_7.3-14
[82] viridisLite_0.2.0 memoise_1.1.0