Data Analysis: Gender + Twitter

1 Data

The data used in the present paper came from a cross-sectional study that examined questions regarding gender identity and social media usage. This study comes from a larger dataset collected during 2023 using the Qualtrics research panel service. Recruitment was facilitated through Qualtrics via email, and the survey was available exclusively to Qualtrics panel participants, who did it using the Qualtrics survey platform.

Through the questionnaire, we try to gather (1) the gender traits of the respondents using four different items (gender identity, gender expression, biological sex, and sexual attraction), (2) a set of variables related to the usage of X regarding the time spent in the application and the topics for what it is used, and, finally, (3) the algorithm’s performance in classifying the gender of the users. This novel dataset provides us with relevant data regarding not only the accuracy of the gender assignment algorithm used by X (formerly Twitter) but also its interrelation with certain subjective and objective traits such as gender perception, biological sex, or sexual orientation, among others. In this sense, it is important to stress the necessity to gather this data through a traditional survey approach rather than with other data mining techniques (i.e., web scraping) that, although they could provide us with larger datasets, cannot capture key subjective aspects of gender key for our study.

This novel dataset provides us with relevant data regarding not only the accuracy of the gender assignment algorithm used by X (formerly Twitter) but also its interrelation with certain subjective and objective traits such as gender perception, biological sex, or sexual orientation, among others. In this sense, it is important to stress the necessity to gather this data through a traditional survey approach rather than with other data mining techniques (i.e., web scraping) that, although they could provide us with larger datasets, cannot capture key subjective aspects of gender key for our study.

The survey had been distributed to a global audience and received \(N=1642\) responses.

 # CSV importation
DF <- read_csv("first_sample.csv")

# Naming columns correctly
colnames(DF) <- DF[1,]
DF <- DF[-1,]
DF <- DF[-1,]

# Print total number of respondents before cleaning
print(paste("Total number of respondents before cleaning:", nrow(DF)))

## [1] "Total number of respondents before cleaning: 1642"

For data quality and rigor, we removed participants who did not provide their Qualtrics identification number, who had not completed the survey, and those who had duplicate entries based on the prolific identification variable to ensure that each participant contributed only one response, leaving us with \(N=1553\) responses.

# Filtering the data where 'Finished' column is "True"
DF <- DF %>% filter(Finished == "True")

# Dropping rows with NA in 'PROLIFIC_PID'
DF <- DF %>% drop_na(PROLIFIC_PID)

# Remove duplicates based on 'item ID # Packages' column (not PROLIFIC_PID)
DF <- DF %>% distinct(PROLIFIC_PID, .keep_all = TRUE)

# Print total number of respondents after cleaning
print(paste("Total number of respondents after cleaning:", nrow(DF)))

## [1] "Total number of respondents after cleaning: 1553"

1.1 Demographics

Demographic variables (i.e., age, nationality, working status) have been extracted from data gathered and provided by Qualtrics.

The mean age of participants was 30.8 years, with a standard deviation of 10. The median is 28 years, with Q1=24 and Q3=35. The distribution shows a positive skewness value of 1.5, exhibiting an overrepresentation of millennials or Gen Z over Gen X and Baby Boomers.

# Mean
print(paste("Age - Mean:", round(mean(na.omit(as.numeric(df$age))), 2)))

## [1] "Age - Mean: 30.78"

# SD
print(paste("Age - SD:", round(sd(na.omit(as.numeric(df$age))), 2)))

## [1] "Age - SD: 9.99"

# Median
print(paste("Age - Median:", round(median(na.omit(as.numeric(df$age))), 2)))

## [1] "Age - Median: 28"

# Quantiles
print(quantile(na.omit(as.numeric(df$age))))

##   0%  25%  50%  75% 100% 
##   18   24   28   35   79

# Skewness
print(paste("Age - Skewness:", round(skewness(na.omit(as.numeric(df$age))), 2)))

## [1] "Age - Skewness: 1.5"

# Histogram 
hist(na.omit(as.numeric(df$age)), breaks=70, main = "Age", xlab = "Age")

The sample is composed of different categories related to employment status. The majority of participants are employed full-time (46%), followed by part-time workers (16%) and those who are unemployed and seeking a job (14%). Smaller groups include those who are not in paid work, such as homemakers, retirees, or disabled individuals (3%), and those who are due to start a new job within the next month (1%). The percentage of NaN values (12%) is notable.

a <- round(prop.table(table(df$employment_status)), 2)*100
employment_freq <- as.data.frame(a)
print(employment_freq)

##                                                       Var1 Freq
## 1                                          CONSENT_REVOKED    0
## 2                                             DATA_EXPIRED   12
## 3             Due to start a new job within the next month    1
## 4                                                Full-Time   46
## 5 Not in paid work (e.g. homemaker', 'retired or disabled)    3
## 6                                                    Other    8
## 7                                                Part-Time   16
## 8                             Unemployed (and job seeking)   14

Our results are spread across more than fifty countries -–South Africa (376), followed by the UK (255), and Portugal (170) are the most numerous—– . The composition of the sample by continent reveals a highly uneven distribution. Europe dominates the sample, accounting for 61%, followed by Africa with 27%, reflecting a substantial representation from these two continents. North America contributes 5% to the sample, while Asia, despite its size and diversity, represents only 4%. Oceania and South America each have minimal representation, making up just 1% of the sample respectively.

nationality_freq <- as.data.frame(table(df$nationality))
nationality_freq <- nationality_freq[order(-nationality_freq$Freq), ]
top_3_nationalities <- head(nationality_freq, 3)
print(top_3_nationalities)

##              Var1 Freq
## 55   South Africa  376
## 65 United Kingdom  255
## 51       Portugal  170

a <- round(prop.table(table(df$continent)), 2)*100
print(as.data.frame(a))

##            Var1 Freq
## 1        Africa   27
## 2          Asia    4
## 3        Europe   61
## 4 North America    5
## 5       Oceania    1
## 6 South America    1
## 7       Unknown    1

The sample’s ethnic composition is as follows: White (62%), Black (24%), Mixed (6%), Asian (4%), and Other (3%).

a <- round(prop.table(table(df$ethnicity)), 2)*100
print(as.data.frame(a))

##              Var1 Freq
## 1           Asian    4
## 2           Black   24
## 3 CONSENT_REVOKED    0
## 4    DATA_EXPIRED    0
## 5           Mixed    6
## 6           Other    3
## 7           White   62

The average response time was 317 seconds.

hist(as.numeric(df$duration), breaks=100, main = "Response Time", xlab = "Duration in Seconds")

mean(as.numeric(df$duration))

## [1] 316.9794

1.2 Gender traits

We decide to differentiate between the biological sex, gender identity, gender expression, and sexual orientation of the respondents. The variable “biological sex” refers to personal anatomy and physical attributes, such as external sex organs, sex chromosomes, and internal reproductive structures. In this sense, the distribution is as follows: female individuals comprise 50.16%, male individuals represent 49.65%, and intersex individuals account for a tiny proportion of 0.19%.

prop.table(table(df$bio_sex))

## 
##                                                             Female 
##                                                        0.501609788 
## Intersex (please also tick this if you have an intersex variation) 
##                                                        0.001931745 
##                                                               Male 
##                                                        0.496458467

We also gather the gender identity of the respondents. By gender identity, we understand an individual’s sense of being male, female, or another gender, which is separate from biological sex. The gender identity distribution for the sample is as follows: Male participants make up 48.16%, while Female participants account for 47.39%, showing nearly equal representation between these two categories. A smaller proportion identify as non-binary (3.03%), followed by other (0.45%). Transgender individuals (including those with a transgender history) represent 0.64% of the dataset.

prop.table(table(df$gender_identity))

## 
##                                                                Female 
##                                                           0.473921442 
##                                                                  Male 
##                                                           0.481648422 
##                                                             No gender 
##                                                           0.003219575 
##                                                            Non-binary 
##                                                           0.030264005 
##                                                                 Other 
##                                                           0.004507405 
## Transgender (please also tick this if you have a transgender history) 
##                                                           0.006439150

Gender expression has also been part of our consideration. In this sense, gender expression can be defined as how we show our gender to the world around us. Masculine expressions are the most prevalent, representing 47.84% of the sample, followed closely by Feminine expressions at 44.95%. Non-Binary expressions account for 5.92%, while Other expressions are reported by 1.29% of participants.

prop.table(table(df$gender_expression))

## 
##   Feminine  Masculine Non-Binary      Other 
## 0.44945267 0.47842885 0.05924018 0.01287830

Finally, the individuals were asked about their sexual orientation. The majority, 67.87%, identified as straight, followed by 17.58% identifying as Bisexual. Gay individuals constituted 7.47% of the sample. Smaller proportions were identified as Asexual (2.25%), Questioning (2.25%), and Other (2.58%).

prop.table(table(df$sex_orientation))

## 
##     Asexual    Bisexual         Gay       Other Questioning    Straight 
##  0.02253703  0.17578880  0.07469414  0.02575660  0.02253703  0.67868641

1.3 Patterns of X’s usage

As indicated, we gathered information about the respondents’ patterns of X’s usage throughout our survey. In this sense, participants were asked why they primarily use Twitter, highlighting motivations such as entertainment, information gathering, or networking; the frequency of platform usage was examined, including how often users log in, tweet, and engage with content like promotional tweets, retweets, and likes; content preferences by identifying the topics users tweet about most and least frequently; and also how often participants utilize specific features, such as the search function, and whether they assign a gender to their profile during registration.

The questions related to the time a user spends using X were registered using an ordinal scale variable with seven items (several times a day, once a day, several times a week, once a week, several times a month, once a month, and less frequently).

df_pca <- data.frame(
  
  U1 = ifelse(df$twitter_use_time == "Less frequently", 0,
              ifelse(df$twitter_use_time == "Once a month", 1, 
                     ifelse(df$twitter_use_time == "Several times a month", 2, 
                            ifelse(df$twitter_use_time == "Once a week", 3, 
                                   ifelse(df$twitter_use_time == "Several times a week", 4, 
                                          ifelse(df$twitter_use_time == "Once a day", 5, 6)))))),
  
  
  U22 = ifelse(df$twitter_tweet_time == "Less frequently", 0,
              ifelse(df$twitter_tweet_time == "Once a month", 1, 
                     ifelse(df$twitter_tweet_time == "Several times a month", 2, 
                            ifelse(df$twitter_tweet_time == "Once a week", 3, 
                                   ifelse(df$twitter_tweet_time == "Several times a week", 4, 
                                          ifelse(df$twitter_tweet_time == "Once a day", 5, 6)))))),
  

  
  U23 = ifelse(df$twitter_retweet_time == "Less frequently", 0,
               ifelse(df$twitter_retweet_time == "Once a month", 1, 
                      ifelse(df$twitter_retweet_time == "Several times a month", 2, 
                             ifelse(df$twitter_retweet_time == "Once a week", 3, 
                                    ifelse(df$twitter_retweet_time == "Several times a week", 4, 
                                           ifelse(df$twitter_retweet_time == "Once a day", 5, 6)))))),
  
  U24 = ifelse(df$twitter_ad_read_time == "Less frequently" | is.na(df$twitter_ad_read_time), 0,
               ifelse(df$twitter_ad_read_time == "Once a month", 1, 
                      ifelse(df$twitter_ad_read_time == "Several times a month", 2, 
                             ifelse(df$twitter_ad_read_time == "Once a week", 3, 
                                    ifelse(df$twitter_ad_read_time == "Several times a week", 4, 
                                           ifelse(df$twitter_ad_read_time == "Once a day", 5, 6)))))),
  
  U25 = ifelse(df$twitter_like_time == "Less frequently" | is.na(df$twitter_like_time), 0,
               ifelse(df$twitter_like_time == "Once a month", 1, 
                      ifelse(df$twitter_like_time == "Several times a month", 2, 
                             ifelse(df$twitter_like_time == "Once a week", 3, 
                                    ifelse(df$twitter_like_time == "Several times a week", 4, 
                                           ifelse(df$twitter_like_time == "Once a day", 5, 6)))))),
  
  U26 = ifelse(df$twitter_serch_time == "Less frequently", 0,
               ifelse(df$twitter_serch_time == "Once a month", 1, 
                      ifelse(df$twitter_serch_time == "Several times a month", 2, 
                             ifelse(df$twitter_serch_time == "Once a week", 3, 
                                    ifelse(df$twitter_serch_time == "Several times a week", 4, 
                                           ifelse(df$twitter_serch_time == "Once a day", 5, 6))))))
  
  
)

We created a single composite variable through factor analysis with one single factor. This statistical approach allowed us to identify and group underlying patterns across all items related to X usage frequency, such as how often participants use X, tweet, read promotional, retweet, like others’ tweets, and use the search function. The variable’s value has to be understood as a metric that reflects the users’ time usage. In this sense, higher values are linked to those respondents who spend more time using the application. Once normalized, the variable shows a balanced distribution with values ranging from -2.33 to 2.66. The mean value is 0.001, and the median is 0.00, indicating a symmetrical distribution centered around zero.

time_fa <- fa(df_pca, factor=1)
a <- time_fa$scores
a <- orderNorm(a)
summary(a$x.t)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -2.327073 -0.673983  0.000000  0.001166  0.674996  2.663772

hist(a$x.t, breaks=100, main = "X Usage Index", xlab = "Index Value")

Regarding the topics most followed and tweeted, we decided to classify them into broader thematic categories. Specifically, we grouped these topics into three main categories: (1) actuality (i.e., news and politics), (2) job-related (including job opportunities or career advancement), and (3) topics encompassing personal interactions, entertainment, hobbies, and other social content.

encoded_data <- model.matrix(~ twitter_use_topic - 1, data = df)
encoded_data <- as.data.frame(encoded_data)

df$twitter_tweet_topic_most <- ifelse(is.na(df$twitter_tweet_topic_most), "NaN", df$twitter_tweet_topic_most)
encoded_data_1 <- model.matrix(~ twitter_tweet_topic_most - 1, data = df)
encoded_data_1 <- as.data.frame(encoded_data_1)

df$twitter_tweet_topic_least <- ifelse(is.na(df$twitter_tweet_topic_least), "NaN", df$twitter_tweet_topic_least)
encoded_data_2 <- model.matrix(~ twitter_tweet_topic_least - 1, data = df)
encoded_data_2 <- as.data.frame(encoded_data_2)

df_use <- data.frame(
  
  U2 = encoded_data$`twitter_use_topicCommunication with friends and family`,
  U3 = encoded_data$twitter_use_topicEntertainment,
  U4 = encoded_data$twitter_use_topicInformation,
  U5 = encoded_data$`twitter_use_topicPassing time`,
  U6 = encoded_data$`twitter_use_topicProfessional advancement`,
  U7 = encoded_data$`twitter_use_topicSelf-expression`)

df_tweet <- data.frame(

  U8 = encoded_data_1$twitter_tweet_topic_mostFun,
  U9 = encoded_data_1$`twitter_tweet_topic_mostJob-related`,
  U10 = encoded_data_1$`twitter_tweet_topic_mostNews and info`,
  U11 = encoded_data_1$twitter_tweet_topic_mostOther,
  U12 = encoded_data_1$twitter_tweet_topic_mostPolitics,
  U13 = encoded_data_1$twitter_tweet_topic_mostSexuality,
  U14 = encoded_data_1$twitter_tweet_topic_mostSocial
  
)

topic_use <- ifelse(df_use$U4 == 1, "Actuality (i.e, News, politics)", ifelse(df_use$U6 == 1, "Job-related", "Social"))
topic_tweet <- ifelse(df_tweet$U12 == 1 | df_tweet$U10 == 1, "Actuality (i.e, News, politics)", ifelse(df_tweet$U9 == 1, "Job-related", "Social"))

The majority of use is for social purposes (59.82%), followed by topics related to news and politics (37.73%), with job-related use being minimal (2.45%). Similarly, tweets are predominantly social (68.45%), while actuality-focused tweets account for 27.69%, and job-related tweets represent only 3.86%.

prop.table(table(topic_use))

## topic_use
## Actuality (i.e, News, politics)                     Job-related 
##                      0.37733419                      0.02446877 
##                          Social 
##                      0.59819704

prop.table(table(topic_tweet))

## topic_tweet
## Actuality (i.e, News, politics)                     Job-related 
##                       0.2768835                       0.0386349 
##                          Social 
##                       0.6844816

1.4 Algorithm performance

To capture how the performance of the algorithm participants were asked to log in to their X account, check the gender that appeared assigned in their profile, and report if, under its consideration, this was right or wrong. 94% of respondents declare to have their gender rightly assigned, and 6% declare that they do not. The respondents were also asked to report if they assassinated any gender during their initial registration to the social network. In this regard, 37.8% declared not to remember if they indicated their gender in the sign-in process, 10.3% claimed not to recognize it, and 51.9% claimed to reveal their gender. However, 2.5% of the respondents who stated that they did provide their gender during the initial registration reported a wrong gender assignment, and this proportion rises to 8.18% among users who claimed not to remember. Thus, X could also change the gender of your profile even if the user provides it. For this reason, we used the entire dataset independently if the respondents had provided their gender during the registration process. In this sense, regarding the non targeted survey the 5.4% of the respondents declared that, according to them, the gender in their profile was incorrect. However, this percentage rises to 10.5% when the sample is targeted.

print(prop.table(table(as.factor(df$twitter_gender_correct))))

## 
##         No        Yes 
## 0.06310367 0.93689633

print(prop.table(table(as.factor(df$twitter_gender_assignation))))

## 
## I don't remember               No              Yes 
##        0.3779781        0.1030264        0.5189955

This makes us suspect that Twitter could assign you a gender even if you specify it. For this reason, we decided to run our models not only with the respondents who declared not to provide their gender during the sign-up process but with the entire data.

2 Model

We want to compare the efficiency of the gender assignment algorithm used by X, differentiating by the aforementioned gender traits of users. In this sense, we use several logistic regressions that take the following form:

\[DV_{i} = \frac{e^{\beta_0 + \beta_{j,i} IV_{j,i} + \beta_{n,i} C_{n,i}}}{1 + e^{\beta_0 + \beta_1 IV_i + \beta_{n,i} C_{n,i}}} \]

Where \(DV_{i}\) represents the effectiveness of the classification algorithm, \(IV_i\) represents the traits of interest, and \(C_{n,i}\) represents a vector of demographic variables introduced to control the effect of \(IV_i\) over \(DV_{i}\). Thus, our independent variable (\(DV_{i}\)) takes the value of 1 if the gender that appears in the Twitter account of the participants is correct and 0 otherwise. A set of four independent variables (\(DV_{j,i}\)) has been used in this sense. All of them tend to capture a dimension of the gender of each respondent (biological sex, gender identity, gender expression, and sexual orientation).

Y <- ifelse(df$twitter_gender_correct == "Yes", TRUE, FALSE)

glm_df <- data.frame(
  Y = Y,
  X_1 = ifelse(df$sex_orientation != 'Straight', 'LGTBIQ+', 'No LGTBIQ+'),
  X_2 = ifelse(df$gender_expression != 'Masculine' & df$gender_expression != 'Femenine', 'LGTBIQ+', 'No LGTBIQ+'),
  X_3 = ifelse(df$gender_identity != 'Male' & df$gender_identity != 'Female', 'LGTBIQ+', 'No LGTBIQ+'),
  X_4 = df$bio_sex,
  C_1 = df$continent,
  C_2 = as.numeric(df$age),
  C_3 = df$ethnicity,
  T_1 = a$x.t,
  U_1 = df$twitter_use_topic,
  U_2 = df$twitter_tweet_topic_most,
  U_3 = df$twitter_tweet_topic_least
)

The dependent variable represents whether the gender assigned by X aligns with the respondent’s self-reported gender. Thus, the variable takes value 1 when the gender assignment is correct and 0; otherwise, this serves as the primary measure of the algorithm’s performance.

Four key predictors are included in the model, focusing on aspects of gender and sexual identity. In this sense, sexual orientation is represented by a binary variable distinguishing individuals who identify as part of the LGTBIQ+ community from those who identify as straight. Similarly, gender expression is also coded in a binary form, classifying individuals as LGTBIQ+ if their expression is neither traditionally masculine nor feminine. The third predictor focuses on gender identity, categorizing respondents as LGTBIQ+ if they identify outside the binary male/female framework. To complement these variables, a final predictor that records biological sex is included, providing a definite measure of physical attributes such as chromosomes and anatomy.

To ensure the model’s robustness, we included several demographic controls. In this sense, we run our models, including the respondent’s continent of residence, allowing us to account for regional cultural and societal differences, age as a continuous variable to capture generational effects, and ethnicity to address potential cultural or demographic variations in gender-related experiences.

Behavioral variables related to X user’s usage further enrich the model by exploring how individuals interact with the platform. We include measures of the time spent using X, derived through factor analysis and aggregating multiple dimensions of engagement, such as login frequency, tweeting, and content interactions. Additionally, another control has been included to identify the primary topic driving an individual’s Twitter usage. At the same time, the other two variables provide insights into the topics respondents tweet about most and least frequently, respectively.

3 Results

glmFit <- (glm(Y ~ ., data= glm_df ,family = 'binomial'(link='logit')))

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(glmFit)

## 
## Call:
## glm(formula = Y ~ ., family = binomial(link = "logit"), data = glm_df)
## 
## Coefficients:
##                                                                         Estimate
## (Intercept)                                                            6.619e-01
## X_1No LGTBIQ+                                                         -1.113e-01
## X_2No LGTBIQ+                                                          3.184e-02
## X_3No LGTBIQ+                                                          3.186e+00
## X_4Intersex (please also tick this if you have an intersex variation)  1.355e+00
## X_4Male                                                                3.921e-01
## C_1Asia                                                                1.314e+00
## C_1Europe                                                              8.803e-01
## C_1North America                                                       4.894e-01
## C_1Oceania                                                             1.693e+01
## C_1South America                                                       1.598e+01
## C_1Unknown                                                             1.624e+01
## C_2                                                                   -2.136e-04
## C_3Black                                                               8.047e-01
## C_3DATA_EXPIRED                                                       -2.296e+00
## C_3Mixed                                                              -8.343e-01
## C_3Other                                                              -1.643e+00
## C_3White                                                              -4.446e-01
## T_1                                                                   -4.621e-02
## U_1Entertainment                                                      -8.129e-01
## U_1Information                                                        -9.626e-01
## U_1Passing time                                                       -9.499e-01
## U_1Professional advancement                                           -1.741e+00
## U_1Self-expression                                                    -9.990e-01
## U_2Job-related                                                        -5.062e-02
## U_2NaN                                                                 1.435e+01
## U_2News and info                                                       2.429e-01
## U_2Other                                                              -1.084e+00
## U_2Politics                                                            3.875e-01
## U_2Sexuality                                                          -1.901e+00
## U_2Social                                                              4.016e-01
## U_3Job-related                                                        -5.323e-01
## U_3NaN                                                                 1.358e+01
## U_3News and info                                                      -8.446e-01
## U_3Other                                                               2.166e-01
## U_3Politics                                                           -3.561e-01
## U_3Sexuality                                                          -7.768e-02
## U_3Social                                                             -7.956e-02
##                                                                       Std. Error
## (Intercept)                                                            1.394e+00
## X_1No LGTBIQ+                                                          3.061e-01
## X_2No LGTBIQ+                                                          4.070e-01
## X_3No LGTBIQ+                                                          3.684e-01
## X_4Intersex (please also tick this if you have an intersex variation)  1.552e+00
## X_4Male                                                                3.929e-01
## C_1Asia                                                                8.906e-01
## C_1Europe                                                              5.012e-01
## C_1North America                                                       5.945e-01
## C_1Oceania                                                             8.450e+02
## C_1South America                                                       7.914e+02
## C_1Unknown                                                             2.197e+03
## C_2                                                                    1.310e-02
## C_3Black                                                               8.929e-01
## C_3DATA_EXPIRED                                                        1.486e+00
## C_3Mixed                                                               8.845e-01
## C_3Other                                                               9.480e-01
## C_3White                                                               8.570e-01
## T_1                                                                    1.293e-01
## U_1Entertainment                                                       1.028e+00
## U_1Information                                                         1.051e+00
## U_1Passing time                                                        1.035e+00
## U_1Professional advancement                                            1.233e+00
## U_1Self-expression                                                     1.153e+00
## U_2Job-related                                                         7.046e-01
## U_2NaN                                                                 1.856e+03
## U_2News and info                                                       3.984e-01
## U_2Other                                                               4.037e-01
## U_2Politics                                                            5.116e-01
## U_2Sexuality                                                           7.444e-01
## U_2Social                                                              3.459e-01
## U_3Job-related                                                         5.260e-01
## U_3NaN                                                                 2.137e+03
## U_3News and info                                                       6.336e-01
## U_3Other                                                               8.329e-01
## U_3Politics                                                            5.384e-01
## U_3Sexuality                                                           5.593e-01
## U_3Social                                                              6.921e-01
##                                                                       z value
## (Intercept)                                                             0.475
## X_1No LGTBIQ+                                                          -0.364
## X_2No LGTBIQ+                                                           0.078
## X_3No LGTBIQ+                                                           8.648
## X_4Intersex (please also tick this if you have an intersex variation)   0.873
## X_4Male                                                                 0.998
## C_1Asia                                                                 1.476
## C_1Europe                                                               1.756
## C_1North America                                                        0.823
## C_1Oceania                                                              0.020
## C_1South America                                                        0.020
## C_1Unknown                                                              0.007
## C_2                                                                    -0.016
## C_3Black                                                                0.901
## C_3DATA_EXPIRED                                                        -1.545
## C_3Mixed                                                               -0.943
## C_3Other                                                               -1.733
## C_3White                                                               -0.519
## T_1                                                                    -0.357
## U_1Entertainment                                                       -0.790
## U_1Information                                                         -0.916
## U_1Passing time                                                        -0.917
## U_1Professional advancement                                            -1.412
## U_1Self-expression                                                     -0.866
## U_2Job-related                                                         -0.072
## U_2NaN                                                                  0.008
## U_2News and info                                                        0.610
## U_2Other                                                               -2.686
## U_2Politics                                                             0.758
## U_2Sexuality                                                           -2.553
## U_2Social                                                               1.161
## U_3Job-related                                                         -1.012
## U_3NaN                                                                  0.006
## U_3News and info                                                       -1.333
## U_3Other                                                                0.260
## U_3Politics                                                            -0.661
## U_3Sexuality                                                           -0.139
## U_3Social                                                              -0.115
##                                                                       Pr(>|z|)
## (Intercept)                                                            0.63494
## X_1No LGTBIQ+                                                          0.71620
## X_2No LGTBIQ+                                                          0.93765
## X_3No LGTBIQ+                                                          < 2e-16
## X_4Intersex (please also tick this if you have an intersex variation)  0.38258
## X_4Male                                                                0.31825
## C_1Asia                                                                0.13996
## C_1Europe                                                              0.07902
## C_1North America                                                       0.41042
## C_1Oceania                                                             0.98401
## C_1South America                                                       0.98389
## C_1Unknown                                                             0.99410
## C_2                                                                    0.98699
## C_3Black                                                               0.36746
## C_3DATA_EXPIRED                                                        0.12226
## C_3Mixed                                                               0.34559
## C_3Other                                                               0.08310
## C_3White                                                               0.60389
## T_1                                                                    0.72078
## U_1Entertainment                                                       0.42932
## U_1Information                                                         0.35977
## U_1Passing time                                                        0.35890
## U_1Professional advancement                                            0.15807
## U_1Self-expression                                                     0.38637
## U_2Job-related                                                         0.94273
## U_2NaN                                                                 0.99383
## U_2News and info                                                       0.54209
## U_2Other                                                               0.00724
## U_2Politics                                                            0.44873
## U_2Sexuality                                                           0.01067
## U_2Social                                                              0.24564
## U_3Job-related                                                         0.31154
## U_3NaN                                                                 0.99493
## U_3News and info                                                       0.18250
## U_3Other                                                               0.79486
## U_3Politics                                                            0.50833
## U_3Sexuality                                                           0.88953
## U_3Social                                                              0.90848
##                                                                          
## (Intercept)                                                              
## X_1No LGTBIQ+                                                            
## X_2No LGTBIQ+                                                            
## X_3No LGTBIQ+                                                         ***
## X_4Intersex (please also tick this if you have an intersex variation)    
## X_4Male                                                                  
## C_1Asia                                                                  
## C_1Europe                                                             .  
## C_1North America                                                         
## C_1Oceania                                                               
## C_1South America                                                         
## C_1Unknown                                                               
## C_2                                                                      
## C_3Black                                                                 
## C_3DATA_EXPIRED                                                          
## C_3Mixed                                                                 
## C_3Other                                                              .  
## C_3White                                                                 
## T_1                                                                      
## U_1Entertainment                                                         
## U_1Information                                                           
## U_1Passing time                                                          
## U_1Professional advancement                                              
## U_1Self-expression                                                       
## U_2Job-related                                                           
## U_2NaN                                                                   
## U_2News and info                                                         
## U_2Other                                                              ** 
## U_2Politics                                                              
## U_2Sexuality                                                          *  
## U_2Social                                                                
## U_3Job-related                                                           
## U_3NaN                                                                   
## U_3News and info                                                         
## U_3Other                                                                 
## U_3Politics                                                              
## U_3Sexuality                                                             
## U_3Social                                                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 724.65  on 1543  degrees of freedom
## Residual deviance: 564.00  on 1506  degrees of freedom
##   (9 observations deleted due to missingness)
## AIC: 640
## 
## Number of Fisher Scoring iterations: 16

One of the most striking findings is the significant role of gender identity (X_3) in predicting correct classification. Individuals who identify within the binary categories of male or female are much more likely to have their gender correctly classified by the algorithm, as indicated by the strong positive coefficient (3.186) and a highly significant p-value (< 2e-16). This suggests a clear bias in the algorithm, favoring traditional binary identities and potentially marginalizing non-binary or gender-diverse users. Such a finding raises important questions about the inclusivity of the algorithm and its ability to account for a wider range of gender identities.

The analysis also highlights significant patterns related to users’ tweet topics. Specifically, individuals whose most frequent tweet topics fall into the categories of “Other” or “Sexuality” (U_2) are less likely to have their gender accurately classified. Both categories show significant negative coefficients and p-values below 0.01 and 0.05, respectively. This result indicates limitations in the algorithm’s ability to connect non-standard or nuanced content themes with gender-related traits. It may reflect gaps in the training data or assumptions built into the model that disproportionately affect users discussing these topics.

Interestingly, other variables in the model, including sexual orientation (X_1), gender expression (X_2), and biological sex (X_4), as well as demographic controls like age (C_2) and ethnicity (C_3), did not show significant associations with classification accuracy. This lack of significance suggests that these characteristics are either less influential or less effectively captured in the algorithm’s decision-making process. Similarly, variables related to the time users spend on X (T_1) and the broader categories of topics they engage with (U_1) did not emerge as meaningful predictors, indicating that general usage patterns may not play a central role in determining classification outcomes. Among the demographic controls, only the continent of residence (C_1) showed any marginally significant effect, with Europe having slightly better classification outcomes. However, this result (p < 0.1) is not strong enough to draw definitive conclusions. Overall, the model does a reasonable job of explaining variation in classification accuracy, with a residual deviance of 564.00 and an AIC of 640, but the relatively high null deviance of 724.65 suggests there is room for further improvement in capturing the full complexity of factors at play.

library(rpart.plot)

## Loading required package: rpart

tree <- rpart(as.factor(Y) ~.,method="class", control = rpart.control(minsplit=30, cp=0.001), data = glm_df)


pfit<- prune(tree, cp=   
               tree$cptable[which.min(tree$cptable[,"xerror"]),"CP"])


rpart.plot(tree)