R Project Report

Author

M

Published

March 24, 2025

R Project: RainbowR Dataset: EU-LGBTI-Survey

The provisional goal of this project is to analyse and visualize how LGBTI people in the EU experience daily life/discrimination and visualize possible differences between how trans* and intersex* people in the EU experience daily life/discrimination in comparison to the LGB-population in the EU as a whole. For this purpose, datasets from the second FRA (Fundamental Rights Agency) EU LGBTI survey, which was conducted in 2019 will be used. The 2019 dataset is able to be viewed and downloaded through the Github page of rainbowR, which is a community aiming to connect, support and promote LGBTQIA+ people who use R. As well as through the FRA website.

Elaborations on the dataset and questions

The FRA dataset only gives the percentages, not the absolute numbers for a particular question. In total a number 139,799 persons aged 15 years or older who describe themselves as lesbian, gay, bisexual, trans or intersex (LGBTI) completed the online EU-LGBTI II Survey in all EU Member States and the candidate countries of North Macedonia and Serbia. Information on how many persons per country could not be found.

For the purposes of this project, i1 will at times be connecting trans* and inter* persons into one statistical category and lesbian, gay and bisexual persons into another statistical category. This is mainly for the sake of being able to work with the dataset as it is, and should not be viewed as a statement of opinion that these groups and identities can generally be grouped together, which would (in parts) ignore and erase important differences and cross-identity connections.

As another limitation of not just this project but the dataset as a whole, it should be noted that while some important factors which intersect significantly with sexual and gender identity and without which sexual and gender identity cannot be well understood are included in the dataset, such as age, education, employment and residence, many other highly important factors have not been included, such as race, (dis)ability and neurodivergence.

As stated before, the survey the dataset is based on was sent to and filled out by 139,799 persons in various European countries. Its aim was to gather information about a wide range of issues and topics regarding people within various LGBTI communities in those countries. For this project, i will be focusing on one specific question posed as part of the survey:

Do you avoid certain places or locations for fear of being assaulted, threatened or harassed due to being LGBTI?

Alongside the specific answers to the question, a large array of other factors/aspects of the persons answering were requested and collected. These include:

  • sexual and gender identity (with one own/separate sheet for trans* identities)
  • age
  • education
  • employment
  • partnered status
  • country of residence
  • type of residence (city, suburbs, town, etc.)
  • openness about being LGBTI 2

For this project, i will be limiting my analysis to looking at the aspects of sexual and gender identity, openness, age, two specific countries (Germany and Italy) and the European average. Most of these limitations are arbitrary and made mainly to keep within the manageable scope of the project. i’ve chosen Germany and Italy as the two specific countries because they are the two countries in the dataset that i hold citizenship in and more importantly am more in connection with.

Necessary packages

library(tidyverse) 
library(readxl)
library(viridis)

The most commonly used package in this project will be tidyverse. readxl will be used once to import data and viridis will be used later to aid with visualization.

Import

The files in the Github are in the .xlsx format, but before importing them i’ve converted them into the .csv format for practical reasons (more specifically changing the column types later on was a royal pain with read_excel and much easier with read_delim ;) )

Q2_open <- read_delim("LGBTI-Survey-Data/FRA_EU-T1Q2_open.csv", delim = ";", show_col_types = FALSE) 
structure(Q2_open)
# A tibble: 3,153 × 6
   CountryCode question_label            target_group openness answer percentage
   <chr>       <chr>                     <chr>        <chr>    <chr>       <dbl>
 1 Austria     Avoid certain places for… Lesbian wom… Very op… Always          2
 2 Austria     Avoid certain places for… Lesbian wom… Very op… Often          11
 3 Austria     Avoid certain places for… Lesbian wom… Very op… Rarely         42
 4 Austria     Avoid certain places for… Lesbian wom… Very op… Never          45
 5 Austria     Avoid certain places for… Lesbian wom… Very op… Dont …          0
 6 Belgium     Avoid certain places for… Lesbian wom… Very op… Always          6
 7 Belgium     Avoid certain places for… Lesbian wom… Very op… Often          26
 8 Belgium     Avoid certain places for… Lesbian wom… Very op… Rarely         45
 9 Belgium     Avoid certain places for… Lesbian wom… Very op… Never          24
10 Belgium     Avoid certain places for… Lesbian wom… Very op… Dont …          0
# ℹ 3,143 more rows
Q2_age = read_delim("LGBTI-Survey-Data/FRA_EU-T1Q2_age.csv", delim = ";", show_col_types = FALSE)
glimpse(Q2_age)
Rows: 3,396
Columns: 6
$ CountryCode    <chr> "Austria", "Austria", "Austria", "Austria", "Austria", …
$ question_label <chr> "Avoid certain places for fear of being assaulted, thre…
$ target_group   <chr> "Lesbian women", "Lesbian women", "Lesbian women", "Les…
$ age            <chr> "15-17", "15-17", "15-17", "15-17", "15-17", "15-17", "…
$ answer         <chr> "Always", "Often", "Rarely", "Never", "Dont know", "Alw…
$ percentage     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 10, 29, 31, 30,…

Transformation

The datasets available on the Github of rainbowR have already been cleaned to an extent, as can be read in their documentation. In the following i will continue working with said datasets (with a few own modifications), but for the sake of demonstration i will clean one example dataset from the FRA website:

# In this table, there are multiple things to be cleaned. First thing on import, i'll skip the first two rows along with the last eleven. Since they contain additional information that we won't need in the end. As well as skipping rows, i'll standardize NA-Values by defining ":" in the "percentage" column, which in the original dataset stands for not available due to small sample size, as a missing value.
FRA_EU_T1Q1_trans_raw <- read_excel("LGBTI-Survey-Data/2025-03-20-xlsx_exportSurveyQuestion-LGBTI-DAGGage-01--15-17-DEXavoid_hands-EN-05--Trans-people.xlsx", skip = 2, n_max = 151, na = ":")

# with the table imported, i'll continue with cleaning the dataset 

FRA_EU_T1Q1_trans_raw_cleaned <- FRA_EU_T1Q1_trans_raw |> 
  select(!2) |>
  select(!7) |>  #removing "question_code" column as well as "Notes" column, since missing values have been standardized (i'm choosing to treat small sample sizes the same as non-small sample sizes for this project).
  rename(age = subset, region = CountryCode, question = question_label) # simplifying names

glimpse(FRA_EU_T1Q1_trans_raw_cleaned)
Rows: 151
Columns: 6
$ region       <chr> "Austria", "Austria", "Austria", "Austria", "Austria", "B…
$ question     <chr> "Avoid holding hands in public with same-sex partner for …
$ target_group <chr> "Trans people", "Trans people", "Trans people", "Trans pe…
$ age          <chr> "15-17", "15-17", "15-17", "15-17", "15-17", "15-17", "15…
$ answer       <chr> "Always", "Often", "Rarely", "Never", "Dont know", "Alway…
$ percentage   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
# this dataset is now very similar to the rainbowR ones, with the small differences of names that i will also apply to the pre-cleaned rainbowR tables, as well as differences in column types, which i will also change moving forward.

Transformation with rainbowR datasets:

Necessary steps:

  1. adjust column data types

  2. select Germany, Italy and EU-28 values

  3. aggregate LGB and TI groups

# 1
# on original import, most variables were imported as characters, which is incorrect for those we need as factors, such as "answer", "openness" and "age". So i'll import the datasets again, but this time with the correct data types for those columns. (Technically other columns like "CountryCode" and "target_group" could also be considered to be factors, but since their type as character doesn't interfere with later steps, i've left them as is)

Q2_open_m <- read_delim("LGBTI-Survey-Data/FRA_EU-T1Q2_open.csv", delim = ";",
    col_types =cols(openness = col_factor(levels = c("Very open","Fairly open", "Rarely open", "Never open")),answer = col_factor(levels = c("Always", "Often", "Rarely", "Never", "Dont know")) ))

unique(Q2_open$openness) #Looking for unique values of openness to put in cols() without having to scroll through the entire table
[1] "Very open"   "Fairly open" "Rarely open" "Never open" 
unique(Q2_open$answer)
[1] "Always"    "Often"     "Rarely"    "Never"     "Dont know"
glimpse(Q2_open_m)
Rows: 3,153
Columns: 6
$ CountryCode    <chr> "Austria", "Austria", "Austria", "Austria", "Austria", …
$ question_label <chr> "Avoid certain places for fear of being assaulted, thre…
$ target_group   <chr> "Lesbian women", "Lesbian women", "Lesbian women", "Les…
$ openness       <fct> Very open, Very open, Very open, Very open, Very open, …
$ answer         <fct> Always, Often, Rarely, Never, Dont know, Always, Often,…
$ percentage     <dbl> 2, 11, 42, 45, 0, 6, 26, 45, 24, 0, 3, 31, 44, 22, 0, N…
Q2_age_m = read_delim("LGBTI-Survey-Data/FRA_EU-T1Q2_age.csv", delim = ";",
    col_types =cols(age = col_factor(levels = c("15-17","18-24", "25-39", "40-54", "55+")),answer = col_factor(levels = c("Always", "Often", "Rarely", "Never", "Dont know"))))

unique(Q2_age$age)
[1] "15-17" "18-24" "25-39" "40-54" "55+"  
glimpse(Q2_age_m)
Rows: 3,396
Columns: 6
$ CountryCode    <chr> "Austria", "Austria", "Austria", "Austria", "Austria", …
$ question_label <chr> "Avoid certain places for fear of being assaulted, thre…
$ target_group   <chr> "Lesbian women", "Lesbian women", "Lesbian women", "Les…
$ age            <fct> 15-17, 15-17, 15-17, 15-17, 15-17, 15-17, 15-17, 15-17,…
$ answer         <fct> Always, Often, Rarely, Never, Dont know, Always, Often,…
$ percentage     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 10, 29, 31, 30,…
# 2
# i'll also rename the columns like i did before and since i'm only looking at data for Germany, Italy, and the EU as a whole, i'll filter those values from the dataset

Q2_open_m_eugi <- Q2_open_m |>
  rename(region = CountryCode, question = question_label) |> 
 filter(region %in% c("Germany", "Italy", "EU-28"))

glimpse(Q2_open_m_eugi)
Rows: 340
Columns: 6
$ region       <chr> "Germany", "Germany", "Germany", "Germany", "Germany", "I…
$ question     <chr> "Avoid certain places for fear of being assaulted, threat…
$ target_group <chr> "Lesbian women", "Lesbian women", "Lesbian women", "Lesbi…
$ openness     <fct> Very open, Very open, Very open, Very open, Very open, Ve…
$ answer       <fct> Always, Often, Rarely, Never, Dont know, Always, Often, R…
$ percentage   <dbl> 3, 12, 50, 35, 0, 3, 22, 46, 29, 0, 5, 18, 46, 31, 0, 6, …
Q2_age_m_eugi <- Q2_age_m |>
  rename(region = CountryCode, question = question_label) |> 
 filter(region %in% c("Germany", "Italy", "EU-28"))

glimpse(Q2_age_m_eugi)
Rows: 403
Columns: 6
$ region       <chr> "Germany", "Germany", "Germany", "Germany", "Germany", "I…
$ question     <chr> "Avoid certain places for fear of being assaulted, threat…
$ target_group <chr> "Lesbian women", "Lesbian women", "Lesbian women", "Lesbi…
$ age          <fct> 15-17, 15-17, 15-17, 15-17, 15-17, 15-17, 15-17, 15-17, 1…
$ answer       <fct> Always, Often, Rarely, Never, Dont know, Always, Often, R…
$ percentage   <dbl> 4, 13, 37, 44, 1, 5, 22, 35, 39, 0, 9, 25, 36, 29, 0, 9, …
# 3 
# to be able to later compare the answers of the LGB and TI groups, i'll aggregate their answers and later calculate the average answer for each aggregate

#creating a new variable "LGBTI" that will give the two groups, LGB and TI, corresponding values
Q2_open_m_eugi_agg <- Q2_open_m_eugi |> 
  mutate(LGBTI = case_match(target_group,
            c("Trans people", "Intersex people") ~ "TI",
            c("Gay men", "Lesbian women", "Bisexual women", "Bisexual men") ~ "LGB"))

glimpse(Q2_open_m_eugi_agg)
Rows: 340
Columns: 7
$ region       <chr> "Germany", "Germany", "Germany", "Germany", "Germany", "I…
$ question     <chr> "Avoid certain places for fear of being assaulted, threat…
$ target_group <chr> "Lesbian women", "Lesbian women", "Lesbian women", "Lesbi…
$ openness     <fct> Very open, Very open, Very open, Very open, Very open, Ve…
$ answer       <fct> Always, Often, Rarely, Never, Dont know, Always, Often, R…
$ percentage   <dbl> 3, 12, 50, 35, 0, 3, 22, 46, 29, 0, 5, 18, 46, 31, 0, 6, …
$ LGBTI        <chr> "LGB", "LGB", "LGB", "LGB", "LGB", "LGB", "LGB", "LGB", "…
Q2_age_m_eugi_agg <- Q2_age_m_eugi |> 
  mutate(LGBTI = case_match(target_group,
            c("Trans people", "Intersex people") ~ "TI",
            c("Gay men", "Lesbian women", "Bisexual women", "Bisexual men") ~ "LGB"))

Visualization & visualization-connected transformation

To be able to visualize differences between the specific target groups and their additional aspects, i’ll use faceted bar plots in two ways: the first time applied to all target groups separately and the second time using the aggregate groups of “LGB” and “TI”.

#while rendering the final html, i noticed that
#in contrast to in my R session, especially the 
#first two bar plots were not very legible due 
#to their small render size and many facets. 
#So in the following i will shorten the names 
#of the "answer" and "target_group_ord" values 
#to one/two letter/s. i still recommend opening 
#the created images in a new tab to see the full extent. 

# All groups separately, openness subset

#mutate to change names and correct facet order
Q2_open_m_eugi_agg_ord = Q2_open_m_eugi_agg |> 
  mutate(t_g_short = case_match(target_group, "Lesbian women" ~ "Lw", "Bisexual women" ~ "Bw", "Gay men" ~ "Gm", "Bisexual men" ~ "Bm", "Trans people" ~ "Tp", "Intersex people" ~ "Ip")) |> 
  mutate(t_g_short_ord = factor(t_g_short, levels = c("Lw","Bw", "Gm", "Bm", "Tp", "Ip"))) |>
   mutate(answer_short = case_match(answer, "Always" ~ "A", "Often" ~ "O", "Rarely" ~ "R", "Never" ~ "N", "Dont know" ~ "Dn")) |> 
  mutate(answer_short_ord = factor(answer_short, levels = c("A", "O", "R", "N", "Dn")))

Q2_open_m_eugi_agg_ord |> 
  ggplot(mapping = aes(x = answer_short_ord, 
                        y = percentage, 
                        fill = region)) +
  geom_col(width = 0.5, position = position_dodge(1)) + #bar plot with the values of the fill variable (region) positioned next to one another (dodge)
  facet_grid(Q2_open_m_eugi_agg_ord$t_g_short_ord
             ~ Q2_open_m_eugi_agg_ord$openness, axes = "margins") + #faceting the plot with the "openness" and previously re-organized "t_g_short_ord" variables 
  geom_vline(aes(1,5, xintercept = 1.5)) + # adding vertical lines between the axis-elements of the "answer" variable for better legibility 
  geom_vline(aes(2.5, xintercept = 2.5)) +
  geom_vline(aes(3.5, xintercept = 3.5)) +
  geom_vline(aes(4.5, xintercept = 4.5)) +
  ylim(0, 100) + #limiting the y-axis to 100
  scale_fill_viridis(option = "viridis", discrete = TRUE) + 
  theme_dark() + # combining viridis theme for columns with dark theme for grid for better legibility
  labs(title= "Do you avoid certain places or locations for fear of being assaulted, 
      threatened or harassed due to being LGBTI?",
       subtitle = "Year: 2019",
       x = "Answer",
       y = "Percentage",
       fill = "Region",
       caption = "Source: EU Fundamental Rights Agency, LGBTI Survey II") # adding labels

# if one wants to run the original 
#visualization with the full names, simply
#replace the x-axis and facet-grid values 
#with "answer" and "target_group" and run 
#it in an own R session. This applies to all 
#other visualizations.



# All groups separately, age subset. See above comments for explanations.

#mutate to change names and correct facet order
Q2_age_m_eugi_agg_ord = Q2_age_m_eugi_agg |> 
  mutate(t_g_short = case_match(target_group, "Lesbian women" ~ "Lw", "Bisexual women" ~ "Bw", "Gay men" ~ "Gm", "Bisexual men" ~ "Bm", "Trans people" ~ "Tp", "Intersex people" ~ "Ip")) |> 
  mutate(t_g_short_ord = factor(t_g_short, levels = c("Lw","Bw", "Gm", "Bm", "Tp", "Ip"))) |>
   mutate(answer_short = case_match(answer, "Always" ~ "A", "Often" ~ "O", "Rarely" ~ "R", "Never" ~ "N", "Dont know" ~ "Dn")) |> 
  mutate(answer_short_ord = factor(answer_short, levels = c("A", "O", "R", "N", "Dn")))

Q2_age_m_eugi_agg_ord |>  
  ggplot(mapping = aes(x = answer_short_ord, 
                        y = percentage, 
                        fill = region)) +
  geom_col(width = 0.5, position = position_dodge(1)) +
  facet_grid(Q2_age_m_eugi_agg_ord$t_g_short_ord
             ~ Q2_age_m_eugi_agg_ord$age, axes = "margins") +  
  geom_vline(aes(1,5, xintercept = 1.5)) + 
  geom_vline(aes(2.5, xintercept = 2.5)) +
  geom_vline(aes(3.5, xintercept = 3.5)) +
  geom_vline(aes(4.5, xintercept = 4.5)) +
  ylim(0, 100) + 
  scale_fill_viridis(option = "viridis", discrete = TRUE) + 
  theme_dark() + 
  labs(title= "Do you avoid certain places or locations for fear of being assaulted,
          threatened or harassed due to being LGBTI?",
       subtitle = "Year: 2019",
       x = "Answer",
       y = "Percentage",
       fill = "Region",
       caption = "Source: EU Fundamental Rights Agency, LGBTI Survey II") 

#Missing data points, such as the lack of data for some "Dont know" answers, can be seen prominently in the plots. This may at first be seen as an overall lack/negative, but some absences, such as the lack of data for intersex people aged 55+ in Germany or Italy, give valuable indications towards broader societal-historical factors influencing the data. 

# now moving on to visualizing the LGB and TI subsets:

# LGB - TI, age subset


Q2_age_LGBTI_comp = Q2_age_m_eugi_agg_ord |> 
  group_by(region,age,answer,LGBTI) |> # grouping by the relevant variables
  summarise(mpa=round(mean(percentage))) # calculating the rounded mean percentage of each grouped variable

#mutate to change names 
Q2_age_LGBTI_comp_short = Q2_age_LGBTI_comp |> 
   mutate(answer_short = case_match(answer, "Always" ~ "A", "Often" ~ "O", "Rarely" ~ "R", "Never" ~ "N", "Dont know" ~ "Dn")) |> 
  mutate(answer_short_ord = factor(answer_short, levels = c("A", "O", "R", "N", "Dn")))


# see previous All groups separately, openness subset for comments
Q2_age_LGBTI_comp_short |> 
  ggplot(mapping = aes(x = answer_short_ord,
                       y= mpa,
                       fill = region)) +
  geom_col(width = 0.5, position = position_dodge(1)) + 
 facet_grid(Q2_age_LGBTI_comp$LGBTI
                        ~ Q2_age_LGBTI_comp$age, axes = "all_x") + # showing internal x_axis for change in legibility
  geom_vline(aes(1,5, xintercept = 1.5)) + 
  geom_vline(aes(2.5, xintercept = 2.5)) +
  geom_vline(aes(3.5, xintercept = 3.5)) +
  geom_vline(aes(4.5, xintercept = 4.5)) +
  ylim(0, 100) + 
  scale_fill_viridis(option = "viridis", discrete = TRUE) + 
  theme_dark() + 
  labs(title= "Do you avoid certain places or locations for fear of being assaulted,
       threatened or harassed due to being LGBTI?",
       subtitle = "Year: 2019",
       x = "Answer",
       y = "Average percentage",
       fill = "Region",
       caption = "Source: EU Fundamental Rights Agency, LGBTI Survey II") 

# LGB - TI, openness subset
Q2_open_LGBTI_comp = Q2_open_m_eugi_agg_ord |> 
  group_by(region,openness,answer,LGBTI) |> # grouping by the relevant variables
  summarise(mpa=round(mean(percentage))) # calculating the rounded mean percentage of each grouped variable

#mutate to change names 
Q2_open_LGBTI_comp_short = Q2_open_LGBTI_comp |> 
   mutate(answer_short = case_match(answer, "Always" ~ "A", "Often" ~ "O", "Rarely" ~ "R", "Never" ~ "N", "Dont know" ~ "Dn")) |> 
  mutate(answer_short_ord = factor(answer_short, levels = c("A", "O", "R", "N", "Dn")))

Q2_open_LGBTI_comp_short |> 
  ggplot(mapping = aes(x = answer_short_ord,
                       y= mpa,
                       fill = region)) +
  geom_col(width = 0.5, position = position_dodge(1)) + 
  facet_grid(Q2_open_LGBTI_comp$LGBTI
             ~ Q2_open_LGBTI_comp$openness, axes = "all_x") + 
  geom_vline(aes(1,5, xintercept = 1.5)) + 
  geom_vline(aes(2.5, xintercept = 2.5)) +
  geom_vline(aes(3.5, xintercept = 3.5)) +
  geom_vline(aes(4.5, xintercept = 4.5)) +
  ylim(0, 100) + 
  scale_fill_viridis(option = "viridis", discrete = TRUE) + 
  theme_dark() + 
  labs(title= "Do you avoid certain places or locations for fear of being assaulted, 
       threatened or harassed due to being LGBTI?",
       subtitle = "Year: 2019",
       x = "Answer",
       y = "Average percentage",
       fill = "Region",
       caption = "Source: EU Fundamental Rights Agency, LGBTI Survey II") 

Conclusion

In conclusion, a short (and not exhaustive) list of correlations that can be observed through the data and visualizations:

  • On average, across openness and age, TI persons seem to avoid certain places more often than LGB persons, although the differences are not large. Germany and Italy tend to have lower amounts of avoidance of certain places than the European average. On average, Germany tends to have lower amounts of avoidance than Italy, although here also the differences are small.

  • Never open LGBTI persons seem to have the higher rates of avoidance than other openness levels on average across regions.

  • Younger LGBTI persons seem to be split more than other age groups between those with high rates of avoidance and those with low rates on average across regions.

  • Across all target groups, intersex* people age 55 and older have the highest rate of avoidance in the EU-28, with 25 percent. Followed by never open gay men in the EU-28 with 24 percent. The lowest rates of avoidance are by intersex* people age 18-24 in Italy with 62 percent. Followed by never open bisexual men in Germany with 55 percent.

As stated before, these are all correlations and mainly descriptive statements, what the causes of certain trends in the data might be cannot be inferred from these observations.

Footnotes

  1. i prefer using the non-capitalized ‘i’ instead of ‘I’ to denote myself for two main reasons: 1. It communicates less significance/reality to the concept of the ‘I’ 2. The form of the ‘i’ in that it isn’t one straight line without disruption is more accurate to how i think-feel about my self and any self. The ‘I’ is not a perfectly straight line with no pauses, disruptions or fragmentation, the ‘i’ is made through fragmentation, disruptions, spaces for connection, pauses and contamination. The ‘i’ gets closer to embodying this reality through its form.↩︎

  2. see previous note on intersecting factors and limitations↩︎