Data

R Markdown for Andraya Gallant, Cam Sanders, Alivia Frost, and Jasmine Brown

library("tidyverse")

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library("gt")
library("gapminder")
library("srvyr")

## 
## Attaching package: 'srvyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter

library("srvyrexploR")
library("fst")
library("ggridges")
library("plotly")

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

library("DT")
library("gridExtra")

## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

library("patchwork")
library("reshape2")

## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths

library("finalfit")

# Load complete ESS data (490,555 observations)
ess <- read_fst("All-ESS-Data.fst")

table(ess$hinctnta)

## 
##     1     2     3     4     5     6     7     8     9    10    77    88    99 
## 27483 31409 31414 31134 30205 28657 28297 26624 22420 22099 41851 23623  2987

table(ess$alcfreq)

## 
##    1    2    3    4    5    6    7   77   88   99 
## 2648 6040 7046 5164 3573 6027 9348  221  114    4

ess_clean <- ess %>%
  mutate(
    income = case_when(
      hinctnta == 1 ~ "1st decile",
      hinctnta == 2 ~ "2nd decile",
      hinctnta == 3 ~ "3rd decile",
      hinctnta == 4 ~ "4th decile",
      hinctnta == 5 ~ "5th decile",
      hinctnta == 6 ~ "6th decile",
      hinctnta == 7 ~ "7th decile",
      hinctnta == 8 ~ "8th decile",
      hinctnta == 9 ~ "9th decile",
      hinctnta == 10 ~ "10th decile",
      TRUE ~ NA_character_ #assigning values to the levels of responses
    ), income = factor(income, levels = c("1st decile", "2nd decile", "3rd decile", "4th decile", "5th decile", "6th decile", "7th decile", "8th decile", "9th decile", "10th decile"))) %>%
  mutate(
    alcohol = case_when(
      alcfreq == 1 ~ "Everyday",
      alcfreq == 2 ~ "Several times a week",
      alcfreq == 3 ~ "Once a week",
      alcfreq == 4 ~ "2-3 times a month",
      alcfreq == 5 ~ "Once a month",
      alcfreq == 6 ~ "Less than once a month",
      alcfreq == 7 ~ "Never",
      TRUE ~ NA_character_ # adding values to the levels of responses
    ), alcohol = factor(alcohol, levels = c("Never", "Less than once a month", "Once a month", "2-3 times a month", "Once a week", "Several times a week", "Everyday"))) %>%
  select(cntry, income, alcohol)


ess_clean <- ess_clean %>%
  count(
    income, alcohol, cntry) %>%
  filter(
    !is.na(income), !is.na(alcohol), !is.na(cntry)) #filtering and counting income, alcohol and country.

Our group decided to focus on four European countries that are geographically far apart. We did this to attempt to keep our sample representative of Europe. If we were to pick four neighouring countries, their cultures may be similar or their people may be related. By picking four far apart countries we can try to eliminate some similarities. We chose Slovenia, Finland, Ireland, and Portugal. They are filtered for in that order below.

ess_countries_pct <- ess_clean %>%
  filter(cntry == "SI" | cntry == "FI" | cntry == "IE" | cntry == "PT") %>%
  group_by(cntry) %>% #grouping by country so that the percentages are calculated on the individual countries
  mutate(
    prop = n/sum(n),
    pct = round(100 * prop, 2)) %>%
  mutate(pct = round(pct, digits = 3)) %>%
   mutate(
    alc_consumption = case_when( #recoding the values! By combing certain values, the data will become more concise and easier to interpret.
      alcohol %in% c("Never") ~ "Never", 
      alcohol %in% c("Less than once a month", "Once a month", "2-3 times a month") ~ "Monthly",
      alcohol %in% c("Once a week", "Several times a week") ~ "Weekly",
      alcohol %in% c("Everyday") ~ "Everyday",
      TRUE ~ NA_character_
    ), 
    alc_consumption = factor(alc_consumption, levels = c("Everyday", "Weekly", "Monthly", "Never"))) %>%
   mutate(
    income_class = case_when( #more recoding
      income %in% c("1st decile") ~ "Low Class", 
      income %in% c("2nd decile", "3rd decile", "4th decile", "5th decile") ~ "Lower Middle Class",
      income %in% c("6th decile", "7th decile", "8th decile", "9th decile") ~ "Upper Middle Class",
      income %in% c("10th decile") ~ "High Class",
      TRUE ~ NA_character_
    ),
    income_class = factor(income_class, levels = c("Low Class", "Lower Middle Class", "Upper Middle Class", "High Class")))

ess_plot <- ess_countries_pct %>%
  select(income_class, n, pct, cntry, alc_consumption) %>%
   group_by(cntry, income_class, alc_consumption) %>%
  summarise(across(starts_with("pct"), ~sum(., na.rm = TRUE)))

## `summarise()` has grouped output by 'cntry', 'income_class'. You can override
## using the `.groups` argument.

ess_plot

## # A tibble: 64 × 4
## # Groups:   cntry, income_class [16]
##    cntry income_class       alc_consumption   pct
##    <chr> <fct>              <fct>           <dbl>
##  1 FI    Low Class          Everyday         0.1 
##  2 FI    Low Class          Weekly           1.65
##  3 FI    Low Class          Monthly          4.89
##  4 FI    Low Class          Never            1.91
##  5 FI    Lower Middle Class Everyday         0.56
##  6 FI    Lower Middle Class Weekly           9.89
##  7 FI    Lower Middle Class Monthly         20.8 
##  8 FI    Lower Middle Class Never            6.7 
##  9 FI    Upper Middle Class Everyday         0.77
## 10 FI    Upper Middle Class Weekly          18.9 
## # ℹ 54 more rows

ess_income <- ess_countries_pct %>%
  select(income_class, n, pct, cntry) %>%
   group_by(cntry, income_class) %>%
  summarise(across(starts_with("pct"), ~sum(., na.rm = TRUE))) %>%
  pivot_wider(names_from = income_class, values_from = pct) #pivot wider to make the income class values as column headers, so that the rows are by country.

## `summarise()` has grouped output by 'cntry'. You can override using the
## `.groups` argument.

Standard_DeviationI <- apply(ess_income[2:4], 1, sd, na.rm = TRUE) #Standard deviation of income across columns 2 to 4

ess_income <- cbind(ess_income, Standard_DeviationI) #cbind to stick the standard deviation on the end of the tibble

## New names:
## • `` -> `...6`

colnames(ess_income) <- c("Country", "Low Class", "Lower Middle Class", "Upper Middle Class", "High Class", "Standard Deviation")

ess_income

## # A tibble: 4 × 6
## # Groups:   Country [4]
##   Country `Low Class` `Lower Middle Class` `Upper Middle Class` `High Class`
##   <chr>         <dbl>                <dbl>                <dbl>        <dbl>
## 1 FI             8.55                 37.9                 45.7         7.77
## 2 IE            19.7                  52.9                 24.4         2.98
## 3 PT            11.8                  50.9                 33.6         3.65
## 4 SI            11.4                  51.8                 32.2         4.54
## # ℹ 1 more variable: `Standard Deviation` <dbl>

ess_alcohol <- ess_countries_pct %>%
  select(n, pct, cntry, alc_consumption) %>%
  group_by(cntry, alc_consumption) %>% 
  summarise(across(starts_with("pct"), ~sum(., na.rm = TRUE))) %>%
  pivot_wider(names_from = alc_consumption, values_from = pct) #alcohol consumption as column headers, rows by country.

## `summarise()` has grouped output by 'cntry'. You can override using the
## `.groups` argument.

Standard_DeviationA <- apply(ess_alcohol[2:4], 1, sd, na.rm = TRUE) #Standard deviation of income across columns 2 to 5

ess_alcohol <- cbind(ess_alcohol, Standard_DeviationI) #cbind to stick standard deviation on the end.

## New names:
## • `` -> `...6`

colnames(ess_alcohol) <- c("Country", "Everyday", "Weekly", "Monthly", "Never", "Standard Deviation") #renaming the columns.

ess_alcohol

## # A tibble: 4 × 6
## # Groups:   Country [4]
##   Country Everyday Weekly Monthly Never `Standard Deviation`
##   <chr>      <dbl>  <dbl>   <dbl> <dbl>                <dbl>
## 1 FI          1.58   33.9    52.0  12.5                 19.6
## 2 IE          2.73   40.8    30.3  26.2                 18.0
## 3 PT         21.5    22.5    26.5  29.5                 19.6
## 4 SI          7.07   32.0    41.0  19.9                 20.2

GT table for income. It is concise, and clear with labelling.

ess_income %>%
  gt(rowname_col = "row", groupname_col = "group") %>%
  tab_header(md("*Income Summary Statistics for 4 European Countries*"),
             subtitle = md("**Finland, Ireland, Portugal, and Slovenia**")) %>% 
  tab_options(
    table.border.top.width = 4, #making the border lines thicker
    table.border.bottom.width = 4,
    column_labels.border.bottom.width = 2,
    heading.title.font.size = px(18),
    heading.subtitle.font.size = px(12),
    source_notes.font.size = px(10),
    data_row.padding = px(16) #added extra space between the rows so that it was not cramped.
  ) %>%
  tab_source_note(
    md("**Standard deviation is calculated from income level percentages for each country*")) %>%
  fmt_number(
    `Standard Deviation`, decimals = 3) %>%
  cols_label(
    `Low Class` = md("*Low*"),
    `Lower Middle Class` = md("*Lower Middle*"),
    `Upper Middle Class` = md("*Upper Middle*"),
    `High Class` = md("*High*"),
    Country = md("**Country**"),
    `Standard Deviation` = md("**Standard Deviation**")) %>% #relabelled the column headers
  tab_spanner( #added a column header over the class columns. 
    label = md("**Class (%)**"),
    columns = c(`Low Class`, `Lower Middle Class`, `Upper Middle Class`, `High Class`)) %>%
  text_case_match(
    "FI" ~ "Finland", "IE" ~ "Ireland", "PT" ~ "Portugal", "SI" ~ "Slovenia") %>% #changed the row names to match the actual country names
  cols_align(
    align = "center") %>% #centered the percentages
  tab_style(
    cell_borders(sides = c("top", "bottom"), #more added styling 
                 style = "solid",
                 color = "grey",
                 weight = px(2)),
    locations = list(cells_title(),
                     cells_source_notes(),
                     cells_column_labels()))

Country	Class (%)				Standard Deviation
Income Summary Statistics for 4 European Countries
Finland, Ireland, Portugal, and Slovenia
Country	Low	Lower Middle	Upper Middle	High	Standard Deviation
Finland	8.55	37.93	45.73	7.77	19.606
Ireland	19.73	52.93	24.38	2.98	17.977
Portugal	11.80	50.94	33.59	3.65	19.612
Slovenia	11.42	51.83	32.23	4.54	20.208
*Standard deviation is calculated from income level percentages for each country

Here is the GT table for alcohol! I followed all of the same steps as the gt table above. I wanted to keep the presentation identical.

ess_alcohol %>%
  gt(rowname_col = "row", groupname_col = "group") %>%
  tab_header(md("*Alcohol Consumption Summary Statistics for 4 European Countries*"), subtitle = md("**Finland, Ireland, Portugal, and Slovenia**")) %>%
  tab_source_note(md("**Standard deviation is calculated from alcohol consumption levels for each country*")) %>% #adding a source note on how the standard deviation was calculated.
   tab_options(
    table.border.top.width = 4,
    table.border.bottom.width = 4,
    column_labels.border.bottom.width = 2,
    heading.title.font.size = px(18),
    source_notes.font.size = px(10),
    data_row.padding = px(16)) %>%
  fmt_number(`Standard Deviation`, decimals = 3) %>% #keeping only three decimal places for standard deviation. 
  cols_label(Country = md("**Country**"),
             `Standard Deviation` = md("**Standard Deviation**"),
             Everyday = md("*Everyday*"),
             Weekly = md("*Weekly*"),
             Monthly = md("*Monthly*"),
             Never = md("*Never*")) %>%
  tab_spanner(
    label = md("**Alcohol Consumption (%)**"),
    columns = c("Everyday", "Weekly", "Monthly", "Never")) %>%
  text_case_match(
    "FI" ~ "Finland", "IE" ~ "Ireland", "PT" ~ "Portugal", "SI" ~ "Slovenia") %>%
  cols_align(
    align = "center") %>%
  tab_style(
    cell_borders(sides = c("top", "bottom"),
                 style = "solid",
                 color = "gray",
                 weight = px(2)),
    locations = list(cells_title(),
                     cells_source_notes(),
                     cells_column_labels()))

Country	Alcohol Consumption (%)				Standard Deviation
Alcohol Consumption Summary Statistics for 4 European Countries
Finland, Ireland, Portugal, and Slovenia
Country	Everyday	Weekly	Monthly	Never	Standard Deviation
Finland	1.58	33.90	51.97	12.53	19.606
Ireland	2.73	40.75	30.32	26.22	17.977
Portugal	21.54	22.46	26.48	29.50	19.612
Slovenia	7.07	32.03	41.02	19.90	20.208
*Standard deviation is calculated from alcohol consumption levels for each country

labeller = labeller(cntry = 
    c("FI" = "Finland",
      "IE" = "Ireland",
      "PT" = "Portugal",
      "SI" = "Slovenia"))

Here we have the visualization. I am so pleased with how this turned out. I was able to facet the ess_countries data by country, keeping all four plots together in one chunk of code. Here we can see the relationship between alcohol consumption and income class in a clearer way. Notably, Ireland’s plot is right skewed. Also, Portugal has high percentages of everyday drinkers across the income classes. The stacked bar plot was definitely the way to go for this. I also added in a hover feature using the plotly package. If you hover your mouse over any part of the bars in the plots, you will see the percentage values displayed.

countries_plot <- ggplot(data = ess_plot %>%
           filter(!is.na(alc_consumption), !is.na(income_class), !is.na(pct)), mapping = aes(x = income_class, y = pct, fill = alc_consumption)) +
    geom_bar(position = "stack", stat = "identity",alpha = 0.9, color = "white") +
labs(title = "Relationship between Income and Alcohol in 4 European Countries", x = "Income", y = "Percentage") +  theme_minimal() + 
  scale_fill_viridis_d(option = "D") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  facet_wrap(~cntry, labeller = labeller) 

ggplotly(countries_plot)

Data set evaluation: Having done a brief exploration of the variables of income and alcohol, I would definitely say that the alcohol one is of better quality. Professor Parker explained to me in class on Tuesday the 11th that respondents self declare their income decile, and that it is corrected afterwards if they are wrong, causing some to fall in the 9.5 range. While I did not see that yet, I am sure I will run into it, and I am not super excited. I think maybe a survey sent out during tax season would help minimize this, because most people would have their tax information in front of them or near by. The fact that income is self declared interferes with the reliability of the information. We could assume the same however for alcohol. People are self declaring how much they drink, how do you know for sure unless you are keeping track of everytime you drink? This data could be accurate, but it could also represent the respondents best or worst views of themselves, or biases. This kind of ties into the data limitations. We can’t really assume that the data is 100% accurate, so we have to account for that in our analysis of the variables.

In regards to sample restrictions, I do think it was a good idea to pick four far apart countries to try and avoid cultural similarities. However, I don’t know how the sample was determined for the survey, so is it possible that we are missing responses from certain groups, or from certain parts of the countries? Of course it’s possible. Who is to say that the sample is truly representative of the larger populations? For example, if interest was gathered through online means, that could limit the responses to only people who have the internet or social medias.

There is a lot of data that I was not able to use. Initially, there was hundreds of thousands of income respondents, and tens of thousands of alcohol respondents. I ended up with 5910. I noticed that if one respondent answered one survey but not the other, I could not use the information. So it makes me wonder how different these plots would look had everyone answered both surveys. Since I am combining variables where there is only a total of 5910 respondents across four countries, the results will not be representative. It is too small of a sample to make any generalization. However, for the purposes of this course, I do think it is still worth looking into. Just looking at the plot above, we can see patterns forming, for example the difference of the “everyday” value between Portugal and Ireland.

Next Steps I am looking forward to continuing the analysis of these variables. As professor Parker suggested, I am planning on adding socioeconomic variables into the research. Namely age, gender, race, and education. I am unsure if I will be adding all four new variables, but time will tell. By expanding to these socioeconomic variables, we will see bigger patterns of alcohol consumption. Who are these people? What sociological influences are affecting their drinking habits? This can be supported through literature, and would definitely be a worthy exploration of data. I cannot speak for the next steps for the literature analysis, or regarding questions. This is explained in our main proposal document.

RPubs link https://rpubs.com/akgallant16743/1270052

Data

2025-02-05