Project 2 Religions Dataset

Author

Theresa Benny

Approach

The third dataset presents survey data showing the distribution of respondents from different religious affiliations across income brackets. Each row represents a religious group, while each column represents a household income category. The values in the table represent the number of respondents within each religion who fall into a particular income bracket.

This dataset is currently structured as a cross-tabulated wide table, where income ranges are stored as column headers and the counts of respondents appear as cell values. While this format is useful for presentation, it is not well suited for analysis because income categories are embedded in the column structure rather than represented as a variable.

The goal of this transformation is to convert the dataset into a tidy structure where each observation represents the number of respondents for a specific religion and income bracket combination.

To preserve the original dataset, I will first recreate the table exactly as shown and store it as a wide-format CSV file. This raw dataset will be committed to my GitHub repository before any transformations are performed.

The dataset contains:

  • Religious affiliation categories

  • Multiple income brackets represented as columns

  • Counts of respondents for each religion–income combination

  • A column representing respondents who declined to report income

The structure is considered wide because income categories appear as separate columns instead of being represented as values within a single variable.

The dataset will first be imported into R and inspected to understand its structure, including column names, row counts, and data types. This step will help confirm that the dataset has been accurately reconstructed from the original table.

Column names will be standardized to ensure consistent formatting. For example, special characters such as dollar signs and hyphens may be simplified so that column names follow a consistent naming convention. This improves readability and makes variables easier to reference in later analysis.

The primary structural issue in this dataset is that income brackets are stored as separate columns. These columns will be reshaped into a tidy format where:

  • One column represents the income bracket

  • One column represents the number of respondents

After this transformation, each row will represent a single combination of religion and income bracket.

The resulting dataset will include variables such as:

  • religion

  • income_bracket

  • respondent_count

This structure follows tidy data principles and allows easier grouping, filtering, and comparison across income categories.

The column labeled “Don’t know/refused” represents respondents who did not provide income information. This category will be retained as a separate income group so that missing responses remain documented rather than being removed from the dataset.

Once the dataset has been transformed into tidy format, I will analyze how income distribution varies across religious groups.

Potential analyses include:

  • Comparing income distribution across religious affiliations

  • Identifying which religions have higher proportions of respondents in higher income brackets

  • Calculating the relative share of respondents in each income category

  • Visualizing income distributions across religions using bar charts or stacked charts

These analyses will help illustrate how income levels differ across religious groups and allow clearer comparisons between categories.

Several challenges may arise when working with this dataset:

  1. Income categories are currently embedded within column names and must be converted into a single variable.

  2. Column names include special characters such as dollar signs and ranges that may require standardization.

  3. The “Don’t know/refused” category must be handled carefully so that it is not mistakenly removed during analysis.

  4. Because the data represents counts rather than percentages, additional calculations may be needed to compare distributions across groups.

Addressing these issues will ensure the dataset is properly structured for analysis and visualization.

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.1     ✔ tibble    3.3.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
religion_raw <- read_csv("data.csv")
Rows: 14 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): Religion
dbl (10): <$10k, $10-20k, $20-30k, $30-40k, $40-50k, $50-75k, $75-100k, $100...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
glimpse(religion_raw)
Rows: 14
Columns: 11
$ Religion             <chr> "Agnostic", "Atheist", "Buddhist", "Catholic", "E…
$ `<$10k`              <dbl> 27, 12, 27, 418, 575, 9, 228, 20, 19, 289, 34, 23…
$ `$10-20k`            <dbl> 34, 27, 21, 617, 869, 7, 244, 27, 19, 495, 42, 23…
$ `$20-30k`            <dbl> 60, 37, 30, 732, 1064, 9, 236, 24, 25, 619, 37, 1…
$ `$30-40k`            <dbl> 81, 52, 34, 670, 982, 11, 238, 24, 25, 655, 48, 1…
$ `$40-50k`            <dbl> 76, 39, 33, 638, 881, 13, 197, 21, 30, 651, 51, 1…
$ `$50-75k`            <dbl> 137, 81, 58, 1116, 1486, 34, 212, 30, 73, 1107, 1…
$ `$75-100k`           <dbl> 102, 76, 62, 949, 949, 47, 156, 15, 59, 939, 87, …
$ `$100-150k`          <dbl> 109, 59, 39, 792, 723, 48, 156, 11, 87, 792, 96, …
$ `>$150k`             <dbl> 84, 74, 53, 792, 414, 54, 78, 6, 151, 753, 64, 41…
$ `Don't know/refused` <dbl> 96, 76, 54, 1163, 1529, 37, 339, 37, 87, 1096, 10…
head(religion_raw)
# A tibble: 6 × 11
  Religion  `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
  <chr>       <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
1 Agnostic       27        34        60        81        76       137        102
2 Atheist        12        27        37        52        39        81         76
3 Buddhist       27        21        30        34        33        58         62
4 Catholic      418       617       732       670       638      1116        949
5 Evangeli…     575       869      1064       982       881      1486        949
6 Hindu           9         7         9        11        13        34         47
# ℹ 3 more variables: `$100-150k` <dbl>, `>$150k` <dbl>,
#   `Don't know/refused` <dbl>
#Let's tidy the dataset now. 

religion_clean <- religion_raw %>%
  rename_with(tolower)

The income brackets are currently stored as separate columns. To make the dataset tidy, these columns will be reshaped into a single income_bracket variable and a respondent_count variable.

religion_tidy <- religion_clean %>%
  pivot_longer(
    cols = -religion,
    names_to = "income_bracket",
    values_to = "respondent_count"
  )

glimpse(religion_tidy)
Rows: 140
Columns: 3
$ religion         <chr> "Agnostic", "Agnostic", "Agnostic", "Agnostic", "Agno…
$ income_bracket   <chr> "<$10k", "$10-20k", "$20-30k", "$30-40k", "$40-50k", …
$ respondent_count <dbl> 27, 34, 60, 81, 76, 137, 102, 109, 84, 96, 12, 27, 37…
head(religion_tidy)
# A tibble: 6 × 3
  religion income_bracket respondent_count
  <chr>    <chr>                     <dbl>
1 Agnostic <$10k                        27
2 Agnostic $10-20k                      34
3 Agnostic $20-30k                      60
4 Agnostic $30-40k                      81
5 Agnostic $40-50k                      76
6 Agnostic $50-75k                     137

Now that this is tidy and in long format, let’s begin our analysis.

religion_summary <- religion_tidy %>%
  group_by(religion) %>%
  summarise(total_respondents = sum(respondent_count))

religion_summary
# A tibble: 14 × 2
   religion                      total_respondents
   <chr>                                     <dbl>
 1 Agnostic                                    806
 2 Atheist                                     533
 3 Buddhist                                    411
 4 Catholic                                   7887
 5 Evangelical                                9472
 6 Hindu                                       269
 7 Historically Black Protestant              2084
 8 Jehovah's Witness                           215
 9 Jewish                                      575
10 Mainline Protestant                        7396
11 Mormon                                      674
12 Muslim                                      279
13 Orthodox                                    285
14 Unaffiliated                               4674

Let’s visualize this.

religion_tidy <- religion_tidy %>%
  mutate(
    income_bracket = factor(
      income_bracket,
      levels = c(
        "<$10k",
        "$10-20k",
        "$20-30k",
        "$30-40k",
        "$40-50k",
        "$50-75k",
        "$75-100k",
        "$100-150k",
        ">$150k",
        "don't know/refused"
      )
    )
  )



top_religions <- religion_tidy %>%
  group_by(religion) %>%
  summarise(total_respondents = sum(respondent_count, na.rm = TRUE)) %>%
  slice_max(order_by = total_respondents, n = 5)

religion_plot_data <- religion_tidy %>%
  filter(religion %in% top_religions$religion)

ggplot(religion_plot_data, aes(x = income_bracket, y = respondent_count, fill = religion)) +
  geom_col(position = "dodge") +
  labs(
    title = "Income Distribution for the Five Largest Religious Groups",
    x = "Income Bracket",
    y = "Number of Respondents",
    fill = "Religion"
  ) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This visualization shows how respondents from the five largest religious groups are distributed across income brackets. Most groups have their highest counts in the middle-income ranges, particularly $50–75k and $75–100k. Evangelical and Catholic respondents appear most frequently across the income categories, while Historically Black Protestant respondents appear less frequently overall. The chart highlights how income distribution varies across religious affiliations while showing a general concentration in middle-income levels.