Project 2

Load the libraries to use:

library(dplyr)
library(tidyr)
library(tidyverse)
library(magrittr)

Dataset 1

I chose to use my own dataset for this assignment.

Let’s simulate the data:

set.seed(123)

# Create sample data
df <- data.frame(
  id = 1:5,
  gender = sample(c("Male", "Female"), 5, replace = TRUE),
  age = sample(18:65, 5, replace = TRUE),
  weight = sample(120:220, 5, replace = TRUE),
  height = sample(150:200, 5, replace = TRUE),
  systolic_1 = sample(100:140, 5, replace = TRUE),
  diastolic_1 = sample(60:90, 5, replace = TRUE),
  systolic_2 = sample(100:140, 5, replace = TRUE),
  diastolic_2 = sample(60:90, 5, replace = TRUE),
  smoking_status = sample(c("Never smoked", "Current smoker", "Former smoker"), 5, replace = TRUE)
)

# Make some of the data missing
df[sample(1:5, 3), 5:8] <- NA
df[sample(1:5, 2), 10] <- NA

# write the data to a csv file:
write.csv(
  x = df,
  file = "blood_pressure_and_demographics.csv",
  row.names = FALSE
)

Read the data into R:

blood_pressure_and_demographics <- read.csv(
  file = "blood_pressure_and_demographics.csv"
)

# look at the first 10 rows:
blood_pressure_and_demographics

##   id gender age weight height systolic_1 diastolic_1 systolic_2 diastolic_2
## 1  1   Male  59    209     NA         NA          NA         NA          68
## 2  2   Male  60    210     NA         NA          NA         NA          69
## 3  3   Male  54    188    178        108          70        131          82
## 4  4 Female  31    210    184        118          66        106          86
## 5  5   Male  42    176     NA         NA          NA         NA          87
##   smoking_status
## 1   Never smoked
## 2  Former smoker
## 3   Never smoked
## 4           <NA>
## 5           <NA>

Time to tidy and transform the data.

First, reshape the dataset into long format:

df_long <- blood_pressure_and_demographics %>%
  pivot_longer(cols = c(systolic_1, diastolic_1, systolic_2, diastolic_2),
               names_to = c(".value", "visit_number"),
               names_sep = "_") %>%
  mutate(visit_number = as.numeric(visit_number)) %>%
  arrange(id, visit_number)

df_long

## # A tibble: 10 × 9
##       id gender   age weight height smoking_status visit_number systolic
##    <int> <chr>  <int>  <int>  <int> <chr>                 <dbl>    <int>
##  1     1 Male      59    209     NA Never smoked              1       NA
##  2     1 Male      59    209     NA Never smoked              2       NA
##  3     2 Male      60    210     NA Former smoker             1       NA
##  4     2 Male      60    210     NA Former smoker             2       NA
##  5     3 Male      54    188    178 Never smoked              1      108
##  6     3 Male      54    188    178 Never smoked              2      131
##  7     4 Female    31    210    184 <NA>                      1      118
##  8     4 Female    31    210    184 <NA>                      2      106
##  9     5 Male      42    176     NA <NA>                      1       NA
## 10     5 Male      42    176     NA <NA>                      2       NA
## # ℹ 1 more variable: diastolic <int>

Calculate the mean systolic and diastolic blood pressure for each individual and visit.

To do that, group the dataframe by id and visit_number and summarize each group via mean.

Finally drop the groups.

bp_means <- df_long %>%
  group_by(id, visit_number) %>%
  summarize(
    mean_systolic = mean(systolic),
    mean_diastolic = mean(diastolic),
    n_obs = n(),
    .groups = "drop"
  )

bp_means

## # A tibble: 10 × 5
##       id visit_number mean_systolic mean_diastolic n_obs
##    <int>        <dbl>         <dbl>          <dbl> <int>
##  1     1            1            NA             NA     1
##  2     1            2            NA             68     1
##  3     2            1            NA             NA     1
##  4     2            2            NA             69     1
##  5     3            1           108             70     1
##  6     3            2           131             82     1
##  7     4            1           118             66     1
##  8     4            2           106             86     1
##  9     5            1            NA             NA     1
## 10     5            2            NA             87     1

Join the blood pressure means back to the original dataset by id and visit_number:

df_clean <- df_long %>%
  select(-c(systolic, diastolic)) %>%
  left_join(bp_means, by = c("id", "visit_number"))

df_clean

## # A tibble: 10 × 10
##       id gender   age weight height smoking_status visit_number mean_systolic
##    <int> <chr>  <int>  <int>  <int> <chr>                 <dbl>         <dbl>
##  1     1 Male      59    209     NA Never smoked              1            NA
##  2     1 Male      59    209     NA Never smoked              2            NA
##  3     2 Male      60    210     NA Former smoker             1            NA
##  4     2 Male      60    210     NA Former smoker             2            NA
##  5     3 Male      54    188    178 Never smoked              1           108
##  6     3 Male      54    188    178 Never smoked              2           131
##  7     4 Female    31    210    184 <NA>                      1           118
##  8     4 Female    31    210    184 <NA>                      2           106
##  9     5 Male      42    176     NA <NA>                      1            NA
## 10     5 Male      42    176     NA <NA>                      2            NA
## # ℹ 2 more variables: mean_diastolic <dbl>, n_obs <int>

The final output is a cleaned and transformed dataset that is ready for downstream analysis.I chose to use my own dataset for this assignment.

Let’s simulate the data:

set.seed(123)

# Create sample data
df <- data.frame(
  id = 1:5,
  gender = sample(c("Male", "Female"), 5, replace = TRUE),
  age = sample(18:65, 5, replace = TRUE),
  weight = sample(120:220, 5, replace = TRUE),
  height = sample(150:200, 5, replace = TRUE),
  systolic_1 = sample(100:140, 5, replace = TRUE),
  diastolic_1 = sample(60:90, 5, replace = TRUE),
  systolic_2 = sample(100:140, 5, replace = TRUE),
  diastolic_2 = sample(60:90, 5, replace = TRUE),
  smoking_status = sample(c("Never smoked", "Current smoker", "Former smoker"), 5, replace = TRUE)
)

# Make some of the data missing
df[sample(1:5, 3), 5:8] <- NA
df[sample(1:5, 2), 10] <- NA

# write the data to a csv file:
write.csv(
  x = df,
  file = "blood_pressure_and_demographics.csv",
  row.names = FALSE
)

Read the data into R:

blood_pressure_and_demographics <- read.csv(
  file = "blood_pressure_and_demographics.csv"
)

# look at the first 10 rows:
blood_pressure_and_demographics

##   id gender age weight height systolic_1 diastolic_1 systolic_2 diastolic_2
## 1  1   Male  59    209     NA         NA          NA         NA          68
## 2  2   Male  60    210     NA         NA          NA         NA          69
## 3  3   Male  54    188    178        108          70        131          82
## 4  4 Female  31    210    184        118          66        106          86
## 5  5   Male  42    176     NA         NA          NA         NA          87
##   smoking_status
## 1   Never smoked
## 2  Former smoker
## 3   Never smoked
## 4           <NA>
## 5           <NA>

Time to tidy and transform the data.

First, reshape the dataset into long format:

df_long <- blood_pressure_and_demographics %>%
  pivot_longer(cols = c(systolic_1, diastolic_1, systolic_2, diastolic_2),
               names_to = c(".value", "visit_number"),
               names_sep = "_") %>%
  mutate(visit_number = as.numeric(visit_number)) %>%
  arrange(id, visit_number)

df_long

## # A tibble: 10 × 9
##       id gender   age weight height smoking_status visit_number systolic
##    <int> <chr>  <int>  <int>  <int> <chr>                 <dbl>    <int>
##  1     1 Male      59    209     NA Never smoked              1       NA
##  2     1 Male      59    209     NA Never smoked              2       NA
##  3     2 Male      60    210     NA Former smoker             1       NA
##  4     2 Male      60    210     NA Former smoker             2       NA
##  5     3 Male      54    188    178 Never smoked              1      108
##  6     3 Male      54    188    178 Never smoked              2      131
##  7     4 Female    31    210    184 <NA>                      1      118
##  8     4 Female    31    210    184 <NA>                      2      106
##  9     5 Male      42    176     NA <NA>                      1       NA
## 10     5 Male      42    176     NA <NA>                      2       NA
## # ℹ 1 more variable: diastolic <int>

Calculate the mean systolic and diastolic blood pressure for each individual and visit.

To do that, group the dataframe by id and visit_number and summarize each group via mean.

Finally drop the groups.

bp_means <- df_long %>%
  group_by(id, visit_number) %>%
  summarize(
    mean_systolic = mean(systolic),
    mean_diastolic = mean(diastolic),
    n_obs = n(),
    .groups = "drop"
  )

bp_means

## # A tibble: 10 × 5
##       id visit_number mean_systolic mean_diastolic n_obs
##    <int>        <dbl>         <dbl>          <dbl> <int>
##  1     1            1            NA             NA     1
##  2     1            2            NA             68     1
##  3     2            1            NA             NA     1
##  4     2            2            NA             69     1
##  5     3            1           108             70     1
##  6     3            2           131             82     1
##  7     4            1           118             66     1
##  8     4            2           106             86     1
##  9     5            1            NA             NA     1
## 10     5            2            NA             87     1

Join the blood pressure means back to the original dataset by id and visit_number:

df_clean <- df_long %>%
  select(-c(systolic, diastolic)) %>%
  left_join(bp_means, by = c("id", "visit_number"))

df_clean

## # A tibble: 10 × 10
##       id gender   age weight height smoking_status visit_number mean_systolic
##    <int> <chr>  <int>  <int>  <int> <chr>                 <dbl>         <dbl>
##  1     1 Male      59    209     NA Never smoked              1            NA
##  2     1 Male      59    209     NA Never smoked              2            NA
##  3     2 Male      60    210     NA Former smoker             1            NA
##  4     2 Male      60    210     NA Former smoker             2            NA
##  5     3 Male      54    188    178 Never smoked              1           108
##  6     3 Male      54    188    178 Never smoked              2           131
##  7     4 Female    31    210    184 <NA>                      1           118
##  8     4 Female    31    210    184 <NA>                      2           106
##  9     5 Male      42    176     NA <NA>                      1            NA
## 10     5 Male      42    176     NA <NA>                      2            NA
## # ℹ 2 more variables: mean_diastolic <dbl>, n_obs <int>

The final output is a cleaned and transformed dataset that is ready for downstream analysis.

Dataset 2

Read the data into R:

dfSchool <- read.csv(file = "school_diversity.csv")

str(dfSchool)

## 'data.frame':    27944 obs. of  16 variables:
##  $ X           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ LEAID       : int  100002 100005 100005 100006 100006 100007 100007 100008 100011 100012 ...
##  $ LEA_NAME    : chr  "alabama youth services" "albertville city" "albertville city" "marshall county" ...
##  $ ST          : chr  "AL" "AL" "AL" "AL" ...
##  $ d_Locale_Txt: chr  NA "town-distant" "town-distant" "rural-distant" ...
##  $ SCHOOL_YEAR : chr  "1994-1995" "1994-1995" "2016-2017" "1994-1995" ...
##  $ AIAN        : num  0 0 0.294 0.104 0.492 ...
##  $ Asian       : num  0.589 0.321 0.551 0.134 0.299 ...
##  $ Black       : num  71.709 1.283 3.194 0.373 1.073 ...
##  $ Hispanic    : num  0.196 4.522 46.741 0.909 21.294 ...
##  $ White       : num  27.5 93.9 46.8 98.5 75.8 ...
##  $ Multi       : num  NA NA 2.44 NA 1.04 ...
##  $ Total       : int  509 3118 5447 6707 5687 7671 13938 10440 1973 2389 ...
##  $ diverse     : chr  "Diverse" "Extremely undiverse" "Diverse" "Extremely undiverse" ...
##  $ variance    : num  NA NA 0.0116 NA NA ...
##  $ int_group   : chr  NA NA "Highly integrated" NA ...

Cleaning and transforming data

# we are interested in getting the counts of the students' race
dfSchool2 <- dfSchool %>%
  mutate_all(~replace(., is.na(.), 0)) %>%
  filter(Total > 100) %>%
  mutate(Asian_Pop = Asian / 100 * Total,
         Black_Pop = Black / 100 * Total,
         Hispanic_Pop = Hispanic / 100 * Total,
         White_Pop = White / 100 * Total,
         Multi_Pop = Multi / 100 * Total,
         ) %>%
  group_by(LEA_NAME, ST) %>%
  summarize(Asian = mean(Asian_Pop),
            Black = mean(Black_Pop),
            Hispanic = mean(Hispanic_Pop),
            White = mean(White_Pop),
            Multi = mean(Multi_Pop)
            )

head(dfSchool2)

## # A tibble: 6 × 7
## # Groups:   LEA_NAME [6]
##   LEA_NAME                        ST    Asian   Black Hispanic White Multi
##   <chr>                           <chr> <dbl>   <dbl>    <dbl> <dbl> <dbl>
## 1 a e r o  spec educ coop         IL     0       8        8     120   0   
## 2 a-c central cusd 262            IL     0       2.50     6.00  489   4.00
## 3 a.c.g.c. public school district MN     4.00    1.00    66     721  12.0 
## 4 abbeville 60                    SC     7    1355.      25.5  1974  32.5 
## 5 abbotsford school district      WI     2       7.00   181.    504   2.5 
## 6 abbott isd                      TX     0       4       25.0   248.  3.50

Comparing race ration in schools

dfSchool4 <- dfSchool %>%
  mutate_all(~replace(., is.na(.), 0)) %>%
  filter(Total > 100) %>%
  mutate(Asian = Asian / 100,
         Black = Black / 100,
         Hispanic = Hispanic / 100,
         White = White / 100,
         Multi = Multi / 100,
         ) %>%
  group_by(ST) %>%
  summarize(Asian = mean(Asian),
            Black = mean(Black),
            Hispanic = mean(Hispanic),
            White = mean(White),
            Multi = mean(Multi)) %>%
  pivot_longer(!ST, names_to = 'Ethnicity', values_to = 'Ratio')

ggplot(data = dfSchool4, aes(x = Ethnicity, y = Ratio, fill = Ethnicity)) +
  geom_boxplot() +
  labs(title = 'Ethnicity Ratio Distribution in Schools')

While the other racial ratios are more equal, we can observe that the White student ratio has the largest overall spread by far.

DataSet 3

Read the data into R:

religion <- read.csv(file = "income_and_religion.csv")

religion

##                   religion X..10k X.10.20k X.20.30k X.30.40k X.40.50k X.50.75k
## 1                 Agnostic     27       34       60       81       76      137
## 2                  Atheist     12       27       37       52       35       70
## 3                 Buddhist     27       21       30       34       33       58
## 4                 Catholic    418      617      732      670      638     1116
## 5       Don’t know/refused     15       14       15       11       10       35
## 6         Evangelical Prot    575      869     1064      982      881     1486
## 7                    Hindu      1        9        7        9       11       34
## 8  Historically Black Prot    228      244      236      238      197      223
## 9        Jehovah's Witness     20       27       24       24       21       30
## 10                  Jewish     19       19       25       25       30       95
##     X
## 1  NA
## 2  NA
## 3  NA
## 4  NA
## 5  NA
## 6  NA
## 7  NA
## 8  NA
## 9  NA
## 10 NA

Tidying the Data:

To tidy the data, I removed unnecessary variables, changed the names, and then changed the wide data set into a longer data set by grouping the income levels as the income variable and assigned the values to frequency. Then I changed the income into a factor and ordered the levels.

religion <- religion %>%
  select(-X) %>%
  set_colnames(c("religion", "<$10k","$10-20k", "$20-30k", "$30-40k", "$40-50k", "$50-75k")) %>%
  pivot_longer(2:7, names_to = "income", values_to = "frequency")

# factoring the income variable
religion$income <- religion$income %>%
  as.factor() %>%
  relevel("<$10k")

head(religion)

## # A tibble: 6 × 3
##   religion income  frequency
##   <chr>    <fct>       <int>
## 1 Agnostic <$10k          27
## 2 Agnostic $10-20k        34
## 3 Agnostic $20-30k        60
## 4 Agnostic $30-40k        81
## 5 Agnostic $40-50k        76
## 6 Agnostic $50-75k       137

Calculations and Graphs:

Based on their religion, I was able to determine the percentage of persons in each economic bracket. After that, I created a bar graph to display the percentages of revenue for each religious organization. I also discovered the average for every income bracket, which might be biased because some had higher frequencies than others, and we’re not sure if it makes sense for the overall population.For each faith, I also discovered the largest wealth category. Finally, I made an effort to identify the income bracket that corresponds to each faith group’s median value. I determined the cumulative sum of the income levels and divided the total number of members of each religious group by two to find the median. Next, I found the income group that contains the median value by comparing it to its cumulative sum.

# finding the proportion of each income for each religion
religion <- religion %>%
  group_by(religion) %>%
  mutate(count = sum(frequency)) %>%
  group_by(religion, income) %>%
  mutate(proportion = frequency / count,
         label = round(proportion * 100, 2))

# bar graph of the proportions
ggplot(religion) + 
  geom_bar(aes(x = religion, y = proportion, fill = income), stat = "identity", 
           position = position_fill(reverse = TRUE)) +
  geom_text(aes(x = religion, y = proportion, label = label, group = income),
            position = position_stack(vjust = .5, reverse = TRUE), size = 3) +
  coord_flip() +
  ggtitle("Distribution of Income Levels per Religions")

# finding the average of each income group
religion %>%
  group_by(income) %>%
  summarise(average = mean(frequency)) %>%
  arrange(desc(average))

## # A tibble: 6 × 2
##   income  average
##   <fct>     <dbl>
## 1 $50-75k    328.
## 2 $20-30k    223 
## 3 $30-40k    213.
## 4 $40-50k    193.
## 5 $10-20k    188.
## 6 <$10k      134.

# finding the biggest income group for each religion
religion %>%
  group_by(religion) %>%
  select(-label, - count) %>%
  mutate(max = max(proportion)) %>%
  filter(proportion == max) %>%
  select(-max) %>%
  arrange(desc(proportion))

## # A tibble: 10 × 4
## # Groups:   religion [10]
##    religion                income  frequency proportion
##    <chr>                   <fct>       <int>      <dbl>
##  1 Hindu                   $50-75k        34      0.479
##  2 Jewish                  $50-75k        95      0.446
##  3 Don’t know/refused      $50-75k        35      0.35 
##  4 Agnostic                $50-75k       137      0.330
##  5 Atheist                 $50-75k        70      0.300
##  6 Buddhist                $50-75k        58      0.286
##  7 Catholic                $50-75k      1116      0.266
##  8 Evangelical Prot        $50-75k      1486      0.254
##  9 Jehovah's Witness       $50-75k        30      0.205
## 10 Historically Black Prot $10-20k       244      0.179

# finding the median value
religion %>%
  group_by(religion) %>%
  select(-label, - proportion) %>%
  mutate(median = count / 2,
         sum = cumsum(frequency)) %>%
  select(-count, -frequency) %>%
  filter(sum >= median) %>%
  mutate(minsum = min(sum)) %>%
  filter(minsum == sum) %>%
  select(-minsum) %>%
  arrange(desc(income))

## # A tibble: 10 × 4
## # Groups:   religion [10]
##    religion                income  median   sum
##    <chr>                   <fct>    <dbl> <int>
##  1 Agnostic                $40-50k  208.    278
##  2 Hindu                   $40-50k   35.5    37
##  3 Jewish                  $40-50k  106.    118
##  4 Atheist                 $30-40k  116.    128
##  5 Buddhist                $30-40k  102.    112
##  6 Catholic                $30-40k 2096.   2437
##  7 Don’t know/refused      $30-40k   50      55
##  8 Evangelical Prot        $30-40k 2928.   3490
##  9 Jehovah's Witness       $30-40k   73      95
## 10 Historically Black Prot $20-30k  683     708

Conclusion

The largest income category for all faith groups, according to the graph and chart, is between $50 and $75k, with the exception of historically Black Protestants, who are primarily in the $10–20k range. Interestingly, the percentage of Hindu and Jewish individuals earning between $50,000 and $75,000 is over 40%, whereas the percentage for Evangelical Protestant, Catholic, and Buddhist groupings is only under 25%. Additionally, it is clear that the group with the highest combined income is $50–75k, followed by $20–30k.

Given that income is not regularly distributed, the median should be examined rather than the average. For every religion group, I was also able to determine the median income group. In order to accomplish this, I had to first get the median frequency value for each group before calculating the cumulative total of the frequencies for each income bracket. When the median value exceeded the cumulative sum, it indicated that the median value would belong to that particular income category. With the exception of Agnostic, Hindu, and Jewish people, who fell into the $40–50k income range for their median group, and Historically Black Protestants, who fell into the $20–30k income range for their median group, the majority of the religious group fell into the $30–45k income range.

Project 2

Mohammed Rahman

2024-03-01

Dataset 1

Dataset 2

DataSet 3