Load the libraries to use:
library(dplyr)
library(tidyr)
library(tidyverse)
library(magrittr)
I chose to use my own dataset for this assignment.
Let’s simulate the data:
set.seed(123)
# Create sample data
df <- data.frame(
id = 1:5,
gender = sample(c("Male", "Female"), 5, replace = TRUE),
age = sample(18:65, 5, replace = TRUE),
weight = sample(120:220, 5, replace = TRUE),
height = sample(150:200, 5, replace = TRUE),
systolic_1 = sample(100:140, 5, replace = TRUE),
diastolic_1 = sample(60:90, 5, replace = TRUE),
systolic_2 = sample(100:140, 5, replace = TRUE),
diastolic_2 = sample(60:90, 5, replace = TRUE),
smoking_status = sample(c("Never smoked", "Current smoker", "Former smoker"), 5, replace = TRUE)
)
# Make some of the data missing
df[sample(1:5, 3), 5:8] <- NA
df[sample(1:5, 2), 10] <- NA
# write the data to a csv file:
write.csv(
x = df,
file = "blood_pressure_and_demographics.csv",
row.names = FALSE
)
Read the data into R:
blood_pressure_and_demographics <- read.csv(
file = "blood_pressure_and_demographics.csv"
)
# look at the first 10 rows:
blood_pressure_and_demographics
## id gender age weight height systolic_1 diastolic_1 systolic_2 diastolic_2
## 1 1 Male 59 209 NA NA NA NA 68
## 2 2 Male 60 210 NA NA NA NA 69
## 3 3 Male 54 188 178 108 70 131 82
## 4 4 Female 31 210 184 118 66 106 86
## 5 5 Male 42 176 NA NA NA NA 87
## smoking_status
## 1 Never smoked
## 2 Former smoker
## 3 Never smoked
## 4 <NA>
## 5 <NA>
Time to tidy and transform the data.
First, reshape the dataset into long format:
df_long <- blood_pressure_and_demographics %>%
pivot_longer(cols = c(systolic_1, diastolic_1, systolic_2, diastolic_2),
names_to = c(".value", "visit_number"),
names_sep = "_") %>%
mutate(visit_number = as.numeric(visit_number)) %>%
arrange(id, visit_number)
df_long
## # A tibble: 10 × 9
## id gender age weight height smoking_status visit_number systolic
## <int> <chr> <int> <int> <int> <chr> <dbl> <int>
## 1 1 Male 59 209 NA Never smoked 1 NA
## 2 1 Male 59 209 NA Never smoked 2 NA
## 3 2 Male 60 210 NA Former smoker 1 NA
## 4 2 Male 60 210 NA Former smoker 2 NA
## 5 3 Male 54 188 178 Never smoked 1 108
## 6 3 Male 54 188 178 Never smoked 2 131
## 7 4 Female 31 210 184 <NA> 1 118
## 8 4 Female 31 210 184 <NA> 2 106
## 9 5 Male 42 176 NA <NA> 1 NA
## 10 5 Male 42 176 NA <NA> 2 NA
## # ℹ 1 more variable: diastolic <int>
Calculate the mean systolic and diastolic blood pressure for each individual and visit.
To do that, group the dataframe by id and
visit_number and summarize each group via mean.
Finally drop the groups.
bp_means <- df_long %>%
group_by(id, visit_number) %>%
summarize(
mean_systolic = mean(systolic),
mean_diastolic = mean(diastolic),
n_obs = n(),
.groups = "drop"
)
bp_means
## # A tibble: 10 × 5
## id visit_number mean_systolic mean_diastolic n_obs
## <int> <dbl> <dbl> <dbl> <int>
## 1 1 1 NA NA 1
## 2 1 2 NA 68 1
## 3 2 1 NA NA 1
## 4 2 2 NA 69 1
## 5 3 1 108 70 1
## 6 3 2 131 82 1
## 7 4 1 118 66 1
## 8 4 2 106 86 1
## 9 5 1 NA NA 1
## 10 5 2 NA 87 1
Join the blood pressure means back to the original dataset by
id and visit_number:
df_clean <- df_long %>%
select(-c(systolic, diastolic)) %>%
left_join(bp_means, by = c("id", "visit_number"))
df_clean
## # A tibble: 10 × 10
## id gender age weight height smoking_status visit_number mean_systolic
## <int> <chr> <int> <int> <int> <chr> <dbl> <dbl>
## 1 1 Male 59 209 NA Never smoked 1 NA
## 2 1 Male 59 209 NA Never smoked 2 NA
## 3 2 Male 60 210 NA Former smoker 1 NA
## 4 2 Male 60 210 NA Former smoker 2 NA
## 5 3 Male 54 188 178 Never smoked 1 108
## 6 3 Male 54 188 178 Never smoked 2 131
## 7 4 Female 31 210 184 <NA> 1 118
## 8 4 Female 31 210 184 <NA> 2 106
## 9 5 Male 42 176 NA <NA> 1 NA
## 10 5 Male 42 176 NA <NA> 2 NA
## # ℹ 2 more variables: mean_diastolic <dbl>, n_obs <int>
The final output is a cleaned and transformed dataset that is ready for downstream analysis.I chose to use my own dataset for this assignment.
Let’s simulate the data:
set.seed(123)
# Create sample data
df <- data.frame(
id = 1:5,
gender = sample(c("Male", "Female"), 5, replace = TRUE),
age = sample(18:65, 5, replace = TRUE),
weight = sample(120:220, 5, replace = TRUE),
height = sample(150:200, 5, replace = TRUE),
systolic_1 = sample(100:140, 5, replace = TRUE),
diastolic_1 = sample(60:90, 5, replace = TRUE),
systolic_2 = sample(100:140, 5, replace = TRUE),
diastolic_2 = sample(60:90, 5, replace = TRUE),
smoking_status = sample(c("Never smoked", "Current smoker", "Former smoker"), 5, replace = TRUE)
)
# Make some of the data missing
df[sample(1:5, 3), 5:8] <- NA
df[sample(1:5, 2), 10] <- NA
# write the data to a csv file:
write.csv(
x = df,
file = "blood_pressure_and_demographics.csv",
row.names = FALSE
)
Read the data into R:
blood_pressure_and_demographics <- read.csv(
file = "blood_pressure_and_demographics.csv"
)
# look at the first 10 rows:
blood_pressure_and_demographics
## id gender age weight height systolic_1 diastolic_1 systolic_2 diastolic_2
## 1 1 Male 59 209 NA NA NA NA 68
## 2 2 Male 60 210 NA NA NA NA 69
## 3 3 Male 54 188 178 108 70 131 82
## 4 4 Female 31 210 184 118 66 106 86
## 5 5 Male 42 176 NA NA NA NA 87
## smoking_status
## 1 Never smoked
## 2 Former smoker
## 3 Never smoked
## 4 <NA>
## 5 <NA>
Time to tidy and transform the data.
First, reshape the dataset into long format:
df_long <- blood_pressure_and_demographics %>%
pivot_longer(cols = c(systolic_1, diastolic_1, systolic_2, diastolic_2),
names_to = c(".value", "visit_number"),
names_sep = "_") %>%
mutate(visit_number = as.numeric(visit_number)) %>%
arrange(id, visit_number)
df_long
## # A tibble: 10 × 9
## id gender age weight height smoking_status visit_number systolic
## <int> <chr> <int> <int> <int> <chr> <dbl> <int>
## 1 1 Male 59 209 NA Never smoked 1 NA
## 2 1 Male 59 209 NA Never smoked 2 NA
## 3 2 Male 60 210 NA Former smoker 1 NA
## 4 2 Male 60 210 NA Former smoker 2 NA
## 5 3 Male 54 188 178 Never smoked 1 108
## 6 3 Male 54 188 178 Never smoked 2 131
## 7 4 Female 31 210 184 <NA> 1 118
## 8 4 Female 31 210 184 <NA> 2 106
## 9 5 Male 42 176 NA <NA> 1 NA
## 10 5 Male 42 176 NA <NA> 2 NA
## # ℹ 1 more variable: diastolic <int>
Calculate the mean systolic and diastolic blood pressure for each individual and visit.
To do that, group the dataframe by id and
visit_number and summarize each group via mean.
Finally drop the groups.
bp_means <- df_long %>%
group_by(id, visit_number) %>%
summarize(
mean_systolic = mean(systolic),
mean_diastolic = mean(diastolic),
n_obs = n(),
.groups = "drop"
)
bp_means
## # A tibble: 10 × 5
## id visit_number mean_systolic mean_diastolic n_obs
## <int> <dbl> <dbl> <dbl> <int>
## 1 1 1 NA NA 1
## 2 1 2 NA 68 1
## 3 2 1 NA NA 1
## 4 2 2 NA 69 1
## 5 3 1 108 70 1
## 6 3 2 131 82 1
## 7 4 1 118 66 1
## 8 4 2 106 86 1
## 9 5 1 NA NA 1
## 10 5 2 NA 87 1
Join the blood pressure means back to the original dataset by
id and visit_number:
df_clean <- df_long %>%
select(-c(systolic, diastolic)) %>%
left_join(bp_means, by = c("id", "visit_number"))
df_clean
## # A tibble: 10 × 10
## id gender age weight height smoking_status visit_number mean_systolic
## <int> <chr> <int> <int> <int> <chr> <dbl> <dbl>
## 1 1 Male 59 209 NA Never smoked 1 NA
## 2 1 Male 59 209 NA Never smoked 2 NA
## 3 2 Male 60 210 NA Former smoker 1 NA
## 4 2 Male 60 210 NA Former smoker 2 NA
## 5 3 Male 54 188 178 Never smoked 1 108
## 6 3 Male 54 188 178 Never smoked 2 131
## 7 4 Female 31 210 184 <NA> 1 118
## 8 4 Female 31 210 184 <NA> 2 106
## 9 5 Male 42 176 NA <NA> 1 NA
## 10 5 Male 42 176 NA <NA> 2 NA
## # ℹ 2 more variables: mean_diastolic <dbl>, n_obs <int>
The final output is a cleaned and transformed dataset that is ready for downstream analysis.
Read the data into R:
dfSchool <- read.csv(file = "school_diversity.csv")
str(dfSchool)
## 'data.frame': 27944 obs. of 16 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ LEAID : int 100002 100005 100005 100006 100006 100007 100007 100008 100011 100012 ...
## $ LEA_NAME : chr "alabama youth services" "albertville city" "albertville city" "marshall county" ...
## $ ST : chr "AL" "AL" "AL" "AL" ...
## $ d_Locale_Txt: chr NA "town-distant" "town-distant" "rural-distant" ...
## $ SCHOOL_YEAR : chr "1994-1995" "1994-1995" "2016-2017" "1994-1995" ...
## $ AIAN : num 0 0 0.294 0.104 0.492 ...
## $ Asian : num 0.589 0.321 0.551 0.134 0.299 ...
## $ Black : num 71.709 1.283 3.194 0.373 1.073 ...
## $ Hispanic : num 0.196 4.522 46.741 0.909 21.294 ...
## $ White : num 27.5 93.9 46.8 98.5 75.8 ...
## $ Multi : num NA NA 2.44 NA 1.04 ...
## $ Total : int 509 3118 5447 6707 5687 7671 13938 10440 1973 2389 ...
## $ diverse : chr "Diverse" "Extremely undiverse" "Diverse" "Extremely undiverse" ...
## $ variance : num NA NA 0.0116 NA NA ...
## $ int_group : chr NA NA "Highly integrated" NA ...
Cleaning and transforming data
# we are interested in getting the counts of the students' race
dfSchool2 <- dfSchool %>%
mutate_all(~replace(., is.na(.), 0)) %>%
filter(Total > 100) %>%
mutate(Asian_Pop = Asian / 100 * Total,
Black_Pop = Black / 100 * Total,
Hispanic_Pop = Hispanic / 100 * Total,
White_Pop = White / 100 * Total,
Multi_Pop = Multi / 100 * Total,
) %>%
group_by(LEA_NAME, ST) %>%
summarize(Asian = mean(Asian_Pop),
Black = mean(Black_Pop),
Hispanic = mean(Hispanic_Pop),
White = mean(White_Pop),
Multi = mean(Multi_Pop)
)
head(dfSchool2)
## # A tibble: 6 × 7
## # Groups: LEA_NAME [6]
## LEA_NAME ST Asian Black Hispanic White Multi
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 a e r o spec educ coop IL 0 8 8 120 0
## 2 a-c central cusd 262 IL 0 2.50 6.00 489 4.00
## 3 a.c.g.c. public school district MN 4.00 1.00 66 721 12.0
## 4 abbeville 60 SC 7 1355. 25.5 1974 32.5
## 5 abbotsford school district WI 2 7.00 181. 504 2.5
## 6 abbott isd TX 0 4 25.0 248. 3.50
Comparing race ration in schools
dfSchool4 <- dfSchool %>%
mutate_all(~replace(., is.na(.), 0)) %>%
filter(Total > 100) %>%
mutate(Asian = Asian / 100,
Black = Black / 100,
Hispanic = Hispanic / 100,
White = White / 100,
Multi = Multi / 100,
) %>%
group_by(ST) %>%
summarize(Asian = mean(Asian),
Black = mean(Black),
Hispanic = mean(Hispanic),
White = mean(White),
Multi = mean(Multi)) %>%
pivot_longer(!ST, names_to = 'Ethnicity', values_to = 'Ratio')
ggplot(data = dfSchool4, aes(x = Ethnicity, y = Ratio, fill = Ethnicity)) +
geom_boxplot() +
labs(title = 'Ethnicity Ratio Distribution in Schools')
While the other racial ratios are more equal, we can observe that the White student ratio has the largest overall spread by far.
Read the data into R:
religion <- read.csv(file = "income_and_religion.csv")
religion
## religion X..10k X.10.20k X.20.30k X.30.40k X.40.50k X.50.75k
## 1 Agnostic 27 34 60 81 76 137
## 2 Atheist 12 27 37 52 35 70
## 3 Buddhist 27 21 30 34 33 58
## 4 Catholic 418 617 732 670 638 1116
## 5 Don’t know/refused 15 14 15 11 10 35
## 6 Evangelical Prot 575 869 1064 982 881 1486
## 7 Hindu 1 9 7 9 11 34
## 8 Historically Black Prot 228 244 236 238 197 223
## 9 Jehovah's Witness 20 27 24 24 21 30
## 10 Jewish 19 19 25 25 30 95
## X
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## 7 NA
## 8 NA
## 9 NA
## 10 NA
Tidying the Data:
To tidy the data, I removed unnecessary variables, changed the names,
and then changed the wide data set into a longer data set by grouping
the income levels as the income variable and assigned the
values to frequency. Then I changed the income
into a factor and ordered the levels.
religion <- religion %>%
select(-X) %>%
set_colnames(c("religion", "<$10k","$10-20k", "$20-30k", "$30-40k", "$40-50k", "$50-75k")) %>%
pivot_longer(2:7, names_to = "income", values_to = "frequency")
# factoring the income variable
religion$income <- religion$income %>%
as.factor() %>%
relevel("<$10k")
head(religion)
## # A tibble: 6 × 3
## religion income frequency
## <chr> <fct> <int>
## 1 Agnostic <$10k 27
## 2 Agnostic $10-20k 34
## 3 Agnostic $20-30k 60
## 4 Agnostic $30-40k 81
## 5 Agnostic $40-50k 76
## 6 Agnostic $50-75k 137
Calculations and Graphs:
Based on their religion, I was able to determine the percentage of persons in each economic bracket. After that, I created a bar graph to display the percentages of revenue for each religious organization. I also discovered the average for every income bracket, which might be biased because some had higher frequencies than others, and we’re not sure if it makes sense for the overall population.For each faith, I also discovered the largest wealth category. Finally, I made an effort to identify the income bracket that corresponds to each faith group’s median value. I determined the cumulative sum of the income levels and divided the total number of members of each religious group by two to find the median. Next, I found the income group that contains the median value by comparing it to its cumulative sum.
# finding the proportion of each income for each religion
religion <- religion %>%
group_by(religion) %>%
mutate(count = sum(frequency)) %>%
group_by(religion, income) %>%
mutate(proportion = frequency / count,
label = round(proportion * 100, 2))
# bar graph of the proportions
ggplot(religion) +
geom_bar(aes(x = religion, y = proportion, fill = income), stat = "identity",
position = position_fill(reverse = TRUE)) +
geom_text(aes(x = religion, y = proportion, label = label, group = income),
position = position_stack(vjust = .5, reverse = TRUE), size = 3) +
coord_flip() +
ggtitle("Distribution of Income Levels per Religions")
# finding the average of each income group
religion %>%
group_by(income) %>%
summarise(average = mean(frequency)) %>%
arrange(desc(average))
## # A tibble: 6 × 2
## income average
## <fct> <dbl>
## 1 $50-75k 328.
## 2 $20-30k 223
## 3 $30-40k 213.
## 4 $40-50k 193.
## 5 $10-20k 188.
## 6 <$10k 134.
# finding the biggest income group for each religion
religion %>%
group_by(religion) %>%
select(-label, - count) %>%
mutate(max = max(proportion)) %>%
filter(proportion == max) %>%
select(-max) %>%
arrange(desc(proportion))
## # A tibble: 10 × 4
## # Groups: religion [10]
## religion income frequency proportion
## <chr> <fct> <int> <dbl>
## 1 Hindu $50-75k 34 0.479
## 2 Jewish $50-75k 95 0.446
## 3 Don’t know/refused $50-75k 35 0.35
## 4 Agnostic $50-75k 137 0.330
## 5 Atheist $50-75k 70 0.300
## 6 Buddhist $50-75k 58 0.286
## 7 Catholic $50-75k 1116 0.266
## 8 Evangelical Prot $50-75k 1486 0.254
## 9 Jehovah's Witness $50-75k 30 0.205
## 10 Historically Black Prot $10-20k 244 0.179
# finding the median value
religion %>%
group_by(religion) %>%
select(-label, - proportion) %>%
mutate(median = count / 2,
sum = cumsum(frequency)) %>%
select(-count, -frequency) %>%
filter(sum >= median) %>%
mutate(minsum = min(sum)) %>%
filter(minsum == sum) %>%
select(-minsum) %>%
arrange(desc(income))
## # A tibble: 10 × 4
## # Groups: religion [10]
## religion income median sum
## <chr> <fct> <dbl> <int>
## 1 Agnostic $40-50k 208. 278
## 2 Hindu $40-50k 35.5 37
## 3 Jewish $40-50k 106. 118
## 4 Atheist $30-40k 116. 128
## 5 Buddhist $30-40k 102. 112
## 6 Catholic $30-40k 2096. 2437
## 7 Don’t know/refused $30-40k 50 55
## 8 Evangelical Prot $30-40k 2928. 3490
## 9 Jehovah's Witness $30-40k 73 95
## 10 Historically Black Prot $20-30k 683 708
Conclusion
The largest income category for all faith groups, according to the graph and chart, is between $50 and $75k, with the exception of historically Black Protestants, who are primarily in the $10–20k range. Interestingly, the percentage of Hindu and Jewish individuals earning between $50,000 and $75,000 is over 40%, whereas the percentage for Evangelical Protestant, Catholic, and Buddhist groupings is only under 25%. Additionally, it is clear that the group with the highest combined income is $50–75k, followed by $20–30k.
Given that income is not regularly distributed, the median should be examined rather than the average. For every religion group, I was also able to determine the median income group. In order to accomplish this, I had to first get the median frequency value for each group before calculating the cumulative total of the frequencies for each income bracket. When the median value exceeded the cumulative sum, it indicated that the median value would belong to that particular income category. With the exception of Agnostic, Hindu, and Jewish people, who fell into the $40–50k income range for their median group, and Historically Black Protestants, who fell into the $20–30k income range for their median group, the majority of the religious group fell into the $30–45k income range.