An analysis that could be performed is once the data is ‘tidy’ we can see the columns as ‘religion’, ‘income’ then ‘frequency’, with each row being an entry(unique) to that person or entry point. We can then grab this data and analyze mathematically amounts, averages, etc"
This data can also be found in the tidyr package under the name relig_income which includes more income ranges and respondents who did not want to answer information about their icnome.
religion <- read.csv("https://raw.githubusercontent.com/okhaimova/DATA-607/master/Project2/income%20and%20religion.csv")
To tidy the data, I removed unnecessary variables, changed the names, and then changed the wide data set into a longer data set by grouping the income levels as the income variable and assigned the values to frequency. Then I changed the income into a factor and ordered the levels.
religion <- religion %>%
select(-X) %>%
set_colnames(c("religion", "<$10k","$10-20k", "$20-30k", "$30-40k", "$40-50k", "$50-75k")) %>%
pivot_longer(2:7, names_to = "income", values_to = "frequency")
# factoring the income variable
religion$income <- religion$income %>%
as.factor() %>%
relevel("<$10k")
head(religion)
## # A tibble: 6 x 3
## religion income frequency
## <chr> <fct> <int>
## 1 Agnostic <$10k 27
## 2 Agnostic $10-20k 34
## 3 Agnostic $20-30k 60
## 4 Agnostic $30-40k 81
## 5 Agnostic $40-50k 76
## 6 Agnostic $50-75k 137
I found the proportion of people that are in each income group based on their religion. I then made a bar graph to show the proportions of income levels for each religious group. I also found the average for each income group which may be skewed as certain groups had larger frequencies and we do not know if it is proportionate to the population.I also found the largest income group for each religion. Lastly, I tried to find the income group for each religious group which holds the median value. To determine the median, I divided the total amount of people for each religious group by 2 and then found the cumulative sum of the income levels. Next, I found the income group that contains the median value by comparing it to its cumulative sum.
# finding the proportion of each income for each religion
religion <- religion %>%
group_by(religion) %>%
mutate(count = sum(frequency)) %>%
group_by(religion, income) %>%
mutate(proportion = frequency / count,
label = round(proportion * 100, 2))
# bar graph of the proportions
ggplot(religion) +
geom_bar(aes(x = religion, y = proportion, fill = income), stat = "identity",
position = position_fill(reverse = TRUE)) +
geom_text(aes(x = religion, y = proportion, label = label, group = income),
position = position_stack(vjust = .5, reverse = TRUE), size = 3) +
coord_flip() +
ggtitle("Distribution of Income Levels per Religions")
# finding the average of each income group
religion %>%
group_by(income) %>%
summarise(average = mean(frequency)) %>%
arrange(desc(average))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 6 x 2
## income average
## <fct> <dbl>
## 1 $50-75k 328.
## 2 $20-30k 223
## 3 $30-40k 213.
## 4 $40-50k 193.
## 5 $10-20k 188.
## 6 <$10k 134.
# finding the biggest income group for each religion
religion %>%
group_by(religion) %>%
select(-label, - count) %>%
mutate(max = max(proportion)) %>%
filter(proportion == max) %>%
select(-max) %>%
arrange(desc(proportion))
## # A tibble: 10 x 4
## # Groups: religion [10]
## religion income frequency proportion
## <chr> <fct> <int> <dbl>
## 1 Hindu $50-75k 34 0.479
## 2 Jewish $50-75k 95 0.446
## 3 Don’t know/refused $50-75k 35 0.35
## 4 Agnostic $50-75k 137 0.330
## 5 Atheist $50-75k 70 0.300
## 6 Buddhist $50-75k 58 0.286
## 7 Catholic $50-75k 1116 0.266
## 8 Evangelical Prot $50-75k 1486 0.254
## 9 Jehovah's Witness $50-75k 30 0.205
## 10 Historically Black Prot $10-20k 244 0.179
# finding the median value
religion %>%
group_by(religion) %>%
select(-label, - proportion) %>%
mutate(median = count / 2,
sum = cumsum(frequency)) %>%
select(-count, -frequency) %>%
filter(sum >= median) %>%
mutate(minsum = min(sum)) %>%
filter(minsum == sum) %>%
select(-minsum) %>%
arrange(desc(income))
## # A tibble: 10 x 4
## # Groups: religion [10]
## religion income median sum
## <chr> <fct> <dbl> <int>
## 1 Agnostic $40-50k 208. 278
## 2 Hindu $40-50k 35.5 37
## 3 Jewish $40-50k 106. 118
## 4 Atheist $30-40k 116. 128
## 5 Buddhist $30-40k 102. 112
## 6 Catholic $30-40k 2096. 2437
## 7 Don’t know/refused $30-40k 50 55
## 8 Evangelical Prot $30-40k 2928. 3490
## 9 Jehovah's Witness $30-40k 73 95
## 10 Historically Black Prot $20-30k 683 708
Based on the graph and chart, the largest income group for every religion group is $50-75k except for Historically Black Protestant who mostly fall in the $10-20k group. It is also interesting to note that the proportion of those who fall in the $50-75k income group for Hindu and Jewish is over 40% while for Evangelical Protestant, Catholic, and Buddhist groups, that proprtion is only around a quarter. It is also evident collectively, the largest income group is $50-75k followed by $20-30k.
Since income is not normally distributed, it would be better to look at the median instead of the average. I was also able to find the median income group for each religious group. To do so, I had to find what the median frequency value for each group and then found the cumulative sum of the frequency for each income group. Once the median value was greater than the cumulative sum, it meant that the median value would fall into that income group. Majority of the religious group fell into the $30-45k income for its median value, except for Agnostic, Hindu, and Jewish who fell into the $40-50k income for its median group and Historically Black Protestants who fell into the $20-30k income for its median group.
Find if there is a correlation between students performance and parental level of education.
exams <- read.csv("https://raw.githubusercontent.com/okhaimova/DATA-607/master/Project2/StudentsPerformance.csv")
First, I removed some variables as we are only interested in the students performance and parental level of education. Then I renamed the columns and made the data set longer by assigning the the subject variables to subject and the values to score. Then I changed the parental_edu variable into a factor variable and ordered the levels accordingly.
exams <- exams %>%
select(parental.level.of.education, math.score, reading.score, writing.score) %>%
set_colnames(c("parental_edu", "math","reading", "writing")) %>%
pivot_longer(2:4, names_to = "subject", values_to = "score")
# factoring the parents' levels of education
exams$parental_edu <- exams$parental_edu %>%
as.factor() %>%
factor(c("some high school", "high school", "some college",
"associate's degree", "bachelor's degree", "master's degree"))
head(exams)
## # A tibble: 6 x 3
## parental_edu subject score
## <fct> <chr> <int>
## 1 bachelor's degree math 72
## 2 bachelor's degree reading 72
## 3 bachelor's degree writing 74
## 4 some college math 69
## 5 some college reading 90
## 6 some college writing 88
# Overall average score comparison for each parental level
exams %>%
group_by(parental_edu) %>%
summarise(average = mean(score))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 6 x 2
## parental_edu average
## <fct> <dbl>
## 1 some high school 65.1
## 2 high school 63.1
## 3 some college 68.5
## 4 associate's degree 69.6
## 5 bachelor's degree 71.9
## 6 master's degree 73.6
# Average score comparison for each parental level and subject
exams %>%
group_by(parental_edu, subject) %>%
summarise(average = mean(score)) %>%
pivot_wider(names_from = subject, values_from = average)
## `summarise()` regrouping output by 'parental_edu' (override with `.groups` argument)
## # A tibble: 6 x 4
## # Groups: parental_edu [6]
## parental_edu math reading writing
## <fct> <dbl> <dbl> <dbl>
## 1 some high school 63.5 66.9 64.9
## 2 high school 62.1 64.7 62.4
## 3 some college 67.1 69.5 68.8
## 4 associate's degree 67.9 70.9 69.9
## 5 bachelor's degree 69.4 73 73.4
## 6 master's degree 69.7 75.4 75.7
# Standard Deviation for each parental level and subject
exams %>%
group_by(parental_edu, subject) %>%
summarise(sd = sd(score)) %>%
pivot_wider(names_from = subject, values_from = sd)
## `summarise()` regrouping output by 'parental_edu' (override with `.groups` argument)
## # A tibble: 6 x 4
## # Groups: parental_edu [6]
## parental_edu math reading writing
## <fct> <dbl> <dbl> <dbl>
## 1 some high school 15.9 15.5 15.7
## 2 high school 14.5 14.1 14.1
## 3 some college 14.3 14.1 15.0
## 4 associate's degree 15.1 13.9 14.3
## 5 bachelor's degree 14.9 14.3 14.7
## 6 master's degree 15.2 13.8 13.7
#Scatterplot
ggplot(exams, aes(x = parental_edu, y = score)) +
geom_point(aes(color = subject)) +
coord_flip()
# histogram of each distribution
ggplot(exams) +
geom_histogram(aes(x = score, fill = subject)) +
facet_grid(parental_edu ~ subject) +
ggtitle("Distributions of Scores")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# box plots for each distribution
ggplot(exams, aes(x = parental_edu, y = score, color = subject)) +
geom_boxplot() +
facet_wrap(~subject) +
coord_flip() +
ggtitle("Distributions of Scores")
When looking at the average score across all three tests, we can see there is some correlation between the parental level of education and scores. As the level of education increases, the average overall test score increase as well except between the levels of some high school and high school.
The same can be seen when we look at the average of scores separately for each subject. There is the same correlation that as level of education increases, so does the average test score except between some high school and high school.
When looking at the scatterplot, we can see that the range of scores seem to decrease as the parental level of education increases. This shows that there is a correlation between scores and parental level of education. It also can be seen that for students whose parents have some high school education, the scores vary the greatest as they have the highest standard deviation and students whose parents have completed a master's degree, their standard deviations are the smallest for reading and writing, but not for math, meaning they have a larger variation for math scores.
The distributions can be further seen for each level of parental education and subject in the histograms and box plots. It can be said, generally, that there is a correlation between the two. As the parental level of education rises, the range of scores decrease and the average score increase, with a small discrepancy between some high school and high school.
An example of analysis that can be done is the change in ridership per borough from one year to the next.
mta <- read.csv("https://raw.githubusercontent.com/okhaimova/DATA-607/master/Project2/MTA%20ridership.csv")
First, I removed some variables and only kept the names of the boroughs and the ridership for each year. Afterwards, I grouped the years into the year variable and the values into ridership. Then I factored the variables and made the ridership variable numeric.
mta <- mta %>%
select(2:8) %>%
set_colnames(c("borough", "2013", "2014", "2015", "2016", "2017", "2018")) %>%
pivot_longer(2:7, names_to = "year", values_to = "ridership")
# factoring the years and boroughs and making ridership numeric
mta$year <- as.factor(mta$year)
mta$borough <- as.factor(mta$borough)
mta$ridership <- as.numeric(gsub(",", "", mta$ridership))
head(mta)
## # A tibble: 6 x 3
## borough year ridership
## <fct> <fct> <dbl>
## 1 Bronx 2013 2961575
## 2 Bronx 2014 2941958
## 3 Bronx 2015 3045205
## 4 Bronx 2016 3038777
## 5 Bronx 2017 2785331
## 6 Bronx 2018 2562443
# Overall average ridership by year
annual <- mta %>%
group_by(year) %>%
summarise(average = mean(ridership))
## `summarise()` ungrouping output (override with `.groups` argument)
annual
## # A tibble: 6 x 2
## year average
## <fct> <dbl>
## 1 2013 4075399.
## 2 2014 4179655.
## 3 2015 4205208.
## 4 2016 4186768.
## 5 2017 4063695.
## 6 2018 3947377.
ggplot(annual, aes(x=year, y=average, group=1)) +
geom_line()+
geom_point() +
ylab("Average Annual Ridership") +
ggtitle("Average Annual Ridership Over Time")
# Average ridership each borough
boroughannual <- mta %>%
group_by(borough, year) %>%
summarise(average = mean(ridership))
## `summarise()` regrouping output by 'borough' (override with `.groups` argument)
boroughannual %>%
pivot_wider(names_from = year, values_from = average)
## # A tibble: 5 x 7
## # Groups: borough [5]
## borough `2013` `2014` `2015` `2016` `2017` `2018`
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Bronx 2044246. 2089933. 2086337. 2099966. 2048967. 1920210.
## 2 Brooklyn 2371925. 2437738. 2468462. 2449301. 2416386. 2358161.
## 3 Manhattan 8106404. 8321578. 8358968. 8307783. 8024126. 7835550.
## 4 Queens 3090748. 3155941. 3176318. 3181482. 3056963. 2945640.
## 5 Staten Island 4628850 4662582 4701436. 4723261 4613752 4387587
ggplot(boroughannual, aes(x=year, y=average, group = borough, color = borough)) +
geom_line()+
geom_point() +
ylab("Average Annual Ridership") +
ggtitle("Average Annual Ridership Over Time by Borough")
Overall, it can be seen that the average annual ridership over time increased and then decreased after 2015. It is also evident that ridership for each borough increased and then decreased.