As mentioned by the classmate (Shovan Biswas) who posted this dataset would like to tidy this dataset into 3 variables religion, income and frequency. The sample size is based on the religion distribution across the United States. There are more analysis can be done besides tidying the data such as:
(1) For each income level, identify which religion has the highest and lowest household income.
(2) Figure out which income level has the highest variation.
Load the original data file from GitHub.
IncomeReligious <- read.csv("https://raw.githubusercontent.com/SieSiongWong/DATA-607/master/Income%20Distirbution%20by%20Religious%20Group.csv", header=TRUE, sep=",",)
Review the dataset.
IncomeReligious
## Religious.tradition Less.than..30.000 X.30.000..49.999
## 1 Buddhist 36% 18%
## 2 Catholic 36% 19%
## 3 Evangelical Protestant 35% 22%
## 4 Hindu 17% 13%
## 5 Historically Black Protestant 53% 22%
## 6 Jehovah's Witness 48% 25%
## 7 Jewish 16% 15%
## 8 Mainline Protestant 29% 20%
## 9 Mormon 27% 20%
## 10 Muslim 34% 17%
## 11 Orthodox Christian 18% 17%
## 12 Unaffiliated (religious "nones") 33% 20%
## X.50.000..99.999 X.100.000.or.more Sample.Size
## 1 32% 13% 233
## 2 26% 19% 6,137
## 3 28% 14% 7,462
## 4 34% 36% 172
## 5 17% 8% 1,704
## 6 22% 4% 208
## 7 24% 44% 708
## 8 28% 23% 5,208
## 9 33% 20% 594
## 10 29% 20% 205
## 11 36% 29% 155
## 12 26% 21% 6,790
str(IncomeReligious)
## 'data.frame': 12 obs. of 6 variables:
## $ Religious.tradition: Factor w/ 12 levels "Buddhist","Catholic",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Less.than..30.000 : Factor w/ 11 levels "16%","17%","18%",..: 9 9 8 2 11 10 1 5 4 7 ...
## $ X.30.000..49.999 : Factor w/ 8 levels "13%","15%","17%",..: 4 5 7 1 7 8 2 6 6 3 ...
## $ X.50.000..99.999 : Factor w/ 10 levels "17%","22%","24%",..: 7 4 5 9 1 2 3 5 8 6 ...
## $ X.100.000.or.more : Factor w/ 11 levels "13%","14%","19%",..: 1 3 2 8 11 9 10 6 4 4 ...
## $ Sample.Size : Factor w/ 12 levels "1,704","155",..: 6 9 11 3 1 5 12 7 8 4 ...
Clean the data.
## Remove the sample size as it's useless.
IncomeReligious2 <- subset(IncomeReligious, select=-Sample.Size)
## Rename the columns 1:5.
IncomeReligious2 <- IncomeReligious2 %>% rename("Religion"="Religious.tradition", "Less than $30k"="Less.than..30.000", "$30k-$49,999"="X.30.000..49.999", "$50k-$99,999"="X.50.000..99.999", "Over $100k"="X.100.000.or.more")
Rehape the clean data.
## Convert the dataset to long form by turning the column 2 to 5 into rows.
IncomeReligious2 <- IncomeReligious2 %>% gather(Income, value,2:5)
## Rename the column name value to Frequency.
IncomeReligious2 <- rename(IncomeReligious2, "Frequency"="value")
## Convert the frequency column into percentatge value.
IncomeReligious2 <- IncomeReligious2 %>% transform(Frequency=as.numeric(unlist(str_extract(IncomeReligious2$Frequency,"[[:digit:]]{1,}"))))
## Sort the religion column.
IncomeReligious2 <- IncomeReligious2 %>% arrange(Religion, desc(Religion))
IncomeReligious2
## Religion Income Frequency
## 1 Buddhist Less than $30k 36
## 2 Buddhist $30k-$49,999 18
## 3 Buddhist $50k-$99,999 32
## 4 Buddhist Over $100k 13
## 5 Catholic Less than $30k 36
## 6 Catholic $30k-$49,999 19
## 7 Catholic $50k-$99,999 26
## 8 Catholic Over $100k 19
## 9 Evangelical Protestant Less than $30k 35
## 10 Evangelical Protestant $30k-$49,999 22
## 11 Evangelical Protestant $50k-$99,999 28
## 12 Evangelical Protestant Over $100k 14
## 13 Hindu Less than $30k 17
## 14 Hindu $30k-$49,999 13
## 15 Hindu $50k-$99,999 34
## 16 Hindu Over $100k 36
## 17 Historically Black Protestant Less than $30k 53
## 18 Historically Black Protestant $30k-$49,999 22
## 19 Historically Black Protestant $50k-$99,999 17
## 20 Historically Black Protestant Over $100k 8
## 21 Jehovah's Witness Less than $30k 48
## 22 Jehovah's Witness $30k-$49,999 25
## 23 Jehovah's Witness $50k-$99,999 22
## 24 Jehovah's Witness Over $100k 4
## 25 Jewish Less than $30k 16
## 26 Jewish $30k-$49,999 15
## 27 Jewish $50k-$99,999 24
## 28 Jewish Over $100k 44
## 29 Mainline Protestant Less than $30k 29
## 30 Mainline Protestant $30k-$49,999 20
## 31 Mainline Protestant $50k-$99,999 28
## 32 Mainline Protestant Over $100k 23
## 33 Mormon Less than $30k 27
## 34 Mormon $30k-$49,999 20
## 35 Mormon $50k-$99,999 33
## 36 Mormon Over $100k 20
## 37 Muslim Less than $30k 34
## 38 Muslim $30k-$49,999 17
## 39 Muslim $50k-$99,999 29
## 40 Muslim Over $100k 20
## 41 Orthodox Christian Less than $30k 18
## 42 Orthodox Christian $30k-$49,999 17
## 43 Orthodox Christian $50k-$99,999 36
## 44 Orthodox Christian Over $100k 29
## 45 Unaffiliated (religious "nones") Less than $30k 33
## 46 Unaffiliated (religious "nones") $30k-$49,999 20
## 47 Unaffiliated (religious "nones") $50k-$99,999 26
## 48 Unaffiliated (religious "nones") Over $100k 21
Analyze the clean data.
## % of adults at each household income level for each religion bar plot:- Figure 1
IncomeReligious2$Income <- factor(IncomeReligious2$Income, levels=c("Less than $30k", "$30k-$49,999", "$50k-$99,999", "Over $100k"))
ggplot(IncomeReligious2, aes(x = Religion, y = Frequency, fill = Income)) + geom_bar(stat="identity", position = position_stack(reverse = FALSE)) + xlab("Religion") + ylab("% of Adults Household Income") + scale_fill_brewer(palette = "Set2") + coord_flip() + theme(legend.position = "top") + geom_text(aes(label=Frequency), position = position_stack(vjust = .5), size = 3)

## % of income distribution by religions at each income level box plot: Figure 2
ggplot(IncomeReligious2, aes(x=reorder(factor(Income), Frequency, fun=median),y=Frequency,fill=factor(Income))) + geom_boxplot() + labs(title="% Income Distribution by Religions") + ylab("% Income Distribution") + theme(legend.position = "none", axis.title.x = element_blank(), axis.text.x=element_text(angle=45)) + theme(plot.title = element_text(hjust=0.5)) + theme(axis.text.x = element_text(margin = margin(t = 25, r = 20, b = 0, l = 0)))

Conclusions:
From the figure 1, we can clearly see that at the income level “Less Than $30k”, the Historically Black Protestant has the highest percentage, 53% and the Jewish has the lowest percentage,16%. For the income level “$30k-$49,999”, the Jehovah’s Witness has highest percentage, 25% and the Jewish has the lowest percentage, 15. For the income level “$50k-$99,999”, the Orthodox Christian has the highest percentage, 36% and the Historically Black Protestant has the lowest percentage, 17%. For the income level “Over $100k”, the Jewish has the highest percentage, 44% and the Jehovah’s Witness has the lowest percentage, 4%.
From the figure 2, we can see that the income level over $100k and the income level below $30k have huge variation of % income distribution. As we can see from the figure 1, this happens mainly because at the income level over $100k, the Jewish religion has much higher percentage compared to other. Same thing for income level below $30k that Historically Black Protestant and Jehovah’s Witness religions have much higher percentage compared other.