The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work. Your task is to choose any three of the “wide” datasets identified in the Week [5] Discussion items. You may use your own dataset. For each of the three chosen datasets: Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data [Most of your grade will be based on this step!]. Perform the analysis requested in the discussion item.
library(tidyr)
library(dplyr)
library(ggplot2)
Contributor: Jose Zuniga
Source: Tips for Simplifying Crosstab Query Statements, Rob Gravelle, Database Journal, 2010.
Analysis: Compare monthly citizenship applications for the given regions.
Month | REGION 1 | REGION 2 | REGION 3 | REGION 4 | REGION 5 | TOTAL |
---|---|---|---|---|---|---|
April | 13 | 33 | 76 | 2 | 47 | 171 |
May | 17 | 55 | 209 | 1 | 143 | 425 |
June | 8 | 63 | 221 | 1 | 127 | 420 |
July | 13 | 104 | 240 | 6 | 123 | 486 |
August | 18 | 121 | 274 | 9 | 111 | 533 |
September | 25 | 160 | 239 | 2 | 88 | 514 |
October | 9 | 88 | 295 | 2 | 127 | 521 |
November | 2 | 86 | 292 | 2 | 120 | 502 |
December | 1 | 128 | 232 | 6 | 155 | 522 |
TOTAL | 106 | 838 | 2078 | 31 | 1041 | 4094 |
csv1 <- rbind(c("Month", "REGION 1", "REGION 2",
"REGION 3", "REGION 4", "REGION 5", "TOTAL"),
c("April", 13, 33, 76, 2, 47, 171),
c("May", 17, 55, 209, 1, 143, 425),
c("June", 8, 63, 221, 1, 127, 420),
c("July", 13, 104, 240, 6, 123, 486),
c("August", 18, 121, 274, 9, 111, 533),
c("September", 25, 160, 239, 2, 88, 514),
c("October", 9, 88, 295, 2, 127, 521),
c("November", 2, 86, 292, 2, 120, 502),
c("December", 1, 128, 232, 6, 155, 522),
c("TOTAL", 106, 838, 2078, 31, 1041, 4094))
write.table(csv1, file = "DATA_607_Project2_1.csv", sep = ",", col.names=F, row.names=F)
wide_data1 <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/SPS/",
"master/DATA%20607/DATA_607_Project2_1.csv"),
stringsAsFactors = F)
long_data1 <- wide_data1 %>%
gather(Region, Amount, 2:6) %>%
mutate(Contribution = Amount / TOTAL,
Month = match(Month, month.name),
Region = substr(Region, 8, 8)) %>%
select(-TOTAL) %>%
filter(Month != "NA"); head(long_data1)
## Month Region Amount Contribution
## 1 4 1 13 0.07602339
## 2 5 1 17 0.04000000
## 3 6 1 8 0.01904762
## 4 7 1 13 0.02674897
## 5 8 1 18 0.03377111
## 6 9 1 25 0.04863813
long_data1 %>%
group_by(Region) %>%
summarise(Region_Sum = sum(Amount))
## # A tibble: 5 × 2
## Region Region_Sum
## <chr> <int>
## 1 1 106
## 2 2 838
## 3 3 2078
## 4 4 31
## 5 5 1041
long_data1 %>%
group_by(Month) %>%
summarise(Month_Sum = sum(Amount))
## # A tibble: 9 × 2
## Month Month_Sum
## <int> <int>
## 1 4 171
## 2 5 425
## 3 6 420
## 4 7 486
## 5 8 533
## 6 9 514
## 7 10 521
## 8 11 502
## 9 12 522
ggplot(long_data1, aes(x=Region, y=Amount, fill=Region)) +
geom_bar(width = 1, stat = "identity") +
labs(title = "Applications by Region", y = "Applications")
long_data1 %>% select(-4) %>% spread(Month, Amount)
## Region 4 5 6 7 8 9 10 11 12
## 1 1 13 17 8 13 18 25 9 2 1
## 2 2 33 55 63 104 121 160 88 86 128
## 3 3 76 209 221 240 274 239 295 292 232
## 4 4 2 1 1 6 9 2 2 2 6
## 5 5 47 143 127 123 111 88 127 120 155
ggplot(long_data1, aes(x=as.character(Month), y=Amount, fill=Region)) +
geom_bar(width = 0.9, stat = "identity") +
ggtitle("Monthly Applications by Region") +
xlab("Month") + ylab("Applications")
Visual analysis presents the citizenship application data in a format that is even easier to digest than the original table. The monthly contribution of Region 4 which is represented by the color blue is almost unnoticeable. We see quite the opposite with the contribution of Region 3 which shows its monthly dominance in green. The other benefit of this final visualization is that the distribution of the aggregate number of applications per month can be seen in the heights of the bars.
Contributor: Yifei Li
Source: Reshape, Ramnath Vaidyanathan, Introduction to R, 2013.
Analysis: The correlation between religious groups and income distribution.
religion | <$10k | $10-20k | $20-30k | $30-40k | $40-50k | $50-75k | $75-100k | $100-150k | >150k |
---|---|---|---|---|---|---|---|---|---|
Agnostic | 27 | 34 | 60 | 81 | 76 | 137 | 122 | 109 | 84 |
Atheist | 12 | 27 | 37 | 52 | 35 | 70 | 73 | 59 | 74 |
Buddhist | 27 | 21 | 30 | 34 | 33 | 58 | 62 | 39 | 53 |
Catholic | 418 | 617 | 732 | 670 | 638 | 1116 | 949 | 792 | 633 |
csv2 <- rbind(c("religion", "<$10k", "$10-20k", "$20-30k", "$30-40k",
"$50-75k", "$40-50k", "$75-100k", "$100-150k", ">150k"),
c("Agnostic", 27, 34, 60, 81, 76, 137, 122, 109, 84),
c("Atheist", 12, 27, 37, 52, 35, 70, 73, 59, 74),
c("Buddhist", 27, 21, 30, 34, 33, 58, 62, 39, 53),
c("Catholic",418, 617, 732, 670, 638, 1116, 949, 792, 633))
write.table(csv2, file = "DATA_607_Project2_2.csv", sep = ",", col.names=F, row.names=F)
wide_data2 <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/SPS/",
"master/DATA%20607/DATA_607_Project2_2.csv"),
stringsAsFactors = F)
long_data2 <- wide_data2 %>%
mutate(Religion = religion) %>%
gather(Income, Amount, 2:10) %>%
mutate(Income = gsub("\\.", "-", Income),
Income = gsub("X--", "\\<\\$", Income),
Income = gsub("X-", "\\$", Income),
Income = sub("\\$150k", "\\>\\$150k", Income)) %>%
group_by(Religion) %>%
mutate(Income_Level = seq_along(Income), # For sorting
Total = sum(Amount),
Frequency = Amount / Total) %>%
group_by(Income_Level) %>%
mutate(Freq_Sum = cumsum(Frequency)) %>%
select(2,5,3,4,7,8) %>% ungroup(); head(long_data2)
## # A tibble: 6 × 6
## Religion Income_Level Income Amount Frequency Freq_Sum
## <chr> <int> <chr> <int> <dbl> <dbl>
## 1 Agnostic 1 <$10k 27 0.03698630 0.03698630
## 2 Atheist 1 <$10k 12 0.02733485 0.06432115
## 3 Buddhist 1 <$10k 27 0.07563025 0.13995141
## 4 Catholic 1 <$10k 418 0.06367098 0.20362239
## 5 Agnostic 2 $10-20k 34 0.04657534 0.04657534
## 6 Atheist 2 $10-20k 27 0.06150342 0.10807876
long_data2 %>%
group_by(Religion) %>%
summarise(n = sum(Amount))
## # A tibble: 4 × 2
## Religion n
## <chr> <int>
## 1 Agnostic 730
## 2 Atheist 439
## 3 Buddhist 357
## 4 Catholic 6565
long_data2 %>%
group_by(Income_Level, Income) %>%
summarise(Income_Sum = sum(Amount))
## Source: local data frame [9 x 3]
## Groups: Income_Level [?]
##
## Income_Level Income Income_Sum
## <int> <chr> <int>
## 1 1 <$10k 484
## 2 2 $10-20k 699
## 3 3 $20-30k 859
## 4 4 $30-40k 837
## 5 5 $50-75k 782
## 6 6 $40-50k 1381
## 7 7 $75-100k 1206
## 8 8 $100-150k 999
## 9 9 >$150k 844
ggplot(long_data2, aes(x=paste(Income_Level, "\n", Income), y=Frequency, fill=Religion)) +
geom_bar(width = 0.9, stat = "identity") +
geom_text(aes(label = round(Frequency,2), y=Freq_Sum-Frequency/2), size = 3) +
ggtitle("Income Level by Religion") +
xlab("Income Levels") + ylab("Cumulative Sum of Member Frequency")
chi_table2 <- long_data2 %>%
select(1,3,4) %>%
spread(Income, Amount) %>%
select(2:10)
chisq.test(chi_table2)
##
## Pearson's Chi-squared test
##
## data: chi_table2
## X-squared = 94.837, df = 24, p-value = 2.249e-10
Assuming the data were gathered through simple random sampling, set the null hypothesis to independence between Religion and Income, and the alternate hypothesis to a relationship between Religion and Income. Failure to reject the null hypothesis will be assessed on a \(p\)-value of \(0.01\) such that \(H_{ 0 }: p\le 0.05; H_{ A }: p>0.05\). Given two categorical variables from a single population, the Chi-Square test for independence is applied to determine whether there is a significant association between the two variables.i The test results indicate the existence of a \(p\)-value \(=2.2487166\times 10^{-10}<0.01\) indicating that the probability of observing a sample statistic as extreme as the test statistic is extremely low; therefore the null hypothesis is rejected.ii The data supports the conclusion that there is a relationship between Religion and Income.
Contributor: Marco Siqueira Campos
Source: Religious Landscape Study, Hadley Wicham, Pew Research Center, 2016.
Analysis: Income by religion.
Religious tradition | Less than $30,000 | $30,000- $49,999 | $50,000- $99,999 | $100,000 or more | Sample Size |
---|---|---|---|---|---|
Buddhist | 36% | 18% | 32% | 13% | 233 |
Catholic | 36% | 19% | 26% | 19% | 6,137 |
Evangelical Protestant | 35% | 22% | 28% | 14% | 7,462 |
Hindu | 17% | 13% | 34% | 36% | 172 |
Historically Black Protestant | 53% | 22% | 17% | 8% | 1,704 |
Jehovah’s Witness | 48% | 25% | 22% | 4% | 208 |
Jewish | 16% | 15% | 24% | 44% | 708 |
Mainline Protestant | 29% | 20% | 28% | 23% | 5,208 |
Mormon | 27% | 20% | 33% | 20% | 594 |
Muslim | 34% | 17% | 29% | 20% | 205 |
Orthodox Christian | 18% | 17% | 36% | 29% | 155 |
Unaffiliated (religious “nones”) | 33% | 20% | 26% | 21% | 6,790 |
csv3 <- rbind(c("Religious tradition", "Less than $30,000", "$30,000-$49,999",
"$50,000-$99,999", "$100,000 or more", "Sample Size"),
c("Buddhist", '36%', '18%', '32%', '13%', '233'),
c("Catholic", '36%', '19%', '26%', '19%', '6,137'),
c("Evangelical Protestant", '35%', '22%', '28%', '14%', '7,462'),
c("Hindu", '17%', '13%', '34%', '36%', '172'),
c("Historically Black Protestant", '53%', '22%', '17%', '8%', '1,704'),
c("Jehovah's Witness", '48%', '25%', '22%', '4%', '208'),
c("Jewish", '16%', '15%', '24%', '44%', '708'),
c("Mainline Protestant", '29%', '20%', '28%', '23%', '5,208'),
c("Mormon", '27%', '20%', '33%', '20%', '594'),
c("Muslim", '34%', '17%', '29%', '20%', '205'),
c("Orthodox Christian", '18%', '17%', '36%', '29%', '155'),
c("Unaffiliated (religious 'nones')", '33%', '20%', '26%', '21%', '6,790'))
write.table(csv3, file = "DATA_607_Project2_3.csv", sep = ",", col.names=F, row.names=F)
wide_data3 <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/SPS/",
"master/DATA%20607/DATA_607_Project2_3.csv"),
stringsAsFactors = F)
long_data3 <- wide_data3 %>%
unite(temp, Religious.tradition, Sample.Size) %>%
gather(Income, Frequency, 2:5) %>%
separate(temp, into=c("Religion", "Amount"), sep = "_", convert=T) %>%
mutate(Frequency = as.numeric(gsub("%", "", Frequency)) / 100,
Amount = round(as.integer(sub(",", "", Amount)) * Frequency, 0),
Income = gsub("L.+n\\.{2}", "\\<\\$", Income),
Income = gsub("\\.o.+e", "\\+", Income),
Income = gsub("\\.{2}", "-\\$", Income),
Income = gsub("X\\.", "\\$", Income),
Income = gsub("\\.", ",", Income)) %>%
group_by(Religion) %>%
mutate(Income_Level = seq_along(Income)) %>% # To sort ggplot x-axis
group_by(Income) %>%
mutate(Freq_Sum = cumsum(Frequency)) %>%
ungroup() %>% select(1,5,3,2,4,6); head(long_data3)
## # A tibble: 6 × 6
## Religion Income_Level Income Amount Frequency
## <chr> <int> <chr> <dbl> <dbl>
## 1 Buddhist 1 <$30,000 84 0.36
## 2 Catholic 1 <$30,000 2209 0.36
## 3 Evangelical Protestant 1 <$30,000 2612 0.35
## 4 Hindu 1 <$30,000 29 0.17
## 5 Historically Black Protestant 1 <$30,000 903 0.53
## 6 Jehovah's Witness 1 <$30,000 100 0.48
## # ... with 1 more variables: Freq_Sum <dbl>
Since Data given in terms of frequency, there are some unavoidable discrepancies due to rounding.
long_data3 %>%
group_by(Religion) %>%
summarise(n = sum(Amount))
## # A tibble: 12 × 2
## Religion n
## <chr> <dbl>
## 1 Buddhist 231
## 2 Catholic 6137
## 3 Evangelical Protestant 7388
## 4 Hindu 171
## 5 Historically Black Protestant 1704
## 6 Jehovah's Witness 206
## 7 Jewish 701
## 8 Mainline Protestant 5208
## 9 Mormon 594
## 10 Muslim 205
## 11 Orthodox Christian 155
## 12 Unaffiliated (religious 'nones') 6790
long_data3 %>%
group_by(Income_Level, Income) %>%
summarise(Income_Sum = sum(Amount))
## Source: local data frame [4 x 3]
## Groups: Income_Level [?]
##
## Income_Level Income Income_Sum
## <int> <chr> <dbl>
## 1 1 <$30,000 10059
## 2 2 $30,000-$49,999 5985
## 3 3 $50,000-$99,999 7858
## 4 4 $100,000+ 5588
ggplot(long_data3, aes(x=paste(Income_Level, "\n", Income), y=Frequency, fill=Religion)) +
geom_bar(width = 0.9, stat = "identity") +
geom_text(aes(label = round(Frequency,2), y=Freq_Sum-Frequency/2), size = 3) +
ggtitle("Income Level by Religion") +
xlab("Income Levels") + ylab("Cumulative Sum of Member Frequency")
chi_table3 <- long_data3 %>%
select(1,3,4) %>%
spread(Income, Amount) %>%
select(2:5)
chisq.test(chi_table3)
##
## Pearson's Chi-squared test
##
## data: chi_table3
## X-squared = 1046.7, df = 33, p-value < 2.2e-16
Assuming the data were gathered through simple random sampling, set the null hypothesis to independence between Religion and Income, and the alternate hypothesis to a relationship between Religion and Income. Failure to reject the null hypothesis will be assessed on a \(p\)-value of \(0.01\) such that \(H_{ 0 }: p\le 0.05; H_{ A }: p>0.05\). Given two categorical variables from a single population, the Chi-Square test for independence is applied to determine whether there is a significant association between the two variables.i The test results indicate the existence of a \(p\)-value \(=1.444359\times 10^{-198}<0.01\) indicating that the probability of observing a sample statistic as extreme as the test statistic is extremely low; therefore the null hypothesis is rejected.ii The data supports the conclusion that there is a relationship between Religion and Income.