Project Instructions

The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work. Your task is to choose any three of the “wide” datasets identified in the Week [5] Discussion items. You may use your own dataset. For each of the three chosen datasets: Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data [Most of your grade will be based on this step!]. Perform the analysis requested in the discussion item.

library(tidyr)
library(dplyr)
library(ggplot2)

(1) Data Set 1

Contributor: Jose Zuniga

Source: Tips for Simplifying Crosstab Query Statements, Rob Gravelle, Database Journal, 2010.

Analysis: Compare monthly citizenship applications for the given regions.

Month REGION 1 REGION 2 REGION 3 REGION 4 REGION 5 TOTAL
April 13 33 76 2 47 171
May 17 55 209 1 143 425
June 8 63 221 1 127 420
July 13 104 240 6 123 486
August 18 121 274 9 111 533
September 25 160 239 2 88 514
October 9 88 295 2 127 521
November 2 86 292 2 120 502
December 1 128 232 6 155 522
TOTAL 106 838 2078 31 1041 4094

(1.1) Create .CSV file with data and “wide” structure of information in dataset.

csv1 <- rbind(c("Month", "REGION 1", "REGION 2", 
                "REGION 3", "REGION 4", "REGION 5", "TOTAL"),
              c("April", 13, 33, 76, 2, 47, 171),
              c("May", 17, 55, 209, 1, 143, 425),
              c("June", 8, 63, 221, 1, 127, 420),
              c("July", 13, 104, 240, 6, 123, 486),
              c("August", 18, 121, 274, 9, 111, 533),
              c("September", 25, 160, 239, 2, 88, 514),
              c("October", 9, 88, 295, 2, 127, 521),
              c("November", 2, 86, 292, 2, 120, 502),
              c("December", 1, 128, 232, 6, 155, 522),
              c("TOTAL", 106, 838, 2078, 31, 1041, 4094))

write.table(csv1, file = "DATA_607_Project2_1.csv", sep = ",", col.names=F, row.names=F)

(1.2) Read the information from the .CSV file into R.

wide_data1 <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/SPS/",
                             "master/DATA%20607/DATA_607_Project2_1.csv"), 
                             stringsAsFactors = F)

(1.3) Use tidyr and dplyr to tidy and transform the data.

long_data1 <- wide_data1 %>% 
  gather(Region, Amount, 2:6) %>% 
  mutate(Contribution = Amount / TOTAL, 
         Month = match(Month, month.name),
         Region = substr(Region, 8, 8)) %>% 
  select(-TOTAL) %>% 
  filter(Month != "NA"); head(long_data1)
##   Month Region Amount Contribution
## 1     4      1     13   0.07602339
## 2     5      1     17   0.04000000
## 3     6      1      8   0.01904762
## 4     7      1     13   0.02674897
## 5     8      1     18   0.03377111
## 6     9      1     25   0.04863813

(1.4) Perform the analysis requested by the contributor.

long_data1 %>%
  group_by(Region) %>%
  summarise(Region_Sum = sum(Amount))
## # A tibble: 5 × 2
##   Region Region_Sum
##    <chr>      <int>
## 1      1        106
## 2      2        838
## 3      3       2078
## 4      4         31
## 5      5       1041
long_data1 %>%
  group_by(Month) %>%
  summarise(Month_Sum = sum(Amount))
## # A tibble: 9 × 2
##   Month Month_Sum
##   <int>     <int>
## 1     4       171
## 2     5       425
## 3     6       420
## 4     7       486
## 5     8       533
## 6     9       514
## 7    10       521
## 8    11       502
## 9    12       522
ggplot(long_data1, aes(x=Region, y=Amount, fill=Region)) + 
  geom_bar(width = 1, stat = "identity") + 
  labs(title = "Applications by Region", y = "Applications")

long_data1 %>% select(-4) %>% spread(Month, Amount)
##   Region  4   5   6   7   8   9  10  11  12
## 1      1 13  17   8  13  18  25   9   2   1
## 2      2 33  55  63 104 121 160  88  86 128
## 3      3 76 209 221 240 274 239 295 292 232
## 4      4  2   1   1   6   9   2   2   2   6
## 5      5 47 143 127 123 111  88 127 120 155
ggplot(long_data1, aes(x=as.character(Month), y=Amount, fill=Region)) + 
  geom_bar(width = 0.9, stat = "identity") +
  ggtitle("Monthly Applications by Region")  + 
  xlab("Month") + ylab("Applications")

(1.5) Conclusions.

Visual analysis presents the citizenship application data in a format that is even easier to digest than the original table. The monthly contribution of Region 4 which is represented by the color blue is almost unnoticeable. We see quite the opposite with the contribution of Region 3 which shows its monthly dominance in green. The other benefit of this final visualization is that the distribution of the aggregate number of applications per month can be seen in the heights of the bars.


(2) Data Set 2

Contributor: Yifei Li

Source: Reshape, Ramnath Vaidyanathan, Introduction to R, 2013.

Analysis: The correlation between religious groups and income distribution.

religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k $100-150k >150k
Agnostic 27 34 60 81 76 137 122 109 84
Atheist 12 27 37 52 35 70 73 59 74
Buddhist 27 21 30 34 33 58 62 39 53
Catholic 418 617 732 670 638 1116 949 792 633

(2.1) Create .CSV file with data and “wide” structure of information in dataset.

csv2 <- rbind(c("religion", "<$10k", "$10-20k", "$20-30k", "$30-40k", 
                "$50-75k", "$40-50k", "$75-100k", "$100-150k", ">150k"),
              c("Agnostic", 27, 34, 60, 81, 76, 137, 122, 109, 84),
              c("Atheist", 12, 27, 37, 52, 35, 70, 73, 59, 74),
              c("Buddhist", 27, 21, 30, 34, 33, 58, 62, 39, 53),
              c("Catholic",418, 617, 732, 670, 638, 1116, 949, 792, 633))

write.table(csv2, file = "DATA_607_Project2_2.csv", sep = ",", col.names=F, row.names=F)

(2.2) Read the information from the .CSV file into R.

wide_data2 <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/SPS/",
                             "master/DATA%20607/DATA_607_Project2_2.csv"), 
                             stringsAsFactors = F)

(2.3) Use tidyr and dplyr to tidy and transform the data.

long_data2 <- wide_data2 %>% 
  mutate(Religion = religion) %>% 
  gather(Income, Amount, 2:10) %>% 
  mutate(Income = gsub("\\.", "-", Income),
         Income = gsub("X--", "\\<\\$", Income),
         Income = gsub("X-", "\\$", Income),
         Income = sub("\\$150k", "\\>\\$150k", Income)) %>%
  group_by(Religion) %>% 
  mutate(Income_Level = seq_along(Income), # For sorting
         Total = sum(Amount),
         Frequency = Amount / Total) %>% 
  group_by(Income_Level) %>% 
  mutate(Freq_Sum = cumsum(Frequency)) %>% 
  select(2,5,3,4,7,8) %>% ungroup(); head(long_data2)
## # A tibble: 6 × 6
##   Religion Income_Level  Income Amount  Frequency   Freq_Sum
##      <chr>        <int>   <chr>  <int>      <dbl>      <dbl>
## 1 Agnostic            1   <$10k     27 0.03698630 0.03698630
## 2  Atheist            1   <$10k     12 0.02733485 0.06432115
## 3 Buddhist            1   <$10k     27 0.07563025 0.13995141
## 4 Catholic            1   <$10k    418 0.06367098 0.20362239
## 5 Agnostic            2 $10-20k     34 0.04657534 0.04657534
## 6  Atheist            2 $10-20k     27 0.06150342 0.10807876

(2.4) Perform the analysis requested by the contributor.

long_data2 %>%
  group_by(Religion) %>%
  summarise(n = sum(Amount))
## # A tibble: 4 × 2
##   Religion     n
##      <chr> <int>
## 1 Agnostic   730
## 2  Atheist   439
## 3 Buddhist   357
## 4 Catholic  6565
long_data2 %>%
  group_by(Income_Level, Income) %>%
  summarise(Income_Sum = sum(Amount))
## Source: local data frame [9 x 3]
## Groups: Income_Level [?]
## 
##   Income_Level    Income Income_Sum
##          <int>     <chr>      <int>
## 1            1     <$10k        484
## 2            2   $10-20k        699
## 3            3   $20-30k        859
## 4            4   $30-40k        837
## 5            5   $50-75k        782
## 6            6   $40-50k       1381
## 7            7  $75-100k       1206
## 8            8 $100-150k        999
## 9            9    >$150k        844
ggplot(long_data2, aes(x=paste(Income_Level, "\n", Income), y=Frequency, fill=Religion)) + 
  geom_bar(width = 0.9, stat = "identity") +
  geom_text(aes(label = round(Frequency,2), y=Freq_Sum-Frequency/2), size = 3) +
  ggtitle("Income Level by Religion")  + 
  xlab("Income Levels") + ylab("Cumulative Sum of Member Frequency")

chi_table2 <- long_data2 %>%
  select(1,3,4) %>%
  spread(Income, Amount) %>%
  select(2:10)
chisq.test(chi_table2)
## 
##  Pearson's Chi-squared test
## 
## data:  chi_table2
## X-squared = 94.837, df = 24, p-value = 2.249e-10

(2.5) Conclusions.

Assuming the data were gathered through simple random sampling, set the null hypothesis to independence between Religion and Income, and the alternate hypothesis to a relationship between Religion and Income. Failure to reject the null hypothesis will be assessed on a \(p\)-value of \(0.01\) such that \(H_{ 0 }: p\le 0.05; H_{ A }: p>0.05\). Given two categorical variables from a single population, the Chi-Square test for independence is applied to determine whether there is a significant association between the two variables.i The test results indicate the existence of a \(p\)-value \(=2.2487166\times 10^{-10}<0.01\) indicating that the probability of observing a sample statistic as extreme as the test statistic is extremely low; therefore the null hypothesis is rejected.ii The data supports the conclusion that there is a relationship between Religion and Income.


(3) Data Set 3

Contributor: Marco Siqueira Campos

Source: Religious Landscape Study, Hadley Wicham, Pew Research Center, 2016.

Analysis: Income by religion.

Religious tradition Less than $30,000 $30,000- $49,999 $50,000- $99,999 $100,000 or more Sample Size
Buddhist 36% 18% 32% 13% 233
Catholic 36% 19% 26% 19% 6,137
Evangelical Protestant 35% 22% 28% 14% 7,462
Hindu 17% 13% 34% 36% 172
Historically Black Protestant 53% 22% 17% 8% 1,704
Jehovah’s Witness 48% 25% 22% 4% 208
Jewish 16% 15% 24% 44% 708
Mainline Protestant 29% 20% 28% 23% 5,208
Mormon 27% 20% 33% 20% 594
Muslim 34% 17% 29% 20% 205
Orthodox Christian 18% 17% 36% 29% 155
Unaffiliated (religious “nones”) 33% 20% 26% 21% 6,790

(3.1) Create .CSV file with data and “wide” structure of information in dataset.

csv3 <- rbind(c("Religious tradition", "Less than $30,000", "$30,000-$49,999", 
                "$50,000-$99,999", "$100,000 or more", "Sample Size"),
             c("Buddhist", '36%', '18%', '32%', '13%', '233'),
             c("Catholic", '36%', '19%', '26%', '19%', '6,137'),
             c("Evangelical Protestant", '35%', '22%', '28%', '14%', '7,462'),
             c("Hindu", '17%', '13%', '34%', '36%', '172'),
             c("Historically Black Protestant", '53%', '22%', '17%', '8%', '1,704'),
             c("Jehovah's Witness", '48%', '25%', '22%', '4%', '208'),
             c("Jewish", '16%', '15%', '24%', '44%', '708'),
             c("Mainline Protestant", '29%', '20%', '28%', '23%', '5,208'),
             c("Mormon", '27%', '20%', '33%', '20%', '594'),
             c("Muslim", '34%', '17%', '29%', '20%', '205'),
             c("Orthodox Christian", '18%', '17%', '36%', '29%', '155'),
             c("Unaffiliated (religious 'nones')", '33%', '20%', '26%', '21%', '6,790'))

write.table(csv3, file = "DATA_607_Project2_3.csv", sep = ",", col.names=F, row.names=F)

(3.2) Read the information from the .CSV file into R.

wide_data3 <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/SPS/",
                              "master/DATA%20607/DATA_607_Project2_3.csv"), 
                              stringsAsFactors = F)

(3.3) Use tidyr and dplyr to tidy and transform the data.

long_data3 <- wide_data3 %>% 
  unite(temp, Religious.tradition, Sample.Size) %>%
  gather(Income, Frequency, 2:5) %>%
  separate(temp, into=c("Religion", "Amount"), sep = "_", convert=T) %>% 
  mutate(Frequency = as.numeric(gsub("%", "", Frequency)) / 100,
         Amount = round(as.integer(sub(",", "", Amount)) * Frequency, 0),
         Income = gsub("L.+n\\.{2}", "\\<\\$", Income),
         Income = gsub("\\.o.+e", "\\+", Income),
         Income = gsub("\\.{2}", "-\\$", Income),
         Income = gsub("X\\.", "\\$", Income),
         Income = gsub("\\.", ",", Income)) %>% 
  group_by(Religion) %>% 
  mutate(Income_Level = seq_along(Income)) %>%  # To sort ggplot x-axis
  group_by(Income) %>% 
  mutate(Freq_Sum = cumsum(Frequency)) %>% 
  ungroup() %>% select(1,5,3,2,4,6); head(long_data3)
## # A tibble: 6 × 6
##                        Religion Income_Level   Income Amount Frequency
##                           <chr>        <int>    <chr>  <dbl>     <dbl>
## 1                      Buddhist            1 <$30,000     84      0.36
## 2                      Catholic            1 <$30,000   2209      0.36
## 3        Evangelical Protestant            1 <$30,000   2612      0.35
## 4                         Hindu            1 <$30,000     29      0.17
## 5 Historically Black Protestant            1 <$30,000    903      0.53
## 6             Jehovah's Witness            1 <$30,000    100      0.48
## # ... with 1 more variables: Freq_Sum <dbl>

(3.4) Perform the analysis requested by the contributor.

Since Data given in terms of frequency, there are some unavoidable discrepancies due to rounding.

long_data3 %>%
  group_by(Religion) %>%
  summarise(n = sum(Amount))
## # A tibble: 12 × 2
##                            Religion     n
##                               <chr> <dbl>
## 1                          Buddhist   231
## 2                          Catholic  6137
## 3            Evangelical Protestant  7388
## 4                             Hindu   171
## 5     Historically Black Protestant  1704
## 6                 Jehovah's Witness   206
## 7                            Jewish   701
## 8               Mainline Protestant  5208
## 9                            Mormon   594
## 10                           Muslim   205
## 11               Orthodox Christian   155
## 12 Unaffiliated (religious 'nones')  6790
long_data3 %>%
  group_by(Income_Level, Income) %>%
  summarise(Income_Sum = sum(Amount))
## Source: local data frame [4 x 3]
## Groups: Income_Level [?]
## 
##   Income_Level          Income Income_Sum
##          <int>           <chr>      <dbl>
## 1            1        <$30,000      10059
## 2            2 $30,000-$49,999       5985
## 3            3 $50,000-$99,999       7858
## 4            4       $100,000+       5588
ggplot(long_data3, aes(x=paste(Income_Level, "\n", Income), y=Frequency, fill=Religion)) + 
  geom_bar(width = 0.9, stat = "identity") +
  geom_text(aes(label = round(Frequency,2), y=Freq_Sum-Frequency/2), size = 3) +
  ggtitle("Income Level by Religion")  + 
  xlab("Income Levels") + ylab("Cumulative Sum of Member Frequency")

chi_table3 <- long_data3 %>%
  select(1,3,4) %>%
  spread(Income, Amount) %>%
  select(2:5)
chisq.test(chi_table3)
## 
##  Pearson's Chi-squared test
## 
## data:  chi_table3
## X-squared = 1046.7, df = 33, p-value < 2.2e-16

(3.5) Conclusions.

Assuming the data were gathered through simple random sampling, set the null hypothesis to independence between Religion and Income, and the alternate hypothesis to a relationship between Religion and Income. Failure to reject the null hypothesis will be assessed on a \(p\)-value of \(0.01\) such that \(H_{ 0 }: p\le 0.05; H_{ A }: p>0.05\). Given two categorical variables from a single population, the Chi-Square test for independence is applied to determine whether there is a significant association between the two variables.i The test results indicate the existence of a \(p\)-value \(=1.444359\times 10^{-198}<0.01\) indicating that the probability of observing a sample statistic as extreme as the test statistic is extremely low; therefore the null hypothesis is rejected.ii The data supports the conclusion that there is a relationship between Religion and Income.