Project Instructions

The goal of this assignment is to give you practice in preparing different datasets for downstream analysis work. Your task is to choose any three of the “wide” datasets identified in the Week [5] Discussion items. You may use your own dataset. For each of the three chosen datasets: Create a .CSV file (or optionally, a MySQL database!) that includes all of the information included in the dataset. You’re encouraged to use a “wide” structure similar to how the information appears in the discussion item, so that you can practice tidying and transformations. Read the information from your .CSV file into R, and use tidyr and dplyr as needed to tidy and transform your data [Most of your grade will be based on this step!]. Perform the analysis requested in the discussion item.

library(tidyr)
library(dplyr)
library(ggplot2)

(1) Data Set 1

Contributor: Jose Zuniga

Source: Tips for Simplifying Crosstab Query Statements, Rob Gravelle, Database Journal, 2010.

Analysis: Compare monthly citizenship applications for the given regions.

Month	REGION 1	REGION 2	REGION 3	REGION 4	REGION 5	TOTAL
April	13	33	76	2	47	171
May	17	55	209	1	143	425
June	8	63	221	1	127	420
July	13	104	240	6	123	486
August	18	121	274	9	111	533
September	25	160	239	2	88	514
October	9	88	295	2	127	521
November	2	86	292	2	120	502
December	1	128	232	6	155	522
TOTAL	106	838	2078	31	1041	4094

(1.1) Create .CSV file with data and “wide” structure of information in dataset.

csv1 <- rbind(c("Month", "REGION 1", "REGION 2", 
                "REGION 3", "REGION 4", "REGION 5", "TOTAL"),
              c("April", 13, 33, 76, 2, 47, 171),
              c("May", 17, 55, 209, 1, 143, 425),
              c("June", 8, 63, 221, 1, 127, 420),
              c("July", 13, 104, 240, 6, 123, 486),
              c("August", 18, 121, 274, 9, 111, 533),
              c("September", 25, 160, 239, 2, 88, 514),
              c("October", 9, 88, 295, 2, 127, 521),
              c("November", 2, 86, 292, 2, 120, 502),
              c("December", 1, 128, 232, 6, 155, 522),
              c("TOTAL", 106, 838, 2078, 31, 1041, 4094))

write.table(csv1, file = "DATA_607_Project2_1.csv", sep = ",", col.names=F, row.names=F)

(1.2) Read the information from the .CSV file into R.

wide_data1 <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/SPS/",
                             "master/DATA%20607/DATA_607_Project2_1.csv"), 
                             stringsAsFactors = F)

(1.3) Use tidyr and dplyr to tidy and transform the data.

long_data1 <- wide_data1 %>% 
  gather(Region, Amount, 2:6) %>% 
  mutate(Contribution = Amount / TOTAL, 
         Month = match(Month, month.name),
         Region = substr(Region, 8, 8)) %>% 
  select(-TOTAL) %>% 
  filter(Month != "NA"); head(long_data1)

##   Month Region Amount Contribution
## 1     4      1     13   0.07602339
## 2     5      1     17   0.04000000
## 3     6      1      8   0.01904762
## 4     7      1     13   0.02674897
## 5     8      1     18   0.03377111
## 6     9      1     25   0.04863813

(1.4) Perform the analysis requested by the contributor.

long_data1 %>%
  group_by(Region) %>%
  summarise(Region_Sum = sum(Amount))

## # A tibble: 5 × 2
##   Region Region_Sum
##    <chr>      <int>
## 1      1        106
## 2      2        838
## 3      3       2078
## 4      4         31
## 5      5       1041

long_data1 %>%
  group_by(Month) %>%
  summarise(Month_Sum = sum(Amount))

## # A tibble: 9 × 2
##   Month Month_Sum
##   <int>     <int>
## 1     4       171
## 2     5       425
## 3     6       420
## 4     7       486
## 5     8       533
## 6     9       514
## 7    10       521
## 8    11       502
## 9    12       522

ggplot(long_data1, aes(x=Region, y=Amount, fill=Region)) + 
  geom_bar(width = 1, stat = "identity") + 
  labs(title = "Applications by Region", y = "Applications")

long_data1 %>% select(-4) %>% spread(Month, Amount)

##   Region  4   5   6   7   8   9  10  11  12
## 1      1 13  17   8  13  18  25   9   2   1
## 2      2 33  55  63 104 121 160  88  86 128
## 3      3 76 209 221 240 274 239 295 292 232
## 4      4  2   1   1   6   9   2   2   2   6
## 5      5 47 143 127 123 111  88 127 120 155

ggplot(long_data1, aes(x=as.character(Month), y=Amount, fill=Region)) + 
  geom_bar(width = 0.9, stat = "identity") +
  ggtitle("Monthly Applications by Region")  + 
  xlab("Month") + ylab("Applications")

(1.5) Conclusions.

Visual analysis presents the citizenship application data in a format that is even easier to digest than the original table. The monthly contribution of Region 4 which is represented by the color blue is almost unnoticeable. We see quite the opposite with the contribution of Region 3 which shows its monthly dominance in green. The other benefit of this final visualization is that the distribution of the aggregate number of applications per month can be seen in the heights of the bars.

(2) Data Set 2

Contributor: Yifei Li

Source: Reshape, Ramnath Vaidyanathan, Introduction to R, 2013.

Analysis: The correlation between religious groups and income distribution.

religion	<$10k	$10-20k	$20-30k	$30-40k	$40-50k	$50-75k	$75-100k	$100-150k	>150k
Agnostic	27	34	60	81	76	137	122	109	84
Atheist	12	27	37	52	35	70	73	59	74
Buddhist	27	21	30	34	33	58	62	39	53
Catholic	418	617	732	670	638	1116	949	792	633

(2.1) Create .CSV file with data and “wide” structure of information in dataset.

csv2 <- rbind(c("religion", "<$10k", "$10-20k", "$20-30k", "$30-40k", 
                "$50-75k", "$40-50k", "$75-100k", "$100-150k", ">150k"),
              c("Agnostic", 27, 34, 60, 81, 76, 137, 122, 109, 84),
              c("Atheist", 12, 27, 37, 52, 35, 70, 73, 59, 74),
              c("Buddhist", 27, 21, 30, 34, 33, 58, 62, 39, 53),
              c("Catholic",418, 617, 732, 670, 638, 1116, 949, 792, 633))

write.table(csv2, file = "DATA_607_Project2_2.csv", sep = ",", col.names=F, row.names=F)

(2.2) Read the information from the .CSV file into R.

wide_data2 <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/SPS/",
                             "master/DATA%20607/DATA_607_Project2_2.csv"), 
                             stringsAsFactors = F)

(2.3) Use tidyr and dplyr to tidy and transform the data.

long_data2 <- wide_data2 %>% 
  mutate(Religion = religion) %>% 
  gather(Income, Amount, 2:10) %>% 
  mutate(Income = gsub("\\.", "-", Income),
         Income = gsub("X--", "\\<\\$", Income),
         Income = gsub("X-", "\\$", Income),
         Income = sub("\\$150k", "\\>\\$150k", Income)) %>%
  group_by(Religion) %>% 
  mutate(Income_Level = seq_along(Income), # For sorting
         Total = sum(Amount),
         Frequency = Amount / Total) %>% 
  group_by(Income_Level) %>% 
  mutate(Freq_Sum = cumsum(Frequency)) %>% 
  select(2,5,3,4,7,8) %>% ungroup(); head(long_data2)

## # A tibble: 6 × 6
##   Religion Income_Level  Income Amount  Frequency   Freq_Sum
##      <chr>        <int>   <chr>  <int>      <dbl>      <dbl>
## 1 Agnostic            1   <$10k     27 0.03698630 0.03698630
## 2  Atheist            1   <$10k     12 0.02733485 0.06432115
## 3 Buddhist            1   <$10k     27 0.07563025 0.13995141
## 4 Catholic            1   <$10k    418 0.06367098 0.20362239
## 5 Agnostic            2 $10-20k     34 0.04657534 0.04657534
## 6  Atheist            2 $10-20k     27 0.06150342 0.10807876

(2.4) Perform the analysis requested by the contributor.

long_data2 %>%
  group_by(Religion) %>%
  summarise(n = sum(Amount))

## # A tibble: 4 × 2
##   Religion     n
##      <chr> <int>
## 1 Agnostic   730
## 2  Atheist   439
## 3 Buddhist   357
## 4 Catholic  6565

long_data2 %>%
  group_by(Income_Level, Income) %>%
  summarise(Income_Sum = sum(Amount))

## Source: local data frame [9 x 3]
## Groups: Income_Level [?]
## 
##   Income_Level    Income Income_Sum
##          <int>     <chr>      <int>
## 1            1     <$10k        484
## 2            2   $10-20k        699
## 3            3   $20-30k        859
## 4            4   $30-40k        837
## 5            5   $50-75k        782
## 6            6   $40-50k       1381
## 7            7  $75-100k       1206
## 8            8 $100-150k        999
## 9            9    >$150k        844

ggplot(long_data2, aes(x=paste(Income_Level, "\n", Income), y=Frequency, fill=Religion)) + 
  geom_bar(width = 0.9, stat = "identity") +
  geom_text(aes(label = round(Frequency,2), y=Freq_Sum-Frequency/2), size = 3) +
  ggtitle("Income Level by Religion")  + 
  xlab("Income Levels") + ylab("Cumulative Sum of Member Frequency")

chi_table2 <- long_data2 %>%
  select(1,3,4) %>%
  spread(Income, Amount) %>%
  select(2:10)
chisq.test(chi_table2)

## 
##  Pearson's Chi-squared test
## 
## data:  chi_table2
## X-squared = 94.837, df = 24, p-value = 2.249e-10

(2.5) Conclusions.

Assuming the data were gathered through simple random sampling, set the null hypothesis to independence between Religion and Income, and the alternate hypothesis to a relationship between Religion and Income. Failure to reject the null hypothesis will be assessed on a $p$-value of $0.01$ such that $H_{ 0 }: p\le 0.05; H_{ A }: p>0.05$. Given two categorical variables from a single population, the Chi-Square test for independence is applied to determine whether there is a significant association between the two variables.ⁱ The test results indicate the existence of a $p$-value $=2.2487166\times 10^{-10}<0.01$ indicating that the probability of observing a sample statistic as extreme as the test statistic is extremely low; therefore the null hypothesis is rejected.ⁱⁱ The data supports the conclusion that there is a relationship between Religion and Income.

(3) Data Set 3

Contributor: Marco Siqueira Campos

Source: Religious Landscape Study, Hadley Wicham, Pew Research Center, 2016.

Analysis: Income by religion.

Religious tradition	Less than $30,000	$30,000- $49,999	$50,000- $99,999	$100,000 or more	Sample Size
Buddhist	36%	18%	32%	13%	233
Catholic	36%	19%	26%	19%	6,137
Evangelical Protestant	35%	22%	28%	14%	7,462
Hindu	17%	13%	34%	36%	172
Historically Black Protestant	53%	22%	17%	8%	1,704
Jehovah’s Witness	48%	25%	22%	4%	208
Jewish	16%	15%	24%	44%	708
Mainline Protestant	29%	20%	28%	23%	5,208
Mormon	27%	20%	33%	20%	594
Muslim	34%	17%	29%	20%	205
Orthodox Christian	18%	17%	36%	29%	155
Unaffiliated (religious “nones”)	33%	20%	26%	21%	6,790

(3.1) Create .CSV file with data and “wide” structure of information in dataset.

csv3 <- rbind(c("Religious tradition", "Less than $30,000", "$30,000-$49,999", 
                "$50,000-$99,999", "$100,000 or more", "Sample Size"),
             c("Buddhist", '36%', '18%', '32%', '13%', '233'),
             c("Catholic", '36%', '19%', '26%', '19%', '6,137'),
             c("Evangelical Protestant", '35%', '22%', '28%', '14%', '7,462'),
             c("Hindu", '17%', '13%', '34%', '36%', '172'),
             c("Historically Black Protestant", '53%', '22%', '17%', '8%', '1,704'),
             c("Jehovah's Witness", '48%', '25%', '22%', '4%', '208'),
             c("Jewish", '16%', '15%', '24%', '44%', '708'),
             c("Mainline Protestant", '29%', '20%', '28%', '23%', '5,208'),
             c("Mormon", '27%', '20%', '33%', '20%', '594'),
             c("Muslim", '34%', '17%', '29%', '20%', '205'),
             c("Orthodox Christian", '18%', '17%', '36%', '29%', '155'),
             c("Unaffiliated (religious 'nones')", '33%', '20%', '26%', '21%', '6,790'))

write.table(csv3, file = "DATA_607_Project2_3.csv", sep = ",", col.names=F, row.names=F)

(3.2) Read the information from the .CSV file into R.

wide_data3 <- read.csv(paste0("https://raw.githubusercontent.com/jzuniga123/SPS/",
                              "master/DATA%20607/DATA_607_Project2_3.csv"), 
                              stringsAsFactors = F)

(3.3) Use tidyr and dplyr to tidy and transform the data.

long_data3 <- wide_data3 %>% 
  unite(temp, Religious.tradition, Sample.Size) %>%
  gather(Income, Frequency, 2:5) %>%
  separate(temp, into=c("Religion", "Amount"), sep = "_", convert=T) %>% 
  mutate(Frequency = as.numeric(gsub("%", "", Frequency)) / 100,
         Amount = round(as.integer(sub(",", "", Amount)) * Frequency, 0),
         Income = gsub("L.+n\\.{2}", "\\<\\$", Income),
         Income = gsub("\\.o.+e", "\\+", Income),
         Income = gsub("\\.{2}", "-\\$", Income),
         Income = gsub("X\\.", "\\$", Income),
         Income = gsub("\\.", ",", Income)) %>% 
  group_by(Religion) %>% 
  mutate(Income_Level = seq_along(Income)) %>%  # To sort ggplot x-axis
  group_by(Income) %>% 
  mutate(Freq_Sum = cumsum(Frequency)) %>% 
  ungroup() %>% select(1,5,3,2,4,6); head(long_data3)

## # A tibble: 6 × 6
##                        Religion Income_Level   Income Amount Frequency
##                           <chr>        <int>    <chr>  <dbl>     <dbl>
## 1                      Buddhist            1 <$30,000     84      0.36
## 2                      Catholic            1 <$30,000   2209      0.36
## 3        Evangelical Protestant            1 <$30,000   2612      0.35
## 4                         Hindu            1 <$30,000     29      0.17
## 5 Historically Black Protestant            1 <$30,000    903      0.53
## 6             Jehovah's Witness            1 <$30,000    100      0.48
## # ... with 1 more variables: Freq_Sum <dbl>

(3.4) Perform the analysis requested by the contributor.

Since Data given in terms of frequency, there are some unavoidable discrepancies due to rounding.

long_data3 %>%
  group_by(Religion) %>%
  summarise(n = sum(Amount))

## # A tibble: 12 × 2
##                            Religion     n
##                               <chr> <dbl>
## 1                          Buddhist   231
## 2                          Catholic  6137
## 3            Evangelical Protestant  7388
## 4                             Hindu   171
## 5     Historically Black Protestant  1704
## 6                 Jehovah's Witness   206
## 7                            Jewish   701
## 8               Mainline Protestant  5208
## 9                            Mormon   594
## 10                           Muslim   205
## 11               Orthodox Christian   155
## 12 Unaffiliated (religious 'nones')  6790

long_data3 %>%
  group_by(Income_Level, Income) %>%
  summarise(Income_Sum = sum(Amount))

## Source: local data frame [4 x 3]
## Groups: Income_Level [?]
## 
##   Income_Level          Income Income_Sum
##          <int>           <chr>      <dbl>
## 1            1        <$30,000      10059
## 2            2 $30,000-$49,999       5985
## 3            3 $50,000-$99,999       7858
## 4            4       $100,000+       5588

ggplot(long_data3, aes(x=paste(Income_Level, "\n", Income), y=Frequency, fill=Religion)) + 
  geom_bar(width = 0.9, stat = "identity") +
  geom_text(aes(label = round(Frequency,2), y=Freq_Sum-Frequency/2), size = 3) +
  ggtitle("Income Level by Religion")  + 
  xlab("Income Levels") + ylab("Cumulative Sum of Member Frequency")

chi_table3 <- long_data3 %>%
  select(1,3,4) %>%
  spread(Income, Amount) %>%
  select(2:5)
chisq.test(chi_table3)

## 
##  Pearson's Chi-squared test
## 
## data:  chi_table3
## X-squared = 1046.7, df = 33, p-value < 2.2e-16

(3.5) Conclusions.

Assuming the data were gathered through simple random sampling, set the null hypothesis to independence between Religion and Income, and the alternate hypothesis to a relationship between Religion and Income. Failure to reject the null hypothesis will be assessed on a $p$-value of $0.01$ such that $H_{ 0 }: p\le 0.05; H_{ A }: p>0.05$. Given two categorical variables from a single population, the Chi-Square test for independence is applied to determine whether there is a significant association between the two variables.ⁱ The test results indicate the existence of a $p$-value $=1.444359\times 10^{-198}<0.01$ indicating that the probability of observing a sample statistic as extreme as the test statistic is extremely low; therefore the null hypothesis is rejected.ⁱⁱ The data supports the conclusion that there is a relationship between Religion and Income.

References

ⁱ http://www.ats.ucla.edu/stat/mult_pkg/whatstat/default.htm

ⁱⁱ http://stattrek.com/chi-square-test/independence.aspx?Tutorial=AP

Project - Preparing Datasets for Analysis

Jose Zuniga