In a recent mythical poll in Scotland, voters were asked if they preferred Cullen skink over Partan bree. Here are the results of the poll, with results broken down by city and by age group.

  1. Write down 3 questions that you might want to answer based on this data.
  2. Create an R data frame with 2 observations to store this data in its current “messy” state. Use whatever method you want to re-create and/or load the data.
  3. Use the functionality in the tidyr package to convert the data frame to be “tidy data.”
  4. Use the functionality in the dplyr package to answer the questions that you asked in step 1.
  5. Having gone through the process, would you ask different questions and/or change the way that you structured your data frame?

3 Questions: 1. Percentage of responses by group 2. Total Yes/No responses with percentage of total 3. Percent responses for each location

Read data

require(tidyr)
## Loading required package: tidyr
## Warning: package 'tidyr' was built under R version 3.1.3
require(dplyr)
## Loading required package: dplyr
## Warning: package 'dplyr' was built under R version 3.1.3
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data <- read.csv("c:/Users/Public/data.csv", sep = ",", header = F, skip = 2, stringsAsFactor = F) #read csv, skip top two rows
colnames(data) <- c("Response", "Edinburgh.1624", "Edinburgh.25", "Glasgow.1624", "Glasgow.25") #set colnames
print(data)
##   Response Edinburgh.1624 Edinburgh.25 Glasgow.1624 Glasgow.25
## 1      Yes          80100       143000        99400     150400
## 2       No          35900       214800        43000     207000

Tidy data

data2 <- data %>% gather(Location.Age, Frequency, 2:5, convert = F)
data3 <- data2 %>% separate(Location.Age, into = c("Location", "Age"))
data3[data3=="1624"] <- "16-24"
data3[data3=="25"] <- "25+"
data4 <- data3[c(2, 3, 1, 4)]
print(data4)
##    Location   Age Response Frequency
## 1 Edinburgh 16-24      Yes     80100
## 2 Edinburgh 16-24       No     35900
## 3 Edinburgh   25+      Yes    143000
## 4 Edinburgh   25+       No    214800
## 5   Glasgow 16-24      Yes     99400
## 6   Glasgow 16-24       No     43000
## 7   Glasgow   25+      Yes    150400
## 8   Glasgow   25+       No    207000

Answer Questions

  1. Percentage of responses by group
data4 %>%
  mutate("% Total Response" = round((Frequency / sum(Frequency)*100), 2))
##    Location   Age Response Frequency % Total Response
## 1 Edinburgh 16-24      Yes     80100             8.23
## 2 Edinburgh 16-24       No     35900             3.69
## 3 Edinburgh   25+      Yes    143000            14.69
## 4 Edinburgh   25+       No    214800            22.06
## 5   Glasgow 16-24      Yes     99400            10.21
## 6   Glasgow 16-24       No     43000             4.42
## 7   Glasgow   25+      Yes    150400            15.45
## 8   Glasgow   25+       No    207000            21.26
  1. Total Yes/No responses with percentage of total
data4 %>%
  group_by(Response) %>%
  summarise(Frequency = sum(Frequency)) %>%
  mutate("% Total Response" = round((Frequency / sum(Frequency)*100), 2))
## Source: local data frame [2 x 3]
## 
##   Response Frequency % Total Response
## 1       No    500700            51.43
## 2      Yes    472900            48.57
  1. Percent responses for each location
data4 %>%
  group_by(Location, Response) %>%
  summarise(Frequency = sum(Frequency)) %>%
  mutate("% Total Response" = round((Frequency / sum(Frequency)*100), 2))
## Source: local data frame [4 x 4]
## Groups: Location
## 
##    Location Response Frequency % Total Response
## 1 Edinburgh       No    250700            52.91
## 2 Edinburgh      Yes    223100            47.09
## 3   Glasgow       No    250000            50.02
## 4   Glasgow      Yes    249800            49.98

Additional Question

Calculate Yes/No responses for each age group (easier after tidying data)

data4 %>%
  group_by(Age, Response) %>%
  summarise(Frequency = sum(Frequency)) %>%
  mutate("% Total Response" = round((Frequency / sum(Frequency)*100), 2))
## Source: local data frame [4 x 4]
## Groups: Age
## 
##     Age Response Frequency % Total Response
## 1 16-24       No     78900            30.53
## 2 16-24      Yes    179500            69.47
## 3   25+       No    421800            58.98
## 4   25+      Yes    293400            41.02