Questions and answers

Question 1 Write down 3 questions that you might want to answer based on this data.

Answer I would like to know the following information from the given data:

Get the percentage of votes (yes/no) by city, irrespective of the age groups. We would like to see how each city’s population has voted.
Get the percentage of yes/no, by age group, irrespective of the city
Get the precentage of yes/no by age group and by city

NOTE A “Yes” vote means, the population prefers Cullen skink and “No” vote means the population prefers Partan bree.

Question 2 Create an R data frame with 2 observations to store this data in its current “messy” state. Use whatever method you want to re-create and/or load the data.

Answer

Edinburgh.Yes <- c(80100,143000)
Edinburgh.No <- c(35900,214800)
Glasgow.Yes <- c(99400,150400)
Glasgow.No <- c(43000,207000)

polls <- data.frame(Edinburgh.Yes = Edinburgh.Yes,Edinburgh.No, Glasgow.Yes, Glasgow.No)
rownames(polls) <- c('16-24','25+')

polls

##       Edinburgh.Yes Edinburgh.No Glasgow.Yes Glasgow.No
## 16-24         80100        35900       99400      43000
## 25+          143000       214800      150400     207000

Figure 1: Data Frame

Question 3: Use the functionality in the tidyr package to convert the data frame to be “tidy data.”

Answer The data frame shown in Figure 1 is certainly a messy data set, due to the following reasons: 1. The columns Einburgh.Yes, Edinburgh.No, Glasgow.Yes and Glasgow.No are not variables. 2. The row names “16-24” and “25+” are also not variables.

Let us first fix the usage of constants as row names (2nd issue listed above). To the “polls” data set, let us add a new variable called “Age_Group” with two observations: 16-24 and 25+, and delete the row names.

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

polls <- mutate(polls,Age_Group=rownames(polls))
rownames(polls) <- NULL

polls

##   Edinburgh.Yes Edinburgh.No Glasgow.Yes Glasgow.No Age_Group
## 1         80100        35900       99400      43000     16-24
## 2        143000       214800      150400     207000       25+

The above display of polls data frame shows that there are no row names, and a new variable called Age_Group is added with values 16-24 and 25+. But still the polls data frame is not tidy, since it has four columns Einburgh.No, Glasgow.Yes, Glasgodinburgh.Yes, Edw.No, which are not variables. Let us convert these columns to observations (melting the data set)

polls <- gather(polls,City,Votes,-Age_Group)
polls

##   Age_Group          City  Votes
## 1     16-24 Edinburgh.Yes  80100
## 2       25+ Edinburgh.Yes 143000
## 3     16-24  Edinburgh.No  35900
## 4       25+  Edinburgh.No 214800
## 5     16-24   Glasgow.Yes  99400
## 6       25+   Glasgow.Yes 150400
## 7     16-24    Glasgow.No  43000
## 8       25+    Glasgow.No 207000

Figure 2: Data Frame, after converting the row names as observations

The above display of polls data frame is still not tidy, since the City name is concatenated with the opinion data (Yes/No). We have to separate this data (under the City column into 2 columns: City and the Opinion). The code is shown below:

  polls <- separate(polls,City,into=c("City","Opinion"), sep = "\\.")
  polls

##   Age_Group      City Opinion  Votes
## 1     16-24 Edinburgh     Yes  80100
## 2       25+ Edinburgh     Yes 143000
## 3     16-24 Edinburgh      No  35900
## 4       25+ Edinburgh      No 214800
## 5     16-24   Glasgow     Yes  99400
## 6       25+   Glasgow     Yes 150400
## 7     16-24   Glasgow      No  43000
## 8       25+   Glasgow      No 207000

Question 4: Use the functionality in the dplyr package to answer the questions that you asked in step 1.

Answer * Get the percentage of votes (yes/no) by city, irrespective of the age groups. We would like to see how each city’s population has voted.

s_Edinburgh <- sum(polls[polls$City=="Edinburgh",]$Votes)
s_Glasgow <- sum(polls[polls$City=="Glasgow",]$Votes)

polls_by_city <- summarise(group_by(polls,City, Opinion),Votes=(sum(Votes)))

polls_by_city$Votes[polls_by_city$City=="Edinburgh"] <- (polls_by_city$Votes[polls_by_city$City=="Edinburgh"] / s_Edinburgh)*100
polls_by_city$Votes[polls_by_city$City=="Glasgow"] <- (polls_by_city$Votes[polls_by_city$City=="Glasgow"] / s_Glasgow)*100

polls_by_city

## Source: local data frame [4 x 3]
## Groups: City
## 
##        City Opinion    Votes
## 1 Edinburgh      No 52.91262
## 2 Edinburgh     Yes 47.08738
## 3   Glasgow      No 50.02001
## 4   Glasgow     Yes 49.97999

par(mfrow=c(1,2))

slices <- c(polls_by_city$Votes[polls_by_city$City=="Edinburgh"]) 

lbls <- c("NO", "Yes")

lbls <- paste(lbls, round(slices)) 
 lbls <- paste(lbls,"%",sep="")  
 pie(slices,labels = lbls, col=rainbow(length(lbls)),
    main="Pie Chart of votes in Edinburgh") 


slices <- c(polls_by_city$Votes[polls_by_city$City=="Glasgow"]) 



lbls <- c("No","Yes")


lbls <- paste(lbls, round(slices)) 
 lbls <- paste(lbls,"%",sep="")  
 pie(slices,labels = lbls, col=rainbow(length(lbls)),
    main="Pie Chart of votes in Glasgow")

Conclusion-1: Approximately 53% of Edinburgh’s population has voted “No”, while 47% of Edinburgh’s population has voted “Yes”

Approximately 50% of Glasgow’s population has voted “Yes” and 50% has voted “No”

In Edinburgh, majority of the people (53%) prefer Partan Bree (since “no” is the majority population’s choice.) In Glasgow, the votes are divided almost the same (50%) between Partan Bree and Cullen Skink

Get the percentage of yes/no, by age group, irrespective of the city.

s_16_24_age <- sum(polls[polls$Age_Group=="16-24",]$Votes)
s_25_age <- sum(polls[polls$Age_Group=="25+",]$Votes)

polls_by_age_group <- summarise(group_by(polls,Age_Group, Opinion),Votes=(sum(Votes)))

polls_by_age_group$Votes[polls_by_age_group$Age_Group == "16-24"] <- (polls_by_age_group$Votes[polls_by_age_group$Age_Group == "16-24"] / s_16_24_age)*100
polls_by_age_group$Votes[polls_by_age_group$Age_Group == "25+"] <- (polls_by_age_group$Votes[polls_by_age_group$Age_Group == "25+"] / s_25_age)*100
  
polls_by_age_group

## Source: local data frame [4 x 3]
## Groups: Age_Group
## 
##   Age_Group Opinion    Votes
## 1     16-24      No 30.53406
## 2     16-24     Yes 69.46594
## 3       25+      No 58.97651
## 4       25+     Yes 41.02349

par(mfrow=c(1,2))

slices <- c(polls_by_age_group$Votes[polls_by_age_group$Age_Group == "16-24"])
lbls <- c("NO", "Yes")

lbls <- paste(lbls, round(slices)) # add percents to labels 
 lbls <- paste(lbls,"%",sep="") # ad % to labels 
 pie(slices,labels = lbls, col=rainbow(length(lbls)),
    main="16-24 age group support %") 


slices <- c(polls_by_age_group$Votes[polls_by_age_group$Age_Group == "25+"])
lbls <- c("NO", "Yes")

lbls <- paste(lbls, round(slices)) # add percents to labels 
 lbls <- paste(lbls,"%",sep="") # ad % to labels 
 pie(slices,labels = lbls, col=rainbow(length(lbls)),
    main="25+ age group support %")

Conclusion-2: Approximately 41% of 25+ age group has voted for Yes, and 69.5% of 16-24 age group has voted for Yes. Hence the majority of the 25+ age group population has voted “No”, while the majority of 16-24 age group population has voted for yes.

Hence 59% of 25+ age group prefer Partan Bree, while in 16-24 age group the majority (69.5%) prefer Cullen Skink. The support/voting pattern changes depending on the age group.

Get the precentage of yes/no by age group and by city.

poll_percents <- polls
poll_percents$Votes[poll_percents$City == "Edinburgh"] <- (poll_percents$Votes[poll_percents$City == "Edinburgh"] / s_Edinburgh) * 100
poll_percents$Votes[poll_percents$City == "Glasgow"] <- (poll_percents$Votes[poll_percents$City == "Glasgow"] / s_Glasgow) * 100
poll_percents

##   Age_Group      City Opinion     Votes
## 1     16-24 Edinburgh     Yes 16.905867
## 2       25+ Edinburgh     Yes 30.181511
## 3     16-24 Edinburgh      No  7.577037
## 4       25+ Edinburgh      No 45.335585
## 5     16-24   Glasgow     Yes 19.887955
## 6       25+   Glasgow     Yes 30.092037
## 7     16-24   Glasgow      No  8.603441
## 8       25+   Glasgow      No 41.416567

NOTE Percentages might not add up to 100, due to rounding of decimals.

Conclusion-3: We can conclude that in both the cities, approximately 30% of the population in 25+ age group consistently support Yes (I mean they support Cullen Skink), however the majority of the 25+ age group prefer the other candidate. The votes distribution is also approximately same in both the cities, in each group.

Question 5: Having gone through the process, would you ask different questions and/or change the way that you structured your data frame?

Answer I would like to convert the poll_percents data frame to the following display, using spread() function of tidyr. Such display will help us to compare the votes percentages side by side.

library(tidyr)
poll_percents$City_Opinion <- paste(poll_percents$City,".",poll_percents$Opinion,sep="")
poll_percents$City <- NULL
poll_percents$Opinion <- NULL
spread(poll_percents,City_Opinion, Votes)

##   Age_Group Edinburgh.No Edinburgh.Yes Glasgow.No Glasgow.Yes
## 1     16-24     7.577037      16.90587   8.603441    19.88796
## 2       25+    45.335585      30.18151  41.416567    30.09204

From the above display, approximately 20% of 16-24 age group of Glasgow city’s population has voted for Yes, while in Edinburgh, approximately 17% of the same age group’s population has voted for Yes. But for 25+ age group, approximately the same percentage(30%) in both the cities have voted Yes. Visually this display will help us to compare the percentages between different cities easily.

Week_7_Assignment

Sekhar Mekala

Wednesday, March 18, 2015

Initial analysis

Questions and answers