{r echo=FALSE, message=FALSE, warning=FALSE, packages}
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
library(lattice)
library(MASS)
library(dplyr)
library(gridExtra)
library(RColorBrewer)
library(tidyr)
library(data.table)
library(knitr)
library(tidyr)
'''
#Notice that the parameter "echo" was set to FALSE for this code chunk. This
#prevents the code from displaying in the knitted HTML output. You should set
#echo=FALSE for all code chunks in your file, unless it makes sense for your
#report to show the code that generated a particular plot.
#The other parameters for "message" and "warning" should also be set to FALSE
#for other code chunks once you have verified that each plot comes out as you
#want it to. This will clean up the flow of your report.
{r echo=FALSE, Load_the_Data} ‘’’
We will primarily be using the Udacity provided 2016 Federal Election Campaign
contribution data set. The set reports each persons name,their address, who their
employer is, how much and to which candidate. We will be augmenting this data with
population gotten from the internet.
We are interested in looking at levels of enthusiasm so we will devide the number of
donors per zip code by the population of the zip code and create a measure of
enthusiasm which we will call Participation. ‘’’
‘’’ #Load the Data
Here are the links to the 2 data setts used.
Tip: In this section, you should perform some preliminary exploration of
your dataset. Run some summaries of the data and create univariate plots to
understand the structure of the individual variables in your dataset. Don’t
forget to add a comment after each plot or closely-related group of plots!
There should be multiple code chunks and text sections; the first one below is
just to help you get started.
{r echo=FALSE, Univariate_Plots}
dim(NY2016)
649460 18
{r}
dim(Zip_Pop2)
‘’’ Before doing any Univariate plotting we have some data wrangling to do. ‘’’
‘’’ First of all lets set the SCIPEN option so we do not get any scientific notation in our plots. ‘’’
options("scipen"=1000000)
‘’’ NOTE: before uploading the files I changed the
colmn name containg the zip codes to Zip in both
files. ‘’’
‘’’ NY2016 is a big file (105 mb) so we will sample to speed
things along. ‘’’
{r}
NY_Samp<-NY2016[sample(nrow(NY2016), 10000),]
‘’’ The Zip codes of NY2016 are 9 digits long so we
will shorten them to 5 digits. ‘’’
{r}
NY_Samp$Zip<- substr(NY_Samp$Zip , 0,5)
‘’’ The Zip column is character so we will convert it to numeric. ‘’’
{r}
NY_Samp$Zip<- as.numeric(as.character(NY_Samp$Zip))
‘’’ We will also switch to Factor the Contributor’s
Name, Contributors Occupoation, and the Candidates
Name ‘’’
{r}
NYS$cand_nm <- as.factor(NYS$cand_nm)
{r}
NYS$contbr_occupation <- as.factor(NYS$contbr_occupation)
‘’’ Lets add population data by zip to the govt elelction data. ‘’’
{r}
NY_Samp.with.Population<- NY_Samp %>% left_join(Zip_Pop2 ,
by = c("Zip"="Zip"))
‘’’ Now lets add a new variable called freq that is the
the number of campaign donors per zip. ‘’’
{r}
freq<-NY_Samp.with.Population%>%
group_by(Zip) %>%
summarise(value=sum(Zip), freq = n())
‘’’ Now we add the new variable FREQ to the data set we
previously added Popultion to. ‘’’
{r}
freq_NY_Samp<- NY_Samp.with.Population %>% left_join(freq ,by =
c("Zip"="Zip"))
‘’’ Next we will create another variable called Participation
by deviding the number of donors pwer zip code by that
zip codes population. ‘’’
{r}
freq_NY_Samp2<-mutate(freq_NY_Samp ,
Participation = (freq_NY_Samp$freq/freq_NY_Samp$Population)*10000)
write.csv(freq_NY_Samp,"freq_NY_Samp.csv")
‘’’ By this point we have taken our original data set which
has among other measurements included donors names, which
candidate they gave to, and added the population of each
zip code. Then we devided the number of donors per zip
code by the population of that zip code to creat a new
variable called Participation. ‘’’
{r}
dim(freq_NY_Samp2)
‘’’ Lets drop all irrelevent columns so we can plot pairs. ‘’’
{r}
NYS<- freq_NY_Samp2[ , c("cand_nm","contbr_nm","contbr_city","Zip",
"contbr_occupation","contb_receipt_amt",
"Population","freq","value","Participation")]
{r}
dim(NYS)
10000 10
{r}
str(NYS)
{r}
str(NYS)
‘’’ So after all of that , which zips had the
highest participation rates? ‘’’
{r}
Participation.Rate <- qplot(NYS$Zip) +
ggtitle('Participation Rate')+
geom_bar()
Participation.Rate
```
'''
Some Zips are so over represented that they obscure
others. Let us a Y Scale Log to balance things out for
the eye.
'''
{r}
{r}
Zip.Count<-ggplot(data = NYS, aes(x= Zip ))+
geom_bar(stat = "count")+
ggtitle("Number of contributors per Zip")
Zip.Count
{r}
Part.Zip<-ggplot(data = NYS, aes(x= Zip , y= Participation))+
geom_bar(stat = "identity")+
ggtitle("Participation Rate by Zip")
Part.Zip
{r}
range(NYS$Participation)
```
Tip: Make sure that you leave a blank line between the start / end of each code block and the end / start of your Markdown text so that it is formatted nicely in the knitted text. Note as well that text on consecutive lines is treated as a single space. Make sure you have a blank line between your paragraphs so that they too are formatted for easy readability.
Tip: Now that you’ve completed your univariate explorations, it’s time to reflect on and summarize what you’ve found. Use the questions below to help you gather your observations and add your own if you have other thoughts!
Tip: Based on what you saw in the univariate plots, what relationships between variables might be interesting to look at in this section? Don’t limit yourself to relationships between a main output feature and one of the supporting variables. Try to look at relationships between supporting variables as well.
Tip: As before, summarize what you found in your bivariate explorations here. Use the questions below to guide your discussion.
Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.
Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.
Tip: Here’s the final step! Reflect on the exploration you performed and the insights you found. What were some of the struggles that you went through? What went well? What was surprising? Make sure you include an insight into future work that could be done with the dataset.
Tip: Don’t forget to remove this, and the other Tip sections before saving your final work and knitting the final report!