TITLE by Joe Foley

{r echo=FALSE, message=FALSE, warning=FALSE, packages}
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
library(lattice)
library(MASS)
library(dplyr)
library(gridExtra)
library(RColorBrewer)
library(tidyr)
library(data.table)
library(knitr)
library(tidyr)
'''


#Notice that the parameter "echo" was set to FALSE for this code chunk. This
#prevents the code from displaying in the knitted HTML output. You should set
#echo=FALSE for all code chunks in your file, unless it makes sense for your
#report to show the code that generated a particular plot.

#The other parameters for "message" and "warning" should also be set to FALSE
#for other code chunks once you have verified that each plot comes out as you
#want it to. This will clean up the flow of your report.

{r echo=FALSE, Load_the_Data} ‘’’

We will primarily be using the Udacity provided 2016 Federal Election Campaign

contribution data set. The set reports each persons name,their address, who their

employer is, how much and to which candidate. We will be augmenting this data with

population gotten from the internet.

We are interested in looking at levels of enthusiasm so we will devide the number of

donors per zip code by the population of the zip code and create a measure of

enthusiasm which we will call Participation. ‘’’

‘’’ #Load the Data

Here are the links to the 2 data setts used.

NYS2016:“https://drive.google.com/open?id=1ImsK3wdIeU1J#WVyc8-rwjgnTRgLl9dO27-5e6CBWD1A"

Zip_Pop“https://drive.google.com/file/d/0B-ygjCols7cAU1#RjNHhsM3FNcDA/view?usp=sharing" ‘’’

Univariate Plots Section

Tip: In this section, you should perform some preliminary exploration of

your dataset. Run some summaries of the data and create univariate plots to

understand the structure of the individual variables in your dataset. Don’t

forget to add a comment after each plot or closely-related group of plots!

There should be multiple code chunks and text sections; the first one below is

just to help you get started.

{r echo=FALSE, Univariate_Plots}
dim(NY2016)
649460     18

{r}
dim(Zip_Pop2)

‘’’ Before doing any Univariate plotting we have some data wrangling to do. ‘’’

‘’’ First of all lets set the SCIPEN option so we do not get any scientific notation in our plots. ‘’’

options("scipen"=1000000)

‘’’ NOTE: before uploading the files I changed the

colmn name containg the zip codes to Zip in both

files. ‘’’

‘’’ NY2016 is a big file (105 mb) so we will sample to speed

things along. ‘’’

{r}
NY_Samp<-NY2016[sample(nrow(NY2016), 10000),]

‘’’ The Zip codes of NY2016 are 9 digits long so we

will shorten them to 5 digits. ‘’’

{r}
NY_Samp$Zip<- substr(NY_Samp$Zip , 0,5)

‘’’ The Zip column is character so we will convert it to numeric. ‘’’

{r}
NY_Samp$Zip<- as.numeric(as.character(NY_Samp$Zip))

‘’’ We will also switch to Factor the Contributor’s

Name, Contributors Occupoation, and the Candidates

Name ‘’’

{r}
NYS$cand_nm <- as.factor(NYS$cand_nm)

{r}
NYS$contbr_occupation <- as.factor(NYS$contbr_occupation)

‘’’ Lets add population data by zip to the govt elelction data. ‘’’

{r}
NY_Samp.with.Population<- NY_Samp %>% left_join(Zip_Pop2 , 
                                      by = c("Zip"="Zip"))

‘’’ Now lets add a new variable called freq that is the

the number of campaign donors per zip. ‘’’

{r}
freq<-NY_Samp.with.Population%>%
  group_by(Zip) %>%
  summarise(value=sum(Zip), freq = n())

‘’’ Now we add the new variable FREQ to the data set we

previously added Popultion to. ‘’’

{r}
freq_NY_Samp<- NY_Samp.with.Population %>% left_join(freq ,by = 
                      c("Zip"="Zip"))

‘’’ Next we will create another variable called Participation

by deviding the number of donors pwer zip code by that

zip codes population. ‘’’

{r}
freq_NY_Samp2<-mutate(freq_NY_Samp , 
                     Participation = (freq_NY_Samp$freq/freq_NY_Samp$Population)*10000)
write.csv(freq_NY_Samp,"freq_NY_Samp.csv")

‘’’ By this point we have taken our original data set which

has among other measurements included donors names, which

candidate they gave to, and added the population of each

zip code. Then we devided the number of donors per zip

code by the population of that zip code to creat a new

variable called Participation. ‘’’

{r}
dim(freq_NY_Samp2)

‘’’ Lets drop all irrelevent columns so we can plot pairs. ‘’’

{r}
NYS<- freq_NY_Samp2[ , c("cand_nm","contbr_nm","contbr_city","Zip",
                         "contbr_occupation","contb_receipt_amt",
                         "Population","freq","value","Participation")]

{r}
dim(NYS)
10000    10

{r}
str(NYS)

{r}
str(NYS)

‘’’ So after all of that , which zips had the

highest participation rates? ‘’’

{r}
Participation.Rate <- qplot(NYS$Zip) +
  ggtitle('Participation Rate')+
  geom_bar()
Participation.Rate
 ```
 
 
 
 
 
 '''
 Some Zips are so over represented  that they obscure
 
 others.  Let us a Y Scale Log to balance things out for
 
 the eye.
 '''

{r}

{r}
Zip.Count<-ggplot(data = NYS, aes(x= Zip ))+
  geom_bar(stat = "count")+ 
          ggtitle("Number of contributors per Zip")
Zip.Count

{r}
Part.Zip<-ggplot(data = NYS, aes(x= Zip , y= Participation))+
  geom_bar(stat = "identity")+ 
          ggtitle("Participation Rate by Zip")
Part.Zip

{r}
range(NYS$Participation)

```

Tip: Make sure that you leave a blank line between the start / end of each code block and the end / start of your Markdown text so that it is formatted nicely in the knitted text. Note as well that text on consecutive lines is treated as a single space. Make sure you have a blank line between your paragraphs so that they too are formatted for easy readability.

Univariate Analysis

Tip: Now that you’ve completed your univariate explorations, it’s time to reflect on and summarize what you’ve found. Use the questions below to help you gather your observations and add your own if you have other thoughts!

What is the structure of your dataset?

What is/are the main feature(s) of interest in your dataset?

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Did you create any new variables from existing variables in the dataset?

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

Bivariate Plots Section

Tip: Based on what you saw in the univariate plots, what relationships between variables might be interesting to look at in this section? Don’t limit yourself to relationships between a main output feature and one of the supporting variables. Try to look at relationships between supporting variables as well.

Bivariate Analysis

Tip: As before, summarize what you found in your bivariate explorations here. Use the questions below to guide your discussion.

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

What was the strongest relationship you found?

Multivariate Plots Section

Tip: Now it’s time to put everything together. Based on what you found in the bivariate plots section, create a few multivariate plots to investigate more complex interactions between variables. Make sure that the plots that you create here are justified by the plots you explored in the previous section. If you plan on creating any mathematical models, this is the section where you will do that.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Were there any interesting or surprising interactions between features?

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

Final Plots and Summary

Tip: You’ve done a lot of exploration and have built up an understanding of the structure of and relationships between the variables in your dataset. Here, you will select three plots from all of your previous exploration to present here as a summary of some of your most interesting findings. Make sure that you have refined your selected plots for good titling, axis labels (with units), and good aesthetic choices (e.g. color, transparency). After each plot, make sure you justify why you chose each plot by describing what it shows.

Plot One

Description One

Plot Two

Description Two

Plot Three

Description Three

Reflection

Tip: Here’s the final step! Reflect on the exploration you performed and the insights you found. What were some of the struggles that you went through? What went well? What was surprising? Make sure you include an insight into future work that could be done with the dataset.

Tip: Don’t forget to remove this, and the other Tip sections before saving your final work and knitting the final report!

R Notebook