Data Preparation
library(knitr)
sports <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/nfl-fandom/NFL_fandom_data-google_trends.csv",skip=1,header=TRUE, stringsAsFactors = FALSE )
## Data Cleanup- remove % symbol, convert cols to numeric
sports2 <- as_data_frame(lapply(sports, gsub, pattern='\\%', replacement=''))
sports_names <- colnames(sports2)[-1]
sports2[sports_names] <- sapply(sports2[sports_names],as.numeric)
sports_names
## get the means of sports categories
team_means <- lapply(sports2[sports_names],mean)
## get sd of categories
col_sd <- sapply(sports2[sports_names], sd, na.rm = TRUE)
col_sd
Research Question
- How does Google search results for the Seven major sports correlate to Trump’s 2016 vote percentage?
Cases
- The towns are listed by designated market area (DMA)
- 207 different cases
Data Collection
- “Google Trends data was derived from comparing 5-year search traffic for the 7 sports leagues we analyzed:”(from github info)
Response
What is the response variable, and what type is it (numerical/categorical)?
- Discrete numerical- Trump 2016 vote percentage
Explanatory
What is the explanatory variable, and what type is it (numerical/categorival)?
- Discrete numerical- Pct. of major sports searches
- Create categorical clusters within this data based on \(\sigma\) to test if there is a difference between (higher, average, and lower sport specific searches) and Trump’s average 2016 vote
Relevant summary statistics
Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
paste("the means of column",sports_names,"is ",team_means," The sd of each column is ",col_sd)
[1] "the means of column NFL is 39.0966183574879 The sd of each column is 6.43264868397972"
[2] "the means of column NBA is 22.8019323671498 The sd of each column is 5.5436832009321"
[3] "the means of column MLB is 13.5942028985507 The sd of each column is 3.99022655723329"
[4] "the means of column NHL is 5.09178743961353 The sd of each column is 3.6117768793827"
[5] "the means of column NASCAR is 5.3719806763285 The sd of each column is 2.3006515412352"
[6] "the means of column CBB is 4.75845410628019 The sd of each column is 3.7867440642748"
[7] "the means of column CFB is 9.28019323671498 The sd of each column is 5.21840878561084"
[8] "the means of column Trump.2016.Vote. is 54.5292270531401 The sd of each column is 12.2978147234567"
Generalizability
- With an average Trump vote percentage of 54.5, our sample is clearly biased(Trump lost popular vote)
- This will be shown statistically
- As an observational Study this will not prove there is a causation between sports searches and Trump’s 2016 vote percentage
Several ideas for Approach
- Use linear regression to build a predictive model
- Look at each sport individually. Categorize each sport into clusters based on \(\sigma\). I will attempt to create equal categories. I.E. (<-1.5 \(\sigma\),-1 \(\sigma\),-.5 \(\sigma\), .5 \(\sigma\),1 \(\sigma\), >1.5 \(\sigma\)) I will use these categories to determine if there are statistically significant differences between how these clusters voted for Trump in 2016.
- My explanatory variables are completely dependent on each other as they add up to 100% overall. Any analysis involving how multiple explanatory variables effect Trumps voting percentage, will have to find a way to account for this
LS0tDQp0aXRsZTogIlIgTm90ZWJvb2siDQpvdXRwdXQ6IGh0bWxfbm90ZWJvb2sNCi0tLQ0KDQoNCi0tLQ0KdGl0bGU6IERBVEEgNjA2IERhdGEgUHJvamVjdCBQcm9wb3NhbA0KYXV0aG9yOiBKdXN0aW4gSGVybWFuDQotLS0NCg0KIyMjIERhdGEgUHJlcGFyYXRpb24NCg0KYGBge3Igc2V0dXAsIGVjaG89VFJVRSwgcmVzdWx0cz0naGlkZScsIHdhcm5pbmc9RkFMU0UsIG1lc3NhZ2U9RkFMU0V9DQpsaWJyYXJ5KGtuaXRyKQ0KDQpzcG9ydHMgPC0gcmVhZC5jc3YoImh0dHBzOi8vcmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbS9maXZldGhpcnR5ZWlnaHQvZGF0YS9tYXN0ZXIvbmZsLWZhbmRvbS9ORkxfZmFuZG9tX2RhdGEtZ29vZ2xlX3RyZW5kcy5jc3YiLHNraXA9MSxoZWFkZXI9VFJVRSwgc3RyaW5nc0FzRmFjdG9ycyA9IEZBTFNFICkNCg0KIyMgRGF0YSBDbGVhbnVwLSByZW1vdmUgJSBzeW1ib2wsIGNvbnZlcnQgY29scyB0byBudW1lcmljDQpzcG9ydHMyIDwtIGFzX2RhdGFfZnJhbWUobGFwcGx5KHNwb3J0cywgZ3N1YiwgcGF0dGVybj0nXFwlJywgcmVwbGFjZW1lbnQ9JycpKQ0Kc3BvcnRzX25hbWVzIDwtIGNvbG5hbWVzKHNwb3J0czIpWy0xXQ0Kc3BvcnRzMltzcG9ydHNfbmFtZXNdIDwtIHNhcHBseShzcG9ydHMyW3Nwb3J0c19uYW1lc10sYXMubnVtZXJpYykNCnNwb3J0c19uYW1lcw0KIyMgZ2V0IHRoZSBtZWFucyBvZiBzcG9ydHMgY2F0ZWdvcmllcw0KdGVhbV9tZWFucyA8LSBsYXBwbHkoc3BvcnRzMltzcG9ydHNfbmFtZXNdLG1lYW4pDQojIyBnZXQgc2Qgb2YgY2F0ZWdvcmllcw0KY29sX3NkIDwtIHNhcHBseShzcG9ydHMyW3Nwb3J0c19uYW1lc10sIHNkLCBuYS5ybSA9IFRSVUUpDQpjb2xfc2QNCg0KDQoNCmBgYA0KDQoNCiMjIyBSZXNlYXJjaCBRdWVzdGlvbiANCg0KKyBIb3cgZG9lcyBHb29nbGUgc2VhcmNoIHJlc3VsdHMgZm9yIHRoZSBTZXZlbiBtYWpvciBzcG9ydHMgY29ycmVsYXRlIHRvIFRydW1wJ3MgMjAxNiB2b3RlIHBlcmNlbnRhZ2U/ICANCg0KIyMjIENhc2VzIA0KDQotIFRoZSB0b3ducyBhcmUgbGlzdGVkIGJ5IGRlc2lnbmF0ZWQgbWFya2V0IGFyZWEgKERNQSkNCi0gMjA3IGRpZmZlcmVudCBjYXNlcw0KDQoNCiMjIyBEYXRhIENvbGxlY3Rpb24gDQoNCisgIkdvb2dsZSBUcmVuZHMgZGF0YSB3YXMgZGVyaXZlZCBmcm9tIGNvbXBhcmluZyA1LXllYXIgc2VhcmNoIHRyYWZmaWMgZm9yIHRoZSA3IHNwb3J0cyBsZWFndWVzIHdlIGFuYWx5emVkOiIoZnJvbSBnaXRodWIgaW5mbykNCg0KDQojIyMgVHlwZSBvZiBzdHVkeSANCg0KKyBPYnNlcnZhdGlvbmFsDQoNCg0KIyMjIERhdGEgU291cmNlIA0KDQoqKklmIHlvdSBjb2xsZWN0ZWQgdGhlIGRhdGEsIHN0YXRlIHNlbGYtY29sbGVjdGVkLiBJZiBub3QsIHByb3ZpZGUgYSBjaXRhdGlvbi9saW5rLioqDQoNCltodHRwczovL2dpdGh1Yi5jb20vZml2ZXRoaXJ0eWVpZ2h0L2RhdGEvYmxvYi9tYXN0ZXIvbmZsLWZhbmRvbS9ORkxfZmFuZG9tX2RhdGEtZ29vZ2xlX3RyZW5kcy5jc3ZdDQoNCg0KIyMjIFJlc3BvbnNlIA0KDQoqKldoYXQgaXMgdGhlIHJlc3BvbnNlIHZhcmlhYmxlLCBhbmQgd2hhdCB0eXBlIGlzIGl0IChudW1lcmljYWwvY2F0ZWdvcmljYWwpPyoqDQoNCisgRGlzY3JldGUgbnVtZXJpY2FsLSBUcnVtcCAyMDE2IHZvdGUgcGVyY2VudGFnZQ0KDQojIyMgRXhwbGFuYXRvcnkgDQoNCioqV2hhdCBpcyB0aGUgZXhwbGFuYXRvcnkgdmFyaWFibGUsIGFuZCB3aGF0IHR5cGUgaXMgaXQgKG51bWVyaWNhbC9jYXRlZ29yaXZhbCk/KioNCg0KKyBEaXNjcmV0ZSBudW1lcmljYWwtICBQY3QuIG9mIG1ham9yIHNwb3J0cyBzZWFyY2hlcw0KICAgICsgQ3JlYXRlIGNhdGVnb3JpY2FsIGNsdXN0ZXJzIHdpdGhpbiB0aGlzIGRhdGEgYmFzZWQgb24gICRcc2lnbWEkIHRvIHRlc3QgaWYgdGhlcmUgaXMgYSBkaWZmZXJlbmNlIGJldHdlZW4gKGhpZ2hlciwgYXZlcmFnZSwgYW5kIGxvd2VyIHNwb3J0IHNwZWNpZmljIHNlYXJjaGVzKSBhbmQgIFRydW1wJ3MgYXZlcmFnZSAyMDE2IHZvdGUgDQoNCg0KIyMjIFJlbGV2YW50IHN1bW1hcnkgc3RhdGlzdGljcyANCg0KKipQcm92aWRlIHN1bW1hcnkgc3RhdGlzdGljcyByZWxldmFudCB0byB5b3VyIHJlc2VhcmNoIHF1ZXN0aW9uLiBGb3IgZXhhbXBsZSwgaWYgeW914oCZcmUgY29tcGFyaW5nIG1lYW5zIGFjcm9zcyBncm91cHMgcHJvdmlkZSBtZWFucywgU0RzLCBzYW1wbGUgc2l6ZXMgb2YgZWFjaCBncm91cC4gVGhpcyBzdGVwIHJlcXVpcmVzIHRoZSB1c2Ugb2YgUiwgaGVuY2UgYSBjb2RlIGNodW5rIGlzIHByb3ZpZGVkIGJlbG93LiBJbnNlcnQgbW9yZSBjb2RlIGNodW5rcyBhcyBuZWVkZWQuKioNCg0KYGBge3J9IA0KcGFzdGUoInRoZSBtZWFucyBvZiBjb2x1bW4iLHNwb3J0c19uYW1lcywiaXMgIix0ZWFtX21lYW5zLCIgVGhlIHNkIG9mIGVhY2ggY29sdW1uIGlzICIsY29sX3NkKQ0KYGBgDQojIyBHZW5lcmFsaXphYmlsaXR5DQorIFdpdGggYW4gYXZlcmFnZSBUcnVtcCB2b3RlIHBlcmNlbnRhZ2Ugb2YgNTQuNSwgb3VyIHNhbXBsZSBpcyBjbGVhcmx5IGJpYXNlZChUcnVtcCBsb3N0IHBvcHVsYXIgdm90ZSkNCiAgICArIFRoaXMgd2lsbCBiZSBzaG93biBzdGF0aXN0aWNhbGx5DQorIEFzIGFuIG9ic2VydmF0aW9uYWwgU3R1ZHkgdGhpcyB3aWxsIG5vdCBwcm92ZSB0aGVyZSBpcyBhIGNhdXNhdGlvbiBiZXR3ZWVuICBzcG9ydHMgc2VhcmNoZXMgYW5kIFRydW1wJ3MgMjAxNiB2b3RlIHBlcmNlbnRhZ2UgIA0KDQojIyBTZXZlcmFsIGlkZWFzIGZvciBBcHByb2FjaA0KKyBVc2UgbGluZWFyIHJlZ3Jlc3Npb24gdG8gYnVpbGQgYSAgcHJlZGljdGl2ZSBtb2RlbA0KKyBMb29rIGF0IGVhY2ggc3BvcnQgaW5kaXZpZHVhbGx5LiAgIENhdGVnb3JpemUgZWFjaCBzcG9ydCBpbnRvIGNsdXN0ZXJzIGJhc2VkIG9uICRcc2lnbWEkLiAgSSB3aWxsIGF0dGVtcHQgdG8gY3JlYXRlIGVxdWFsIGNhdGVnb3JpZXMuIEkuRS4gKDwtMS41ICRcc2lnbWEkLC0xICRcc2lnbWEkLC0uNSAkXHNpZ21hJCwgLjUgJFxzaWdtYSQsMSAkXHNpZ21hJCwgPjEuNSAkXHNpZ21hJCkgSSB3aWxsIHVzZSB0aGVzZSBjYXRlZ29yaWVzIHRvIGRldGVybWluZSBpZiB0aGVyZSBhcmUgc3RhdGlzdGljYWxseSBzaWduaWZpY2FudCBkaWZmZXJlbmNlcyBiZXR3ZWVuIGhvdyB0aGVzZSBjbHVzdGVycyB2b3RlZCBmb3IgVHJ1bXAgaW4gMjAxNi4NCisgTXkgZXhwbGFuYXRvcnkgdmFyaWFibGVzIGFyZSBjb21wbGV0ZWx5IGRlcGVuZGVudCBvbiBlYWNoIG90aGVyIGFzIHRoZXkgYWRkIHVwIHRvIDEwMCUgb3ZlcmFsbC4gQW55IGFuYWx5c2lzIGludm9sdmluZyBob3cgbXVsdGlwbGUgZXhwbGFuYXRvcnkgdmFyaWFibGVzIGVmZmVjdCBUcnVtcHMgdm90aW5nIHBlcmNlbnRhZ2UsIHdpbGwgaGF2ZSB0byBmaW5kIGEgd2F5IHRvIGFjY291bnQgZm9yIHRoaXMgICAgIA0KDQogDQo=