Albert Y. Kim
Friday 2015/02/20
Question 4 on page 76 from Chapter 4 of Data Analysis Using Regression and Multilevel/Hierarchical Models. The codebook can be found here.
Using the OkCupid data, fit what you think is a good predictive model for gender and interpret all results.
What do I mean by “good predictions”? Recall that the dataset from class is a sample of profiles. In fact, I gave you 10% of all profiles. I define “bad” as the following: the proportion of people misclassified (both possible ways) using the other 90% of profiles.
Think about what happens when your prediction mechanism is overly optimized for your particular dataset. What will happen when it tries to predict other datasets?
Data consists cancer counts of the National Cancer Institute's “Surveillance, Epidemiology, and End Results Program” (SEER) database of cancers
Unprocessed data file as provided by SEER (please do not share). Marked as txt file, but actually is CSV file:
library(dplyr)
SEER <- read.csv("Space Time Surveillance Counts 11_05_09.txt", header=TRUE) %>% tbl_df()
SEER
Let
Keep a tone of where you would be sharing your findings with state public health officials:
Do this for breast and lung cancer separately.
Go to Social Explorer while at Reed or using a proxy.
States, counties, census tracts, and census blocks can be identified using “Federal Information Processing Standard” (FIPS) codes.
These codes are crucial for matching up different data sets. For example, “Nez Perce County Idaho” might be coded as: “Nez Perce”, “Nez_Perce”, “NezPerce”, etc. Whereas the FIPS code is just 16069.