Exercises: 1,3 (Pg. 227); 2 (Pg. 232); 1,2 (Pg. 235);
Submission: Submit via an electronic document on Sakai. Must be submitted as a HTML file generated in RStudio. All assigned problems are chosen according to the textbook R for Data Science. You do not need R code to answer every question. If you answer without using R code, delete the code chunk. If the question requires R code, make sure you display R code. If the question requires a figure, make sure you display a figure. A lot of the questions can be answered in written response, but require R code and/or figures for understanding and explaining.
gss_cat %>% ggplot(aes(rincome)) + geom_bar()
The default bar chart of the distribution of rincome (reported income) is hard to understand because there are so many labels on the x-axis; the overlapping labels make the data difficult to comprehend.
gss_cat %>% mutate(rincome = fct_relevel(rincome, c("Not applicable", "Refused", "Don't know", "No answer"))) %>% ggplot(aes(rincome)) + geom_bar() + coord_flip()
I used fct_relevel() to put all the categorical variables together and I used coord_flip() to flip the axes. This reorganization made the bar chart not only more comprehendable, but also more appealing to visualize.
gss_cat_denom = gss_cat %>% filter(!denom %in% c("No answer", "Other", "Don't know", "Not applicable", "No denomination"))
gss_cat_denom
## # A tibble: 7,025 x 9
## year marital age race rincome partyid relig denom tvhours
## <int> <fct> <int> <fct> <fct> <fct> <fct> <fct> <int>
## 1 2000 Never mar… 26 White $8000 to … Ind,near rep Prote… Southern… 12
## 2 2000 Divorced 48 White $8000 to … Not str rep… Prote… Baptist-… NA
## 3 2000 Married 25 White $20000 - … Strong demo… Prote… Southern… NA
## 4 2000 Divorced 44 White $7000 to … Ind,near dem Prote… Lutheran… NA
## 5 2000 Married 47 White $25000 or… Strong repu… Prote… Southern… 3
## 6 2000 Married 52 White $25000 or… Strong demo… Prote… Southern… 1
## 7 2000 Married 51 White $25000 or… Strong repu… Prote… United m… NA
## 8 2000 Married 40 Black $25000 or… Strong demo… Prote… Baptist-… 7
## 9 2000 Married 45 Black Not appli… Independent Prote… United m… NA
## 10 2000 Married 49 White Refused Strong repu… Prote… United m… 2
## # … with 7,015 more rows
The table above shows which religions have a denomination - the code filters out any answers that did not define a denomination.
gss_cat_denom %>% ggplot(aes(relig)) + geom_bar() + coord_flip()
The graph above takes the new dataset, which filters out religions that did not have a corresponding denomination, and graphs the religions listed and their counts. The only religion demonstrated in the graph is Protestant.
gss_cat %>% ggplot(aes(tvhours)) + geom_bar()
## Warning: Removed 10146 rows containing non-finite values (stat_count).
The mean would not be a good summary because the distribution of tvhours is skewed right and there are obvious outliers that will impact the summary data. The median would be a better summary for the tvhours data.
gss_cat %>% mutate(partyid = fct_collapse(partyid, Republican = c("Strong republican", "Not str republican"), Independent = c("Ind,near rep", "Independent", "Ind,near dem"), Democrat = c("Not str democrat", "Strong democrat"), Other = c("No answer", "Don't know", "Other party"),)) %>% group_by(partyid, year) %>% count() %>% ggplot(mapping = aes(x = year, y = n, color = partyid)) + geom_point() + geom_line()
Over time, the number of Republicans gradually increased, reached a peak around 2006, and then slowly declined. The number of Democrats gradually increased, reached a peak around 2006, and then slowly declined. The number of Independents slowly decreased, reached a peak around 2006, and gradually increased. Each year, from 2000 to 2014, Independents always had a larger proportion of registered voters than Democrats and Republicans and Republicans always had a smaller proportion of registered voters than Democrats and Independents. Democrats and Republicans followed a rather similar trend and Independents went the opposite direction, trend-wise, except for the peak in 2006 and the gradual decrease after 2012.
gss_cat %>% mutate(rincome = fct_collapse(rincome, Other = c("Not applicable", "Refused", "Don't know", "No answer"), "Less than $1k to $9999" = c("Lt $1000", "$1000 to 2999", "$3000 to 3999", "$4000 to 4999", "$5000 to 5999", "$6000 to 6999", "$7000 to 7999", "$8000 to 9999"), "$10k to $19999 "= c("$10000 to $14999", "$15000 to 19999"), "$20k or more" = c("$20000 to $24999", "$25000 or more"))) %>%
count(rincome)
## Warning: Unknown levels in `f`: $10000 to $14999, $15000 to 19999, $20000 to
## $24999
## # A tibble: 6 x 2
## rincome n
## <fct> <int>
## 1 Other 8468
## 2 $20k or more 7363
## 3 $20000 - 24999 1283
## 4 $15000 - 19999 1048
## 5 $10000 - 14999 1168
## 6 Less than $1k to $9999 2153
I collapsed the data into categories of 10k.