This assignment is to practice good techniques for showing the distribution of a values of a categorical variable. To do the assignment, you need to select any categorical variable from any dataframe you have encountered in this class or elsewhere.
Load the libraries you need and the data, if necessary. Run str() and summary on the dataframe. Describe the source and point out your selected variable.
I am using the gss_sm dataset from the socviz package. The data contains a variety of demographics from a sample of people.
I will be working with the siblings variable
# Place your code here.
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.3.0 v purrr 0.3.4
## v tibble 3.0.0 v dplyr 0.8.5
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts --------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(socviz)
library(waffle)
a4data <- gss_sm
str(a4data)
## tibble [2,867 x 32] (S3: tbl_df/tbl/data.frame)
## $ year : num [1:2867] 2016 2016 2016 2016 2016 ...
## ..- attr(*, "label")= chr "GSS YEAR FOR THIS RESPONDENT "
## ..- attr(*, "format.stata")= chr "%8.0g"
## $ id : num [1:2867] 1 2 3 4 5 6 7 8 9 10 ...
## ..- attr(*, "label")= chr "RESPONDNT ID NUMBER "
## ..- attr(*, "format.stata")= chr "%8.0g"
## $ ballot : 'labelled' num [1:2867] 1 2 3 1 3 2 1 3 1 3 ...
## ..- attr(*, "label")= chr "BALLOT USED FOR INTERVIEW"
## ..- attr(*, "format.stata")= chr "%8.0g"
## ..- attr(*, "labels")= Named num [1:5] 0 1 2 3 4
## .. ..- attr(*, "names")= chr [1:5] "iap" "BALLOT A" "BALLOT B" "BALLOT C" ...
## $ age : num [1:2867] 47 61 72 43 55 53 50 23 45 71 ...
## ..- attr(*, "label")= chr "AGE OF RESPONDENT"
## ..- attr(*, "format.stata")= chr "%8.0g"
## ..- attr(*, "labels")= Named num [1:3] 89 98 99
## .. ..- attr(*, "names")= chr [1:3] "89 OR OLDER" "dk" "na"
## $ childs : num [1:2867] 3 0 2 4 2 2 2 3 3 4 ...
## ..- attr(*, "label")= chr "NUMBER OF CHILDREN"
## ..- attr(*, "format.stata")= chr "%8.0g"
## ..- attr(*, "labels")= Named num [1:2] 8 9
## .. ..- attr(*, "names")= chr [1:2] "EIGHT OR MORE" "DK NA"
## $ sibs : 'labelled' num [1:2867] 2 3 3 3 2 2 2 6 5 1 ...
## ..- attr(*, "label")= chr "NUMBER OF BROTHERS AND SISTERS"
## ..- attr(*, "format.stata")= chr "%8.0g"
## ..- attr(*, "labels")= Named num [1:3] -1 98 99
## .. ..- attr(*, "names")= chr [1:3] "iap" "dk" "na"
## $ degree : Factor w/ 5 levels "Lt High School",..: 4 2 4 2 5 3 2 2 2 3 ...
## $ race : Factor w/ 3 levels "White","Black",..: 1 1 1 1 1 1 1 3 2 1 ...
## $ sex : Factor w/ 2 levels "Male","Female": 1 1 1 2 2 2 1 2 1 1 ...
## $ region : Factor w/ 9 levels "New England",..: 1 1 1 1 1 1 1 2 2 2 ...
## $ income16 : Factor w/ 26 levels "under $1 000",..: 26 19 21 26 26 20 26 16 20 20 ...
## $ relig : Factor w/ 13 levels "Protestant","Catholic",..: 4 4 2 2 4 4 4 2 1 4 ...
## $ marital : Factor w/ 5 levels "Married","Widowed",..: 1 5 1 1 1 1 1 1 1 3 ...
## $ padeg : Factor w/ 5 levels "Lt High School",..: 5 1 2 NA 4 NA 2 1 1 2 ...
## $ madeg : Factor w/ 5 levels "Lt High School",..: 2 2 1 2 2 2 2 1 1 2 ...
## $ partyid : Factor w/ 8 levels "Strong Democrat",..: 4 3 6 6 2 2 6 3 1 7 ...
## $ polviews : Factor w/ 7 levels "Extremely Liberal",..: 4 2 6 4 3 3 3 5 NA 6 ...
## $ happy : Factor w/ 3 levels "Very Happy","Pretty Happy",..: 2 2 1 2 1 1 2 1 2 2 ...
## $ partners : Factor w/ 9 levels "No Partners",..: NA 2 2 NA 2 2 NA 2 NA 4 ...
## $ grass : Factor w/ 2 levels "Legal","Not Legal": NA 1 2 NA 1 1 NA 2 NA 2 ...
## $ zodiac : Factor w/ 12 levels "Aries","Taurus",..: 11 8 12 4 8 8 10 11 NA 9 ...
## $ pres12 : 'labelled' num [1:2867] 3 1 2 2 1 1 NA NA NA 2 ...
## ..- attr(*, "label")= chr "VOTE OBAMA OR ROMNEY"
## ..- attr(*, "format.stata")= chr "%8.0g"
## ..- attr(*, "labels")= Named num [1:7] 0 1 2 3 4 8 9
## .. ..- attr(*, "names")= chr [1:7] "iap" "Obama" "Romney" "Other candidate (SPECIFY)" ...
## $ wtssall : num [1:2867] 0.957 0.478 0.957 1.914 1.435 ...
## ..- attr(*, "label")= chr "WEIGHT VARIABLE"
## ..- attr(*, "format.stata")= chr "%12.0g"
## ..- attr(*, "labels")= Named num -1
## .. ..- attr(*, "names")= chr "iap"
## $ income_rc : Factor w/ 16 levels "Gt $0","Gt $10000",..: 16 9 11 16 16 10 16 6 10 10 ...
## $ agegrp : Factor w/ 5 levels "Age 18-35","Age 35-45",..: 3 4 5 2 3 3 3 1 2 5 ...
## $ ageq : Factor w/ 4 levels "Age 18-34","Age 34-49",..: 2 3 4 2 3 3 3 1 2 4 ...
## $ siblings : Factor w/ 7 levels "0","1","2","3",..: 3 4 4 4 3 3 3 7 6 2 ...
## $ kids : Factor w/ 5 levels "0","1","2","3",..: 4 1 3 5 3 3 3 4 4 5 ...
## $ religion : Factor w/ 5 levels "Protestant","Catholic",..: 4 4 2 2 4 4 4 2 1 4 ...
## $ bigregion : Factor w/ 4 levels "Northeast","Midwest",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ partners_rc: Factor w/ 5 levels "0","1","2","3",..: NA 2 2 NA 2 2 NA 2 NA 4 ...
## $ obama : num [1:2867] 0 1 0 0 1 1 NA NA NA 0 ...
Do a default bar plot of your variable.
# Place your code here.
ggplot(a4data, aes(x=siblings)) + geom_bar()
Use dplyr verbs to create a smaller dataframe that includes the counts of the categorical variable values in a variable count. Reorder the values of the variable according to count. Use geom_col() and put the categorical values on the vertical axis.
# Place your code here.
a4data <- a4data %>%
filter(!is.na(siblings)) %>%
group_by(siblings) %>%
summarise(count = n())
ggplot(a4data, aes(x=reorder(siblings, count), y=count)) + geom_col() + coord_flip() + labs(x='Siblings', y='Count', title = "Distribution of Siblings")
Create a pie chart to show the proportions of the categorical variable with its various values.
ggplot(a4data, aes(x=1, y=count, fill=siblings)) +
geom_col() +
coord_polar(theta = 'y') +
theme_void() +
labs(title="Distribution of Siblings")
As an alternative to the pie chart, create a waffle plot.
a4data <- a4data %>%
mutate(percent = round(count/sum(count)*100))
a4waffle <- as.numeric(a4data$percent)
names(a4waffle) <- a4data$siblings
waffle(a4waffle, title='Distribution of Siblings')