This assignment is to practice good techniques for showing the distribution of a values of a categorical variable. To do the assignment, you need to select any categorical variable from any dataframe you have encountered in this class or elsewhere.

Problem 1

Load the libraries you need and the data, if necessary. Run str() and summary on the dataframe. Describe the source and point out your selected variable.

I am using the gss_sm dataset from the socviz package. The data contains a variety of demographics from a sample of people.

I will be working with the siblings variable

# Place your code here.
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.3.0     v purrr   0.3.4
## v tibble  3.0.0     v dplyr   0.8.5
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts --------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(socviz)
library(waffle)

a4data <- gss_sm

str(a4data)
## tibble [2,867 x 32] (S3: tbl_df/tbl/data.frame)
##  $ year       : num [1:2867] 2016 2016 2016 2016 2016 ...
##   ..- attr(*, "label")= chr "GSS YEAR FOR THIS RESPONDENT                       "
##   ..- attr(*, "format.stata")= chr "%8.0g"
##  $ id         : num [1:2867] 1 2 3 4 5 6 7 8 9 10 ...
##   ..- attr(*, "label")= chr "RESPONDNT ID NUMBER                                "
##   ..- attr(*, "format.stata")= chr "%8.0g"
##  $ ballot     : 'labelled' num [1:2867] 1 2 3 1 3 2 1 3 1 3 ...
##   ..- attr(*, "label")= chr "BALLOT USED FOR INTERVIEW"
##   ..- attr(*, "format.stata")= chr "%8.0g"
##   ..- attr(*, "labels")= Named num [1:5] 0 1 2 3 4
##   .. ..- attr(*, "names")= chr [1:5] "iap" "BALLOT A" "BALLOT B" "BALLOT C" ...
##  $ age        : num [1:2867] 47 61 72 43 55 53 50 23 45 71 ...
##   ..- attr(*, "label")= chr "AGE OF RESPONDENT"
##   ..- attr(*, "format.stata")= chr "%8.0g"
##   ..- attr(*, "labels")= Named num [1:3] 89 98 99
##   .. ..- attr(*, "names")= chr [1:3] "89 OR OLDER" "dk" "na"
##  $ childs     : num [1:2867] 3 0 2 4 2 2 2 3 3 4 ...
##   ..- attr(*, "label")= chr "NUMBER OF CHILDREN"
##   ..- attr(*, "format.stata")= chr "%8.0g"
##   ..- attr(*, "labels")= Named num [1:2] 8 9
##   .. ..- attr(*, "names")= chr [1:2] "EIGHT OR MORE" "DK NA"
##  $ sibs       : 'labelled' num [1:2867] 2 3 3 3 2 2 2 6 5 1 ...
##   ..- attr(*, "label")= chr "NUMBER OF BROTHERS AND SISTERS"
##   ..- attr(*, "format.stata")= chr "%8.0g"
##   ..- attr(*, "labels")= Named num [1:3] -1 98 99
##   .. ..- attr(*, "names")= chr [1:3] "iap" "dk" "na"
##  $ degree     : Factor w/ 5 levels "Lt High School",..: 4 2 4 2 5 3 2 2 2 3 ...
##  $ race       : Factor w/ 3 levels "White","Black",..: 1 1 1 1 1 1 1 3 2 1 ...
##  $ sex        : Factor w/ 2 levels "Male","Female": 1 1 1 2 2 2 1 2 1 1 ...
##  $ region     : Factor w/ 9 levels "New England",..: 1 1 1 1 1 1 1 2 2 2 ...
##  $ income16   : Factor w/ 26 levels "under $1 000",..: 26 19 21 26 26 20 26 16 20 20 ...
##  $ relig      : Factor w/ 13 levels "Protestant","Catholic",..: 4 4 2 2 4 4 4 2 1 4 ...
##  $ marital    : Factor w/ 5 levels "Married","Widowed",..: 1 5 1 1 1 1 1 1 1 3 ...
##  $ padeg      : Factor w/ 5 levels "Lt High School",..: 5 1 2 NA 4 NA 2 1 1 2 ...
##  $ madeg      : Factor w/ 5 levels "Lt High School",..: 2 2 1 2 2 2 2 1 1 2 ...
##  $ partyid    : Factor w/ 8 levels "Strong Democrat",..: 4 3 6 6 2 2 6 3 1 7 ...
##  $ polviews   : Factor w/ 7 levels "Extremely Liberal",..: 4 2 6 4 3 3 3 5 NA 6 ...
##  $ happy      : Factor w/ 3 levels "Very Happy","Pretty Happy",..: 2 2 1 2 1 1 2 1 2 2 ...
##  $ partners   : Factor w/ 9 levels "No Partners",..: NA 2 2 NA 2 2 NA 2 NA 4 ...
##  $ grass      : Factor w/ 2 levels "Legal","Not Legal": NA 1 2 NA 1 1 NA 2 NA 2 ...
##  $ zodiac     : Factor w/ 12 levels "Aries","Taurus",..: 11 8 12 4 8 8 10 11 NA 9 ...
##  $ pres12     : 'labelled' num [1:2867] 3 1 2 2 1 1 NA NA NA 2 ...
##   ..- attr(*, "label")= chr "VOTE OBAMA OR ROMNEY"
##   ..- attr(*, "format.stata")= chr "%8.0g"
##   ..- attr(*, "labels")= Named num [1:7] 0 1 2 3 4 8 9
##   .. ..- attr(*, "names")= chr [1:7] "iap" "Obama" "Romney" "Other candidate (SPECIFY)" ...
##  $ wtssall    : num [1:2867] 0.957 0.478 0.957 1.914 1.435 ...
##   ..- attr(*, "label")= chr "WEIGHT VARIABLE"
##   ..- attr(*, "format.stata")= chr "%12.0g"
##   ..- attr(*, "labels")= Named num -1
##   .. ..- attr(*, "names")= chr "iap"
##  $ income_rc  : Factor w/ 16 levels "Gt $0","Gt $10000",..: 16 9 11 16 16 10 16 6 10 10 ...
##  $ agegrp     : Factor w/ 5 levels "Age 18-35","Age 35-45",..: 3 4 5 2 3 3 3 1 2 5 ...
##  $ ageq       : Factor w/ 4 levels "Age 18-34","Age 34-49",..: 2 3 4 2 3 3 3 1 2 4 ...
##  $ siblings   : Factor w/ 7 levels "0","1","2","3",..: 3 4 4 4 3 3 3 7 6 2 ...
##  $ kids       : Factor w/ 5 levels "0","1","2","3",..: 4 1 3 5 3 3 3 4 4 5 ...
##  $ religion   : Factor w/ 5 levels "Protestant","Catholic",..: 4 4 2 2 4 4 4 2 1 4 ...
##  $ bigregion  : Factor w/ 4 levels "Northeast","Midwest",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ partners_rc: Factor w/ 5 levels "0","1","2","3",..: NA 2 2 NA 2 2 NA 2 NA 4 ...
##  $ obama      : num [1:2867] 0 1 0 0 1 1 NA NA NA 0 ...

Problem 2

Do a default bar plot of your variable.

# Place your code here.

ggplot(a4data, aes(x=siblings)) + geom_bar()

Problem 3

Use dplyr verbs to create a smaller dataframe that includes the counts of the categorical variable values in a variable count. Reorder the values of the variable according to count. Use geom_col() and put the categorical values on the vertical axis.

# Place your code here.
a4data <- a4data %>%
  filter(!is.na(siblings)) %>%
  group_by(siblings) %>%
  summarise(count = n())

ggplot(a4data, aes(x=reorder(siblings, count), y=count)) + geom_col() + coord_flip() + labs(x='Siblings', y='Count', title = "Distribution of Siblings")

Problem 4

Create a pie chart to show the proportions of the categorical variable with its various values.

ggplot(a4data, aes(x=1, y=count, fill=siblings)) +
  geom_col() +
  coord_polar(theta = 'y') + 
  theme_void() + 
  labs(title="Distribution of Siblings")

Problem 5

As an alternative to the pie chart, create a waffle plot.

a4data <- a4data %>%
    mutate(percent = round(count/sum(count)*100))

a4waffle <- as.numeric(a4data$percent)
names(a4waffle) <- a4data$siblings


waffle(a4waffle, title='Distribution of Siblings')