Day One - 10/19/2022

Installing Packages

If you need to install the necessary packages - this is the code that you would use. I have commented that out by using the # sign so that my machine won’t try to install these packages again.

# install.packages("remotes")
# library(remotes)
# remotes::install_github("ryanburge/socsci")

Loading Packages

You need to load the socsci package everytime you start fresh in RStudio. You do that by using the code below.

library(socsci)

Loading Data

We need to load our data into our machine. We will do that using the following command:

cces20 <- read_csv("https://www.dropbox.com/s/wuixmc67ae786wp/small_ces2020.csv?dl=1")

We are calling our data “cces20” and were are reading it in from a Dropbox folder.

The codebook for this data can be found here: https://www.dropbox.com/s/s01kefaks4u6g1f/small_ces2020%20Codebook.docx?dl=0

Counting Things

The building block of data analysis is just counting stuff.

cces20 %>% 
  ct(gender)

## # A tibble: 2 x 3
##   gender     n   pct
## *  <dbl> <int> <dbl>
## 1      1 25791 0.423
## 2      2 35209 0.577

What does 1 or 2 represent here? You need to look at the codebook. 1 is Male; 2 is Female.

Let’s try another example:

cces20 %>% 
  ct(sexuality)

## # A tibble: 7 x 3
##   sexuality     n   pct
## *     <dbl> <int> <dbl>
## 1         1 52383 0.859
## 2         2   971 0.016
## 3         3  1837 0.03 
## 4         4  3267 0.054
## 5         5   861 0.014
## 6         6  1623 0.027
## 7        NA    58 0.001

Notice how there are a few people in the NA column? We want to exclude those.

cces20 %>% 
  ct(sexuality, show_na = FALSE)

## # A tibble: 6 x 3
##   sexuality     n   pct
## *     <dbl> <int> <dbl>
## 1         1 52383 0.86 
## 2         2   971 0.016
## 3         3  1837 0.03 
## 4         4  3267 0.054
## 5         5   861 0.014
## 6         6  1623 0.027

Now, they are not included in our count of the sexuality variable.

Making a new variable

Let’s say we want to make an age variable. But there’s not one in the dataset. Let’s make on using the mutate command and then just count it.

cces20 %>% 
  mutate(age = 2020 - birthyr) %>% 
  ct(age)

## # A tibble: 78 x 3
##      age     n   pct
##  * <dbl> <int> <dbl>
##  1    18   584 0.01 
##  2    19   618 0.01 
##  3    20  1144 0.019
##  4    21   916 0.015
##  5    22   878 0.014
##  6    23   879 0.014
##  7    24   940 0.015
##  8    25  1124 0.018
##  9    26  1056 0.017
## 10    27  1168 0.019
## # ... with 68 more rows

Filtering

R has a lot of cool functions. One of the most helpful is called filter. It will filter the data based on whatever parameters we have set up.

The racial breakdown of women in the dataset. It’s important that you do == (2 equal signs).

cces20 %>% 
  filter(gender == 2) %>% 
  ct(race)

## # A tibble: 8 x 3
##    race     n   pct
## * <dbl> <int> <dbl>
## 1     1 24772 0.704
## 2     2  4641 0.132
## 3     3  3165 0.09 
## 4     4  1065 0.03 
## 5     5   259 0.007
## 6     6   813 0.023
## 7     7   445 0.013
## 8     8    49 0.001

If you want to look at the gender breakdown of non-white people in the data you would do this:

cces20 %>% 
  filter(race != 1) %>% 
  ct(gender)

## # A tibble: 2 x 3
##   gender     n   pct
## *  <dbl> <int> <dbl>
## 1      1  6435 0.381
## 2      2 10437 0.619

There’s also a way to do OR in a filter. Let’s say you wanted to do the gender breakdown of white OR hispanic. That looks like this:

cces20 %>% 
  filter(race == 1 | race == 3) %>% 
  ct(gender)

## # A tibble: 2 x 3
##   gender     n   pct
## *  <dbl> <int> <dbl>
## 1      1 21371 0.433
## 2      2 27937 0.567

The vertical line | is the OR command in this case.

Making a Simple Graph

Let’s say you wanted to visualize the distribution of the age variable in the data.

First you would need to calculate that and save it as a new object. We will do that here and save it as a new dataset called age_graph

age_graph <- cces20 %>%
  mutate(age = 2020 - birthyr) %>% 
  ct(age)

Now, we will visualize that using the ggplot command like so:

age_graph %>% 
  ggplot(., aes(x = age, y = pct)) +
  geom_col()

A More Complicated Bar Graph

Let’s say you needed to visualize the age distribution of men and women in the dataset. You would do that in three steps.

Step 1 is created two datasets. One is the age distribution of men. The second is the age distribution of women.

Like so:

agem <- cces20 %>% 
  filter(gender  ==1) %>% 
  mutate(age = 2020 - birthyr) %>% 
  ct(age) %>% 
  mutate(gender = "Men")


agef <- cces20 %>% 
  filter(gender == 2) %>% 
  mutate(age = 2020 - birthyr) %>% 
  ct(age) %>% 
  mutate(gender = "Women")

Notice how we have created a dataset called agem and agef? We need to mash them together into a new dataset that we will call age_all.

We do that using the bind_rows command like so:

age_all <- bind_rows(agem, agef)

age_all

## # A tibble: 155 x 4
##      age     n   pct gender
##    <dbl> <int> <dbl> <chr> 
##  1    18   223 0.009 Men   
##  2    19   231 0.009 Men   
##  3    20   475 0.018 Men   
##  4    21   348 0.013 Men   
##  5    22   284 0.011 Men   
##  6    23   279 0.011 Men   
##  7    24   331 0.013 Men   
##  8    25   385 0.015 Men   
##  9    26   363 0.014 Men   
## 10    27   353 0.014 Men   
## # ... with 145 more rows

Now, we need to graph them. We will use a new option in ggplot called facet_wrap which will create a separate graph for the gender variable. Like so:

age_all %>% 
  ggplot(., aes(x = age, y = pct)) +
  geom_col(color= "black") +
  facet_wrap(~ gender)