library(tidyverse)
library(janitor)
library(car)

cces <- read_csv("https://raw.githubusercontent.com/ryanburge/cces/master/CCES%20for%20Methods/small_cces.csv")

1 Introduction

Let’s start by taking a look at our dataset using the %>% and the glimpse command. You can get a pipe easily by hit CTRL + SHIFT + M.

cces %>%  glimpse()
## Observations: 64,600
## Variables: 33
## $ X1         <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ X1_1       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ id         <int> 222168628, 273691199, 284214415, 287557695, 2903876...
## $ state      <int> 33, 22, 29, 1, 8, 1, 48, 42, 13, 42, 15, 48, 12, 48...
## $ birthyr    <int> 1969, 1994, 1964, 1988, 1982, 1963, 1962, 1991, 196...
## $ gender     <int> 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, ...
## $ educ       <int> 2, 2, 2, 2, 5, 2, 2, 1, 2, 2, 5, 3, 3, 4, 3, 3, 3, ...
## $ race       <int> 1, 1, 2, 2, 1, 6, 1, 1, 1, 1, 4, 1, 6, 3, 6, 7, 1, ...
## $ marital    <int> 1, 5, 5, 5, 1, 4, 2, 2, 1, 1, 1, 1, 1, 6, 5, 5, 5, ...
## $ natecon    <int> 3, 4, 5, 4, 2, 4, 3, 5, 4, 5, 2, 6, 3, 3, 3, 3, 6, ...
## $ mymoney    <int> 2, 3, 2, 4, 2, 4, 3, 5, 4, 3, 2, 1, 3, 3, 2, 5, 4, ...
## $ econfuture <int> 6, 5, 4, 5, 6, 4, 3, 6, 6, 5, 3, 6, 6, 2, 2, 5, 6, ...
## $ police     <int> 2, 3, 2, 2, 2, 3, 2, 2, 2, 1, 2, 1, 3, 2, 1, 4, 2, ...
## $ background <int> 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, ...
## $ registry   <int> 2, 1, 2, 2, 2, 1, 1, 2, 1, 1, 8, 1, 1, 2, 1, 2, 1, ...
## $ assaultban <int> 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 2, 1, 1, 2, 1, 2, ...
## $ conceal    <int> 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 2, 2, ...
## $ pathway    <int> 2, 2, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, ...
## $ border     <int> 1, 1, 1, 2, 2, 1, 1, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, ...
## $ dreamer    <int> 2, 2, 1, 1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 1, 1, ...
## $ deport     <int> 1, 1, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, ...
## $ prochoice  <int> 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 1, 1, 2, 1, 1, ...
## $ prolife    <int> 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, ...
## $ gaym       <int> 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ employ     <int> 7, 7, 6, 6, 2, 6, 1, 4, 3, 1, 5, 1, 7, 2, 1, 4, 4, ...
## $ pid7       <int> 5, 4, 1, 4, 2, 2, 6, 4, 7, 7, 2, 4, 4, 1, 4, 1, 2, ...
## $ attend     <int> 6, 8, 3, 4, 6, 2, 2, 5, 4, 5, 5, 5, 4, 6, 6, 6, 6, ...
## $ religion   <int> 11, 98, 2, 11, 10, 1, 1, 11, 11, 2, 1, 1, 10, 11, 9...
## $ vote16     <int> 1, 1, NA, NA, 2, 99, 1, NA, 1, 1, 2, 5, 99, NA, 5, ...
## $ ideo5      <int> 3, 3, 5, 4, 2, 6, 5, 3, 4, 3, 3, 4, 6, 1, 3, 1, 2, ...
## $ union      <int> 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, ...
## $ income     <int> 97, 6, 4, 1, 7, 1, 3, 1, 4, 7, 10, 5, 2, 2, 3, 1, 6...
## $ sexuality  <int> 1, 1, 1, 1, 1, 1, 1, 9, 1, 1, 1, 1, 6, 1, 1, 1, 1, ...

Glimpse tells you a lot of stuff. First it tells you how many observations you have. Here you have A LOT: 64,600. And how many variables do you have? 32 variables. You can also see the names of each variable, as well as the top fiften rows of each column.

All this won’t make any sense unless you have the codebook. That can be found here

Take a look at the race row. Look in your codebook. That first person our dataset is white, the second is white, and the third is black. Do the same for the gender.

2 The Tabyl Function

One of the most helpful functions to get a simple sense of how many of each race in the dataset is to use the table command. Again, this uses cces and then the pipe.

cces %>%  tabyl(race)
##   race     n     percent
## 1    1 46289 0.716547988
## 2    2  7926 0.122693498
## 3    3  5238 0.081083591
## 4    4  2278 0.035263158
## 5    5   522 0.008080495
## 6    6  1452 0.022476780
## 7    7   760 0.011764706
## 8    8   135 0.002089783

So, what percentage of the sample is white? Black? Asian? It gives you the actual count, but it also gives you the percentage. How about we visualize that?

cces %>%  ggplot(.,aes(x=race)) + geom_bar()

All that does it just but the race variable into a bar chart. That’s also known as a histogram. Sometimes visualizing data can help you understand it better.

Let’s try to visualize the party identification variable. It’s called pid7

cces %>%  ggplot(.,aes(x=pid7)) + geom_bar()

You notice something odd here? Why is the x axis (which is the one going left and right) have a little bar way on the right side? Well we can find out why if we just use the tabyl command.

cces %>%  tabyl(pid7)
##    pid7     n      percent
## 1     1 16251 0.2515634675
## 2     2  8618 0.1334055728
## 3     3  6270 0.0970588235
## 4     4 10493 0.1624303406
## 5     5  5554 0.0859752322
## 6     6  6814 0.1054798762
## 7     7  8479 0.1312538700
## 8     8  2067 0.0319969040
## 9    98    34 0.0005263158
## 10   99    20 0.0003095975

Now it makes sense, right? There are some weird values in there. Do you see them. There are people who are coded 98 and 99. Those values are people who don’t know or didn’t respond. We don’t want to plot them, beacuse they make the plot look ugly. So, we can fix that.

3 Filter Command

R has a lot of cool functions. One of the most helpful is called filter. It will filter the data based on whatever parameters we have set up. So, how do we use filter to get rid of those 98 and 99 values? It’s actually pretty simple.

If you just hit up, it will bring up a prior command. Hit up until you get back to the ggplot line. And then just magic a small addition of the filter command.

cces %>%  filter(pid7 < 9) %>% ggplot(.,aes(x=pid7)) + geom_bar()

Do you see what just happened there? We told R to look at the cces data. Then to filter out any values that are greater than 9. And guess what? 98 and 99 are greater than 9. Therefore R will not plot them in your bar graph.

Let’s try to use filter in another way. How about we just look at the party identification of men. Take a look at your codebook. What is the variable for male called? It’s called gender, right? And which value is male? Yes, it’s value 1. So, let’s use that info to plot.

cces %>%  filter(gender ==1) %>% ggplot(.,aes(x=pid7)) + geom_bar()

It’s important that you do == (2 equal signs). But now you have a plot of just males and their party identification. How about adding some colors? And some better labels?

cces %>%  filter(gender ==1) %>%  filter(pid7 < 9) %>% ggplot(.,aes(x=pid7)) + geom_bar(fill = "darkorchid3", color = "black")  + labs(x= "Party Identification", y ="Number of Respondents", title ="Party ID of Males")

3.1 Other Ways to Filter

Here are just a bunch of ways to filter something

If you want to look at all racial groups but white and the frequency of responses to the question about party ideology.

cces %>% filter(race != 1) %>% tabyl(pid7)
##    pid7    n      percent
## 1     1 6635 0.3623504997
## 2     2 3428 0.1872098738
## 3     3 1799 0.0982469554
## 4     4 2917 0.1593031511
## 5     5  850 0.0464201846
## 6     6  956 0.0522090547
## 7     7  990 0.0540658621
## 8     8  714 0.0389929551
## 9    98   15 0.0008191797
## 10   99    7 0.0003822839

But let’s say you just wanted to look at black OR hispanic. You would use the |, which is just above the enter key. You have to hit shift.

cces %>% filter(race == 2 | race == 3) %>% tabyl(pid7)
##    pid7    n      percent
## 1     1 5646 0.4288969918
## 2     2 2616 0.1987237922
## 3     3 1187 0.0901701610
## 4     4 1670 0.1268611364
## 5     5  414 0.0314494075
## 6     6  551 0.0418565785
## 7     7  575 0.0436797326
## 8     8  486 0.0369188696
## 9    98   12 0.0009115770
## 10   99    7 0.0005317533

Let’s say you wanted to look at the party ideology of black males. That’s the & symbol.

cces %>% filter(race == 2 & gender ==1) %>% tabyl(pid7)
##    pid7    n      percent
## 1     1 1311 0.4700609537
## 2     2  524 0.1878809609
## 3     3  321 0.1150950161
## 4     4  323 0.1158121190
## 5     5   66 0.0236643958
## 6     6   68 0.0243814987
## 7     7   95 0.0340623880
## 8     8   78 0.0279670133
## 9    98    2 0.0007171029
## 10   99    1 0.0003585515

4 Mutate Command

Let’s say you want to create a new variable that is the age of the respondent. Check out the codebook. There’s a variable called birthyr, let’s check that out.

cces %>%  tabyl(birthyr)
##    birthyr    n      percent
## 1     1917    1 1.547988e-05
## 2     1918    1 1.547988e-05
## 3     1921    3 4.643963e-05
## 4     1922    3 4.643963e-05
## 5     1923   11 1.702786e-04
## 6     1924   13 2.012384e-04
## 7     1925   23 3.560372e-04
## 8     1926   17 2.631579e-04
## 9     1927   32 4.953560e-04
## 10    1928   51 7.894737e-04
## 11    1929   48 7.430341e-04
## 12    1930   96 1.486068e-03
## 13    1931  118 1.826625e-03
## 14    1932  119 1.842105e-03
## 15    1933  148 2.291022e-03
## 16    1934  201 3.111455e-03
## 17    1935  250 3.869969e-03
## 18    1936  297 4.597523e-03
## 19    1937  333 5.154799e-03
## 20    1938  377 5.835913e-03
## 21    1939  436 6.749226e-03
## 22    1940  443 6.857585e-03
## 23    1941  511 7.910217e-03
## 24    1942  609 9.427245e-03
## 25    1943  624 9.659443e-03
## 26    1944  599 9.272446e-03
## 27    1945  639 9.891641e-03
## 28    1946  812 1.256966e-02
## 29    1947  939 1.453560e-02
## 30    1948  955 1.478328e-02
## 31    1949  902 1.396285e-02
## 32    1950 1008 1.560372e-02
## 33    1951  993 1.537152e-02
## 34    1952 1216 1.882353e-02
## 35    1953 1372 2.123839e-02
## 36    1954 1292 2.000000e-02
## 37    1955 1414 2.188854e-02
## 38    1956 1439 2.227554e-02
## 39    1957 1484 2.297214e-02
## 40    1958 1443 2.233746e-02
## 41    1959 1345 2.082043e-02
## 42    1960 1509 2.335913e-02
## 43    1961 1335 2.066563e-02
## 44    1962 1285 1.989164e-02
## 45    1963 1258 1.947368e-02
## 46    1964 1194 1.848297e-02
## 47    1965 1195 1.849845e-02
## 48    1966 1046 1.619195e-02
## 49    1967 1011 1.565015e-02
## 50    1968 1001 1.549536e-02
## 51    1969  964 1.492260e-02
## 52    1970 1042 1.613003e-02
## 53    1971  934 1.445820e-02
## 54    1972  999 1.546440e-02
## 55    1973  835 1.292570e-02
## 56    1974  977 1.512384e-02
## 57    1975  941 1.456656e-02
## 58    1976  890 1.377709e-02
## 59    1977  922 1.427245e-02
## 60    1978 1003 1.552632e-02
## 61    1979  960 1.486068e-02
## 62    1980 1258 1.947368e-02
## 63    1981 1387 2.147059e-02
## 64    1982 1368 2.117647e-02
## 65    1983 1297 2.007740e-02
## 66    1984 1372 2.123839e-02
## 67    1985 1347 2.085139e-02
## 68    1986 1359 2.103715e-02
## 69    1987 1285 1.989164e-02
## 70    1988 1217 1.883901e-02
## 71    1989 1126 1.743034e-02
## 72    1990 1303 2.017028e-02
## 73    1991 1002 1.551084e-02
## 74    1992  913 1.413313e-02
## 75    1993  784 1.213622e-02
## 76    1994  788 1.219814e-02
## 77    1995  750 1.160991e-02
## 78    1996  682 1.055728e-02
## 79    1997  776 1.201238e-02
## 80    1998  668 1.034056e-02

So, what do you see? You see the year that each person was born. But, how do we convert that to age. How? Like this:

cces <- cces %>% mutate(age = 2017 - birthyr)

So, that will give you a new variable called age. That new variable has to be saved however and that is why this begins with “cces <-” that will overwrite the original dataset with a new one that contains our variable we created called age. Let’s visualize that.

cces %>% ggplot(.,aes(x=age)) + geom_bar()

You can see that there are a lot of people in their 60s and 70s, and less in their 40s. Let’s do something here. Let’s visualize the age distribution of both males and females.

cces %>% ggplot(.,aes(x=age)) + geom_bar() + facet_grid(.~gender)

Can you figure out what just happened. It is still mapping the distribution of age. But there are two plots (1 and 2). What is 1? What is 2? Well, look at the new addition I made to the command. See that facet_grid command? I told R to plot the age variable for all values of gender. So we can see in our codebook that 1 is male and 2 is female. That’s what the plot is showing us.

5 Crosstab Command

One of the most basic, yet important types of analysis we do as social scientists is called a crosstab. That is basically a two way frequency table. If you wanted to take a look at union membership, that’s easy to do with the tabyl command. Right?

cces %>% tabyl(union)
##   union     n     percent
## 1     1  4804 0.074365325
## 2     2 11496 0.177956656
## 3     3 48162 0.745541796
## 4     8   138 0.002136223

But that just tells us what union membership looks like in the entire sample. But let’s say that I’m interested in knowing how union membership varies by each racial group. Well, we could use the filter command right? So if you wanted to see what percentage of black people were in a union you could do that like this:

cces %>% filter(race ==1) %>% tabyl(union)
##   union     n     percent
## 1     1  3359 0.072565836
## 2     2  8709 0.188144052
## 3     3 34139 0.737518633
## 4     8    82 0.001771479

But here’s the problem, if I wanted to that relationship for each of the seven racial groups, that’s seven lines of code. How about one line of code?

cces %>%  crosstab(race, union)
##   race    1    2     3  8
## 1    1 3359 8709 34139 82
## 2    2  640 1328  5938 20
## 3    3  439  640  4138 21
## 4    4  158  169  1941 10
## 5    5   44  126   352  0
## 6    6  100  264  1085  3
## 7    7   51  239   468  2
## 8    8   13   21   101  0

What you have going down the table is each of the racial groups. 1 is white, 2 is black etc. Then across the columns there each of three response options to the union membership question. So how many Hispanics have never been part of a union? 4138.

5.1 Adding Percentages

But, what percentage is that? You can add a little extra code and get that easily.

cces %>%  crosstab(race, union) %>% adorn_crosstab(denom = "row")
##   race           1            2             3         8
## 1    1 7.3% (3359) 18.8% (8709) 73.8% (34139) 0.2% (82)
## 2    2 8.1%  (640) 16.8% (1328) 74.9%  (5938) 0.3% (20)
## 3    3 8.4%  (439) 12.2%  (640) 79.0%  (4138) 0.4% (21)
## 4    4 6.9%  (158)  7.4%  (169) 85.2%  (1941) 0.4% (10)
## 5    5 8.4%   (44) 24.1%  (126) 67.4%   (352) 0.0%  (0)
## 6    6 6.9%  (100) 18.2%  (264) 74.7%  (1085) 0.2%  (3)
## 7    7 6.7%   (51) 31.4%  (239) 61.6%   (468) 0.3%  (2)
## 8    8 9.6%   (13) 15.6%   (21) 74.8%   (101) 0.0%  (0)

So, what percentage of Hispanics have never been in a union? 79.0%. Which racial group has the lowest union membership? Asians. Over 85% have never been in a union.

6 Recode Command

A lot of what social scientists do is to create new variables by recoding old variables. Let’s start by creating a dichotomous variable out of the gender variable. We this new variable to be called male and male’s will have a value of 1, while everyone else will be coded as zero.

cces <- cces %>% mutate(male = recode(gender, "1=1; else=0"))
cces %>% tabyl(male)
##   male     n   percent
## 1    0 35069 0.5428638
## 2    1 29531 0.4571362

Do you see what happened? Now males are 1 and everyone is zero.

Let’s do something a bit more difficult. Look in your codeboook to a variable called econfuture. You notice how the lower values like 1 are indicating the next year will be a lot better than the previous year and high values are saying that next year will be a lot worse? Doesn’t that seem backwards? Shouldn’t high values mean greater things and low values mean less? And what about the 6 value which means “unsure” and 8 means “skipped.” We need to clean all of that up.

cces <- cces %>% mutate(econ2 = recode(econfuture, "1=5; 2=4; 3=3; 4=2; 5=1; else=99"))

So, we just reverse coded everything. You see how 1 has now become a 5 and so on? But why did we do “else =99”? The answer is this: we wanted to make those weird responses to be weird numbers so we know that when we do some further analysis to make sure we don’t plot those values by filtering them out. So, let’s filter and visualize.

cces %>% filter(econ2 < 10) %>% ggplot(.,aes(x=econ2)) + geom_bar()

How about we visualize that by race?

cces %>% filter(econ2 < 10) %>% ggplot(.,aes(x=econ2)) + geom_bar() + facet_grid(.~race)

We can see now that each race has it’s own little bar chart. However, what makes this hard to really interpret is that there are so many more white people in the sample than other races that it distorts the plot. We can do this a different way to see which race is the most optimistic about the future. Scroll down to the Group By and Summarise section to see.

6.1 Other Ways to Recode

Let’s say we wanted to do a range of values. For example the “pid7” variable is coded where 1 is Strong Democrat, 2 is Not very Strong Democrat, 3 is Lean Democrat, 4 is Independent, 5 is Lean Republican, 6 is Not very strong Republican, and 7 is Strong Republican. Then, 8 is not sure. And there are some 98 and 99. We want to collapse that into 3 categories: Democrat, Independent, and Republican.

cces <- cces %>% mutate(pid3 = recode(pid7, "1:3=1; 4=2; 5:7=3; else=99"))
cces %>% tabyl(pid3)
##   pid3     n    percent
## 1    1 31139 0.48202786
## 2    2 10493 0.16243034
## 3    3 20847 0.32270898
## 4   99  2121 0.03283282

Now we have a pid7 variable but an outside reader wouldn’t know what 1 is vs 2 vs 3. We can change that. The key here is to use the single ticks (’’) to surround the text:

cces <- cces %>% mutate(pid3 = recode(pid3, "1= 'Democrat'; 2 = 'Independent'; 3 = 'Republican'"))
cces %>% tabyl(pid3)
##          pid3     n    percent
## 1          99  2121 0.03283282
## 2    Democrat 31139 0.48202786
## 3 Independent 10493 0.16243034
## 4  Republican 20847 0.32270898

7 Arrange Command

A really simple thing to do is just figure out how to arrange a column from least to most or most to least. Let’s say that we wanted to arrange this CCES data based on who is the youngest.

(I need to just do a quick command called select to show you only the age row.)

cces %>% arrange(age) %>% select(X1, age)
## # A tibble: 64,600 x 2
##       X1   age
##    <int> <dbl>
##  1  1031    19
##  2  1964    19
##  3  2258    19
##  4  2917    19
##  5  4927    19
##  6  5280    19
##  7  6638    19
##  8  6651    19
##  9  7947    19
## 10  9136    19
## # ... with 64,590 more rows

So, person number is 1031 is 19 years old. Along with a bunch of other people.

How about the other way? And finding the oldest person? Just add a negative sign.

cces %>% arrange(-age) %>% select(X1, age)
## # A tibble: 64,600 x 2
##       X1   age
##    <int> <dbl>
##  1 43462   100
##  2 59736    99
##  3  3748    96
##  4  3940    96
##  5 60869    96
##  6 22957    95
##  7 42504    95
##  8 54522    95
##  9   560    94
## 10  1701    94
## # ... with 64,590 more rows

Person #43462 is the oldest at 100 years old.

8 Group By and Summarise

We are going to add two new commands here and they make magic happen. So, we are trying to see how each race feels about their economic future. Here’s how we do that.

cces %>% group_by(race) %>% summarise(avg = mean(econ2))
## # A tibble: 8 x 2
##    race      avg
##   <int>    <dbl>
## 1     1 18.85066
## 2     2 16.33560
## 3     3 14.80031
## 4     4 14.23354
## 5     5 15.48659
## 6     6 22.85606
## 7     7 25.75263
## 8     8 21.01481

Okay, this is not right though. You want to guess why? It’s because we have some 99’s that we coded in there. We need to get rid of those.

cces %>% filter(econ2 < 10) %>% group_by(race) %>% summarise(avg = mean(econ2))
## # A tibble: 8 x 2
##    race      avg
##   <int>    <dbl>
## 1     1 2.927467
## 2     2 3.266803
## 3     3 3.101326
## 4     4 3.074516
## 5     5 2.766004
## 6     6 2.860000
## 7     7 2.521664
## 8     8 3.290909

Which racial group is most optimistic about their future? It’s the one with the highest “avg” score. And in this case that’s those who marked the “other” race box. Which is the lowest? It’s those of mixed race. Let’s try one more. Let’s do education.

cces %>% filter(econ2 < 10) %>% group_by(educ) %>% summarise(avg = mean(econ2))
## # A tibble: 6 x 2
##    educ      avg
##   <int>    <dbl>
## 1     1 2.735443
## 2     2 2.814411
## 3     3 2.937572
## 4     4 2.922579
## 5     5 3.130910
## 6     6 3.217549

What do you see here? It’s actually an interesting pattern. As one goes from lower levels of education to higher levels of education the overall optimism goes up. That makes intuitive sense, right? The more education you have, the more you think that your future is going to be better.

How do we visualize that? Well, it gets a little tricky. First, you need to realize that you have just created a new dataset. It has two columns. One is educ and the other is avg. So let’s save that new dataset. Here’s how.

plot <- cces %>% filter(econ2 < 10) %>% group_by(educ) %>% summarise(avg = mean(econ2))

Now we have a new dataset called plot that we can use to actually make a visualization. Here’s the structure for that:

plot %>% ggplot(.,aes(x=educ, y=avg)) + geom_col()

9 Conclusion

These are just a handful of things that R can do. It’s capabilities are literally endless. Later in the semester I will be giving you two assignments that will require the use of statistical software to complete. You will use the commands I taught you here. Let me know if you have questions or comments about this tutorial.