library(building)
library(car)
library(janitor)
cces <- read_csv("https://raw.githubusercontent.com/ryanburge/cces/master/CCES%20for%20Methods/small_cces.csv")
Let’s start by taking a look at our dataset using the %>% and the glimpse command. You can get a pipe easily by hit CTRL + SHIFT + M.
cces %>% glimpse()
## Observations: 64,600
## Variables: 33
## $ X1 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ X1_1 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ id <int> 222168628, 273691199, 284214415, 287557695, 2903876...
## $ state <int> 33, 22, 29, 1, 8, 1, 48, 42, 13, 42, 15, 48, 12, 48...
## $ birthyr <int> 1969, 1994, 1964, 1988, 1982, 1963, 1962, 1991, 196...
## $ gender <int> 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, ...
## $ educ <int> 2, 2, 2, 2, 5, 2, 2, 1, 2, 2, 5, 3, 3, 4, 3, 3, 3, ...
## $ race <int> 1, 1, 2, 2, 1, 6, 1, 1, 1, 1, 4, 1, 6, 3, 6, 7, 1, ...
## $ marital <int> 1, 5, 5, 5, 1, 4, 2, 2, 1, 1, 1, 1, 1, 6, 5, 5, 5, ...
## $ natecon <int> 3, 4, 5, 4, 2, 4, 3, 5, 4, 5, 2, 6, 3, 3, 3, 3, 6, ...
## $ mymoney <int> 2, 3, 2, 4, 2, 4, 3, 5, 4, 3, 2, 1, 3, 3, 2, 5, 4, ...
## $ econfuture <int> 6, 5, 4, 5, 6, 4, 3, 6, 6, 5, 3, 6, 6, 2, 2, 5, 6, ...
## $ police <int> 2, 3, 2, 2, 2, 3, 2, 2, 2, 1, 2, 1, 3, 2, 1, 4, 2, ...
## $ background <int> 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, ...
## $ registry <int> 2, 1, 2, 2, 2, 1, 1, 2, 1, 1, 8, 1, 1, 2, 1, 2, 1, ...
## $ assaultban <int> 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 2, 1, 1, 2, 1, 2, ...
## $ conceal <int> 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 2, 2, ...
## $ pathway <int> 2, 2, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, ...
## $ border <int> 1, 1, 1, 2, 2, 1, 1, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, ...
## $ dreamer <int> 2, 2, 1, 1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 1, 1, ...
## $ deport <int> 1, 1, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, ...
## $ prochoice <int> 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 1, 1, 2, 1, 1, ...
## $ prolife <int> 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, ...
## $ gaym <int> 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ employ <int> 7, 7, 6, 6, 2, 6, 1, 4, 3, 1, 5, 1, 7, 2, 1, 4, 4, ...
## $ pid7 <int> 5, 4, 1, 4, 2, 2, 6, 4, 7, 7, 2, 4, 4, 1, 4, 1, 2, ...
## $ attend <int> 6, 8, 3, 4, 6, 2, 2, 5, 4, 5, 5, 5, 4, 6, 6, 6, 6, ...
## $ religion <int> 11, 98, 2, 11, 10, 1, 1, 11, 11, 2, 1, 1, 10, 11, 9...
## $ vote16 <int> 1, 1, NA, NA, 2, 99, 1, NA, 1, 1, 2, 5, 99, NA, 5, ...
## $ ideo5 <int> 3, 3, 5, 4, 2, 6, 5, 3, 4, 3, 3, 4, 6, 1, 3, 1, 2, ...
## $ union <int> 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, ...
## $ income <int> 97, 6, 4, 1, 7, 1, 3, 1, 4, 7, 10, 5, 2, 2, 3, 1, 6...
## $ sexuality <int> 1, 1, 1, 1, 1, 1, 1, 9, 1, 1, 1, 1, 6, 1, 1, 1, 1, ...
Glimpse tells you a lot of stuff. First it tells you how many observations you have. Here you have A LOT: 64,600. And how many variables do you have? 32 variables. You can also see the names of each variable, as well as the top fiften rows of each column.
All this won’t make any sense unless you have the codebook. That can be found here
Take a look at the race row. Look in your codebook. That first person our dataset is white, the second is white, and the third is black. Do the same for the gender.
One of the most helpful functions to get a simple sense of how many of each race in the dataset is to use the table command. Again, this uses cces and then the pipe.
cces %>% ct(race)
## # A tibble: 8 x 3
## race n pct
## <int> <int> <dbl>
## 1 1 46289 0.717
## 2 2 7926 0.123
## 3 3 5238 0.081
## 4 4 2278 0.035
## 5 5 522 0.008
## 6 6 1452 0.022
## 7 7 760 0.012
## 8 8 135 0.002
So, what percentage of the sample is white? Black? Asian? It gives you the actual count, but it also gives you the percentage. How about we visualize that? Let’s start by saving our count results, then making a graph.
count1 <- cces %>%
ct(race)
count1 %>% ggplot(.,aes(x=race, y =pct)) + geom_col()
All that does it just but the race variable into a bar chart. That’s also known as a histogram. Sometimes visualizing data can help you understand it better.
Let’s try to visualize the party identification variable. It’s called pid7
count2 <- cces %>%
ct(pid7)
count2 %>% ggplot(.,aes(x=pid7, y = pct)) + geom_col()
You notice something odd here? Why is the x axis (which is the one going left and right) have a little bar way on the right side? Well we can find out why if we just use the tabyl command.
cces %>% ct(pid7)
## # A tibble: 10 x 3
## pid7 n pct
## <int> <int> <dbl>
## 1 1 16251 0.252
## 2 2 8618 0.133
## 3 3 6270 0.097
## 4 4 10493 0.162
## 5 5 5554 0.086
## 6 6 6814 0.105
## 7 7 8479 0.131
## 8 8 2067 0.032
## 9 98 34 0.001
## 10 99 20 0
Now it makes sense, right? There are some weird values in there. Do you see them. There are people who are coded 98 and 99. Those values are people who don’t know or didn’t respond. We don’t want to plot them, beacuse they make the plot look ugly. So, we can fix that.
R has a lot of cool functions. One of the most helpful is called filter. It will filter the data based on whatever parameters we have set up. So, how do we use filter to get rid of those 98 and 99 values? It’s actually pretty simple.
If you just hit up, it will bring up a prior command. Hit up until you get back to the last ct() line. And then just magic a small addition of the filter command.
count2 <- cces %>%
filter(pid7 < 9) %>%
ct(pid7)
count2 %>% ggplot(., aes(x=pid7, pct)) + geom_col()
Do you see what just happened there? We told R to look at the cces data. Then to filter out any values that are greater than 9. And guess what? 98 and 99 are greater than 9. Therefore R will not plot them in your bar graph.
Let’s try to use filter in another way. How about we just look at the party identification of men. Take a look at your codebook. What is the variable for male called? It’s called gender, right? And which value is male? Yes, it’s value 1. So, let’s use that info to plot.
count3 <- cces %>%
filter(gender ==1) %>%
ct(pid7)
count3 %>%
filter(pid7 < 9) %>%
ggplot(.,aes(x=pid7, y = pct)) + geom_col()
It’s important that you do == (2 equal signs). But now you have a plot of just males and their party identification. How about adding some colors? And some better labels?
count3 %>%
filter(pid7 < 9) %>%
ggplot(.,aes(x=pid7, y = pct)) +
geom_col(fill = "darkorchid3", color = "black") +
labs(x= "Party Identification", y ="Number of Respondents", title ="Party ID of Males")
Here are just a bunch of ways to filter something
If you want to look at all racial groups but white and the frequency of responses to the question about party ideology.
cces %>% filter(race != 1) %>% ct(pid7)
## # A tibble: 10 x 3
## pid7 n pct
## <int> <int> <dbl>
## 1 1 6635 0.362
## 2 2 3428 0.187
## 3 3 1799 0.098
## 4 4 2917 0.159
## 5 5 850 0.046
## 6 6 956 0.052
## 7 7 990 0.054
## 8 8 714 0.039
## 9 98 15 0.001
## 10 99 7 0
But let’s say you just wanted to look at black OR hispanic. You would use the |, which is just above the enter key. You have to hit shift.
cces %>% filter(race == 2 | race == 3) %>% ct(pid7)
## # A tibble: 10 x 3
## pid7 n pct
## <int> <int> <dbl>
## 1 1 5646 0.429
## 2 2 2616 0.199
## 3 3 1187 0.09
## 4 4 1670 0.127
## 5 5 414 0.031
## 6 6 551 0.042
## 7 7 575 0.044
## 8 8 486 0.037
## 9 98 12 0.001
## 10 99 7 0.001
Let’s say you wanted to look at the party ideology of black males. That’s the & symbol.
cces %>% filter(race == 2 & gender ==1) %>% ct(pid7)
## # A tibble: 10 x 3
## pid7 n pct
## <int> <int> <dbl>
## 1 1 1311 0.47
## 2 2 524 0.188
## 3 3 321 0.115
## 4 4 323 0.116
## 5 5 66 0.024
## 6 6 68 0.024
## 7 7 95 0.034
## 8 8 78 0.028
## 9 98 2 0.001
## 10 99 1 0
Let’s say you want to create a new variable that is the age of the respondent. Check out the codebook. There’s a variable called birthyr, let’s check that out.
cces %>% ct(birthyr)
## # A tibble: 80 x 3
## birthyr n pct
## <int> <int> <dbl>
## 1 1917 1 0
## 2 1918 1 0
## 3 1921 3 0
## 4 1922 3 0
## 5 1923 11 0
## 6 1924 13 0
## 7 1925 23 0
## 8 1926 17 0
## 9 1927 32 0
## 10 1928 51 0.001
## # ... with 70 more rows
So, what do you see? You see the year that each person was born. But, how do we convert that to age. How? Like this:
cces <- cces %>% mutate(age = 2016 - birthyr)
So, that will give you a new variable called age. That new variable has to be saved however and that is why this begins with “cces <-” that will overwrite the original dataset with a new one that contains our variable we created called age. Let’s visualize that.
age_count <- cces %>% ct(age)
age_count %>% ggplot(.,aes(x=age, y = pct)) + geom_col()
You can see that there are a lot of people in their 60s and 70s, and less in their 40s. Let’s do something here. Let’s visualize the age distribution of both males and females.
One of the most basic, yet important types of analysis we do as social scientists is called a crosstab. That is basically a two way frequency table. If you wanted to take a look at union membership, that’s easy to do with the tabyl command. Right?
cces %>% ct(union)
## # A tibble: 4 x 3
## union n pct
## <int> <int> <dbl>
## 1 1 4804 0.074
## 2 2 11496 0.178
## 3 3 48162 0.746
## 4 8 138 0.002
But that just tells us what union membership looks like in the entire sample. But let’s say that I’m interested in knowing how union membership varies by each racial group. Well, we could use the filter command right? So if you wanted to see what percentage of black people were in a union you could do that like this:
cces %>% filter(race ==1) %>% ct(union)
## # A tibble: 4 x 3
## union n pct
## <int> <int> <dbl>
## 1 1 3359 0.073
## 2 2 8709 0.188
## 3 3 34139 0.738
## 4 8 82 0.002
But here’s the problem, if I wanted to that relationship for each of the seven racial groups, that’s seven lines of code. How about one line of code?
cces %>% tabyl(race, union)
## race 1 2 3 8
## 1 3359 8709 34139 82
## 2 640 1328 5938 20
## 3 439 640 4138 21
## 4 158 169 1941 10
## 5 44 126 352 0
## 6 100 264 1085 3
## 7 51 239 468 2
## 8 13 21 101 0
What you have going down the table is each of the racial groups. 1 is white, 2 is black etc. Then across the columns there each of three response options to the union membership question. So how many Hispanics have never been part of a union? 4138.
But, what percentage is that? You can add a little extra code and get that easily.
cces %>% crosstab(race, union) %>% adorn_crosstab(denom = "row")
## race 1 2 3 8
## 1 1 7.3% (3359) 18.8% (8709) 73.8% (34139) 0.2% (82)
## 2 2 8.1% (640) 16.8% (1328) 74.9% (5938) 0.3% (20)
## 3 3 8.4% (439) 12.2% (640) 79.0% (4138) 0.4% (21)
## 4 4 6.9% (158) 7.4% (169) 85.2% (1941) 0.4% (10)
## 5 5 8.4% (44) 24.1% (126) 67.4% (352) 0.0% (0)
## 6 6 6.9% (100) 18.2% (264) 74.7% (1085) 0.2% (3)
## 7 7 6.7% (51) 31.4% (239) 61.6% (468) 0.3% (2)
## 8 8 9.6% (13) 15.6% (21) 74.8% (101) 0.0% (0)
So, what percentage of Hispanics have never been in a union? 79.0%. Which racial group has the lowest union membership? Asians. Over 85% have never been in a union.
A lot of what social scientists do is to create new variables by recoding old variables. Let’s start by creating a dichotomous variable out of the gender variable. We this new variable to be called male and male’s will have a value of 1, while everyone else will be coded as zero.
cces <- cces %>% mutate(male = recode(gender, "1=1; else=0"))
cces %>% tabyl(male)
## male n percent
## 0 35069 0.5428638
## 1 29531 0.4571362
Do you see what happened? Now males are 1 and everyone is zero.
Let’s do something a bit more difficult. Look in your codeboook to a variable called econfuture. You notice how the lower values like 1 are indicating the next year will be a lot better than the previous year and high values are saying that next year will be a lot worse? Doesn’t that seem backwards? Shouldn’t high values mean greater things and low values mean less? And what about the 6 value which means “unsure” and 8 means “skipped.” We need to clean all of that up.
cces <- cces %>% mutate(econ2 = recode(econfuture, "1=5; 2=4; 3=3; 4=2; 5=1; else=99"))
So, we just reverse coded everything. You see how 1 has now become a 5 and so on? But why did we do “else =99”? The answer is this: we wanted to make those weird responses to be weird numbers so we know that when we do some further analysis to make sure we don’t plot those values by filtering them out. So, let’s filter and visualize.
econ_ct <- cces %>% filter(econ2 < 10) %>% ct(econ2)
econ_ct %>%
ggplot(.,aes(x=econ2, y =pct)) + geom_col()
How about we visualize that by race?
cces %>% filter(econ2 < 10) %>% ggplot(.,aes(x=econ2)) + geom_bar() + facet_grid(.~race)
We can see now that each race has it’s own little bar chart. However, what makes this hard to really interpret is that there are so many more white people in the sample than other races that it distorts the plot. We can do this a different way to see which race is the most optimistic about the future.
A really simple thing to do is just figure out how to arrange a column from least to most or most to least. Let’s say that we wanted to arrange this CCES data based on who is the youngest.
(I need to just do a quick command called select to show you only the age row.)
cces %>% arrange(age) %>% select(X1, age)
## # A tibble: 64,600 x 2
## X1 age
## <int> <dbl>
## 1 1031 18
## 2 1964 18
## 3 2258 18
## 4 2917 18
## 5 4927 18
## 6 5280 18
## 7 6638 18
## 8 6651 18
## 9 7947 18
## 10 9136 18
## # ... with 64,590 more rows
So, person number is 1031 is 19 years old. Along with a bunch of other people.
How about the other way? And finding the oldest person? Just add a negative sign.
cces %>% arrange(-age) %>% select(X1, age)
## # A tibble: 64,600 x 2
## X1 age
## <int> <dbl>
## 1 43462 99
## 2 59736 98
## 3 3748 95
## 4 3940 95
## 5 60869 95
## 6 22957 94
## 7 42504 94
## 8 54522 94
## 9 560 93
## 10 1701 93
## # ... with 64,590 more rows
Person #43462 is the oldest at 100 years old.
We are going to add two new commands here and they make magic happen. So, we are trying to see how each race feels about their economic future. Here’s how we do that.
cces %>% group_by(race) %>% mean_ci(econ2)
## # A tibble: 8 x 8
## race mean sd n level se lower upper
## <int> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 18.9 35.7 46289 0.05 0.166 18.5 19.2
## 2 2 16.3 32.9 7926 0.05 0.369 15.6 17.1
## 3 3 14.8 31.4 5238 0.05 0.434 13.9 15.7
## 4 4 14.2 30.8 2278 0.05 0.645 13.0 15.5
## 5 5 15.5 32.6 522 0.05 1.43 12.7 18.3
## 6 6 22.9 39.0 1452 0.05 1.02 20.8 24.9
## 7 7 25.8 41.3 760 0.05 1.50 22.8 28.7
## 8 8 21.0 37.3 135 0.05 3.21 14.7 27.4
Okay, this is not right though. You want to guess why? It’s because we have some 99’s that we coded in there. We need to get rid of those.
cces %>% filter(econ2 < 10) %>% group_by(race) %>% mean_ci(econ2)
## # A tibble: 8 x 8
## race mean sd n level se lower upper
## <int> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 2.93 1.04 38617 0.05 0.00527 2.92 2.94
## 2 2 3.27 1.07 6844 0.05 0.0129 3.24 3.29
## 3 3 3.10 1.08 4599 0.05 0.0159 3.07 3.13
## 4 4 3.07 0.968 2013 0.05 0.0216 3.03 3.12
## 5 5 2.77 1.15 453 0.05 0.0541 2.66 2.87
## 6 6 2.86 1.10 1150 0.05 0.0324 2.80 2.92
## 7 7 2.52 1.11 577 0.05 0.0464 2.43 2.61
## 8 8 3.29 1.03 110 0.05 0.0978 3.10 3.48
Which racial group is most optimistic about their future? It’s the one with the highest “avg” score. And in this case that’s those who marked the “other” race box. Which is the lowest? It’s those of mixed race. Let’s try one more. Let’s do education.
cces %>% filter(econ2 < 10) %>% group_by(educ) %>% mean_ci(econ2)
## # A tibble: 6 x 8
## educ mean sd n level se lower upper
## <int> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 1 2.74 1.12 1580 0.05 0.0283 2.68 2.79
## 2 2 2.81 1.05 13406 0.05 0.00908 2.80 2.83
## 3 3 2.94 1.04 12991 0.05 0.00916 2.92 2.96
## 4 4 2.92 1.05 6019 0.05 0.0135 2.90 2.95
## 5 5 3.13 1.03 12902 0.05 0.00909 3.11 3.15
## 6 6 3.22 1.01 7465 0.05 0.0117 3.19 3.24
What do you see here? It’s actually an interesting pattern. As one goes from lower levels of education to higher levels of education the overall optimism goes up. That makes intuitive sense, right? The more education you have, the more you think that your future is going to be better.
How do we visualize that? Well, it gets a little tricky. First, you need to realize that you have just created a new dataset. It has two columns. One is educ and the other is avg. So let’s save that new dataset. Here’s how.
plot <- cces %>% filter(econ2 < 10) %>% group_by(educ) %>% mean_ci(econ2)
Now we have a new dataset called plot that we can use to actually make a visualization. Here’s the structure for that:
plot %>% ggplot(.,aes(x=educ, y=mean)) + geom_col()
Let’s say that you want to compare two things in one bar chart. You can take the pieces that we’ve learned and do that.
The question is comparing the racial composition of Ohio and Nebraska. We need to start by great a dataset of the racial count in Nebraska, then do the same for Ohio.
ohio <- cces %>%
filter(state == 39) %>%
ct(race) %>%
mutate(group = "Ohio")
neb <- cces %>%
filter(state == 31) %>%
ct(race) %>%
mutate(group = "Nebraksa")
neb
## # A tibble: 7 x 4
## race n pct group
## <int> <int> <dbl> <chr>
## 1 1 317 0.857 Nebraksa
## 2 2 21 0.057 Nebraksa
## 3 3 9 0.024 Nebraksa
## 4 4 7 0.019 Nebraksa
## 5 5 5 0.014 Nebraksa
## 6 6 5 0.014 Nebraksa
## 7 7 6 0.016 Nebraksa
You see what we did? That creates a dataset for Ohio and one for Nebraska. The key thing that we added is the group variable, which will help identify our data in just a minute. We need to put those two datasets together. Here’s how we do it.
both <- bind_rows(ohio, neb)
We just bound the rows together to make a new dataset called both. Now we can plot.
both %>%
ggplot(., aes(x=race,y=pct, fill = group)) +
geom_col(position = "dodge") +
labs(x = "Race", y = "Percent", title = "Racial Composition")
Let’s just do some small things to make that graph look better.
both %>%
ggplot(., aes(x=race,y=pct, fill = group)) +
geom_col(position = "dodge") +
labs(x = "Race", y = "Percent", title = "Racial Composition") +
theme_minimal() +
theme(legend.position = "bottom") +
theme(legend.title=element_blank())
We just did a few things. Changed the theme. Moved the legend to the bottom and removed the legend title. Let’s do some more stuff. Moved the legend to a better spot. Change the font and add a better color scheme. Add x axis variable labels. Add percents to the top of the bar. Make this presentable.
library(ggsci) #You might need to install this one.
library(showtext)
font_add_google("Abel", "font")
showtext_auto()
both %>%
ggplot(., aes(x=race,y=pct, fill = group)) +
geom_col(position = "dodge", color = "black") +
labs(x = "Race", y = "Percent", title = "Racial Composition") +
theme_minimal() +
theme(legend.title=element_blank()) +
scale_fill_npg() +
scale_y_continuous(labels = percent) +
theme(text=element_text(size=44, family="font")) +
theme(legend.position = c(0.8, .8)) +
scale_x_continuous(breaks = c(1,2,3,4,5,6,7,8), labels = c("White", "Black", "Hispanic", "Asian", "Native American", "Mixed", "Other", "Middle Eastern")) +
geom_text(aes(y = pct + .025, label = paste0(pct*100, '%')), position = position_dodge(width = .9), size = 10, family = "font")
These are just a handful of things that R can do. It’s capabilities are literally endless. Later in the semester I will be giving you two assignments that will require the use of statistical software to complete. You will use the commands I taught you here. Let me know if you have questions or comments about this tutorial.