library(tidyverse)
library(janitor)
library(car)
cces <- read_csv("https://raw.githubusercontent.com/ryanburge/cces/master/CCES%20for%20Methods/small_cces.csv")Let’s start by taking a look at our dataset using the %>% and the glimpse command. You can get a pipe easily by hit CTRL + SHIFT + M.
cces %>% glimpse()## Observations: 64,600
## Variables: 33
## $ X1 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ X1_1 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, ...
## $ id <int> 222168628, 273691199, 284214415, 287557695, 2903876...
## $ state <int> 33, 22, 29, 1, 8, 1, 48, 42, 13, 42, 15, 48, 12, 48...
## $ birthyr <int> 1969, 1994, 1964, 1988, 1982, 1963, 1962, 1991, 196...
## $ gender <int> 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 2, 1, 1, ...
## $ educ <int> 2, 2, 2, 2, 5, 2, 2, 1, 2, 2, 5, 3, 3, 4, 3, 3, 3, ...
## $ race <int> 1, 1, 2, 2, 1, 6, 1, 1, 1, 1, 4, 1, 6, 3, 6, 7, 1, ...
## $ marital <int> 1, 5, 5, 5, 1, 4, 2, 2, 1, 1, 1, 1, 1, 6, 5, 5, 5, ...
## $ natecon <int> 3, 4, 5, 4, 2, 4, 3, 5, 4, 5, 2, 6, 3, 3, 3, 3, 6, ...
## $ mymoney <int> 2, 3, 2, 4, 2, 4, 3, 5, 4, 3, 2, 1, 3, 3, 2, 5, 4, ...
## $ econfuture <int> 6, 5, 4, 5, 6, 4, 3, 6, 6, 5, 3, 6, 6, 2, 2, 5, 6, ...
## $ police <int> 2, 3, 2, 2, 2, 3, 2, 2, 2, 1, 2, 1, 3, 2, 1, 4, 2, ...
## $ background <int> 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 2, 1, 1, 1, 1, 1, ...
## $ registry <int> 2, 1, 2, 2, 2, 1, 1, 2, 1, 1, 8, 1, 1, 2, 1, 2, 1, ...
## $ assaultban <int> 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 2, 1, 1, 2, 1, 2, ...
## $ conceal <int> 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 2, 2, ...
## $ pathway <int> 2, 2, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 2, 1, 1, ...
## $ border <int> 1, 1, 1, 2, 2, 1, 1, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, ...
## $ dreamer <int> 2, 2, 1, 1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 1, 2, 1, 1, ...
## $ deport <int> 1, 1, 2, 2, 2, 1, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, ...
## $ prochoice <int> 1, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 1, 1, 2, 1, 1, ...
## $ prolife <int> 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, ...
## $ gaym <int> 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ employ <int> 7, 7, 6, 6, 2, 6, 1, 4, 3, 1, 5, 1, 7, 2, 1, 4, 4, ...
## $ pid7 <int> 5, 4, 1, 4, 2, 2, 6, 4, 7, 7, 2, 4, 4, 1, 4, 1, 2, ...
## $ attend <int> 6, 8, 3, 4, 6, 2, 2, 5, 4, 5, 5, 5, 4, 6, 6, 6, 6, ...
## $ religion <int> 11, 98, 2, 11, 10, 1, 1, 11, 11, 2, 1, 1, 10, 11, 9...
## $ vote16 <int> 1, 1, NA, NA, 2, 99, 1, NA, 1, 1, 2, 5, 99, NA, 5, ...
## $ ideo5 <int> 3, 3, 5, 4, 2, 6, 5, 3, 4, 3, 3, 4, 6, 1, 3, 1, 2, ...
## $ union <int> 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, ...
## $ income <int> 97, 6, 4, 1, 7, 1, 3, 1, 4, 7, 10, 5, 2, 2, 3, 1, 6...
## $ sexuality <int> 1, 1, 1, 1, 1, 1, 1, 9, 1, 1, 1, 1, 6, 1, 1, 1, 1, ...
Glimpse tells you a lot of stuff. First it tells you how many observations you have. Here you have A LOT: 64,600. And how many variables do you have? 32 variables. You can also see the names of each variable, as well as the top fiften rows of each column.
All this won’t make any sense unless you have the codebook. That can be found here
Take a look at the race row. Look in your codebook. That first person our dataset is white, the second is white, and the third is black. Do the same for the gender.
One of the most helpful functions to get a simple sense of how many of each race in the dataset is to use the table command. Again, this uses cces and then the pipe.
cces %>% tabyl(race)## race n percent
## 1 1 46289 0.716547988
## 2 2 7926 0.122693498
## 3 3 5238 0.081083591
## 4 4 2278 0.035263158
## 5 5 522 0.008080495
## 6 6 1452 0.022476780
## 7 7 760 0.011764706
## 8 8 135 0.002089783
So, what percentage of the sample is white? Black? Asian? It gives you the actual count, but it also gives you the percentage. How about we visualize that?
cces %>% ggplot(.,aes(x=race)) + geom_bar()All that does it just but the race variable into a bar chart. That’s also known as a histogram. Sometimes visualizing data can help you understand it better.
Let’s try to visualize the party identification variable. It’s called pid7
cces %>% ggplot(.,aes(x=pid7)) + geom_bar()You notice something odd here? Why is the x axis (which is the one going left and right) have a little bar way on the right side? Well we can find out why if we just use the tabyl command.
cces %>% tabyl(pid7)## pid7 n percent
## 1 1 16251 0.2515634675
## 2 2 8618 0.1334055728
## 3 3 6270 0.0970588235
## 4 4 10493 0.1624303406
## 5 5 5554 0.0859752322
## 6 6 6814 0.1054798762
## 7 7 8479 0.1312538700
## 8 8 2067 0.0319969040
## 9 98 34 0.0005263158
## 10 99 20 0.0003095975
Now it makes sense, right? There are some weird values in there. Do you see them. There are people who are coded 98 and 99. Those values are people who don’t know or didn’t respond. We don’t want to plot them, beacuse they make the plot look ugly. So, we can fix that.
R has a lot of cool functions. One of the most helpful is called filter. It will filter the data based on whatever parameters we have set up. So, how do we use filter to get rid of those 98 and 99 values? It’s actually pretty simple.
If you just hit up, it will bring up a prior command. Hit up until you get back to the ggplot line. And then just magic a small addition of the filter command.
cces %>% filter(pid7 < 9) %>% ggplot(.,aes(x=pid7)) + geom_bar()Do you see what just happened there? We told R to look at the cces data. Then to filter out any values that are greater than 9. And guess what? 98 and 99 are greater than 9. Therefore R will not plot them in your bar graph.
Let’s try to use filter in another way. How about we just look at the party identification of men. Take a look at your codebook. What is the variable for male called? It’s called gender, right? And which value is male? Yes, it’s value 1. So, let’s use that info to plot.
cces %>% filter(gender ==1) %>% ggplot(.,aes(x=pid7)) + geom_bar()It’s important that you do == (2 equal signs). But now you have a plot of just males and their party identification. How about adding some colors? And some better labels?
cces %>% filter(gender ==1) %>% filter(pid7 < 9) %>% ggplot(.,aes(x=pid7)) + geom_bar(fill = "darkorchid3", color = "black") + labs(x= "Party Identification", y ="Number of Respondents", title ="Party ID of Males")Here are just a bunch of ways to filter something
If you want to look at all racial groups but white and the frequency of responses to the question about party ideology.
cces %>% filter(race != 1) %>% tabyl(pid7)## pid7 n percent
## 1 1 6635 0.3623504997
## 2 2 3428 0.1872098738
## 3 3 1799 0.0982469554
## 4 4 2917 0.1593031511
## 5 5 850 0.0464201846
## 6 6 956 0.0522090547
## 7 7 990 0.0540658621
## 8 8 714 0.0389929551
## 9 98 15 0.0008191797
## 10 99 7 0.0003822839
But let’s say you just wanted to look at black OR hispanic. You would use the |, which is just above the enter key. You have to hit shift.
cces %>% filter(race == 2 | race == 3) %>% tabyl(pid7)## pid7 n percent
## 1 1 5646 0.4288969918
## 2 2 2616 0.1987237922
## 3 3 1187 0.0901701610
## 4 4 1670 0.1268611364
## 5 5 414 0.0314494075
## 6 6 551 0.0418565785
## 7 7 575 0.0436797326
## 8 8 486 0.0369188696
## 9 98 12 0.0009115770
## 10 99 7 0.0005317533
Let’s say you wanted to look at the party ideology of black males. That’s the & symbol.
cces %>% filter(race == 2 & gender ==1) %>% tabyl(pid7)## pid7 n percent
## 1 1 1311 0.4700609537
## 2 2 524 0.1878809609
## 3 3 321 0.1150950161
## 4 4 323 0.1158121190
## 5 5 66 0.0236643958
## 6 6 68 0.0243814987
## 7 7 95 0.0340623880
## 8 8 78 0.0279670133
## 9 98 2 0.0007171029
## 10 99 1 0.0003585515
Let’s say you want to create a new variable that is the age of the respondent. Check out the codebook. There’s a variable called birthyr, let’s check that out.
cces %>% tabyl(birthyr)## birthyr n percent
## 1 1917 1 1.547988e-05
## 2 1918 1 1.547988e-05
## 3 1921 3 4.643963e-05
## 4 1922 3 4.643963e-05
## 5 1923 11 1.702786e-04
## 6 1924 13 2.012384e-04
## 7 1925 23 3.560372e-04
## 8 1926 17 2.631579e-04
## 9 1927 32 4.953560e-04
## 10 1928 51 7.894737e-04
## 11 1929 48 7.430341e-04
## 12 1930 96 1.486068e-03
## 13 1931 118 1.826625e-03
## 14 1932 119 1.842105e-03
## 15 1933 148 2.291022e-03
## 16 1934 201 3.111455e-03
## 17 1935 250 3.869969e-03
## 18 1936 297 4.597523e-03
## 19 1937 333 5.154799e-03
## 20 1938 377 5.835913e-03
## 21 1939 436 6.749226e-03
## 22 1940 443 6.857585e-03
## 23 1941 511 7.910217e-03
## 24 1942 609 9.427245e-03
## 25 1943 624 9.659443e-03
## 26 1944 599 9.272446e-03
## 27 1945 639 9.891641e-03
## 28 1946 812 1.256966e-02
## 29 1947 939 1.453560e-02
## 30 1948 955 1.478328e-02
## 31 1949 902 1.396285e-02
## 32 1950 1008 1.560372e-02
## 33 1951 993 1.537152e-02
## 34 1952 1216 1.882353e-02
## 35 1953 1372 2.123839e-02
## 36 1954 1292 2.000000e-02
## 37 1955 1414 2.188854e-02
## 38 1956 1439 2.227554e-02
## 39 1957 1484 2.297214e-02
## 40 1958 1443 2.233746e-02
## 41 1959 1345 2.082043e-02
## 42 1960 1509 2.335913e-02
## 43 1961 1335 2.066563e-02
## 44 1962 1285 1.989164e-02
## 45 1963 1258 1.947368e-02
## 46 1964 1194 1.848297e-02
## 47 1965 1195 1.849845e-02
## 48 1966 1046 1.619195e-02
## 49 1967 1011 1.565015e-02
## 50 1968 1001 1.549536e-02
## 51 1969 964 1.492260e-02
## 52 1970 1042 1.613003e-02
## 53 1971 934 1.445820e-02
## 54 1972 999 1.546440e-02
## 55 1973 835 1.292570e-02
## 56 1974 977 1.512384e-02
## 57 1975 941 1.456656e-02
## 58 1976 890 1.377709e-02
## 59 1977 922 1.427245e-02
## 60 1978 1003 1.552632e-02
## 61 1979 960 1.486068e-02
## 62 1980 1258 1.947368e-02
## 63 1981 1387 2.147059e-02
## 64 1982 1368 2.117647e-02
## 65 1983 1297 2.007740e-02
## 66 1984 1372 2.123839e-02
## 67 1985 1347 2.085139e-02
## 68 1986 1359 2.103715e-02
## 69 1987 1285 1.989164e-02
## 70 1988 1217 1.883901e-02
## 71 1989 1126 1.743034e-02
## 72 1990 1303 2.017028e-02
## 73 1991 1002 1.551084e-02
## 74 1992 913 1.413313e-02
## 75 1993 784 1.213622e-02
## 76 1994 788 1.219814e-02
## 77 1995 750 1.160991e-02
## 78 1996 682 1.055728e-02
## 79 1997 776 1.201238e-02
## 80 1998 668 1.034056e-02
So, what do you see? You see the year that each person was born. But, how do we convert that to age. How? Like this:
cces <- cces %>% mutate(age = 2017 - birthyr)So, that will give you a new variable called age. That new variable has to be saved however and that is why this begins with “cces <-” that will overwrite the original dataset with a new one that contains our variable we created called age. Let’s visualize that.
cces %>% ggplot(.,aes(x=age)) + geom_bar()You can see that there are a lot of people in their 60s and 70s, and less in their 40s. Let’s do something here. Let’s visualize the age distribution of both males and females.
cces %>% ggplot(.,aes(x=age)) + geom_bar() + facet_grid(.~gender)Can you figure out what just happened. It is still mapping the distribution of age. But there are two plots (1 and 2). What is 1? What is 2? Well, look at the new addition I made to the command. See that facet_grid command? I told R to plot the age variable for all values of gender. So we can see in our codebook that 1 is male and 2 is female. That’s what the plot is showing us.
One of the most basic, yet important types of analysis we do as social scientists is called a crosstab. That is basically a two way frequency table. If you wanted to take a look at union membership, that’s easy to do with the tabyl command. Right?
cces %>% tabyl(union)## union n percent
## 1 1 4804 0.074365325
## 2 2 11496 0.177956656
## 3 3 48162 0.745541796
## 4 8 138 0.002136223
But that just tells us what union membership looks like in the entire sample. But let’s say that I’m interested in knowing how union membership varies by each racial group. Well, we could use the filter command right? So if you wanted to see what percentage of black people were in a union you could do that like this:
cces %>% filter(race ==1) %>% tabyl(union)## union n percent
## 1 1 3359 0.072565836
## 2 2 8709 0.188144052
## 3 3 34139 0.737518633
## 4 8 82 0.001771479
But here’s the problem, if I wanted to that relationship for each of the seven racial groups, that’s seven lines of code. How about one line of code?
cces %>% crosstab(race, union)## race 1 2 3 8
## 1 1 3359 8709 34139 82
## 2 2 640 1328 5938 20
## 3 3 439 640 4138 21
## 4 4 158 169 1941 10
## 5 5 44 126 352 0
## 6 6 100 264 1085 3
## 7 7 51 239 468 2
## 8 8 13 21 101 0
What you have going down the table is each of the racial groups. 1 is white, 2 is black etc. Then across the columns there each of three response options to the union membership question. So how many Hispanics have never been part of a union? 4138.
But, what percentage is that? You can add a little extra code and get that easily.
cces %>% crosstab(race, union) %>% adorn_crosstab(denom = "row")## race 1 2 3 8
## 1 1 7.3% (3359) 18.8% (8709) 73.8% (34139) 0.2% (82)
## 2 2 8.1% (640) 16.8% (1328) 74.9% (5938) 0.3% (20)
## 3 3 8.4% (439) 12.2% (640) 79.0% (4138) 0.4% (21)
## 4 4 6.9% (158) 7.4% (169) 85.2% (1941) 0.4% (10)
## 5 5 8.4% (44) 24.1% (126) 67.4% (352) 0.0% (0)
## 6 6 6.9% (100) 18.2% (264) 74.7% (1085) 0.2% (3)
## 7 7 6.7% (51) 31.4% (239) 61.6% (468) 0.3% (2)
## 8 8 9.6% (13) 15.6% (21) 74.8% (101) 0.0% (0)
So, what percentage of Hispanics have never been in a union? 79.0%. Which racial group has the lowest union membership? Asians. Over 85% have never been in a union.
A lot of what social scientists do is to create new variables by recoding old variables. Let’s start by creating a dichotomous variable out of the gender variable. We this new variable to be called male and male’s will have a value of 1, while everyone else will be coded as zero.
cces <- cces %>% mutate(male = recode(gender, "1=1; else=0"))
cces %>% tabyl(male)## male n percent
## 1 0 35069 0.5428638
## 2 1 29531 0.4571362
Do you see what happened? Now males are 1 and everyone is zero.
Let’s do something a bit more difficult. Look in your codeboook to a variable called econfuture. You notice how the lower values like 1 are indicating the next year will be a lot better than the previous year and high values are saying that next year will be a lot worse? Doesn’t that seem backwards? Shouldn’t high values mean greater things and low values mean less? And what about the 6 value which means “unsure” and 8 means “skipped.” We need to clean all of that up.
cces <- cces %>% mutate(econ2 = recode(econfuture, "1=5; 2=4; 3=3; 4=2; 5=1; else=99"))So, we just reverse coded everything. You see how 1 has now become a 5 and so on? But why did we do “else =99”? The answer is this: we wanted to make those weird responses to be weird numbers so we know that when we do some further analysis to make sure we don’t plot those values by filtering them out. So, let’s filter and visualize.
cces %>% filter(econ2 < 10) %>% ggplot(.,aes(x=econ2)) + geom_bar()How about we visualize that by race?
cces %>% filter(econ2 < 10) %>% ggplot(.,aes(x=econ2)) + geom_bar() + facet_grid(.~race)We can see now that each race has it’s own little bar chart. However, what makes this hard to really interpret is that there are so many more white people in the sample than other races that it distorts the plot. We can do this a different way to see which race is the most optimistic about the future. Scroll down to the Group By and Summarise section to see.
Let’s say we wanted to do a range of values. For example the “pid7” variable is coded where 1 is Strong Democrat, 2 is Not very Strong Democrat, 3 is Lean Democrat, 4 is Independent, 5 is Lean Republican, 6 is Not very strong Republican, and 7 is Strong Republican. Then, 8 is not sure. And there are some 98 and 99. We want to collapse that into 3 categories: Democrat, Independent, and Republican.
cces <- cces %>% mutate(pid3 = recode(pid7, "1:3=1; 4=2; 5:7=3; else=99"))
cces %>% tabyl(pid3)## pid3 n percent
## 1 1 31139 0.48202786
## 2 2 10493 0.16243034
## 3 3 20847 0.32270898
## 4 99 2121 0.03283282
Now we have a pid7 variable but an outside reader wouldn’t know what 1 is vs 2 vs 3. We can change that. The key here is to use the single ticks (’’) to surround the text:
cces <- cces %>% mutate(pid3 = recode(pid3, "1= 'Democrat'; 2 = 'Independent'; 3 = 'Republican'"))
cces %>% tabyl(pid3)## pid3 n percent
## 1 99 2121 0.03283282
## 2 Democrat 31139 0.48202786
## 3 Independent 10493 0.16243034
## 4 Republican 20847 0.32270898
A really simple thing to do is just figure out how to arrange a column from least to most or most to least. Let’s say that we wanted to arrange this CCES data based on who is the youngest.
(I need to just do a quick command called select to show you only the age row.)
cces %>% arrange(age) %>% select(X1, age)## # A tibble: 64,600 x 2
## X1 age
## <int> <dbl>
## 1 1031 19
## 2 1964 19
## 3 2258 19
## 4 2917 19
## 5 4927 19
## 6 5280 19
## 7 6638 19
## 8 6651 19
## 9 7947 19
## 10 9136 19
## # ... with 64,590 more rows
So, person number is 1031 is 19 years old. Along with a bunch of other people.
How about the other way? And finding the oldest person? Just add a negative sign.
cces %>% arrange(-age) %>% select(X1, age)## # A tibble: 64,600 x 2
## X1 age
## <int> <dbl>
## 1 43462 100
## 2 59736 99
## 3 3748 96
## 4 3940 96
## 5 60869 96
## 6 22957 95
## 7 42504 95
## 8 54522 95
## 9 560 94
## 10 1701 94
## # ... with 64,590 more rows
Person #43462 is the oldest at 100 years old.
We are going to add two new commands here and they make magic happen. So, we are trying to see how each race feels about their economic future. Here’s how we do that.
cces %>% group_by(race) %>% summarise(avg = mean(econ2))## # A tibble: 8 x 2
## race avg
## <int> <dbl>
## 1 1 18.85066
## 2 2 16.33560
## 3 3 14.80031
## 4 4 14.23354
## 5 5 15.48659
## 6 6 22.85606
## 7 7 25.75263
## 8 8 21.01481
Okay, this is not right though. You want to guess why? It’s because we have some 99’s that we coded in there. We need to get rid of those.
cces %>% filter(econ2 < 10) %>% group_by(race) %>% summarise(avg = mean(econ2))## # A tibble: 8 x 2
## race avg
## <int> <dbl>
## 1 1 2.927467
## 2 2 3.266803
## 3 3 3.101326
## 4 4 3.074516
## 5 5 2.766004
## 6 6 2.860000
## 7 7 2.521664
## 8 8 3.290909
Which racial group is most optimistic about their future? It’s the one with the highest “avg” score. And in this case that’s those who marked the “other” race box. Which is the lowest? It’s those of mixed race. Let’s try one more. Let’s do education.
cces %>% filter(econ2 < 10) %>% group_by(educ) %>% summarise(avg = mean(econ2))## # A tibble: 6 x 2
## educ avg
## <int> <dbl>
## 1 1 2.735443
## 2 2 2.814411
## 3 3 2.937572
## 4 4 2.922579
## 5 5 3.130910
## 6 6 3.217549
What do you see here? It’s actually an interesting pattern. As one goes from lower levels of education to higher levels of education the overall optimism goes up. That makes intuitive sense, right? The more education you have, the more you think that your future is going to be better.
How do we visualize that? Well, it gets a little tricky. First, you need to realize that you have just created a new dataset. It has two columns. One is educ and the other is avg. So let’s save that new dataset. Here’s how.
plot <- cces %>% filter(econ2 < 10) %>% group_by(educ) %>% summarise(avg = mean(econ2))Now we have a new dataset called plot that we can use to actually make a visualization. Here’s the structure for that:
plot %>% ggplot(.,aes(x=educ, y=avg)) + geom_col()These are just a handful of things that R can do. It’s capabilities are literally endless. Later in the semester I will be giving you two assignments that will require the use of statistical software to complete. You will use the commands I taught you here. Let me know if you have questions or comments about this tutorial.