Read in the GSS data as a dataframe called “x”.
setwd("~/Dropbox/Data General/GSS") #Set your working directory to whatever folder holds GSS.csv
options(scipen = 999) #Turn off scientific notation
x <- read.csv("GSS.csv")
We use the summary() function to take a simple look at how many units (in this case, survey respondents) are in each level of a categorical variable. R calls categorical variables “factors”, and the categories “levels.”
summary(x$race)
## black other white
## 7625 2589 44873
summary(x$rincome)
## $1000 to 2999 $10000 - 14999 $15000 - 19999 $20000 - 24999 $25000 or more
## 1714 4472 3489 3345 11529
## $3000 to 3999 $4000 to 4999 $5000 to 5999 $6000 to 6999 $7000 to 7999
## 1111 957 987 902 903
## $8000 to 9999 lt $1000 NA's
## 1625 1189 22864
To begin considering the relationship between two categorical variables, we can generate a simple contingency table. A contingency table shows the frequency with which respondents identified as belonging to a certain category in one variable contingent on their belonging in some other category from a second variable.
Consider the relationship between race and income.
table(x$rincome, x$race)
##
## black other white
## $1000 to 2999 259 67 1388
## $10000 - 14999 674 216 3582
## $15000 - 19999 502 195 2792
## $20000 - 24999 431 172 2742
## $25000 or more 1195 687 9647
## $3000 to 3999 190 49 872
## $4000 to 4999 151 46 760
## $5000 to 5999 159 48 780
## $6000 to 6999 135 40 727
## $7000 to 7999 136 44 723
## $8000 to 9999 235 79 1311
## lt $1000 198 54 937
To do things with the table, save the table as an object called “table1”.
table1 <- table(x$rincome, x$race)
Make the table's cells proportions rather than simple frequencies.
prop.table(table1)
##
## black other white
## $1000 to 2999 0.008038 0.002079 0.043075
## $10000 - 14999 0.020917 0.006703 0.111163
## $15000 - 19999 0.015579 0.006052 0.086646
## $20000 - 24999 0.013376 0.005338 0.085094
## $25000 or more 0.037085 0.021320 0.299382
## $3000 to 3999 0.005896 0.001521 0.027061
## $4000 to 4999 0.004686 0.001428 0.023586
## $5000 to 5999 0.004934 0.001490 0.024206
## $6000 to 6999 0.004190 0.001241 0.022562
## $7000 to 7999 0.004221 0.001365 0.022437
## $8000 to 9999 0.007293 0.002452 0.040685
## lt $1000 0.006145 0.001676 0.029079
Or you can nest functions.
prop.table(table(x$rincome, x$race))
##
## black other white
## $1000 to 2999 0.008038 0.002079 0.043075
## $10000 - 14999 0.020917 0.006703 0.111163
## $15000 - 19999 0.015579 0.006052 0.086646
## $20000 - 24999 0.013376 0.005338 0.085094
## $25000 or more 0.037085 0.021320 0.299382
## $3000 to 3999 0.005896 0.001521 0.027061
## $4000 to 4999 0.004686 0.001428 0.023586
## $5000 to 5999 0.004934 0.001490 0.024206
## $6000 to 6999 0.004190 0.001241 0.022562
## $7000 to 7999 0.004221 0.001365 0.022437
## $8000 to 9999 0.007293 0.002452 0.040685
## lt $1000 0.006145 0.001676 0.029079
But these proportions are analytically useless. What are we actually interested in? The option “1” give us row proportions…
prop.table(table1, 1)
##
## black other white
## $1000 to 2999 0.15111 0.03909 0.80980
## $10000 - 14999 0.15072 0.04830 0.80098
## $15000 - 19999 0.14388 0.05589 0.80023
## $20000 - 24999 0.12885 0.05142 0.81973
## $25000 or more 0.10365 0.05959 0.83676
## $3000 to 3999 0.17102 0.04410 0.78488
## $4000 to 4999 0.15778 0.04807 0.79415
## $5000 to 5999 0.16109 0.04863 0.79027
## $6000 to 6999 0.14967 0.04435 0.80599
## $7000 to 7999 0.15061 0.04873 0.80066
## $8000 to 9999 0.14462 0.04862 0.80677
## lt $1000 0.16653 0.04542 0.78806
Again, not very useful or interesting. The option 2 gives us column proportions.
prop.table(table1, 2)
##
## black other white
## $1000 to 2999 0.06073 0.03948 0.05285
## $10000 - 14999 0.15803 0.12728 0.13640
## $15000 - 19999 0.11770 0.11491 0.10632
## $20000 - 24999 0.10106 0.10136 0.10441
## $25000 or more 0.28019 0.40483 0.36735
## $3000 to 3999 0.04455 0.02887 0.03321
## $4000 to 4999 0.03540 0.02711 0.02894
## $5000 to 5999 0.03728 0.02829 0.02970
## $6000 to 6999 0.03165 0.02357 0.02768
## $7000 to 7999 0.03189 0.02593 0.02753
## $8000 to 9999 0.05510 0.04655 0.04992
## lt $1000 0.04642 0.03182 0.03568
Now that's interesting, I'd say! But let's visualize it. First we need to save it as an object.
table2 <- prop.table(table(x$rincome, x$race), 2)
Now we can make a classic bar chart.
barplot(table2)
Not bad but we have a lot of cleaning up to do, not least of all because the income categories are not in order! First thing we'll do is re-order the factor levels, rerun the tables, and then modify the graph.
For main= put the overall title of the graph, beside declares if you want the different groups to have separate bars placed beside each other (default is to stack them instead), for legend= tell it you want the rownames from table 2, and cex.names= is used to increase or decrease the axis labels.
x$rincome <- ordered(x$rincome, levels = c("lt $1000", "$1000 to 2999", "$3000 to 3999",
"$4000 to 4999", "$5000 to 5999", "$6000 to 6999", "$7000 to 7999", "$8000 to 9999",
"$10000 - 14999", "$15000 - 19999", "$20000 - 24999", "$25000 or more"))
table2 <- prop.table(table(x$rincome, x$race), 2)
prop.table(table(x$rincome, x$race))
##
## black other white
## lt $1000 0.006145 0.001676 0.029079
## $1000 to 2999 0.008038 0.002079 0.043075
## $3000 to 3999 0.005896 0.001521 0.027061
## $4000 to 4999 0.004686 0.001428 0.023586
## $5000 to 5999 0.004934 0.001490 0.024206
## $6000 to 6999 0.004190 0.001241 0.022562
## $7000 to 7999 0.004221 0.001365 0.022437
## $8000 to 9999 0.007293 0.002452 0.040685
## $10000 - 14999 0.020917 0.006703 0.111163
## $15000 - 19999 0.015579 0.006052 0.086646
## $20000 - 24999 0.013376 0.005338 0.085094
## $25000 or more 0.037085 0.021320 0.299382
barplot(table2, main = "Income Level by Race", beside = TRUE, legend = rownames(table2),
args.legend = list(x = "topleft"), cex.names = 1.5)