Analyze and Visualize Categorical Variables

Read in the GSS data as a dataframe called “x”.

setwd("~/Dropbox/Data General/GSS")  #Set your working directory to whatever folder holds GSS.csv
options(scipen = 999)  #Turn off scientific notation
x <- read.csv("GSS.csv")

We use the summary() function to take a simple look at how many units (in this case, survey respondents) are in each level of a categorical variable. R calls categorical variables “factors”, and the categories “levels.”

summary(x$race)
## black other white 
##  7625  2589 44873
summary(x$rincome)
##  $1000 to 2999 $10000 - 14999 $15000 - 19999 $20000 - 24999 $25000 or more 
##           1714           4472           3489           3345          11529 
##  $3000 to 3999  $4000 to 4999  $5000 to 5999  $6000 to 6999  $7000 to 7999 
##           1111            957            987            902            903 
##  $8000 to 9999       lt $1000           NA's 
##           1625           1189          22864

To begin considering the relationship between two categorical variables, we can generate a simple contingency table. A contingency table shows the frequency with which respondents identified as belonging to a certain category in one variable contingent on their belonging in some other category from a second variable.

Consider the relationship between race and income.

table(x$rincome, x$race)
##                 
##                  black other white
##   $1000 to 2999    259    67  1388
##   $10000 - 14999   674   216  3582
##   $15000 - 19999   502   195  2792
##   $20000 - 24999   431   172  2742
##   $25000 or more  1195   687  9647
##   $3000 to 3999    190    49   872
##   $4000 to 4999    151    46   760
##   $5000 to 5999    159    48   780
##   $6000 to 6999    135    40   727
##   $7000 to 7999    136    44   723
##   $8000 to 9999    235    79  1311
##   lt $1000         198    54   937

To do things with the table, save the table as an object called “table1”.

table1 <- table(x$rincome, x$race)

Make the table's cells proportions rather than simple frequencies.

prop.table(table1)
##                 
##                     black    other    white
##   $1000 to 2999  0.008038 0.002079 0.043075
##   $10000 - 14999 0.020917 0.006703 0.111163
##   $15000 - 19999 0.015579 0.006052 0.086646
##   $20000 - 24999 0.013376 0.005338 0.085094
##   $25000 or more 0.037085 0.021320 0.299382
##   $3000 to 3999  0.005896 0.001521 0.027061
##   $4000 to 4999  0.004686 0.001428 0.023586
##   $5000 to 5999  0.004934 0.001490 0.024206
##   $6000 to 6999  0.004190 0.001241 0.022562
##   $7000 to 7999  0.004221 0.001365 0.022437
##   $8000 to 9999  0.007293 0.002452 0.040685
##   lt $1000       0.006145 0.001676 0.029079

Or you can nest functions.

prop.table(table(x$rincome, x$race))
##                 
##                     black    other    white
##   $1000 to 2999  0.008038 0.002079 0.043075
##   $10000 - 14999 0.020917 0.006703 0.111163
##   $15000 - 19999 0.015579 0.006052 0.086646
##   $20000 - 24999 0.013376 0.005338 0.085094
##   $25000 or more 0.037085 0.021320 0.299382
##   $3000 to 3999  0.005896 0.001521 0.027061
##   $4000 to 4999  0.004686 0.001428 0.023586
##   $5000 to 5999  0.004934 0.001490 0.024206
##   $6000 to 6999  0.004190 0.001241 0.022562
##   $7000 to 7999  0.004221 0.001365 0.022437
##   $8000 to 9999  0.007293 0.002452 0.040685
##   lt $1000       0.006145 0.001676 0.029079

But these proportions are analytically useless. What are we actually interested in? The option “1” give us row proportions…

prop.table(table1, 1)
##                 
##                    black   other   white
##   $1000 to 2999  0.15111 0.03909 0.80980
##   $10000 - 14999 0.15072 0.04830 0.80098
##   $15000 - 19999 0.14388 0.05589 0.80023
##   $20000 - 24999 0.12885 0.05142 0.81973
##   $25000 or more 0.10365 0.05959 0.83676
##   $3000 to 3999  0.17102 0.04410 0.78488
##   $4000 to 4999  0.15778 0.04807 0.79415
##   $5000 to 5999  0.16109 0.04863 0.79027
##   $6000 to 6999  0.14967 0.04435 0.80599
##   $7000 to 7999  0.15061 0.04873 0.80066
##   $8000 to 9999  0.14462 0.04862 0.80677
##   lt $1000       0.16653 0.04542 0.78806

Again, not very useful or interesting. The option 2 gives us column proportions.

prop.table(table1, 2)
##                 
##                    black   other   white
##   $1000 to 2999  0.06073 0.03948 0.05285
##   $10000 - 14999 0.15803 0.12728 0.13640
##   $15000 - 19999 0.11770 0.11491 0.10632
##   $20000 - 24999 0.10106 0.10136 0.10441
##   $25000 or more 0.28019 0.40483 0.36735
##   $3000 to 3999  0.04455 0.02887 0.03321
##   $4000 to 4999  0.03540 0.02711 0.02894
##   $5000 to 5999  0.03728 0.02829 0.02970
##   $6000 to 6999  0.03165 0.02357 0.02768
##   $7000 to 7999  0.03189 0.02593 0.02753
##   $8000 to 9999  0.05510 0.04655 0.04992
##   lt $1000       0.04642 0.03182 0.03568

Now that's interesting, I'd say! But let's visualize it. First we need to save it as an object.

table2 <- prop.table(table(x$rincome, x$race), 2)

Now we can make a classic bar chart.

barplot(table2)

plot of chunk unnamed-chunk-10

Not bad but we have a lot of cleaning up to do, not least of all because the income categories are not in order! First thing we'll do is re-order the factor levels, rerun the tables, and then modify the graph. For main= put the overall title of the graph, beside declares if you want the different groups to have separate bars placed beside each other (default is to stack them instead), for legend= tell it you want the rownames from table 2, and cex.names= is used to increase or decrease the axis labels.

x$rincome <- ordered(x$rincome, levels = c("lt $1000", "$1000 to 2999", "$3000 to 3999", 
    "$4000 to 4999", "$5000 to 5999", "$6000 to 6999", "$7000 to 7999", "$8000 to 9999", 
    "$10000 - 14999", "$15000 - 19999", "$20000 - 24999", "$25000 or more"))
table2 <- prop.table(table(x$rincome, x$race), 2)
prop.table(table(x$rincome, x$race))
##                 
##                     black    other    white
##   lt $1000       0.006145 0.001676 0.029079
##   $1000 to 2999  0.008038 0.002079 0.043075
##   $3000 to 3999  0.005896 0.001521 0.027061
##   $4000 to 4999  0.004686 0.001428 0.023586
##   $5000 to 5999  0.004934 0.001490 0.024206
##   $6000 to 6999  0.004190 0.001241 0.022562
##   $7000 to 7999  0.004221 0.001365 0.022437
##   $8000 to 9999  0.007293 0.002452 0.040685
##   $10000 - 14999 0.020917 0.006703 0.111163
##   $15000 - 19999 0.015579 0.006052 0.086646
##   $20000 - 24999 0.013376 0.005338 0.085094
##   $25000 or more 0.037085 0.021320 0.299382
barplot(table2, main = "Income Level by Race", beside = TRUE, legend = rownames(table2), 
    args.legend = list(x = "topleft"), cex.names = 1.5)

plot of chunk unnamed-chunk-11