Descriptive Statistics for one Categorical Variable
We’ll look at the mtcars dataset, which is included with the base distribution of R as a dataframe. First we’ll run a few standard commands to examine a new dataframe when we know nothing but the name of the dataframe.
mpg cyl disp hp
Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
Median :19.20 Median :6.000 Median :196.3 Median :123.0
Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
drat wt qsec vs
Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
Median :3.695 Median :3.325 Median :17.71 Median :0.0000
Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
am gear carb
Min. :0.0000 Min. :3.000 Min. :1.000
1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
Median :0.0000 Median :4.000 Median :2.000
Mean :0.4062 Mean :3.688 Mean :2.812
3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
Max. :1.0000 Max. :5.000 Max. :8.000
Note that there are some numerical variables here, which are categorical in nature. One example is ‘am,’ which tells us whether the car has an automatic (am = 0) or manual transmission (am = 1). To create a variable that R will treat as categorical, we need to run a special command.
mtcars$TranType =as.factor(mtcars$am)
Now we can run the standard commands to exploare a categorical variable.
Simple Counts
# Get simple counts of each categorical valuetable(mtcars$TranType)
0 1
19 13
Note that since TranType is within the dataframe mtcars, we must refer to it as mtcars$TranType
Proportions
table(mtcars$TranType)/length(mtcars$TranType)
0 1
0.59375 0.40625
Create a barplot of the values
barplot(table(mtcars$TranType))
Note that we must run the barplot command on a table, not the raw data.
Use the dataframe cdc loaded in Lab 1 for the following exercises.
Load the Data
load("cdc.Rdata")
Exercise
Show the counts of the values of the categorical variable genhlth.
Solution
table(cdc$genhlth)
excellent very good good fair poor
4657 6972 5675 2019 677
Show the Proportions
Solution
table(cdc$genhlth)/nrow(cdc)
excellent very good good fair poor
0.23285 0.34860 0.28375 0.10095 0.03385