One Categorical Variable

Harold Nelson

4/3/2020

Descriptive Statistics for one Categorical Variable

We’ll look at the mtcars dataset, which is included with the base distribution of R as a dataframe. First we’ll run a few standard commands to examine a new dataframe when we know nothing but the name of the dataframe.

str(mtcars)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...
summary(mtcars)
##       mpg             cyl             disp             hp       
##  Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
##  1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
##  Median :19.20   Median :6.000   Median :196.3   Median :123.0  
##  Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
##  3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
##  Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
##       drat             wt             qsec             vs        
##  Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
##  1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
##  Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
##  Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
##  3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
##  Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
##        am              gear            carb      
##  Min.   :0.0000   Min.   :3.000   Min.   :1.000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:2.000  
##  Median :0.0000   Median :4.000   Median :2.000  
##  Mean   :0.4062   Mean   :3.688   Mean   :2.812  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:4.000  
##  Max.   :1.0000   Max.   :5.000   Max.   :8.000

Note that there are some numerical variables here, which are categorical in nature. One example is ‘am,’ which tells us whether the car has an automatic (am = 0) or manual transmission (am = 1). To create a variable that R will treat as categorical, we need to run a special command.

mtcars$am = factor(mtcars$am)

Now we can run the standard commands to explore a categorical variable.

Simple Counts

# Get simple counts of each categorical value
table(mtcars$am)
## 
##  0  1 
## 19 13

Note that since am is within the dataframe mtcars, we must refer to it as mtcars$am

Proportions

table(mtcars$am)/length(mtcars$am)
## 
##       0       1 
## 0.59375 0.40625

Create a barplot of the values

barplot(table(mtcars$am))

Note that we must run the barplot command on a table, not the raw data.

Exercises

Use the dataframe loaded in Lab 1 for the following exercises. First load the cdc.Rdata file, which contains the dataframe. Then run the command str() and identify the categorical variables.

Answer

load("/cloud/project/cdc.Rdata")
str(cdc)
## 'data.frame':    20000 obs. of  9 variables:
##  $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
##  $ exerany : num  0 0 1 1 0 1 1 0 0 1 ...
##  $ hlthplan: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ smoke100: num  0 1 1 0 0 0 0 0 1 0 ...
##  $ height  : num  70 64 60 66 61 64 71 67 65 70 ...
##  $ weight  : int  175 125 105 132 150 114 194 170 150 180 ...
##  $ wtdesire: int  175 115 105 124 130 114 185 160 130 170 ...
##  $ age     : int  77 33 49 42 55 55 31 45 27 44 ...
##  $ gender  : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...

Note that there are two categorical variables recognizable as such because they are identified as factors.

There are also three other variables which are categorical in nature but coded numerically.

The variable “exerany” indicates whether or not the person gets any exercise.

  1. The variable “smoke100” identifies smokers.

  2. The variable “healthplan” indicates whether the person is covered by a health insurance plan.

  3. In all three cases, the numerical value 0 means false/no, while the numerical value 1 means true/yes.

Convert these three variables to factors and re-run the str() command to verify success.

Be sure to work the problem before you advance to the next slide.

Answer

cdc$smoke100 = factor(cdc$smoke100)
cdc$exerany = factor(cdc$exerany)
cdc$hlthplan = factor(cdc$hlthplan)

str(cdc)
## 'data.frame':    20000 obs. of  9 variables:
##  $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
##  $ exerany : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 2 1 1 2 ...
##  $ hlthplan: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ smoke100: Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 1 2 1 ...
##  $ height  : num  70 64 60 66 61 64 71 67 65 70 ...
##  $ weight  : int  175 125 105 132 150 114 194 170 150 180 ...
##  $ wtdesire: int  175 115 105 124 130 114 185 160 130 170 ...
##  $ age     : int  77 33 49 42 55 55 31 45 27 44 ...
##  $ gender  : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...

Exercise

Use your exploratory tools to examine the categorical variable genhlth. Then answer the following questions.

  1. What is the most common value of genhlth?
  2. What is the least common value of genhlth?
  3. What fraction of the population considers their health excellent?
  4. What fraction of the population considers their health poor?

Be sure to work the problem before you advance to the next slide.

Answer

table(cdc$genhlth)
## 
## excellent very good      good      fair      poor 
##      4657      6972      5675      2019       677
table(cdc$genhlth)/length(cdc$genhlth)
## 
## excellent very good      good      fair      poor 
##   0.23285   0.34860   0.28375   0.10095   0.03385
barplot(table(cdc$genhlth))

  1. What is the most common value of genhlth?
    Ans: Very good

  2. What is the least common value of genhlth?
    Ans: Poor

  3. What fraction of the population considers their health excellent?
    Ans: 23%

  4. What fraction of the population considers their health poor?
    Ans: 3%

Exercise

Use your tools to examine smoke100. What can you say?

Be sure to work the problem before you advance to the next slide.

Answer

table(cdc$smoke100)
## 
##     0     1 
## 10559  9441
table(cdc$smoke100)/length(cdc$smoke100)
## 
##       0       1 
## 0.52795 0.47205
barplot(table(cdc$smoke100))

What can you say? A slight majority of the population, about 53%, does not smoke.