Harold Nelson
4/3/2020
We’ll look at the mtcars dataset, which is included with the base distribution of R as a dataframe. First we’ll run a few standard commands to examine a new dataframe when we know nothing but the name of the dataframe.
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs
## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000
## Median :3.695 Median :3.325 Median :17.71 Median :0.0000
## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000
## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000
## am gear carb
## Min. :0.0000 Min. :3.000 Min. :1.000
## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000
## Median :0.0000 Median :4.000 Median :2.000
## Mean :0.4062 Mean :3.688 Mean :2.812
## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :1.0000 Max. :5.000 Max. :8.000
Note that there are some numerical variables here, which are categorical in nature. One example is ‘am,’ which tells us whether the car has an automatic (am = 0) or manual transmission (am = 1). To create a variable that R will treat as categorical, we need to run a special command.
Now we can run the standard commands to explore a categorical variable.
##
## 0 1
## 19 13
Note that since am is within the dataframe mtcars, we must refer to it as mtcars$am
Note that we must run the barplot command on a table, not the raw data.
Use the dataframe loaded in Lab 1 for the following exercises. First load the cdc.Rdata file, which contains the dataframe. Then run the command str() and identify the categorical variables.
## 'data.frame': 20000 obs. of 9 variables:
## $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
## $ exerany : num 0 0 1 1 0 1 1 0 0 1 ...
## $ hlthplan: num 1 1 1 1 1 1 1 1 1 1 ...
## $ smoke100: num 0 1 1 0 0 0 0 0 1 0 ...
## $ height : num 70 64 60 66 61 64 71 67 65 70 ...
## $ weight : int 175 125 105 132 150 114 194 170 150 180 ...
## $ wtdesire: int 175 115 105 124 130 114 185 160 130 170 ...
## $ age : int 77 33 49 42 55 55 31 45 27 44 ...
## $ gender : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...
Note that there are two categorical variables recognizable as such because they are identified as factors.
There are also three other variables which are categorical in nature but coded numerically.
The variable “exerany” indicates whether or not the person gets any exercise.
The variable “smoke100” identifies smokers.
The variable “healthplan” indicates whether the person is covered by a health insurance plan.
In all three cases, the numerical value 0 means false/no, while the numerical value 1 means true/yes.
Convert these three variables to factors and re-run the str() command to verify success.
Be sure to work the problem before you advance to the next slide.
cdc$smoke100 = factor(cdc$smoke100)
cdc$exerany = factor(cdc$exerany)
cdc$hlthplan = factor(cdc$hlthplan)
str(cdc)
## 'data.frame': 20000 obs. of 9 variables:
## $ genhlth : Factor w/ 5 levels "excellent","very good",..: 3 3 3 3 2 2 2 2 3 3 ...
## $ exerany : Factor w/ 2 levels "0","1": 1 1 2 2 1 2 2 1 1 2 ...
## $ hlthplan: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ smoke100: Factor w/ 2 levels "0","1": 1 2 2 1 1 1 1 1 2 1 ...
## $ height : num 70 64 60 66 61 64 71 67 65 70 ...
## $ weight : int 175 125 105 132 150 114 194 170 150 180 ...
## $ wtdesire: int 175 115 105 124 130 114 185 160 130 170 ...
## $ age : int 77 33 49 42 55 55 31 45 27 44 ...
## $ gender : Factor w/ 2 levels "m","f": 1 2 2 2 2 2 1 1 2 1 ...
Use your exploratory tools to examine the categorical variable genhlth. Then answer the following questions.
Be sure to work the problem before you advance to the next slide.
##
## excellent very good good fair poor
## 4657 6972 5675 2019 677
##
## excellent very good good fair poor
## 0.23285 0.34860 0.28375 0.10095 0.03385
What is the most common value of genhlth?
Ans: Very good
What is the least common value of genhlth?
Ans: Poor
What fraction of the population considers their health excellent?
Ans: 23%
What fraction of the population considers their health poor?
Ans: 3%
Use your tools to examine smoke100. What can you say?
Be sure to work the problem before you advance to the next slide.
##
## 0 1
## 10559 9441
##
## 0 1
## 0.52795 0.47205
What can you say? A slight majority of the population, about 53%, does not smoke.