Harold Nelson
9/10/2020
Let’s create a simple dataframe using data.frame.
Name = c("Tom","Dick","Harry","Mary","Sally","Susan")
Gender = c("m","m","m","f","f","f")
People = data.frame(Name,Gender)
People
## Name Gender
## 1 Tom m
## 2 Dick m
## 3 Harry m
## 4 Mary f
## 5 Sally f
## 6 Susan f
## 'data.frame': 6 obs. of 2 variables:
## $ Name : chr "Tom" "Dick" "Harry" "Mary" ...
## $ Gender: chr "m" "m" "m" "f" ...
How can we turn People$Gender into a factor. I’ll create the factor version as a separate variable.
## Name Gender GenderF
## 1 Tom m m
## 2 Dick m m
## 3 Harry m m
## 4 Mary f f
## 5 Sally f f
## 6 Susan f f
## 'data.frame': 6 obs. of 3 variables:
## $ Name : chr "Tom" "Dick" "Harry" "Mary" ...
## $ Gender : chr "m" "m" "m" "f" ...
## $ GenderF: Factor w/ 2 levels "f","m": 2 2 2 1 1 1
Gender and GenderF look the same, but they do differ in some respects. There’s some ambiguity with GenderF. It’s really a numeric value in the dataframe and a separate table of values which maps the numeric values to the character string values. Watch what happens when we create numeric versions of these two values using as.numeric().
## Warning: NAs introduced by coercion
We see a warning. Let’s look at our vectors. First check GenderNum.
Now look at GenderFNum
## [1] 2 2 2 1 1 1
Can we treat GenderF as a numeric variable?
## Warning in Ops.factor(People$GenderF, 1): '+' not meaningful for factors
## [1] NA NA NA NA NA NA
No, we can’t do this.
What about using different character strings to represent our factor? We can select the values as we create the factor with the labels argument of the factor function.
## Name Gender GenderF
## 1 Tom m Male
## 2 Dick m Male
## 3 Harry m Male
## 4 Mary f Female
## 5 Sally f Female
## 6 Susan f Female
Note that the sorted values of the Gender vector are used to set the numeric values in the factor. We can override this using the levels argument of the factor function.
## Name Gender GenderF
## 1 Tom m m
## 2 Dick m m
## 3 Harry m m
## 4 Mary f f
## 5 Sally f f
## 6 Susan f f
## Factor w/ 2 levels "m","f": 1 1 1 2 2 2
It looks the same, but the str() reveals that the numeric value 1 now means “m”, not “f”.
This is important if we use the levels() function to assign new string values after the factor has been created. Suppose we forget and think in terms of the natural sorted order of values.
## Name Gender GenderF
## 1 Tom m Woman
## 2 Dick m Woman
## 3 Harry m Woman
## 4 Mary f Man
## 5 Sally f Man
## 6 Susan f Man
Here’s an example with the county data. First load the data
## Classes 'tbl_df', 'tbl' and 'data.frame': 3142 obs. of 15 variables:
## $ name : Factor w/ 1877 levels "Abbeville County",..: 83 90 101 150 165 226 236 249 297 319 ...
## $ state : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ pop2000 : num 43671 140415 29038 20826 51024 ...
## $ pop2010 : num 54571 182265 27457 22915 57322 ...
## $ pop2017 : int 55504 212628 25270 22668 58013 10309 19825 114728 33713 25857 ...
## $ pop_change : num 1.48 9.19 -6.22 0.73 0.68 -2.28 -2.69 -1.51 -1.2 -0.6 ...
## $ poverty : num 13.7 11.8 27.2 15.2 15.6 28.5 24.4 18.6 18.8 16.1 ...
## $ homeownership : num 77.5 76.7 68 82.9 82 76.9 69 70.7 71.4 77.5 ...
## $ multi_unit : num 7.2 22.6 11.1 6.6 3.7 9.9 13.7 14.3 8.7 4.3 ...
## $ unemployment_rate: num 3.86 3.99 5.9 4.39 4.02 4.93 5.49 4.93 4.08 4.05 ...
## $ metro : Factor w/ 2 levels "no","yes": 2 2 1 2 2 1 1 2 1 1 ...
## $ median_edu : Factor w/ 4 levels "below_hs","hs_diploma",..: 3 3 2 2 2 2 2 3 2 2 ...
## $ per_capita_income: num 27842 27780 17892 20572 21367 ...
## $ median_hh_income : int 55317 52562 33368 43404 47412 29655 36326 43686 37342 40041 ...
## $ smoking_ban : Factor w/ 3 levels "none","partial",..: 1 1 2 1 1 1 NA NA 1 1 ...
We see that the variable state is a factor.
let’s get a count of counties in each state using the table command.
##
## Alabama Alaska Arizona
## 67 29 15
## Arkansas California Colorado
## 75 58 64
## Connecticut Delaware District of Columbia
## 8 3 1
## Florida Georgia Hawaii
## 67 159 5
## Idaho Illinois Indiana
## 44 102 92
## Iowa Kansas Kentucky
## 99 105 120
## Louisiana Maine Maryland
## 64 16 24
## Massachusetts Michigan Minnesota
## 14 83 87
## Mississippi Missouri Montana
## 82 115 56
## Nebraska Nevada New Hampshire
## 93 17 10
## New Jersey New Mexico New York
## 21 33 62
## North Carolina North Dakota Ohio
## 100 53 88
## Oklahoma Oregon Pennsylvania
## 77 36 67
## Rhode Island South Carolina South Dakota
## 5 46 66
## Tennessee Texas Utah
## 95 254 29
## Vermont Virginia Washington
## 14 133 39
## West Virginia Wisconsin Wyoming
## 55 72 23
Let’s create a smaller dataframe, pnw.
## Classes 'tbl_df', 'tbl' and 'data.frame': 119 obs. of 15 variables:
## $ name : Factor w/ 1877 levels "Abbeville County",..: 4 6 98 120 134 155 159 167 172 173 ...
## $ state : Factor w/ 51 levels "Alabama","Alaska",..: 13 13 13 13 13 13 13 13 13 13 ...
## $ pop2000 : num 300904 3476 75565 6411 9171 ...
## $ pop2010 : num 392365 3976 82839 5986 9285 ...
## $ pop2017 : int 456849 4147 85269 6028 9184 45927 22024 7290 43560 114595 ...
## $ pop_change : num 9.8 7.35 2.23 1.43 1.74 1.11 3.61 8.45 7.5 6.74 ...
## $ poverty : num 11.8 13.8 17.6 14.5 15.8 13.1 14.8 11.3 14 11.9 ...
## $ homeownership : num 69.6 80.4 71.2 80.9 74.2 79.9 68.3 76.8 74.5 74.1 ...
## $ multi_unit : num 18 3.1 19.5 7.5 5.8 9.6 29.9 1.3 10.1 17.4 ...
## $ unemployment_rate: num 2.78 5.66 3.01 3.25 5.53 2.95 2.53 4.78 4.49 2.67 ...
## $ metro : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 2 ...
## $ median_edu : Factor w/ 4 levels "below_hs","hs_diploma",..: 3 3 3 3 2 3 3 3 3 3 ...
## $ per_capita_income: num 31150 24528 24043 24464 21863 ...
## $ median_hh_income : int 60151 42727 47390 50603 43472 51307 58835 49964 45607 54150 ...
## $ smoking_ban : Factor w/ 3 levels "none","partial",..: 2 NA 2 2 2 2 2 2 2 2 ...
Note that the state factor in the smaller dataframe has all of the original levels. Look at what happens when we try to get a table of the state variable from pnw.
##
## Alabama Alaska Arizona
## 0 0 0
## Arkansas California Colorado
## 0 0 0
## Connecticut Delaware District of Columbia
## 0 0 0
## Florida Georgia Hawaii
## 0 0 0
## Idaho Illinois Indiana
## 44 0 0
## Iowa Kansas Kentucky
## 0 0 0
## Louisiana Maine Maryland
## 0 0 0
## Massachusetts Michigan Minnesota
## 0 0 0
## Mississippi Missouri Montana
## 0 0 0
## Nebraska Nevada New Hampshire
## 0 0 0
## New Jersey New Mexico New York
## 0 0 0
## North Carolina North Dakota Ohio
## 0 0 0
## Oklahoma Oregon Pennsylvania
## 0 36 0
## Rhode Island South Carolina South Dakota
## 0 0 0
## Tennessee Texas Utah
## 0 0 0
## Vermont Virginia Washington
## 0 0 39
## West Virginia Wisconsin Wyoming
## 0 0 0
##
## Idaho Oregon Washington
## 44 36 39
Using as.character() followed by factor() drops the extraneous levels.
Note that the factor has a specific order. The default ordering of a factor is alphabetical. We can override this with the levels argument of factor().
Note the order that results is whatever we want.
pnw$state = factor(as.character(pnw$state),
levels = c("Washington",
"Oregon",
"Idaho"))
table(pnw$state)
##
## Washington Oregon Idaho
## 39 36 44
And if you want to use postal codes instead of names, use the labels argument.
pnw$state = factor(as.character(pnw$state),
levels = c("Washington",
"Oregon",
"Idaho"),
labels = c("WA","OR","ID") )
table(pnw$state)
##
## WA OR ID
## 39 36 44
What happens if we convert the factor variable back to a string variable?
##
## ID OR WA
## 44 36 39
The cdc dataset has many variables coded numerically, which are really categorical. Essentially, 0 means “No” and 1 means “Yes”.
The variable exerany is one of these. Create a new factor variable exeranyf with meaningful values. Create a table of these two variables to verify your work.