Factors

Harold Nelson

9/10/2020

Creating a Factor

Let’s create a simple dataframe using data.frame.

Name = c("Tom","Dick","Harry","Mary","Sally","Susan")
Gender = c("m","m","m","f","f","f")
People = data.frame(Name,Gender)
People
##    Name Gender
## 1   Tom      m
## 2  Dick      m
## 3 Harry      m
## 4  Mary      f
## 5 Sally      f
## 6 Susan      f
str(People)
## 'data.frame':    6 obs. of  2 variables:
##  $ Name  : chr  "Tom" "Dick" "Harry" "Mary" ...
##  $ Gender: chr  "m" "m" "m" "f" ...

How can we turn People$Gender into a factor. I’ll create the factor version as a separate variable.

People$GenderF = factor(People$Gender)
People
##    Name Gender GenderF
## 1   Tom      m       m
## 2  Dick      m       m
## 3 Harry      m       m
## 4  Mary      f       f
## 5 Sally      f       f
## 6 Susan      f       f
str(People)
## 'data.frame':    6 obs. of  3 variables:
##  $ Name   : chr  "Tom" "Dick" "Harry" "Mary" ...
##  $ Gender : chr  "m" "m" "m" "f" ...
##  $ GenderF: Factor w/ 2 levels "f","m": 2 2 2 1 1 1

Gender and GenderF look the same, but they do differ in some respects. There’s some ambiguity with GenderF. It’s really a numeric value in the dataframe and a separate table of values which maps the numeric values to the character string values. Watch what happens when we create numeric versions of these two values using as.numeric().

GenderNum = as.numeric(People$Gender)
## Warning: NAs introduced by coercion
GenderFNum = as.numeric(People$GenderF)

We see a warning. Let’s look at our vectors. First check GenderNum.

GenderNum
## [1] NA NA NA NA NA NA

Now look at GenderFNum

GenderFNum
## [1] 2 2 2 1 1 1

Can we treat GenderF as a numeric variable?

People$GenderF + 1
## Warning in Ops.factor(People$GenderF, 1): '+' not meaningful for factors
## [1] NA NA NA NA NA NA

No, we can’t do this.

Levels and Labels

What about using different character strings to represent our factor? We can select the values as we create the factor with the labels argument of the factor function.

People$GenderF = factor(People$Gender,labels =c("Female","Male"))
People
##    Name Gender GenderF
## 1   Tom      m    Male
## 2  Dick      m    Male
## 3 Harry      m    Male
## 4  Mary      f  Female
## 5 Sally      f  Female
## 6 Susan      f  Female

Note that the sorted values of the Gender vector are used to set the numeric values in the factor. We can override this using the levels argument of the factor function.

People$GenderF = factor(People$Gender,levels=c("m","f"))
People
##    Name Gender GenderF
## 1   Tom      m       m
## 2  Dick      m       m
## 3 Harry      m       m
## 4  Mary      f       f
## 5 Sally      f       f
## 6 Susan      f       f
str(People$GenderF)
##  Factor w/ 2 levels "m","f": 1 1 1 2 2 2

It looks the same, but the str() reveals that the numeric value 1 now means “m”, not “f”.

This is important if we use the levels() function to assign new string values after the factor has been created. Suppose we forget and think in terms of the natural sorted order of values.

levels(People$GenderF) = c("Woman","Man")
People
##    Name Gender GenderF
## 1   Tom      m   Woman
## 2  Dick      m   Woman
## 3 Harry      m   Woman
## 4  Mary      f     Man
## 5 Sally      f     Man
## 6 Susan      f     Man

Factors = Pain??

Here’s an example with the county data. First load the data

load("county.rda")
str(county)
## Classes 'tbl_df', 'tbl' and 'data.frame':    3142 obs. of  15 variables:
##  $ name             : Factor w/ 1877 levels "Abbeville County",..: 83 90 101 150 165 226 236 249 297 319 ...
##  $ state            : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ pop2000          : num  43671 140415 29038 20826 51024 ...
##  $ pop2010          : num  54571 182265 27457 22915 57322 ...
##  $ pop2017          : int  55504 212628 25270 22668 58013 10309 19825 114728 33713 25857 ...
##  $ pop_change       : num  1.48 9.19 -6.22 0.73 0.68 -2.28 -2.69 -1.51 -1.2 -0.6 ...
##  $ poverty          : num  13.7 11.8 27.2 15.2 15.6 28.5 24.4 18.6 18.8 16.1 ...
##  $ homeownership    : num  77.5 76.7 68 82.9 82 76.9 69 70.7 71.4 77.5 ...
##  $ multi_unit       : num  7.2 22.6 11.1 6.6 3.7 9.9 13.7 14.3 8.7 4.3 ...
##  $ unemployment_rate: num  3.86 3.99 5.9 4.39 4.02 4.93 5.49 4.93 4.08 4.05 ...
##  $ metro            : Factor w/ 2 levels "no","yes": 2 2 1 2 2 1 1 2 1 1 ...
##  $ median_edu       : Factor w/ 4 levels "below_hs","hs_diploma",..: 3 3 2 2 2 2 2 3 2 2 ...
##  $ per_capita_income: num  27842 27780 17892 20572 21367 ...
##  $ median_hh_income : int  55317 52562 33368 43404 47412 29655 36326 43686 37342 40041 ...
##  $ smoking_ban      : Factor w/ 3 levels "none","partial",..: 1 1 2 1 1 1 NA NA 1 1 ...

We see that the variable state is a factor.

let’s get a count of counties in each state using the table command.

table(county$state)
## 
##              Alabama               Alaska              Arizona 
##                   67                   29                   15 
##             Arkansas           California             Colorado 
##                   75                   58                   64 
##          Connecticut             Delaware District of Columbia 
##                    8                    3                    1 
##              Florida              Georgia               Hawaii 
##                   67                  159                    5 
##                Idaho             Illinois              Indiana 
##                   44                  102                   92 
##                 Iowa               Kansas             Kentucky 
##                   99                  105                  120 
##            Louisiana                Maine             Maryland 
##                   64                   16                   24 
##        Massachusetts             Michigan            Minnesota 
##                   14                   83                   87 
##          Mississippi             Missouri              Montana 
##                   82                  115                   56 
##             Nebraska               Nevada        New Hampshire 
##                   93                   17                   10 
##           New Jersey           New Mexico             New York 
##                   21                   33                   62 
##       North Carolina         North Dakota                 Ohio 
##                  100                   53                   88 
##             Oklahoma               Oregon         Pennsylvania 
##                   77                   36                   67 
##         Rhode Island       South Carolina         South Dakota 
##                    5                   46                   66 
##            Tennessee                Texas                 Utah 
##                   95                  254                   29 
##              Vermont             Virginia           Washington 
##                   14                  133                   39 
##        West Virginia            Wisconsin              Wyoming 
##                   55                   72                   23

Let’s create a smaller dataframe, pnw.

pnw = county[county$state %in% c("Washington", "Oregon", "Idaho"),]
str(pnw)
## Classes 'tbl_df', 'tbl' and 'data.frame':    119 obs. of  15 variables:
##  $ name             : Factor w/ 1877 levels "Abbeville County",..: 4 6 98 120 134 155 159 167 172 173 ...
##  $ state            : Factor w/ 51 levels "Alabama","Alaska",..: 13 13 13 13 13 13 13 13 13 13 ...
##  $ pop2000          : num  300904 3476 75565 6411 9171 ...
##  $ pop2010          : num  392365 3976 82839 5986 9285 ...
##  $ pop2017          : int  456849 4147 85269 6028 9184 45927 22024 7290 43560 114595 ...
##  $ pop_change       : num  9.8 7.35 2.23 1.43 1.74 1.11 3.61 8.45 7.5 6.74 ...
##  $ poverty          : num  11.8 13.8 17.6 14.5 15.8 13.1 14.8 11.3 14 11.9 ...
##  $ homeownership    : num  69.6 80.4 71.2 80.9 74.2 79.9 68.3 76.8 74.5 74.1 ...
##  $ multi_unit       : num  18 3.1 19.5 7.5 5.8 9.6 29.9 1.3 10.1 17.4 ...
##  $ unemployment_rate: num  2.78 5.66 3.01 3.25 5.53 2.95 2.53 4.78 4.49 2.67 ...
##  $ metro            : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 2 ...
##  $ median_edu       : Factor w/ 4 levels "below_hs","hs_diploma",..: 3 3 3 3 2 3 3 3 3 3 ...
##  $ per_capita_income: num  31150 24528 24043 24464 21863 ...
##  $ median_hh_income : int  60151 42727 47390 50603 43472 51307 58835 49964 45607 54150 ...
##  $ smoking_ban      : Factor w/ 3 levels "none","partial",..: 2 NA 2 2 2 2 2 2 2 2 ...

Note that the state factor in the smaller dataframe has all of the original levels. Look at what happens when we try to get a table of the state variable from pnw.

table(pnw$state)
## 
##              Alabama               Alaska              Arizona 
##                    0                    0                    0 
##             Arkansas           California             Colorado 
##                    0                    0                    0 
##          Connecticut             Delaware District of Columbia 
##                    0                    0                    0 
##              Florida              Georgia               Hawaii 
##                    0                    0                    0 
##                Idaho             Illinois              Indiana 
##                   44                    0                    0 
##                 Iowa               Kansas             Kentucky 
##                    0                    0                    0 
##            Louisiana                Maine             Maryland 
##                    0                    0                    0 
##        Massachusetts             Michigan            Minnesota 
##                    0                    0                    0 
##          Mississippi             Missouri              Montana 
##                    0                    0                    0 
##             Nebraska               Nevada        New Hampshire 
##                    0                    0                    0 
##           New Jersey           New Mexico             New York 
##                    0                    0                    0 
##       North Carolina         North Dakota                 Ohio 
##                    0                    0                    0 
##             Oklahoma               Oregon         Pennsylvania 
##                    0                   36                    0 
##         Rhode Island       South Carolina         South Dakota 
##                    0                    0                    0 
##            Tennessee                Texas                 Utah 
##                    0                    0                    0 
##              Vermont             Virginia           Washington 
##                    0                    0                   39 
##        West Virginia            Wisconsin              Wyoming 
##                    0                    0                    0

Solve the problem

pnw$state = factor(as.character(pnw$state))
table(pnw$state)
## 
##      Idaho     Oregon Washington 
##         44         36         39

Using as.character() followed by factor() drops the extraneous levels.

Note that the factor has a specific order. The default ordering of a factor is alphabetical. We can override this with the levels argument of factor().

Note the order that results is whatever we want.

pnw$state = factor(as.character(pnw$state),
                   levels = c("Washington",
                              "Oregon",
                              "Idaho"))
table(pnw$state)
## 
## Washington     Oregon      Idaho 
##         39         36         44

And if you want to use postal codes instead of names, use the labels argument.

pnw$state = factor(as.character(pnw$state),
                   levels = c("Washington",
                              "Oregon",
                              "Idaho"),
                    labels = c("WA","OR","ID")                      )
table(pnw$state)
## 
## WA OR ID 
## 39 36 44

What happens if we convert the factor variable back to a string variable?

pnw$state = as.character(pnw$state)
table(pnw$state)
## 
## ID OR WA 
## 44 36 39

Exercise

The cdc dataset has many variables coded numerically, which are really categorical. Essentially, 0 means “No” and 1 means “Yes”.

The variable exerany is one of these. Create a new factor variable exeranyf with meaningful values. Create a table of these two variables to verify your work.

Solution

load("cdc.Rdata")
cdc$exeranyf = factor(cdc$exerany,labels = c("No","Yes"))
table(cdc$exerany,cdc$exeranyf)
##    
##        No   Yes
##   0  5086     0
##   1     0 14914