October 27, 2016

Categorical/qualitative data

  • We're looking at variation in \(x\), but sometimes it's unreasonable to think of \(x_i\) as being greater than or larger than \(x_j\).
  • Examples:
    • We could look at number of X chromosomes, but it would be more reasonable to just treat sex as a categorical variable.
    • The birthplace of an individual
    • The industry of a firm
    • The region a country is in.

Dummy variables

  • For binary variables–things that either are or aren't– we can create dummy variables.
    • e.g. a car either has an automatic transmission or not. mtcars$am takes a value of 1 for cars with an automatic transmission and 0 otherwise.
  • For categorical variables, we can create a dummy variable for each category.
    • e.g. if we want to know about the relationship between GDP growth and world region, we could create dummy variables for Western Europe, Sub-Saharan Africa, etc.
      • As a control variable, we might only look at a small subset of the possible regional dummies. But we can use an F test to check if that's a reasonable move.

Dummy variables

fit0 <- lm(Sepal.Length ~ ., iris) # regress everything on Sepal.Length
summary(fit0)
Call:
lm(formula = Sepal.Length ~ ., data = iris)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.79424 -0.21874  0.00899  0.20255  0.73103 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        2.17127    0.27979   7.760 1.43e-12 ***
Sepal.Width        0.49589    0.08607   5.761 4.87e-08 ***
Petal.Length       0.82924    0.06853  12.101  < 2e-16 ***
Petal.Width       -0.31516    0.15120  -2.084  0.03889 *  
Speciesversicolor -0.72356    0.24017  -3.013  0.00306 ** 
Speciesvirginica  -1.02350    0.33373  -3.067  0.00258 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3068 on 144 degrees of freedom
Multiple R-squared:  0.8673,    Adjusted R-squared:  0.8627 
F-statistic: 188.3 on 5 and 144 DF,  p-value: < 2.2e-16

Qualitative data types in R

  • Data types can be checked using is().
is(iris$Species) ; is(iris$Sepal.Length)
[1] "factor"              "integer"             "oldClass"           
[4] "numeric"             "vector"              "data.frameRowLabels"
[1] "numeric" "vector" 
  • "Numeric" data is (usually) not, qualitative.
    • If it should be qualitative (e.g. zip codes) you can change it with as.character() (or as.factor())
  • "Character" data is text
  • "Factor" data is explicitly qualitative

Dummy variables in R

  • You don't (necessarily) need to create a new dummy variable for each type of observation.
    • If you wanted to, you could do something like this: df$africa <- ifelse(df$continent == "africa",1,0)
  • If you put character or factor data into a regression, R will automatically create a dummy variable for each one.
    • Just be careful that your data is formatted correctly… capitalization matters (remember: computers are stupid and need your guidance… common sense is your area of comparative advantage).
"ALLCAPS" == "allcaps"
[1] FALSE

Dummy variables in R, example

Sometimes you'll see something like the result below. What went wrong?

Call: lm(formula = mpg ~ hp + wt + disp, data = mtcars.wrong)
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  26.83764    4.17034   6.435 0.000201 ***
hp62         -9.51024    4.33672  -2.193 0.059645 .  
hp65          2.82410    2.09106   1.351 0.213795    
hp66         -1.77900    2.14614  -0.829 0.431180    
hp91         -5.30248    2.24247  -2.365 0.045631 *  
hp93         -9.16462    2.61653  -3.503 0.008049 ** 
hp95        -11.07347    4.28895  -2.582 0.032522 *  
hp97        -10.71666    2.82050  -3.800 0.005240 ** 
hp105       -15.56664    4.32215  -3.602 0.006966 ** 
hp109       -11.68873    3.53552  -3.306 0.010761 *  
hp110       -11.38519    3.02898  -3.759 0.005554 ** 
hp113         0.53411    2.04212   0.262 0.800278    
hp123       -15.84432    4.69465  -3.375 0.009714 ** 
hp150       -17.26634    3.60149  -4.794 0.001366 ** 
hp175       -13.18269    3.32742  -3.962 0.004166 ** 
hp180       -17.83923    4.78625  -3.727 0.005811 ** 
hp205       -25.13005    6.95310  -3.614 0.006840 ** 
hp215       -25.77152    7.51069  -3.431 0.008938 ** 
hp230       -21.50560    7.46695  -2.880 0.020508 *  
hp245       -18.89200    3.85578  -4.900 0.001194 ** 
hp264       -15.44246    3.01868  -5.116 0.000912 ***
hp335       -18.00364    4.00655  -4.494 0.002019 ** 
wt            2.80500    2.96326   0.947 0.371556    
disp         -0.01278    0.01345  -0.950 0.369803    

Residual standard error: 1.392 on 8 degrees of freedom
Multiple R-squared:  0.9862,    Adjusted R-squared:  0.9467 
F-statistic: 24.93 on 23 and 8 DF,  p-value: < 1e-04

Dummy variables in R, example

Let's see if we can treat number of cylinders as a categorical variable.

fit1 <- lm(qsec ~ as.factor(cyl) + hp - 1, mtcars)
print(summary(fit1), concise = TRUE)
Call: lm(formula = qsec ~ as.factor(cyl) + hp - 1, data = mtcars)
                 Estimate Std. Error t value Pr(>|t|)    
as.factor(cyl)4 20.757765   0.663148  31.302   <1e-04 ***
as.factor(cyl)6 20.375156   0.930557  21.896   <1e-04 ***
as.factor(cyl)8 20.874818   1.391386  15.003   <1e-04 ***
hp              -0.019610   0.006435  -3.047    0.005 ** 

Residual standard error: 1.314 on 28 degrees of freedom
Multiple R-squared:  0.9953,    Adjusted R-squared:  0.9946 
F-statistic:  1483 on 4 and 28 DF,  p-value: < 1e-04

Holding constant hp it looks like 6 cylinder cars are fastest.