Regression whit Categorical variables

Interpreting a regression with a categorical variable as an independent variable is slightly different than a numerical variable.

Let me demonstrate, and we’ll start with a numerical variable. I’ll use the data for the final paper in this example, just to make things simple.

I’ll regress age (a numerical variable) on trust to begin.

dat <- read.csv("https://raw.githubusercontent.com/ejvanholm/DataProjects/master/FinalPaperData.csv")

summary(lm(TRUST~AGE, data=dat))

## 
## Call:
## lm(formula = TRUST ~ AGE, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4923 -0.3359 -0.2837  0.6334  0.7561 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.188635   0.048038   3.927 9.23e-05 ***
## AGE         0.003068   0.001045   2.934  0.00342 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4658 on 944 degrees of freedom
##   (469 observations deleted due to missingness)
## Multiple R-squared:  0.009039,   Adjusted R-squared:  0.007989 
## F-statistic: 8.611 on 1 and 944 DF,  p-value: 0.003423

For each one unit increase in AGE we see a .003 increase in TRUST, and that change is significant. A one unit change here is 1 year. A person can be 0 years old, or 1 year old, or 2 years old - the numbers we have in the variable AGE increase by 1 unit up and up to the maximum in our data. When we say “For each one unit increase in AGE we see a .003 increase” we’re comparing people with 1 additional year of age to someone that has 1 fewer years of age. So a person that is 27 would be .003 more likely to say they trust people than a 26 year old. Regression coefficients are all about comparison in that way, and with numerical variables we’re comparing people with 1 more to people with 1 less all the time.

Categorical variables work a little different. Let’s stay with SEX, which has two values in our data: male and female.

summary(lm(TRUST~SEX, data=dat))

## 
## Call:
## lm(formula = TRUST ~ SEX, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3397 -0.3397 -0.3051  0.6603  0.6949 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.30508    0.02152  14.176   <2e-16 ***
## SEXMale      0.03458    0.03040   1.137    0.256    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4676 on 944 degrees of freedom
##   (469 observations deleted due to missingness)
## Multiple R-squared:  0.001368,   Adjusted R-squared:  0.0003103 
## F-statistic: 1.293 on 1 and 944 DF,  p-value: 0.2557

The variables name shows as SEXMale. What that means is the coefficient on the line (.03458) is the value for males in the data. What the SEX variable is comparing is men to women. Being a male, as opposed to a female, is associated with being .034 more likely to say people can be trusted, and that is not significant. In that example female is the omitted category. We’re comparing men to women, and the coefficient is the value for changing from a female to a male. If we changed the variable so that we compared females to males the coefficient would be the same, only negative. Women are less likely to say people can be trusted than men, because men are more likely to say that. We always have a comparison, but it doesn’t make sense to think about increasing gender/sex by one unit. There’s nothing to increase, it’s just a change.

I’m going to create a new variable for FEMALE, which will take the value of 1 if the observation is for a female. That means we’re now comparing females to males, rather than males to females as above.

dat$FEMALE <- ifelse(dat$SEX=="Female", 1, 0)
summary(lm(TRUST~FEMALE, data=dat))

## 
## Call:
## lm(formula = TRUST ~ FEMALE, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3397 -0.3397 -0.3051  0.6603  0.6949 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.33966    0.02148  15.816   <2e-16 ***
## FEMALE      -0.03458    0.03040  -1.137    0.256    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4676 on 944 degrees of freedom
##   (469 observations deleted due to missingness)
## Multiple R-squared:  0.001368,   Adjusted R-squared:  0.0003103 
## F-statistic: 1.293 on 1 and 944 DF,  p-value: 0.2557

Being a female, as opposed to a male, is associated with being .0345 less likely to say people can be trusted, and that change is significant. The coefficient changes its sign, but the p-value and everything else stays the same.

Numerical variables make comparisons based on one-unit increases. Categorical variables make comparisons based on the omitted category. In the case of SEX that was females. There’s always a value in the data that wont show up in the regression for that reason.

Let’s do a categorical variable with 3 values then, like marriage. Let’s look at the variable first to see what values are present.

table(dat$MARITAL)

## 
##       Married Never Married   Was Married 
##           631           440           344

Okay so there’s Married, Never Married, and Was Married. Let’s put it in a regression now.

summary(lm(TRUST~MARITAL, data=dat))

## 
## Call:
## lm(formula = TRUST ~ MARITAL, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3972 -0.3972 -0.2351  0.6028  0.7649 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.39716    0.02250  17.654  < 2e-16 ***
## MARITALNever Married -0.16206    0.03486  -4.649  3.8e-06 ***
## MARITALWas Married   -0.09852    0.03840  -2.565   0.0105 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4627 on 943 degrees of freedom
##   (469 observations deleted due to missingness)
## Multiple R-squared:  0.02318,    Adjusted R-squared:  0.02111 
## F-statistic: 11.19 on 2 and 943 DF,  p-value: 1.575e-05

First question, what is the omitted category here? We see “Never Married” and “Was Married” which means that Married has been omitted. We’re comparing people that were Never Married to people that are Married, and separately “Was Married” to Married. So what do we learn?

People that are married are more likely to say people can be trusted, because both of the other categories have a negative effect.

Being never married, as compared to those who are married, is associated with being .162 less likely to say people can be trusted, and that change is significant.

Being married before (was married), as compared to those who are married, is associated with being .098 less likely to say people can be trusted, and that change is significant.

Hopefully that helps this to make sense.

And for anyone wondering how R decides what to omit, its default rule is to omit whatever comes up first alphabetically. So females are omitted because f comes before m, and married is dropped because m comes before n and w.

Regression whit Categorical variables

van Holm

11/27/2020