A couple more things on regression

Olesya Volchenko & Anna Shirokanova

5/15/2023

Our plan for today

Standardized coefficients

When we using varibles “as is” in linear regression we have two problems:

Why coefficients cannot be compared?

Numeric vaiables can be measured with VERY different scales.

Standardized coefficients

also known as “beta coefficients”

Betas are calculated by subtracting the mean from the variable and dividing by its standard deviation. This results in standardized variables having a mean of zero and a standard deviation of 1.

Despite the name, it isn’t actually the coefficients that get standardized, but the variables.

More here: https://www.statisticshowto.com/standardized-beta-coefficient/

Standardized coefficients: graphically

var <- rnorm(1000, 20, 30)
hist(var)
abline(v= mean(var), col = "red", lwd = 3)

Centering:

hist(var - mean(var))
abline(v= 0, col = "red", lwd = 3)

Scaling:

hist(var/sd(var))
abline(v = mean(var/sd(var)), col = "red", lwd = 3)

Scaling and centering together:

hist((var-mean(var)) / sd(var))
abline(v = mean((var-mean(var)) / sd(var)), col = "red", lwd = 3)

There is a convenient scale() function in r

scale(var, center = TRUE, scale = FALSE) # centering only
scale(var, center = FALSE, scale = TRUE) # scaling only
scale(var, center = TRUE, scale = TRUE) # centering and scaling

Why it may be important to scale? An example

Let’s run a model on a generated dataset.

m1 <- lm(y ~ x1 + x2, data = df)
tab_model(m1)
  y
Predictors Estimates CI p
(Intercept) 21.22 11.34 – 31.11 <0.001
x1 3.00 2.98 – 3.03 <0.001
x2 2.47 0.29 – 4.64 0.027
Observations 100
R2 / R2 adjusted 0.999 / 0.999

So the coefficients for x1 and x2 are almost equal. But can we tell that the amplitude of the effects is the same

range(df$x1)
## [1]  78.53001 540.16178
range(df$x2)
## [1] 1.085641 5.307978

Let’s take a look on beta coefficients:

library(lm.beta)
lm.beta(m1)
## 
## Call:
## lm(formula = y ~ x1 + x2, data = df)
## 
## Standardized Coefficients::
## (Intercept)          x1          x2 
##          NA 0.999237696 0.008750954

You can also print out beta coefficients with tab_model() fuction

tab_model(m1, show.std = T)
  y
Predictors Estimates std. Beta CI standardized CI p
(Intercept) 21.22 0.00 11.34 – 31.11 -0.01 – 0.01 <0.001
x1 3.00 1.00 2.98 – 3.03 0.99 – 1.01 <0.001
x2 2.47 0.01 0.29 – 4.64 0.00 – 0.02 0.027
Observations 100
R2 / R2 adjusted 0.999 / 0.999

Why it may be important to center predictors? An example

m1 <- lm(y~x*cond, data=df)
summary(m1)
## 
## Call:
## lm(formula = y ~ x * cond, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.91411 -0.55945 -0.05044  0.64814  2.64157 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.9639     0.3683   2.617  0.00955 ** 
## x             3.0212     0.1138  26.538  < 2e-16 ***
## condb         5.2669     0.4959  10.620  < 2e-16 ***
## x:condb      -6.0817     0.1561 -38.968  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.017 on 196 degrees of freedom
## Multiple R-squared:  0.9806, Adjusted R-squared:  0.9803 
## F-statistic:  3310 on 3 and 196 DF,  p-value: < 2.2e-16

Before centering

plot(df$x, df$y, xlim = c(0, 6), pch = 16, col = df$cond)
abline(a = 0.9639, b = 3.0212)
abline(a = 0.9639+5.2669, b = 3.0212-6.0817, col = "red")
abline(v = 0)

After centering

plot(df$x, df$y, xlim = c(0, 6), pch = 16, col = df$cond)
abline(a = 0.9639, b = 3.0212)
abline(a = 0.9639+5.2669, b = 3.0212-6.0817, col = "red")
abline(v = mean(df$x))

df$x_c<- scale(df$x, center = T, scale = F)
m1 <- lm(y ~ x_c * cond, data=df)
summary(m1)
## 
## Call:
## lm(formula = y ~ x_c * cond, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.91411 -0.55945 -0.05044  0.64814  2.64157 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.1347     0.1021   99.28   <2e-16 ***
## x_c           3.0212     0.1138   26.54   <2e-16 ***
## condb       -13.1942     0.1443  -91.41   <2e-16 ***
## x_c:condb    -6.0817     0.1561  -38.97   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.017 on 196 degrees of freedom
## Multiple R-squared:  0.9806, Adjusted R-squared:  0.9803 
## F-statistic:  3310 on 3 and 196 DF,  p-value: < 2.2e-16

To summarize

Any questions?

Variable transformation

There are common types of variable transformation used in linear regression modelling of survey data:

You want to have a meaningful intercept in the model

To make your intercept interpretable: center!

data$age_cent <- data$age - mean(data$age)
#same as
data$age_cent <- scale(data$age, center = T, scale = F)

Now, 0 is the average value, not the lack of a feature: “When a respondent has +1 to the average age, Y changes by [coefficient], holding everything else constant.”

You need to compare several predictors by strength

To make coefficients of the continuous predictors comparable, standardize (=scale) the variables:

data$V1_scaled <- scale(data$V1)

*If you scale the predictors and the outcome, regression coefficients will be the same as (partial) correlation coefficients. All units will be SD.

Now, V1_scaled changes in standard deviations of V1: “When V1 is +1 standard deviation above the mean, Y changes by [coefficient] in SD units of Y, holding everything else constant.”

You need to log-transform a variable

To make a very large scale like GPD comparable (and relationship linear):

data$gdp_log10 <- log(data$gdp)

*To avoid computation errors, sometimes log(data$gdp + 1) is used

Now, gdp_log10 is the natural log of gdp: “When GDP is 1 unit higher on the log-scale, the predicted Y changes by [coefficient], holding everything else constant.”

You need to revert the scale

To make a scale run in the ‘right’ direction, invert it: Example: “How happy are you? 0-Very happy, 10-Not at all”

Goal: to have a scale from not at all (0) to very happy (10).

data$hap <- 10 - data$V1 #, or, in a more general form:
data$hap <- max(data$V1) - data$V1

*The exact formula depends on the original scale and desired outcome, e.g. 

max(data$var +1) - data$var

Now, data$hap’s regression coefficient can be interpreted like this: “When the respondent’s happiness is 1 point higher, Y changes by [coefficient], holding everything else constant.”

You need to create an additive index

Let’s say we want to use trust as a predictor in our model. However, in ESS there are several measures of trust. Which one to choose? You can use all of them by creating additive index.

data$trust_index <- data$trust1 + data$trust2 + data$trust3

#or
data$trust_index <- (data$trust1 + data$trust2 + data$trust3)/3

#or
data$trust_index <- rowMeans(data.frame(data$trust1, data$trust2, data$trust3)

You can change baseline in a categorical predictor

m1 <- lm(y ~ cond, data = df)
summary(m1)
## 
## Call:
## lm(formula = y ~ cond, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9689 -0.7170 -0.0520  0.7134  3.8494 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.039134   0.107544  -0.364    0.716    
## condb       -0.005385   0.152090  -0.035    0.972    
## condc        1.839554   0.152090  12.095   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 297 degrees of freedom
## Multiple R-squared:  0.3971, Adjusted R-squared:  0.3931 
## F-statistic: 97.82 on 2 and 297 DF,  p-value: < 2.2e-16
df$cond <- relevel(df$cond, ref = "c")
m1 <- lm(y ~ cond, data = df)
summary(m1)
## 
## Call:
## lm(formula = y ~ cond, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9689 -0.7170 -0.0520  0.7134  3.8494 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.8004     0.1075   16.74   <2e-16 ***
## conda        -1.8396     0.1521  -12.10   <2e-16 ***
## condb        -1.8449     0.1521  -12.13   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 297 degrees of freedom
## Multiple R-squared:  0.3971, Adjusted R-squared:  0.3931 
## F-statistic: 97.82 on 2 and 297 DF,  p-value: < 2.2e-16

Steps in linear regression modeling

An extra task

Make a cheat sheet on linear regression (including interactions) You can find some examples of cheat sheets here

It is not obligatory but you can gain some extra point for in-class participation.

Upload your cheat sheets in “cheat sheets” disccussion in Canvas before June, 1.

Total recap of everything in 10 ideas

1

Everything that takes up different values is a variable.

The world can be described through variables.

2

Each variable belongs to a class by measurement type (nominal/ordinal/interval, categorical/continuous) which defines its legit methods of analysis.

3

Research hypotheses can focus on one variable (univariate distribution), on the relations between two variables (bivariate distribution, correlation and association measures), or on many variables (multivariate distribution and regression).

4

R handles data management and statistical tests better than any other language to date.

The implemented variable types in R are character, factor, and numeric.

5

When looking at a single variable, note its central tendency measures (mode, median, and the mean).

When looking at its distribution, note the standard deviation, skew, and kurtosis.

Continuous variables have all of this; ordinal ones have the median and the mode; nominal ones, the mode only.

6

Relationships between two variables are formally tested with association measures (e.g. chi-square).

Relationships between two continuous variables also have magnitude and direction (positive or negative).

Pearson’ correlation coefficient is good for two continuous variables. There are adapted correlation coefficients for any pair of variables.

7

Any relationships within data can be described with a model.

Data = Model + Error

In modelling, we want to generalise results by estimating the parameters in the whole population based on the sample in hand.

Linear models are the simplest type of models where a continuous variable is to be predicted / explained.

8

In a linear regression, a continuous outcome is modelled. Predictors can be of any type. Regression coefficients show the change in the outcome associated with either a 1-unit increase in the predictor, or with belonging to the category as compared to the reference category.

If predictors are not independent, their relationship can be a moderation. To test for it, interaction effects are introduced to the model.

9

When there are two or more groups that differ by a continuous variable, their mean values can be compared directly.

Use t-test for independent samples or F-test in one-way ANOVA to compare means. If F-test shows that the means are not the same, use post-hoc test to learn which groups have different mean values.

These are parametric tests that assume the normal distribution of variables. If this assumption does not hold, use non-parametric equivalents.

10

Interpretation of a statistical test should take you back to the level of hypotheses and theory, in a feedback loop.

P-values are the probability to observe the differences as large or even larger in a sample, given there is no true difference in the population. P-values serve for a rough yes/no answer. Effect size can quantify the difference.

When analysing your results, look at the graphs and use confidence intervals in addition to reporting p-values. Ground your conclusions on the overall picture, not a p-value alone.

About your final exam

Some final questions?