Introduction

In most of the regression modeling, variables of different unit of measurements are considered. For example age in years and weight in kilograms may be two predictors with response variable as sleeping time in hours. This would cause difficulty in comparing the effect sizes and the interpretation of intercept (constant) in the model. Dimensionless regression may help to overcome such difficulties. This can be achieved by simple mathematical transformations of variables that will yield standardized regression coefficients. This notes illustrates this approach using an example that would indicate additional advantage of such standardized coefficients

Illustrative Data Set

Let us consider Birth Weight Data that has the following variables.

  1. id: identity number

  2. matage: maternal age (years)

  3. ht: hypertension (1=yes, 0=no)

  4. gestwks: gestational age (weeks)

  5. sex: sex of the baby

  6. bweight: birthweight(g)

  7. matagegp: maternal age into four groups (<30, 30-34, 35-39, 40+)

  8. gestcat: gestwks into two groups (<37, >=37)

A possible modeling possibility is to study the birth weight (in grams) with gestational age (in weeks)

Model - Original Scales

Let us first fit a model with gestwks as predictor \((X)\) and bweight as response variable \((Y)\). The resultant model will be

\[E[Y|X]=\beta_0+\beta_1 X\] where \(X\) and \(Y\) are without standardizing. We fit the model (estimating \(\beta'\)s) using lm() in R

fit1 = lm(bweight ~ gestwks, data = bw_data)

This model will have the following summary


Call:
lm(formula = bweight ~ gestwks, data = bw_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1810.15  -284.69    -6.97   283.06  1248.31 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -4865.245    290.081  -16.77   <2e-16 ***
gestwks       206.641      7.485   27.61   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 441.2 on 639 degrees of freedom
Multiple R-squared:  0.544, Adjusted R-squared:  0.5433 
F-statistic: 762.3 on 1 and 639 DF,  p-value: < 2.2e-16

A graphical representation of this model may provide a better picture about the relationship

we could note a high positive correlation and nearly a linear relationship between the variables. The resultant estimated equation is \[\hat {E[bweight|gestwk]} = -4865.2 + 206.6 * gestwk\]

Interpretation

  1. Coefficient of “gestwk”: one week increase in gestwk will increase the mean “bweight” by 206.6 g; Plus sign indicates increment

  2. Constant / Intercept: Initial week of gestation or when no gestation bweight is -4865.2 (in grams)

Intercept estimate provides a meaningless value for the response variable - bweight

This is one of the difficulties when we use the variables in its raw form (original unit of measurement). Here, \(X=0\) initial gestation week may be meaningful. Equally, a predictor may not assume the value zero or it may not be a plausible value for the predictor; for example,if we assume \(X=0\). Such difficulties can be handled with the notion of standardizing variables

Standardization or Scaling

There would be more methods to carry out this transformation of variables. In this notes we shall confine to standardizing continuous (metric / numeric) predictors alone. Following steps list the underlying process

  1. Centering - transform with a plausible value of the predictor, say its average (mean)

  2. Subtract mean from each point of the predictor

  3. Model with the transformed variable

This will not alter the meaning / values of other measures

In R, we can do with scale() where center is to indicate the subtraction of mean of the predictor from each value of the predictor and scale is to divide the values by standard deviation (SD) of the predictor. Symbolically,

\[X=\frac{X-\bar X}{S_X}\] where \(\bar X, S_X\) are the mean and standard deviation of \(X\).

In the following code we do not intend to divide by SD of \(X\)

scale(bw_data$gestwks,center = TRUE,scale=FALSE)

A sample of the transformed values are

SNO X X_CEN
1 37.74 -0.95
2 39.15 0.46
3 35.72 -2.97
4 39.29 0.60
5 38.38 -0.31

Model - Scaled Predictor

Underlying model is

\[E[Y|X]=\beta_0+\beta_1 (X-\bar{X})\] where \(\bar{X}\) is the mean of \(X\)

Estimation of \(\beta'\)s and the associated visuals are presented below


Call:
lm(formula = bweight ~ gestwks_CEN, data = bw_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1810.15  -284.69    -6.97   283.06  1248.31 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3129.137     17.425  179.58   <2e-16 ***
gestwks_CEN  206.641      7.485   27.61   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 441.2 on 639 degrees of freedom
Multiple R-squared:  0.544, Adjusted R-squared:  0.5433 
F-statistic: 762.3 on 1 and 639 DF,  p-value: < 2.2e-16

The resultant estimated equation is \[\hat {E[bweight|gestwk]} = 3130 + 206.6 * (gestwk-\textrm {mean}(gestwk))\]

Interpretation

  1. Coefficient of “gestwk”: one week increase in gestwk will increase the mean “bweight” by 206.6 g; Plus sign indicates increment

  2. Constant / Intercept: Average bweight is 3130 (in grams) during the average period of gestation (\(\bar X=38.68\) weeks).

Both issues, implausibility of assuming zero for \(X\) and inappropriate value for \(Y\) are avoided by carrying out the scaling transformation

Final Note

This notes illustrates the way standardized coefficients would help in the regression modeling. One of the methods of transformation has been discussed with the help of a data set that exhibit the difficulty of raw coefficient and its interpretation. However, more transformation methods and models without intercept are not included in this attempt