Objectives

  1. To understand the need for building relational models between response and a set of predictors, conceptualized from the contextual mapping.

  2. To know the difference between Linear and Non-Linear models

  3. To interpret the estimated weights in a model

  4. To understand the meaning Confidence Interval (CI) for concluding statistical significance of a predictor

Points to Ponder

  1. To build a relational mapping from a contextual mapping

  2. It is possible to construct different forms of relational mapping

  3. LM - A straightline (Linear) mapping is an easiest way to build mappings / models

  4. However, it is possible to go beyond such modeling to construct Non-Linear (Curvilinear) models

  5. Symbollically, such mappings / models are represented as \(Y=f(X)\) where \(Y\) is the response variable and \(X\) is a set of predictors identified through a contextual mapping from a study / data. \(f\) is the relational mapping / model function

  6. If we assume, \(k\) is the number of predictors then the Linear form of \(f\) is \[Y=\beta_0+\beta_1X_1+\beta_2X_2+\cdot\cdot\cdot\cdot+\beta_kX_k\] \(\beta's\) are weights / coefficients / betas that are to be estimated

  7. In real time situations, this mapping includes a component of error / noise in the data; the actual model is \(Y=f(X)+\textrm{Error}(\epsilon)\). Errors are quite random (Unpredictable) we estimate and not try to determine \(\beta s\)

  8. Methods to build LM (and other models) is highly dependent on the nature of the response variable. Additionally, some models will have few assumptions in constructing such models

Fitting LM - R code

Once the response variable and predictors are decided from a contextual mapping, LM can be constructed using lm() of R function

The major and mandatory input for this function is a formula connects the predictors and response variable.

\[Y \sim X_1 + X_2 +\cdot\cdots\cdot+X_k\] Input of the data set can be done using the argument data=

Sample code

lm(Y ~ X_1 + X_2 +……..+X_k, data = data set name)

Example

Let us consider Credit data set from ISLR, one of the packages in R

Underlying model is fitted using

ex=ISLR::Credit

fit1=lm(Balance ~ Age + Income + Limit,data=ex)
fit1$coefficients

summary() provides necessary estimated value of \(\beta s\) and other necessary values associated with the predictors including constant (intercept) \(\beta_0\)

Estimated weights (upto four decimals) are extracted output after executing the above codes

ex=ISLR::Credit

fit1=lm(Balance ~ Age + Income + Limit,data=ex)
round(fit1$coefficients,4)
(Intercept)         Age      Income       Limit 
  -342.1970     -0.8018     -7.5628      0.2637 

Interpretation of Weights \(\beta\)

Numeric predictors

If the fitted LM has a numeric predictor \(X\), then the corresponding weight \(\beta\) is interpreted as the change in the average value of the response variable \(Y\) when the predictor \(X\) is increased by one unit, provided other predictors are kept in a same value (constant); constant may be zero or any appropriate value

Categorical predictors

Following are the steps to interpret factor predictors

  1. One of the levels will be considered Reference or Base level

    • this would be software specific; however, options may be provided to change / fix base level

    • In R, first level (ascending order) is considered as base level

  2. Weights \((\beta)\) are the changes in the mean response \(Y\) when the levels of factor \(X\) are compared with base level

  3. Other predictors are kept constant; factor predictors, are kept in the same level

Intercept or Constant

Constant in the model provides the average response when all the predictors are kept at the same level

Example Continued

Let us start with a model that has no predictors listed in the data sets but have a constant term; corresponding code and the output is is

fit1=lm(Balance~1, data=ex)
fit1$coefficients
(Intercept) 
    520.015 

This is typically the mean of Balance, the response variable in this model

d1=as.data.frame(mean(ex$Balance))
colnames(d1)=""
rownames(d1)="Average Balance"
d1
Average Balance 520.015

Single Predictor

If we add a numerical predictor then the summary will be

fit2=lm(Balance ~ Income ,data=ex)
d2=data.frame(round(fit2$coefficients,4))
d2 <- cbind(rownames(d2), data.frame(d2, row.names=NULL))
colnames(d2) = c("Predictors", "Estimated Weights")
d2
Predictors Estimated Weights
(Intercept) 246.5148
Income 6.0484

Now the intercept (246.5148) refers to the average Balance when Income is zero

Estimated weight corresponding to the numerical predictor (Income) is the change in the average Balance when Income is increased by one unit. Hence average Balance increases by 6.0484 units when Income is increased by one unit

However, zero may NOT be a reasonable for the predictor Income. In such cases (in most / all of the numeric predictors) it is better to mean-centering the predictor using scale() in R

Modified model

First let us center the numeric predictor Income and store as Income_1 in the data set using the following code

ex$Income_1=scale(ex$Income, center=T,scale=F)

Then the model fit (with the predictor Income_1) is

ex$Income_1=scale(ex$Income, center=T,scale=F)
fit2A=lm(Balance ~ Income_1 ,data=ex)
d2A=data.frame(round(fit2A$coefficients,4))
d2A <- cbind(rownames(d2A), data.frame(d2A, row.names=NULL))
colnames(d2A) = c("Predictors", "Estimated Weights")
d2A
Predictors Estimated Weights
(Intercept) 520.0150
Income_1 6.0484
  1. It can be easily noted that estimated weight corresponding to the predictor Income does not change.

  2. Estimated constant (520.0150) now refers to the average response Balance when the Income is equal to its mean value

Next let us build a model with one factor predictor, say with Ethnicity which has three levels and the base level is African American

fit3=lm(Balance ~ Ethnicity ,data=ex)
d3=data.frame(round(fit3$coefficients,4))
d3 <- cbind(rownames(d3), data.frame(d3, row.names=NULL))
colnames(d3) = c("Predictors", "Estimated Weights")
d3
Predictors Estimated Weights
(Intercept) 531.0000
EthnicityAsian -18.6863
EthnicityCaucasian -12.5025

Interpretation of the weights can be well understood by taking the average of the response Balance for each level of Ethnicity. Following code yields the group wise average of Balance

library(dplyr)
grp_mean=data.frame(ex %>% group_by(Ethnicity) %>% summarise(Mean_Balance=mean(Balance)))
grp_mean
Ethnicity Mean_Balance
African American 531.0000
Asian 512.3137
Caucasian 518.4975

From these two numerical summaries, we can observe that

  1. the intercept of model fit3 (531) is the average of Balance when Ethnicity is African American

  2. weight corresponding to EthnicityAsian (-18.6863) refers to the difference in the average of Balalnce for Asian group compared to African American (From the second summary, mean of Balance for Asian group is 512.3137, which is 531-18.6863)

  3. In a similar way, average Balance for the Caucasian group is 518.4975;the weight in fit3 is -12.5025 which is difference of average Balance Caucasian and African American groups (531-518.4975)

More Predictors

Let us continue building a model with three numerical predictors (after centering)

ex=ex %>% mutate(Age_1=scale(Age, center=T,scale=F),
                 Limit_1=scale(Limit, center=T,scale=F),
                 Income_1=scale(ex$Income, center=T,scale=F))
fit4=lm(Balance ~ Age_1+Income_1+Limit_1 ,data=ex)
d4=data.frame(round(fit4$coefficients,4))
rownames(d4)=c("Constant", "Age", "Income", "Limit")
d4 <- cbind(rownames(d4), data.frame(d4, row.names=NULL))
colnames(d4) = c("Predictors", "Estimated Weights")
d4
Predictors Estimated Weights
Constant 520.0150
Age -0.8018
Income -7.5628
Limit 0.2637

Intercept 520.015 refers to the average Balance for a person whose Age is 55.6675, Income is 45.2189 and Limit is 4735.6. Other weights are the relative increase (for Limit) or decrease (Age and Income) in average Balance when one unit is increased in a predictor when other predictors are at their average value

When we build a model with factor predictors, additional care must be taken to interpret the constant term in the model. For example, let us include a factor predictor Student in fit4

fit5=lm(Balance ~ Age_1+Income_1+Limit_1+Student ,data=ex)
d5=data.frame(round(fit5$coefficients,4))
rownames(d5)=c("Constant", "Age", "Income", "Limit", "Student Yes")
d5 <- cbind(rownames(d5), data.frame(d5, row.names=NULL))
colnames(d5) = c("Predictors", "Estimated Weights")
d5
Predictors Estimated Weights
Constant 477.4207
Age -0.5291
Income -7.8347
Limit 0.2671
Student Yes 425.9432

The weight (425.9432) corresponding to Student refers to the difference in the average Balance between the two levels of student (No / Yes). On an average, Balance is expected to increase by 425.9432 units when Student is Yes compared to No (Base level), while other numeric predictors are kept at their respective means

Also, the intercept for Student=Yes is 477.4207+425.9432 = 903.3639 when other predictors Age, Income, Limit are kept at their respective means.

Accordingly, average Balance is 477.4207 (constant) for a person who is not a student (base level of Student), whose Age is 55.6675, Income is 45.2189 and Limit is 4735.6.If he / she is a Student then average Balance is 903.3639

Significance of Predictors

Once a linear model is built and interpretation of the weights are appropriately done, then we have a step to check the Significance of the predictors. This is carried out using the estimated weights.

Meaning of Significance

More details about CI

We shall use fit5 for finding CI with 5% level of significance. Then 95% CI is

CI_fit5=data.frame(confint(fit5))
CI_fit5=round(CI_fit5,4)
rownames(CI_fit5)=c("Constant", "Age", "Income", "Limit", "Student Yes")
CI_fit5<- cbind(rownames(CI_fit5), data.frame(CI_fit5, row.names=NULL))
colnames(CI_fit5) = c("Predictors", "Estimated LL", "Estimated UL")
CI_fit5
Predictors Estimated LL Estimated UL
Constant 466.6270 488.2143
Age -1.1344 0.0761
Income -8.3177 -7.3517
Limit 0.2598 0.2744
Student Yes 391.7655 460.1209

Accordingly, except the predictor Age all other predictors are statistically significant (CI for Age covers zero, rest do not cover)

Final Remarks

Following steps may be followed to build a LM, interpret and finding the significance (statistical) of weights.

  1. Contextual mapping - Data context - Data domain

  2. Fitting model - use software

    • Scaling the numeric predictors

    • Estimate the weights / betas / coefficients

  3. Interpret the weights

  4. Estimate the limits of Confidence Interval

  5. Interpret CI