To understand the need for building relational models between response and a set of predictors, conceptualized from the contextual mapping.
To know the difference between Linear and Non-Linear models
To interpret the estimated weights in a model
To understand the meaning Confidence Interval (CI) for concluding statistical significance of a predictor
To build a relational mapping from a contextual mapping
It is possible to construct different forms of relational mapping
LM - A straightline (Linear) mapping is an easiest way to build mappings / models
However, it is possible to go beyond such modeling to construct Non-Linear (Curvilinear) models
Symbollically, such mappings / models are represented as \(Y=f(X)\) where \(Y\) is the response variable and \(X\) is a set of predictors identified through a contextual mapping from a study / data. \(f\) is the relational mapping / model function
If we assume, \(k\) is the number of predictors then the Linear form of \(f\) is \[Y=\beta_0+\beta_1X_1+\beta_2X_2+\cdot\cdot\cdot\cdot+\beta_kX_k\] \(\beta's\) are weights / coefficients / betas that are to be estimated
In real time situations, this mapping includes a component of error / noise in the data; the actual model is \(Y=f(X)+\textrm{Error}(\epsilon)\). Errors are quite random (Unpredictable) we estimate and not try to determine \(\beta s\)
Methods to build LM (and other models) is highly dependent on the nature of the response variable. Additionally, some models will have few assumptions in constructing such models
Once the response variable and predictors are decided from a contextual mapping, LM can be constructed using lm() of R function
The major and mandatory input for this function is a formula connects the predictors and response variable.
\[Y \sim X_1 + X_2 +\cdot\cdots\cdot+X_k\] Input of the data set can be done using the argument data=
Sample code
lm(Y ~ X_1 + X_2 +……..+X_k, data = data set name)
Let us consider Credit data set from ISLR, one of the packages in R
Response: Balance
Predictors: Age, Income, Limit
Underlying model is fitted using
ex=ISLR::Credit
fit1=lm(Balance ~ Age + Income + Limit,data=ex)
fit1$coefficientssummary() provides necessary estimated value of \(\beta s\) and other necessary values associated with the predictors including constant (intercept) \(\beta_0\)
Estimated weights (upto four decimals) are extracted output after executing the above codes
ex=ISLR::Credit
fit1=lm(Balance ~ Age + Income + Limit,data=ex)
round(fit1$coefficients,4)(Intercept) Age Income Limit
-342.1970 -0.8018 -7.5628 0.2637
If the fitted LM has a numeric predictor \(X\), then the corresponding weight \(\beta\) is interpreted as the change in the average value of the response variable \(Y\) when the predictor \(X\) is increased by one unit, provided other predictors are kept in a same value (constant); constant may be zero or any appropriate value
Following are the steps to interpret factor predictors
One of the levels will be considered Reference or Base level
this would be software specific; however, options may be provided to change / fix base level
In R, first level (ascending order) is considered as base level
Weights \((\beta)\) are the changes in the mean response \(Y\) when the levels of factor \(X\) are compared with base level
Other predictors are kept constant; factor predictors, are kept in the same level
Constant in the model provides the average response when all the predictors are kept at the same level
Let us start with a model that has no predictors listed in the data sets but have a constant term; corresponding code and the output is is
fit1=lm(Balance~1, data=ex)
fit1$coefficients(Intercept)
520.015
This is typically the mean of Balance, the response variable in this model
d1=as.data.frame(mean(ex$Balance))
colnames(d1)=""
rownames(d1)="Average Balance"
d1| Average Balance | 520.015 |
If we add a numerical predictor then the summary will be
fit2=lm(Balance ~ Income ,data=ex)
d2=data.frame(round(fit2$coefficients,4))
d2 <- cbind(rownames(d2), data.frame(d2, row.names=NULL))
colnames(d2) = c("Predictors", "Estimated Weights")
d2| Predictors | Estimated Weights |
|---|---|
| (Intercept) | 246.5148 |
| Income | 6.0484 |
Now the intercept (246.5148) refers to the average Balance when Income is zero
Estimated weight corresponding to the numerical predictor (Income) is the change in the average Balance when Income is increased by one unit. Hence average Balance increases by 6.0484 units when Income is increased by one unit
However, zero may NOT be a reasonable for the predictor Income. In such cases (in most / all of the numeric predictors) it is better to mean-centering the predictor using scale() in R
Modified model
First let us center the numeric predictor Income and store as Income_1 in the data set using the following code
ex$Income_1=scale(ex$Income, center=T,scale=F)Then the model fit (with the predictor Income_1) is
ex$Income_1=scale(ex$Income, center=T,scale=F)
fit2A=lm(Balance ~ Income_1 ,data=ex)
d2A=data.frame(round(fit2A$coefficients,4))
d2A <- cbind(rownames(d2A), data.frame(d2A, row.names=NULL))
colnames(d2A) = c("Predictors", "Estimated Weights")
d2A| Predictors | Estimated Weights |
|---|---|
| (Intercept) | 520.0150 |
| Income_1 | 6.0484 |
It can be easily noted that estimated weight corresponding to the predictor Income does not change.
Estimated constant (520.0150) now refers to the average response Balance when the Income is equal to its mean value
Next let us build a model with one factor predictor, say with Ethnicity which has three levels and the base level is African American
fit3=lm(Balance ~ Ethnicity ,data=ex)
d3=data.frame(round(fit3$coefficients,4))
d3 <- cbind(rownames(d3), data.frame(d3, row.names=NULL))
colnames(d3) = c("Predictors", "Estimated Weights")
d3| Predictors | Estimated Weights |
|---|---|
| (Intercept) | 531.0000 |
| EthnicityAsian | -18.6863 |
| EthnicityCaucasian | -12.5025 |
Interpretation of the weights can be well understood by taking the average of the response Balance for each level of Ethnicity. Following code yields the group wise average of Balance
library(dplyr)
grp_mean=data.frame(ex %>% group_by(Ethnicity) %>% summarise(Mean_Balance=mean(Balance)))
grp_mean| Ethnicity | Mean_Balance |
|---|---|
| African American | 531.0000 |
| Asian | 512.3137 |
| Caucasian | 518.4975 |
From these two numerical summaries, we can observe that
the intercept of model fit3 (531) is the average of Balance when Ethnicity is African American
weight corresponding to EthnicityAsian (-18.6863) refers to the difference in the average of Balalnce for Asian group compared to African American (From the second summary, mean of Balance for Asian group is 512.3137, which is 531-18.6863)
In a similar way, average Balance for the Caucasian group is 518.4975;the weight in fit3 is -12.5025 which is difference of average Balance Caucasian and African American groups (531-518.4975)
Let us continue building a model with three numerical predictors (after centering)
ex=ex %>% mutate(Age_1=scale(Age, center=T,scale=F),
Limit_1=scale(Limit, center=T,scale=F),
Income_1=scale(ex$Income, center=T,scale=F))
fit4=lm(Balance ~ Age_1+Income_1+Limit_1 ,data=ex)
d4=data.frame(round(fit4$coefficients,4))
rownames(d4)=c("Constant", "Age", "Income", "Limit")
d4 <- cbind(rownames(d4), data.frame(d4, row.names=NULL))
colnames(d4) = c("Predictors", "Estimated Weights")
d4| Predictors | Estimated Weights |
|---|---|
| Constant | 520.0150 |
| Age | -0.8018 |
| Income | -7.5628 |
| Limit | 0.2637 |
Intercept 520.015 refers to the average Balance for a person whose Age is 55.6675, Income is 45.2189 and Limit is 4735.6. Other weights are the relative increase (for Limit) or decrease (Age and Income) in average Balance when one unit is increased in a predictor when other predictors are at their average value
When we build a model with factor predictors, additional care must be taken to interpret the constant term in the model. For example, let us include a factor predictor Student in fit4
fit5=lm(Balance ~ Age_1+Income_1+Limit_1+Student ,data=ex)
d5=data.frame(round(fit5$coefficients,4))
rownames(d5)=c("Constant", "Age", "Income", "Limit", "Student Yes")
d5 <- cbind(rownames(d5), data.frame(d5, row.names=NULL))
colnames(d5) = c("Predictors", "Estimated Weights")
d5| Predictors | Estimated Weights |
|---|---|
| Constant | 477.4207 |
| Age | -0.5291 |
| Income | -7.8347 |
| Limit | 0.2671 |
| Student Yes | 425.9432 |
The weight (425.9432) corresponding to Student refers to the difference in the average Balance between the two levels of student (No / Yes). On an average, Balance is expected to increase by 425.9432 units when Student is Yes compared to No (Base level), while other numeric predictors are kept at their respective means
Also, the intercept for Student=Yes is 477.4207+425.9432 = 903.3639 when other predictors Age, Income, Limit are kept at their respective means.
Accordingly, average Balance is 477.4207 (constant) for a person who is not a student (base level of Student), whose Age is 55.6675, Income is 45.2189 and Limit is 4735.6.If he / she is a Student then average Balance is 903.3639
Once a linear model is built and interpretation of the weights are appropriately done, then we have a step to check the Significance of the predictors. This is carried out using the estimated weights.
Meaning of Significance
Statistical: We expect that effect of a predictor is not mere due to chance but there may be a real effect of predictor in explaining the variation in the response
Practical: On the other hand, from the context point of view one (an expert) may assess that a predictor is having impact on the relational mapping, irrespective of statistical significance
We will focus the statistical significance using Confidence Interval (CI) with a pre specified level of significance (before conducting the experiment), say 5%, a kind of margin of error
More details about CI
CI covers a set of plausible values for the parameter about the population of interest
It’s form is a real interval (L, U)
L: Lower Limit (LL), U: Upper Limit (UL)
There are uncountable infinite values in between a and b
If a CI covers zero (LL is negative, UL is positive), corresponding predictor is not statistically significant
If a CI does not cover zero, corresponding predictor is statistically significant. In this case ign of LL and UL are same (either negative or positive)
In R we can use the function confint(fitted model) to estimate the limits of a CI
We shall use fit5 for finding CI with 5% level of significance. Then 95% CI is
CI_fit5=data.frame(confint(fit5))
CI_fit5=round(CI_fit5,4)
rownames(CI_fit5)=c("Constant", "Age", "Income", "Limit", "Student Yes")
CI_fit5<- cbind(rownames(CI_fit5), data.frame(CI_fit5, row.names=NULL))
colnames(CI_fit5) = c("Predictors", "Estimated LL", "Estimated UL")
CI_fit5| Predictors | Estimated LL | Estimated UL |
|---|---|---|
| Constant | 466.6270 | 488.2143 |
| Age | -1.1344 | 0.0761 |
| Income | -8.3177 | -7.3517 |
| Limit | 0.2598 | 0.2744 |
| Student Yes | 391.7655 | 460.1209 |
Accordingly, except the predictor Age all other predictors are statistically significant (CI for Age covers zero, rest do not cover)
Following steps may be followed to build a LM, interpret and finding the significance (statistical) of weights.
Contextual mapping - Data context - Data domain
Fitting model - use software
Scaling the numeric predictors
Estimate the weights / betas / coefficients
Interpret the weights
Estimate the limits of Confidence Interval
Interpret CI