Previously on STAT 412:
In any experiment, there are two main variables:
The independent variable: the variable that an experimenter changes or controls so that they can observe the effects on the dependent variable.
The dependent variable: the variable being measured in an experiment that is “dependent” on the independent variable.
Researchers are often interested in understanding how changes in the independent variable affect the dependent variable. However, sometimes there is a third variable that is not accounted for that can affect the relationship between the two variables under study.
This type of variable is known as a confounding variable and it can confound the results of a study and make it appear that there exists some type of cause-and-effect relationship between two variables that doesn’t actually exist.
For example, suppose a researcher collects data on ice cream sales and shark attacks and finds that the two variables are highly correlated. Does this mean that increased ice cream sales cause more shark attacks? That’s unlikely. The more likely cause is the confounding variable temperature. When it is warmer outside, more people buy ice cream and more people go in the ocean.
Additional Confounding examples:
Smoking and Lung Cancer: In a study investigating the link between smoking and lung cancer, age can be a confounding variable.
Suppose a study is examining the relationship between education level and income, occupation and the years of experience could be a confounding variable because certain jobs pay more.
Requirements for Confounding:
Possible Problems of Confounding:
Remedies for Confounding:
Strategies to reduce confounding are:
randomization (aim is random distribution of confounders between study groups)
restriction (restrict entry to study of individuals with confounding factors - risks bias in itself)
matching (of individuals or groups, aim for equal distribution of confounders)
stratification (confounders are distributed evenly within each stratum)
adjustment (usually distorted by choice of standard)
multivariate analysis (only works if you can identify and measure the confounders)
Let’s continue with the application.
Apply pre-processing steps:
library(readr)
books = read_delim("amazon-books.txt", delim = "\t",
escape_double = FALSE, trim_ws = TRUE)## Rows: 325 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (5): Title, Author, Hard/ Paper, Publisher, ISBN-10
## dbl (8): List Price, Amazon Price, NumPages, Pub year, Height, Width, Thick,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
books=books[,c(3,5,6,10,11,12,13)] #ignore character variables such as title, author name and book ISBN number
head(books)| List Price | Hard/ Paper | NumPages | Height | Width | Thick | Weight (oz) |
|---|---|---|---|---|---|---|
| 12.95 | P | 304 | 7.8 | 5.5 | 0.8 | 11.2 |
| 15.00 | P | 273 | 8.4 | 5.5 | 0.7 | 7.2 |
| 1.50 | P | 96 | 8.3 | 5.2 | 0.3 | 4.0 |
| 15.99 | P | 672 | 8.8 | 6.0 | 1.6 | 28.8 |
| 30.50 | P | 720 | 8.0 | 5.2 | 1.4 | 22.4 |
| 28.95 | H | 460 | 8.9 | 6.3 | 1.7 | 32.0 |
## [1] 325 7
## [1] "tbl_df" "tbl" "data.frame"
## 'data.frame': 325 obs. of 7 variables:
## $ List Price : num 12.9 15 1.5 16 30.5 ...
## $ Hard/ Paper: chr "P" "P" "P" "P" ...
## $ NumPages : num 304 273 96 672 720 460 336 405 NA 304 ...
## $ Height : num 7.8 8.4 8.3 8.8 8 8.9 7.8 8.2 9.6 9.6 ...
## $ Width : num 5.5 5.5 5.2 6 5.2 6.3 5.3 5.3 6.5 6.4 ...
## $ Thick : num 0.8 0.7 0.3 1.6 1.4 1.7 1.2 0.8 2.1 1.1 ...
## $ Weight (oz): num 11.2 7.2 4 28.8 22.4 32 15.5 11.2 NA 19.2 ...
colnames(books)[c(1,2,3,7)]=c("Price","Cover","Pages","Weight")
books[,2]=as.factor(books[,2])
str(books)## 'data.frame': 325 obs. of 7 variables:
## $ Price : num 12.9 15 1.5 16 30.5 ...
## $ Cover : Factor w/ 2 levels "H","P": 2 2 2 2 2 1 1 2 1 1 ...
## $ Pages : num 304 273 96 672 720 460 336 405 NA 304 ...
## $ Height: num 7.8 8.4 8.3 8.8 8 8.9 7.8 8.2 9.6 9.6 ...
## $ Width : num 5.5 5.5 5.2 6 5.2 6.3 5.3 5.3 6.5 6.4 ...
## $ Thick : num 0.8 0.7 0.3 1.6 1.4 1.7 1.2 0.8 2.1 1.1 ...
## $ Weight: num 11.2 7.2 4 28.8 22.4 32 15.5 11.2 NA 19.2 ...
## [1] 22
## [1] 0
Implementation of Simple Linear Regression
Now, I would like to build a simple linear regression model with certain covariates, where the dependent variable is the price of a book.
##
## Call:
## lm(formula = Price ~ Height, data = books)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.009 -3.918 -1.582 1.424 104.567
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -39.6362 6.2959 -6.296 1.05e-09 ***
## Height 7.0773 0.7698 9.193 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.76 on 310 degrees of freedom
## Multiple R-squared: 0.2142, Adjusted R-squared: 0.2117
## F-statistic: 84.52 on 1 and 310 DF, p-value: < 2.2e-16
Model significance: OK.
Covariate significance: OK.
Adj R-squared:0.21
##
## Call:
## lm(formula = Price ~ Weight, data = books)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.604 -4.765 -1.310 1.189 114.129
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.0661 1.4730 5.476 9.01e-08 ***
## Weight 0.7926 0.1047 7.571 4.29e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.19 on 310 degrees of freedom
## Multiple R-squared: 0.156, Adjusted R-squared: 0.1533
## F-statistic: 57.32 on 1 and 310 DF, p-value: 4.287e-13
Model significance: OK.
Covariate significance: OK.
Adj R-squared:0.15
##
## Call:
## lm(formula = Price ~ Pages, data = books)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.788 -4.894 -2.621 0.179 124.341
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.44876 1.71755 7.830 7.79e-14 ***
## Pages 0.01350 0.00468 2.885 0.00419 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.09 on 310 degrees of freedom
## Multiple R-squared: 0.02615, Adjusted R-squared: 0.02301
## F-statistic: 8.323 on 1 and 310 DF, p-value: 0.004188
Model significance: OK.
Covariate significance: OK.
Adj R-squared:0.023
##
## Call:
## lm(formula = Price ~ Width, data = books)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.616 -2.894 -0.364 1.525 102.471
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -23.1055 4.4245 -5.222 3.25e-07 ***
## Width 7.3883 0.7878 9.378 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.71 on 310 degrees of freedom
## Multiple R-squared: 0.221, Adjusted R-squared: 0.2185
## F-statistic: 87.95 on 1 and 310 DF, p-value: < 2.2e-16
Model significance: OK.
Covariate significance: OK.
Adj R-squared:0.22
##
## Call:
## lm(formula = Price ~ Width + Height, data = books)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.836 -3.059 -0.579 1.905 97.170
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -48.0528 6.1573 -7.804 9.33e-14 ***
## Width 5.0312 0.8631 5.829 1.40e-08 ***
## Height 4.6771 0.8398 5.569 5.56e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.18 on 309 degrees of freedom
## Multiple R-squared: 0.2921, Adjusted R-squared: 0.2875
## F-statistic: 63.74 on 2 and 309 DF, p-value: < 2.2e-16
Model significance: OK.
Covariate significance: OK.
Adj R-squared:0.29
Let’s compare fit4 and fit 4.4. The difference between these two models is additional .variable height. Coefficient of the variable Width is 7.39 in the first model while its value is 5.03 in the second model. If the change in percentage exceeds 10%, then we can say that the additional variable “Height” is a confounding variable. In regression analysis, you can include the confounding variables as covariates in the model.This helps to isolate the effect of the independent variables while holding the confounders constant.
Let’s calculate the percentage change:
Percentage_Change = (fit4$coefficients[2] - fit4.4$coefficients[2])/fit4$coefficients[2]*100
Percentage_Change## Width
## 31.90374
Since the percentage change is 31.90%, which is greater than 10%, this indicates that the association between book price and width is confounded by height. Also, adding those variables to the model the R square increase from 0.22 to 0.29, which means that these new variable is explaining approximately 7% of the variance in the book price.
This change may occur because of the collinearity problem, as we saw in the last recitation. Now, Let us decide the reason of this change by calculating the variance inflation factor.
## Zorunlu paket yükleniyor: carData
## Width Height
## 1.316541 1.316541
There is no collinearity problem.
##
## Call:
## lm(formula = Price ~ Thick, data = books)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.398 -4.184 -2.562 0.213 123.566
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.420 2.023 7.129 7.14e-12 ***
## Thick 3.928 2.110 1.861 0.0637 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.19 on 310 degrees of freedom
## Multiple R-squared: 0.01105, Adjusted R-squared: 0.007858
## F-statistic: 3.463 on 1 and 310 DF, p-value: 0.06369
Model significance: Not OK.
Covariate significance: Not OK at the 5% significance level.
Adj R-squared:0.008
##
## Call:
## lm(formula = Price ~ Thick + Weight, data = books)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.195 -4.468 -1.118 1.588 105.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.0477 1.8220 7.161 5.88e-12 ***
## Thick -11.4235 2.5865 -4.417 1.39e-05 ***
## Weight 1.2104 0.1389 8.714 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.84 on 309 degrees of freedom
## Multiple R-squared: 0.2062, Adjusted R-squared: 0.201
## F-statistic: 40.12 on 2 and 309 DF, p-value: 3.232e-16
Model significance: OK.
Covariate significance: OK.
Adj R-squared:0.20
Let’s examine the confounding variable effect:
Percentage_Change = (fit5$coefficients[2] - fit5.5$coefficients[2])/fit5$coefficients[2]*100
Percentage_Change## Thick
## 390.8552
It is highly greater than 10% change. Therefore, we can say that weight is a confounding variable on the relation between price and thick.
Let’s check whether collinearity exists or not.
## Thick Weight
## 1.865117 1.865117
No collinearity problem.
In regression, when the influence of an independent variable on a dependent variable keeps varying based on the values of other independent variables, we say that there is an interaction effect.
For instance, determining the probability of dropout of a school student can depend on the number of years of education completed so far. At the same time, the effect of the number of years of education on the probability of dropout can change significantly based on the gender of the student. In such cases, we say that there is a interaction effect of these two variables (the number of years of education completed and gender) on the chances of dropout.
Data Dictionary:
library(readr)
Nursing = read_delim("Nursing2.csv",
delim = ";", escape_double = FALSE, trim_ws = TRUE)## Rows: 52 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ";"
## dbl (6): Beds, AllPatientDays, PatientRevenue, NurseSalaries, FacilitiesExpe...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Preprocessing steps:
## spc_tbl_ [52 × 6] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Beds : num [1:52] 244 59 120 120 120 65 120 90 96 120 ...
## $ AllPatientDays : num [1:52] 385 203 392 419 363 234 372 305 169 188 ...
## $ PatientRevenue : num [1:52] 23521 9160 21900 22354 17421 ...
## $ NurseSalaries : num [1:52] 5230 2459 6304 6590 5362 ...
## $ FacilitiesExpend: num [1:52] 5334 493 6115 6346 6225 ...
## $ Rural : num [1:52] 0 1 0 0 0 1 1 1 0 1 ...
## - attr(*, "spec")=
## .. cols(
## .. Beds = col_double(),
## .. AllPatientDays = col_double(),
## .. PatientRevenue = col_double(),
## .. NurseSalaries = col_double(),
## .. FacilitiesExpend = col_double(),
## .. Rural = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Build the multiple linear regression:
##
## Call:
## lm(formula = NurseSalaries ~ ., data = Nursing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4029.5 -615.4 -77.7 853.7 2253.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.749e+03 6.052e+02 4.542 4.01e-05 ***
## Beds -7.046e+00 8.580e+00 -0.821 0.41576
## AllPatientDays 1.068e+00 3.586e+00 0.298 0.76710
## PatientRevenue 1.079e-01 6.618e-02 1.630 0.10984
## FacilitiesExpend 2.318e-01 1.048e-01 2.212 0.03199 *
## Rural1 -1.180e+03 3.810e+02 -3.098 0.00332 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1234 on 46 degrees of freedom
## Multiple R-squared: 0.5012, Adjusted R-squared: 0.447
## F-statistic: 9.244 on 5 and 46 DF, p-value: 3.824e-06
Model significance: OK.
Covariate significance: Not OK for all covariates.
Adj R-squared:0.45
Let’s check the multicollinearity.
## Beds AllPatientDays PatientRevenue FacilitiesExpend
## 4.114498 6.290267 7.131940 1.397288
## Rural
## 1.121779
Seems reasonable. The model does not suffer from multicollinearity problem since all values are smaller than 10.
What should be the next step?
##
## Call:
## lm(formula = NurseSalaries ~ . - Beds, data = Nursing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4144.1 -702.7 -67.1 830.2 2304.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.610e+03 5.791e+02 4.507 4.35e-05 ***
## AllPatientDays 1.746e-01 3.405e+00 0.051 0.95933
## PatientRevenue 9.095e-02 6.266e-02 1.452 0.15327
## FacilitiesExpend 2.088e-01 1.006e-01 2.075 0.04352 *
## Rural1 -1.122e+03 3.729e+02 -3.007 0.00422 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1230 on 47 degrees of freedom
## Multiple R-squared: 0.4939, Adjusted R-squared: 0.4508
## F-statistic: 11.47 on 4 and 47 DF, p-value: 1.414e-06
we can exclude the variable beds. However, there are still some variables which are statistically insignificant. Let’s continue to drop them.
##
## Call:
## lm(formula = NurseSalaries ~ . - Beds - AllPatientDays, data = Nursing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4120.3 -708.2 -71.7 833.6 2306.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.622e+03 5.266e+02 4.978 8.68e-06 ***
## PatientRevenue 9.382e-02 2.799e-02 3.352 0.00157 **
## FacilitiesExpend 2.076e-01 9.704e-02 2.140 0.03749 *
## Rural1 -1.122e+03 3.690e+02 -3.041 0.00382 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1217 on 48 degrees of freedom
## Multiple R-squared: 0.4939, Adjusted R-squared: 0.4622
## F-statistic: 15.61 on 3 and 48 DF, p-value: 3.216e-07
Model significance: OK.
Covariate significance: OK.
Adj R-squared:0.46
Residual standard error: 1217
Exclude beds and all patient days from the model. Now, all variables are statistically significant.
What if we include interactions to the model instead of excluding insignificant variables?
##
## Call:
## lm(formula = NurseSalaries ~ . + Rural * AllPatientDays, data = Nursing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2758.90 -668.30 -30.97 761.48 2428.20
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -589.55183 937.17138 -0.629 0.53248
## Beds -6.63375 7.31950 -0.906 0.36960
## AllPatientDays 14.38588 4.36943 3.292 0.00194 **
## PatientRevenue 0.05610 0.05774 0.972 0.33643
## FacilitiesExpend 0.19223 0.08988 2.139 0.03791 *
## Rural1 2959.22115 1022.75883 2.893 0.00586 **
## AllPatientDays:Rural1 -13.56981 3.17896 -4.269 0.00010 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1053 on 45 degrees of freedom
## Multiple R-squared: 0.645, Adjusted R-squared: 0.5976
## F-statistic: 13.62 on 6 and 45 DF, p-value: 9.548e-09
Model significance: OK.
Covariate significance: not OK.
Adj R-squared:0.60
Add one more interaction:
##
## Call:
## lm(formula = NurseSalaries ~ . + Rural * AllPatientDays + Beds *
## AllPatientDays, data = Nursing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2308.2 -583.3 -106.2 565.1 1778.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.073e+03 9.398e+02 -3.269 0.00210 **
## Beds 2.210e+01 8.631e+00 2.561 0.01395 *
## AllPatientDays 2.316e+01 4.073e+00 5.687 9.70e-07 ***
## PatientRevenue 4.524e-02 4.780e-02 0.946 0.34912
## FacilitiesExpend 1.054e-01 7.661e-02 1.376 0.17582
## Rural1 2.734e+03 8.471e+02 3.227 0.00236 **
## AllPatientDays:Rural1 -1.178e+01 2.657e+00 -4.432 6.12e-05 ***
## Beds:AllPatientDays -8.010e-02 1.715e-02 -4.670 2.84e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 870.5 on 44 degrees of freedom
## Multiple R-squared: 0.7626, Adjusted R-squared: 0.7248
## F-statistic: 20.19 on 7 and 44 DF, p-value: 7.9e-12
Model significance: OK.
Covariate significance: not OK for all.
Adj R-squared:0.73
Now, exclude the insignificant variables.
m3.3.3=lm(NurseSalaries~.+Rural*AllPatientDays+Beds*AllPatientDays-PatientRevenue-FacilitiesExpend,data=Nursing)
summary(m3.3.3)##
## Call:
## lm(formula = NurseSalaries ~ . + Rural * AllPatientDays + Beds *
## AllPatientDays - PatientRevenue - FacilitiesExpend, data = Nursing)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2317.4 -475.2 -112.8 415.6 1944.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.572e+03 9.095e+02 -3.927 0.000286 ***
## Beds 3.004e+01 7.653e+00 3.925 0.000288 ***
## AllPatientDays 2.633e+01 3.301e+00 7.977 3.20e-10 ***
## Rural1 3.073e+03 8.377e+02 3.669 0.000631 ***
## AllPatientDays:Rural1 -1.276e+01 2.629e+00 -4.854 1.43e-05 ***
## Beds:AllPatientDays -8.813e-02 1.682e-02 -5.241 3.90e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 884.9 on 46 degrees of freedom
## Multiple R-squared: 0.7435, Adjusted R-squared: 0.7157
## F-statistic: 26.67 on 5 and 46 DF, p-value: 1.504e-12
Model significance: OK.
Covariate significance: OK.
Adj R-squared:0.46
Error: ~885
Hence, interaction terms increase the performance of the model.