As good Social/Data Scientists, we must always look at the data to identify potential problems. Here are some main things that you should be in the lookout for:
#install.packages("olsrr")
library(olsrr)
#install.packages("rlang")
library(rlang)
#install.packages("Hmisc")
library(Hmisc)
library(car)Create A New Dataset With No Missing Observations
good<-read.csv("good.csv")
goodR<-na.omit(good[,c("CHID","readss97","AGE97","faminc97","HOME97","bthwht")])
attach(goodR)Run the following regression
lm1<-lm(readss97~AGE97 + faminc97 + bthwht, data=goodR)
summary(lm1)##
## Call:
## lm(formula = readss97 ~ AGE97 + faminc97 + bthwht, data = goodR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -58.923 -9.416 -0.372 9.376 63.690
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.356e+01 9.844e-01 95.043 < 2e-16 ***
## AGE97 7.155e-01 1.200e-01 5.960 2.98e-09 ***
## faminc97 8.727e-05 6.299e-06 13.856 < 2e-16 ***
## bthwht -3.502e+00 7.294e-01 -4.801 1.70e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.06 on 1960 degrees of freedom
## Multiple R-squared: 0.1174, Adjusted R-squared: 0.1161
## F-statistic: 86.94 on 3 and 1960 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm1)Based on the previous diagnostics in Lab 6 it looks like there may be problems with the variables faminc97 and AGE97. The problems can be fixed largely by re-specifying our model. Primarily what we need to do is log-transform our faminc97 variable and center AGE97 (i.e. subtract the mean value of AGE97 from each observation so that its new mean is 0) and also include its squared term.
hist(faminc97)goodR$faminc97R<-ifelse(goodR$faminc97<=1,1,goodR$faminc97)
# Or, goodR$faminc97R<-ifelse(goodR$faminc97<=1,1,
# ifelse(goodR$faminc97 > 1,goodR$faminc97,NA)) log 0 is undefined
goodR$loginc <-log(goodR$faminc97R)
goodR$ageC<-goodR$AGE97-8
goodR$ageC2<- goodR$ageC^2
lm2<-lm(readss97~loginc + ageC + ageC2 + bthwht+
HOME97,data=goodR)
summary(lm2)##
## Call:
## lm(formula = readss97 ~ loginc + ageC + ageC2 + bthwht + HOME97,
## data = goodR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.824 -9.389 -0.377 9.518 55.985
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 55.13936 3.15466 17.479 < 2e-16 ***
## loginc 1.69500 0.30738 5.514 3.97e-08 ***
## ageC 0.37552 0.12910 2.909 0.00367 **
## ageC2 -0.03267 0.04638 -0.704 0.48134
## bthwht -2.53381 0.74463 -3.403 0.00068 ***
## HOME97 1.52543 0.12229 12.474 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.67 on 1958 degrees of freedom
## Multiple R-squared: 0.1636, Adjusted R-squared: 0.1615
## F-statistic: 76.6 on 5 and 1958 DF, p-value: < 2.2e-16
We examine some residual points for a linear regression to see if there are some outliers, heteroscedasticity, normality etc. Compare the residual plot in which the faminc97 variable isn’t transformed.
plot(lm1,1) #`,1` show only 1st graph: Residuals vs Fitted plotplot(lm2,1)Graph of Model Residuals Against loginc
resid2<-lm2$residuals
plot(resid2~loginc, data=goodR)New model after having made variable transformations (lm2 show that ageC2 is not needed)
lm3<-lm(readss97~loginc + ageC + bthwht+
HOME97,data=goodR)
summary(lm3)##
## Call:
## lm(formula = readss97 ~ loginc + ageC + bthwht + HOME97, data = goodR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.469 -9.516 -0.324 9.517 55.657
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.8712 3.1312 17.524 < 2e-16 ***
## loginc 1.6871 0.3071 5.493 4.47e-08 ***
## ageC 0.4100 0.1194 3.433 0.000609 ***
## bthwht -2.6817 0.7143 -3.754 0.000179 ***
## HOME97 1.5321 0.1219 12.568 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.66 on 1959 degrees of freedom
## Multiple R-squared: 0.1634, Adjusted R-squared: 0.1617
## F-statistic: 95.65 on 4 and 1959 DF, p-value: < 2.2e-16
plot(lm3,1)library(car)
vif(lm3)## loginc ageC bthwht HOME97
## 1.260362 1.115411 1.085926 1.312712
Key reference material: Belsley, D.A.; Kuh, E., Welsh, R. E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. Rules of Thumb" used to identify potential outliers: (k=number of IVs; N= Sample size) NOTE: These ‘rules’ are not a hard-and-fast rule, but rather a guideline only! ALWAYS produce a plot/histogram or sample quantiles of your outlier measures
| Measure | Cut-off Value |
|---|---|
| abs(standardized resid) | > 2 (or 3) |
| leverage | >(2k+2)/N |
| abs(Dffits) | > 2/sqrt(k+1/N) |
| abs(Dfbetas) | > 2/sqrt(N) |
| Cook’s D | > 4/N |
Create a new data frame where we can store all of our outlier information
outliers <- goodR[,c("CHID","readss97","loginc","ageC","HOME97","bthwht")] # CHID = Children ID
outliers$r <- rstudent(lm3)
outliers$lev <- hatvalues(lm3)
outliers$cd <- cooks.distance(lm3)
outliers$dffit<- dffits(lm3)
# In this step we are creating a separate data frame with the dfbetas so that we can properly label the columns
dfbetaR<-dfbetas(lm3)
dfbetaR <- as.data.frame(dfbetaR)
dfbetas(lm3)# To check the order of the variable names use the colnames() function
colnames(dfbetaR)## [1] "(Intercept)" "loginc" "ageC" "bthwht" "HOME97"
# Next assign new variable names and add the dfb appendage
colnames(dfbetaR)<-c("int_dfb","income_dfb",
"ageC_dfb","bthwht_dfb","home_dfb")
# find out the number of observations in `lm3`
nobs(lm3)## [1] 1964
Now we want to combine the two data frames using the cbind() function
outliers<-cbind(outliers,dfbetaR)
# use the head() function to make sure youthe new columns have been added to your outliers data frame
head(outliers)Discrepancy is the difference between the predicted and observed value, measured by Studentized Residuals. The basic idea of Studentized Residuals is to delete the observations one at a time, each time refitting the regression model on the remaining n–1 observations. Then, we compare the observed response values to their fitted values based on the models with the ith observation deleted. This produces deleted residuals. Standardizing the deleted residuals produces studentized residuals.
Our critical standardized residual value is abs(standardized resid) > 2 (or 3) which seems more plausible? Let’s plot to find out.
r <- rstudent(lm3) # create an named numerical object
plot(outliers$r,
xlab="Index",
ylab="studentized residuals",
pch=19)
abline(h = 3, col="red") # add cutoff line
abline(h = -3, col="red") # add cutoff line
text(x=1:length(r), y=r, labels=ifelse(r > 3,names(r),""), col="red", pos = 4) # add labels # pos = 4 to position at the right
text(x=1:length(r), y=r, labels=ifelse(r<(-3.0),names(r),""), col="red", pos = 4) # add labels # pos = 4 to position at the right
# Or, use this code for interactive point identification to avoid overlapping labels
identify(1:length(r),r,row.names(outliers))## integer(0)
# abline(lm(r~CHID, data=outliers), col="red") # fit a regression line (y~x) According to the plot, observations with studentized residuals greater than +/-3 might be the real outliers in our dataset. Let’s subset our outliers data frame by those observations.
rstudent1<-subset(outliers,abs(r)>3)Use the view function to examine observations with standardized residuals greater than +/-3
View(rstudent1)What do these observations have in common?
Quite a few observations have low readss97 values, and you can also see that there seem to be a lot of observations with high income and low reading scores, or observations with low income and high reading scores.
Leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. As measured by hat values, it is the standardized distance to mean of predictors.
Our official critical leverage value is leverage > (2k+1)/N OR leverage > (2*4+1)/1964 (.005). Is this value too strict? Let’s plot to find out!
plot(outliers$lev,
xlab="Index",
ylab="leverage",
pch=19)
abline(lm(lev~CHID, data=outliers), col="red") # Add a fitted line Residuals vs Leverage plot
plot(lm3, which = 5)According to the plot, maybe observations with leverage values greater than .03 are the real outliers. Let’s subset our outliers data frame by those observations.
leverage1<-subset(outliers,lev > .03)
# Use the view function to examine observations with leverage values greater than .03
View(leverage1)What do these observations have in common? Many observations have a loginc value less 1 , high reading score, and high parenting practices scores. A pattern is emerging. In the previous step, we saw that some observations with standardized residuals greater than +/-3 also had very low loginc values but higher reading scores and high parenting practices scores.
Cook’s distance was introduced by American statistician R Dennis Cook in 1977. It is used to identify influential data points, by considering both the residual and leverage i.e., it takes into account both the x value and y value of the observation.
Our critical Cooks Distance value is Cook’s D > 4/N OR Cook’s D > 4/1964 (0.00203666).
plot(cooks.distance(lm3), pch="*", cex=2, main="Influential Obs by Cook's distance Using Official Threshold (4/1964)")
abline(h = 4/1964, col="red") # add cutoff lineAlternatively, using functions from library olsrr
ols_plot_cooksd_bar(lm3) # threshold = 4/n = 4/1964 = 0.002 ols_plot_cooksd_chart(lm3)Maybe our critical cooks.d value of 4/1964 is too strict. Let’s take a closer look at our cook’s d values that are large according to our critical value of 4/1964.
First, create a new data frame containing observations with Cook’s D values above 4/1964.
large_cd<-subset(outliers, cd > (4/1964))
View(large_cd)Next, produce a histogram and print the sample quantiles of our Cook’s D outlier measure to the console.
# install.packages("Hmisc")
# library(Hmisc)
describe(large_cd$cd)## large_cd$cd
## n missing distinct Info Mean Gmd .05 .10
## 87 0 87 1 0.007076 0.008124 0.002100 0.002186
## .25 .50 .75 .90 .95
## 0.002465 0.003157 0.005046 0.011376 0.017548
##
## lowest : 0.002050058 0.002059938 0.002072141 0.002075410 0.002096715
## highest: 0.018274246 0.023348340 0.063262844 0.077436290 0.104389886
hist(large_cd$cd)quantile(large_cd$cd, probs = seq(0, 1, 0.05)) # from 0 to 1, by 0.05 ## 0% 5% 10% 15% 20% 25%
## 0.002050058 0.002099812 0.002186473 0.002266736 0.002379564 0.002465379
## 30% 35% 40% 45% 50% 55%
## 0.002534835 0.002585449 0.002928289 0.003031257 0.003156610 0.003283638
## 60% 65% 70% 75% 80% 85%
## 0.003404639 0.003746563 0.004223186 0.005046340 0.005655253 0.005933571
## 90% 95% 100%
## 0.011375837 0.017547768 0.104389886
Notice that there is a big jump between the 75th and 90th percentile. The 90th percentile value is 0.011375837 which is much larger than the 75th percentile value, which is about .00504. This is a good indicator that observations in the 90th percentile and higher are the real outliers.
Let’s take a closer look at these observations
large_cd2<-subset(outliers, cd > 0.011375837)
# Use the View() function to examine observations in this object
View(large_cd2)Now, focusing just on the variables included in the regression model, what do you notice? Again, we find that quite a few observations have very low loginc values but high reading scores and high parenting practices scores.
cd <- cooks.distance(lm3)
plot(cooks.distance(lm3), pch="*", cex=2, main="Influential Obs by Cooks distance")
abline(h = 0.011375837, col="red") # add cutoff line
text(x=1:length(cd), y=cd, labels=ifelse(cd>0.011375837,names(cd),""), col="red", pos = 4) # add labels # pos = 4 to position at the right
# Or alternatively,
identify(1:length(cd),cooks.distance(lm3),row.names(outliers)) # interactive labeling, LOVE it! :)## integer(0)
Let’s fit a model without outliers (i.e. dropping observations with Cook’s D greater than 0.011375837).How do these results compare to the results obtained with the lm3 model?
lm4<-lm(readss97~loginc + ageC + bthwht + HOME97,
data=subset(outliers, cd < 0.011375837))
# Do the model estimates or R-Squared change?
summary(lm3)##
## Call:
## lm(formula = readss97 ~ loginc + ageC + bthwht + HOME97, data = goodR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.469 -9.516 -0.324 9.517 55.657
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.8712 3.1312 17.524 < 2e-16 ***
## loginc 1.6871 0.3071 5.493 4.47e-08 ***
## ageC 0.4100 0.1194 3.433 0.000609 ***
## bthwht -2.6817 0.7143 -3.754 0.000179 ***
## HOME97 1.5321 0.1219 12.568 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.66 on 1959 degrees of freedom
## Multiple R-squared: 0.1634, Adjusted R-squared: 0.1617
## F-statistic: 95.65 on 4 and 1959 DF, p-value: < 2.2e-16
summary(lm4)##
## Call:
## lm(formula = readss97 ~ loginc + ageC + bthwht + HOME97, data = subset(outliers,
## cd < 0.011375837))
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.817 -9.370 -0.429 9.339 56.389
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.1573 3.3941 13.305 < 2e-16 ***
## loginc 2.8991 0.3511 8.258 2.7e-16 ***
## ageC 0.4076 0.1177 3.463 0.000546 ***
## bthwht -2.5110 0.7042 -3.566 0.000371 ***
## HOME97 1.3760 0.1233 11.163 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.41 on 1950 degrees of freedom
## Multiple R-squared: 0.1831, Adjusted R-squared: 0.1814
## F-statistic: 109.3 on 4 and 1950 DF, p-value: < 2.2e-16
It creates a “bubble” plot of Studentized residuals versus hat values, with the areas of the circles representing the observations proportional to the value Cook’s distance.
# library(car)
influencePlot(lm3, id.method="noteworthy", # "identify": for interactive labeling
main="Influence Plot",
sub="Circle size is proportial to Cook's Distance")## StudRes Hat CookD
## 345 -3.798934 0.001779784 0.005111247
## 715 2.342583 0.054617275 0.063262844
## 1132 1.326663 0.049370524 0.018274246
## 1192 3.815063 0.003395120 0.009848521
## 2759 3.059742 0.039876923 0.077436290
## 2760 3.578872 0.039381913 0.104389886
We can also use DFBETAS to find potential outliers DFBETA measures the difference in each parameter estimate with and without the influential point. Large values of DFBETAS indicate observations that are influential in estimating a given parameter.
Our critical dfbeta cuttoff value is abs(dfbetas) > 2/sqrt(N) OR here: dfbetas > abs(2/sqrt(1964)) (0.04512937). But maybe this value is too strict. To get a better idea of what potential values of each variable might be outliers we need to plot the dfbetas for each variable (excluding the intercept).
Use the par() and mfrow() functions to print all four dfbeta plots to one window (2 rows, 2 columns)
par(mfrow=c(2,2))
plot(outliers$income_dfb)
plot(outliers$ageC_dfb)
plot(outliers$bthwht_dfb)
plot(outliers$home_dfb)One observation that should not be a surprise is our income variable.
The critical value for our income dfbeta is about abs(.2). Let’s take a closer look at these observations.
inc<-subset(outliers, abs(income_dfb) > .2)
View(inc)Focusing specifically, on the loginc values, what do these observations have in common? Again, we find that quite a few of these observations have loginc values less than 1
The difference in fits for observation i, denoted DFFITS. Create Scatterplot using Dffits object automatically generated when we constructed the linear model.
plot(dffits(lm3),
ylab = "Standardized dfFits", xlab = "Index",
main = paste("Standardized DfFits, \n critical value = 2*sqrt(k+1/n) = +/-",
round(0.1,3))) # (2*sqrt(4+1/1964) = 0.1)
# Critical Value divided by horizontal lines
abline(h = 0.09025, col = "red", lty = 2)
abline(h = -0.09025, col = "red", lty = 2)Let’s take a closer look at these observations
large_dffits<-subset(outliers, dffit > 0.1)
View(large_dffits)Next, produce a histogram and print the sample quantiles of our dffits outlier measure to the console.
# library(Hmisc)
describe(large_dffits$dffit)## large_dffits$dffit
## n missing distinct Info Mean Gmd .05 .10
## 51 0 51 1 0.1739 0.1055 0.1008 0.1016
## .25 .50 .75 .90 .95
## 0.1081 0.1246 0.1696 0.2816 0.4524
##
## lowest : 0.1002484 0.1006708 0.1006789 0.1009103 0.1012990
## highest: 0.3023354 0.3418022 0.5630620 0.6235657 0.7246348
hist(large_dffits$dffit)quantile(large_dffits$dffit, probs = seq(0, 1, 0.05)) # from 0 to 1, by 0.05 ## 0% 5% 10% 15% 20% 25% 30%
## 0.1002484 0.1007946 0.1015823 0.1037258 0.1060657 0.1080966 0.1113601
## 35% 40% 45% 50% 55% 60% 65%
## 0.1127046 0.1149095 0.1222516 0.1245861 0.1279384 0.1311345 0.1402279
## 70% 75% 80% 85% 90% 95% 100%
## 0.1591148 0.1695848 0.1812054 0.2424442 0.2815864 0.4524321 0.7246348
large_dffits2<-subset(outliers, dffit > 0.2424442) # e.g., set at 85th
View(large_cd2)df <- dffits(lm3)
plot(dffits(lm3), pch="*", cex=2, main="Influential Obs by DFFIT")
abline(h = 0.2424442, col="red")
abline(h = -0.2424442, col="red")
text(x=1:length(df), y=df, labels=ifelse(df>0.2424442, names(df),""), col="red", pos = 4)
text(x=1:length(df), y=df, labels=ifelse(df<(-0.2424442), names(df),""), col="red", pos = 4)
# Or, use this code for interactive point identification
identify(1:length(df),dffits(lm3),row.names(outliers))## integer(0)
One common theme that emerged from the analysis is that observations with very low incomes are a problem. (To determine if there are other problematic variable values or observations, make sure to perform a thorough outlier analysis.)
How do we deal with outliers? There are TWO approaches.
In this example, we might want to omit from our analysis observations with a cook’s d value greater than 0.011375837. First, create a new data frame (here, called outliers2) and subset your outliers data frame to include only those observations that have a cook’s d value LESS than our new critical cook’s d value (0.011375837).
outliers2<-subset(outliers, cd < 0.011375837)
# Now you can compare the results produced from our lm3 model with the results produced with our new dataset.
lm5<-lm(readss97~loginc + ageC + HOME97+ bthwht, data=outliers2)
summary(lm3)##
## Call:
## lm(formula = readss97 ~ loginc + ageC + bthwht + HOME97, data = goodR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.469 -9.516 -0.324 9.517 55.657
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.8712 3.1312 17.524 < 2e-16 ***
## loginc 1.6871 0.3071 5.493 4.47e-08 ***
## ageC 0.4100 0.1194 3.433 0.000609 ***
## bthwht -2.6817 0.7143 -3.754 0.000179 ***
## HOME97 1.5321 0.1219 12.568 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.66 on 1959 degrees of freedom
## Multiple R-squared: 0.1634, Adjusted R-squared: 0.1617
## F-statistic: 95.65 on 4 and 1959 DF, p-value: < 2.2e-16
summary(lm5)##
## Call:
## lm(formula = readss97 ~ loginc + ageC + HOME97 + bthwht, data = outliers2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.817 -9.370 -0.429 9.339 56.389
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.1573 3.3941 13.305 < 2e-16 ***
## loginc 2.8991 0.3511 8.258 2.7e-16 ***
## ageC 0.4076 0.1177 3.463 0.000546 ***
## HOME97 1.3760 0.1233 11.163 < 2e-16 ***
## bthwht -2.5110 0.7042 -3.566 0.000371 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.41 on 1950 degrees of freedom
## Multiple R-squared: 0.1831, Adjusted R-squared: 0.1814
## F-statistic: 109.3 on 4 and 1950 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm3)plot(lm5)In this example, most of the observations that were flagged as outliers had very low loginc values, so we might consider setting loginc values less than or equal to 1 to NA (i.e., missing).
First, create a new data frame (here, called outliers3) that is an exact copy of our original outliers data frame
outliers3<-outliers
# Next, set loginc values less than or equal to 1 to NA
outliers3$loginc[outliers3$loginc <= 1]<-NA
# you could also use:
# outliers3$loginc<-ifelse(outliers3$loginc <= 1, NA,outliers3$loginc)NOTE: you can use the same format to set other VARIABLE VALUES to missing. Once you have finished setting specific variable values to missing, compare the lm3 estimates with the estimates produced with the outliers3 data frame. Are there differences? Has the R-Squared value changed?
lm6<-lm(readss97~loginc + ageC + HOME97+ bthwht, data=outliers3)
summary(lm3)##
## Call:
## lm(formula = readss97 ~ loginc + ageC + bthwht + HOME97, data = goodR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.469 -9.516 -0.324 9.517 55.657
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.8712 3.1312 17.524 < 2e-16 ***
## loginc 1.6871 0.3071 5.493 4.47e-08 ***
## ageC 0.4100 0.1194 3.433 0.000609 ***
## bthwht -2.6817 0.7143 -3.754 0.000179 ***
## HOME97 1.5321 0.1219 12.568 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.66 on 1959 degrees of freedom
## Multiple R-squared: 0.1634, Adjusted R-squared: 0.1617
## F-statistic: 95.65 on 4 and 1959 DF, p-value: < 2.2e-16
summary(lm6)##
## Call:
## lm(formula = readss97 ~ loginc + ageC + HOME97 + bthwht, data = outliers3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -55.885 -9.325 -0.396 9.395 56.577
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.8643 3.6322 11.801 < 2e-16 ***
## loginc 3.1751 0.3897 8.148 6.53e-16 ***
## ageC 0.4005 0.1184 3.382 0.000734 ***
## HOME97 1.3463 0.1260 10.684 < 2e-16 ***
## bthwht -2.5328 0.7095 -3.570 0.000366 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.52 on 1950 degrees of freedom
## (9 observations deleted due to missingness)
## Multiple R-squared: 0.1818, Adjusted R-squared: 0.1802
## F-statistic: 108.3 on 4 and 1950 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm3)plot(lm6)Import bloodpress.txt - information for 20 patients - into R-Studio. You will use the following variables: 1) BP - Blood Pressure 2) BSA - Body Surface Area 3) weight
Create and present a correlation matrix between all three variables. What is the correlation between BSA and weight, and what does it suggest about a regression where both are independent variables?
Regress blood pressure on BSA and weight and interpret the coefficients and p-values.
Run a VIF on the model you just ran
Regress BP on BSA, only, and interpret coefficients What do the results suggest about whether there was a multicollinearity problem in your previous regression?