R Markdown

Data File: LungCapacityDataSet

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
LungCapData2 <- read.csv(file.choose() ,stringsAsFactors = TRUE)
# does not need to include dataframe name before $ sign in every command
attach(LungCapData2)
summary(LungCapData2)
##     LungCap            Age            Height      Smoke        Gender   
##  Min.   : 0.507   Min.   : 3.00   Min.   :45.30   no :648   female:358  
##  1st Qu.: 6.150   1st Qu.: 9.00   1st Qu.:59.90   yes: 77   male  :367  
##  Median : 8.000   Median :13.00   Median :65.40                         
##  Mean   : 7.863   Mean   :12.33   Mean   :64.84                         
##  3rd Qu.: 9.800   3rd Qu.:15.00   3rd Qu.:70.30                         
##  Max.   :14.675   Max.   :19.00   Max.   :81.80                         
##  Caesarean
##  no :561  
##  yes:164  
##           
##           
##           
## 
str(LungCapData2)
## 'data.frame':    725 obs. of  6 variables:
##  $ LungCap  : num  6.47 10.12 9.55 11.12 4.8 ...
##  $ Age      : int  6 18 16 14 5 11 8 11 15 11 ...
##  $ Height   : num  62.1 74.7 69.7 71 56.9 58.7 63.3 70.4 70.5 59.2 ...
##  $ Smoke    : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ Gender   : Factor w/ 2 levels "female","male": 2 1 1 2 2 1 2 2 2 2 ...
##  $ Caesarean: Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...

Boxplot of Lung Capacity male V/S females

ggplot(LungCapData2, aes(x = Gender, y = LungCap, fill = factor(Gender))) + geom_boxplot()

Boxplot of Lung Capacity Smokers V/S non smokers

ggplot(LungCapData2, aes(x = Smoke, y = LungCap ,fill = factor(Smoke))) + geom_boxplot()

Boxplot of Lung Capacity Caesarean V/S Natural

ggplot(LungCapData2, aes(x = Caesarean, y = LungCap, fill=factor(Caesarean))) + geom_boxplot()

Scatterplot b/w Lung Capacity and Height

plot(Height, LungCap, main="Polynomial Regression", las=1)

Scatterplot b/w Lung Capacity and Age

plot(Age, LungCap, main="Polynomial Regression", las=1)

Correlation DV v/s IV should be strong

numvar <- data.frame(Height,LungCap,Age)
cor(numvar)
##            Height   LungCap       Age
## Height  1.0000000 0.9121873 0.8357368
## LungCap 0.9121873 1.0000000 0.8196749
## Age     0.8357368 0.8196749 1.0000000

Correlation IV v/s IV should be very small recommended less than 0.8
If it is high then known as multicollinearity
if it is approxmately equal to 1 meaning that singularity exist
both multicolinearity and singularity are not good signs for reg
any of the IV should not be included in model

1. LM: Predicting LungCap using Height

model1 <- lm(LungCap ~ Height)
summary(model1)
## 
## Call:
## lm(formula = LungCap ~ Height)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3619 -0.7014 -0.0032  0.7787  3.2938 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -13.996829   0.367451  -38.09   <2e-16 ***
## Height        0.337157   0.005633   59.86   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.092 on 723 degrees of freedom
## Multiple R-squared:  0.8321, Adjusted R-squared:  0.8319 
## F-statistic:  3583 on 1 and 723 DF,  p-value: < 2.2e-16

str(model1) when we create a model the model name contains several
parameters like coefficients, residuals, fitted.values

Graph observed/estimated values and regression line on it

plot(Height, LungCap, main="Linear Regression", las=1)
points(Height, model1$fitted.values, col= "red")

# and add the line to the plot...make it thick and red...
abline(model1, lwd=3, col="blue")

95% Confidence Interval for regression coefficients

confint(model1, conf.level=0.95)
##                  2.5 %      97.5 %
## (Intercept) -14.718228 -13.2754300
## Height        0.326098   0.3482151

2. LM: Predicting LungCap using Age

model2 <- lm(LungCap ~ Age)
summary(model2)
## 
## Call:
## lm(formula = LungCap ~ Age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.7799 -1.0203 -0.0005  0.9789  4.2650 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.14686    0.18353   6.249 7.06e-10 ***
## Age          0.54485    0.01416  38.476  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.526 on 723 degrees of freedom
## Multiple R-squared:  0.6719, Adjusted R-squared:  0.6714 
## F-statistic:  1480 on 1 and 723 DF,  p-value: < 2.2e-16
plot(Age, LungCap, main="Linear Regression", las=1)
abline(model2, lwd=3, col="red")

3. LM: Predicting LungCap using Age & Height

model3 <- lm(LungCap ~ Height + Age)
summary(model3)
## 
## Call:
## lm(formula = LungCap ~ Height + Age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4080 -0.7097 -0.0078  0.7167  3.1679 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -11.747065   0.476899 -24.632  < 2e-16 ***
## Height        0.278432   0.009926  28.051  < 2e-16 ***
## Age           0.126368   0.017851   7.079 3.45e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 722 degrees of freedom
## Multiple R-squared:  0.843,  Adjusted R-squared:  0.8425 
## F-statistic:  1938 on 2 and 722 DF,  p-value: < 2.2e-16
# if we plot the model then it will generate 4 graphs 
# 1. residual plot v/s fitted values (bounce randomly around the 0 line)
# 2. q-q plot check the normality of residual (should be linear)
# 3. Fitted v/s sqrt(standarized residuals)
# 4. Residuals v/s Leverage (Cook's Distances)
plot(model3)

Comarison of model 1 and model 3 using anova()

# comparing whether model 3 is better than model 1 or not
# H0: Model 1 and Model3 are same
# H1: Model3 is better than model 1
comp13<-anova(model1, model3)
cat(" pvalue= ",  as.character(comp13[2,6]))   
##  pvalue=  3.44549638659208e-12
# As p-value is 3.445496e-12 very smaller than 0.05 than model3 is better than model1

Comarison of model 2 and model 3 using anova()

# comparing whether model 3 is better than model 1 or not
# H0: Model2 and Model3 are same
# H1: Model3 is better than model2

comp23<-anova(model2, model3)
cat(" pvalue= ",  as.character(comp23[2,6]))  
##  pvalue=  1.13517345806132e-117
# If p-value is 1.135173e-117 which is very smaller than 0.05 than model3 is better than model2

Converting Scaled Height into Categorical Factor 6 Levels

cut command split a numeric variable to factor variable height is a numeric variable and we want to make intervals A<=50, B=50-55, C=55-60, D=60-65, E=65-70, F=70+. By default it is (50,55], (55, 60] etc.

CatHeight <- cut(Height, breaks=c(0,50,55,60,65,70,100))

#We can also use labels with this command
CatHeight <- cut(Height, breaks=c(0,50,55,60,65,70,100), labels = c("A","B","C","D","E","F"))

#We can also use break= instead of cutpoints e.g split into four equal groups
#CatHeight <- cut(Height, breaks=4, labels c("A","B","C","D"))

# We can find the levals of a categorical variable using levels command
levels(CatHeight)
## [1] "A" "B" "C" "D" "E" "F"

Creating 5 Dummy Indicator variables Xb, Xc, Xd, Xe, Xf

CatHeight hs six categories or levels a, b, c, d, e, f which require five dummy variables or indicator variables

LungCapData2$Xb <- ifelse(CatHeight == 'B', 1, 0)
LungCapData2$Xc <- ifelse(CatHeight == 'C', 1, 0)
LungCapData2$Xd <- ifelse(CatHeight == 'D', 1, 0)
LungCapData2$Xe <- ifelse(CatHeight == 'E', 1, 0)
LungCapData2$Xf <- ifelse(CatHeight == 'F', 1, 0)

4. LM: Regression using CatHeight (No Interaction)

attach command must be used every time after adding new variables to df

attach(LungCapData2)
## The following objects are masked from LungCapData2 (pos = 3):
## 
##     Age, Caesarean, Gender, Height, LungCap, Smoke
str(LungCapData2)
## 'data.frame':    725 obs. of  11 variables:
##  $ LungCap  : num  6.47 10.12 9.55 11.12 4.8 ...
##  $ Age      : int  6 18 16 14 5 11 8 11 15 11 ...
##  $ Height   : num  62.1 74.7 69.7 71 56.9 58.7 63.3 70.4 70.5 59.2 ...
##  $ Smoke    : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ Gender   : Factor w/ 2 levels "female","male": 2 1 1 2 2 1 2 2 2 2 ...
##  $ Caesarean: Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ Xb       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Xc       : num  0 0 0 0 1 1 0 0 0 1 ...
##  $ Xd       : num  1 0 0 0 0 0 1 0 0 0 ...
##  $ Xe       : num  0 0 1 0 0 0 0 0 0 0 ...
##  $ Xf       : num  0 1 0 1 0 0 0 1 1 0 ...
model4 <- lm(LungCap ~ Age + CatHeight)
summary(model4)
## 
## Call:
## lm(formula = LungCap ~ Age + CatHeight)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8719 -0.7751  0.0281  0.7521  3.4160 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.97553    0.29394   3.319  0.00095 ***
## Age          0.20110    0.01859  10.816  < 2e-16 ***
## CatHeightB   1.48361    0.31780   4.668 3.62e-06 ***
## CatHeightC   2.68562    0.29818   9.007  < 2e-16 ***
## CatHeightD   3.93857    0.30623  12.862  < 2e-16 ***
## CatHeightE   5.00703    0.32105  15.596  < 2e-16 ***
## CatHeightF   6.53873    0.34635  18.879  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.159 on 718 degrees of freedom
## Multiple R-squared:  0.812,  Adjusted R-squared:  0.8104 
## F-statistic: 516.8 on 6 and 718 DF,  p-value: < 2.2e-16

Fitted regression line for category A

µy|x = 0.98 +0.20 Age + 1.48 Xb +2.96 Xc + 3.94 Xd +5.01 Xe +6.54 Xf

For height category A the base category all indicator variables are zero.

Hence the model for height category A is given below

µy|x = 0.98 +0.20Age+1.48(Xb =0)+2.96(Xc=0)+3.94 (Xd=0)+5.01(Xe=0)+6.54(Xf=0)

µy|x = 0.98 +0.20 Age

Fitted regression line for category B

µy|x = 0.98 +0.20 Age + 1.48 Xb +2.96 Xc + 3.94 Xd +5.01 Xe +6.54 Xf

For height category b the variable Xb will be 1 and all other are zero.

Hence the model for height category B is given below

µy|x = 0.98 +0.20Age+1.48(Xb =1)+2.96(Xc=0)+3.94 (Xd=0)+5.01(Xe=0)+6.54(Xf=0)

µy|x = 0.98 +0.20 Age + 1.48

µy|x = 2.46 + 0.20 Age

plot(Age[CatHeight== "A"], LungCap[CatHeight=="A"], col=2, ylim=c(0,15), xlim=c(0,20), xlab="Age", ylab="LungCap", main="LungCap vs. Age, CatHeight")
points(Age[CatHeight== "B"], LungCap[CatHeight=="B"], col=3)
points(Age[CatHeight== "C"], LungCap[CatHeight=="C"], col=4)
points(Age[CatHeight== "D"], LungCap[CatHeight=="D"], col=5)
points(Age[CatHeight== "E"], LungCap[CatHeight=="E"], col=6)
points(Age[CatHeight== "F"], LungCap[CatHeight=="F"], col=7)
legend(0,15.5,legend = c("A","B","C","D","E","F"), col=2:7,pch=1, cex=0.8)
# Now plot the regression lines on this plot using abline() command
# a= y intercept, b= slope
# for HeightCat ="A" slope =  0.20 and intercept = 0.98
abline(a=0.98, b=0.20, col=2, lwd=3)
abline(a=2.46, b=0.20, col=3, lwd=3)
abline(a=3.67, b=0.20, col=4, lwd=3)
abline(a=4.92, b=0.20, col=5, lwd=3)
abline(a=5.9, b=0.20, col=6, lwd=3)
abline(a=7.52, b=0.20, col=7, lwd=3)

5. LM: Predicting LungCap using Age and Smoke Without INTERACTION

First we observe without interaction

Plot the data, using different colours for smoke(red)/non-smoke(blue)

plot(Age[Smoke=="no"], LungCap[Smoke=="no"], col="blue", ylim=c(0,15), 
     xlim=c(0,20), xlab="Age", ylab="LungCap", 
     main="LungCap vs. Age,Smoke")
# Now, add in the points for the Smokers, in Solid Red Circles
points(Age[Smoke=="yes"], LungCap[Smoke=="yes"], col="red", pch=16)
# And, add in a legend
legend(1,15,legend=c("NonSmoker", "Smoker"), col=c("blue", "red"), 
       pch=c(1,16), bty="n")

# linear regression using Age and Smoke without interaction
model5<- lm(LungCap ~ Age  + Smoke)
summary(model5)
## 
## Call:
## lm(formula = LungCap ~ Age + Smoke)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8559 -1.0289 -0.0363  1.0083  4.1995 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.08572    0.18299   5.933 4.61e-09 ***
## Age          0.55540    0.01438  38.628  < 2e-16 ***
## Smokeyes    -0.64859    0.18676  -3.473 0.000546 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.514 on 722 degrees of freedom
## Multiple R-squared:  0.6773, Adjusted R-squared:  0.6764 
## F-statistic: 757.5 on 2 and 722 DF,  p-value: < 2.2e-16
plot(Age[Smoke=="no"], LungCap[Smoke=="no"], col="blue", ylim=c(0,15), 
     xlim=c(0,20), xlab="Age", ylab="LungCap", 
     main="LungCap vs. Age,Smoke")
# Now, add in the points for the Smokers, in Solid Red Circles
points(Age[Smoke=="yes"], LungCap[Smoke=="yes"], col="red", pch=16)
abline(a=1.08572, b=0.5554, col="blue", lwd=3)
# And now, add in the line for Smokers, in Red
abline(a=0.43713, b=.5554, col="red", lwd=3)

Model for non smokers is simple Put Smoke=0

LungCap = 1.08572 + 0.5554 Age - 0.64859 Smoke

LungCap = 1.08572 + 0.5554 Age

Now the model for Smokers = 1

LungCap = 1.08572 + 0.5554 Age - 0.64859 (Smoke=1)

LungCap = 1.08572 + 0.5554 Age - 0.64859

LungCap = 0.43713 + 0.5554 Age

These two parallel lines means the rate of change

in lungcapacity remains same for smokers and non smokers

Is this Correct?

We have to check that

6. LM: Predicting LungCap using Age and Smoke With INTERACTION

model6 <- lm(LungCap ~ Age*Smoke)
summary(model6)
## 
## Call:
## lm(formula = LungCap ~ Age * Smoke)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8586 -1.0174 -0.0251  1.0004  4.1996 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.05157    0.18706   5.622  2.7e-08 ***
## Age           0.55823    0.01473  37.885  < 2e-16 ***
## Smokeyes      0.22601    1.00755   0.224    0.823    
## Age:Smokeyes -0.05970    0.06759  -0.883    0.377    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.515 on 721 degrees of freedom
## Multiple R-squared:  0.6776, Adjusted R-squared:  0.6763 
## F-statistic: 505.1 on 3 and 721 DF,  p-value: < 2.2e-16
plot(Age[Smoke=="no"], LungCap[Smoke=="no"], col="blue", ylim=c(0,15), 
     xlim=c(0,20), xlab="Age", ylab="LungCap", 
     main="LungCap vs. Age,Smoke")
# Now, add in the points for the Smokers, in Solid Red Circles
points(Age[Smoke=="yes"], LungCap[Smoke=="yes"], col="red", pch=16)
abline(a=1.05157, b=0.55823, col="blue", lwd=3)
# And now, add in the line for Smokers, in Red
abline(a=1.27759, b=0.49853, col="red", lwd=3)

Compute the model for smoker and non smokers separetely

LungCap = 1.05157 + 0.55823 Age + 0.22602 Smoke - 0.0597 Age x Smoke

Model for non smokers (Put Smoke=0)

LungCap = 1.05157 + 0.55823 Age + 0.22602 (Smoke=0) - 0.0597 * Age * (Smoke=0)
LungCap = 1.05157 + 0.55823 Age

Now the model for Smokers = 1

LungCap = 1.05157 + 0.55823 Age + 0.22602 (Smoke=1) - 0.0597 * Age * (Smoke=1)

LungCap = 1.05157 + 0.55823 Age + 0.22602 - 0.0597 * Age

LungCap = 1.27759 + 0.49853 Age

Conclusion

**Q1: Does the interaction make sense conceptually or we can say that should the smoking affect the lungs a 19 year old differently than a 15 years old?

Ans: No

Q2: Is the Interaction term Statistically Significant?
Ans: The p-value of interaction effect is 0.377 means Interaction is statistically insignificant **

7. Non linear Regression Predicting LungCap using Height

Quadratic model

Two different ways to create a quadratic model

method2: model7 <- lm(LungCap ~ Height + I(Height^2))

method3: model8 <- lm(LungCap ~ Height + HeightSquare)

model7 <- lm(LungCap ~ Height + I(Height^2))
summary(model7)
## 
## Call:
## lm(formula = LungCap ~ Height + I(Height^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4210 -0.6812 -0.0175  0.7805  3.3711 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.875e+01  2.656e+00  -7.059 3.96e-12 ***
## Height       4.876e-01  8.344e-02   5.843 7.75e-09 ***
## I(Height^2) -1.175e-03  6.502e-04  -1.807   0.0712 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.09 on 722 degrees of freedom
## Multiple R-squared:  0.8328, Adjusted R-squared:  0.8324 
## F-statistic:  1799 on 2 and 722 DF,  p-value: < 2.2e-16
# or, create Height^2, and then include this in model
HeightSquare <- Height^2
model8 <- lm(LungCap ~ Height + HeightSquare)
summary(model8)
## 
## Call:
## lm(formula = LungCap ~ Height + HeightSquare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4210 -0.6812 -0.0175  0.7805  3.3711 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.875e+01  2.656e+00  -7.059 3.96e-12 ***
## Height        4.876e-01  8.344e-02   5.843 7.75e-09 ***
## HeightSquare -1.175e-03  6.502e-04  -1.807   0.0712 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.09 on 722 degrees of freedom
## Multiple R-squared:  0.8328, Adjusted R-squared:  0.8324 
## F-statistic:  1799 on 2 and 722 DF,  p-value: < 2.2e-16

observe the the output of model 7 and 8 are exactly same

Notice that the Height^2 is not significant therefore model1 is better than model 8

Adjusted R^2 of model1 is 0.8319

Adjusted R^2 of model8 is 0.8324

Adjusted R^2 slightly increased by including Height^2

Finally model 1 is better than model 8