Data File: LungCapacityDataSet
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.2
LungCapData2 <- read.csv(file.choose() ,stringsAsFactors = TRUE)
# does not need to include dataframe name before $ sign in every command
attach(LungCapData2)
summary(LungCapData2)
## LungCap Age Height Smoke Gender
## Min. : 0.507 Min. : 3.00 Min. :45.30 no :648 female:358
## 1st Qu.: 6.150 1st Qu.: 9.00 1st Qu.:59.90 yes: 77 male :367
## Median : 8.000 Median :13.00 Median :65.40
## Mean : 7.863 Mean :12.33 Mean :64.84
## 3rd Qu.: 9.800 3rd Qu.:15.00 3rd Qu.:70.30
## Max. :14.675 Max. :19.00 Max. :81.80
## Caesarean
## no :561
## yes:164
##
##
##
##
str(LungCapData2)
## 'data.frame': 725 obs. of 6 variables:
## $ LungCap : num 6.47 10.12 9.55 11.12 4.8 ...
## $ Age : int 6 18 16 14 5 11 8 11 15 11 ...
## $ Height : num 62.1 74.7 69.7 71 56.9 58.7 63.3 70.4 70.5 59.2 ...
## $ Smoke : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ Gender : Factor w/ 2 levels "female","male": 2 1 1 2 2 1 2 2 2 2 ...
## $ Caesarean: Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
ggplot(LungCapData2, aes(x = Gender, y = LungCap, fill = factor(Gender))) + geom_boxplot()
ggplot(LungCapData2, aes(x = Smoke, y = LungCap ,fill = factor(Smoke))) + geom_boxplot()
ggplot(LungCapData2, aes(x = Caesarean, y = LungCap, fill=factor(Caesarean))) + geom_boxplot()
plot(Height, LungCap, main="Polynomial Regression", las=1)
plot(Age, LungCap, main="Polynomial Regression", las=1)
numvar <- data.frame(Height,LungCap,Age)
cor(numvar)
## Height LungCap Age
## Height 1.0000000 0.9121873 0.8357368
## LungCap 0.9121873 1.0000000 0.8196749
## Age 0.8357368 0.8196749 1.0000000
Correlation IV v/s IV should be very small recommended less
than 0.8
If it is high then known as multicollinearity
if it is approxmately equal to 1 meaning that singularity exist
both multicolinearity and singularity are not good signs for reg
any of the IV should not be included in model
model1 <- lm(LungCap ~ Height)
summary(model1)
##
## Call:
## lm(formula = LungCap ~ Height)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.3619 -0.7014 -0.0032 0.7787 3.2938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -13.996829 0.367451 -38.09 <2e-16 ***
## Height 0.337157 0.005633 59.86 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.092 on 723 degrees of freedom
## Multiple R-squared: 0.8321, Adjusted R-squared: 0.8319
## F-statistic: 3583 on 1 and 723 DF, p-value: < 2.2e-16
str(model1) when we create a model the model name contains
several
parameters like coefficients, residuals, fitted.values
plot(Height, LungCap, main="Linear Regression", las=1)
points(Height, model1$fitted.values, col= "red")
# and add the line to the plot...make it thick and red...
abline(model1, lwd=3, col="blue")
confint(model1, conf.level=0.95)
## 2.5 % 97.5 %
## (Intercept) -14.718228 -13.2754300
## Height 0.326098 0.3482151
model2 <- lm(LungCap ~ Age)
summary(model2)
##
## Call:
## lm(formula = LungCap ~ Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7799 -1.0203 -0.0005 0.9789 4.2650
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.14686 0.18353 6.249 7.06e-10 ***
## Age 0.54485 0.01416 38.476 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.526 on 723 degrees of freedom
## Multiple R-squared: 0.6719, Adjusted R-squared: 0.6714
## F-statistic: 1480 on 1 and 723 DF, p-value: < 2.2e-16
plot(Age, LungCap, main="Linear Regression", las=1)
abline(model2, lwd=3, col="red")
model3 <- lm(LungCap ~ Height + Age)
summary(model3)
##
## Call:
## lm(formula = LungCap ~ Height + Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4080 -0.7097 -0.0078 0.7167 3.1679
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -11.747065 0.476899 -24.632 < 2e-16 ***
## Height 0.278432 0.009926 28.051 < 2e-16 ***
## Age 0.126368 0.017851 7.079 3.45e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 722 degrees of freedom
## Multiple R-squared: 0.843, Adjusted R-squared: 0.8425
## F-statistic: 1938 on 2 and 722 DF, p-value: < 2.2e-16
# if we plot the model then it will generate 4 graphs
# 1. residual plot v/s fitted values (bounce randomly around the 0 line)
# 2. q-q plot check the normality of residual (should be linear)
# 3. Fitted v/s sqrt(standarized residuals)
# 4. Residuals v/s Leverage (Cook's Distances)
plot(model3)
# comparing whether model 3 is better than model 1 or not
# H0: Model 1 and Model3 are same
# H1: Model3 is better than model 1
comp13<-anova(model1, model3)
cat(" pvalue= ", as.character(comp13[2,6]))
## pvalue= 3.44549638659208e-12
# As p-value is 3.445496e-12 very smaller than 0.05 than model3 is better than model1
# comparing whether model 3 is better than model 1 or not
# H0: Model2 and Model3 are same
# H1: Model3 is better than model2
comp23<-anova(model2, model3)
cat(" pvalue= ", as.character(comp23[2,6]))
## pvalue= 1.13517345806132e-117
# If p-value is 1.135173e-117 which is very smaller than 0.05 than model3 is better than model2
cut command split a numeric variable to factor variable height is a numeric variable and we want to make intervals A<=50, B=50-55, C=55-60, D=60-65, E=65-70, F=70+. By default it is (50,55], (55, 60] etc.
CatHeight <- cut(Height, breaks=c(0,50,55,60,65,70,100))
#We can also use labels with this command
CatHeight <- cut(Height, breaks=c(0,50,55,60,65,70,100), labels = c("A","B","C","D","E","F"))
#We can also use break= instead of cutpoints e.g split into four equal groups
#CatHeight <- cut(Height, breaks=4, labels c("A","B","C","D"))
# We can find the levals of a categorical variable using levels command
levels(CatHeight)
## [1] "A" "B" "C" "D" "E" "F"
CatHeight hs six categories or levels a, b, c, d, e, f which require five dummy variables or indicator variables
LungCapData2$Xb <- ifelse(CatHeight == 'B', 1, 0)
LungCapData2$Xc <- ifelse(CatHeight == 'C', 1, 0)
LungCapData2$Xd <- ifelse(CatHeight == 'D', 1, 0)
LungCapData2$Xe <- ifelse(CatHeight == 'E', 1, 0)
LungCapData2$Xf <- ifelse(CatHeight == 'F', 1, 0)
attach command must be used every time after adding new variables to df
attach(LungCapData2)
## The following objects are masked from LungCapData2 (pos = 3):
##
## Age, Caesarean, Gender, Height, LungCap, Smoke
str(LungCapData2)
## 'data.frame': 725 obs. of 11 variables:
## $ LungCap : num 6.47 10.12 9.55 11.12 4.8 ...
## $ Age : int 6 18 16 14 5 11 8 11 15 11 ...
## $ Height : num 62.1 74.7 69.7 71 56.9 58.7 63.3 70.4 70.5 59.2 ...
## $ Smoke : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
## $ Gender : Factor w/ 2 levels "female","male": 2 1 1 2 2 1 2 2 2 2 ...
## $ Caesarean: Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
## $ Xb : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Xc : num 0 0 0 0 1 1 0 0 0 1 ...
## $ Xd : num 1 0 0 0 0 0 1 0 0 0 ...
## $ Xe : num 0 0 1 0 0 0 0 0 0 0 ...
## $ Xf : num 0 1 0 1 0 0 0 1 1 0 ...
model4 <- lm(LungCap ~ Age + CatHeight)
summary(model4)
##
## Call:
## lm(formula = LungCap ~ Age + CatHeight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.8719 -0.7751 0.0281 0.7521 3.4160
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.97553 0.29394 3.319 0.00095 ***
## Age 0.20110 0.01859 10.816 < 2e-16 ***
## CatHeightB 1.48361 0.31780 4.668 3.62e-06 ***
## CatHeightC 2.68562 0.29818 9.007 < 2e-16 ***
## CatHeightD 3.93857 0.30623 12.862 < 2e-16 ***
## CatHeightE 5.00703 0.32105 15.596 < 2e-16 ***
## CatHeightF 6.53873 0.34635 18.879 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.159 on 718 degrees of freedom
## Multiple R-squared: 0.812, Adjusted R-squared: 0.8104
## F-statistic: 516.8 on 6 and 718 DF, p-value: < 2.2e-16
plot(Age[CatHeight== "A"], LungCap[CatHeight=="A"], col=2, ylim=c(0,15), xlim=c(0,20), xlab="Age", ylab="LungCap", main="LungCap vs. Age, CatHeight")
points(Age[CatHeight== "B"], LungCap[CatHeight=="B"], col=3)
points(Age[CatHeight== "C"], LungCap[CatHeight=="C"], col=4)
points(Age[CatHeight== "D"], LungCap[CatHeight=="D"], col=5)
points(Age[CatHeight== "E"], LungCap[CatHeight=="E"], col=6)
points(Age[CatHeight== "F"], LungCap[CatHeight=="F"], col=7)
legend(0,15.5,legend = c("A","B","C","D","E","F"), col=2:7,pch=1, cex=0.8)
# Now plot the regression lines on this plot using abline() command
# a= y intercept, b= slope
# for HeightCat ="A" slope = 0.20 and intercept = 0.98
abline(a=0.98, b=0.20, col=2, lwd=3)
abline(a=2.46, b=0.20, col=3, lwd=3)
abline(a=3.67, b=0.20, col=4, lwd=3)
abline(a=4.92, b=0.20, col=5, lwd=3)
abline(a=5.9, b=0.20, col=6, lwd=3)
abline(a=7.52, b=0.20, col=7, lwd=3)
plot(Age[Smoke=="no"], LungCap[Smoke=="no"], col="blue", ylim=c(0,15),
xlim=c(0,20), xlab="Age", ylab="LungCap",
main="LungCap vs. Age,Smoke")
# Now, add in the points for the Smokers, in Solid Red Circles
points(Age[Smoke=="yes"], LungCap[Smoke=="yes"], col="red", pch=16)
# And, add in a legend
legend(1,15,legend=c("NonSmoker", "Smoker"), col=c("blue", "red"),
pch=c(1,16), bty="n")
# linear regression using Age and Smoke without interaction
model5<- lm(LungCap ~ Age + Smoke)
summary(model5)
##
## Call:
## lm(formula = LungCap ~ Age + Smoke)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8559 -1.0289 -0.0363 1.0083 4.1995
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.08572 0.18299 5.933 4.61e-09 ***
## Age 0.55540 0.01438 38.628 < 2e-16 ***
## Smokeyes -0.64859 0.18676 -3.473 0.000546 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.514 on 722 degrees of freedom
## Multiple R-squared: 0.6773, Adjusted R-squared: 0.6764
## F-statistic: 757.5 on 2 and 722 DF, p-value: < 2.2e-16
plot(Age[Smoke=="no"], LungCap[Smoke=="no"], col="blue", ylim=c(0,15),
xlim=c(0,20), xlab="Age", ylab="LungCap",
main="LungCap vs. Age,Smoke")
# Now, add in the points for the Smokers, in Solid Red Circles
points(Age[Smoke=="yes"], LungCap[Smoke=="yes"], col="red", pch=16)
abline(a=1.08572, b=0.5554, col="blue", lwd=3)
# And now, add in the line for Smokers, in Red
abline(a=0.43713, b=.5554, col="red", lwd=3)
model6 <- lm(LungCap ~ Age*Smoke)
summary(model6)
##
## Call:
## lm(formula = LungCap ~ Age * Smoke)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8586 -1.0174 -0.0251 1.0004 4.1996
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.05157 0.18706 5.622 2.7e-08 ***
## Age 0.55823 0.01473 37.885 < 2e-16 ***
## Smokeyes 0.22601 1.00755 0.224 0.823
## Age:Smokeyes -0.05970 0.06759 -0.883 0.377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.515 on 721 degrees of freedom
## Multiple R-squared: 0.6776, Adjusted R-squared: 0.6763
## F-statistic: 505.1 on 3 and 721 DF, p-value: < 2.2e-16
plot(Age[Smoke=="no"], LungCap[Smoke=="no"], col="blue", ylim=c(0,15),
xlim=c(0,20), xlab="Age", ylab="LungCap",
main="LungCap vs. Age,Smoke")
# Now, add in the points for the Smokers, in Solid Red Circles
points(Age[Smoke=="yes"], LungCap[Smoke=="yes"], col="red", pch=16)
abline(a=1.05157, b=0.55823, col="blue", lwd=3)
# And now, add in the line for Smokers, in Red
abline(a=1.27759, b=0.49853, col="red", lwd=3)
LungCap = 1.05157 + 0.55823 Age + 0.22602 Smoke - 0.0597 Age x Smoke
LungCap = 1.05157 + 0.55823 Age + 0.22602 (Smoke=0) - 0.0597 * Age *
(Smoke=0)
LungCap = 1.05157 + 0.55823 Age
LungCap = 1.05157 + 0.55823 Age + 0.22602 (Smoke=1) - 0.0597 * Age * (Smoke=1)
LungCap = 1.27759 + 0.49853 Age
**Q1: Does the interaction make sense conceptually or we can say that should the smoking affect the lungs a 19 year old differently than a 15 years old?
Ans: No
Q2: Is the Interaction term Statistically Significant?
Ans: The p-value of interaction effect is 0.377 means Interaction is
statistically insignificant **
model7 <- lm(LungCap ~ Height + I(Height^2))
summary(model7)
##
## Call:
## lm(formula = LungCap ~ Height + I(Height^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4210 -0.6812 -0.0175 0.7805 3.3711
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.875e+01 2.656e+00 -7.059 3.96e-12 ***
## Height 4.876e-01 8.344e-02 5.843 7.75e-09 ***
## I(Height^2) -1.175e-03 6.502e-04 -1.807 0.0712 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.09 on 722 degrees of freedom
## Multiple R-squared: 0.8328, Adjusted R-squared: 0.8324
## F-statistic: 1799 on 2 and 722 DF, p-value: < 2.2e-16
# or, create Height^2, and then include this in model
HeightSquare <- Height^2
model8 <- lm(LungCap ~ Height + HeightSquare)
summary(model8)
##
## Call:
## lm(formula = LungCap ~ Height + HeightSquare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4210 -0.6812 -0.0175 0.7805 3.3711
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.875e+01 2.656e+00 -7.059 3.96e-12 ***
## Height 4.876e-01 8.344e-02 5.843 7.75e-09 ***
## HeightSquare -1.175e-03 6.502e-04 -1.807 0.0712 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.09 on 722 degrees of freedom
## Multiple R-squared: 0.8328, Adjusted R-squared: 0.8324
## F-statistic: 1799 on 2 and 722 DF, p-value: < 2.2e-16