Loading LungCap data

LungCap Age Height Smoke Gender Caesarean
6.475 6 62.1 no male no
10.125 18 74.7 yes female no
9.550 16 69.7 no female yes
11.125 14 71.0 no male no
4.800 5 56.9 no male no
6.225 11 58.7 no female no

Questions

1. What are the most factors affecting the LungCap?

2. Does the Gender have a huge effect on the Cap volume?

First we will arrange LungCap as normal value between 4-6 L

European Respiratory Journal

Preparing data

hist(data$LungCap,xlab="LungCap",ylab="Count",
     main="Variation of LungCap")

data_na_counts <- data %>% 
  summarize(across(everything(),
                   \(x)mean(is.na(x))))
gt(data_na_counts)
LungCap Age Height Smoke Gender Caesarean
0 0 0 0 0 0

No NA’s in all variables.

min(data$LungCap)
## [1] 0.507
max(data$LungCap)
## [1] 14.675

Our recorded data varies between 0.5 and 14.6 L

data <- data %>%
  mutate(disease = (LungCap<4|LungCap>6))
data %>% 
  ggplot(aes(x=Gender,
             y=LungCap))+
  geom_boxplot()

As observed the variance of both genders is approximately the same with slightly higher by males

Visualizing Smoke with LungCap

data %>% 
  ggplot(aes(x=Smoke,
             y=LungCap,
             color=Gender))+
  geom_boxplot()

For smokers median of LungCap volume is somewhat higher, especially for men.

Checking the relation between Smoking and prob of disease!

tab <- table(data$disease, data$Smoke)
chi <- chisq.test(tab, correct=T)
print(chi)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tab
## X-squared = 8.2019, df = 1, p-value = 0.004185

P-value not very significant.

cramersV(tab)
## [1] 0.1063624

Very small effect size

T_test to ensure the small difference by smoke and cap

t.test(data$LungCap ~ data$Smoke)
## 
##  Welch Two Sample t-test
## 
## data:  data$LungCap by data$Smoke
## t = -3.6498, df = 117.72, p-value = 0.0003927
## alternative hypothesis: true difference in means between group no and group yes is not equal to 0
## 95 percent confidence interval:
##  -1.3501778 -0.4003548
## sample estimates:
##  mean in group no mean in group yes 
##          7.770188          8.645455

Small p-value but not too much indicating the small effect of Smoke on LungCap

Is there an association between Caesarean and LungCap??

data %>% 
  ggplot(aes(x=Caesarean,
             y=LungCap))+
  geom_boxplot()

No measurable effect

Transforming Age into categorical

range(data$Age)
## [1]  3 19
Agegroups <- cut(data$Age, breaks = c(0,9,14,19),
                 labels = c("Children","Youth","Adults"))
data <- data %>% 
  mutate(Agegroups)

Relation of Agegroups and Lung Capacity

attach(data)
boxplot(LungCap ~ Agegroups)

attach(data)
ANOVA <- aov(LungCap ~ Agegroups)
summary(ANOVA)
##              Df Sum Sq Mean Sq F value Pr(>F)    
## Agegroups     2   2978    1489   499.4 <2e-16 ***
## Residuals   722   2153       3                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Very Small p-value indicating significance of Agegroups on the LungCap

attach(data)
cor(Height, LungCap)
## [1] 0.9121873

Very high association

Plotting Height with disease probability

data %>% 
  ggplot(aes(x=Height,
             fill=disease))+
  geom_histogram(width=15)

Checking Height plot with LungCap volume

data %>% 
  ggplot(aes(x=Height,
             y=LungCap))+
  geom_point()+
  geom_smooth(method="lm",se=FALSE)

There exists a linear relation between height and Lung Capacity

Building the Model–

model1 <- glm(disease ~ Height,    
             data=data,
             family="binomial")
summary(model1) 
## 
## Call:
## glm(formula = disease ~ Height, family = "binomial", data = data)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -9.8241     1.1757  -8.356   <2e-16 ***
## Height        0.1886     0.0197   9.577   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 585.38  on 724  degrees of freedom
## Residual deviance: 460.05  on 723  degrees of freedom
## AIC: 464.05
## 
## Number of Fisher Scoring iterations: 6

Statistically Significant P-value for Height predictor on the probability of the disease(<4L|>6L)

model2 <- glm(disease ~ Height + Agegroups,    
             data=data,
             family="binomial")
summary(model2) 
## 
## Call:
## glm(formula = disease ~ Height + Agegroups, family = "binomial", 
##     data = data)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)     -5.94674    1.53924  -3.863 0.000112 ***
## Height           0.11557    0.02744   4.211 2.54e-05 ***
## AgegroupsYouth   0.67395    0.31509   2.139 0.032446 *  
## AgegroupsAdults  3.29332    1.08071   3.047 0.002309 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 585.38  on 724  degrees of freedom
## Residual deviance: 441.61  on 721  degrees of freedom
## AIC: 449.61
## 
## Number of Fisher Scoring iterations: 8

All significant P-values for the dummy variables in Age group, with lower AIC in this model.

Model 2 is more fitting

Checking if there interaction between Height and Age!

model3 <- glm(disease ~ Height*Agegroups,    
             data=data,
             family="binomial")
summary(model3)
## 
## Call:
## glm(formula = disease ~ Height * Agegroups, family = "binomial", 
##     data = data)
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              0.249573   1.785202   0.140    0.889    
## Height                   0.005147   0.031504   0.163    0.870    
## AgegroupsYouth         -23.326498   4.637864  -5.030 4.92e-07 ***
## AgegroupsAdults        -25.418762  17.719268  -1.435    0.151    
## Height:AgegroupsYouth    0.399024   0.077070   5.177 2.25e-07 ***
## Height:AgegroupsAdults   0.450337   0.278429   1.617    0.106    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 585.38  on 724  degrees of freedom
## Residual deviance: 403.89  on 719  degrees of freedom
## AIC: 415.89
## 
## Number of Fisher Scoring iterations: 9

Lower AIC but with some high p-values so prefer not to complex the model

For deep classification

Split data

library(tidymodels)
 set.seed(2)
 split <- initial_split(data,
                        prop=.80,
                        strata=disease)
 data_train <- training(split) 
 data_test <- testing(split)

Build the training model

model <- glm(disease ~ Height + Agegroups,
              data=data_train,
              family="binomial")
summary(model)
## 
## Call:
## glm(formula = disease ~ Height + Agegroups, family = "binomial", 
##     data = data_train)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)   
## (Intercept)     -4.65158    1.72777  -2.692  0.00710 **
## Height           0.09130    0.03072   2.972  0.00296 **
## AgegroupsYouth   1.05971    0.36366   2.914  0.00357 **
## AgegroupsAdults  3.48839    1.10192   3.166  0.00155 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.08  on 578  degrees of freedom
## Residual deviance: 346.82  on 575  degrees of freedom
## AIC: 354.82
## 
## Number of Fisher Scoring iterations: 7

Significant p-values for both predictors.

evaluate the model on testing set

 data_test <- data_test %>% 
   mutate(disease_prob = predict(model,
                                data_test,
                                type= "response"),
          disease_pred = ifelse(disease_prob > .5, 1, 0))
t <- table(data_test$disease,
            data_test$disease_pred) 
accuracy <- sum(diag(t))/sum(t)
print(accuracy)
## [1] 0.8287671

The accuracy of our model is 82%

Then We can say based on our Analysis on this dataset, LungCap and probability of abnormal volume is affected by Height and Age of people