HW6

R Markdown

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   3.5.1     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(naniar)
library(readr)
library(ggplot2)
library(broom)
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(tidyr)
library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select

Loading Survey dataset from the MASS library:

data("survey")
head(survey)

##      Sex Wr.Hnd NW.Hnd W.Hnd    Fold Pulse    Clap Exer Smoke Height      M.I
## 1 Female   18.5   18.0 Right  R on L    92    Left Some Never 173.00   Metric
## 2   Male   19.5   20.5  Left  R on L   104    Left None Regul 177.80 Imperial
## 3   Male   18.0   13.3 Right  L on R    87 Neither None Occas     NA     <NA>
## 4   Male   18.8   18.9 Right  R on L    NA Neither None Never 160.00   Metric
## 5   Male   20.0   20.0 Right Neither    35   Right Some Never 165.00   Metric
## 6 Female   18.0   17.7 Right  L on R    64   Right Some Never 172.72 Imperial
##      Age
## 1 18.250
## 2 17.583
## 3 16.917
## 4 20.333
## 5 23.667
## 6 21.000

summary(survey)

##      Sex          Wr.Hnd          NW.Hnd        W.Hnd          Fold    
##  Female:118   Min.   :13.00   Min.   :12.50   Left : 18   L on R : 99  
##  Male  :118   1st Qu.:17.50   1st Qu.:17.50   Right:218   Neither: 18  
##  NA's  :  1   Median :18.50   Median :18.50   NA's :  1   R on L :120  
##               Mean   :18.67   Mean   :18.58                            
##               3rd Qu.:19.80   3rd Qu.:19.73                            
##               Max.   :23.20   Max.   :23.50                            
##               NA's   :1       NA's   :1                                
##      Pulse             Clap       Exer       Smoke         Height     
##  Min.   : 35.00   Left   : 39   Freq:115   Heavy: 11   Min.   :150.0  
##  1st Qu.: 66.00   Neither: 50   None: 24   Never:189   1st Qu.:165.0  
##  Median : 72.50   Right  :147   Some: 98   Occas: 19   Median :171.0  
##  Mean   : 74.15   NA's   :  1              Regul: 17   Mean   :172.4  
##  3rd Qu.: 80.00                            NA's :  1   3rd Qu.:180.0  
##  Max.   :104.00                                        Max.   :200.0  
##  NA's   :45                                            NA's   :28     
##        M.I           Age       
##  Imperial: 68   Min.   :16.75  
##  Metric  :141   1st Qu.:17.67  
##  NA's    : 28   Median :18.58  
##                 Mean   :20.37  
##                 3rd Qu.:20.17  
##                 Max.   :73.00  
##

pct_complete(survey)

## [1] 96.23769

As we can see the average pulse rate among the dataset was 74.15, which is very close to the median value of 72.5 and is not much affected by outliers. 96% of our dataset is complete. We have 45 missing values for pulse rate. As the sample size is large enough, I have decided to clear data and remove missing values.

survey.data <- na.omit(survey)
head(survey.data)

##      Sex Wr.Hnd NW.Hnd W.Hnd    Fold Pulse  Clap Exer Smoke Height      M.I
## 1 Female   18.5   18.0 Right  R on L    92  Left Some Never 173.00   Metric
## 2   Male   19.5   20.5  Left  R on L   104  Left None Regul 177.80 Imperial
## 5   Male   20.0   20.0 Right Neither    35 Right Some Never 165.00   Metric
## 6 Female   18.0   17.7 Right  L on R    64 Right Some Never 172.72 Imperial
## 7   Male   17.7   17.7 Right  L on R    83 Right Freq Never 182.88 Imperial
## 8 Female   17.0   17.3 Right  R on L    74 Right Freq Never 157.00   Metric
##      Age
## 1 18.250
## 2 17.583
## 5 23.667
## 6 21.000
## 7 18.833
## 8 35.833

In order to see if there is a statistically significant differrence of pulse rates based on sex, smoking status and exercise habits, we will conduct several tests:

f_pulse <- survey.data %>% filter(Sex == "Female") %>% pull(Pulse)
m_pulse <- survey.data %>% filter(Sex == "Male") %>% pull(Pulse)

sex_result <- t.test(f_pulse, m_pulse)
tidy(sex_result)

## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
## 1     1.48      74.8      73.3     0.828   0.409      164.    -2.04      4.99
## # ℹ 2 more variables: method <chr>, alternative <chr>

There was no statsitically significant difference in pulse values between males and females. (p-value=0.4)

smoke_result <- aov(Pulse~Smoke, data=survey.data)
tidy(smoke_result)

## # A tibble: 2 × 6
##   term         df  sumsq meansq statistic p.value
##   <chr>     <dbl>  <dbl>  <dbl>     <dbl>   <dbl>
## 1 Smoke         3   288.   96.0     0.718   0.543
## 2 Residuals   164 21942.  134.     NA      NA

As the results from one-way Anova test show there was no statistically significant difference in pulse values among different levels of smokers. (p-value=0.54)

exer_result <- aov(Pulse~Exer, data=survey.data)
tidy(exer_result)

## # A tibble: 2 × 6
##   term         df  sumsq meansq statistic p.value
##   <chr>     <dbl>  <dbl>  <dbl>     <dbl>   <dbl>
## 1 Exer          2  1153.   577.      4.51  0.0123
## 2 Residuals   165 21077.   128.     NA    NA

P-value of 0.01 indicates that there is a statistically significant difference in pulse values among people with different exercising habits in our dataset.

my_model <- glm(Pulse~Age+Sex+Exer+Smoke, family = gaussian, data=survey.data)
summary(my_model)

## 
## Call:
## glm(formula = Pulse ~ Age + Sex + Exer + Smoke, family = gaussian, 
##     data = survey.data)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  81.7045     5.5271  14.783  < 2e-16 ***
## Age          -0.2282     0.1447  -1.577  0.11688    
## SexMale      -1.0043     1.7912  -0.561  0.57580    
## ExerNone      4.8843     3.2666   1.495  0.13683    
## ExerSome      5.0685     1.8604   2.724  0.00716 ** 
## SmokeNever   -5.5154     4.4013  -1.253  0.21199    
## SmokeOccas   -6.3903     5.3200  -1.201  0.23145    
## SmokeRegul   -1.3400     5.2701  -0.254  0.79961    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 127.4306)
## 
##     Null deviance: 22230  on 167  degrees of freedom
## Residual deviance: 20389  on 160  degrees of freedom
## AIC: 1301
## 
## Number of Fisher Scoring iterations: 2

Intercept = 81.7045. This is the predicted value of Pulse when all predictors are at their reference levels. The p-value (< 2e-16) shows this is highly significant. Coefficient for age= -0.2282. For each one-unit increase in Age, the Pulse decreases by 0.2282 units on average, holding other variables constant, however p-value (0.11688) indicates this effect is not statistically significant. SexMale Estimate = -1.0043. Male individuals have a pulse rate approximately 1 unit lower than females.The p-value (0.57580) suggests this difference is not statistically significant. ExerNone Estimate = 4.8843. Individuals with no exercise have a pulse rate about 4.88 units higher than those in the Heavy exercising category. The p-value (0.13683) indicates this effect is not statistically significant. ExerSome Estimate = 5.0685. Individuals with some exercise have a pulse rate about 5.07 units higher than those in the Heavy exercising category. The p-value (0.00716) shows this effect IS statistically significant. None of the estimates for smoking groups was statistically significant, suggesting that smoking status does not affect pulse rate.

plotting residuals:

plot(my_model, which=1)

plot(my_model, which=2)

residuals <- residuals(my_model)
fitted_values <- fitted(my_model)

ggplot(data = survey.data, aes(x = fitted_values, y = residuals)) +
  geom_point(color = "blue") +
  geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
  labs(title = "Residuals vs Fitted Values",
       x = "Fitted Values",
       y = "Residuals") +
  theme_minimal()

As the plots show, there is a clustering of residuals, rather than random distribution. This may cause problems with linear model assumptions and violate them.

aic_value <- AIC(my_model)
aic_value

## [1] 1300.959

AIC value is very high (1300), suggesting that this linear model is not a good fit to our data.

The analysis explored the relationship between pulse rate and predictors including age, sex, exercise level, and smoking status. The findings showed that exercise level had a significant effect on pulse rate, with individuals engaging in some exercise having a significantly higher pulse rate compared to heavy exercisers (p = 0.007). However, no significant association was observed for age, sex, or smoking status, indicating that these factors do not have a meaningful impact on pulse rate in this dataset.

Despite these findings, there are several limitations to the analysis. The residual plot suggested potential heteroscedasticity, indicating that the model may not fully meet the assumptions of linear regression. Potential confounding variables not included in the model, such as overall health status or other lifestyle factors, may have influenced the results. To improve our findings we could explore interaction terms, use alternative model specifications, or additional predictors to better capture the complexity of factors affecting pulse rate.

HW6

2024-12-03

R Markdown