library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 3.5.1 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(naniar)
library(readr)
library(ggplot2)
library(broom)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(tidyr)
library(MASS)
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
Loading Survey dataset from the MASS library:
data("survey")
head(survey)
## Sex Wr.Hnd NW.Hnd W.Hnd Fold Pulse Clap Exer Smoke Height M.I
## 1 Female 18.5 18.0 Right R on L 92 Left Some Never 173.00 Metric
## 2 Male 19.5 20.5 Left R on L 104 Left None Regul 177.80 Imperial
## 3 Male 18.0 13.3 Right L on R 87 Neither None Occas NA <NA>
## 4 Male 18.8 18.9 Right R on L NA Neither None Never 160.00 Metric
## 5 Male 20.0 20.0 Right Neither 35 Right Some Never 165.00 Metric
## 6 Female 18.0 17.7 Right L on R 64 Right Some Never 172.72 Imperial
## Age
## 1 18.250
## 2 17.583
## 3 16.917
## 4 20.333
## 5 23.667
## 6 21.000
summary(survey)
## Sex Wr.Hnd NW.Hnd W.Hnd Fold
## Female:118 Min. :13.00 Min. :12.50 Left : 18 L on R : 99
## Male :118 1st Qu.:17.50 1st Qu.:17.50 Right:218 Neither: 18
## NA's : 1 Median :18.50 Median :18.50 NA's : 1 R on L :120
## Mean :18.67 Mean :18.58
## 3rd Qu.:19.80 3rd Qu.:19.73
## Max. :23.20 Max. :23.50
## NA's :1 NA's :1
## Pulse Clap Exer Smoke Height
## Min. : 35.00 Left : 39 Freq:115 Heavy: 11 Min. :150.0
## 1st Qu.: 66.00 Neither: 50 None: 24 Never:189 1st Qu.:165.0
## Median : 72.50 Right :147 Some: 98 Occas: 19 Median :171.0
## Mean : 74.15 NA's : 1 Regul: 17 Mean :172.4
## 3rd Qu.: 80.00 NA's : 1 3rd Qu.:180.0
## Max. :104.00 Max. :200.0
## NA's :45 NA's :28
## M.I Age
## Imperial: 68 Min. :16.75
## Metric :141 1st Qu.:17.67
## NA's : 28 Median :18.58
## Mean :20.37
## 3rd Qu.:20.17
## Max. :73.00
##
pct_complete(survey)
## [1] 96.23769
As we can see the average pulse rate among the dataset was 74.15, which is very close to the median value of 72.5 and is not much affected by outliers. 96% of our dataset is complete. We have 45 missing values for pulse rate. As the sample size is large enough, I have decided to clear data and remove missing values.
survey.data <- na.omit(survey)
head(survey.data)
## Sex Wr.Hnd NW.Hnd W.Hnd Fold Pulse Clap Exer Smoke Height M.I
## 1 Female 18.5 18.0 Right R on L 92 Left Some Never 173.00 Metric
## 2 Male 19.5 20.5 Left R on L 104 Left None Regul 177.80 Imperial
## 5 Male 20.0 20.0 Right Neither 35 Right Some Never 165.00 Metric
## 6 Female 18.0 17.7 Right L on R 64 Right Some Never 172.72 Imperial
## 7 Male 17.7 17.7 Right L on R 83 Right Freq Never 182.88 Imperial
## 8 Female 17.0 17.3 Right R on L 74 Right Freq Never 157.00 Metric
## Age
## 1 18.250
## 2 17.583
## 5 23.667
## 6 21.000
## 7 18.833
## 8 35.833
In order to see if there is a statistically significant differrence of pulse rates based on sex, smoking status and exercise habits, we will conduct several tests:
f_pulse <- survey.data %>% filter(Sex == "Female") %>% pull(Pulse)
m_pulse <- survey.data %>% filter(Sex == "Male") %>% pull(Pulse)
sex_result <- t.test(f_pulse, m_pulse)
tidy(sex_result)
## # A tibble: 1 × 10
## estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.48 74.8 73.3 0.828 0.409 164. -2.04 4.99
## # ℹ 2 more variables: method <chr>, alternative <chr>
There was no statsitically significant difference in pulse values between males and females. (p-value=0.4)
smoke_result <- aov(Pulse~Smoke, data=survey.data)
tidy(smoke_result)
## # A tibble: 2 × 6
## term df sumsq meansq statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Smoke 3 288. 96.0 0.718 0.543
## 2 Residuals 164 21942. 134. NA NA
As the results from one-way Anova test show there was no statistically significant difference in pulse values among different levels of smokers. (p-value=0.54)
exer_result <- aov(Pulse~Exer, data=survey.data)
tidy(exer_result)
## # A tibble: 2 × 6
## term df sumsq meansq statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Exer 2 1153. 577. 4.51 0.0123
## 2 Residuals 165 21077. 128. NA NA
P-value of 0.01 indicates that there is a statistically significant difference in pulse values among people with different exercising habits in our dataset.
my_model <- glm(Pulse~Age+Sex+Exer+Smoke, family = gaussian, data=survey.data)
summary(my_model)
##
## Call:
## glm(formula = Pulse ~ Age + Sex + Exer + Smoke, family = gaussian,
## data = survey.data)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 81.7045 5.5271 14.783 < 2e-16 ***
## Age -0.2282 0.1447 -1.577 0.11688
## SexMale -1.0043 1.7912 -0.561 0.57580
## ExerNone 4.8843 3.2666 1.495 0.13683
## ExerSome 5.0685 1.8604 2.724 0.00716 **
## SmokeNever -5.5154 4.4013 -1.253 0.21199
## SmokeOccas -6.3903 5.3200 -1.201 0.23145
## SmokeRegul -1.3400 5.2701 -0.254 0.79961
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 127.4306)
##
## Null deviance: 22230 on 167 degrees of freedom
## Residual deviance: 20389 on 160 degrees of freedom
## AIC: 1301
##
## Number of Fisher Scoring iterations: 2
Intercept = 81.7045. This is the predicted value of Pulse when all predictors are at their reference levels. The p-value (< 2e-16) shows this is highly significant. Coefficient for age= -0.2282. For each one-unit increase in Age, the Pulse decreases by 0.2282 units on average, holding other variables constant, however p-value (0.11688) indicates this effect is not statistically significant. SexMale Estimate = -1.0043. Male individuals have a pulse rate approximately 1 unit lower than females.The p-value (0.57580) suggests this difference is not statistically significant. ExerNone Estimate = 4.8843. Individuals with no exercise have a pulse rate about 4.88 units higher than those in the Heavy exercising category. The p-value (0.13683) indicates this effect is not statistically significant. ExerSome Estimate = 5.0685. Individuals with some exercise have a pulse rate about 5.07 units higher than those in the Heavy exercising category. The p-value (0.00716) shows this effect IS statistically significant. None of the estimates for smoking groups was statistically significant, suggesting that smoking status does not affect pulse rate.
plotting residuals:
plot(my_model, which=1)
plot(my_model, which=2)
residuals <- residuals(my_model)
fitted_values <- fitted(my_model)
ggplot(data = survey.data, aes(x = fitted_values, y = residuals)) +
geom_point(color = "blue") +
geom_hline(yintercept = 0, linetype = "dashed", color = "red") +
labs(title = "Residuals vs Fitted Values",
x = "Fitted Values",
y = "Residuals") +
theme_minimal()
As the plots show, there is a clustering of residuals, rather than
random distribution. This may cause problems with linear model
assumptions and violate them.
aic_value <- AIC(my_model)
aic_value
## [1] 1300.959
AIC value is very high (1300), suggesting that this linear model is not a good fit to our data.
The analysis explored the relationship between pulse rate and predictors including age, sex, exercise level, and smoking status. The findings showed that exercise level had a significant effect on pulse rate, with individuals engaging in some exercise having a significantly higher pulse rate compared to heavy exercisers (p = 0.007). However, no significant association was observed for age, sex, or smoking status, indicating that these factors do not have a meaningful impact on pulse rate in this dataset.
Despite these findings, there are several limitations to the analysis. The residual plot suggested potential heteroscedasticity, indicating that the model may not fully meet the assumptions of linear regression. Potential confounding variables not included in the model, such as overall health status or other lifestyle factors, may have influenced the results. To improve our findings we could explore interaction terms, use alternative model specifications, or additional predictors to better capture the complexity of factors affecting pulse rate.