DASC 531 Assignment 3 Human Body Data Platelet Analysis

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this: ## Load Dataset

hbody <- read.csv("~/Desktop/Desktop - Morgan’s MacBook Pro (2)/R data sets/hbody.csv")

Data Preparation

colnames(hbody)[2] <- 'GENDER'
hbody$GENDER <- factor(hbody$GENDER, levels = c(1, 0), labels = c('M', 'F'))
levels(hbody$GENDER)

## [1] "M" "F"

BMI_CAT <- cut(hbody$BMI, breaks = c(15, 30, 45, 60), labels = c('low', 'med', 'high'))

Overview of Data

str(hbody)

## 'data.frame':    300 obs. of  15 variables:
##  $ AGE      : int  43 57 38 80 34 77 29 69 44 35 ...
##  $ GENDER   : Factor w/ 2 levels "M","F": 2 1 2 1 1 1 1 2 2 1 ...
##  $ PULSE    : int  80 84 94 74 50 60 52 58 66 62 ...
##  $ SYSTOLIC : int  100 112 134 126 114 134 118 138 114 124 ...
##  $ DIASTOLIC: int  70 70 94 64 68 60 64 80 66 70 ...
##  $ HDL      : int  73 35 36 37 50 55 53 40 45 62 ...
##  $ LDL      : int  68 116 223 83 104 75 128 140 136 110 ...
##  $ WHITE    : num  8.7 4.9 6.9 7.5 6.1 5.7 4.1 8.1 8 5.6 ...
##  $ RED      : num  4.8 4.73 4.47 4.32 4.95 3.95 4.68 4.6 4.09 5.47 ...
##  $ PLATE    : int  319 187 297 170 140 192 191 286 263 193 ...
##  $ WEIGHT   : num  98.6 96.9 108.2 73.1 83.1 ...
##  $ HEIGHT   : num  172 186 154 160 179 ...
##  $ WAIST    : num  120.4 107.8 120.3 97.2 95.1 ...
##  $ ARM.CIRC : num  40.7 37 44.3 30.3 34 31.4 27.4 34.2 32.5 40 ...
##  $ BMI      : num  33.3 28 45.4 28.4 25.9 31.1 20.1 32.7 25.8 36.5 ...

summary(hbody)

##       AGE        GENDER      PULSE           SYSTOLIC     DIASTOLIC     
##  Min.   :18.00   M:153   Min.   : 36.00   Min.   : 88   Min.   : 40.00  
##  1st Qu.:31.00   F:147   1st Qu.: 64.00   1st Qu.:112   1st Qu.: 64.00  
##  Median :46.00           Median : 72.00   Median :121   Median : 70.00  
##  Mean   :47.04           Mean   : 71.77   Mean   :123   Mean   : 70.75  
##  3rd Qu.:62.00           3rd Qu.: 80.00   3rd Qu.:132   3rd Qu.: 78.00  
##  Max.   :80.00           Max.   :104.00   Max.   :186   Max.   :102.00  
##       HDL              LDL            WHITE             RED       
##  Min.   : 26.00   Min.   : 39.0   Min.   : 2.700   Min.   :3.390  
##  1st Qu.: 43.00   1st Qu.: 85.0   1st Qu.: 5.200   1st Qu.:4.197  
##  Median : 52.00   Median :113.0   Median : 6.200   Median :4.490  
##  Mean   : 53.66   Mean   :113.7   Mean   : 6.542   Mean   :4.538  
##  3rd Qu.: 62.00   3rd Qu.:137.2   3rd Qu.: 7.825   3rd Qu.:4.883  
##  Max.   :138.00   Max.   :251.0   Max.   :14.300   Max.   :6.340  
##      PLATE           WEIGHT           HEIGHT          WAIST       
##  Min.   : 75.0   Min.   : 39.00   Min.   :134.5   Min.   : 64.40  
##  1st Qu.:198.0   1st Qu.: 67.08   1st Qu.:161.6   1st Qu.: 87.88  
##  Median :232.0   Median : 80.50   Median :168.3   Median : 96.95  
##  Mean   :239.4   Mean   : 81.66   Mean   :168.0   Mean   : 99.18  
##  3rd Qu.:263.5   3rd Qu.: 92.80   3rd Qu.:174.6   3rd Qu.:109.10  
##  Max.   :646.0   Max.   :150.40   Max.   :193.3   Max.   :170.50  
##     ARM.CIRC          BMI       
##  Min.   :20.50   Min.   :15.90  
##  1st Qu.:29.48   1st Qu.:24.50  
##  Median :33.05   Median :28.00  
##  Mean   :33.08   Mean   :28.91  
##  3rd Qu.:36.33   3rd Qu.:31.98  
##  Max.   :46.60   Max.   :59.00

Histogram

gf_histogram(~ PLATE, data = hbody, bins = 30, color = "black", fill = "lightsteelblue3", alpha = 0.7) %>%
  gf_density(~PLATE, data = hbody, color = "red", size = 1) %>%
  gf_labs(
    title = "Distribution of Platelet Counts",
    x = "Platelet Counts",
    y = "Density"
  )

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Histogram Version 2

Couldn’t get the ggformula histogram density graph to work

ggplot(hbody, aes(x = PLATE)) +
  geom_histogram(aes(y = ..density..), bins = 30, color = "black", fill = "lightsteelblue3") +
  geom_density(color = "red", size = 1) +
  ggtitle("Distribution of Platelet Counts") +
  theme_minimal()

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

QQ plot

gf_qq(~hbody$PLATE, title = "Human Body Platelet Data", color = "black", size = 2) %>% 
  gf_qqline(~hbody$PLATE, color = "red", size = 1, linetype = "solid") %>% 
            gf_labs(
    title = "Human Body Platelet Data",
    x = "Theoretical Quantiles",
    y = "Sample Quantiles of Platelets"
  )

Shapiro-Wilk test

shapiro_test <- shapiro.test(hbody$PLATE)
shapiro_test

## 
##  Shapiro-Wilk normality test
## 
## data:  hbody$PLATE
## W = 0.89367, p-value = 1.2e-13

Calculate Mean and Standard Deviation by Gender

platelet_summary <- hbody %>%
  group_by(GENDER) %>%
  summarise(mean_platelets = mean(PLATE, na.rm = TRUE),
            sd_platelets = sd(PLATE, na.rm = TRUE))
platelet_summary

## # A tibble: 2 × 3
##   GENDER mean_platelets sd_platelets
##   <fct>           <dbl>        <dbl>
## 1 M                224.         59.5
## 2 F                255.         65.4

Boxplot of Platelets by Gender

ggplot(hbody, aes(x = GENDER, y = PLATE, fill = GENDER)) +
  geom_boxplot() +
  labs(title = "Boxplot of Platelet Counts by Gender",
       x = "Gender",
       y = "Platelet Count") +
  theme_minimal() +
  scale_fill_manual(values = c("lightskyblue", "hotpink"))

Hypothesis Test: Male Platelets vs Female Platelets

# Separate male and female platelets
male_platelets <- hbody %>% filter(GENDER == "M") %>% pull(PLATE)
female_platelets <- hbody %>% filter(GENDER == "F") %>% pull(PLATE)

# Conduct one-tailed t-test
t_test_result <- t.test(male_platelets, female_platelets, alternative = "less", var.equal = TRUE)
t_test_result

## 
##  Two Sample t-test
## 
## data:  male_platelets and female_platelets
## t = -4.2722, df = 298, p-value = 1.303e-05
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
##       -Inf -18.91313
## sample estimates:
## mean of x mean of y 
##  224.2745  255.0884

# Calculate Cohen's d
cohens_d_result <- cohens_d(male_platelets, female_platelets)
cohens_d_result

## Cohen's d |         95% CI
## --------------------------
## -0.49     | [-0.72, -0.26]
## 
## - Estimated using pooled SD.

Scatter Plot: Platelets vs Red Blood Cells by BMI Category

ggplot(hbody, aes(x = RED, y = PLATE, color = BMI_CAT)) +
  geom_point(alpha = 0.7, size = 3) +
  scale_color_manual(values = c("grey60", "firebrick1", "dodgerblue2")) +
  labs(title = "Scatter Plot of Platelets vs. Red Blood Cells by BMI Category",
       x = "Red Blood Cells",
       y = "Platelet Count") +
  theme_minimal()

ANOVA: Comparing Platelet Counts Across BMI Categories

# Run ANOVA
anova_result <- aov(PLATE ~ BMI_CAT, data = hbody)
summary(anova_result)

##              Df  Sum Sq Mean Sq F value Pr(>F)  
## BMI_CAT       2   27105   13553   3.337 0.0369 *
## Residuals   297 1206321    4062                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Check for homogeneity of variances using Levene's test
levene_test <- leveneTest(PLATE ~ BMI_CAT, data = hbody)
levene_test

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   2  0.7021 0.4964
##       297

# Post-hoc analysis with Tukey HSD if ANOVA is significant
tukey_result <- TukeyHSD(anova_result)
tukey_result

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = PLATE ~ BMI_CAT, data = hbody)
## 
## $BMI_CAT
##               diff        lwr       upr     p adj
## med-low   2.394995 -16.025465  20.81545 0.9496203
## high-low 53.429947   4.704796 102.15510 0.0276543
## high-med 51.034951   1.311474 100.75843 0.0427441

Conclusion

The platelet count data does not appear to be normally distributed. It seems to follow a positively skewed (right-skewed) distribution. The histograms and QQ plots support this observation, and the Shapiro-Wilk test results further confirm it. Since the p-value from the Shapiro-Wilk test is less than 0.05, we reject the null hypothesis that the data is normally distributed. This suggests that the platelet counts do not follow a normal distribution.

When analyzing platelet counts by gender, separating the data for mean, standard deviation, and box plot analysis reveals that the median platelet count for men is lower than that of women. This supports the claim that men tend to have lower platelet counts than women.

A two-sample t-test was conducted, and the results further reinforce this claim, showing that males generally have lower platelet counts than females.

The ANOVA results show a p-value of 0.0369, so we reject the null hypothesis that the mean platelet count is the same across all BMI categories. This indicates that BMI does have a significant effect on platelet counts. Levene’s test confirms that the assumption of equal variances is met, validating the ANOVA results. The Tukey HSD analysis reveals significant differences in platelet counts between the high BMI category and both the low and medium BMI categories. However, there is no significant difference between the low and medium BMI categories. This suggests that individuals with higher BMI tend to have higher platelet counts compared to those with lower and medium BMIs.