Smart Pill- Jason Laucel

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(dplyr)
setwd("/Users/jasonlaucel/Data 110 Folder")

###. The following dataset for my project is about a “Smart pill” and it’s ability to record real time info on gastric emptying, bowel transit time, and total intestinal transit time. The data is based on a group of ill trauma patients and also healthy patients. The goal of the dataset is to compare the difference in bowel transportation across different patients and their respective demographics. The variables I plan to use for my project is Weight, Gender,age, Group, and Small Bowel Mean pH. The reason I chose these variables was to see if based on the collective data from Group (Healthy), weight, and gender it somehow includes a correlation to Small Bowel Mean pH. This is what my data visualization will seek to find. The source for the dataset I’ll be using is as follows:https://www.causeweb.org/tshs/smart-pill/

# Load the dataset
data <- read.csv("/Users/jasonlaucel/Data 110 Folder/SmartPill3.csv")

str(data)

'data.frame':   95 obs. of  22 variables:
 $ Group                  : int  0 0 0 0 0 0 0 0 1 1 ...
 $ Gender                 : int  1 1 1 1 0 1 1 0 1 0 ...
 $ Race                   : int  NA NA NA NA NA NA NA NA 1 1 ...
 $ Height                 : num  183 180 180 175 152 ...
 $ Weight                 : num  102.1 102.1 68 69.9 44.9 ...
 $ Age                    : int  25 39 44 53 57 43 38 23 21 24 ...
 $ GE.Time                : num  74.3 73.3 4.3 NA 13.9 23.3 7.5 5.6 2.73 5.02 ...
 $ SB.Time                : num  8.4 13.8 6.7 NA 5.1 8.7 3.7 3.4 5.12 3.3 ...
 $ C.Time                 : num  NA NA NA NA NA ...
 $ WG.Time                : num  816 168 240 216 120 ...
 $ S.Contractions         : int  NA NA NA NA NA NA NA NA 145 114 ...
 $ S.Sum.of.Amplitudes    : num  NA NA NA NA NA ...
 $ S.Mean.Peak.Amplitude  : num  NA NA NA NA NA ...
 $ S.Mean.pH              : num  NA NA NA NA NA NA NA NA 2.07 2.28 ...
 $ SB.Contractions        : int  NA NA NA NA NA NA NA NA 298 782 ...
 $ SB.Sum.of.Amplitudes   : num  NA NA NA NA NA ...
 $ SB.Mean.Peak.Amplitude : num  NA NA NA NA NA ...
 $ SB.Mean.pH             : num  NA NA NA NA NA NA NA NA 7.26 7.21 ...
 $ Colon.Contractions     : int  NA NA NA NA NA NA NA NA 507 50 ...
 $ Colon.Sum.of.Amplitudes: num  NA NA NA NA NA ...
 $ C.Mean.Peak.Amplitude  : num  NA NA NA NA NA ...
 $ C.Mean.pH              : num  NA NA NA NA NA NA NA NA 7.58 7.21 ...

head(data)

  Group Gender Race Height    Weight Age GE.Time SB.Time C.Time WG.Time
1     0      1   NA 182.88 102.05820  25    74.3     8.4     NA     816
2     0      1   NA 180.34 102.05820  39    73.3    13.8     NA     168
3     0      1   NA 180.34  68.03880  44     4.3     6.7     NA     240
4     0      1   NA 175.26  69.85317  53      NA      NA     NA     216
5     0      0   NA 152.40  44.90561  57    13.9     5.1     NA     120
6     0      1   NA 185.42  94.80073  43    23.3     8.7     NA     384
  S.Contractions S.Sum.of.Amplitudes S.Mean.Peak.Amplitude S.Mean.pH
1             NA                  NA                    NA        NA
2             NA                  NA                    NA        NA
3             NA                  NA                    NA        NA
4             NA                  NA                    NA        NA
5             NA                  NA                    NA        NA
6             NA                  NA                    NA        NA
  SB.Contractions SB.Sum.of.Amplitudes SB.Mean.Peak.Amplitude SB.Mean.pH
1              NA                   NA                     NA         NA
2              NA                   NA                     NA         NA
3              NA                   NA                     NA         NA
4              NA                   NA                     NA         NA
5              NA                   NA                     NA         NA
6              NA                   NA                     NA         NA
  Colon.Contractions Colon.Sum.of.Amplitudes C.Mean.Peak.Amplitude C.Mean.pH
1                 NA                      NA                    NA        NA
2                 NA                      NA                    NA        NA
3                 NA                      NA                    NA        NA
4                 NA                      NA                    NA        NA
5                 NA                      NA                    NA        NA
6                 NA                      NA                    NA        NA

data <- na.omit(data)

# Linear regression analysis
# Make a linear model for SB Mean pH using weight, age, and gender as predictors
model <- lm(`SB.Mean.pH` ~ Weight + Age + Gender, data = data)

# Display summary of the regression model
summary(model)


Call:
lm(formula = SB.Mean.pH ~ Weight + Age + Gender, data = data)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.16517 -0.16456  0.07069  0.22328  1.66224 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  7.310523   0.299028  24.448   <2e-16 ***
Weight      -0.004123   0.003884  -1.062    0.292    
Age          0.002625   0.004888   0.537    0.593    
Gender      -0.194384   0.121809  -1.596    0.115    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4897 on 74 degrees of freedom
Multiple R-squared:  0.07642,   Adjusted R-squared:  0.03898 
F-statistic: 2.041 on 3 and 74 DF,  p-value: 0.1155

#.  Group 1 scatterplot
colors_group1 <- c("navy")
scatterplot_group1 <- ggplot(data %>% filter(Group == 1), aes(x = Weight, y = `SB.Mean.pH`, color = factor(Group))) +
  geom_point() +
  labs(
    title = "Weight and Small Bowel Mean pH for Group 1",
    x = "Weight (kg)",
    y = "Small Bowel Mean pH",
    color = "Group",
    caption = "Data source: Smart Pill Dataset: TSHS"
  ) +
  theme_minimal() +
  scale_color_manual(values = colors_group1) +
  theme(legend.position = "bottom")

# Gender Scatterplot
scatterplot_both <- scatterplot_group1 +
  geom_point(data = data %>% filter(Group == 1), aes(color = factor(Gender))) +
  scale_color_manual(values = c("navy", "magenta")) +
  labs(color = "Gender")

Scale for colour is already present.
Adding another scale for colour, which will replace the existing scale.

# Display scatterplot
 scatterplot_both

# Create scatterplot with age on the x-axis
scatterplot_age <- ggplot(data %>% filter(Group == 1), aes(x = Age, y = `SB.Mean.pH`, color = factor(Group))) +
  geom_point() +
  labs(
    title = "Age and Small Bowel Mean pH",
    x = "Age",
    y = "Small Bowel Mean pH",
    color = "Group",
    caption = "Data source: Smart Pill Dataset: TSHS"
  ) +
  theme_minimal() +
  scale_color_manual(values = c("cyan")) +
  theme(legend.position = "bottom")

# Display scatterplot
scatterplot_age

###. In the data set above I created a scatterplot that shows comparisons in Weight and SB Mean pH by age. Group 1 was specified as the healthy patients of the data set. Unfortunately, group 0 which are the ill patients do not have any corresponding data to SB Mean pH. The other scatterplot gives a distinction between Male and Female genders. Here we can see a clear pattern or occurence in both Male and Female. In terms of cleaning the data environment I added na.omit in order to remove any NA data that could cause issues during compiling. I also had to rename some variables to something different to their original naming in the csv file/ excel file I have. Something I wish I could’ve included was group 0 in this data visualization. The variable I chose wasn’t available for the ill patients. Perhaps I will revisit this project in the future and instead of Mean I can use bowel time as a variable as it was available for Group 0 ill patients.