Pymaceuticals Inc., a fictional burgeoning pharmaceutical company based out of San Diego, CA, specializes in drug-based, anti-cancer pharmaceuticals. They have provided the data to test the efficacy of potential drug treatments for squamous cell carcinoma. In this study, 249 mice identified with Squamous cell carcinoma (SCC) tumor growth, kind of skin cancer, were treated through a variety of drug regimens. Over the course of 45 days, tumor development was observed and measured. The objective is to analyze the data to show how four treatments (Capomulin, Infubinol, Ketapril, and Placebo) compare.
Question 1: Is Capomulin more or less effective in reducing the tumor size than Infubinol, or Ketapril drugs categories?
Question 2: Is there a correlation between the age, weight and the tumor size growth for each drug category?
Approach for answering the research question will be:
1- Calculate the mean percent change in tumor volume for the four drug categories.
2- Perform the Hypothesis test to find out whether the difference exists between the mean tumor size for all four drug categories.
3- Perform linear regression to study the correlation between various variables by calculating the correlation coefficient.
4- Finally analyze the results to find out if Capomulin more or less effective in reducing the tumor size of sample mice than Infubinol, or Ketapril drugs categories.
Answer: The metadata_df contain 249 unique mouse id and so are the number of cases that treated with variety of drug regimen. The results_df dataset holds the tumor growth measurements observed for each Mouse ID and carries 1,893 rows results. There are 10 different drug treatments. The total sample size of mice for four treatments (Capomulin, Infubinol, Ketapril, and Placebo) is 100 and the sample size of mice by drug treatments is 25 each.
Answer: Data is collected by the fictitious pharmaceutical company who was testing the efficacy of potential drug treatments for squamous cell carcinoma. I import the data into Rmd file from GitHub.
Answer: This is an experimental study. A group of 249 mice were monitored after administration of a variety of drug regimens over a 45-day treatment period. The impact of Capomulin on tumor growth, metastasis and survival rates were monitored, along with Infubinol, Ketapril, and Placebo.
Answer: The citation and data collection links are as follows.
In my search for the experimental datasets, I found the mouse metadata and the Study results on the GitHub link provided below:
https://raw.githubusercontent.com/rfpoulos/pymaceuticals/master/data/Mouse_metadata.csv
https://raw.githubusercontent.com/rfpoulos/pymaceuticals/master/data/Study_results.csv
Upon further research in finding the original source of the dataset, I found that these datasets are provided by Pymaceuticals Inc., a fictional burgeoning pharmaceutical company based out of San Diego, CA, specializes in drug-based, anti-cancer pharmaceuticals. Below is the link for the original source of the datasets.
https://c-l-nguyen.github.io/web-design-challenge/index.html
Answer: The response variable is the size of tumor, “Tumor.Volume.mm3.” and it holds a numerical data.
What is the explanatory variable, and what type is it (numerical/categorical)?
Answer: The explanatory variable is the “Drug.Regimen” and it holds a categorical data and “Timepoint” which holds numerical data. The ‘Timepoint’ unit is ‘days’.
Weight(g) Age_months(months) Timepoint(days) - numerical explanatory variable Drug.Regimen - categorical explanatory variable Tumor.Volume(mm3) - response variable Metastatic.Sites(satges of the spread of tumor)
Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R; hence a code chunk is provided below. Insert more code chunks as needed.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.2 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.2
## Warning: package 'stringr' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(tidyr)
library(stringr)
library(psych)
## Warning: package 'psych' was built under R version 4.1.2
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(ggplot2)
metadata_df <- read.delim("https://raw.githubusercontent.com/rfpoulos/pymaceuticals/master/data/Mouse_metadata.csv", header=T, sep=",")
head(metadata_df)
## Mouse.ID Drug.Regimen Sex Age_months Weight..g.
## 1 k403 Ramicane Male 21 16
## 2 s185 Capomulin Female 3 17
## 3 x401 Capomulin Female 16 15
## 4 m601 Capomulin Male 22 17
## 5 g791 Ramicane Male 11 16
## 6 s508 Ramicane Male 1 17
df <- metadata_df %>%
group_by(Drug.Regimen)
head(df)
## # A tibble: 6 x 5
## # Groups: Drug.Regimen [2]
## Mouse.ID Drug.Regimen Sex Age_months Weight..g.
## <chr> <chr> <chr> <int> <int>
## 1 k403 Ramicane Male 21 16
## 2 s185 Capomulin Female 3 17
## 3 x401 Capomulin Female 16 15
## 4 m601 Capomulin Male 22 17
## 5 g791 Ramicane Male 11 16
## 6 s508 Ramicane Male 1 17
results_df <- read.delim("https://raw.githubusercontent.com/rfpoulos/pymaceuticals/master/data/Study_results.csv", header=T, sep=",")
head(results_df)
## Mouse.ID Timepoint Tumor.Volume..mm3. Metastatic.Sites
## 1 b128 0 45 0
## 2 f932 0 45 0
## 3 g107 0 45 0
## 4 a457 0 45 0
## 5 c819 0 45 0
## 6 h246 0 45 0
summary(metadata_df)
## Mouse.ID Drug.Regimen Sex Age_months
## Length:249 Length:249 Length:249 Min. : 1.00
## Class :character Class :character Class :character 1st Qu.: 6.00
## Mode :character Mode :character Mode :character Median :13.00
## Mean :12.73
## 3rd Qu.:19.00
## Max. :24.00
## Weight..g.
## Min. :15.00
## 1st Qu.:25.00
## Median :27.00
## Mean :26.12
## 3rd Qu.:29.00
## Max. :30.00
summary(results_df)
## Mouse.ID Timepoint Tumor.Volume..mm3. Metastatic.Sites
## Length:1893 Min. : 0.00 Min. :22.05 Min. :0.000
## Class :character 1st Qu.: 5.00 1st Qu.:45.00 1st Qu.:0.000
## Mode :character Median :20.00 Median :48.95 Median :1.000
## Mean :19.57 Mean :50.45 Mean :1.022
## 3rd Qu.:30.00 3rd Qu.:56.29 3rd Qu.:2.000
## Max. :45.00 Max. :78.57 Max. :4.000
nrow(metadata_df)
## [1] 249
nrow(results_df)
## [1] 1893
drug_count <- unique(metadata_df$Drug.Regimen)
drug_count
## [1] "Ramicane" "Capomulin" "Infubinol" "Placebo" "Ceftamin" "Stelasyn"
## [7] "Zoniferol" "Ketapril" "Propriva" "Naftisol"
length(drug_count)
## [1] 10
capomulin_df <- filter(metadata_df, Drug.Regimen=="Capomulin")
head(capomulin_df)
## Mouse.ID Drug.Regimen Sex Age_months Weight..g.
## 1 s185 Capomulin Female 3 17
## 2 x401 Capomulin Female 16 15
## 3 m601 Capomulin Male 22 17
## 4 f966 Capomulin Male 16 17
## 5 u364 Capomulin Male 18 17
## 6 y793 Capomulin Male 17 17
nrow(capomulin_df)
## [1] 25
infubinol_df <- filter(metadata_df, Drug.Regimen=="Infubinol")
nrow(infubinol_df)
## [1] 25
ketapril_df <- filter(metadata_df, Drug.Regimen=="Ketapril")
nrow(ketapril_df)
## [1] 25
placebo_df <- filter(metadata_df, Drug.Regimen=="Placebo")
nrow(placebo_df)
## [1] 25
merge_df <- merge(x = metadata_df, y = results_df, all = TRUE)
head(merge_df)
## Mouse.ID Drug.Regimen Sex Age_months Weight..g. Timepoint
## 1 a203 Infubinol Female 20 23 20
## 2 a203 Infubinol Female 20 23 25
## 3 a203 Infubinol Female 20 23 15
## 4 a203 Infubinol Female 20 23 10
## 5 a203 Infubinol Female 20 23 35
## 6 a203 Infubinol Female 20 23 0
## Tumor.Volume..mm3. Metastatic.Sites
## 1 55.17334 1
## 2 56.79321 1
## 3 52.77787 1
## 4 51.85244 1
## 5 61.93165 2
## 6 45.00000 0
glimpse(merge_df)
## Rows: 1,893
## Columns: 8
## $ Mouse.ID <chr> "a203", "a203", "a203", "a203", "a203", "a203", "a2~
## $ Drug.Regimen <chr> "Infubinol", "Infubinol", "Infubinol", "Infubinol",~
## $ Sex <chr> "Female", "Female", "Female", "Female", "Female", "~
## $ Age_months <int> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21,~
## $ Weight..g. <int> 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 25, 25, 25,~
## $ Timepoint <int> 20, 25, 15, 10, 35, 0, 30, 5, 45, 40, 5, 40, 35, 45~
## $ Tumor.Volume..mm3. <dbl> 55.17334, 56.79321, 52.77787, 51.85244, 61.93165, 4~
## $ Metastatic.Sites <int> 1, 1, 1, 1, 2, 0, 1, 0, 2, 2, 0, 1, 1, 1, 1, 1, 1, ~
merge_df <- merge_df %>% drop_na()
head(merge_df)
## Mouse.ID Drug.Regimen Sex Age_months Weight..g. Timepoint
## 1 a203 Infubinol Female 20 23 20
## 2 a203 Infubinol Female 20 23 25
## 3 a203 Infubinol Female 20 23 15
## 4 a203 Infubinol Female 20 23 10
## 5 a203 Infubinol Female 20 23 35
## 6 a203 Infubinol Female 20 23 0
## Tumor.Volume..mm3. Metastatic.Sites
## 1 55.17334 1
## 2 56.79321 1
## 3 52.77787 1
## 4 51.85244 1
## 5 61.93165 2
## 6 45.00000 0
colnames(merge_df)[1] <- c("Mouse_Id")
colnames(merge_df)[2] <- c("Drug_Regimen")
colnames(merge_df)[5] <- c("Weight_g")
colnames(merge_df)[7] <- c("Tumor_Volume_mm3")
colnames(merge_df)[8] <- c("Metastatic_Sites")
head(merge_df)
## Mouse_Id Drug_Regimen Sex Age_months Weight_g Timepoint Tumor_Volume_mm3
## 1 a203 Infubinol Female 20 23 20 55.17334
## 2 a203 Infubinol Female 20 23 25 56.79321
## 3 a203 Infubinol Female 20 23 15 52.77787
## 4 a203 Infubinol Female 20 23 10 51.85244
## 5 a203 Infubinol Female 20 23 35 61.93165
## 6 a203 Infubinol Female 20 23 0 45.00000
## Metastatic_Sites
## 1 1
## 2 1
## 3 1
## 4 1
## 5 2
## 6 0
merge_df %>% group_by(Mouse_Id, Timepoint)
## # A tibble: 1,893 x 8
## # Groups: Mouse_Id, Timepoint [1,888]
## Mouse_Id Drug_Regimen Sex Age_months Weight_g Timepoint Tumor_Volume_mm3
## <chr> <chr> <chr> <int> <int> <int> <dbl>
## 1 a203 Infubinol Female 20 23 20 55.2
## 2 a203 Infubinol Female 20 23 25 56.8
## 3 a203 Infubinol Female 20 23 15 52.8
## 4 a203 Infubinol Female 20 23 10 51.9
## 5 a203 Infubinol Female 20 23 35 61.9
## 6 a203 Infubinol Female 20 23 0 45
## 7 a203 Infubinol Female 20 23 30 59.5
## 8 a203 Infubinol Female 20 23 5 48.5
## 9 a203 Infubinol Female 20 23 45 68.0
## 10 a203 Infubinol Female 20 23 40 63.6
## # ... with 1,883 more rows, and 1 more variable: Metastatic_Sites <int>
head(merge_df)
## Mouse_Id Drug_Regimen Sex Age_months Weight_g Timepoint Tumor_Volume_mm3
## 1 a203 Infubinol Female 20 23 20 55.17334
## 2 a203 Infubinol Female 20 23 25 56.79321
## 3 a203 Infubinol Female 20 23 15 52.77787
## 4 a203 Infubinol Female 20 23 10 51.85244
## 5 a203 Infubinol Female 20 23 35 61.93165
## 6 a203 Infubinol Female 20 23 0 45.00000
## Metastatic_Sites
## 1 1
## 2 1
## 3 1
## 4 1
## 5 2
## 6 0
merge_df <- merge_df[!duplicated(merge_df), ]
head(merge_df)
## Mouse_Id Drug_Regimen Sex Age_months Weight_g Timepoint Tumor_Volume_mm3
## 1 a203 Infubinol Female 20 23 20 55.17334
## 2 a203 Infubinol Female 20 23 25 56.79321
## 3 a203 Infubinol Female 20 23 15 52.77787
## 4 a203 Infubinol Female 20 23 10 51.85244
## 5 a203 Infubinol Female 20 23 35 61.93165
## 6 a203 Infubinol Female 20 23 0 45.00000
## Metastatic_Sites
## 1 1
## 2 1
## 3 1
## 4 1
## 5 2
## 6 0
capomulin_df <- filter(merge_df, Drug_Regimen == "Capomulin")
infubinol_df <- filter(merge_df, Drug_Regimen == "Infubinol")
ketapril_df <- filter(merge_df, Drug_Regimen == "Ketapril")
placebo_df <- filter(merge_df, Drug_Regimen == "Placebo")
head(capomulin_df)
## Mouse_Id Drug_Regimen Sex Age_months Weight_g Timepoint Tumor_Volume_mm3
## 1 b128 Capomulin Female 9 22 5 45.65133
## 2 b128 Capomulin Female 9 22 25 43.26214
## 3 b128 Capomulin Female 9 22 35 37.96764
## 4 b128 Capomulin Female 9 22 10 43.27085
## 5 b128 Capomulin Female 9 22 0 45.00000
## 6 b128 Capomulin Female 9 22 40 38.37973
## Metastatic_Sites
## 1 0
## 2 1
## 3 1
## 4 0
## 5 0
## 6 2
# Filtering the max timepoint values
capo_df <- select(capomulin_df, Mouse_Id, Drug_Regimen, Timepoint, Tumor_Volume_mm3, Age_months, Weight_g) %>%
group_by(Mouse_Id) %>%
filter(Timepoint == max(Timepoint, na.rm=TRUE))
head(capo_df)
## # A tibble: 6 x 6
## # Groups: Mouse_Id [6]
## Mouse_Id Drug_Regimen Timepoint Tumor_Volume_mm3 Age_months Weight_g
## <chr> <chr> <int> <dbl> <int> <int>
## 1 b128 Capomulin 45 39.0 9 22
## 2 b742 Capomulin 45 38.9 7 21
## 3 f966 Capomulin 20 30.5 16 17
## 4 g288 Capomulin 45 37.1 3 19
## 5 g316 Capomulin 45 40.2 22 22
## 6 i557 Capomulin 45 47.7 1 24
# Calculating the percent change in Tumor volume
capo_df <- capo_df %>%
mutate(percent_change_Tumor_Volume = ((Tumor_Volume_mm3-45)/45))
head(capo_df)
## # A tibble: 6 x 7
## # Groups: Mouse_Id [6]
## Mouse_Id Drug_Regimen Timepoint Tumor_Volume_mm3 Age_months Weight_g
## <chr> <chr> <int> <dbl> <int> <int>
## 1 b128 Capomulin 45 39.0 9 22
## 2 b742 Capomulin 45 38.9 7 21
## 3 f966 Capomulin 20 30.5 16 17
## 4 g288 Capomulin 45 37.1 3 19
## 5 g316 Capomulin 45 40.2 22 22
## 6 i557 Capomulin 45 47.7 1 24
## # ... with 1 more variable: percent_change_Tumor_Volume <dbl>
# Filtering the max timepoint values
infu_df <- select(infubinol_df, Mouse_Id, Drug_Regimen, Timepoint, Tumor_Volume_mm3, Age_months, Weight_g) %>%
group_by(Mouse_Id) %>%
filter(Timepoint == max(Timepoint, na.rm=TRUE))
# Calculating the percent change in Tumor volume
infu_df <- infu_df %>%
mutate(percent_change_Tumor_Volume = ((Tumor_Volume_mm3-45)/45))
head(infu_df)
## # A tibble: 6 x 7
## # Groups: Mouse_Id [6]
## Mouse_Id Drug_Regimen Timepoint Tumor_Volume_mm3 Age_months Weight_g
## <chr> <chr> <int> <dbl> <int> <int>
## 1 a203 Infubinol 45 68.0 20 23
## 2 a251 Infubinol 45 65.5 21 25
## 3 a577 Infubinol 30 57.0 6 25
## 4 a685 Infubinol 45 66.1 8 30
## 5 c139 Infubinol 45 72.2 11 28
## 6 c326 Infubinol 5 36.3 18 25
## # ... with 1 more variable: percent_change_Tumor_Volume <dbl>
# Filtering the max timepoint values
keta_df <- select(ketapril_df, Mouse_Id, Drug_Regimen, Timepoint, Tumor_Volume_mm3, Age_months, Weight_g) %>%
group_by(Mouse_Id) %>%
filter(Timepoint == max(Timepoint, na.rm=TRUE))
# Calculating the percent change in Tumor volume
keta_df <- keta_df %>%
mutate(percent_change_Tumor_Volume = ((Tumor_Volume_mm3-45)/45))
head(keta_df)
## # A tibble: 6 x 7
## # Groups: Mouse_Id [6]
## Mouse_Id Drug_Regimen Timepoint Tumor_Volume_mm3 Age_months Weight_g
## <chr> <chr> <int> <dbl> <int> <int>
## 1 a457 Ketapril 10 49.8 11 30
## 2 c580 Ketapril 30 58.0 22 25
## 3 c819 Ketapril 40 62.2 21 25
## 4 c832 Ketapril 45 65.4 18 29
## 5 d474 Ketapril 40 60.2 18 27
## 6 f278 Ketapril 5 48.2 12 30
## # ... with 1 more variable: percent_change_Tumor_Volume <dbl>
plac_df <- select(placebo_df, Mouse_Id, Drug_Regimen, Timepoint, Tumor_Volume_mm3, Age_months, Weight_g) %>%
group_by(Mouse_Id) %>%
filter(Timepoint == max(Timepoint, na.rm=TRUE))
# Calculating the percent change in Tumor volume
plac_df <- plac_df %>%
mutate(percent_change_Tumor_Volume = ((Tumor_Volume_mm3-45)/45))
head(plac_df)
## # A tibble: 6 x 7
## # Groups: Mouse_Id [6]
## Mouse_Id Drug_Regimen Timepoint Tumor_Volume_mm3 Age_months Weight_g
## <chr> <chr> <int> <dbl> <int> <int>
## 1 a262 Placebo 45 70.7 17 29
## 2 a897 Placebo 45 72.3 7 28
## 3 c282 Placebo 45 65.8 12 27
## 4 c757 Placebo 45 69.0 9 27
## 5 c766 Placebo 45 69.8 13 26
## 6 e227 Placebo 45 73.2 1 30
## # ... with 1 more variable: percent_change_Tumor_Volume <dbl>
# put all data frames into list
df_list <- list(capo_df, infu_df, keta_df, plac_df)
#merge all data frames together
new_df <- Reduce(function(x, y) merge(x, y, all=TRUE), df_list)
head(new_df)
## Mouse_Id Drug_Regimen Timepoint Tumor_Volume_mm3 Age_months Weight_g
## 1 a203 Infubinol 45 67.97342 20 23
## 2 a251 Infubinol 45 65.52574 21 25
## 3 a262 Placebo 45 70.71762 17 29
## 4 a457 Ketapril 10 49.78342 11 30
## 5 a577 Infubinol 30 57.03186 6 25
## 6 a685 Infubinol 45 66.08307 8 30
## percent_change_Tumor_Volume
## 1 0.5105204
## 2 0.4561276
## 3 0.5715027
## 4 0.1062982
## 5 0.2673747
## 6 0.4685126
#new_df$percent_change_Tumor_Volume = round(new_df$percent_change_Tumor_Volume, digit=2)
head(new_df)
## Mouse_Id Drug_Regimen Timepoint Tumor_Volume_mm3 Age_months Weight_g
## 1 a203 Infubinol 45 67.97342 20 23
## 2 a251 Infubinol 45 65.52574 21 25
## 3 a262 Placebo 45 70.71762 17 29
## 4 a457 Ketapril 10 49.78342 11 30
## 5 a577 Infubinol 30 57.03186 6 25
## 6 a685 Infubinol 45 66.08307 8 30
## percent_change_Tumor_Volume
## 1 0.5105204
## 2 0.4561276
## 3 0.5715027
## 4 0.1062982
## 5 0.2673747
## 6 0.4685126
test_df <- select(new_df, Drug_Regimen, percent_change_Tumor_Volume) %>%
group_by(Drug_Regimen) %>%
summarise(percent_change_Tumor_Volume = mean(percent_change_Tumor_Volume, na.rm=TRUE))
head(test_df)
## # A tibble: 4 x 2
## Drug_Regimen percent_change_Tumor_Volume
## <chr> <dbl>
## 1 Capomulin -0.185
## 2 Infubinol 0.293
## 3 Ketapril 0.396
## 4 Placebo 0.345
# plot mean salaries
ggplot(test_df,
aes(x = percent_change_Tumor_Volume,
y = Drug_Regimen,
fill=Drug_Regimen)) +
geom_bar(stat = "identity")
Compute some summary statistics (count, mean and sd) of the variable weight organized by groups:
# Summarize
new_df %>%
group_by(Drug_Regimen) %>%
summarise(
n = n(),
max = max(percent_change_Tumor_Volume),
min = min(percent_change_Tumor_Volume),
mean = mean(percent_change_Tumor_Volume),
median = median(percent_change_Tumor_Volume),
sd = sd(percent_change_Tumor_Volume),
quan_25th = quantile(percent_change_Tumor_Volume, 0.25),
quan_75th = quantile(percent_change_Tumor_Volume, 0.75)
)
## # A tibble: 4 x 9
## Drug_Regimen n max min mean median sd quan_25th quan_75th
## <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Capomulin 25 0.0597 -0.481 -0.185 -0.153 0.127 -0.281 -0.108
## 2 Infubinol 25 0.605 -0.193 0.293 0.337 0.191 0.201 0.456
## 3 Ketapril 25 0.746 0 0.396 0.433 0.221 0.260 0.553
## 4 Placebo 25 0.627 0 0.345 0.378 0.197 0.177 0.514
library(tidyverse)
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.1.3
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.1.3
# Color by groups
ggsummarystats(
new_df, x = "Drug_Regimen", y = "percent_change_Tumor_Volume",
ggfunc = ggboxplot, add = "jitter",
color = "Drug_Regimen", palette = "npg"
)
Create a box plot of percent_change_Tumor_Volume by Drug_Regimen:
boxplot(new_df$percent_change_Tumor_Volume~new_df$Drug_Regimen,
main="Boxplot comparing percent_change_Tumor_Volume",
xlab = "Drug_Regimen", ylab = "percent_change_Tumor_Volume",
col= rainbow(4))
The drug categories represented on the x_axis and percent change in tumor size on the y_axis.
Infubinol drug category has one datapoint indicating the outlier.
Infubinol and Ketapril drug category shows that the median and mean are closely alligned, which shows data is normally distributed, and we can ignore the outlier in Infubinol drug category.
Ketapril drug category has more variance of data points than the Infubinol drug category.
However, the mean < median in the Capomulin drug category, making it left skewed. Also, the percent change is tumor size is below zero, showing that the Capomulin drug is effective in reducing the size of tumor.
Since Capomulin drug category appears to be different than Infubinol and Ketapril drug category, which is an indication that we will find the difference between means of Capomulin drug and the other two categories.
#=============================================================================================================================
The ANOVA test makes the following assumptions about the data:
Independence of the observations. Each subject should belong to only one group. There is no relationship between the observations in each group. Having repeated measures for the same participants is not allowed. No significant outliers in any cell of the design Normality. the data for each design cell should be approximately normally distributed. Homogeneity of variances. The variance of the outcome variable should be equal in every cell of the design.
new_df %>%
group_by(Drug_Regimen) %>%
identify_outliers(percent_change_Tumor_Volume)
## # A tibble: 1 x 9
## Drug_Regimen Mouse_Id Timepoint Tumor_Volume_mm3 Age_months Weight_g
## <chr> <chr> <int> <dbl> <int> <int>
## 1 Infubinol c326 5 36.3 18 25
## # ... with 3 more variables: percent_change_Tumor_Volume <dbl>,
## # is.outlier <lgl>, is.extreme <lgl>
In this situation there is one outliers, this can be due to data entry errors, measurement errors or unusual values and the mean is not effected.
Check normality assumption by analyzing the model residuals. QQ plot and Shapiro-Wilk test of normality are used.
# Build the linear model
model <- lm(percent_change_Tumor_Volume ~ Drug_Regimen, data = new_df)
# Create a QQ plot of residuals
ggqqplot(residuals(model),col="blue")
# Compute Shapiro-Wilk test of normality
shapiro_test(residuals(model))
## # A tibble: 1 x 3
## variable statistic p.value
## <chr> <dbl> <dbl>
## 1 residuals(model) 0.976 0.0678
In the QQ plot, as all the points fall approximately along the reference line, we can assume normality. This conclusion is supported by the Shapiro-Wilk test. The p-value is not significant (p = 0.6), so we can assume normality.
Computing Shapiro-Wilk test for each group level. If the data is normally distributed, the p-value should be greater than 0.05.
new_df %>%
group_by(Drug_Regimen) %>%
shapiro_test(percent_change_Tumor_Volume)
## # A tibble: 4 x 4
## Drug_Regimen variable statistic p
## <chr> <chr> <dbl> <dbl>
## 1 Capomulin percent_change_Tumor_Volume 0.968 0.599
## 2 Infubinol percent_change_Tumor_Volume 0.960 0.414
## 3 Ketapril percent_change_Tumor_Volume 0.945 0.189
## 4 Placebo percent_change_Tumor_Volume 0.936 0.121
The score were normally distributed (p > 0.05) for each group, as assessed by Shapiro-Wilk’s test of normality.
QQ plot draws the correlation between a given data and the normal distribution.
ggqqplot(new_df, "percent_change_Tumor_Volume", facet.by = "Drug_Regimen", col="red")
All the points fall approximately along the reference line, for each cell. So we can assume normality of the data.
The residuals versus fits plot can be used to check the homogeneity of variances.
plot(model, 1)
In the plot above, there is no evident relationships between residuals and fitted values (the mean of each group), which is good. So, we can assume the homogeneity of variances.
#result_aov <- new_df %>% anova_test(percent_change_Tumor_Volume ~ Drug_Regimen)
result_aov <-anova(lm(percent_change_Tumor_Volume ~ Drug_Regimen, new_df))
result_aov
## Analysis of Variance Table
##
## Response: percent_change_Tumor_Volume
## Df Sum Sq Mean Sq F value Pr(>F)
## Drug_Regimen 3 5.3903 1.79676 51.182 < 2.2e-16 ***
## Residuals 96 3.3701 0.03511
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
table <- aov(percent_change_Tumor_Volume ~ Drug_Regimen, data=new_df)
summary(table)
## Df Sum Sq Mean Sq F value Pr(>F)
## Drug_Regimen 3 5.39 1.7968 51.18 <2e-16 ***
## Residuals 96 3.37 0.0351
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Degree of freedom: 3, since we have 4 categories, so 4-1 =3 degree of freedom Sum of Squares: 53104 Mean Sum of Squares: 17701 F-Statistics: 69.737, F(3, 96) = 69.74 p-Value: 2.2e-16, this tells weather we found significance difference or not. The p-value less than alpha=0.05 shows that we found the difference in at least two of the means in the data sample. And that is what we suspected from the boxplot analysis.
The F-value in an ANOVA is calculated as: variation between sample means / variation within the samples. The higher the F-value in an ANOVA, the higher the variation between sample means relative to the variation within the samples. The higher the F-value, the lower the corresponding p-value.
Since the p-value is below a threshold (α = .05), we can reject the null hypothesis of the ANOVA and conclude that there is a statistically significant difference between group means.
Therefore, we can reject the null hypothesis that there is no difference between the means in the data sample.
But we are not sure about which category mean is different, so we perform Post-hoc tests to determine that.
A significant one-way ANOVA is generally followed up by Tukey post-hoc tests to perform multiple pairwise comparisons between groups. Key R function: tukey_hsd() [rstatix].
TukeyHSD(table)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = percent_change_Tumor_Volume ~ Drug_Regimen, data = new_df)
##
## $Drug_Regimen
## diff lwr upr p adj
## Infubinol-Capomulin 0.47801507 0.33945452 0.61657561 0.0000000
## Ketapril-Capomulin 0.58085831 0.44229776 0.71941885 0.0000000
## Placebo-Capomulin 0.52979659 0.39123605 0.66835714 0.0000000
## Ketapril-Infubinol 0.10284324 -0.03571731 0.24140378 0.2181696
## Placebo-Infubinol 0.05178152 -0.08677902 0.19034207 0.7627868
## Placebo-Ketapril -0.05106171 -0.18962226 0.08749883 0.7704117
when p-adj < p-value, there is a significant difference between the two categories mean.
We are 95% confident that the true mean lies between the lwr and upr bounds of confidence Interval.
The p-Value for Infubinol-Capomulin, Ketapril-Capomulin, Placebo-Capomulin is less than alpha=0.05 that shows there is significant difference between the means of these categories.
The p-Value for Infubinol-Capomulin, Ketapril-Capomulin, Placebo-Capomulin is less than alpha=0.05 that shows there is significant difference between the means of these categories.
Thus, Capomulin more effective in reducing the tumor size than Infubinol, or Ketapril drugs categories.
The visualization below shows the difference in mean between the two categories in the middle of each bar.
Both end limits on each bar shows the upper and lower limits of confidence intervals.
plot(TukeyHSD(table), col= rainbow(4))
###=====================================================================================
head(capo_df)
## # A tibble: 6 x 7
## # Groups: Mouse_Id [6]
## Mouse_Id Drug_Regimen Timepoint Tumor_Volume_mm3 Age_months Weight_g
## <chr> <chr> <int> <dbl> <int> <int>
## 1 b128 Capomulin 45 39.0 9 22
## 2 b742 Capomulin 45 38.9 7 21
## 3 f966 Capomulin 20 30.5 16 17
## 4 g288 Capomulin 45 37.1 3 19
## 5 g316 Capomulin 45 40.2 22 22
## 6 i557 Capomulin 45 47.7 1 24
## # ... with 1 more variable: percent_change_Tumor_Volume <dbl>
#fit regression model
model <- lm(Tumor_Volume_mm3~Weight_g+Age_months, data=capo_df)
#view model summary
summary(model)
##
## Call:
## lm(formula = Tumor_Volume_mm3 ~ Weight_g + Age_months, data = capo_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.2433 -0.7903 0.1667 1.5253 4.8761
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.48591 4.32578 0.112 0.912
## Weight_g 1.76129 0.20298 8.677 1.49e-08 ***
## Age_months 0.05303 0.07401 0.716 0.481
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.838 on 22 degrees of freedom
## Multiple R-squared: 0.7739, Adjusted R-squared: 0.7533
## F-statistic: 37.65 on 2 and 22 DF, p-value: 7.898e-08
#By looking at the coefficients, it seems as though weight is accounting for most of the variance (B = 1.76, p=1.49e-08, p<.05). Age is not a significant predictor of tumor size (B = .053, p = .481).
Min 1Q Median 3Q Max
-7.2433 -0.7903 0.1667 1.5253 4.8761
This section displays a summary of the distribution of residuals from the regression model. A residual is the difference between the observed value and the predicted value from the regression model.
The minimum residual was -7.2433, the median residual was 0.1667 and the max residual was 4.8761.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.48591 4.32578 0.112 0.912
Weight_g 1.76129 0.20298 8.677 1.49e-08 *** Age_months 0.05303 0.07401 0.716 0.481
— Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
This section displays the estimated coefficients of the regression model. We can use these coefficients to form the following estimated regression equation:
Estimate: The estimated coefficient. This tells us about the average increase in the response variable associated with a one unit increase in the predictor variable, assuming all other predictor variables are held constant.
Std. Error: This is the standard error of the coefficient. This is a measure of the uncertainty in our estimate of the coefficient.
t value: This is the t-statistic for the predictor variable, calculated as (Estimate) / (Std. Error).
Residual standard error: This tells us the average distance that the observed values fall from the regression line. The smaller the value, the better the regression model can fit the data.
Multiple R-Squared: This is known as the coefficient of determination. It tells us the proportion of the variance in the response variable that can be explained by the predictor variables. This value ranges from 0 to 1. The closer it is to 1, the better the predictor variables are able to predict the value of the response variable.
Adjusted R-squared: This is a modified version of R-squared that has been adjusted for the number of predictors in the model. It is always lower than the R-squared.
F-statistic: This indicates whether the regression model provides a better fit to the data than a model that contains no independent variables. In essence, it tests if the regression model is useful.
p-value: This is the p-value that corresponds to the F-statistic. If this value is less than some significance level (e.g., 0.05), then the regression model fits the data better than a model with no predictors.
When building regression models, we hope that this p-value is less than some significance level because it indicates that the predictor variables are useful for predicting the value of the response variable.
If we used an alpha level of α = .05 to determine which predictors were significant in this regression model, we’d say that weight has (p=1.49e-08) < is (α = .05) and is statistically significant predictors.
while age (p=0.481) > is (α = .05) and is not statistically significant predictors.
wt_Model <- lm(Tumor_Volume_mm3~Weight_g, data=new_df)
summary(wt_Model)
##
## Call:
## lm(formula = Tumor_Volume_mm3 ~ Weight_g, data = new_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -19.0966 -7.1010 -0.7038 8.7571 20.1531
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.6591 6.2930 -0.899 0.371
## Weight_g 2.3252 0.2402 9.680 6e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.62 on 98 degrees of freedom
## Multiple R-squared: 0.4888, Adjusted R-squared: 0.4836
## F-statistic: 93.7 on 1 and 98 DF, p-value: 5.997e-16
If we used an alpha level of α = .05 to determine which predictors were significant in this regression model, we’d say that weight has (p-value: 6e-16) < is (α = .05) and is statistically significant predictors.
ggscatter(new_df, x = "Tumor_Volume_mm3", y = "Weight_g", col="red",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "Tumor Size mm3", ylab = "Weight_g")
## `geom_smooth()` using formula 'y ~ x'
age_Model <- lm(Tumor_Volume_mm3~Age_months, data=new_df)
summary(age_Model)
##
## Call:
## lm(formula = Tumor_Volume_mm3 ~ Age_months, data = new_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.032 -9.571 1.621 11.525 23.876
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.331 2.912 18.656 <2e-16 ***
## Age_months 0.015 0.185 0.081 0.936
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.45 on 98 degrees of freedom
## Multiple R-squared: 6.709e-05, Adjusted R-squared: -0.01014
## F-statistic: 0.006575 on 1 and 98 DF, p-value: 0.9355
If we used an alpha level of α = .05 to determine which predictors were significant in this regression model, we’d say that age has (p-value: 0.936) > is (α = .05) and is not statistically significant predictors.
ggscatter(new_df, x = "Tumor_Volume_mm3", y = "Age_months", col="blue",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "Tumor Size mm3", ylab = "Age")
## `geom_smooth()` using formula 'y ~ x'
### Explanation of One Way ANOVA Variance Test:
ANOVA (or AOV) is short for ANalysis Of VAriance. While it is commonly used for categorical data, because ANOVA is a type of linear model it can be modified to include continuous data.
ANOVAs are typically referred to as 1-way and 2-way, which is just a way of saying how many groups are being examined in the model. (ANOVAs can be 3-way or more, but these are less common or end up with different names.) A group is the categorical variable that you are evaluating.
ANOVA test determines the difference in mean between two or more independent groups.
ANOVA checks the impact of one or more group by comparing the means of different samples. We can use ANOVA to prove/disprove if all the medication treatments were equally effective or not.
ANOVA analyzes the variance in the data to look for differences. It does this by considering two sources of variance, the between-group variance and the within-group variance.
The between-group variation is calculated by comparing the mean of each group with the overall mean of the data—so individual data points don’t matter quite as much as just comparing group means.
The within-group variation is the variation of each observation from its group mean.
For both types of variance, a sums of squares (SS) is the numerical metric used to quantify them and this metric simply sums the distances of each point to the mean.
The ratio of these SS (between SS divided by within SS) results in an F-statistic, which is the test statistic for ANOVA.
The F-statistic is then combined with the degrees of freedom (df) to arrive at a p-value, which, we are looking for.
Another way to think of it is that small p-values come from large F-statistics, and large F-statistics suggest that the between group variance is much larger than the within-group variance.
So, when the differences in group means is larger and yet the groups are not that variable, we tend to have significant factors in our ANOVAs.
A type 1 error is also known as a false positive and occurs when a researcher incorrectly rejects a true null hypothesis. This means that your report that your findings are significant when in fact they have occurred by chance.
The probability of making a type I error is represented by your alpha level (α), which is the p-value below which you reject the null hypothesis. A p-value of 0.05 indicates that we are willing to accept a 5% chance that we are wrong when we reject the null hypothesis.
We can reduce your risk of committing a type I error by using a lower value for p. For example, a p-value of 0.01 would mean there is a 1% chance of committing a Type I error.
However, using a lower value for alpha means that you will be less likely to detect a true difference if one really exists (thus risking a type II error).
Why is this analysis important?
ANOVA test determines the difference in mean between two or more independent groups.
It provides the overall test of equality of group means.
It can control the overall type I error rate (i.e., false positive finding)
Limitations of the analysis?
One-way ANOVA can only be used when investigating a single factor and a single dependent variable. When comparing the means of three or more groups, it can tell us if at least one pair of means is significantly different, but it can’t tell us which pair.
ANOVA assumes that the data is normally distributed. The ANOVA also assumes homogeneity of variance, which means that the variance among the groups should be approximately equal. ANOVA also assumes that the observations are independent of each other.