library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.2 v dplyr 1.0.7
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## Warning: package 'ggplot2' was built under R version 4.1.2
## Warning: package 'stringr' was built under R version 4.1.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(tidyr)
library(stringr)
library(psych)
## Warning: package 'psych' was built under R version 4.1.2
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(ggplot2)
metadata_df <- read.delim("https://raw.githubusercontent.com/rfpoulos/pymaceuticals/master/data/Mouse_metadata.csv", header=T, sep=",")
head(metadata_df)
## Mouse.ID Drug.Regimen Sex Age_months Weight..g.
## 1 k403 Ramicane Male 21 16
## 2 s185 Capomulin Female 3 17
## 3 x401 Capomulin Female 16 15
## 4 m601 Capomulin Male 22 17
## 5 g791 Ramicane Male 11 16
## 6 s508 Ramicane Male 1 17
df <- metadata_df %>%
group_by(Drug.Regimen)
head(df)
## # A tibble: 6 x 5
## # Groups: Drug.Regimen [2]
## Mouse.ID Drug.Regimen Sex Age_months Weight..g.
## <chr> <chr> <chr> <int> <int>
## 1 k403 Ramicane Male 21 16
## 2 s185 Capomulin Female 3 17
## 3 x401 Capomulin Female 16 15
## 4 m601 Capomulin Male 22 17
## 5 g791 Ramicane Male 11 16
## 6 s508 Ramicane Male 1 17
results_df <- read.delim("https://raw.githubusercontent.com/rfpoulos/pymaceuticals/master/data/Study_results.csv", header=T, sep=",")
head(results_df)
## Mouse.ID Timepoint Tumor.Volume..mm3. Metastatic.Sites
## 1 b128 0 45 0
## 2 f932 0 45 0
## 3 g107 0 45 0
## 4 a457 0 45 0
## 5 c819 0 45 0
## 6 h246 0 45 0
Introduction: Pymaceuticals Inc., a fictional burgeoning pharmaceutical company based out of San Diego, CA, specializes in drug-based, anti-cancer pharmaceuticals.They have provided the data to test the efficacy of potential drug treatments for squamous cell carcinoma. In this study, 249 mice identified with Squamous cell carcinoma (SCC) tumor growth, kind of skin cancer, were treated through a variety of drug regimens. Over the course of 45 days, tumor development was observed and measured.The objective is to analyze the data to show how four treatments (Capomulin, Infubinol, Ketapril, and Placebo) compare.
Question 1: Is Capomulin more effective than the three other drugs in the dataset?
Question 2: Is there a correlation between the age, weight and the effectiveness of capomulin?
Null Hypothesis: There is no difference between the effectiveness of the four drug regimens.
Alternate Hypothesis: Capomulin is more effective than the other three drug regimens on treating SCC tumor growth.
Approach for answering the research question will be:
1- Perform linear regression to study the correlation between various variables and calculating the correlation coefficient.
2- And finally compare the four population against each other.
3- Perform Hypothesis testing
Answer: The metadata_df contain 249 unique mouse id and so are the number of cases that treated with variety of drug regimem .The results_df dataset holds the tumor growth measurments observed for each Mouse ID and carries 1,893 rows results. There are 10 different drug treatments. The total sample size of mouse_id for four treatments (Capomulin, Infubinol, Ketapril, and Placebo) is 100 and the sample size of mouse_id by drug treatments is 25 each.
Answer: Data is collected by the fictitious pharmaceutical company who was testing the efficacy of potential drug treatments for squamous cell carcinoma. I import the data into my .Rmd file from github.
Answer: This is an experimental study.A group of 249 mice were monitored after administration of a variety of drug regimens over a 45-day treatment period. The impact of Capomulin on tumor growth, metastasis and survival rates were monitored, along with Infubinol, Ketapril, and Placebo.
Answer: The citation and data collection links are as follows.
In my search for the experimental datasets, I found the Mouse_metadata and the Study_results on the GitHub link provided below:
https://raw.githubusercontent.com/rfpoulos/pymaceuticals/master/data/Mouse_metadata.csv
https://raw.githubusercontent.com/rfpoulos/pymaceuticals/master/data/Study_results.csv
Upon further research in finding the original source of the the dataset, I found that these datasets are provided by Pymaceuticals Inc., a fictional burgeoning pharmaceutical company based out of San Diego, CA, specializes in drug-based, anti-cancer pharmaceuticals. Below is the link for the original source of the datasets.
https://c-l-nguyen.github.io/web-design-challenge/index.html
Answer: The response variable is the size of tumor, “Tumor.Volume..mm3.” and it holds a numerical data.
What is the explanatory variable, and what type is it (numerical/categorical)?
Answer: The explanatory variable is the “Drug.Regimen” and it holds a categorical data and “Timepoint” which holds numerical data. The ‘Timepoint’ unit is ‘days’.
Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
summary(metadata_df)
## Mouse.ID Drug.Regimen Sex Age_months
## Length:249 Length:249 Length:249 Min. : 1.00
## Class :character Class :character Class :character 1st Qu.: 6.00
## Mode :character Mode :character Mode :character Median :13.00
## Mean :12.73
## 3rd Qu.:19.00
## Max. :24.00
## Weight..g.
## Min. :15.00
## 1st Qu.:25.00
## Median :27.00
## Mean :26.12
## 3rd Qu.:29.00
## Max. :30.00
summary(results_df)
## Mouse.ID Timepoint Tumor.Volume..mm3. Metastatic.Sites
## Length:1893 Min. : 0.00 Min. :22.05 Min. :0.000
## Class :character 1st Qu.: 5.00 1st Qu.:45.00 1st Qu.:0.000
## Mode :character Median :20.00 Median :48.95 Median :1.000
## Mean :19.57 Mean :50.45 Mean :1.022
## 3rd Qu.:30.00 3rd Qu.:56.29 3rd Qu.:2.000
## Max. :45.00 Max. :78.57 Max. :4.000
nrow(metadata_df)
## [1] 249
nrow(results_df)
## [1] 1893
drug_count <- unique(metadata_df$Drug.Regimen)
drug_count
## [1] "Ramicane" "Capomulin" "Infubinol" "Placebo" "Ceftamin" "Stelasyn"
## [7] "Zoniferol" "Ketapril" "Propriva" "Naftisol"
length(drug_count)
## [1] 10
capomulin_df <- filter(metadata_df, Drug.Regimen=="Capomulin")
head(capomulin_df)
## Mouse.ID Drug.Regimen Sex Age_months Weight..g.
## 1 s185 Capomulin Female 3 17
## 2 x401 Capomulin Female 16 15
## 3 m601 Capomulin Male 22 17
## 4 f966 Capomulin Male 16 17
## 5 u364 Capomulin Male 18 17
## 6 y793 Capomulin Male 17 17
nrow(capomulin_df)
## [1] 25
infubinol_df <- filter(metadata_df, Drug.Regimen=="Infubinol")
nrow(infubinol_df)
## [1] 25
ketapril_df <- filter(metadata_df, Drug.Regimen=="Ketapril")
nrow(ketapril_df)
## [1] 25
placebo_df <- filter(metadata_df, Drug.Regimen=="Placebo")
nrow(placebo_df)
## [1] 25
merge_df <- merge(x = metadata_df, y = results_df, all = TRUE)
head(merge_df)
## Mouse.ID Drug.Regimen Sex Age_months Weight..g. Timepoint
## 1 a203 Infubinol Female 20 23 20
## 2 a203 Infubinol Female 20 23 25
## 3 a203 Infubinol Female 20 23 15
## 4 a203 Infubinol Female 20 23 10
## 5 a203 Infubinol Female 20 23 35
## 6 a203 Infubinol Female 20 23 0
## Tumor.Volume..mm3. Metastatic.Sites
## 1 55.17334 1
## 2 56.79321 1
## 3 52.77787 1
## 4 51.85244 1
## 5 61.93165 2
## 6 45.00000 0
glimpse(merge_df)
## Rows: 1,893
## Columns: 8
## $ Mouse.ID <chr> "a203", "a203", "a203", "a203", "a203", "a203", "a2~
## $ Drug.Regimen <chr> "Infubinol", "Infubinol", "Infubinol", "Infubinol",~
## $ Sex <chr> "Female", "Female", "Female", "Female", "Female", "~
## $ Age_months <int> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21,~
## $ Weight..g. <int> 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 25, 25, 25,~
## $ Timepoint <int> 20, 25, 15, 10, 35, 0, 30, 5, 45, 40, 5, 40, 35, 45~
## $ Tumor.Volume..mm3. <dbl> 55.17334, 56.79321, 52.77787, 51.85244, 61.93165, 4~
## $ Metastatic.Sites <int> 1, 1, 1, 1, 2, 0, 1, 0, 2, 2, 0, 1, 1, 1, 1, 1, 1, ~
merge_df <- merge_df %>% drop_na()
head(merge_df)
## Mouse.ID Drug.Regimen Sex Age_months Weight..g. Timepoint
## 1 a203 Infubinol Female 20 23 20
## 2 a203 Infubinol Female 20 23 25
## 3 a203 Infubinol Female 20 23 15
## 4 a203 Infubinol Female 20 23 10
## 5 a203 Infubinol Female 20 23 35
## 6 a203 Infubinol Female 20 23 0
## Tumor.Volume..mm3. Metastatic.Sites
## 1 55.17334 1
## 2 56.79321 1
## 3 52.77787 1
## 4 51.85244 1
## 5 61.93165 2
## 6 45.00000 0
colnames(merge_df)[1] <- c("Mouse_Id")
colnames(merge_df)[2] <- c("Drug_Regimen")
colnames(merge_df)[5] <- c("Weight_g")
colnames(merge_df)[7] <- c("Tumor_Volume_mm3")
colnames(merge_df)[8] <- c("Metastatic_Sites")
head(merge_df)
## Mouse_Id Drug_Regimen Sex Age_months Weight_g Timepoint Tumor_Volume_mm3
## 1 a203 Infubinol Female 20 23 20 55.17334
## 2 a203 Infubinol Female 20 23 25 56.79321
## 3 a203 Infubinol Female 20 23 15 52.77787
## 4 a203 Infubinol Female 20 23 10 51.85244
## 5 a203 Infubinol Female 20 23 35 61.93165
## 6 a203 Infubinol Female 20 23 0 45.00000
## Metastatic_Sites
## 1 1
## 2 1
## 3 1
## 4 1
## 5 2
## 6 0
merge_df %>% group_by(Mouse_Id, Timepoint)
## # A tibble: 1,893 x 8
## # Groups: Mouse_Id, Timepoint [1,888]
## Mouse_Id Drug_Regimen Sex Age_months Weight_g Timepoint Tumor_Volume_mm3
## <chr> <chr> <chr> <int> <int> <int> <dbl>
## 1 a203 Infubinol Female 20 23 20 55.2
## 2 a203 Infubinol Female 20 23 25 56.8
## 3 a203 Infubinol Female 20 23 15 52.8
## 4 a203 Infubinol Female 20 23 10 51.9
## 5 a203 Infubinol Female 20 23 35 61.9
## 6 a203 Infubinol Female 20 23 0 45
## 7 a203 Infubinol Female 20 23 30 59.5
## 8 a203 Infubinol Female 20 23 5 48.5
## 9 a203 Infubinol Female 20 23 45 68.0
## 10 a203 Infubinol Female 20 23 40 63.6
## # ... with 1,883 more rows, and 1 more variable: Metastatic_Sites <int>
head(merge_df)
## Mouse_Id Drug_Regimen Sex Age_months Weight_g Timepoint Tumor_Volume_mm3
## 1 a203 Infubinol Female 20 23 20 55.17334
## 2 a203 Infubinol Female 20 23 25 56.79321
## 3 a203 Infubinol Female 20 23 15 52.77787
## 4 a203 Infubinol Female 20 23 10 51.85244
## 5 a203 Infubinol Female 20 23 35 61.93165
## 6 a203 Infubinol Female 20 23 0 45.00000
## Metastatic_Sites
## 1 1
## 2 1
## 3 1
## 4 1
## 5 2
## 6 0
df1 <- select(merge_df, Drug_Regimen, Tumor_Volume_mm3, Age_months, Weight_g)
head(df1)
## Drug_Regimen Tumor_Volume_mm3 Age_months Weight_g
## 1 Infubinol 55.17334 20 23
## 2 Infubinol 56.79321 20 23
## 3 Infubinol 52.77787 20 23
## 4 Infubinol 51.85244 20 23
## 5 Infubinol 61.93165 20 23
## 6 Infubinol 45.00000 20 23
df1 <- group_by(df1, Drug_Regimen)
head(df1)
## # A tibble: 6 x 4
## # Groups: Drug_Regimen [1]
## Drug_Regimen Tumor_Volume_mm3 Age_months Weight_g
## <chr> <dbl> <int> <int>
## 1 Infubinol 55.2 20 23
## 2 Infubinol 56.8 20 23
## 3 Infubinol 52.8 20 23
## 4 Infubinol 51.9 20 23
## 5 Infubinol 61.9 20 23
## 6 Infubinol 45 20 23
stats_df <- df1 %>% summarise(
Tumor_Volume_mean = mean(Tumor_Volume_mm3), Tumor_Volume_median = median(Tumor_Volume_mm3), Tumor_Volume_sd = sd(Tumor_Volume_mm3), Tumor_Volume_se = sd(Tumor_Volume_mm3)/sqrt(length((Tumor_Volume_mm3))))
head(stats_df)
## # A tibble: 6 x 5
## Drug_Regimen Tumor_Volume_me~ Tumor_Volume_me~ Tumor_Volume_sd Tumor_Volume_se
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Capomulin 40.7 41.6 4.99 0.329
## 2 Ceftamin 52.6 51.8 6.27 0.470
## 3 Infubinol 52.9 51.8 6.57 0.492
## 4 Ketapril 55.2 53.7 8.28 0.604
## 5 Naftisol 54.3 52.5 8.13 0.596
## 6 Placebo 54.0 52.3 7.82 0.581
library(ggplot2)
# plot mean salaries
ggplot(stats_df,
aes(x = Drug_Regimen,
y = Tumor_Volume_mean)) +
geom_bar(stat = "identity", fill = "cornflowerblue")
ggplot(merge_df,
aes(x = Drug_Regimen,
y = Tumor_Volume_mm3)) +
geom_boxplot() +
labs(title = "Mean distribution by Drug_Regimen")
count_df <- df1 %>% count(Drug_Regimen)
count_df
## # A tibble: 10 x 2
## # Groups: Drug_Regimen [10]
## Drug_Regimen n
## <chr> <int>
## 1 Capomulin 230
## 2 Ceftamin 178
## 3 Infubinol 178
## 4 Ketapril 188
## 5 Naftisol 186
## 6 Placebo 181
## 7 Propriva 161
## 8 Ramicane 228
## 9 Stelasyn 181
## 10 Zoniferol 182
barplot(c(230, 178, 178, 188, 186, 181, 161, 228, 181, 182),
names.arg=c("Capomulin","Ceftamin","Infubinol","Ketapril","Naftisol", "Placebo", "Propriva", "Ramicane", "Stelasyn", "Zoniferol"),
ylim=c(0,250),
col=c("beige","orange","lightgreen","lightblue","yellow", "blue", "green", "pink", "purple", "red"),
ylab="Count of Mice per Drug Regimen")
merge_df <- merge_df[!duplicated(merge_df), ]
head(merge_df)
## Mouse_Id Drug_Regimen Sex Age_months Weight_g Timepoint Tumor_Volume_mm3
## 1 a203 Infubinol Female 20 23 20 55.17334
## 2 a203 Infubinol Female 20 23 25 56.79321
## 3 a203 Infubinol Female 20 23 15 52.77787
## 4 a203 Infubinol Female 20 23 10 51.85244
## 5 a203 Infubinol Female 20 23 35 61.93165
## 6 a203 Infubinol Female 20 23 0 45.00000
## Metastatic_Sites
## 1 1
## 2 1
## 3 1
## 4 1
## 5 2
## 6 0
capomulin_df <- filter(merge_df, Drug_Regimen == "Capomulin")
infubinol_df <- filter(merge_df, Drug_Regimen == "Infubinol")
ketapril_df <- filter(merge_df, Drug_Regimen == "Ketapril")
placebo_df <- filter(merge_df, Drug_Regimen == "Placebo")
head(capomulin_df)
## Mouse_Id Drug_Regimen Sex Age_months Weight_g Timepoint Tumor_Volume_mm3
## 1 b128 Capomulin Female 9 22 5 45.65133
## 2 b128 Capomulin Female 9 22 25 43.26214
## 3 b128 Capomulin Female 9 22 35 37.96764
## 4 b128 Capomulin Female 9 22 10 43.27085
## 5 b128 Capomulin Female 9 22 0 45.00000
## 6 b128 Capomulin Female 9 22 40 38.37973
## Metastatic_Sites
## 1 0
## 2 1
## 3 1
## 4 0
## 5 0
## 6 2
capo_df1 <- select(capomulin_df, Mouse_Id, Timepoint, Tumor_Volume_mm3) %>%
group_by(Mouse_Id) %>%
filter(Timepoint == max(Timepoint, na.rm=TRUE))
head(capo_df1)
## # A tibble: 6 x 3
## # Groups: Mouse_Id [6]
## Mouse_Id Timepoint Tumor_Volume_mm3
## <chr> <int> <dbl>
## 1 b128 45 39.0
## 2 b742 45 38.9
## 3 f966 20 30.5
## 4 g288 45 37.1
## 5 g316 45 40.2
## 6 i557 45 47.7
capo_df2 <- select(capomulin_df, Mouse_Id, Weight_g) %>%
group_by(Mouse_Id) %>%
summarise(Average_weight = mean(Weight_g, na.rm=TRUE))
head(capo_df2)
## # A tibble: 6 x 2
## Mouse_Id Average_weight
## <chr> <dbl>
## 1 b128 22
## 2 b742 21
## 3 f966 17
## 4 g288 19
## 5 g316 22
## 6 i557 24
capo_df <- capo_df1 %>% inner_join(capo_df2, by = "Mouse_Id")
head(capo_df)
## # A tibble: 6 x 4
## # Groups: Mouse_Id [6]
## Mouse_Id Timepoint Tumor_Volume_mm3 Average_weight
## <chr> <int> <dbl> <dbl>
## 1 b128 45 39.0 22
## 2 b742 45 38.9 21
## 3 f966 20 30.5 17
## 4 g288 45 37.1 19
## 5 g316 45 40.2 22
## 6 i557 45 47.7 24
capo_df3 <- select(capomulin_df, Mouse_Id, Age_months) %>%
group_by(Mouse_Id) %>%
summarise(Average_age = mean(Age_months, na.rm=TRUE))
head(capo_df3)
## # A tibble: 6 x 2
## Mouse_Id Average_age
## <chr> <dbl>
## 1 b128 9
## 2 b742 7
## 3 f966 16
## 4 g288 3
## 5 g316 22
## 6 i557 1
capo_df <- capo_df %>% inner_join(capo_df3, by = "Mouse_Id")
head(capo_df)
## # A tibble: 6 x 5
## # Groups: Mouse_Id [6]
## Mouse_Id Timepoint Tumor_Volume_mm3 Average_weight Average_age
## <chr> <int> <dbl> <dbl> <dbl>
## 1 b128 45 39.0 22 9
## 2 b742 45 38.9 21 7
## 3 f966 20 30.5 17 16
## 4 g288 45 37.1 19 3
## 5 g316 45 40.2 22 22
## 6 i557 45 47.7 24 1
capo_df$Tumor_Volume_mm3 %>%
summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 23.34 32.38 38.13 36.67 40.16 47.69
capo_df$Tumor_Volume_mm3 %>% sd()
## [1] 5.715188
# Creating the plot
plot(capo_df$Average_age, capo_df$Tumor_Volume_mm3, pch = 19, col = "blue")
# Regression line
abline(lm(capo_df$Tumor_Volume_mm3 ~ capo_df$Average_age), col = "red", lwd = 3)
# Pearson correlation
text(paste("Correlation:", round(cor(capo_df$Average_age, capo_df$Tumor_Volume_mm3), 2)), x = 25, y = 95)
# Creating the plot
plot(capo_df$Average_weight, capo_df$Tumor_Volume_mm3, pch = 19, col = "blue")
# Regression line
abline(lm(capo_df$Tumor_Volume_mm3 ~ capo_df$Average_weight), col = "red", lwd = 3)
# Pearson correlation
text(paste("Correlation:", round(cor(capo_df$Average_weight, capo_df$Tumor_Volume_mm3), 2)), x = 25, y = 95)
pairs(capo_df[,2:5], pch = 19, col = "blue")
infu_df1 <- select(infubinol_df, Mouse_Id, Timepoint, Tumor_Volume_mm3) %>%
group_by(Mouse_Id) %>%
filter(Timepoint == max(Timepoint, na.rm=TRUE))
### Find the average weight by mice_id in Infubinol_df
infu_df2 <- select(infubinol_df, Mouse_Id, Weight_g) %>%
group_by(Mouse_Id) %>%
summarise(Average_weight = mean(Weight_g, na.rm=TRUE))
### Joining the two df's for adding average weight
infu_df <- infu_df1 %>% inner_join(infu_df2, by = "Mouse_Id")
### Find the average age by mice_id in Capomulin_df
infu_df3 <- select(infubinol_df, Mouse_Id, Age_months) %>%
group_by(Mouse_Id) %>%
summarise(Average_age = mean(Age_months, na.rm=TRUE))
### Joining the two df's for adding average age
infu_df <- infu_df %>% inner_join(infu_df3, by = "Mouse_Id")
head(infu_df)
## # A tibble: 6 x 5
## # Groups: Mouse_Id [6]
## Mouse_Id Timepoint Tumor_Volume_mm3 Average_weight Average_age
## <chr> <int> <dbl> <dbl> <dbl>
## 1 a203 45 68.0 23 20
## 2 a251 45 65.5 25 21
## 3 a577 30 57.0 25 6
## 4 a685 45 66.1 30 8
## 5 c139 45 72.2 28 11
## 6 c326 5 36.3 25 18
infu_df$Tumor_Volume_mm3 %>%
summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 36.32 54.05 60.17 58.18 65.53 72.23
infu_df$Tumor_Volume_mm3 %>% sd()
## [1] 8.602957
# Creating the plot
plot(infu_df$Average_age, infu_df$Tumor_Volume_mm3, pch = 19, col = "green")
# Regression line
abline(lm(infu_df$Tumor_Volume_mm3 ~ infu_df$Average_age), col = "red", lwd = 3)
# Pearson correlation
text(paste("Correlation:", round(cor(infu_df$Average_age, infu_df$Tumor_Volume_mm3), 2)), x = 25, y = 95)
# Creating the plot
plot(infu_df$Average_weight, infu_df$Tumor_Volume_mm3, pch = 19, col = "green")
# Regression line
abline(lm(infu_df$Tumor_Volume_mm3 ~ infu_df$Average_weight), col = "red", lwd = 3)
# Pearson correlation
text(paste("Correlation:", round(cor(infu_df$Average_weight, infu_df$Tumor_Volume_mm3), 2)), x = 25, y = 95)
pairs(infu_df[,2:5], pch = 19, col = "green")
keta_df1 <- select(ketapril_df, Mouse_Id, Timepoint, Tumor_Volume_mm3) %>%
group_by(Mouse_Id) %>%
filter(Timepoint == max(Timepoint, na.rm=TRUE))
### Find the average weight by mice_id in Infubinol_df
keta_df2 <- select(ketapril_df, Mouse_Id, Weight_g) %>%
group_by(Mouse_Id) %>%
summarise(Average_weight = mean(Weight_g, na.rm=TRUE))
### Joining the two df's for adding average weight
keta_df <- keta_df1 %>% inner_join(keta_df2, by = "Mouse_Id")
### Find the average age by mice_id in Capomulin_df
keta_df3 <- select(ketapril_df, Mouse_Id, Age_months) %>%
group_by(Mouse_Id) %>%
summarise(Average_age = mean(Age_months, na.rm=TRUE))
### Joining the two df's for adding average age
keta_df <- keta_df %>% inner_join(keta_df3, by = "Mouse_Id")
head(keta_df)
## # A tibble: 6 x 5
## # Groups: Mouse_Id [6]
## Mouse_Id Timepoint Tumor_Volume_mm3 Average_weight Average_age
## <chr> <int> <dbl> <dbl> <dbl>
## 1 a457 10 49.8 30 11
## 2 c580 30 58.0 25 22
## 3 c819 40 62.2 25 21
## 4 c832 45 65.4 29 18
## 5 d474 40 60.2 27 18
## 6 f278 5 48.2 30 12
keta_df$Tumor_Volume_mm3 %>%
summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 45.00 56.72 64.49 62.81 69.87 78.57
keta_df$Tumor_Volume_mm3 %>% sd()
## [1] 9.94592
# Creating the plot
plot(keta_df$Average_age, keta_df$Tumor_Volume_mm3, pch = 19, col = "purple")
# Regression line
abline(lm(keta_df$Tumor_Volume_mm3 ~ keta_df$Average_age), col = "red", lwd = 3)
# Pearson correlation
text(paste("Correlation:", round(cor(keta_df$Average_age, keta_df$Tumor_Volume_mm3), 2)), x = 25, y = 95)
# Creating the plot
plot(keta_df$Average_weight, keta_df$Tumor_Volume_mm3, pch = 19, col = "purple")
# Regression line
abline(lm(keta_df$Tumor_Volume_mm3 ~ keta_df$Average_weight), col = "red", lwd = 3)
# Pearson correlation
text(paste("Correlation:", round(cor(keta_df$Average_weight, keta_df$Tumor_Volume_mm3), 2)), x = 25, y = 95)
pairs(keta_df[,2:5], pch = 19, col = "purple")
plac_df1 <- select(placebo_df, Mouse_Id, Timepoint, Tumor_Volume_mm3) %>%
group_by(Mouse_Id) %>%
filter(Timepoint == max(Timepoint, na.rm=TRUE))
### Find the average weight by mice_id in Infubinol_df
plac_df2 <- select(placebo_df, Mouse_Id, Weight_g) %>%
group_by(Mouse_Id) %>%
summarise(Average_weight = mean(Weight_g, na.rm=TRUE))
### Joining the two df's for adding average weight
plac_df <- plac_df1 %>% inner_join(plac_df2, by = "Mouse_Id")
### Find the average age by mice_id in Capomulin_df
plac_df3 <- select(placebo_df, Mouse_Id, Age_months) %>%
group_by(Mouse_Id) %>%
summarise(Average_age = mean(Age_months, na.rm=TRUE))
### Joining the two df's for adding average age
plac_df <- plac_df %>% inner_join(plac_df3, by = "Mouse_Id")
head(plac_df)
## # A tibble: 6 x 5
## # Groups: Mouse_Id [6]
## Mouse_Id Timepoint Tumor_Volume_mm3 Average_weight Average_age
## <chr> <int> <dbl> <dbl> <dbl>
## 1 a262 45 70.7 29 17
## 2 a897 45 72.3 28 7
## 3 c282 45 65.8 27 12
## 4 c757 45 69.0 27 9
## 5 c766 45 69.8 26 13
## 6 e227 45 73.2 30 1
plac_df$Tumor_Volume_mm3 %>%
summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 45.00 52.94 62.03 60.51 68.13 73.21
plac_df$Tumor_Volume_mm3 %>% sd()
## [1] 8.874672
# Creating the plot
plot(plac_df$Average_age, plac_df$Tumor_Volume_mm3, pch = 19, col = "lightblue")
# Regression line
abline(lm(plac_df$Tumor_Volume_mm3 ~ plac_df$Average_age), col = "red", lwd = 3)
# Pearson correlation
text(paste("Correlation:", round(cor(plac_df$Average_age, plac_df$Tumor_Volume_mm3), 2)), x = 25, y = 95)
# Creating the plot
plot(plac_df$Average_weight, plac_df$Tumor_Volume_mm3, pch = 19, col = "lightblue")
# Regression line
abline(lm(plac_df$Tumor_Volume_mm3 ~ plac_df$Average_weight), col = "red", lwd = 3)
# Pearson correlation
text(paste("Correlation:", round(cor(plac_df$Average_weight, plac_df$Tumor_Volume_mm3), 2)), x = 25, y = 95)
pairs(plac_df[,2:5], pch = 19, col = "lightblue")
From the plots above, there seems a correlation between weight and Tumor size for capomulin drug regimen but will be checked by calculating the correlation coefficient.
library(tidyverse)
library(ggpubr)
## Warning: package 'ggpubr' was built under R version 4.1.3
library(rstatix)
## Warning: package 'rstatix' was built under R version 4.1.3
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
library(tidyverse)
library(infer)
## Warning: package 'infer' was built under R version 4.1.2
##
## Attaching package: 'infer'
## The following objects are masked from 'package:rstatix':
##
## chisq_test, prop_test, t_test
library(moonBook)
## Warning: package 'moonBook' was built under R version 4.1.3
library(webr)
## Warning: package 'webr' was built under R version 4.1.3
which measures a linear dependence between two variables (x and y). It’s also known as a parametric correlation test because it depends to the distribution of the data. It can be used only when x and y are from normal distribution. The plot of y = f(x) is named the linear regression curve.
cor() computes the correlation coefficient cor.test() test for association/correlation between paired samples. It returns both the correlation coefficient and the significance level(or p-value) of the correlation .
cor(capo_df$Tumor_Volume_mm3, capo_df$Average_weight, method = c("pearson", "kendall", "spearman"))
## [1] 0.876706
cor.test(capo_df$Tumor_Volume_mm3, capo_df$Average_weight, method=c("pearson", "kendall", "spearman"))
##
## Pearson's product-moment correlation
##
## data: capo_df$Tumor_Volume_mm3 and capo_df$Average_weight
## t = 8.7408, df = 23, p-value = 9.084e-09
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7368195 0.9446109
## sample estimates:
## cor
## 0.876706
### Visualizing the Data:
ggscatter(capo_df, x = "Tumor_Volume_mm3", y = "Average_weight",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "Tumor Size mm3", ylab = "Average Weight")
## `geom_smooth()` using formula 'y ~ x'
R is a measure of any linear trend between two variables. The value of r ranges between −1 and 1
From the plot above, the value of R=0.88 shows strong linear relationship.
lrModel <- lm(Tumor_Volume_mm3~Average_weight, data=capo_df)
summary(lrModel)
##
## Call:
## lm(formula = Tumor_Volume_mm3 ~ Average_weight, data = capo_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.8625 -1.3779 0.5453 1.3294 4.2266
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.4481 4.0683 0.356 0.725
## Average_weight 1.7505 0.2003 8.741 9.08e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.808 on 23 degrees of freedom
## Multiple R-squared: 0.7686, Adjusted R-squared: 0.7586
## F-statistic: 76.4 on 1 and 23 DF, p-value: 9.084e-09
This shows 76.86% of the variation in the Tumor size can be explained by Average_weight for Capomulin drug
Correlation test is performed to evaluate the association between two or more variables.
keta_df\(Average_age, keta_df\)Tumor_Volume_mm3
cor(keta_df$Tumor_Volume_mm3, keta_df$Average_age, method = c("pearson", "kendall", "spearman"))
## [1] 0.2763875
cor.test(keta_df$Tumor_Volume_mm3, keta_df$Average_age, method=c("pearson", "kendall", "spearman"))
##
## Pearson's product-moment correlation
##
## data: keta_df$Tumor_Volume_mm3 and keta_df$Average_age
## t = 1.3792, df = 23, p-value = 0.1811
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1333012 0.6054028
## sample estimates:
## cor
## 0.2763875
ggscatter(keta_df, x = "Tumor_Volume_mm3", y = "Average_age",
add = "reg.line", conf.int = TRUE,
cor.coef = TRUE, cor.method = "pearson",
xlab = "Tumor Size mm3", ylab = "Average Age")
## `geom_smooth()` using formula 'y ~ x'
R is a measure of any linear trend between two variables. The value of r ranges between −1 and 1
From the plot above, the value of R=0.28 shows week linear relationship.
lrModel <- lm(Tumor_Volume_mm3~Average_age, data=keta_df)
summary(lrModel)
##
## Call:
## lm(formula = Tumor_Volume_mm3 ~ Average_age, data = keta_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.925 -7.852 1.605 5.248 15.343
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 56.5532 4.9363 11.457 5.54e-11 ***
## Average_age 0.4248 0.3080 1.379 0.181
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.764 on 23 degrees of freedom
## Multiple R-squared: 0.07639, Adjusted R-squared: 0.03623
## F-statistic: 1.902 on 1 and 23 DF, p-value: 0.1811
This shows 07.63% of the variation in the Tumor size can be explained by Average_age for ketapril drug
#fit regression model
model <- lm(Tumor_Volume_mm3~Average_weight+Average_age, data=capo_df)
#view model summary
summary(model)
##
## Call:
## lm(formula = Tumor_Volume_mm3 ~ Average_weight + Average_age,
## data = capo_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.2433 -0.7903 0.1667 1.5253 4.8761
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.48591 4.32578 0.112 0.912
## Average_weight 1.76129 0.20298 8.677 1.49e-08 ***
## Average_age 0.05303 0.07401 0.716 0.481
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.838 on 22 degrees of freedom
## Multiple R-squared: 0.7739, Adjusted R-squared: 0.7533
## F-statistic: 37.65 on 2 and 22 DF, p-value: 7.898e-08
The coefficient of determination (commonly denoted R2) is the proportion of the variance in the response variable that can be explained by the explanatory variables in a regression model.
The R-squared of the model (shown near the very bottom of the output) turns out to be 0.7739
This means that 77.39% of the variation in the Tumor size can be explained by the weight and the number of age of mice.
capo_df1 <- select(capomulin_df, Mouse_Id, Drug_Regimen, Timepoint, Tumor_Volume_mm3) %>%
group_by(Mouse_Id) %>%
filter(Timepoint == max(Timepoint, na.rm=TRUE))
head(capo_df1)
## # A tibble: 6 x 4
## # Groups: Mouse_Id [6]
## Mouse_Id Drug_Regimen Timepoint Tumor_Volume_mm3
## <chr> <chr> <int> <dbl>
## 1 b128 Capomulin 45 39.0
## 2 b742 Capomulin 45 38.9
## 3 f966 Capomulin 20 30.5
## 4 g288 Capomulin 45 37.1
## 5 g316 Capomulin 45 40.2
## 6 i557 Capomulin 45 47.7
capo_df1 <- select(capomulin_df, Mouse_Id, Drug_Regimen, Tumor_Volume_mm3)
head(capo_df1)
## Mouse_Id Drug_Regimen Tumor_Volume_mm3
## 1 b128 Capomulin 45.65133
## 2 b128 Capomulin 43.26214
## 3 b128 Capomulin 37.96764
## 4 b128 Capomulin 43.27085
## 5 b128 Capomulin 45.00000
## 6 b128 Capomulin 38.37973
plac_df1 <- select(placebo_df, Mouse_Id, Timepoint, Drug_Regimen, Tumor_Volume_mm3) %>%
group_by(Mouse_Id) %>%
filter(Timepoint == max(Timepoint, na.rm=TRUE))
head(plac_df1)
## # A tibble: 6 x 4
## # Groups: Mouse_Id [6]
## Mouse_Id Timepoint Drug_Regimen Tumor_Volume_mm3
## <chr> <int> <chr> <dbl>
## 1 a262 45 Placebo 70.7
## 2 a897 45 Placebo 72.3
## 3 c282 45 Placebo 65.8
## 4 c757 45 Placebo 69.0
## 5 c766 45 Placebo 69.8
## 6 e227 45 Placebo 73.2
plac_df1 <- select(placebo_df, Mouse_Id, Drug_Regimen, Tumor_Volume_mm3) %>%
group_by(Mouse_Id)
head(plac_df1)
## # A tibble: 6 x 3
## # Groups: Mouse_Id [1]
## Mouse_Id Drug_Regimen Tumor_Volume_mm3
## <chr> <chr> <dbl>
## 1 a262 Placebo 69.6
## 2 a262 Placebo 45
## 3 a262 Placebo 53.8
## 4 a262 Placebo 60.1
## 5 a262 Placebo 65.0
## 6 a262 Placebo 57.0
plac_df1 <- select(placebo_df, Drug_Regimen, Tumor_Volume_mm3)
head(plac_df1)
## Drug_Regimen Tumor_Volume_mm3
## 1 Placebo 69.59273
## 2 Placebo 45.00000
## 3 Placebo 53.82797
## 4 Placebo 60.13186
## 5 Placebo 64.95809
## 6 Placebo 57.01331
df1 <- capo_df1 %>% full_join(plac_df1)
## Joining, by = c("Drug_Regimen", "Tumor_Volume_mm3")
df1 <- select(df1, Drug_Regimen, Tumor_Volume_mm3)
head(df1)
## Drug_Regimen Tumor_Volume_mm3
## 1 Capomulin 45.65133
## 2 Capomulin 43.26214
## 3 Capomulin 37.96764
## 4 Capomulin 43.27085
## 5 Capomulin 45.00000
## 6 Capomulin 38.37973
df1 %>%
group_by(Drug_Regimen) %>%
get_summary_stats(Tumor_Volume_mm3, type = "mean_sd")
## # A tibble: 2 x 5
## Drug_Regimen variable n mean sd
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Capomulin Tumor_Volume_mm3 230 40.7 5.00
## 2 Placebo Tumor_Volume_mm3 181 54.0 7.82
sample size(n) = 25, sample Tumor size mean(xbar) = 36.67 mm3, standard devation = 5.71 mm3
Null hypothesis: There is no difference between the effectiveness of the four drug regimens. In other words, the difference in mean of the size of Tumor for Capomulin drug regimen result Placebo is zero.
Null Hypothesis H^0: mu = 54.034 Alternate Hypothesis: mu != 54.034
It is a Two Tailed test.
obs_diff <- df1 %>%
specify(Tumor_Volume_mm3 ~ Drug_Regimen) %>%
calculate(stat = "diff in means", order = c("Capomulin", "Placebo"))
obs_diff
## Response: Tumor_Volume_mm3 (numeric)
## Explanatory: Drug_Regimen (factor)
## # A tibble: 1 x 1
## stat
## <dbl>
## 1 -13.4
null_dist <- df1 %>%
specify(Tumor_Volume_mm3 ~ Drug_Regimen) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("Capomulin", "Placebo"))
head(null_dist)
## Response: Tumor_Volume_mm3 (numeric)
## Explanatory: Drug_Regimen (factor)
## Null Hypothesis: independence
## # A tibble: 6 x 2
## replicate stat
## <int> <dbl>
## 1 1 -0.436
## 2 2 -0.377
## 3 3 0.129
## 4 4 -1.09
## 5 5 0.798
## 6 6 0.0411
ggplot(data = null_dist, aes(x = stat, fill = "color")) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
x=var.test(Tumor_Volume_mm3 ~ Drug_Regimen, data = df1)
x
##
## F test to compare two variances
##
## data: Tumor_Volume_mm3 by Drug_Regimen
## F = 0.40786, num df = 229, denom df = 180, p-value = 1.857e-10
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.3084916 0.5367226
## sample estimates:
## ratio of variances
## 0.4078559
plot(x)
Here you translate this test on to t_ distribution by calculating the t_statistics
t_statistic = xbar-mu/s/square root of n
The two-sample t-test is also known as the independent t-test. The independent samples t-test comes in two different forms:
the standard Student’s t-test, which assumes that the variance of the two groups are equal. the Welch’s t-test, which is less restrictive compared to the original Student’s test. This is the test where you do not assume that the variance is the same in the two groups, which results in the fractional degrees of freedom.
R computes the Welch t-test, where you do not assume that the variance is the same in the two groups, which results in the fractional degrees of freedom.
t_result <- t.test(Tumor_Volume_mm3 ~ Drug_Regimen, data = df1)
t_result
##
## Welch Two Sample t-test
##
## data: Tumor_Volume_mm3 by Drug_Regimen
## t = -19.993, df = 290.56, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Capomulin and group Placebo is not equal to 0
## 95 percent confidence interval:
## -14.67285 -12.04283
## sample estimates:
## mean in group Capomulin mean in group Placebo
## 40.67574 54.03358
plot(t_result)
t is the t-test statistic value (t = -19.993), df is the degrees of freedom (df= 290.56), p-value is the significance level of the t-test (p-value = 2.2e-16). conf.int is the confidence interval of the means difference at 95% (conf.int = [-14.67285 -12.04283]); sample estimates is the mean value of the sample (mean = 40.67574, 54.03358).
the t-statistic, t = -19.993,
sample mean of Capomulin = 40.67574 is way above the sample mean in group Placebo = 54.034 is equivalent to translating that the t-statistics of -19.993 is way above 0.
Similarly sample mean of Capomulin = 40.67574 is way below the sample mean in group Placebo = 54.034 is equivalent to translating that the t-statistics of -19.993 is way below 0.
So that we can identify the rejection region for the hypothesis testing.
This is done by first specifying the value of alpha which in the context of hypothesis test aka significance level
Typically the value of alpha=0.05 or 0.01 corresponding to 95% or 99% confidence respectively.
So, our cutoff values for the t-statistic, denoted by t cutoff, are those values in the t distribution, with n- 1 degrees of freedom, that cut off, alpha/2 probability to the right, and alpha/2 probability to the left.
This is a two-tail test with one rejection region on the right, and one rejection region on the left.
Hence, the total rejection probability of alpha gets equally divided across the two rejection regions.
Hypothesis testing ultimately uses a p-value to weigh the strength of the evidence. The p-value ranges between 0 and 1. It can be interpreted in the following way:
A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject it. A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject it.
The p-value of the test is 2.2e-16, which is less than the significance level alpha = 0.05. We can conclude that camopulin average tumor size is significantly different from placebo average tumor size with a p-value = 2.2e-16.
A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so we reject the null hypothsisis and accept the alternate hypothesis.
keta_df1 <- select(ketapril_df, Mouse_Id, Drug_Regimen, Timepoint, Tumor_Volume_mm3) %>%
group_by(Mouse_Id) %>%
filter(Timepoint == max(Timepoint, na.rm=TRUE))
head(keta_df1)
## # A tibble: 6 x 4
## # Groups: Mouse_Id [6]
## Mouse_Id Drug_Regimen Timepoint Tumor_Volume_mm3
## <chr> <chr> <int> <dbl>
## 1 a457 Ketapril 10 49.8
## 2 c580 Ketapril 30 58.0
## 3 c819 Ketapril 40 62.2
## 4 c832 Ketapril 45 65.4
## 5 d474 Ketapril 40 60.2
## 6 f278 Ketapril 5 48.2
keta_df1 <- select(ketapril_df, Mouse_Id, Drug_Regimen, Tumor_Volume_mm3)
head(keta_df1)
## Mouse_Id Drug_Regimen Tumor_Volume_mm3
## 1 a457 Ketapril 45.00000
## 2 a457 Ketapril 49.78342
## 3 a457 Ketapril 47.46289
## 4 c580 Ketapril 55.57532
## 5 c580 Ketapril 47.57017
## 6 c580 Ketapril 46.62845
keta_df1 <- select(ketapril_df, Mouse_Id, Timepoint, Drug_Regimen, Tumor_Volume_mm3) %>%
group_by(Mouse_Id) %>%
filter(Timepoint == max(Timepoint, na.rm=TRUE))
head(keta_df1)
## # A tibble: 6 x 4
## # Groups: Mouse_Id [6]
## Mouse_Id Timepoint Drug_Regimen Tumor_Volume_mm3
## <chr> <int> <chr> <dbl>
## 1 a457 10 Ketapril 49.8
## 2 c580 30 Ketapril 58.0
## 3 c819 40 Ketapril 62.2
## 4 c832 45 Ketapril 65.4
## 5 d474 40 Ketapril 60.2
## 6 f278 5 Ketapril 48.2
plac_df1 <- select(placebo_df, Mouse_Id, Drug_Regimen, Tumor_Volume_mm3) %>%
group_by(Mouse_Id)
head(plac_df1)
## # A tibble: 6 x 3
## # Groups: Mouse_Id [1]
## Mouse_Id Drug_Regimen Tumor_Volume_mm3
## <chr> <chr> <dbl>
## 1 a262 Placebo 69.6
## 2 a262 Placebo 45
## 3 a262 Placebo 53.8
## 4 a262 Placebo 60.1
## 5 a262 Placebo 65.0
## 6 a262 Placebo 57.0
plac_df1 <- select(placebo_df, Drug_Regimen, Tumor_Volume_mm3)
head(plac_df1)
## Drug_Regimen Tumor_Volume_mm3
## 1 Placebo 69.59273
## 2 Placebo 45.00000
## 3 Placebo 53.82797
## 4 Placebo 60.13186
## 5 Placebo 64.95809
## 6 Placebo 57.01331
df2 <- keta_df1 %>% full_join(plac_df1)
## Joining, by = c("Drug_Regimen", "Tumor_Volume_mm3")
df2 <- select(df2, Drug_Regimen, Tumor_Volume_mm3)
## Adding missing grouping variables: `Mouse_Id`
head(df2)
## # A tibble: 6 x 3
## # Groups: Mouse_Id [6]
## Mouse_Id Drug_Regimen Tumor_Volume_mm3
## <chr> <chr> <dbl>
## 1 a457 Ketapril 49.8
## 2 c580 Ketapril 58.0
## 3 c819 Ketapril 62.2
## 4 c832 Ketapril 65.4
## 5 d474 Ketapril 60.2
## 6 f278 Ketapril 48.2
df2 %>%
group_by(Drug_Regimen) %>%
get_summary_stats(Tumor_Volume_mm3, type = "mean_sd")
## # A tibble: 2 x 5
## Drug_Regimen variable n mean sd
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Ketapril Tumor_Volume_mm3 25 62.8 9.95
## 2 Placebo Tumor_Volume_mm3 181 54.0 7.82
sample size(n) = 25, sample Tumor size mean(xbar) = 36.67 mm3, standard devation = 5.71 mm3
Null hypothesis: There is no difference between the effectiveness of the four drug regimens. In other words, the difference in mean of the size of Tumor for Ketapril drug regimen result Placebo is zero.
Null Hypothesis H^0: mu = 54.034 Alternate Hypothesis: mu != 54.034
It is a Two Tailed test.
obs_diff <- df2 %>%
specify(Tumor_Volume_mm3 ~ Drug_Regimen) %>%
calculate(stat = "diff in means", order = c("Ketapril", "Placebo"))
obs_diff
## Response: Tumor_Volume_mm3 (numeric)
## Explanatory: Drug_Regimen (factor)
## # A tibble: 1 x 1
## stat
## <dbl>
## 1 8.77
null_dist <- df2 %>%
specify(Tumor_Volume_mm3 ~ Drug_Regimen) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("Ketapril", "Placebo"))
head(null_dist)
## Response: Tumor_Volume_mm3 (numeric)
## Explanatory: Drug_Regimen (factor)
## Null Hypothesis: independence
## # A tibble: 6 x 2
## replicate stat
## <int> <dbl>
## 1 1 -4.21
## 2 2 4.74
## 3 3 -0.281
## 4 4 -0.0543
## 5 5 -2.83
## 6 6 -3.30
ggplot(data = null_dist, aes(x = stat, fill = "color")) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
x=var.test(Tumor_Volume_mm3 ~ Drug_Regimen, data = df2)
x
##
## F test to compare two variances
##
## data: Tumor_Volume_mm3 by Drug_Regimen
## F = 1.6172, num df = 24, denom df = 180, p-value = 0.0829
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.9404268 3.2111951
## sample estimates:
## ratio of variances
## 1.617205
plot(x)
Here you translate this test on to t_ distribution by calculating the t_statistics
t_statistic = xbar-mu/s/square root of n
The two-sample t-test is also known as the independent t-test. The independent samples t-test comes in two different forms:
the standard Student’s t-test, which assumes that the variance of the two groups are equal. the Welch’s t-test, which is less restrictive compared to the original Student’s test. This is the test where you do not assume that the variance is the same in the two groups, which results in the fractional degrees of freedom.
R computes the Welch t-test, where you do not assume that the variance is the same in the two groups, which results in the fractional degrees of freedom.
t_result <- t.test(Tumor_Volume_mm3 ~ Drug_Regimen, data = df2)
t_result
##
## Welch Two Sample t-test
##
## data: Tumor_Volume_mm3 by Drug_Regimen
## t = 4.2331, df = 28.247, p-value = 0.000221
## alternative hypothesis: true difference in means between group Ketapril and group Placebo is not equal to 0
## 95 percent confidence interval:
## 4.529186 13.016035
## sample estimates:
## mean in group Ketapril mean in group Placebo
## 62.80619 54.03358
plot(t_result)
t is the t-test statistic value (t = 4.2331), df is the degrees of freedom (df = 28.247), p-value is the significance level of the t-test (p-value = 0.000221). conf.int is the confidence interval of the means difference at 95% (conf.int = [ 4.529186 13.016035]); sample estimates is the mean value of the sample (mean = 62.80619, 54.03358).
the t-statistic, t = 4.2331,
sample mean of Ketapril = 62.80619 is way above the sample mean in group Placebo = 54.034 is equivalent to translating that the t-statistics of 4.2331 is way above 0.
Similarly sample mean of Ketapril = 62.80619 is way below the sample mean in group Placebo = 54.034 is equivalent to translating that the t-statistics of 4.2331 is way below 0.
So that we can identify the rejection region for the hypothesis testing.
This is done by first specifying the value of alpha which in the context of hypothesis test aka significance level
Typically the value of alpha=0.05 or 0.01 corresponding to 95% or 99% confidence respectively.
So, our cutoff values for the t-statistic, denoted by t cutoff, are those values in the t distribution, with n- 1 degrees of freedom, that cut off, alpha/2 probability to the right, and alpha/2 probability to the left.
This is a two-tail test with one rejection region on the right, and one rejection region on the left.
Hence, the total rejection probability of alpha gets equally divided across the two rejection regions.
Hypothesis testing ultimately uses a p-value to weigh the strength of the evidence. The p-value ranges between 0 and 1. It can be interpreted in the following way:
A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject it. A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject it.
The p-value of the test is 0.000221, which is less than the significance level alpha = 0.05. We can conclude that Ketapril average tumor size is significantly different from placebo average tumor size with a p-value = 0.000221.
A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so we reject the null hypothsisis and accept the alternate hypothesis.
infu_df1 <- select(infubinol_df, Mouse_Id, Drug_Regimen, Timepoint, Tumor_Volume_mm3) %>%
group_by(Mouse_Id) %>%
filter(Timepoint == max(Timepoint, na.rm=TRUE))
head(infu_df1)
## # A tibble: 6 x 4
## # Groups: Mouse_Id [6]
## Mouse_Id Drug_Regimen Timepoint Tumor_Volume_mm3
## <chr> <chr> <int> <dbl>
## 1 a203 Infubinol 45 68.0
## 2 a251 Infubinol 45 65.5
## 3 a577 Infubinol 30 57.0
## 4 a685 Infubinol 45 66.1
## 5 c139 Infubinol 45 72.2
## 6 c326 Infubinol 5 36.3
infu_df1 <- select(infubinol_df, Mouse_Id, Drug_Regimen, Tumor_Volume_mm3)
head(infu_df1)
## Mouse_Id Drug_Regimen Tumor_Volume_mm3
## 1 a203 Infubinol 55.17334
## 2 a203 Infubinol 56.79321
## 3 a203 Infubinol 52.77787
## 4 a203 Infubinol 51.85244
## 5 a203 Infubinol 61.93165
## 6 a203 Infubinol 45.00000
infu_df1 <- select(infubinol_df, Mouse_Id, Timepoint, Drug_Regimen, Tumor_Volume_mm3) %>%
group_by(Mouse_Id) %>%
filter(Timepoint == max(Timepoint, na.rm=TRUE))
head(infu_df1)
## # A tibble: 6 x 4
## # Groups: Mouse_Id [6]
## Mouse_Id Timepoint Drug_Regimen Tumor_Volume_mm3
## <chr> <int> <chr> <dbl>
## 1 a203 45 Infubinol 68.0
## 2 a251 45 Infubinol 65.5
## 3 a577 30 Infubinol 57.0
## 4 a685 45 Infubinol 66.1
## 5 c139 45 Infubinol 72.2
## 6 c326 5 Infubinol 36.3
plac_df1 <- select(placebo_df, Mouse_Id, Drug_Regimen, Tumor_Volume_mm3) %>%
group_by(Mouse_Id)
head(plac_df1)
## # A tibble: 6 x 3
## # Groups: Mouse_Id [1]
## Mouse_Id Drug_Regimen Tumor_Volume_mm3
## <chr> <chr> <dbl>
## 1 a262 Placebo 69.6
## 2 a262 Placebo 45
## 3 a262 Placebo 53.8
## 4 a262 Placebo 60.1
## 5 a262 Placebo 65.0
## 6 a262 Placebo 57.0
plac_df1 <- select(placebo_df, Drug_Regimen, Tumor_Volume_mm3)
head(plac_df1)
## Drug_Regimen Tumor_Volume_mm3
## 1 Placebo 69.59273
## 2 Placebo 45.00000
## 3 Placebo 53.82797
## 4 Placebo 60.13186
## 5 Placebo 64.95809
## 6 Placebo 57.01331
df3 <- infu_df1 %>% full_join(plac_df1)
## Joining, by = c("Drug_Regimen", "Tumor_Volume_mm3")
df3 <- select(df3, Drug_Regimen, Tumor_Volume_mm3)
## Adding missing grouping variables: `Mouse_Id`
head(df3)
## # A tibble: 6 x 3
## # Groups: Mouse_Id [6]
## Mouse_Id Drug_Regimen Tumor_Volume_mm3
## <chr> <chr> <dbl>
## 1 a203 Infubinol 68.0
## 2 a251 Infubinol 65.5
## 3 a577 Infubinol 57.0
## 4 a685 Infubinol 66.1
## 5 c139 Infubinol 72.2
## 6 c326 Infubinol 36.3
df3 %>%
group_by(Drug_Regimen) %>%
get_summary_stats(Tumor_Volume_mm3, type = "mean_sd")
## # A tibble: 2 x 5
## Drug_Regimen variable n mean sd
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Infubinol Tumor_Volume_mm3 25 58.2 8.60
## 2 Placebo Tumor_Volume_mm3 181 54.0 7.82
sample size(n) = 25, sample Tumor size mean(xbar) = 36.67 mm3, standard devation = 5.71 mm3
Null hypothesis: There is no difference between the effectiveness of the four drug regimens. In other words, the difference in mean of the size of Tumor for Infubinol drug regimen result Placebo is zero.
Null Hypothesis H^0: mu = 54.034 Alternate Hypothesis: mu != 54.034
It is a Two Tailed test.
obs_diff <- df3 %>%
specify(Tumor_Volume_mm3 ~ Drug_Regimen) %>%
calculate(stat = "diff in means", order = c("Infubinol", "Placebo"))
obs_diff
## Response: Tumor_Volume_mm3 (numeric)
## Explanatory: Drug_Regimen (factor)
## # A tibble: 1 x 1
## stat
## <dbl>
## 1 4.14
null_dist <- df3 %>%
specify(Tumor_Volume_mm3 ~ Drug_Regimen) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("Infubinol", "Placebo"))
head(null_dist)
## Response: Tumor_Volume_mm3 (numeric)
## Explanatory: Drug_Regimen (factor)
## Null Hypothesis: independence
## # A tibble: 6 x 2
## replicate stat
## <int> <dbl>
## 1 1 -0.708
## 2 2 -0.621
## 3 3 1.59
## 4 4 -1.29
## 5 5 -0.464
## 6 6 1.54
ggplot(data = null_dist, aes(x = stat, fill = "color")) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
x=var.test(Tumor_Volume_mm3 ~ Drug_Regimen, data = df3)
x
##
## F test to compare two variances
##
## data: Tumor_Volume_mm3 by Drug_Regimen
## F = 1.21, num df = 24, denom df = 180, p-value = 0.4768
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.7036077 2.4025492
## sample estimates:
## ratio of variances
## 1.209959
plot(x)
Here you translate this test on to t_ distribution by calculating the t_statistics
t_statistic = xbar-mu/s/square root of n
The two-sample t-test is also known as the independent t-test. The independent samples t-test comes in two different forms:
the standard Student’s t-test, which assumes that the variance of the two groups are equal. the Welch’s t-test, which is less restrictive compared to the original Student’s test. This is the test where you do not assume that the variance is the same in the two groups, which results in the fractional degrees of freedom.
R computes the Welch t-test, where you do not assume that the variance is the same in the two groups, which results in the fractional degrees of freedom.
t_result <- t.test(Tumor_Volume_mm3 ~ Drug_Regimen, data = df3)
t_result
##
## Welch Two Sample t-test
##
## data: Tumor_Volume_mm3 by Drug_Regimen
## t = 2.2821, df = 29.74, p-value = 0.02981
## alternative hypothesis: true difference in means between group Infubinol and group Placebo is not equal to 0
## 95 percent confidence interval:
## 0.4342457 7.8550843
## sample estimates:
## mean in group Infubinol mean in group Placebo
## 58.17825 54.03358
plot(t_result)
t is the t-test statistic value (t = 2.2821), df is the degrees of freedom (df = 28.247), p-value is the significance level of the t-test (p-value = 0.02981). conf.int is the confidence interval of the means difference at 95% (conf.int = [0.4342457 7.8550843]); sample estimates is the mean value of the sample (mean =58.17825 54.03358).
the t-statistic, t = 2.2821,
sample mean of Infubinol = 58.17825 is way above the sample mean in group Placebo = 54.034 is equivalent to translating that the t-statistics of 2.2821 is way above 0.
Similarly sample mean of Infubinol = 58.17825 is way below the sample mean in group Placebo = 54.034 is equivalent to translating that the t-statistics of 2.2821 is way below 0.
So that we can identify the rejection region for the hypothesis testing.
This is done by first specifying the value of alpha which in the context of hypothesis test aka significance level
Typically the value of alpha=0.05 or 0.01 corresponding to 95% or 99% confidence respectively.
So, our cutoff values for the t-statistic, denoted by t cutoff, are those values in the t distribution, with n- 1 degrees of freedom, that cut off, alpha/2 probability to the right, and alpha/2 probability to the left.
This is a two-tail test with one rejection region on the right, and one rejection region on the left.
Hence, the total rejection probability of alpha gets equally divided across the two rejection regions.
Hypothesis testing ultimately uses a p-value to weigh the strength of the evidence. The p-value ranges between 0 and 1. It can be interpreted in the following way:
A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject it. A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject it.
The p-value of the test is 0.02981, which is less than the significance level alpha = 0.05. We can conclude that Ketapril average tumor size is significantly different from placebo average tumor size with a p-value = 0.02981
A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so we reject the null hypothsisis and accept the alternate hypothesis.