Final Project Data 606.Rmd

Overview:

Pymaceuticals Inc., a fictional burgeoning pharmaceutical company based out of San Diego, CA, specializes in drug-based, anti-cancer pharmaceuticals. They have provided the data to test the efficacy of potential drug treatments for squamous cell carcinoma. In this study, 249 mice identified with Squamous cell carcinoma (SCC) tumor growth, kind of skin cancer, were treated through a variety of drug regimens. Over the course of 45 days, tumor development was observed and measured. The objective is to analyze the data to show how four treatments (Capomulin, Infubinol, Ketapril, and Placebo) compare.

Research question:

Question 1: Is Capomulin more or less effective in reducing the tumor size than Infubinol, or Ketapril drugs categories?

Question 2: Is there a correlation between the age, weight and the tumor size growth for each drug category?

Hypothesis Test

Null Hypothesis: There is no difference between the mean percent change in tumor volume for the four drug categories.

Alternate Hypothesis: There is a difference between the mean percent change in tumor volume for the four drug categories.

Approach for answering the research question will be:

1- Calculate the mean percent change in tumor volume for the four drug categories.

2- Perform the Hypothesis test to find out whether the difference exists between the mean tumor size for all four drug categories.

3- Perform linear regression to study the correlation between various variables by calculating the correlation coefficient.

4- Finally analyze the results to find out if Capomulin more or less effective in reducing the tumor size of sample mice than Infubinol, or Ketapril drugs categories.

Cases:

What are the cases? How many different drug treatments are there? How many total samples size as well as the sample size by drug treatments are there?

Answer: The metadata_df contain 249 unique mouse id and so are the number of cases that treated with variety of drug regimen. The results_df dataset holds the tumor growth measurements observed for each Mouse ID and carries 1,893 rows results. There are 10 different drug treatments. The total sample size of mice for four treatments (Capomulin, Infubinol, Ketapril, and Placebo) is 100 and the sample size of mice by drug treatments is 25 each.

Data collection:

Describe the method of data collection.

Answer: Data is collected by the fictitious pharmaceutical company who was testing the efficacy of potential drug treatments for squamous cell carcinoma. I import the data into Rmd file from GitHub.

Type of study:

What type of study is this (observational/experiment)?

Answer: This is an experimental study. A group of 249 mice were monitored after administration of a variety of drug regimens over a 45-day treatment period. The impact of Capomulin on tumor growth, metastasis and survival rates were monitored, along with Infubinol, Ketapril, and Placebo.

Data Source:

If you collected the data, state self-collected. If not, provide a citation/link.

Answer: The citation and data collection links are as follows.

In my search for the experimental datasets, I found the mouse metadata and the Study results on the GitHub link provided below:

https://raw.githubusercontent.com/rfpoulos/pymaceuticals/master/data/Mouse_metadata.csv

https://raw.githubusercontent.com/rfpoulos/pymaceuticals/master/data/Study_results.csv

Upon further research in finding the original source of the dataset, I found that these datasets are provided by Pymaceuticals Inc., a fictional burgeoning pharmaceutical company based out of San Diego, CA, specializes in drug-based, anti-cancer pharmaceuticals. Below is the link for the original source of the datasets.

https://c-l-nguyen.github.io/web-design-challenge/index.html

Response

What is the response variable, and what type is it (numerical/categorical)?

Answer: The response variable is the size of tumor, “Tumor.Volume.mm3.” and it holds a numerical data.

Explanatory

What is the explanatory variable, and what type is it (numerical/categorical)?

Answer: The explanatory variable is the “Drug.Regimen” and it holds a categorical data and “Timepoint” which holds numerical data. The ‘Timepoint’ unit is ‘days’.

Units of Variables:

Weight(g) Age_months(months) Timepoint(days) - numerical explanatory variable Drug.Regimen - categorical explanatory variable Tumor.Volume(mm3) - response variable Metastatic.Sites(satges of the spread of tumor)

The initial Tumor Volume for starting point zero is 45 mm3 for all mice.

Over the course of 45 days, tumor development was observed and measured.

Relevant summary statistics: (Tables and Charts)

Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R; hence a code chunk is provided below. Insert more code chunks as needed.

Libraries Imported

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.3

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.2     v dplyr   1.0.7
## v tidyr   1.1.3     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.1

## Warning: package 'ggplot2' was built under R version 4.1.2

## Warning: package 'stringr' was built under R version 4.1.2

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr)
library(plotly)

## Warning: package 'plotly' was built under R version 4.1.3

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(tidyr)
library(stringr)
library(psych)

## Warning: package 'psych' was built under R version 4.1.2

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

library(ggplot2)

Data Preparation and Exploratory Data Analysis:

load dataset1

metadata_df <- read.delim("https://raw.githubusercontent.com/rfpoulos/pymaceuticals/master/data/Mouse_metadata.csv", header=T, sep=",")
head(metadata_df)

##   Mouse.ID Drug.Regimen    Sex Age_months Weight..g.
## 1     k403     Ramicane   Male         21         16
## 2     s185    Capomulin Female          3         17
## 3     x401    Capomulin Female         16         15
## 4     m601    Capomulin   Male         22         17
## 5     g791     Ramicane   Male         11         16
## 6     s508     Ramicane   Male          1         17

Grouping by Drug.Regimen

df <- metadata_df %>%
  group_by(Drug.Regimen) 

head(df)

## # A tibble: 6 x 5
## # Groups:   Drug.Regimen [2]
##   Mouse.ID Drug.Regimen Sex    Age_months Weight..g.
##   <chr>    <chr>        <chr>       <int>      <int>
## 1 k403     Ramicane     Male           21         16
## 2 s185     Capomulin    Female          3         17
## 3 x401     Capomulin    Female         16         15
## 4 m601     Capomulin    Male           22         17
## 5 g791     Ramicane     Male           11         16
## 6 s508     Ramicane     Male            1         17

Load dataset2

results_df <- read.delim("https://raw.githubusercontent.com/rfpoulos/pymaceuticals/master/data/Study_results.csv", header=T, sep=",")
head(results_df)

##   Mouse.ID Timepoint Tumor.Volume..mm3. Metastatic.Sites
## 1     b128         0                 45                0
## 2     f932         0                 45                0
## 3     g107         0                 45                0
## 4     a457         0                 45                0
## 5     c819         0                 45                0
## 6     h246         0                 45                0

summary(metadata_df)

##    Mouse.ID         Drug.Regimen           Sex              Age_months   
##  Length:249         Length:249         Length:249         Min.   : 1.00  
##  Class :character   Class :character   Class :character   1st Qu.: 6.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :13.00  
##                                                           Mean   :12.73  
##                                                           3rd Qu.:19.00  
##                                                           Max.   :24.00  
##    Weight..g.   
##  Min.   :15.00  
##  1st Qu.:25.00  
##  Median :27.00  
##  Mean   :26.12  
##  3rd Qu.:29.00  
##  Max.   :30.00

Summary Statistic

summary(results_df)

##    Mouse.ID           Timepoint     Tumor.Volume..mm3. Metastatic.Sites
##  Length:1893        Min.   : 0.00   Min.   :22.05      Min.   :0.000   
##  Class :character   1st Qu.: 5.00   1st Qu.:45.00      1st Qu.:0.000   
##  Mode  :character   Median :20.00   Median :48.95      Median :1.000   
##                     Mean   :19.57   Mean   :50.45      Mean   :1.022   
##                     3rd Qu.:30.00   3rd Qu.:56.29      3rd Qu.:2.000   
##                     Max.   :45.00   Max.   :78.57      Max.   :4.000

Sample Sizes for metadata_df

nrow(metadata_df)

## [1] 249

Sample Sizes for results_df

nrow(results_df)

## [1] 1893

How many drug treatments are there?

drug_count <- unique(metadata_df$Drug.Regimen)

drug_count

##  [1] "Ramicane"  "Capomulin" "Infubinol" "Placebo"   "Ceftamin"  "Stelasyn" 
##  [7] "Zoniferol" "Ketapril"  "Propriva"  "Naftisol"

length(drug_count)

## [1] 10

Sample sizes of mouse_id by drug treatment

capomulin_df <- filter(metadata_df, Drug.Regimen=="Capomulin")

head(capomulin_df)

##   Mouse.ID Drug.Regimen    Sex Age_months Weight..g.
## 1     s185    Capomulin Female          3         17
## 2     x401    Capomulin Female         16         15
## 3     m601    Capomulin   Male         22         17
## 4     f966    Capomulin   Male         16         17
## 5     u364    Capomulin   Male         18         17
## 6     y793    Capomulin   Male         17         17

nrow(capomulin_df)

## [1] 25

infubinol_df <- filter(metadata_df, Drug.Regimen=="Infubinol")

nrow(infubinol_df)

## [1] 25

ketapril_df <- filter(metadata_df, Drug.Regimen=="Ketapril")

nrow(ketapril_df)

## [1] 25

placebo_df <- filter(metadata_df, Drug.Regimen=="Placebo")

nrow(placebo_df)

## [1] 25

Performing full outer join, for metadata and result dataframe, so that no data is lost

merge_df <- merge(x = metadata_df, y = results_df, all = TRUE)

head(merge_df)

##   Mouse.ID Drug.Regimen    Sex Age_months Weight..g. Timepoint
## 1     a203    Infubinol Female         20         23        20
## 2     a203    Infubinol Female         20         23        25
## 3     a203    Infubinol Female         20         23        15
## 4     a203    Infubinol Female         20         23        10
## 5     a203    Infubinol Female         20         23        35
## 6     a203    Infubinol Female         20         23         0
##   Tumor.Volume..mm3. Metastatic.Sites
## 1           55.17334                1
## 2           56.79321                1
## 3           52.77787                1
## 4           51.85244                1
## 5           61.93165                2
## 6           45.00000                0

glimpse(merge_df)

## Rows: 1,893
## Columns: 8
## $ Mouse.ID           <chr> "a203", "a203", "a203", "a203", "a203", "a203", "a2~
## $ Drug.Regimen       <chr> "Infubinol", "Infubinol", "Infubinol", "Infubinol",~
## $ Sex                <chr> "Female", "Female", "Female", "Female", "Female", "~
## $ Age_months         <int> 20, 20, 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21,~
## $ Weight..g.         <int> 23, 23, 23, 23, 23, 23, 23, 23, 23, 23, 25, 25, 25,~
## $ Timepoint          <int> 20, 25, 15, 10, 35, 0, 30, 5, 45, 40, 5, 40, 35, 45~
## $ Tumor.Volume..mm3. <dbl> 55.17334, 56.79321, 52.77787, 51.85244, 61.93165, 4~
## $ Metastatic.Sites   <int> 1, 1, 1, 1, 2, 0, 1, 0, 2, 2, 0, 1, 1, 1, 1, 1, 1, ~

Dropping the NA rows

merge_df <- merge_df %>% drop_na()

head(merge_df)

##   Mouse.ID Drug.Regimen    Sex Age_months Weight..g. Timepoint
## 1     a203    Infubinol Female         20         23        20
## 2     a203    Infubinol Female         20         23        25
## 3     a203    Infubinol Female         20         23        15
## 4     a203    Infubinol Female         20         23        10
## 5     a203    Infubinol Female         20         23        35
## 6     a203    Infubinol Female         20         23         0
##   Tumor.Volume..mm3. Metastatic.Sites
## 1           55.17334                1
## 2           56.79321                1
## 3           52.77787                1
## 4           51.85244                1
## 5           61.93165                2
## 6           45.00000                0

Rename colnames of some columns

Assigning new names to the columns of the merged data frame

colnames(merge_df)[1] <- c("Mouse_Id")
colnames(merge_df)[2] <- c("Drug_Regimen")
colnames(merge_df)[5] <- c("Weight_g")
colnames(merge_df)[7] <- c("Tumor_Volume_mm3")
colnames(merge_df)[8] <- c("Metastatic_Sites")

head(merge_df)

##   Mouse_Id Drug_Regimen    Sex Age_months Weight_g Timepoint Tumor_Volume_mm3
## 1     a203    Infubinol Female         20       23        20         55.17334
## 2     a203    Infubinol Female         20       23        25         56.79321
## 3     a203    Infubinol Female         20       23        15         52.77787
## 4     a203    Infubinol Female         20       23        10         51.85244
## 5     a203    Infubinol Female         20       23        35         61.93165
## 6     a203    Infubinol Female         20       23         0         45.00000
##   Metastatic_Sites
## 1                1
## 2                1
## 3                1
## 4                1
## 5                2
## 6                0

merge_df %>% group_by(Mouse_Id, Timepoint)

## # A tibble: 1,893 x 8
## # Groups:   Mouse_Id, Timepoint [1,888]
##    Mouse_Id Drug_Regimen Sex    Age_months Weight_g Timepoint Tumor_Volume_mm3
##    <chr>    <chr>        <chr>       <int>    <int>     <int>            <dbl>
##  1 a203     Infubinol    Female         20       23        20             55.2
##  2 a203     Infubinol    Female         20       23        25             56.8
##  3 a203     Infubinol    Female         20       23        15             52.8
##  4 a203     Infubinol    Female         20       23        10             51.9
##  5 a203     Infubinol    Female         20       23        35             61.9
##  6 a203     Infubinol    Female         20       23         0             45  
##  7 a203     Infubinol    Female         20       23        30             59.5
##  8 a203     Infubinol    Female         20       23         5             48.5
##  9 a203     Infubinol    Female         20       23        45             68.0
## 10 a203     Infubinol    Female         20       23        40             63.6
## # ... with 1,883 more rows, and 1 more variable: Metastatic_Sites <int>

head(merge_df)

##   Mouse_Id Drug_Regimen    Sex Age_months Weight_g Timepoint Tumor_Volume_mm3
## 1     a203    Infubinol Female         20       23        20         55.17334
## 2     a203    Infubinol Female         20       23        25         56.79321
## 3     a203    Infubinol Female         20       23        15         52.77787
## 4     a203    Infubinol Female         20       23        10         51.85244
## 5     a203    Infubinol Female         20       23        35         61.93165
## 6     a203    Infubinol Female         20       23         0         45.00000
##   Metastatic_Sites
## 1                1
## 2                1
## 3                1
## 4                1
## 5                2
## 6                0

Remove duplicate rows across entire data frame

merge_df <- merge_df[!duplicated(merge_df), ]

head(merge_df)

##   Mouse_Id Drug_Regimen    Sex Age_months Weight_g Timepoint Tumor_Volume_mm3
## 1     a203    Infubinol Female         20       23        20         55.17334
## 2     a203    Infubinol Female         20       23        25         56.79321
## 3     a203    Infubinol Female         20       23        15         52.77787
## 4     a203    Infubinol Female         20       23        10         51.85244
## 5     a203    Infubinol Female         20       23        35         61.93165
## 6     a203    Infubinol Female         20       23         0         45.00000
##   Metastatic_Sites
## 1                1
## 2                1
## 3                1
## 4                1
## 5                2
## 6                0

filter by Capomulin, Infubinol, Ketapril, and Placebo

capomulin_df <- filter(merge_df, Drug_Regimen == "Capomulin")
infubinol_df <- filter(merge_df, Drug_Regimen == "Infubinol")
ketapril_df <- filter(merge_df, Drug_Regimen == "Ketapril")
placebo_df <- filter(merge_df, Drug_Regimen == "Placebo")


head(capomulin_df)

##   Mouse_Id Drug_Regimen    Sex Age_months Weight_g Timepoint Tumor_Volume_mm3
## 1     b128    Capomulin Female          9       22         5         45.65133
## 2     b128    Capomulin Female          9       22        25         43.26214
## 3     b128    Capomulin Female          9       22        35         37.96764
## 4     b128    Capomulin Female          9       22        10         43.27085
## 5     b128    Capomulin Female          9       22         0         45.00000
## 6     b128    Capomulin Female          9       22        40         38.37973
##   Metastatic_Sites
## 1                0
## 2                1
## 3                1
## 4                0
## 5                0
## 6                2

The initial Tumor Volume for starting point zero is 45 mm3 for all mice

capomulin_df

# Filtering the max timepoint values

capo_df <- select(capomulin_df, Mouse_Id, Drug_Regimen, Timepoint, Tumor_Volume_mm3, Age_months, Weight_g) %>%
   group_by(Mouse_Id) %>%
  filter(Timepoint == max(Timepoint, na.rm=TRUE))


head(capo_df)

## # A tibble: 6 x 6
## # Groups:   Mouse_Id [6]
##   Mouse_Id Drug_Regimen Timepoint Tumor_Volume_mm3 Age_months Weight_g
##   <chr>    <chr>            <int>            <dbl>      <int>    <int>
## 1 b128     Capomulin           45             39.0          9       22
## 2 b742     Capomulin           45             38.9          7       21
## 3 f966     Capomulin           20             30.5         16       17
## 4 g288     Capomulin           45             37.1          3       19
## 5 g316     Capomulin           45             40.2         22       22
## 6 i557     Capomulin           45             47.7          1       24

# Calculating the percent change in Tumor volume
capo_df <- capo_df %>% 
  mutate(percent_change_Tumor_Volume = ((Tumor_Volume_mm3-45)/45))
head(capo_df)

## # A tibble: 6 x 7
## # Groups:   Mouse_Id [6]
##   Mouse_Id Drug_Regimen Timepoint Tumor_Volume_mm3 Age_months Weight_g
##   <chr>    <chr>            <int>            <dbl>      <int>    <int>
## 1 b128     Capomulin           45             39.0          9       22
## 2 b742     Capomulin           45             38.9          7       21
## 3 f966     Capomulin           20             30.5         16       17
## 4 g288     Capomulin           45             37.1          3       19
## 5 g316     Capomulin           45             40.2         22       22
## 6 i557     Capomulin           45             47.7          1       24
## # ... with 1 more variable: percent_change_Tumor_Volume <dbl>

The initial Tumor Volume for starting point zero is 45 mm3 for all mice

Infubinol_df

# Filtering the max timepoint values
infu_df <- select(infubinol_df, Mouse_Id, Drug_Regimen, Timepoint, Tumor_Volume_mm3, Age_months, Weight_g) %>%
   group_by(Mouse_Id) %>%
  filter(Timepoint == max(Timepoint, na.rm=TRUE))

# Calculating the percent change in Tumor volume
infu_df <- infu_df %>% 
  mutate(percent_change_Tumor_Volume = ((Tumor_Volume_mm3-45)/45))

head(infu_df)

## # A tibble: 6 x 7
## # Groups:   Mouse_Id [6]
##   Mouse_Id Drug_Regimen Timepoint Tumor_Volume_mm3 Age_months Weight_g
##   <chr>    <chr>            <int>            <dbl>      <int>    <int>
## 1 a203     Infubinol           45             68.0         20       23
## 2 a251     Infubinol           45             65.5         21       25
## 3 a577     Infubinol           30             57.0          6       25
## 4 a685     Infubinol           45             66.1          8       30
## 5 c139     Infubinol           45             72.2         11       28
## 6 c326     Infubinol            5             36.3         18       25
## # ... with 1 more variable: percent_change_Tumor_Volume <dbl>

The initial Tumor Volume for starting point zero is 45 mm3 for all mice

Ketapril_df

# Filtering the max timepoint values
keta_df <- select(ketapril_df, Mouse_Id, Drug_Regimen, Timepoint, Tumor_Volume_mm3, Age_months, Weight_g) %>%
   group_by(Mouse_Id) %>%
  filter(Timepoint == max(Timepoint, na.rm=TRUE))

# Calculating the percent change in Tumor volume
keta_df <- keta_df %>% 
  mutate(percent_change_Tumor_Volume = ((Tumor_Volume_mm3-45)/45))

head(keta_df)

## # A tibble: 6 x 7
## # Groups:   Mouse_Id [6]
##   Mouse_Id Drug_Regimen Timepoint Tumor_Volume_mm3 Age_months Weight_g
##   <chr>    <chr>            <int>            <dbl>      <int>    <int>
## 1 a457     Ketapril            10             49.8         11       30
## 2 c580     Ketapril            30             58.0         22       25
## 3 c819     Ketapril            40             62.2         21       25
## 4 c832     Ketapril            45             65.4         18       29
## 5 d474     Ketapril            40             60.2         18       27
## 6 f278     Ketapril             5             48.2         12       30
## # ... with 1 more variable: percent_change_Tumor_Volume <dbl>

The initial Tumor Volume for starting point zero is 45 mm3 for all mice

Placebo_df

plac_df <- select(placebo_df, Mouse_Id, Drug_Regimen, Timepoint, Tumor_Volume_mm3, Age_months, Weight_g) %>%
   group_by(Mouse_Id) %>%
  filter(Timepoint == max(Timepoint, na.rm=TRUE))

# Calculating the percent change in Tumor volume
plac_df <- plac_df %>% 
  mutate(percent_change_Tumor_Volume = ((Tumor_Volume_mm3-45)/45))

head(plac_df)

## # A tibble: 6 x 7
## # Groups:   Mouse_Id [6]
##   Mouse_Id Drug_Regimen Timepoint Tumor_Volume_mm3 Age_months Weight_g
##   <chr>    <chr>            <int>            <dbl>      <int>    <int>
## 1 a262     Placebo             45             70.7         17       29
## 2 a897     Placebo             45             72.3          7       28
## 3 c282     Placebo             45             65.8         12       27
## 4 c757     Placebo             45             69.0          9       27
## 5 c766     Placebo             45             69.8         13       26
## 6 e227     Placebo             45             73.2          1       30
## # ... with 1 more variable: percent_change_Tumor_Volume <dbl>

Merging the grouped dataframes into new_df

# put all data frames into list
df_list <- list(capo_df, infu_df, keta_df, plac_df)      

#merge all data frames together
new_df <- Reduce(function(x, y) merge(x, y, all=TRUE), df_list)  
head(new_df)

##   Mouse_Id Drug_Regimen Timepoint Tumor_Volume_mm3 Age_months Weight_g
## 1     a203    Infubinol        45         67.97342         20       23
## 2     a251    Infubinol        45         65.52574         21       25
## 3     a262      Placebo        45         70.71762         17       29
## 4     a457     Ketapril        10         49.78342         11       30
## 5     a577    Infubinol        30         57.03186          6       25
## 6     a685    Infubinol        45         66.08307          8       30
##   percent_change_Tumor_Volume
## 1                   0.5105204
## 2                   0.4561276
## 3                   0.5715027
## 4                   0.1062982
## 5                   0.2673747
## 6                   0.4685126

#new_df$percent_change_Tumor_Volume = round(new_df$percent_change_Tumor_Volume, digit=2)
head(new_df)

##   Mouse_Id Drug_Regimen Timepoint Tumor_Volume_mm3 Age_months Weight_g
## 1     a203    Infubinol        45         67.97342         20       23
## 2     a251    Infubinol        45         65.52574         21       25
## 3     a262      Placebo        45         70.71762         17       29
## 4     a457     Ketapril        10         49.78342         11       30
## 5     a577    Infubinol        30         57.03186          6       25
## 6     a685    Infubinol        45         66.08307          8       30
##   percent_change_Tumor_Volume
## 1                   0.5105204
## 2                   0.4561276
## 3                   0.5715027
## 4                   0.1062982
## 5                   0.2673747
## 6                   0.4685126

Comparing the percent change in tumor volumn size for each drug category

test_df <- select(new_df, Drug_Regimen, percent_change_Tumor_Volume) %>%
  group_by(Drug_Regimen) %>%
 summarise(percent_change_Tumor_Volume = mean(percent_change_Tumor_Volume, na.rm=TRUE))

head(test_df)

## # A tibble: 4 x 2
##   Drug_Regimen percent_change_Tumor_Volume
##   <chr>                              <dbl>
## 1 Capomulin                         -0.185
## 2 Infubinol                          0.293
## 3 Ketapril                           0.396
## 4 Placebo                            0.345

# plot mean salaries
ggplot(test_df, 
       aes(x = percent_change_Tumor_Volume, 
           y = Drug_Regimen, 
           fill=Drug_Regimen)) +
  geom_bar(stat = "identity")

Summary statistics

Compute some summary statistics (count, mean and sd) of the variable weight organized by groups:

# Summarize
new_df %>%
group_by(Drug_Regimen) %>%
  summarise(
    n = n(),
    max = max(percent_change_Tumor_Volume),
    min = min(percent_change_Tumor_Volume),
    mean = mean(percent_change_Tumor_Volume),
    median = median(percent_change_Tumor_Volume),
    sd = sd(percent_change_Tumor_Volume),
    quan_25th = quantile(percent_change_Tumor_Volume, 0.25),
    quan_75th = quantile(percent_change_Tumor_Volume, 0.75)
  )

## # A tibble: 4 x 9
##   Drug_Regimen     n    max    min   mean median    sd quan_25th quan_75th
##   <chr>        <int>  <dbl>  <dbl>  <dbl>  <dbl> <dbl>     <dbl>     <dbl>
## 1 Capomulin       25 0.0597 -0.481 -0.185 -0.153 0.127    -0.281    -0.108
## 2 Infubinol       25 0.605  -0.193  0.293  0.337 0.191     0.201     0.456
## 3 Ketapril        25 0.746   0      0.396  0.433 0.221     0.260     0.553
## 4 Placebo         25 0.627   0      0.345  0.378 0.197     0.177     0.514

Visualization

library(tidyverse)
library(rstatix)

## Warning: package 'rstatix' was built under R version 4.1.3

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

library(ggpubr)

## Warning: package 'ggpubr' was built under R version 4.1.3

# Color by groups
ggsummarystats(
  new_df, x = "Drug_Regimen", y = "percent_change_Tumor_Volume", 
  ggfunc = ggboxplot, add = "jitter",
  color = "Drug_Regimen", palette = "npg"
  )

Create a box plot of percent_change_Tumor_Volume by Drug_Regimen:

boxplot(new_df$percent_change_Tumor_Volume~new_df$Drug_Regimen, 
        main="Boxplot comparing percent_change_Tumor_Volume", 
         xlab = "Drug_Regimen", ylab = "percent_change_Tumor_Volume",
        col= rainbow(4))

Analysis for the box plots:

The drug categories represented on the x_axis and percent change in tumor size on the y_axis.

Infubinol drug category has one datapoint indicating the outlier.

Infubinol and Ketapril drug category shows that the median and mean are closely alligned, which shows data is normally distributed, and we can ignore the outlier in Infubinol drug category.

Ketapril drug category has more variance of data points than the Infubinol drug category.

However, the mean < median in the Capomulin drug category, making it left skewed. Also, the percent change is tumor size is below zero, showing that the Capomulin drug is effective in reducing the size of tumor.

Since Capomulin drug category appears to be different than Infubinol and Ketapril drug category, which is an indication that we will find the difference between means of Capomulin drug and the other two categories.

#=============================================================================================================================

To compare the means of three or more samples, One Way Variance (ANOVA) is the most appropriate test to use.

Assumptions

The ANOVA test makes the following assumptions about the data:

Independence of the observations. Each subject should belong to only one group. There is no relationship between the observations in each group. Having repeated measures for the same participants is not allowed. No significant outliers in any cell of the design Normality. the data for each design cell should be approximately normally distributed. Homogeneity of variances. The variance of the outcome variable should be equal in every cell of the design.

Check assumptions

Outliers

new_df %>% 
  group_by(Drug_Regimen) %>%
  identify_outliers(percent_change_Tumor_Volume)

## # A tibble: 1 x 9
##   Drug_Regimen Mouse_Id Timepoint Tumor_Volume_mm3 Age_months Weight_g
##   <chr>        <chr>        <int>            <dbl>      <int>    <int>
## 1 Infubinol    c326             5             36.3         18       25
## # ... with 3 more variables: percent_change_Tumor_Volume <dbl>,
## #   is.outlier <lgl>, is.extreme <lgl>

In this situation there is one outliers, this can be due to data entry errors, measurement errors or unusual values and the mean is not effected.

Normality assumption

Check normality assumption by analyzing the model residuals. QQ plot and Shapiro-Wilk test of normality are used.

QQ plot draws the correlation between a given data and the normal distribution.

# Build the linear model
model  <- lm(percent_change_Tumor_Volume ~ Drug_Regimen, data = new_df)
# Create a QQ plot of residuals
ggqqplot(residuals(model),col="blue")

# Compute Shapiro-Wilk test of normality
shapiro_test(residuals(model))

## # A tibble: 1 x 3
##   variable         statistic p.value
##   <chr>                <dbl>   <dbl>
## 1 residuals(model)     0.976  0.0678

Conclusion:

In the QQ plot, as all the points fall approximately along the reference line, we can assume normality. This conclusion is supported by the Shapiro-Wilk test. The p-value is not significant (p = 0.6), so we can assume normality.

Check normality assumption by groups

Computing Shapiro-Wilk test for each group level. If the data is normally distributed, the p-value should be greater than 0.05.

new_df %>%
  group_by(Drug_Regimen) %>%
  shapiro_test(percent_change_Tumor_Volume)

## # A tibble: 4 x 4
##   Drug_Regimen variable                    statistic     p
##   <chr>        <chr>                           <dbl> <dbl>
## 1 Capomulin    percent_change_Tumor_Volume     0.968 0.599
## 2 Infubinol    percent_change_Tumor_Volume     0.960 0.414
## 3 Ketapril     percent_change_Tumor_Volume     0.945 0.189
## 4 Placebo      percent_change_Tumor_Volume     0.936 0.121

The score were normally distributed (p > 0.05) for each group, as assessed by Shapiro-Wilk’s test of normality.

Create QQ plots for each group level:

QQ plot draws the correlation between a given data and the normal distribution.

ggqqplot(new_df, "percent_change_Tumor_Volume", facet.by = "Drug_Regimen", col="red")

Conclusion:

All the points fall approximately along the reference line, for each cell. So we can assume normality of the data.

Homogeneity of variance assumption

The residuals versus fits plot can be used to check the homogeneity of variances.

plot(model, 1)

Conclusion:

In the plot above, there is no evident relationships between residuals and fitted values (the mean of each group), which is good. So, we can assume the homogeneity of variances.

Step 1: Setup Null Hypothesis and Alternate Hypothesis

H0 = mu0 = mu01 = mu02 = mu03 (i.e., There is no difference between the mean percent change in tumor volume for the four drug categories.)

H1 = Not all means are equal

Alpha = 0.05

Computation for ANOVA Test

#result_aov <- new_df %>% anova_test(percent_change_Tumor_Volume ~ Drug_Regimen)
result_aov <-anova(lm(percent_change_Tumor_Volume ~ Drug_Regimen, new_df))

result_aov

## Analysis of Variance Table
## 
## Response: percent_change_Tumor_Volume
##              Df Sum Sq Mean Sq F value    Pr(>F)    
## Drug_Regimen  3 5.3903 1.79676  51.182 < 2.2e-16 ***
## Residuals    96 3.3701 0.03511                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Another similar Anova test using aov function:

table <- aov(percent_change_Tumor_Volume ~ Drug_Regimen, data=new_df)

summary(table)

##              Df Sum Sq Mean Sq F value Pr(>F)    
## Drug_Regimen  3   5.39  1.7968   51.18 <2e-16 ***
## Residuals    96   3.37  0.0351                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Report:

Degree of freedom: 3, since we have 4 categories, so 4-1 =3 degree of freedom Sum of Squares: 53104 Mean Sum of Squares: 17701 F-Statistics: 69.737, F(3, 96) = 69.74 p-Value: 2.2e-16, this tells weather we found significance difference or not. The p-value less than alpha=0.05 shows that we found the difference in at least two of the means in the data sample. And that is what we suspected from the boxplot analysis.

Conclusion:

The F-value in an ANOVA is calculated as: variation between sample means / variation within the samples. The higher the F-value in an ANOVA, the higher the variation between sample means relative to the variation within the samples. The higher the F-value, the lower the corresponding p-value.

Since the p-value is below a threshold (α = .05), we can reject the null hypothesis of the ANOVA and conclude that there is a statistically significant difference between group means.

Therefore, we can reject the null hypothesis that there is no difference between the means in the data sample.

But we are not sure about which category mean is different, so we perform Post-hoc tests to determine that.

Post-hoc test:

A significant one-way ANOVA is generally followed up by Tukey post-hoc tests to perform multiple pairwise comparisons between groups. Key R function: tukey_hsd() [rstatix].

TukeyHSD(table)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = percent_change_Tumor_Volume ~ Drug_Regimen, data = new_df)
## 
## $Drug_Regimen
##                            diff         lwr        upr     p adj
## Infubinol-Capomulin  0.47801507  0.33945452 0.61657561 0.0000000
## Ketapril-Capomulin   0.58085831  0.44229776 0.71941885 0.0000000
## Placebo-Capomulin    0.52979659  0.39123605 0.66835714 0.0000000
## Ketapril-Infubinol   0.10284324 -0.03571731 0.24140378 0.2181696
## Placebo-Infubinol    0.05178152 -0.08677902 0.19034207 0.7627868
## Placebo-Ketapril    -0.05106171 -0.18962226 0.08749883 0.7704117

Report TukeyHSD Results:

when p-adj < p-value, there is a significant difference between the two categories mean.

We are 95% confident that the true mean lies between the lwr and upr bounds of confidence Interval.

The p-Value for Infubinol-Capomulin, Ketapril-Capomulin, Placebo-Capomulin is less than alpha=0.05 that shows there is significant difference between the means of these categories.

Thus, Capomulin more effective in reducing the tumor size than Infubinol, or Ketapril drugs categories.

The visualization below shows the difference in mean between the two categories in the middle of each bar.

Both end limits on each bar shows the upper and lower limits of confidence intervals.

plot(TukeyHSD(table), col= rainbow(4))

###=====================================================================================

correlation between the age, weight and the tumor size growth for each drug category?

head(capo_df)

## # A tibble: 6 x 7
## # Groups:   Mouse_Id [6]
##   Mouse_Id Drug_Regimen Timepoint Tumor_Volume_mm3 Age_months Weight_g
##   <chr>    <chr>            <int>            <dbl>      <int>    <int>
## 1 b128     Capomulin           45             39.0          9       22
## 2 b742     Capomulin           45             38.9          7       21
## 3 f966     Capomulin           20             30.5         16       17
## 4 g288     Capomulin           45             37.1          3       19
## 5 g316     Capomulin           45             40.2         22       22
## 6 i557     Capomulin           45             47.7          1       24
## # ... with 1 more variable: percent_change_Tumor_Volume <dbl>

Applying Regression Model for Analyzing Correlation between Age_months, weight_g and Tumor_Volume_mm3

#fit regression model
model <- lm(Tumor_Volume_mm3~Weight_g+Age_months, data=capo_df)

#view model summary
summary(model)

## 
## Call:
## lm(formula = Tumor_Volume_mm3 ~ Weight_g + Age_months, data = capo_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.2433 -0.7903  0.1667  1.5253  4.8761 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.48591    4.32578   0.112    0.912    
## Weight_g     1.76129    0.20298   8.677 1.49e-08 ***
## Age_months   0.05303    0.07401   0.716    0.481    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.838 on 22 degrees of freedom
## Multiple R-squared:  0.7739, Adjusted R-squared:  0.7533 
## F-statistic: 37.65 on 2 and 22 DF,  p-value: 7.898e-08

#By looking at the coefficients, it seems as though weight is accounting for most of the variance (B = 1.76, p=1.49e-08, p<.05). Age is not a significant predictor of tumor size (B = .053, p = .481).

Residuals:

Min      1Q  Median      3Q     Max

-7.2433 -0.7903 0.1667 1.5253 4.8761

This section displays a summary of the distribution of residuals from the regression model. A residual is the difference between the observed value and the predicted value from the regression model.

The minimum residual was -7.2433, the median residual was 0.1667 and the max residual was 4.8761.

Coefficients:

        Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.48591 4.32578 0.112 0.912
Weight_g 1.76129 0.20298 8.677 1.49e-08 *** Age_months 0.05303 0.07401 0.716 0.481
— Signif. codes: 0 ‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1

This section displays the estimated coefficients of the regression model. We can use these coefficients to form the following estimated regression equation:

Tumor_Volume_mm3 = 0.48591 + 1.76129Weight_g + 0.05303Age_months

Estimate: The estimated coefficient. This tells us about the average increase in the response variable associated with a one unit increase in the predictor variable, assuming all other predictor variables are held constant.

Std. Error: This is the standard error of the coefficient. This is a measure of the uncertainty in our estimate of the coefficient.

t value: This is the t-statistic for the predictor variable, calculated as (Estimate) / (Std. Error).

Residual standard error: This tells us the average distance that the observed values fall from the regression line. The smaller the value, the better the regression model can fit the data.

Multiple R-Squared: This is known as the coefficient of determination. It tells us the proportion of the variance in the response variable that can be explained by the predictor variables. This value ranges from 0 to 1. The closer it is to 1, the better the predictor variables are able to predict the value of the response variable.

Adjusted R-squared: This is a modified version of R-squared that has been adjusted for the number of predictors in the model. It is always lower than the R-squared.

F-statistic: This indicates whether the regression model provides a better fit to the data than a model that contains no independent variables. In essence, it tests if the regression model is useful.

p-value: This is the p-value that corresponds to the F-statistic. If this value is less than some significance level (e.g., 0.05), then the regression model fits the data better than a model with no predictors.

When building regression models, we hope that this p-value is less than some significance level because it indicates that the predictor variables are useful for predicting the value of the response variable.

If we used an alpha level of α = .05 to determine which predictors were significant in this regression model, we’d say that weight has (p=1.49e-08) < is (α = .05) and is statistically significant predictors.

while age (p=0.481) > is (α = .05) and is not statistically significant predictors.

Applying Regression Model for Analyzing Correlation between Tumor_Volume_mm3 and weight seperately.

wt_Model <- lm(Tumor_Volume_mm3~Weight_g, data=new_df)
summary(wt_Model)

## 
## Call:
## lm(formula = Tumor_Volume_mm3 ~ Weight_g, data = new_df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.0966  -7.1010  -0.7038   8.7571  20.1531 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -5.6591     6.2930  -0.899    0.371    
## Weight_g      2.3252     0.2402   9.680    6e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.62 on 98 degrees of freedom
## Multiple R-squared:  0.4888, Adjusted R-squared:  0.4836 
## F-statistic:  93.7 on 1 and 98 DF,  p-value: 5.997e-16

Conclusion:

If we used an alpha level of α = .05 to determine which predictors were significant in this regression model, we’d say that weight has (p-value: 6e-16) < is (α = .05) and is statistically significant predictors.

Visualizing the Data:

ggscatter(new_df, x = "Tumor_Volume_mm3", y = "Weight_g", col="red",
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "Tumor Size mm3", ylab = "Weight_g")

## `geom_smooth()` using formula 'y ~ x'

Applying Regression Model for Analyzing Correlation between Tumor_Volume_mm3 and Age_months

age_Model <- lm(Tumor_Volume_mm3~Age_months, data=new_df)
summary(age_Model)

## 
## Call:
## lm(formula = Tumor_Volume_mm3 ~ Age_months, data = new_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.032  -9.571   1.621  11.525  23.876 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   54.331      2.912  18.656   <2e-16 ***
## Age_months     0.015      0.185   0.081    0.936    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.45 on 98 degrees of freedom
## Multiple R-squared:  6.709e-05,  Adjusted R-squared:  -0.01014 
## F-statistic: 0.006575 on 1 and 98 DF,  p-value: 0.9355

Conclusion:

If we used an alpha level of α = .05 to determine which predictors were significant in this regression model, we’d say that age has (p-value: 0.936) > is (α = .05) and is not statistically significant predictors.

Visualizing the Data:

ggscatter(new_df, x = "Tumor_Volume_mm3", y = "Age_months", col="blue",
          add = "reg.line", conf.int = TRUE, 
          cor.coef = TRUE, cor.method = "pearson",
          xlab = "Tumor Size mm3", ylab = "Age")

## `geom_smooth()` using formula 'y ~ x'

### Explanation of One Way ANOVA Variance Test:

ANOVA (or AOV) is short for ANalysis Of VAriance. While it is commonly used for categorical data, because ANOVA is a type of linear model it can be modified to include continuous data.

ANOVAs are typically referred to as 1-way and 2-way, which is just a way of saying how many groups are being examined in the model. (ANOVAs can be 3-way or more, but these are less common or end up with different names.) A group is the categorical variable that you are evaluating.

ANOVA test determines the difference in mean between two or more independent groups.

ANOVA checks the impact of one or more group by comparing the means of different samples. We can use ANOVA to prove/disprove if all the medication treatments were equally effective or not.

ANOVA analyzes the variance in the data to look for differences. It does this by considering two sources of variance, the between-group variance and the within-group variance.

The between-group variation is calculated by comparing the mean of each group with the overall mean of the data—so individual data points don’t matter quite as much as just comparing group means.

The within-group variation is the variation of each observation from its group mean.

For both types of variance, a sums of squares (SS) is the numerical metric used to quantify them and this metric simply sums the distances of each point to the mean.

The ratio of these SS (between SS divided by within SS) results in an F-statistic, which is the test statistic for ANOVA.

The F-statistic is then combined with the degrees of freedom (df) to arrive at a p-value, which, we are looking for.

Another way to think of it is that small p-values come from large F-statistics, and large F-statistics suggest that the between group variance is much larger than the within-group variance.

So, when the differences in group means is larger and yet the groups are not that variable, we tend to have significant factors in our ANOVAs.

A type 1 error is also known as a false positive and occurs when a researcher incorrectly rejects a true null hypothesis. This means that your report that your findings are significant when in fact they have occurred by chance.

The probability of making a type I error is represented by your alpha level (α), which is the p-value below which you reject the null hypothesis. A p-value of 0.05 indicates that we are willing to accept a 5% chance that we are wrong when we reject the null hypothesis.

We can reduce your risk of committing a type I error by using a lower value for p. For example, a p-value of 0.01 would mean there is a 1% chance of committing a Type I error.

However, using a lower value for alpha means that you will be less likely to detect a true difference if one really exists (thus risking a type II error).

Conclusion:

Why is this analysis important?

ANOVA test determines the difference in mean between two or more independent groups.

It provides the overall test of equality of group means.

It can control the overall type I error rate (i.e., false positive finding)

Limitations of the analysis?

One-way ANOVA can only be used when investigating a single factor and a single dependent variable. When comparing the means of three or more groups, it can tell us if at least one pair of means is significantly different, but it can’t tell us which pair.

ANOVA assumes that the data is normally distributed. The ANOVA also assumes homogeneity of variance, which means that the variance among the groups should be approximately equal. ANOVA also assumes that the observations are independent of each other.

Final Project Data 606.Rmd

Mubashira Qari

April 30, 2022

Overview:

Research question:

Hypothesis Test

Null Hypothesis: There is no difference between the mean percent change in tumor volume for the four drug categories.

Alternate Hypothesis: There is a difference between the mean percent change in tumor volume for the four drug categories.

Cases:

What are the cases? How many different drug treatments are there? How many total samples size as well as the sample size by drug treatments are there?

Data collection:

Describe the method of data collection.

Type of study:

What type of study is this (observational/experiment)?

Data Source:

If you collected the data, state self-collected. If not, provide a citation/link.

Response

What is the response variable, and what type is it (numerical/categorical)?

Explanatory

Units of Variables:

The initial Tumor Volume for starting point zero is 45 mm3 for all mice.

Over the course of 45 days, tumor development was observed and measured.

Relevant summary statistics: (Tables and Charts)

Libraries Imported

Data Preparation and Exploratory Data Analysis:

load dataset1

Grouping by Drug.Regimen

Load dataset2

Summary Statistic

Sample Sizes for metadata_df

Sample Sizes for results_df

How many drug treatments are there?

Sample sizes of mouse_id by drug treatment

Performing full outer join, for metadata and result dataframe, so that no data is lost

Dropping the NA rows

Rename colnames of some columns

Assigning new names to the columns of the merged data frame

Remove duplicate rows across entire data frame

filter by Capomulin, Infubinol, Ketapril, and Placebo

The initial Tumor Volume for starting point zero is 45 mm3 for all mice

capomulin_df

The initial Tumor Volume for starting point zero is 45 mm3 for all mice

Infubinol_df

The initial Tumor Volume for starting point zero is 45 mm3 for all mice

Ketapril_df

The initial Tumor Volume for starting point zero is 45 mm3 for all mice

Placebo_df

Merging the grouped dataframes into new_df

Comparing the percent change in tumor volumn size for each drug category

Summary statistics

Visualization

Analysis for the box plots:

To compare the means of three or more samples, One Way Variance (ANOVA) is the most appropriate test to use.

Assumptions

Check assumptions

Outliers

Normality assumption

QQ plot draws the correlation between a given data and the normal distribution.

Conclusion:

Check normality assumption by groups

Create QQ plots for each group level:

Conclusion:

Homogeneity of variance assumption

Conclusion:

Step 1: Setup Null Hypothesis and Alternate Hypothesis

H0 = mu0 = mu01 = mu02 = mu03 (i.e., There is no difference between the mean percent change in tumor volume for the four drug categories.)

H1 = Not all means are equal

Alpha = 0.05

Computation for ANOVA Test

Another similar Anova test using aov function:

Report:

Conclusion:

Post-hoc test:

Report TukeyHSD Results:

correlation between the age, weight and the tumor size growth for each drug category?

Applying Regression Model for Analyzing Correlation between Age_months, weight_g and Tumor_Volume_mm3

Residuals:

Coefficients:

Tumor_Volume_mm3 = 0.48591 + 1.76129Weight_g + 0.05303Age_months

Applying Regression Model for Analyzing Correlation between Tumor_Volume_mm3 and weight seperately.