For this project I picked the PlantGrowth Dataset which contains the Results from an Experiment on Plant Growth Source file is https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/datasets/PlantGrowth.csv
In this dataset. It has the Results from an experiment to compare yields (as measured by dried weight of plants) obtained under a control and two different treatment conditions.
In this analysis I would like to find out which treatment condition gives the largest plant yields on average and if the control condition can be better than the treatment condition.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
Load the file into a data frame for indepth Analysis with R
data_url <- "https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/datasets/PlantGrowth.csv"
PlantGrowth_df <- read.table(file=data_url, header = TRUE, sep =",",
stringsAsFactors = FALSE)
Preview the data frame to ensure proper loading of data
head(PlantGrowth_df,10)
## X weight group
## 1 1 4.17 ctrl
## 2 2 5.58 ctrl
## 3 3 5.18 ctrl
## 4 4 6.11 ctrl
## 5 5 4.50 ctrl
## 6 6 4.61 ctrl
## 7 7 5.17 ctrl
## 8 8 4.53 ctrl
## 9 9 5.33 ctrl
## 10 10 5.14 ctrl
Rename weight column to weight in lbs Rename X to Seq_No Rename group to Group_Name
names(PlantGrowth_df) <- c("Seq_No", "Weight(lbs)","Group_Name")
#display new column names
names(PlantGrowth_df)
## [1] "Seq_No" "Weight(lbs)" "Group_Name"
Find distinct/unique values of the Group_Name
unique(PlantGrowth_df$Group_Name)
## [1] "ctrl" "trt1" "trt2"
Rename the different group codes to something more detailed
lookup_Groupname <- c(ctrl="control condition", trt1 ="treatment condition_1",
trt2="treatment condition_2")
#Notice if we list the names of the vector we can repeat the values
lookup_Groupname[c("ctrl","trt1","trt1")]
## ctrl trt1 trt1
## "control condition" "treatment condition_1" "treatment condition_1"
#Create a character vector from the column values of the dataframe
lookup_Groupname[PlantGrowth_df$Group_Name]
## ctrl ctrl ctrl
## "control condition" "control condition" "control condition"
## ctrl ctrl ctrl
## "control condition" "control condition" "control condition"
## ctrl ctrl ctrl
## "control condition" "control condition" "control condition"
## ctrl trt1 trt1
## "control condition" "treatment condition_1" "treatment condition_1"
## trt1 trt1 trt1
## "treatment condition_1" "treatment condition_1" "treatment condition_1"
## trt1 trt1 trt1
## "treatment condition_1" "treatment condition_1" "treatment condition_1"
## trt1 trt1 trt2
## "treatment condition_1" "treatment condition_1" "treatment condition_2"
## trt2 trt2 trt2
## "treatment condition_2" "treatment condition_2" "treatment condition_2"
## trt2 trt2 trt2
## "treatment condition_2" "treatment condition_2" "treatment condition_2"
## trt2 trt2 trt2
## "treatment condition_2" "treatment condition_2" "treatment condition_2"
#prior_pG_Groupname <-PlantGrowth_df[,3]
#Replace the Group_Name column in the plant DF with the character vector created but with no names
PlantGrowth_df[,3] <- unname(lookup_Groupname[PlantGrowth_df$Group_Name])
Let us first take a quick look at the distribution of weight in lbs of plant yeilds across all conditions
ggplot(data = PlantGrowth_df) +geom_histogram(aes(x=`Weight(lbs)`))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Create a subset of the PlantGrowth_DF for control group so more analysis can be done
PlantGrowth_df_ctrl_sub <- subset(PlantGrowth_df,Group_Name=="control condition")
Get Summary Statistics of the is control condition, mean, median, max, min
summary(PlantGrowth_df_ctrl_sub)
## Seq_No Weight(lbs) Group_Name
## Min. : 1.00 Min. :4.170 Length:10
## 1st Qu.: 3.25 1st Qu.:4.550 Class :character
## Median : 5.50 Median :5.155 Mode :character
## Mean : 5.50 Mean :5.032
## 3rd Qu.: 7.75 3rd Qu.:5.293
## Max. :10.00 Max. :6.110
Display this subset in a box plot
boxplot(PlantGrowth_df_ctrl_sub$`Weight(lbs)`)
This shows that the distribution of plant weights are not so variant. They range between 4.6 and 5.4 lbs
Compare the box plots of the all the different conditions
boxplot(`Weight(lbs)`~ Group_Name, data = PlantGrowth_df )
we notice that the Treatment condition 2 is gives higher yeilds in general in much less variance.
The control codition is shows the most variance from the mean.
Study summary statistics of the entire the df
summary(PlantGrowth_df)
## Seq_No Weight(lbs) Group_Name
## Min. : 1.00 Min. :3.590 Length:30
## 1st Qu.: 8.25 1st Qu.:4.550 Class :character
## Median :15.50 Median :5.155 Mode :character
## Mean :15.50 Mean :5.073
## 3rd Qu.:22.75 3rd Qu.:5.530
## Max. :30.00 Max. :6.310
We notice that the minimum is 3.59, average is 5.073 and Max is 6.310
We need to figure out which treatment condition performed worse than the control condition
p1 <- ggplot(PlantGrowth_df, aes(x = Group_Name, y = `Weight(lbs)`))
# Print plot with default points
p1 + geom_point(color="red") #set one color for all points
We can clearly we that the treatment condition 1 performed worse to both control condition and treatment condition 2
Lets look at the summary statistics for treatment condition 1 and a violin plot with mean points plot of the 3 conditions.
summary(subset(PlantGrowth_df,Group_Name=="treatment condition_1"))
## Seq_No Weight(lbs) Group_Name
## Min. :11.00 Min. :3.590 Length:10
## 1st Qu.:13.25 1st Qu.:4.207 Class :character
## Median :15.50 Median :4.550 Mode :character
## Mean :15.50 Mean :4.661
## 3rd Qu.:17.75 3rd Qu.:4.870
## Max. :20.00 Max. :6.030
ggplot(PlantGrowth_df, aes(x=Group_Name,y=`Weight(lbs)`)) + geom_violin() + stat_summary(fun.y=mean, geom="point", shape=23, size=2)
We notice that it is reponsible for the minimum plant yeild We notice that treatment condition 2 gives the highest yield and highest yield on average.
Conclusion: This analysis really shows that out of all treatment conditions, treatment condition 2 gives the most plant yield in lbs. We can strongly recommend this method for future farming techniques for the specified plant .