For this project I picked the PlantGrowth Dataset which contains the Results from an Experiment on Plant Growth Source file is https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/datasets/PlantGrowth.csv

In this dataset. It has the Results from an experiment to compare yields (as measured by dried weight of plants) obtained under a control and two different treatment conditions.

In this analysis I would like to find out which treatment condition gives the largest plant yields on average and if the control condition can be better than the treatment condition.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2

Load the file into a data frame for indepth Analysis with R

data_url <- "https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/datasets/PlantGrowth.csv"

PlantGrowth_df <- read.table(file=data_url, header = TRUE, sep =",",
                          stringsAsFactors = FALSE)

Preview the data frame to ensure proper loading of data

head(PlantGrowth_df,10)
##     X weight group
## 1   1   4.17  ctrl
## 2   2   5.58  ctrl
## 3   3   5.18  ctrl
## 4   4   6.11  ctrl
## 5   5   4.50  ctrl
## 6   6   4.61  ctrl
## 7   7   5.17  ctrl
## 8   8   4.53  ctrl
## 9   9   5.33  ctrl
## 10 10   5.14  ctrl

Rename weight column to weight in lbs Rename X to Seq_No Rename group to Group_Name

names(PlantGrowth_df) <- c("Seq_No", "Weight(lbs)","Group_Name")

#display new column names
names(PlantGrowth_df)
## [1] "Seq_No"      "Weight(lbs)" "Group_Name"

Find distinct/unique values of the Group_Name

unique(PlantGrowth_df$Group_Name)
## [1] "ctrl" "trt1" "trt2"

Rename the different group codes to something more detailed

lookup_Groupname <- c(ctrl="control condition", trt1 ="treatment condition_1",
                      trt2="treatment condition_2")
#Notice if we list the names of the vector we can repeat the values
lookup_Groupname[c("ctrl","trt1","trt1")]
##                    ctrl                    trt1                    trt1 
##     "control condition" "treatment condition_1" "treatment condition_1"
#Create a character vector from the column values of the dataframe
lookup_Groupname[PlantGrowth_df$Group_Name]
##                    ctrl                    ctrl                    ctrl 
##     "control condition"     "control condition"     "control condition" 
##                    ctrl                    ctrl                    ctrl 
##     "control condition"     "control condition"     "control condition" 
##                    ctrl                    ctrl                    ctrl 
##     "control condition"     "control condition"     "control condition" 
##                    ctrl                    trt1                    trt1 
##     "control condition" "treatment condition_1" "treatment condition_1" 
##                    trt1                    trt1                    trt1 
## "treatment condition_1" "treatment condition_1" "treatment condition_1" 
##                    trt1                    trt1                    trt1 
## "treatment condition_1" "treatment condition_1" "treatment condition_1" 
##                    trt1                    trt1                    trt2 
## "treatment condition_1" "treatment condition_1" "treatment condition_2" 
##                    trt2                    trt2                    trt2 
## "treatment condition_2" "treatment condition_2" "treatment condition_2" 
##                    trt2                    trt2                    trt2 
## "treatment condition_2" "treatment condition_2" "treatment condition_2" 
##                    trt2                    trt2                    trt2 
## "treatment condition_2" "treatment condition_2" "treatment condition_2"
#prior_pG_Groupname <-PlantGrowth_df[,3]
#Replace the Group_Name column in the plant DF with the character vector created but with no names
PlantGrowth_df[,3] <- unname(lookup_Groupname[PlantGrowth_df$Group_Name])

Let us first take a quick look at the distribution of weight in lbs of plant yeilds across all conditions

ggplot(data = PlantGrowth_df) +geom_histogram(aes(x=`Weight(lbs)`))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Create a subset of the PlantGrowth_DF for control group so more analysis can be done

PlantGrowth_df_ctrl_sub <- subset(PlantGrowth_df,Group_Name=="control condition")

Get Summary Statistics of the is control condition, mean, median, max, min

summary(PlantGrowth_df_ctrl_sub)
##      Seq_No       Weight(lbs)     Group_Name       
##  Min.   : 1.00   Min.   :4.170   Length:10         
##  1st Qu.: 3.25   1st Qu.:4.550   Class :character  
##  Median : 5.50   Median :5.155   Mode  :character  
##  Mean   : 5.50   Mean   :5.032                     
##  3rd Qu.: 7.75   3rd Qu.:5.293                     
##  Max.   :10.00   Max.   :6.110

Display this subset in a box plot

boxplot(PlantGrowth_df_ctrl_sub$`Weight(lbs)`)

This shows that the distribution of plant weights are not so variant. They range between 4.6 and 5.4 lbs

Compare the box plots of the all the different conditions

boxplot(`Weight(lbs)`~ Group_Name,  data = PlantGrowth_df )

we notice that the Treatment condition 2 is gives higher yeilds in general in much less variance.

The control codition is shows the most variance from the mean.

Study summary statistics of the entire the df

summary(PlantGrowth_df)
##      Seq_No       Weight(lbs)     Group_Name       
##  Min.   : 1.00   Min.   :3.590   Length:30         
##  1st Qu.: 8.25   1st Qu.:4.550   Class :character  
##  Median :15.50   Median :5.155   Mode  :character  
##  Mean   :15.50   Mean   :5.073                     
##  3rd Qu.:22.75   3rd Qu.:5.530                     
##  Max.   :30.00   Max.   :6.310

We notice that the minimum is 3.59, average is 5.073 and Max is 6.310

We need to figure out which treatment condition performed worse than the control condition

p1 <- ggplot(PlantGrowth_df, aes(x = Group_Name, y = `Weight(lbs)`))

# Print plot with default points


p1 + geom_point(color="red")            #set one color for all points

We can clearly we that the treatment condition 1 performed worse to both control condition and treatment condition 2

Lets look at the summary statistics for treatment condition 1 and a violin plot with mean points plot of the 3 conditions.

summary(subset(PlantGrowth_df,Group_Name=="treatment condition_1"))
##      Seq_No       Weight(lbs)     Group_Name       
##  Min.   :11.00   Min.   :3.590   Length:10         
##  1st Qu.:13.25   1st Qu.:4.207   Class :character  
##  Median :15.50   Median :4.550   Mode  :character  
##  Mean   :15.50   Mean   :4.661                     
##  3rd Qu.:17.75   3rd Qu.:4.870                     
##  Max.   :20.00   Max.   :6.030
ggplot(PlantGrowth_df, aes(x=Group_Name,y=`Weight(lbs)`)) + geom_violin() + stat_summary(fun.y=mean, geom="point", shape=23, size=2)

We notice that it is reponsible for the minimum plant yeild We notice that treatment condition 2 gives the highest yield and highest yield on average.

Conclusion: This analysis really shows that out of all treatment conditions, treatment condition 2 gives the most plant yield in lbs. We can strongly recommend this method for future farming techniques for the specified plant .