Chapter 2 Activity

Load Libraries

library(tidyverse)
library(skimr)
library(tibble)
library(psych)
library(patchwork)

Review Problem

Gibberellic acid (GA) is thought to elongate the stems of plants. Researchers conducted an experiment to investigate the effect of GA on a mutant strain of the genus Brassica called ros. They applied GA to 17 plants and applied water to 15 control plants. After 14 days they measured the growth of each of the 32 plants. In this experiment, the researchers were trying to establish whether GA affects the growth rate of ros; (2) the response variable is 14-day growth of ros, which is numeric; (3) the predictor variable is group membership (GA group or control group) and is categorical; the two groups are independent of one another.

Create a Data

control <- c(3,2,34,12,6,118,14,107,30,9,3,3,49,4,6)

GA <- c(71,87,117,80,112,66,128,153,131,45,38,137,57,163,47,108,35)

Descriptive Statistics of Data

summary(control)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    3.50    9.00   26.67   32.00  118.00

summary(GA)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   35.00   57.00   87.00   92.65  128.00  163.00

a. Describe the growth of ros plants (mm) after 14 days.

The average growth of ros plants(mm) after 14 days on control is 26.67mm which is less than GA with average of 92.65mm. We can conclude that applying Ga(Gibberellic acid) is more effective than control(applied water only). Also, GA have a higher growth with 163mm than control have a 118mm only. Control have also a lowest growth of 2mm than GA have 35mm.

b. How to examine the distribution of the growth of ros plants (mm) after 14 days?

describe(control)

##    vars  n  mean    sd median trimmed mad min max range skew kurtosis   se
## X1    1 15 26.67 37.52      9   21.54 8.9   2 118   116 1.51     0.82 9.69

describe(GA)

##    vars  n  mean    sd median trimmed  mad min max range skew kurtosis    se
## X1    1 17 92.65 41.67     87    91.8 59.3  35 163   128 0.12    -1.48 10.11

Get the mode of data

getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

getmode(control)

## [1] 3

getmode(GA)

## [1] 71

Create a Histogram with Density

#creating a histogram  with density of the growth of ros plants (mm) after 14 days

control_dist <- ggplot(mapping = aes(control))+
  geom_histogram(aes(y = ..density..), fill ="blue", color = "blue", alpha = .7, binwidth = 10)+
  geom_density( fill = "skyblue", alpha = 0.8, color = "skyblue", adjust = .8)+
  geom_vline(aes(xintercept = mean(control), color = "mean"), lty = 2, lwd = 1)+
  geom_vline(aes(xintercept = median(control), color = "median"), lty = 2, lwd = 1)+
  geom_vline(aes(xintercept = getmode(control), color = "mode"), lty = 2, lwd = 1)+
  scale_color_manual(name = "Line Type",
                     breaks = c("mean", "median", "mode"), 
                     values = c("mean" = "blue", "median" = "red", "mode" = "yellow"))+
  theme_bw()+
  labs(title = "Control Distribution",
       subtitle = "Skewed to the right",
       x = "growth of ros plants (mm)")+
  theme( plot.title = element_text(size = 10,
                              face = "bold",
                              hjust = 0.5),
         legend.position = "bottom")

GA_dist<- ggplot(mapping = aes(GA))+
  geom_histogram(aes(y = ..density..), fill ="#0c6124", color ="#0c6124",alpha = .5, binwidth =10)+
  geom_density( fill = "#2fed33", alpha = 0.5, color = "#2fed33", adjust = .7)+
  geom_vline(aes(xintercept = mean(GA), color = "mean"), lty = 2, lwd = 1, show.legend = F)+
  geom_vline(aes(xintercept = median(GA), color = "median"), lty = 2, lwd = 1, show.legend = F)+
  geom_vline(aes(xintercept = getmode(GA), color = "mode"), lty = 2, lwd = 1, show.legend = F)+
  scale_color_manual(breaks = c("mean", "median", "mode"), 
                     values = c("mean" = "blue", "median" = "red", "mode" = "yellow"))+
  theme_bw()+
  labs(title = "Gibberellic acid(GA) Distribution",
                subtitle = "Bimodal distribution",
       x = "growth of ros plants (mm)")+
  theme( plot.title = element_text(size = 10,
                              face = "bold",
                              hjust = 0.5),
         )


(control_dist | GA_dist)+
  plot_annotation(
    title = "Growth of ros plants (mm) after 14 days Distribution",
    theme = theme(plot.title = element_text(size = 15,
                                            color = "blue"))
  )

The distribution of the growth of ros plants (mm) after 14 days in control is positive skewed distribution or called skewed to the right meaning that data average in control is greater than the median and same also to the mode.

The distribution of the growth of ros plants (mm) after 14 days on GA is Bimodal distribution which there’s a growth have 2 peak or have a higher value of growth of within that days.

Another way of viewing the distribution of the data is using the boxplot

#boxplot of control

control_bxp <- ggplot(mapping = aes(control))+
  geom_boxplot(fill = "skyblue", color = "blue")+
  theme_bw() +
  labs(title = ("Control Boxplot"))+
  theme( plot.title = element_text(size = 15L,
                              face = "bold",
                              hjust = 0.5))
GA_bxp <- ggplot(mapping =  aes(GA))+
  geom_boxplot(fill = "orange", color = "orange", alpha = 0.5)+
  theme_bw()+
  labs(title = ("Gibberellic acid(GA) Boxplot"))+
  theme( plot.title = element_text(size = 15L,
                              face = "bold",
                              hjust = 0.5)) 

(control_bxp | GA_bxp)+
  plot_annotation(
    title = "Growth of ros plants (mm) after 14 days Boxplot",
    theme = theme(plot.title = element_text(size = 15,
                                            color = "blue"))
  )

The whisker-boxplot above show that value occur more on right after the center line or we called median meaning that our data in control is skewed to the right. We can also see there’s two outliers it may happen due to a large value of standard deviation 38.5 which also higher to our mean of 28.1 or we conclude that the dispersion of our data is very far from each other.

The whisker boxplot above show in GA data that is look like normally distributed but it’s actually bimodal based on histogram we see previously. Like what we explain earlier we can’t conclude that the graph above is bimodal since the mean and median is meaningless to that distribution.

There is a pros and cons of using the histogram and boxplot. In the histogram we clearly see the distribution of the data but we can see clearly if theirs outliers occurred. Unlike histogram, boxplot is can’t clearly see the distribution of data but we can see obviously if there some outliers.

Problem 1

1. A sample of 15 patients was randomly split into two groups as part of a double blind experiment to compare two pain relievers. The 7 patients in the first group were given Demerol and reported the following numbers of hours of pain relief: 2, 6, 4, 13, 5, 8, 4 The 8 patients in the second group were given an experimental drug and reported the following numbers of hours of pain relief. 0, 8, 1, 4, 2, 2, 1, 3 How might these data be analyzed?

create a data set

first_group <- c(2,6,4,13,5,8,4,NA)
second_group <-  c(0,8,1,4,2,2,1,3)

patient <- data.frame(first_group,second_group)

How might these data be analyzed?

Perform a T-Test

describe(patient)

##              vars n mean   sd median trimmed  mad min max range skew kurtosis
## first_group     1 7 6.00 3.61      5    6.00 1.48   2  13    11 0.82    -0.71
## second_group    2 8 2.62 2.50      2    2.62 1.48   0   8     8 1.04    -0.14
##                se
## first_group  1.36
## second_group 0.89

Check Assumption

the data observation is not dependent to each other
The variance are almost same(we will observe that)
The data is use random sampling techniques
The data is normally distributed(make an observation about it)

Homogeneity of Variances

ratio <- var(first_group, na.rm = T) / var(second_group)
ratio

## [1] 2.074074

Since this ratio is less than 4, we could assume that the variances between the two groups are approximately equal.

View if the data is normally distributed

fg_density <- patient %>% 
  drop_na(first_group) %>% 
  ggplot(aes(first_group))+
  geom_density(fill = "orange", color = "orange", alpha = .5 )+
  geom_vline(xintercept = 6, color = "orange", lty = 2, lwd =1)+
  geom_text(aes(x = 8.5, y =0.14, label = "Mean of hours of relief\n6.00 hours"), size = 3)+
  theme_bw()+
  labs(
    title = "first Group Distribution",
    subtitle = "skew = 0.82\nThe distribution is approximately normal\n becuase the skew value is\n roughly around 0",
    x = "Hours of Relief"
  )

sg_density <- patient %>% 
  ggplot(aes(second_group))+
  geom_density(fill = "blue", color = "blue", alpha = .5 )+
  geom_vline(xintercept = mean(second_group), color = "blue", lty = 2, lwd =1)+
  geom_text(aes(x = 4.5, y =0.20, label = "Mean of hours of relief\n2.62 hours"), size = 3)+
  theme_bw()+
  labs(
    title = "Second Group Distribution",
    subtitle = "skew = 1.04\nThe distrbution is not normal\nsince the skew value is positive",
    x = "Hours of Relief"
  )
  
((fg_density | sg_density)) &
  plot_annotation(
    title = "Distribution of Hours Of Relief in Two Groups ",
    theme = theme(plot.title = element_text(size = 15,
                                            color = "blue"))
  )

performa a T-Test to compare if thier a significant difference of mean in between two group

Since the normally distributed didn’t meet we will performed Mann Whitney U Test

my_test <-wilcox.test(first_group, second_group, exact = F)
my_test

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  first_group and second_group
## W = 46.5, p-value = 0.03556
## alternative hypothesis: true location shift is not equal to 0

Conclusion

hypothesis testing: Ho: the mean of two groups is equal Ha: the mean of two groups is not equal

Since the p-value is less than 0.05 we failed to reject the null hypothesis, we can conclude now that both group have same distribution and the effect of pain relievers to first group is equal to the effect of pain relievers of second group.

Problem 2

A researcher was interested in the relationship between forearm length and, height. He measured the forearm lengths and heights of a sample of 16 women and obtained the following data. How might these data be (i) visualized and (ii) analyzed?

height <- c(163,161,151,163,166,168,170,163,175,178,163,161,173,160,158,170)
length <- c(25.5,26,25,25,27.2,26,26,26,26,27,24.5,26,28,24.5,25,26)

forearm <- data.frame(height, length)

plot the data

forearm %>% 
ggplot(aes(height, length, color = height))+
  scale_color_viridis_c()+
  geom_point(size = 3)+
  geom_smooth(method = lm, se = F)+
  theme_bw()

## `geom_smooth()` using formula 'y ~ x'

### Analyze The graph above have a positive linearity that going up and have some an outliers occurred.

Get Correlation coeffiecent

cor(forearm$length, forearm$height)

## [1] 0.656379

The r value is 0.66 which mean that the relationship between height and length is moderately positive strong or we can say that while height increasing there a change of forearm also increasing.