MIDTERM 1 PART 2

library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
data("diamonds")
?diamonds
head(diamonds)
## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

Question 1: Make a new data set that has the average depth and price of the diamonds in the data set.

diamondsQ1 <- diamonds%>%
  mutate(avgDepth = mean(depth), avePrice = mean(price))
head(diamondsQ1)
## # A tibble: 6 x 12
##   carat cut   color clarity depth table price     x     y     z avgDepth
##   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>    <dbl>
## 1 0.23  Ideal E     SI2      61.5    55   326  3.95  3.98  2.43     61.7
## 2 0.21  Prem… E     SI1      59.8    61   326  3.89  3.84  2.31     61.7
## 3 0.23  Good  E     VS1      56.9    65   327  4.05  4.07  2.31     61.7
## 4 0.290 Prem… I     VS2      62.4    58   334  4.2   4.23  2.63     61.7
## 5 0.31  Good  J     SI2      63.3    58   335  4.34  4.35  2.75     61.7
## 6 0.24  Very… J     VVS2     62.8    57   336  3.94  3.96  2.48     61.7
## # … with 1 more variable: avePrice <dbl>

Question 2: Add a new column to the data set that records each diamond’s price per carat.

diamondsQ2 <- diamondsQ1%>%
  mutate(pricePerCarat = price/carat)
head(diamondsQ2)
## # A tibble: 6 x 13
##   carat cut   color clarity depth table price     x     y     z avgDepth
##   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>    <dbl>
## 1 0.23  Ideal E     SI2      61.5    55   326  3.95  3.98  2.43     61.7
## 2 0.21  Prem… E     SI1      59.8    61   326  3.89  3.84  2.31     61.7
## 3 0.23  Good  E     VS1      56.9    65   327  4.05  4.07  2.31     61.7
## 4 0.290 Prem… I     VS2      62.4    58   334  4.2   4.23  2.63     61.7
## 5 0.31  Good  J     SI2      63.3    58   335  4.34  4.35  2.75     61.7
## 6 0.24  Very… J     VVS2     62.8    57   336  3.94  3.96  2.48     61.7
## # … with 2 more variables: avePrice <dbl>, pricePerCarat <dbl>

Question 3/4: Create a new data set that groups diamonds by their cut and displays the average price of each group

diamondsQ3<- diamonds
diamondsQ3%>%
  group_by(cut)%>%
  summarise(meanPrice=mean(price))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 2
##   cut       meanPrice
##   <ord>         <dbl>
## 1 Fair          4359.
## 2 Good          3929.
## 3 Very Good     3982.
## 4 Premium       4584.
## 5 Ideal         3458.

Question 5: Create a new data set that groups diamonds by color and displays the average depth and average table for each group.

diamondsQ5 <- diamonds
diamondsQ5%>%
  group_by(color)%>%
  summarise(meanDepth=mean(depth),
            meanTable=mean(table))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 7 x 3
##   color meanDepth meanTable
##   <ord>     <dbl>     <dbl>
## 1 D          61.7      57.4
## 2 E          61.7      57.5
## 3 F          61.7      57.4
## 4 G          61.8      57.3
## 5 H          61.8      57.5
## 6 I          61.8      57.6
## 7 J          61.9      57.8

Question 6: Which color diamonds seem to be largest on average (in terms of carats)? The worst color, J, (and presumably the most common) are the largest on avarage.

diamondsQ6 <- diamonds
diamondsQ6%>%
  group_by(color)%>%
  summarise(meanCarat=mean(carat))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 7 x 2
##   color meanCarat
##   <ord>     <dbl>
## 1 D         0.658
## 2 E         0.658
## 3 F         0.737
## 4 G         0.771
## 5 H         0.912
## 6 I         1.03 
## 7 J         1.16

Question 7: What color of diamonds occurs the most frequently among diamonds with ideal cuts? The color “G” is the most common among ideal cut diamonds.

diamondsQ7 <- diamonds
diamondsQ7%>%
  filter(cut == "Ideal")%>%
  group_by(color)%>%
  summarise(n = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 7 x 2
##   color     n
##   <ord> <int>
## 1 D      2834
## 2 E      3903
## 3 F      3826
## 4 G      4884
## 5 H      3115
## 6 I      2093
## 7 J       896

Question 8: Which clarity of diamonds has the largest average table per carats? The VVS1 clarity has the largest avarge table per carat.

diamondsQ8 <- diamonds
diamondsQ8%>%
  mutate(tablePerCarat = table/carat)%>%
  group_by(clarity)%>%
  summarise(mean(tablePerCarat))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 8 x 2
##   clarity `mean(tablePerCarat)`
##   <ord>                   <dbl>
## 1 I1                       56.3
## 2 SI2                      69.1
## 3 SI1                      89.6
## 4 VS2                     103. 
## 5 VS1                     107. 
## 6 VVS2                    127. 
## 7 VVS1                    141. 
## 8 IF                      140.

Question 9: What is the average price per carat of diamonds that cost more than $10000? The average price per carat of diamonds that cost more than $10000 is $8044 per carat.

diamondsQ9<-diamonds
diamondsQ9%>%
  filter(price > 10000)%>%
  summarise(mean(price/carat))
## # A tibble: 1 x 1
##   `mean(price/carat)`
##                 <dbl>
## 1               8044.

Question 10: Of the diamonds that cost more than $10000, what is the most common clarity? The SI2 clarity is the most common for diamonds that cost more than $10000.

diamondsQ10<-diamonds
diamondsQ10%>%
  filter(price >10000)%>%
  group_by(clarity)%>%
  summarise(n=n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 8 x 2
##   clarity     n
##   <ord>   <int>
## 1 I1         30
## 2 SI2      1239
## 3 SI1      1184
## 4 VS2      1155
## 5 VS1       747
## 6 VVS2      452
## 7 VVS1      247
## 8 IF        168

PART 3

# Load in the data 
data("ToothGrowth")
# Learn about the data 
?ToothGrowth
# Structure of the dataset 
str(ToothGrowth)
## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
# Look at the data 
View(ToothGrowth)
  1. What do the rows of this dataset represent? The rows represent each observation.

  2. What do the columns of this dataset represent? Indicate whether each variable in the study is numerical or categorical. If numerical, identify as continuous or discrete. If categorical, indicate if the variable is ordinal. The first calum is the length of the giniepig’s tooth. It is numeric and is continuous. The second column is what supplement type the guinipigs were given. It is categorical and is not ordinal. The third column is how much supplemnt the ginipig is given. It is numanric and is discrete.

  3. What are the response and explanatory variables in this study? The responce varable is the tooth length. The exlanatory varables are the supplement type and dose.

  4. Create a hypothesis about supplement treatment and dosage levels, without first looking at the data. The gunipigs that got higher doses of ascorbic acid will have longer teeth.

  5. Use ggplot to create a side-by-side boxplot, which illustrates the distribution of each supplement treatment and allows for both visual comparison across and within treatments.

ggplot(ToothGrowth, aes(supp, len, fill = supp))+
  geom_boxplot()+
  theme_bw()

  1. Now add facets to your data to compare across dosage as well.
ggplot(ToothGrowth, aes(supp, len, fill = supp))+
  geom_boxplot()+
  facet_wrap(~dose)+
  theme_bw()

7) Describe any possible trends in these data. Explain in the context of this study. It apperes that the OJ suppliment is more effective at growing the teeth of the guiny pig. It also looks like the higher the dose the longer the guinipigs teeth gets. At the highest (dose of 2 mg/day) the type of supliment does not make a large diffence.

  1. Did you see anything surprising? Does your hypothesis appear to be supported? Note that this is not a formal hypothesis test but rather an exploration. My inital asumption that the VC would be better at growing teeth did not hold up. Although the higer doses did result in longer teeth.