R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

library(tidyverse)
data("diamonds")

str(diamonds)
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Part 2: Tidyverse and Data Wrangling Skills

Question 1

avgdiamonds <- diamonds %>%
  select(depth, price) %>%
  summarise(avgdepth = mean(depth), avgprice = mean(price))
avgdiamonds
## # A tibble: 1 x 2
##   avgdepth avgprice
##      <dbl>    <dbl>
## 1     61.7    3933.

Question 2

diamonds %>%
  mutate(ppc = price/carat)
## # A tibble: 53,940 x 11
##    carat cut       color clarity depth table price     x     y     z   ppc
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43 1417.
##  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31 1552.
##  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31 1422.
##  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63 1152.
##  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75 1081.
##  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48 1400 
##  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47 1400 
##  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53 1296.
##  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49 1532.
## 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39 1470.
## # … with 53,930 more rows

Question 3

avgpercut <- diamonds %>%
  group_by(cut) %>%
  summarise(avgprice = mean(price))
avgpercut$avgprice
## [1] 4358.758 3928.864 3981.760 4584.258 3457.542

Question 5

diamondbycolor <- diamonds %>%
  group_by(color) %>%
  summarise(avgdepth = mean(depth), avgtable = mean(table))
diamondbycolor$avgdepth
## [1] 61.69813 61.66209 61.69458 61.75711 61.83685 61.84639 61.88722
diamondbycolor$avgtable
## [1] 57.40459 57.49120 57.43354 57.28863 57.51781 57.57728 57.81239

Extra credit

newcolumns <- diamonds %>%
  group_by(color) %>%
  mutate(avgdepth = mean(depth), avgtable = mean(table))
# View(newcolumns)  
  
totalcolumns <- left_join(diamonds, newcolumns)
# View(totalcolumns)

Question 6

largest <- diamonds %>%
  group_by(color) %>%
  summarise(avgprice = mean(price))
#Color J diamonds seem the be the biggest, with an average of 5324
largest
## # A tibble: 7 x 2
##   color avgprice
##   <ord>    <dbl>
## 1 D        3170.
## 2 E        3077.
## 3 F        3725.
## 4 G        3999.
## 5 H        4487.
## 6 I        5092.
## 7 J        5324.

Question 7

idealdiamonds <- diamonds %>%
  filter(cut == "Ideal") %>%
  group_by(color)
# Color G is the most frequent color in Ideal cut with 4884 diamonds in this category.  
summary(idealdiamonds$color)
##    D    E    F    G    H    I    J 
## 2834 3903 3826 4884 3115 2093  896

Question 8

tablepercarats <- diamonds %>%
  select(clarity, table, carat) %>%
  mutate(tpc = table/carat) %>%
  group_by(clarity) %>%
  summarise(avgtpc = mean(tpc))
# VVS1 has the largest table per carat with 141 tables per carat.
tablepercarats
## # A tibble: 8 x 2
##   clarity avgtpc
##   <ord>    <dbl>
## 1 I1        56.3
## 2 SI2       69.1
## 3 SI1       89.6
## 4 VS2      103. 
## 5 VS1      107. 
## 6 VVS2     127. 
## 7 VVS1     141. 
## 8 IF       140.

Question 9

ppc <- diamonds %>%
  filter(price > 10000) %>% 
  mutate(ppc = price/carat) %>%
  summarise(avg = mean(ppc))
# The average price per carat of diamonds over $10,000 is $8,044.
ppc
## # A tibble: 1 x 1
##     avg
##   <dbl>
## 1 8044.

Question 10

commonclarity <- diamonds %>%
  filter(price > 10000) %>%
  group_by(clarity) %>%
  summarise(clarity)
# Diamonds over $10,000 are most commonly SI2 clarity.
summary(commonclarity)
##     clarity    
##  SI2    :1239  
##  SI1    :1184  
##  VS2    :1155  
##  VS1    : 747  
##  VVS2   : 452  
##  VVS1   : 247  
##  (Other): 198

Part 3: Data Visualization

# Exploring the dataset

data("ToothGrowth")
#?ToothGrowth
# View(ToothGrowth)

Question 1

Each row represents one observation for each of the guinea pigs.

Question 2

Each column represents one of the variables. These include tooth length- continuous numeric, supplement type- categorical, and dose- ordinal categorical.

Question 3

The response variable is tooth growth, and the explanatory variables are supplement type and dose.

Question 4

I predict that higher dosage will correlate with more tooth growth, and that the VC supplement will be more effective than OJ.

Question 5

ggplot(ToothGrowth, aes(supp, len, fill = supp)) +
  geom_boxplot()

Question 6

ggplot(ToothGrowth, aes(supp, len, fill = supp)) +
  geom_boxplot() +
  facet_wrap(~dose)

Question 7

It seems like generally VC results in higher tooth length, and that length increases with dose.

Question 8

Part of my hypothesis is supported in that tooth growth increases with dose, but not that VC was more effective than OJ.