Midterm MATH239

Part II

library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   1.0.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0

## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(ggplot2)
library(dplyr)
summary(diamonds)

##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
##

Question 1

Make a new data set that has the average depth and price of the diamonds in the data set.

## # A tibble: 1 x 2
## # Groups:   mean_depth, mean_price [1]
##   mean_depth mean_price
##        <dbl>      <dbl>
## 1       61.7      3933.

Question 2

Add a new column to the data set that records each diamond’s price per carat.

diamonds %>%
  mutate(ppcarat = price/carat)

## # A tibble: 53,940 x 11
##    carat cut       color clarity depth table price     x     y     z ppcarat
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>   <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43   1417.
##  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31   1552.
##  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31   1422.
##  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63   1152.
##  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75   1081.
##  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48   1400 
##  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47   1400 
##  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53   1296.
##  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49   1532.
## 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39   1470.
## # … with 53,930 more rows

Question 3 and 4

Create a new data set that groups diamonds by their cut and displays the average price of each group.

diamonds %>%
  group_by(cut) %>%
  dplyr::summarize(ave_price = mean(price, na.rm=TRUE))

## # A tibble: 5 x 2
##   cut       ave_price
## * <ord>         <dbl>
## 1 Fair          4359.
## 2 Good          3929.
## 3 Very Good     3982.
## 4 Premium       4584.
## 5 Ideal         3458.

Question 5

Create a new data set that groups diamonds by color and displays the average depth and average table for each group.

diamonds %>%
  group_by(color) %>%
  summarize(ave_depth = mean(price, na.rm=TRUE), 
                   aver_table = mean(table, na.rm=TRUE))

## # A tibble: 7 x 3
##   color ave_depth aver_table
## * <ord>     <dbl>      <dbl>
## 1 D         3170.       57.4
## 2 E         3077.       57.5
## 3 F         3725.       57.4
## 4 G         3999.       57.3
## 5 H         4487.       57.5
## 6 I         5092.       57.6
## 7 J         5324.       57.8

Question 6

Which color diamonds seem to be largest on average (in terms of carats)?

diamonds %>%
  group_by(color) %>%
  dplyr::summarize(ave_size = mean(carat, na.rm=TRUE))

## # A tibble: 7 x 2
##   color ave_size
## * <ord>    <dbl>
## 1 D        0.658
## 2 E        0.658
## 3 F        0.737
## 4 G        0.771
## 5 H        0.912
## 6 I        1.03 
## 7 J        1.16

From the table below, the J color seems to have the largest diamonds on average

Question 7

What color of diamonds occurs the most frequently among diamonds with ideal cuts?

diamonds %>%
  group_by(color) %>%
  tally()

## # A tibble: 7 x 2
##   color     n
## * <ord> <int>
## 1 D      6775
## 2 E      9797
## 3 F      9542
## 4 G     11292
## 5 H      8304
## 6 I      5422
## 7 J      2808

The frequency of the color G is most common.

Question 8

Which clarity of diamonds has the largest average table per carats?

diamonds %>%
  group_by(clarity) %>%
  mutate(tablePerCarat = table/carat) %>%
  summarise(ave = mean(tablePerCarat))

## # A tibble: 8 x 2
##   clarity   ave
## * <ord>   <dbl>
## 1 I1       56.3
## 2 SI2      69.1
## 3 SI1      89.6
## 4 VS2     103. 
## 5 VS1     107. 
## 6 VVS2    127. 
## 7 VVS1    141. 
## 8 IF      140.

Question 9

What is the average price per carat of diamonds that cost more than $10000?

diamonds %>%
  filter(price > 10000) %>%
  mutate(ppcarat = price/carat) %>%
  dplyr::summarize(ave_ppcarat = mean(ppcarat, na.rm=TRUE))

## # A tibble: 1 x 1
##   ave_ppcarat
##         <dbl>
## 1       8044.

Question 10

Of the diamonds that cost more than $10000, what is the most common clarity?

diamonds %>%
  filter(price > 10000) %>%
  group_by(clarity) %>%
  tally()

## # A tibble: 8 x 2
##   clarity     n
## * <ord>   <int>
## 1 I1         30
## 2 SI2      1239
## 3 SI1      1184
## 4 VS2      1155
## 5 VS1       747
## 6 VVS2      452
## 7 VVS1      247
## 8 IF        168

From the table above, SI2 is the most common clarity.

Part III

#data("ToothGrowth")
# Learn about the data
#?ToothGrowth
# Structure of the dataset
#str(ToothGrowth)
# Look at the data
#View(ToothGrowth)

Question 1

What do the rows of this dataset represent?

The rows represent length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs.

Question 2

What do the columns of this dataset represent? Indicate whether each variable in the study is numerical or categorical. If numerical, identify as continuous or discrete. If categorical, indicate if the variable is ordinal.

There are three variables: length, supplement, and dose. Length is a continuous numeric. Supplement is categorical and nominal. Dose is numeric and discrete.

Question 3

What are the response and explanatory variables in this study?

The repsonse is the length of the teeth on the guinea pigs. The explanatory variables are the type of supplement and dose.

Question 4

Create a hypothesis about supplement treatment and dosage levels, without first looking at the data.

I predict that there is a positive relationship between length and dosage level and that VC will be more effective than orange juice.

Question 5

Use ggplot to create a side-by-side boxplot, which illustrates the distribution of each supplement treatment and allows for both visual comparison across and within treatments. (Feel free to also use color!)

ggplot(ToothGrowth, aes(x=supp, y=len, fill=supp)) +
       geom_boxplot() +
       ggtitle("Distribution of Tooth Growth and Supplement") +
       ylab("Tooth Length") +
       xlab("Supplement")

Question 6

Now add facets to your data to compare across dosage as well.

ggplot(ToothGrowth, aes(x=supp, y=len, fill=supp)) +
       geom_boxplot() +
       ggtitle("Distribution of Tooth Growth and Supplement") +
       ylab("Tooth Length") +
       xlab("Supplement") +
       facet_grid(~dose)

Question 7

Describe any possible trends in these data. Explain in the context of this study.

There is a positive, linear relationship between tooth length and dosage. The spread of OJ supplement becomes narrow as dosage increases. VC with a 2 mg/day results in a narrower distribution. As we increase spread, there is an increase in outliers for both supplements.

Question 8

Did you see anything surprising? Does your hypothesis appear to be supported? Note that this is not a formal hypothesis test but rather an exploration.

My hypothesis about dosage was correct but I was suprised that at almost every dosage, orange juice out performed pure VC.

Midterm MATH239

Olivia Chu

Part II

Question 1

Question 2

Question 3 and 4

Question 5

Question 6

Question 7

Question 8

Question 9

Question 10

Part III

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Question 7

Question 8