library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 1.0.4
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(ggplot2)
library(dplyr)
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
Make a new data set that has the average depth and price of the diamonds in the data set.
## # A tibble: 1 x 2
## # Groups: mean_depth, mean_price [1]
## mean_depth mean_price
## <dbl> <dbl>
## 1 61.7 3933.
Add a new column to the data set that records each diamond’s price per carat.
diamonds %>%
mutate(ppcarat = price/carat)
## # A tibble: 53,940 x 11
## carat cut color clarity depth table price x y z ppcarat
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 1417.
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 1552.
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 1422.
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63 1152.
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 1081.
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 1400
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 1400
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 1296.
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 1532.
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 1470.
## # … with 53,930 more rows
Create a new data set that groups diamonds by their cut and displays the average price of each group.
diamonds %>%
group_by(cut) %>%
dplyr::summarize(ave_price = mean(price, na.rm=TRUE))
## # A tibble: 5 x 2
## cut ave_price
## * <ord> <dbl>
## 1 Fair 4359.
## 2 Good 3929.
## 3 Very Good 3982.
## 4 Premium 4584.
## 5 Ideal 3458.
Create a new data set that groups diamonds by color and displays the average depth and average table for each group.
diamonds %>%
group_by(color) %>%
summarize(ave_depth = mean(price, na.rm=TRUE),
aver_table = mean(table, na.rm=TRUE))
## # A tibble: 7 x 3
## color ave_depth aver_table
## * <ord> <dbl> <dbl>
## 1 D 3170. 57.4
## 2 E 3077. 57.5
## 3 F 3725. 57.4
## 4 G 3999. 57.3
## 5 H 4487. 57.5
## 6 I 5092. 57.6
## 7 J 5324. 57.8
Which color diamonds seem to be largest on average (in terms of carats)?
diamonds %>%
group_by(color) %>%
dplyr::summarize(ave_size = mean(carat, na.rm=TRUE))
## # A tibble: 7 x 2
## color ave_size
## * <ord> <dbl>
## 1 D 0.658
## 2 E 0.658
## 3 F 0.737
## 4 G 0.771
## 5 H 0.912
## 6 I 1.03
## 7 J 1.16
From the table below, the J color seems to have the largest diamonds on average
What color of diamonds occurs the most frequently among diamonds with ideal cuts?
diamonds %>%
group_by(color) %>%
tally()
## # A tibble: 7 x 2
## color n
## * <ord> <int>
## 1 D 6775
## 2 E 9797
## 3 F 9542
## 4 G 11292
## 5 H 8304
## 6 I 5422
## 7 J 2808
The frequency of the color G is most common.
Which clarity of diamonds has the largest average table per carats?
diamonds %>%
group_by(clarity) %>%
mutate(tablePerCarat = table/carat) %>%
summarise(ave = mean(tablePerCarat))
## # A tibble: 8 x 2
## clarity ave
## * <ord> <dbl>
## 1 I1 56.3
## 2 SI2 69.1
## 3 SI1 89.6
## 4 VS2 103.
## 5 VS1 107.
## 6 VVS2 127.
## 7 VVS1 141.
## 8 IF 140.
What is the average price per carat of diamonds that cost more than $10000?
diamonds %>%
filter(price > 10000) %>%
mutate(ppcarat = price/carat) %>%
dplyr::summarize(ave_ppcarat = mean(ppcarat, na.rm=TRUE))
## # A tibble: 1 x 1
## ave_ppcarat
## <dbl>
## 1 8044.
Of the diamonds that cost more than $10000, what is the most common clarity?
diamonds %>%
filter(price > 10000) %>%
group_by(clarity) %>%
tally()
## # A tibble: 8 x 2
## clarity n
## * <ord> <int>
## 1 I1 30
## 2 SI2 1239
## 3 SI1 1184
## 4 VS2 1155
## 5 VS1 747
## 6 VVS2 452
## 7 VVS1 247
## 8 IF 168
From the table above, SI2 is the most common clarity.
#data("ToothGrowth")
# Learn about the data
#?ToothGrowth
# Structure of the dataset
#str(ToothGrowth)
# Look at the data
#View(ToothGrowth)
What do the rows of this dataset represent?
The rows represent length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs.
What do the columns of this dataset represent? Indicate whether each variable in the study is numerical or categorical. If numerical, identify as continuous or discrete. If categorical, indicate if the variable is ordinal.
There are three variables: length, supplement, and dose. Length is a continuous numeric. Supplement is categorical and nominal. Dose is numeric and discrete.
What are the response and explanatory variables in this study?
The repsonse is the length of the teeth on the guinea pigs. The explanatory variables are the type of supplement and dose.
Create a hypothesis about supplement treatment and dosage levels, without first looking at the data.
I predict that there is a positive relationship between length and dosage level and that VC will be more effective than orange juice.
Use ggplot to create a side-by-side boxplot, which illustrates the distribution of each supplement treatment and allows for both visual comparison across and within treatments. (Feel free to also use color!)
ggplot(ToothGrowth, aes(x=supp, y=len, fill=supp)) +
geom_boxplot() +
ggtitle("Distribution of Tooth Growth and Supplement") +
ylab("Tooth Length") +
xlab("Supplement")
Now add facets to your data to compare across dosage as well.
ggplot(ToothGrowth, aes(x=supp, y=len, fill=supp)) +
geom_boxplot() +
ggtitle("Distribution of Tooth Growth and Supplement") +
ylab("Tooth Length") +
xlab("Supplement") +
facet_grid(~dose)
Describe any possible trends in these data. Explain in the context of this study.
There is a positive, linear relationship between tooth length and dosage. The spread of OJ supplement becomes narrow as dosage increases. VC with a 2 mg/day results in a narrower distribution. As we increase spread, there is an increase in outliers for both supplements.
Did you see anything surprising? Does your hypothesis appear to be supported? Note that this is not a formal hypothesis test but rather an exploration.
My hypothesis about dosage was correct but I was suprised that at almost every dosage, orange juice out performed pure VC.