MIDTERM 1 PART 2
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
data("diamonds")
?diamonds
head(diamonds)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
Question 1: Make a new data set that has the average depth and price of the diamonds in the data set.
diamondsQ1 <- diamonds%>%
mutate(avgDepth = mean(depth), avePrice = mean(price))
head(diamondsQ1)
## # A tibble: 6 x 12
## carat cut color clarity depth table price x y z avgDepth
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 61.7
## 2 0.21 Prem… E SI1 59.8 61 326 3.89 3.84 2.31 61.7
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 61.7
## 4 0.290 Prem… I VS2 62.4 58 334 4.2 4.23 2.63 61.7
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 61.7
## 6 0.24 Very… J VVS2 62.8 57 336 3.94 3.96 2.48 61.7
## # … with 1 more variable: avePrice <dbl>
Question 2: Add a new column to the data set that records each diamond’s price per carat.
diamondsQ2 <- diamondsQ1%>%
mutate(pricePerCarat = price/carat)
head(diamondsQ2)
## # A tibble: 6 x 13
## carat cut color clarity depth table price x y z avgDepth
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 61.7
## 2 0.21 Prem… E SI1 59.8 61 326 3.89 3.84 2.31 61.7
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 61.7
## 4 0.290 Prem… I VS2 62.4 58 334 4.2 4.23 2.63 61.7
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 61.7
## 6 0.24 Very… J VVS2 62.8 57 336 3.94 3.96 2.48 61.7
## # … with 2 more variables: avePrice <dbl>, pricePerCarat <dbl>
Question 3/4: Create a new data set that groups diamonds by their cut and displays the average price of each group
diamondsQ3<- diamonds
diamondsQ3%>%
group_by(cut)%>%
summarise(meanPrice=mean(price))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 5 x 2
## cut meanPrice
## <ord> <dbl>
## 1 Fair 4359.
## 2 Good 3929.
## 3 Very Good 3982.
## 4 Premium 4584.
## 5 Ideal 3458.
Question 5: Create a new data set that groups diamonds by color and displays the average depth and average table for each group.
diamondsQ5 <- diamonds
diamondsQ5%>%
group_by(color)%>%
summarise(meanDepth=mean(depth),
meanTable=mean(table))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 7 x 3
## color meanDepth meanTable
## <ord> <dbl> <dbl>
## 1 D 61.7 57.4
## 2 E 61.7 57.5
## 3 F 61.7 57.4
## 4 G 61.8 57.3
## 5 H 61.8 57.5
## 6 I 61.8 57.6
## 7 J 61.9 57.8
Question 6: Which color diamonds seem to be largest on average (in terms of carats)? The worst color, J, (and presumably the most common) are the largest on avarage.
diamondsQ6 <- diamonds
diamondsQ6%>%
group_by(color)%>%
summarise(meanCarat=mean(carat))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 7 x 2
## color meanCarat
## <ord> <dbl>
## 1 D 0.658
## 2 E 0.658
## 3 F 0.737
## 4 G 0.771
## 5 H 0.912
## 6 I 1.03
## 7 J 1.16
Question 7: What color of diamonds occurs the most frequently among diamonds with ideal cuts? The color “G” is the most common among ideal cut diamonds.
diamondsQ7 <- diamonds
diamondsQ7%>%
filter(cut == "Ideal")%>%
group_by(color)%>%
summarise(n = n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 7 x 2
## color n
## <ord> <int>
## 1 D 2834
## 2 E 3903
## 3 F 3826
## 4 G 4884
## 5 H 3115
## 6 I 2093
## 7 J 896
Question 8: Which clarity of diamonds has the largest average table per carats? The VVS1 clarity has the largest avarge table per carat.
diamondsQ8 <- diamonds
diamondsQ8%>%
mutate(tablePerCarat = table/carat)%>%
group_by(clarity)%>%
summarise(mean(tablePerCarat))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 8 x 2
## clarity `mean(tablePerCarat)`
## <ord> <dbl>
## 1 I1 56.3
## 2 SI2 69.1
## 3 SI1 89.6
## 4 VS2 103.
## 5 VS1 107.
## 6 VVS2 127.
## 7 VVS1 141.
## 8 IF 140.
Question 9: What is the average price per carat of diamonds that cost more than $10000? The average price per carat of diamonds that cost more than $10000 is $8044 per carat.
diamondsQ9<-diamonds
diamondsQ9%>%
filter(price > 10000)%>%
summarise(mean(price/carat))
## # A tibble: 1 x 1
## `mean(price/carat)`
## <dbl>
## 1 8044.
Question 10: Of the diamonds that cost more than $10000, what is the most common clarity? The SI2 clarity is the most common for diamonds that cost more than $10000.
diamondsQ10<-diamonds
diamondsQ10%>%
filter(price >10000)%>%
group_by(clarity)%>%
summarise(n=n())
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 8 x 2
## clarity n
## <ord> <int>
## 1 I1 30
## 2 SI2 1239
## 3 SI1 1184
## 4 VS2 1155
## 5 VS1 747
## 6 VVS2 452
## 7 VVS1 247
## 8 IF 168
PART 3
# Load in the data
data("ToothGrowth")
# Learn about the data
?ToothGrowth
# Structure of the dataset
str(ToothGrowth)
## 'data.frame': 60 obs. of 3 variables:
## $ len : num 4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: num 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
# Look at the data
View(ToothGrowth)
What do the rows of this dataset represent? The rows represent each observation.
What do the columns of this dataset represent? Indicate whether each variable in the study is numerical or categorical. If numerical, identify as continuous or discrete. If categorical, indicate if the variable is ordinal. The first calum is the length of the giniepig’s tooth. It is numeric and is continuous. The second column is what supplement type the guinipigs were given. It is categorical and is not ordinal. The third column is how much supplemnt the ginipig is given. It is numanric and is discrete.
What are the response and explanatory variables in this study? The responce varable is the tooth length. The exlanatory varables are the supplement type and dose.
Create a hypothesis about supplement treatment and dosage levels, without first looking at the data. The gunipigs that got higher doses of ascorbic acid will have longer teeth.
Use ggplot to create a side-by-side boxplot, which illustrates the distribution of each supplement treatment and allows for both visual comparison across and within treatments.
ggplot(ToothGrowth, aes(supp, len, fill = supp))+
geom_boxplot()+
theme_bw()
ggplot(ToothGrowth, aes(supp, len, fill = supp))+
geom_boxplot()+
facet_wrap(~dose)+
theme_bw()
7) Describe any possible trends in these data. Explain in the context of this study. It apperes that the OJ suppliment is more effective at growing the teeth of the guiny pig. It also looks like the higher the dose the longer the guinipigs teeth gets. At the highest (dose of 2 mg/day) the type of supliment does not make a large diffence.