Skills Check Solutions

PART 1: Data Wrangling

library(tidyverse)

1) Make a new data set that has the average depth and price of the diamonds in the data set.

avgPandD = diamonds %>% summarise(mean(price), mean(depth))
avgPandD

## # A tibble: 1 x 2
##   `mean(price)` `mean(depth)`
##           <dbl>         <dbl>
## 1         3933.          61.7

2) Add a new column to the data set that records each diamond’s price per carat.

diamonds_ppc = diamonds %>% mutate(price_per_c = price/carat)
diamonds_ppc

## # A tibble: 53,940 x 11
##    carat cut       color clarity depth table price     x     y     z price_per_c
##    <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>       <dbl>
##  1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43       1417.
##  2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31       1552.
##  3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31       1422.
##  4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63       1152.
##  5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75       1081.
##  6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48       1400 
##  7 0.24  Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47       1400 
##  8 0.26  Very Good H     SI1      61.9    55   337  4.07  4.11  2.53       1296.
##  9 0.22  Fair      E     VS2      65.1    61   337  3.87  3.78  2.49       1532.
## 10 0.23  Very Good H     VS1      59.4    61   338  4     4.05  2.39       1470.
## # … with 53,930 more rows

3) Create a new data set that groups diamonds by their cut and displays the average price of each group.

cp = diamonds %>% group_by(cut) %>% summarise(mean(price))

## `summarise()` ungrouping output (override with `.groups` argument)

cp

## # A tibble: 5 x 2
##   cut       `mean(price)`
##   <ord>             <dbl>
## 1 Fair              4359.
## 2 Good              3929.
## 3 Very Good         3982.
## 4 Premium           4584.
## 5 Ideal             3458.

4) Create a new data set that groups diamonds by color and displays the average depth and average table for each group.

cd = diamonds %>% group_by(color) %>% summarise(mean(depth), mean(table))

## `summarise()` ungrouping output (override with `.groups` argument)

cd

## # A tibble: 7 x 3
##   color `mean(depth)` `mean(table)`
##   <ord>         <dbl>         <dbl>
## 1 D              61.7          57.4
## 2 E              61.7          57.5
## 3 F              61.7          57.4
## 4 G              61.8          57.3
## 5 H              61.8          57.5
## 6 I              61.8          57.6
## 7 J              61.9          57.8

Extra-Credit) Add two columns to the diamonds data set. The first column should display the average depth of diamonds in the diamond’s color group. The second column should display the average table of diamonds in the diamonds color group

df1 = diamonds %>% group_by(color) %>% summarise(mean(depth))

## `summarise()` ungrouping output (override with `.groups` argument)

df2 = diamonds %>% group_by(color) %>% summarise(mean(table))

## `summarise()` ungrouping output (override with `.groups` argument)

main = left_join(diamonds, df1)

## Joining, by = "color"

main

## # A tibble: 53,940 x 11
##    carat cut     color clarity depth table price     x     y     z `mean(depth)`
##    <dbl> <ord>   <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>         <dbl>
##  1 0.23  Ideal   E     SI2      61.5    55   326  3.95  3.98  2.43          61.7
##  2 0.21  Premium E     SI1      59.8    61   326  3.89  3.84  2.31          61.7
##  3 0.23  Good    E     VS1      56.9    65   327  4.05  4.07  2.31          61.7
##  4 0.290 Premium I     VS2      62.4    58   334  4.2   4.23  2.63          61.8
##  5 0.31  Good    J     SI2      63.3    58   335  4.34  4.35  2.75          61.9
##  6 0.24  Very G… J     VVS2     62.8    57   336  3.94  3.96  2.48          61.9
##  7 0.24  Very G… I     VVS1     62.3    57   336  3.95  3.98  2.47          61.8
##  8 0.26  Very G… H     SI1      61.9    55   337  4.07  4.11  2.53          61.8
##  9 0.22  Fair    E     VS2      65.1    61   337  3.87  3.78  2.49          61.7
## 10 0.23  Very G… H     VS1      59.4    61   338  4     4.05  2.39          61.8
## # … with 53,930 more rows

5) Which color diamonds seem to be largest on average (in terms of carats)?

size_v_col = diamonds %>% group_by(color) %>% summarise(mean(carat))

## `summarise()` ungrouping output (override with `.groups` argument)

size_v_col

## # A tibble: 7 x 2
##   color `mean(carat)`
##   <ord>         <dbl>
## 1 D             0.658
## 2 E             0.658
## 3 F             0.737
## 4 G             0.771
## 5 H             0.912
## 6 I             1.03 
## 7 J             1.16

Color J seems to be the largest on average.

6) What color of diamonds occurs the most frequently among diamonds with ideal cuts?

cut_color = diamonds %>% filter(cut == "Ideal") %>% count(color)
cut_color

## # A tibble: 7 x 2
##   color     n
##   <ord> <int>
## 1 D      2834
## 2 E      3903
## 3 F      3826
## 4 G      4884
## 5 H      3115
## 6 I      2093
## 7 J       896

## You can also do 
cut_color2 = diamonds %>% 
  filter(cut == "Ideal")%>% 
  group_by(color)%>%
  summarise(n=n())

## `summarise()` ungrouping output (override with `.groups` argument)

cut_color2

## # A tibble: 7 x 2
##   color     n
##   <ord> <int>
## 1 D      2834
## 2 E      3903
## 3 F      3826
## 4 G      4884
## 5 H      3115
## 6 I      2093
## 7 J       896

There are the most ideally cut diamonds with a color of G.

7) Which clarity of diamonds has the largest average table per carats?

tpc = diamonds %>% group_by(clarity) %>% summarise(mean(table/carat))

## `summarise()` ungrouping output (override with `.groups` argument)

tpc

## # A tibble: 8 x 2
##   clarity `mean(table/carat)`
##   <ord>                 <dbl>
## 1 I1                     56.3
## 2 SI2                    69.1
## 3 SI1                    89.6
## 4 VS2                   103. 
## 5 VS1                   107. 
## 6 VVS2                  127. 
## 7 VVS1                  141. 
## 8 IF                    140.

VVS1 has the highest average table per carats.

8) What is the average price per carat of diamonds that cost more than $10000?

avg_ppc = diamonds %>% filter(price > 10000) %>% summarise(mean(price/carat))
avg_ppc

## # A tibble: 1 x 1
##   `mean(price/carat)`
##                 <dbl>
## 1               8044.

9) Of the diamonds that cost more than $10000, what is the most common clarity?

comclar = diamonds %>% filter(price > 10000) %>% count(clarity)
comclar

## # A tibble: 8 x 2
##   clarity     n
##   <ord>   <int>
## 1 I1         30
## 2 SI2      1239
## 3 SI1      1184
## 4 VS2      1155
## 5 VS1       747
## 6 VVS2      452
## 7 VVS1      247
## 8 IF        168

## YOU CAN ALSO DO
comclar2 = diamonds %>% 
  filter(price > 10000) %>% 
  group_by(clarity)%>%
  summarise(n=n())

## `summarise()` ungrouping output (override with `.groups` argument)

comclar2

## # A tibble: 8 x 2
##   clarity     n
##   <ord>   <int>
## 1 I1         30
## 2 SI2      1239
## 3 SI1      1184
## 4 VS2      1155
## 5 VS1       747
## 6 VVS2      452
## 7 VVS1      247
## 8 IF        168

The most common clarity for diamonds over $10000 is SI2.

PART 2: Data Exploration and Visualization

data("ToothGrowth")
?ToothGrowth
str(ToothGrowth)

## 'data.frame':    60 obs. of  3 variables:
##  $ len : num  4.2 11.5 7.3 5.8 6.4 10 11.2 11.2 5.2 7 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: num  0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...

1) What do the rows of this dataset represent?

Individual Guinea Pigs.

2) What do the columns of this dataset represent? Indicate whether each variable in the study is numerical or categorical. If numerical, identify as continuous or discrete. If categorical, indicate if the variable is ordinal.

The first column represents tooth length. This variable is numerical and continuous. The second represents supplement type. This is categorical but not ordinal. The final column represents dose. This is numeric and discrete.

3) What are the response and explanatory variables in this study?

The response variable is tooth length in this study. The explanatory variable are supplement type and dose.

4) Create a hypothesis about supplement treatment and dosage levels, without first looking at the data.

Vitamin C supplementation does not impact the length of odontoblasts in Guinea Pigs.

5) Use ggplot to create a side-by-side boxplot, which illustrates the distribution of each supplement treatment and allows for both visual comparison across and within treatments.

ggplot(ToothGrowth, aes(supp, len, fill = supp)) + geom_boxplot()

6) Now add facets to your data to compare across dosage as well.

ggplot(ToothGrowth, aes(supp, len, fill = supp)) + geom_boxplot() + facet_wrap(~dose)

7) Describe any possible trends in these data. Explain in the context of this study.

When viewing the box plot from question 6 you can immediately see that as the supplement dosage increased so did the tooth length. Additionally, the Guinea Pigs that received OJ had significantly a higher response value in Guinea Pigs that received a dosage of 0.5 or 1 milligrams a day. In Guinea Pigs that received 2 mg/day, the medians are about even.

8) Did you see anything surprising? Does your hypothesis appear to be supported? Note that this is not a formal hypothesis test but rather an exploration.

Initially, it looks like vitamin C supplementation could impact odontoblast lengths in Guinea Pigs. This would be a rejection of the null hypothesis. Additionally, it appears orange juice could be a more effective way of delivering the vitamin. When the dosage is lowered, it seems the delivery method is more impactful, potentially due to asorbic being a less efficient method of delivery.