library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(BMA)
## Loading required package: survival
## Loading required package: leaps
## Loading required package: robustbase
## 
## Attaching package: 'robustbase'
## The following object is masked from 'package:survival':
## 
##     heart
## Loading required package: inline
## Loading required package: rrcov
## Scalable Robust Estimators with High Breakdown Point (version 1.7-1)
library(fda)
## Loading required package: splines
## Loading required package: fds
## Loading required package: rainbow
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: pcaPP
## Loading required package: RCurl
## Loading required package: deSolve
## 
## Attaching package: 'fda'
## The following object is masked from 'package:graphics':
## 
##     matplot
library(Ecfun)
## 
## Attaching package: 'Ecfun'
## The following object is masked from 'package:base':
## 
##     sign
library(Ecdat)
## 
## Attaching package: 'Ecdat'
## The following object is masked from 'package:MASS':
## 
##     SP500
## The following object is masked from 'package:datasets':
## 
##     Orange
library(tidyverse)
## ── Attaching packages
## ───────────────────────────────────────
## tidyverse 1.3.2 ──
## ✔ ggplot2 3.3.6     ✔ purrr   0.3.4
## ✔ tibble  3.1.8     ✔ stringr 1.4.0
## ✔ tidyr   1.2.0     ✔ forcats 0.5.1
## ✔ readr   2.1.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::complete() masks RCurl::complete()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ MASS::select()    masks dplyr::select()

HW 2

7.3.4

  1. From the distributions, we can see that x, y, and z are right-skewed. We can also see that x, y, and z are multimodal. Most values from x and y appear to be larger than z. This leads me to believe that z represents the depth of the diamond. As a result, the x and y variables would represent the length and width of the diamond.
ggplot(diamonds) +
  geom_histogram(mapping=aes(x=x), binwidth = 0.01)

ggplot(diamonds) +
  geom_histogram(mapping=aes(x=y), binwidth = 0.01)

ggplot(diamonds) +
  geom_histogram(mapping=aes(x=z), binwidth = 0.01)

  1. There are 23 .99 carat diamonds compared to 1558 1.00 carat diamonds. This large amount of 1.00 carat diamonds in comparison to .99 carat diamonds leads me to believe that a 1.00 is much more expensive, and therefore results in greater marginal profits for the seller.
diamonds %>%
  filter(carat >= 0.95, carat <= 1.1) %>%
  count(carat) %>%
  print(n = Inf)
## # A tibble: 16 × 2
##    carat     n
##    <dbl> <int>
##  1  0.95    65
##  2  0.96   103
##  3  0.97    59
##  4  0.98    31
##  5  0.99    23
##  6  1     1558
##  7  1.01  2242
##  8  1.02   883
##  9  1.03   523
## 10  1.04   475
## 11  1.05   361
## 12  1.06   373
## 13  1.07   342
## 14  1.08   246
## 15  1.09   287
## 16  1.1    278

7.4.1

  1. In a bar chart, NA is treated as a category and is plotted as such. With bins, NA is dropped as it is not numeric.
"Console states that it removes 36,326 rows containing non-infinite values."
## [1] "Console states that it removes 36,326 rows containing non-infinite values."
diamonds2 <- diamonds %>%
  mutate(y = ifelse(y < 1 | y > 5, NA, y))

ggplot(diamonds2, aes(x = y)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 36326 rows containing non-finite values (stat_bin).

7.5.1.1

nycflights13::flights %>%
  mutate(
    cancelled = is.na(dep_time),
    sched_min = sched_dep_time %% 100,
    sched_hour = sched_dep_time %/% 100,
    sched_dep_time = sched_hour + sched_min / 60) %>%
  
  ggplot() +
  geom_boxplot(mapping = aes(y = sched_dep_time, x = cancelled))

7.5.2.1

"cut within color:"
## [1] "cut within color:"
diamonds %>%
  count(color, cut) %>%
  group_by(color) %>%
  mutate(proportion = n / sum(n)) %>%
  ggplot(mapping = aes(x = color, y = cut)) +
  geom_tile(mapping = aes(fill = proportion))

"color within cut:"
## [1] "color within cut:"
diamonds %>%
  count(color, cut) %>%
  group_by(cut) %>%
  mutate(proportion = n / sum(n)) %>%
  ggplot(mapping = aes(x = color, y = cut)) +
  geom_tile(mapping = aes(fill = proportion))

7.5.3.1

ggplot(diamonds, aes(x = cut_number(price, 10), y = carat)) +
  geom_boxplot() +
  coord_flip() +
  xlab("Price")

  1. The distribution of the large diamonds is more variable. I don’t know anything about buying/selling diamonds so I’m not surprised. I would guess that large diamonds allow more variability in other aspects, such as sharpness and color which could lead to a wide variety of prices with size remaining constant.

  2. Looking at the given graph, we have a strong linear relationship between the x and y variables. As a result, we are able to identify outliers in our data that we wouldn’t have been able to see if we used bins. For instance, with this scatterplot, we are able to see that although our largest x variable doesn’t have any points around it, it follows the general linear trend of the data and therefore isn’t an outlier.

"given code:"
## [1] "given code:"
ggplot(data = diamonds) +
 geom_point(mapping = aes(x = x, y = y)) +
 coord_cartesian(xlim = c(4, 11), ylim = c(4, 11))