7.3.4 -Q1 Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.

Interestingly enough, the appeariance of the distributions change depending on the breaks in the histogram. With 50 breaks, each of the distributions appear close to a guassian normal distribution, with some minor skew. All of the distributions are continuous, non-integers. The QQNorm plots for x confirm something close to normal distribution, while y and z appear not to be…but that might be due to the presence of outliers in the data set.

The information about x.y and z would have to be supplied to me directly because I can’t tell from the data along which is which.

library(tidyverse)
library(data.table)
library(dummies)
library(dplyr)

summary(diamonds)
##      carat               cut        color        clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                  
##        y                z         
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.710   Median : 3.530  
##  Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :58.900   Max.   :31.800  
## 
hist(diamonds$x, breaks = 50, main = "x 50 Breaks")

qqnorm(diamonds$x)

hist(diamonds$y, breaks = 50, main = "y 50 Breaks")

qqnorm(diamonds$y)

hist(diamonds$z, breaks = 50, main = "z 50 Breaks")

qqnorm(diamonds$z)

7.3.4 -Q2

Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)

Without customizing the bin widths the demand for diamonds seems to be perfectly elastic in our “Price Auto Breaks” chart. However, once we create 50 breaks, and see the data more granularly, we can see more of a Poisson type distribution (although the data is continous. Essentially, the lowest priced diamonds have a smaller demand than those at $5K…indicating at certain price points demand is inelastic (people don’t want low-quality diamonds no matter what the price).

hist(diamonds$price, main = "Price Auto Breaks")

hist(diamonds$price, breaks = 50, main = "Price 50 Breaks")

7.4.1 - Q1

What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?

Missing values can be omitted in a histogram, in a bar chart they will be added up and represent the total number of “NA” string values, just like any other categorical value. The reason is because histograms represent continous variables that we know must be numeric, so we can automatically omit any varchar or text string value. However, as mentioned earlier, bar charts are created for categorical values…meaning they are designed to aggregate each instance of any string value, including “NA” unless specifically removed.

7.5.1.1 -Q2

What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?

Based on the linear regression below, carat has the most influence on price - the steepest slope…for every .01 increase in carat, there is a $11,257 increase in price. However, the correlation between carat and cut is inverse…in other words there’s a .09 correlation between a fair cut and carat size, but a -.16 between ideal and carat size. This means as the carat size increases, the price increases but the cut quality degrades…quantity vs. quality.

mydiamonds<-diamonds
mydiamonds <-setDT(mydiamonds)
mydummydiamonds <-dummy.data.frame(mydiamonds)

mydiamonds$cut <- factor(mydiamonds$cut, ordered = FALSE)
mydiamonds$color <- factor(mydiamonds$color, ordered = FALSE)
mydiamonds$clarity <- factor(mydiamonds$clarity, ordered = FALSE)

mydummydiamonds <-dummy.data.frame(mydiamonds, sep =".")

lmdiamonds <- lm(price ~ ., data=mydummydiamonds)
summary(lmdiamonds)
## 
## Call:
## lm(formula = price ~ ., data = mydummydiamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21376.0   -592.4   -183.5    376.4  10694.2 
## 
## Coefficients: (3 not defined because of singularities)
##                  Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)      5993.093    390.770   15.337  < 2e-16 ***
## carat           11256.978     48.628  231.494  < 2e-16 ***
## cut.Fair         -832.912     33.407  -24.932  < 2e-16 ***
## cut.Good         -253.160     20.247  -12.504  < 2e-16 ***
## `cut.Very Good`  -106.129     14.228   -7.459 8.82e-14 ***
## cut.Premium       -70.768     14.590   -4.850 1.24e-06 ***
## cut.Ideal              NA         NA       NA       NA    
## color.D          2369.398     26.131   90.674  < 2e-16 ***
## color.E          2160.280     24.922   86.683  < 2e-16 ***
## color.F          2096.544     24.813   84.492  < 2e-16 ***
## color.G          1887.359     24.313   77.628  < 2e-16 ***
## color.H          1389.131     24.891   55.809  < 2e-16 ***
## color.I           903.154     26.337   34.292  < 2e-16 ***
## color.J                NA         NA       NA       NA    
## clarity.I1      -5345.102     51.024 -104.757  < 2e-16 ***
## clarity.SI2     -2642.516     30.523  -86.574  < 2e-16 ***
## clarity.SI1     -1679.630     29.371  -57.186  < 2e-16 ***
## clarity.VS2     -1077.879     29.150  -36.977  < 2e-16 ***
## clarity.VS1      -766.704     29.847  -25.688  < 2e-16 ***
## clarity.VVS2     -394.288     31.240  -12.621  < 2e-16 ***
## clarity.VVS1     -337.343     32.674  -10.324  < 2e-16 ***
## clarity.IF             NA         NA       NA       NA    
## depth             -63.806      4.535  -14.071  < 2e-16 ***
## table             -26.474      2.912   -9.092  < 2e-16 ***
## x               -1008.261     32.898  -30.648  < 2e-16 ***
## y                   9.609     19.333    0.497    0.619    
## z                 -50.119     33.486   -1.497    0.134    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1130 on 53916 degrees of freedom
## Multiple R-squared:  0.9198, Adjusted R-squared:  0.9198 
## F-statistic: 2.688e+04 on 23 and 53916 DF,  p-value: < 2.2e-16
cor(mydummydiamonds)
##                      carat     cut.Fair      cut.Good cut.Very Good
## carat          1.000000000  0.091843685  0.0341964753   0.009568034
## cut.Fair       0.091843685  1.000000000 -0.0554820732  -0.094236197
## cut.Good       0.034196475 -0.055482073  1.0000000000  -0.169939873
## cut.Very Good  0.009568034 -0.094236197 -0.1699398730   1.000000000
## cut.Premium    0.116244855 -0.102801176 -0.1853854406  -0.314876821
## cut.Ideal     -0.163660333 -0.143077886 -0.2580180296  -0.438243136
## color.D       -0.112056603 -0.012893366  0.0089092841  -0.000607790
## color.E       -0.139214865 -0.019334474  0.0070127892   0.023710592
## color.F       -0.060052467  0.007763181  0.0069490843   0.003110404
## color.G       -0.029038057 -0.006170695 -0.0247286509  -0.025170628
## color.H        0.102464659  0.016646031 -0.0095171730  -0.004436857
## color.I        0.161493717  0.004769664  0.0061867573  -0.001548598
## color.J        0.180054472  0.017256725  0.0149774833   0.009815956
## clarity.I1     0.120983286  0.175852469  0.0158439265  -0.031316975
## clarity.SI2    0.267483210  0.055505777  0.0419704088   0.004805856
## clarity.SI1    0.062668829  0.004586516  0.0559381037   0.032547004
## clarity.VS2   -0.038904149 -0.027265713 -0.0210630056  -0.016411925
## clarity.VS1   -0.063093856 -0.022452892 -0.0171160562  -0.006848869
## clarity.VVS2  -0.137023771 -0.030702591 -0.0386266916   0.015284305
## clarity.VVS1  -0.167571254 -0.039920206 -0.0375642065  -0.005251499
## clarity.IF    -0.114448682 -0.027022441 -0.0330456677  -0.033003418
## depth          0.028224314  0.280657311  0.1361138208   0.025827615
## table          0.181617547  0.125331585  0.1751741968   0.119971034
## price          0.921591301  0.018728220 -0.0003120195   0.006593488
## x              0.975094227  0.080643583  0.0303489705   0.004568574
## y              0.951722199  0.068821579  0.0321866174   0.016699044
## z              0.953387381  0.110367389  0.0451693346   0.016039079
##                cut.Premium    cut.Ideal      color.D      color.E
## carat          0.116244855 -0.163660333 -0.112056603 -0.139214865
## cut.Fair      -0.102801176 -0.143077886 -0.012893366 -0.019334474
## cut.Good      -0.185385441 -0.258018030  0.008909284  0.007012789
## cut.Very Good -0.314876821 -0.438243136 -0.000607790  0.023710592
## cut.Premium    1.000000000 -0.478074365 -0.016566131 -0.018499622
## cut.Ideal     -0.478074365  1.000000000  0.014520993 -0.001105383
## color.D       -0.016566131  0.014520993  1.000000000 -0.178550209
## color.E       -0.018499622 -0.001105383 -0.178550209  1.000000000
## color.F       -0.012098457  0.001351245 -0.175704439 -0.218400503
## color.G        0.003858758  0.034649146 -0.195020754 -0.242410670
## color.H        0.027895105 -0.021263982 -0.161671927 -0.200958100
## color.I        0.005899415 -0.009225169 -0.126698958 -0.157486722
## color.J        0.017231075 -0.038489916 -0.088817162 -0.110399675
## clarity.I1     0.005676004 -0.048794696 -0.024545048 -0.013461879
## clarity.SI2    0.067623298 -0.108241651  0.032016976  0.005513515
## clarity.SI1    0.023274510 -0.082865181  0.057714915  0.005952527
## clarity.VS2    0.022611012  0.015669209  0.021007215  0.027953717
## clarity.VS1   -0.011865966  0.034246697 -0.050133884 -0.027238160
## clarity.VVS2  -0.061949459  0.075507673 -0.015974682  0.011682835
## clarity.VVS1  -0.053851941  0.088354200 -0.046090962 -0.001501777
## clarity.IF    -0.054013023  0.104986213 -0.047418307 -0.044863351
## depth         -0.198305643 -0.022777723 -0.013566273 -0.028712725
## table          0.338071896 -0.549598773 -0.008920770  0.007172049
## price          0.095705972 -0.097175385 -0.072472544 -0.101089368
## x              0.126820057 -0.162673706 -0.106126820 -0.134213165
## y              0.107943149 -0.153158073 -0.103998733 -0.130135896
## z              0.090019394 -0.158688152 -0.105215511 -0.132209808
##                    color.F      color.G      color.H      color.I
## carat         -0.060052467 -0.029038057  0.102464659  0.161493717
## cut.Fair       0.007763181 -0.006170695  0.016646031  0.004769664
## cut.Good       0.006949084 -0.024728651 -0.009517173  0.006186757
## cut.Very Good  0.003110404 -0.025170628 -0.004436857 -0.001548598
## cut.Premium   -0.012098457  0.003858758  0.027895105  0.005899415
## cut.Ideal      0.001351245  0.034649146 -0.021263982 -0.009225169
## color.D       -0.175704439 -0.195020754 -0.161671927 -0.126698958
## color.E       -0.218400503 -0.242410670 -0.200958100 -0.157486722
## color.F        1.000000000 -0.238547079 -0.197755189 -0.154976666
## color.G       -0.238547079  1.000000000 -0.219495684 -0.172014244
## color.H       -0.197755189 -0.219495684  1.000000000 -0.142599563
## color.I       -0.154976666 -0.172014244 -0.142599563  1.000000000
## color.J       -0.108640102 -0.120583604 -0.099963636 -0.078339442
## clarity.I1     0.004974074 -0.002005850  0.021149736  0.009277645
## clarity.SI2   -0.002250883 -0.045650915  0.020162952 -0.001995871
## clarity.SI1   -0.020435501 -0.080737935  0.031613435  0.015933620
## clarity.VS2    0.003774546 -0.023828920 -0.029923039 -0.009293063
## clarity.VS1   -0.011037941  0.055601148 -0.012740006  0.024189245
## clarity.VVS2   0.013127817  0.059744240 -0.030271219 -0.030483373
## clarity.VVS1   0.016900707  0.042398281  0.004561280 -0.003041191
## clarity.IF     0.018538947  0.077917264  0.006719881 -0.012711469
## depth         -0.017740824  0.002767939  0.026037337  0.022629972
## table         -0.004906335 -0.038815278  0.011573905  0.017966959
## price         -0.024160863  0.008556126  0.059222867  0.097125229
## x             -0.048021190 -0.024593290  0.095895826  0.146522500
## y             -0.046707467 -0.024478948  0.093479773  0.142894679
## z             -0.048802523 -0.024581704  0.095041257  0.145276329
##                     color.J    clarity.I1   clarity.SI2  clarity.SI1
## carat          1.800545e-01  0.1209832860  2.674832e-01  0.062668829
## cut.Fair       1.725673e-02  0.1758524686  5.550578e-02  0.004586516
## cut.Good       1.497748e-02  0.0158439265  4.197041e-02  0.055938104
## cut.Very Good  9.815956e-03 -0.0313169747  4.805856e-03  0.032547004
## cut.Premium    1.723107e-02  0.0056760043  6.762330e-02  0.023274510
## cut.Ideal     -3.848992e-02 -0.0487946958 -1.082417e-01 -0.082865181
## color.D       -8.881716e-02 -0.0245450476  3.201698e-02  0.057714915
## color.E       -1.103997e-01 -0.0134618794  5.513515e-03  0.005952527
## color.F       -1.086401e-01  0.0049740743 -2.250883e-03 -0.020435501
## color.G       -1.205836e-01 -0.0020058499 -4.565092e-02 -0.080737935
## color.H       -9.996364e-02  0.0211497355  2.016295e-02  0.031613435
## color.I       -7.833944e-02  0.0092776446 -1.995871e-03  0.015933620
## color.J        1.000000e+00  0.0081915607  8.438184e-05  0.013609359
## clarity.I1     8.191561e-03  1.0000000000 -5.349738e-02 -0.066724172
## clarity.SI2    8.438184e-05 -0.0534973796  1.000000e+00 -0.256271886
## clarity.SI1    1.360936e-02 -0.0667241721 -2.562719e-01  1.000000000
## clarity.VS2    1.849612e-02 -0.0640019054 -2.458163e-01 -0.306592381
## clarity.VS1    2.715020e-02 -0.0498665431 -1.915257e-01 -0.238878860
## clarity.VVS2  -3.797059e-02 -0.0379971497 -1.459381e-01 -0.182020153
## clarity.VVS1  -3.860789e-02 -0.0318186561 -1.222080e-01 -0.152422924
## clarity.IF    -1.965420e-02 -0.0218653649 -8.397973e-02 -0.104743042
## depth          2.254271e-02  0.0811353782  7.202000e-03  0.040899286
## table          3.725306e-02  0.0447155371  9.534449e-02  0.051959065
## price          8.171036e-02 -0.0002553361  1.284203e-01  0.008956634
## x              1.646575e-01  0.1083605886  2.708270e-01  0.079241478
## y              1.607763e-01  0.1007356814  2.632520e-01  0.076097482
## z              1.642172e-01  0.1119132069  2.631922e-01  0.081004933
##                clarity.VS2  clarity.VS1 clarity.VVS2 clarity.VVS1
## carat         -0.038904149 -0.063093856  -0.13702377 -0.167571254
## cut.Fair      -0.027265713 -0.022452892  -0.03070259 -0.039920206
## cut.Good      -0.021063006 -0.017116056  -0.03862669 -0.037564207
## cut.Very Good -0.016411925 -0.006848869   0.01528430 -0.005251499
## cut.Premium    0.022611012 -0.011865966  -0.06194946 -0.053851941
## cut.Ideal      0.015669209  0.034246697   0.07550767  0.088354200
## color.D        0.021007215 -0.050133884  -0.01597468 -0.046090962
## color.E        0.027953717 -0.027238160   0.01168284 -0.001501777
## color.F        0.003774546 -0.011037941   0.01312782  0.016900707
## color.G       -0.023828920  0.055601148   0.05974424  0.042398281
## color.H       -0.029923039 -0.012740006  -0.03027122  0.004561280
## color.I       -0.009293063  0.024189245  -0.03048337 -0.003041191
## color.J        0.018496119  0.027150196  -0.03797059 -0.038607887
## clarity.I1    -0.064001905 -0.049866543  -0.03799715 -0.031818656
## clarity.SI2   -0.245816298 -0.191525689  -0.14593813 -0.122207991
## clarity.SI1   -0.306592381 -0.238878860  -0.18202015 -0.152422924
## clarity.VS2    1.000000000 -0.229132887  -0.17459395 -0.146204250
## clarity.VS1   -0.229132887  1.000000000  -0.13603340 -0.113913805
## clarity.VVS2  -0.174593948 -0.136033397   1.00000000 -0.086799678
## clarity.VVS1  -0.146204250 -0.113913805  -0.08679968  1.000000000
## clarity.IF    -0.100469651 -0.078280079  -0.05964760 -0.049948658
## depth         -0.009458949 -0.024168882  -0.01924313 -0.023477434
## table         -0.009655146 -0.026857522  -0.06227270 -0.069102784
## price         -0.001061688 -0.009886258  -0.05238083 -0.095266165
## x             -0.035507416 -0.059881798  -0.14715097 -0.185253404
## y             -0.035927764 -0.056489878  -0.14162399 -0.179271300
## z             -0.036313914 -0.058512562  -0.14474583 -0.182401151
##                 clarity.IF        depth        table         price
## carat         -0.114448682  0.028224314  0.181617547  0.9215913012
## cut.Fair      -0.027022441  0.280657311  0.125331585  0.0187282203
## cut.Good      -0.033045668  0.136113821  0.175174197 -0.0003120195
## cut.Very Good -0.033003418  0.025827615  0.119971034  0.0065934877
## cut.Premium   -0.054013023 -0.198305643  0.338071896  0.0957059722
## cut.Ideal      0.104986213 -0.022777723 -0.549598773 -0.0971753849
## color.D       -0.047418307 -0.013566273 -0.008920770 -0.0724725441
## color.E       -0.044863351 -0.028712725  0.007172049 -0.1010893683
## color.F        0.018538947 -0.017740824 -0.004906335 -0.0241608630
## color.G        0.077917264  0.002767939 -0.038815278  0.0085561259
## color.H        0.006719881  0.026037337  0.011573905  0.0592228674
## color.I       -0.012711469  0.022629972  0.017966959  0.0971252285
## color.J       -0.019654198  0.022542712  0.037253059  0.0817103594
## clarity.I1    -0.021865365  0.081135378  0.044715537 -0.0002553361
## clarity.SI2   -0.083979735  0.007202000  0.095344490  0.1284202937
## clarity.SI1   -0.104743042  0.040899286  0.051959065  0.0089566338
## clarity.VS2   -0.100469651 -0.009458949 -0.009655146 -0.0010616879
## clarity.VS1   -0.078280079 -0.024168882 -0.026857522 -0.0098862584
## clarity.VVS2  -0.059647605 -0.019243134 -0.062272695 -0.0523808313
## clarity.VVS1  -0.049948658 -0.023477434 -0.069102784 -0.0952661654
## clarity.IF     1.000000000 -0.030880817 -0.078765865 -0.0495960070
## depth         -0.030880817  1.000000000 -0.295778522 -0.0106474046
## table         -0.078765865 -0.295778522  1.000000000  0.1271339021
## price         -0.049596007 -0.010647405  0.127133902  1.0000000000
## x             -0.125976111 -0.025289247  0.195344281  0.8844351610
## y             -0.120799996 -0.029340671  0.183760147  0.8654208979
## z             -0.125247837  0.094923882  0.150928692  0.8612494439
##                          x           y           z
## carat          0.975094227  0.95172220  0.95338738
## cut.Fair       0.080643583  0.06882158  0.11036739
## cut.Good       0.030348970  0.03218662  0.04516933
## cut.Very Good  0.004568574  0.01669904  0.01603908
## cut.Premium    0.126820057  0.10794315  0.09001939
## cut.Ideal     -0.162673706 -0.15315807 -0.15868815
## color.D       -0.106126820 -0.10399873 -0.10521551
## color.E       -0.134213165 -0.13013590 -0.13220981
## color.F       -0.048021190 -0.04670747 -0.04880252
## color.G       -0.024593290 -0.02447895 -0.02458170
## color.H        0.095895826  0.09347977  0.09504126
## color.I        0.146522500  0.14289468  0.14527633
## color.J        0.164657523  0.16077626  0.16421717
## clarity.I1     0.108360589  0.10073568  0.11191321
## clarity.SI2    0.270826985  0.26325202  0.26319217
## clarity.SI1    0.079241478  0.07609748  0.08100493
## clarity.VS2   -0.035507416 -0.03592776 -0.03631391
## clarity.VS1   -0.059881798 -0.05648988 -0.05851256
## clarity.VVS2  -0.147150972 -0.14162399 -0.14474583
## clarity.VVS1  -0.185253404 -0.17927130 -0.18240115
## clarity.IF    -0.125976111 -0.12080000 -0.12524784
## depth         -0.025289247 -0.02934067  0.09492388
## table          0.195344281  0.18376015  0.15092869
## price          0.884435161  0.86542090  0.86124944
## x              1.000000000  0.97470148  0.97077180
## y              0.974701480  1.00000000  0.95200572
## z              0.970771799  0.95200572  1.00000000

7.5.3.1 -Q3 How does the price distribution of very large diamonds compare to small diamonds. Is it as you expect, or does it surprise you?

Returning to our summary stats for carats, we can use the IQR end points (25% and 75% quartiles) to define “very large” and “small”. Anything below the 1st quartile or .4 would be “small”, and anything above the 3rd quartile or 1.04 would be “large”for this data set.

Based on our histogram, and the QQ norm plot it seems that there are more extreme values in the large diamond data set than expected…indicating that the distribution isn’t normal - which makes sense given the shape our histogram, which appears to show a Poisson distribution. The phenomenon that I think is responsible for the shape is the curve is low demand for the “cheaper” large diamonds.

smalldiamonds <-dplyr::filter(diamonds, carat <= .4)
largediamonds <-dplyr::filter(diamonds, carat >= 1.04)

hist(smalldiamonds$price, breaks = 50, main = "Small Diamonds Price 50 Breaks")

qqnorm(smalldiamonds$price)

summary(smalldiamonds$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   326.0   579.0   718.0   739.3   877.0  2366.0
cor(smalldiamonds$price, smalldiamonds$carat)
## [1] 0.5067542
hist(largediamonds$price, breaks = 50, main = "Large Diamonds Price 50 Breaks")

qqnorm(largediamonds$price)

summary(largediamonds$price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2037    5720    8258    9154   11970   18820
cor(largediamonds$price, largediamonds$carat)
## [1] 0.7572961

10.5.1 - Q1 How can you tell if an object is a tibble? (Hint: try printing mtcars, which is a regular data frame).

When printing, tibbles show the first 10 rows by default and they list the variable type beneath the column heading (data, chr, int, dbl).

10.5.1 - Q3

If you have the name of a variable stored in an object, e.g. var <- “mpg”, how can you extract the reference variable from a tibble?

In this example it would be var$mpg. Reference the new tibble object, subsett on the original name.