7.3.4 -Q1 Explore the distribution of each of the x, y, and z variables in diamonds. What do you learn? Think about a diamond and how you might decide which dimension is the length, width, and depth.
Interestingly enough, the appeariance of the distributions change depending on the breaks in the histogram. With 50 breaks, each of the distributions appear close to a guassian normal distribution, with some minor skew. All of the distributions are continuous, non-integers. The QQNorm plots for x confirm something close to normal distribution, while y and z appear not to be…but that might be due to the presence of outliers in the data set.
The information about x.y and z would have to be supplied to me directly because I can’t tell from the data along which is which.
library(tidyverse)
library(data.table)
library(dummies)
library(dplyr)
summary(diamonds)
## carat cut color clarity
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.710 Median : 3.530
## Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :58.900 Max. :31.800
##
hist(diamonds$x, breaks = 50, main = "x 50 Breaks")
qqnorm(diamonds$x)
hist(diamonds$y, breaks = 50, main = "y 50 Breaks")
qqnorm(diamonds$y)
hist(diamonds$z, breaks = 50, main = "z 50 Breaks")
qqnorm(diamonds$z)
7.3.4 -Q2
Explore the distribution of price. Do you discover anything unusual or surprising? (Hint: Carefully think about the binwidth and make sure you try a wide range of values.)
Without customizing the bin widths the demand for diamonds seems to be perfectly elastic in our “Price Auto Breaks” chart. However, once we create 50 breaks, and see the data more granularly, we can see more of a Poisson type distribution (although the data is continous. Essentially, the lowest priced diamonds have a smaller demand than those at $5K…indicating at certain price points demand is inelastic (people don’t want low-quality diamonds no matter what the price).
hist(diamonds$price, main = "Price Auto Breaks")
hist(diamonds$price, breaks = 50, main = "Price 50 Breaks")
7.4.1 - Q1
What happens to missing values in a histogram? What happens to missing values in a bar chart? Why is there a difference?
Missing values can be omitted in a histogram, in a bar chart they will be added up and represent the total number of “NA” string values, just like any other categorical value. The reason is because histograms represent continous variables that we know must be numeric, so we can automatically omit any varchar or text string value. However, as mentioned earlier, bar charts are created for categorical values…meaning they are designed to aggregate each instance of any string value, including “NA” unless specifically removed.
7.5.1.1 -Q2
What variable in the diamonds dataset is most important for predicting the price of a diamond? How is that variable correlated with cut? Why does the combination of those two relationships lead to lower quality diamonds being more expensive?
Based on the linear regression below, carat has the most influence on price - the steepest slope…for every .01 increase in carat, there is a $11,257 increase in price. However, the correlation between carat and cut is inverse…in other words there’s a .09 correlation between a fair cut and carat size, but a -.16 between ideal and carat size. This means as the carat size increases, the price increases but the cut quality degrades…quantity vs. quality.
mydiamonds<-diamonds
mydiamonds <-setDT(mydiamonds)
mydummydiamonds <-dummy.data.frame(mydiamonds)
mydiamonds$cut <- factor(mydiamonds$cut, ordered = FALSE)
mydiamonds$color <- factor(mydiamonds$color, ordered = FALSE)
mydiamonds$clarity <- factor(mydiamonds$clarity, ordered = FALSE)
mydummydiamonds <-dummy.data.frame(mydiamonds, sep =".")
lmdiamonds <- lm(price ~ ., data=mydummydiamonds)
summary(lmdiamonds)
##
## Call:
## lm(formula = price ~ ., data = mydummydiamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21376.0 -592.4 -183.5 376.4 10694.2
##
## Coefficients: (3 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5993.093 390.770 15.337 < 2e-16 ***
## carat 11256.978 48.628 231.494 < 2e-16 ***
## cut.Fair -832.912 33.407 -24.932 < 2e-16 ***
## cut.Good -253.160 20.247 -12.504 < 2e-16 ***
## `cut.Very Good` -106.129 14.228 -7.459 8.82e-14 ***
## cut.Premium -70.768 14.590 -4.850 1.24e-06 ***
## cut.Ideal NA NA NA NA
## color.D 2369.398 26.131 90.674 < 2e-16 ***
## color.E 2160.280 24.922 86.683 < 2e-16 ***
## color.F 2096.544 24.813 84.492 < 2e-16 ***
## color.G 1887.359 24.313 77.628 < 2e-16 ***
## color.H 1389.131 24.891 55.809 < 2e-16 ***
## color.I 903.154 26.337 34.292 < 2e-16 ***
## color.J NA NA NA NA
## clarity.I1 -5345.102 51.024 -104.757 < 2e-16 ***
## clarity.SI2 -2642.516 30.523 -86.574 < 2e-16 ***
## clarity.SI1 -1679.630 29.371 -57.186 < 2e-16 ***
## clarity.VS2 -1077.879 29.150 -36.977 < 2e-16 ***
## clarity.VS1 -766.704 29.847 -25.688 < 2e-16 ***
## clarity.VVS2 -394.288 31.240 -12.621 < 2e-16 ***
## clarity.VVS1 -337.343 32.674 -10.324 < 2e-16 ***
## clarity.IF NA NA NA NA
## depth -63.806 4.535 -14.071 < 2e-16 ***
## table -26.474 2.912 -9.092 < 2e-16 ***
## x -1008.261 32.898 -30.648 < 2e-16 ***
## y 9.609 19.333 0.497 0.619
## z -50.119 33.486 -1.497 0.134
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1130 on 53916 degrees of freedom
## Multiple R-squared: 0.9198, Adjusted R-squared: 0.9198
## F-statistic: 2.688e+04 on 23 and 53916 DF, p-value: < 2.2e-16
cor(mydummydiamonds)
## carat cut.Fair cut.Good cut.Very Good
## carat 1.000000000 0.091843685 0.0341964753 0.009568034
## cut.Fair 0.091843685 1.000000000 -0.0554820732 -0.094236197
## cut.Good 0.034196475 -0.055482073 1.0000000000 -0.169939873
## cut.Very Good 0.009568034 -0.094236197 -0.1699398730 1.000000000
## cut.Premium 0.116244855 -0.102801176 -0.1853854406 -0.314876821
## cut.Ideal -0.163660333 -0.143077886 -0.2580180296 -0.438243136
## color.D -0.112056603 -0.012893366 0.0089092841 -0.000607790
## color.E -0.139214865 -0.019334474 0.0070127892 0.023710592
## color.F -0.060052467 0.007763181 0.0069490843 0.003110404
## color.G -0.029038057 -0.006170695 -0.0247286509 -0.025170628
## color.H 0.102464659 0.016646031 -0.0095171730 -0.004436857
## color.I 0.161493717 0.004769664 0.0061867573 -0.001548598
## color.J 0.180054472 0.017256725 0.0149774833 0.009815956
## clarity.I1 0.120983286 0.175852469 0.0158439265 -0.031316975
## clarity.SI2 0.267483210 0.055505777 0.0419704088 0.004805856
## clarity.SI1 0.062668829 0.004586516 0.0559381037 0.032547004
## clarity.VS2 -0.038904149 -0.027265713 -0.0210630056 -0.016411925
## clarity.VS1 -0.063093856 -0.022452892 -0.0171160562 -0.006848869
## clarity.VVS2 -0.137023771 -0.030702591 -0.0386266916 0.015284305
## clarity.VVS1 -0.167571254 -0.039920206 -0.0375642065 -0.005251499
## clarity.IF -0.114448682 -0.027022441 -0.0330456677 -0.033003418
## depth 0.028224314 0.280657311 0.1361138208 0.025827615
## table 0.181617547 0.125331585 0.1751741968 0.119971034
## price 0.921591301 0.018728220 -0.0003120195 0.006593488
## x 0.975094227 0.080643583 0.0303489705 0.004568574
## y 0.951722199 0.068821579 0.0321866174 0.016699044
## z 0.953387381 0.110367389 0.0451693346 0.016039079
## cut.Premium cut.Ideal color.D color.E
## carat 0.116244855 -0.163660333 -0.112056603 -0.139214865
## cut.Fair -0.102801176 -0.143077886 -0.012893366 -0.019334474
## cut.Good -0.185385441 -0.258018030 0.008909284 0.007012789
## cut.Very Good -0.314876821 -0.438243136 -0.000607790 0.023710592
## cut.Premium 1.000000000 -0.478074365 -0.016566131 -0.018499622
## cut.Ideal -0.478074365 1.000000000 0.014520993 -0.001105383
## color.D -0.016566131 0.014520993 1.000000000 -0.178550209
## color.E -0.018499622 -0.001105383 -0.178550209 1.000000000
## color.F -0.012098457 0.001351245 -0.175704439 -0.218400503
## color.G 0.003858758 0.034649146 -0.195020754 -0.242410670
## color.H 0.027895105 -0.021263982 -0.161671927 -0.200958100
## color.I 0.005899415 -0.009225169 -0.126698958 -0.157486722
## color.J 0.017231075 -0.038489916 -0.088817162 -0.110399675
## clarity.I1 0.005676004 -0.048794696 -0.024545048 -0.013461879
## clarity.SI2 0.067623298 -0.108241651 0.032016976 0.005513515
## clarity.SI1 0.023274510 -0.082865181 0.057714915 0.005952527
## clarity.VS2 0.022611012 0.015669209 0.021007215 0.027953717
## clarity.VS1 -0.011865966 0.034246697 -0.050133884 -0.027238160
## clarity.VVS2 -0.061949459 0.075507673 -0.015974682 0.011682835
## clarity.VVS1 -0.053851941 0.088354200 -0.046090962 -0.001501777
## clarity.IF -0.054013023 0.104986213 -0.047418307 -0.044863351
## depth -0.198305643 -0.022777723 -0.013566273 -0.028712725
## table 0.338071896 -0.549598773 -0.008920770 0.007172049
## price 0.095705972 -0.097175385 -0.072472544 -0.101089368
## x 0.126820057 -0.162673706 -0.106126820 -0.134213165
## y 0.107943149 -0.153158073 -0.103998733 -0.130135896
## z 0.090019394 -0.158688152 -0.105215511 -0.132209808
## color.F color.G color.H color.I
## carat -0.060052467 -0.029038057 0.102464659 0.161493717
## cut.Fair 0.007763181 -0.006170695 0.016646031 0.004769664
## cut.Good 0.006949084 -0.024728651 -0.009517173 0.006186757
## cut.Very Good 0.003110404 -0.025170628 -0.004436857 -0.001548598
## cut.Premium -0.012098457 0.003858758 0.027895105 0.005899415
## cut.Ideal 0.001351245 0.034649146 -0.021263982 -0.009225169
## color.D -0.175704439 -0.195020754 -0.161671927 -0.126698958
## color.E -0.218400503 -0.242410670 -0.200958100 -0.157486722
## color.F 1.000000000 -0.238547079 -0.197755189 -0.154976666
## color.G -0.238547079 1.000000000 -0.219495684 -0.172014244
## color.H -0.197755189 -0.219495684 1.000000000 -0.142599563
## color.I -0.154976666 -0.172014244 -0.142599563 1.000000000
## color.J -0.108640102 -0.120583604 -0.099963636 -0.078339442
## clarity.I1 0.004974074 -0.002005850 0.021149736 0.009277645
## clarity.SI2 -0.002250883 -0.045650915 0.020162952 -0.001995871
## clarity.SI1 -0.020435501 -0.080737935 0.031613435 0.015933620
## clarity.VS2 0.003774546 -0.023828920 -0.029923039 -0.009293063
## clarity.VS1 -0.011037941 0.055601148 -0.012740006 0.024189245
## clarity.VVS2 0.013127817 0.059744240 -0.030271219 -0.030483373
## clarity.VVS1 0.016900707 0.042398281 0.004561280 -0.003041191
## clarity.IF 0.018538947 0.077917264 0.006719881 -0.012711469
## depth -0.017740824 0.002767939 0.026037337 0.022629972
## table -0.004906335 -0.038815278 0.011573905 0.017966959
## price -0.024160863 0.008556126 0.059222867 0.097125229
## x -0.048021190 -0.024593290 0.095895826 0.146522500
## y -0.046707467 -0.024478948 0.093479773 0.142894679
## z -0.048802523 -0.024581704 0.095041257 0.145276329
## color.J clarity.I1 clarity.SI2 clarity.SI1
## carat 1.800545e-01 0.1209832860 2.674832e-01 0.062668829
## cut.Fair 1.725673e-02 0.1758524686 5.550578e-02 0.004586516
## cut.Good 1.497748e-02 0.0158439265 4.197041e-02 0.055938104
## cut.Very Good 9.815956e-03 -0.0313169747 4.805856e-03 0.032547004
## cut.Premium 1.723107e-02 0.0056760043 6.762330e-02 0.023274510
## cut.Ideal -3.848992e-02 -0.0487946958 -1.082417e-01 -0.082865181
## color.D -8.881716e-02 -0.0245450476 3.201698e-02 0.057714915
## color.E -1.103997e-01 -0.0134618794 5.513515e-03 0.005952527
## color.F -1.086401e-01 0.0049740743 -2.250883e-03 -0.020435501
## color.G -1.205836e-01 -0.0020058499 -4.565092e-02 -0.080737935
## color.H -9.996364e-02 0.0211497355 2.016295e-02 0.031613435
## color.I -7.833944e-02 0.0092776446 -1.995871e-03 0.015933620
## color.J 1.000000e+00 0.0081915607 8.438184e-05 0.013609359
## clarity.I1 8.191561e-03 1.0000000000 -5.349738e-02 -0.066724172
## clarity.SI2 8.438184e-05 -0.0534973796 1.000000e+00 -0.256271886
## clarity.SI1 1.360936e-02 -0.0667241721 -2.562719e-01 1.000000000
## clarity.VS2 1.849612e-02 -0.0640019054 -2.458163e-01 -0.306592381
## clarity.VS1 2.715020e-02 -0.0498665431 -1.915257e-01 -0.238878860
## clarity.VVS2 -3.797059e-02 -0.0379971497 -1.459381e-01 -0.182020153
## clarity.VVS1 -3.860789e-02 -0.0318186561 -1.222080e-01 -0.152422924
## clarity.IF -1.965420e-02 -0.0218653649 -8.397973e-02 -0.104743042
## depth 2.254271e-02 0.0811353782 7.202000e-03 0.040899286
## table 3.725306e-02 0.0447155371 9.534449e-02 0.051959065
## price 8.171036e-02 -0.0002553361 1.284203e-01 0.008956634
## x 1.646575e-01 0.1083605886 2.708270e-01 0.079241478
## y 1.607763e-01 0.1007356814 2.632520e-01 0.076097482
## z 1.642172e-01 0.1119132069 2.631922e-01 0.081004933
## clarity.VS2 clarity.VS1 clarity.VVS2 clarity.VVS1
## carat -0.038904149 -0.063093856 -0.13702377 -0.167571254
## cut.Fair -0.027265713 -0.022452892 -0.03070259 -0.039920206
## cut.Good -0.021063006 -0.017116056 -0.03862669 -0.037564207
## cut.Very Good -0.016411925 -0.006848869 0.01528430 -0.005251499
## cut.Premium 0.022611012 -0.011865966 -0.06194946 -0.053851941
## cut.Ideal 0.015669209 0.034246697 0.07550767 0.088354200
## color.D 0.021007215 -0.050133884 -0.01597468 -0.046090962
## color.E 0.027953717 -0.027238160 0.01168284 -0.001501777
## color.F 0.003774546 -0.011037941 0.01312782 0.016900707
## color.G -0.023828920 0.055601148 0.05974424 0.042398281
## color.H -0.029923039 -0.012740006 -0.03027122 0.004561280
## color.I -0.009293063 0.024189245 -0.03048337 -0.003041191
## color.J 0.018496119 0.027150196 -0.03797059 -0.038607887
## clarity.I1 -0.064001905 -0.049866543 -0.03799715 -0.031818656
## clarity.SI2 -0.245816298 -0.191525689 -0.14593813 -0.122207991
## clarity.SI1 -0.306592381 -0.238878860 -0.18202015 -0.152422924
## clarity.VS2 1.000000000 -0.229132887 -0.17459395 -0.146204250
## clarity.VS1 -0.229132887 1.000000000 -0.13603340 -0.113913805
## clarity.VVS2 -0.174593948 -0.136033397 1.00000000 -0.086799678
## clarity.VVS1 -0.146204250 -0.113913805 -0.08679968 1.000000000
## clarity.IF -0.100469651 -0.078280079 -0.05964760 -0.049948658
## depth -0.009458949 -0.024168882 -0.01924313 -0.023477434
## table -0.009655146 -0.026857522 -0.06227270 -0.069102784
## price -0.001061688 -0.009886258 -0.05238083 -0.095266165
## x -0.035507416 -0.059881798 -0.14715097 -0.185253404
## y -0.035927764 -0.056489878 -0.14162399 -0.179271300
## z -0.036313914 -0.058512562 -0.14474583 -0.182401151
## clarity.IF depth table price
## carat -0.114448682 0.028224314 0.181617547 0.9215913012
## cut.Fair -0.027022441 0.280657311 0.125331585 0.0187282203
## cut.Good -0.033045668 0.136113821 0.175174197 -0.0003120195
## cut.Very Good -0.033003418 0.025827615 0.119971034 0.0065934877
## cut.Premium -0.054013023 -0.198305643 0.338071896 0.0957059722
## cut.Ideal 0.104986213 -0.022777723 -0.549598773 -0.0971753849
## color.D -0.047418307 -0.013566273 -0.008920770 -0.0724725441
## color.E -0.044863351 -0.028712725 0.007172049 -0.1010893683
## color.F 0.018538947 -0.017740824 -0.004906335 -0.0241608630
## color.G 0.077917264 0.002767939 -0.038815278 0.0085561259
## color.H 0.006719881 0.026037337 0.011573905 0.0592228674
## color.I -0.012711469 0.022629972 0.017966959 0.0971252285
## color.J -0.019654198 0.022542712 0.037253059 0.0817103594
## clarity.I1 -0.021865365 0.081135378 0.044715537 -0.0002553361
## clarity.SI2 -0.083979735 0.007202000 0.095344490 0.1284202937
## clarity.SI1 -0.104743042 0.040899286 0.051959065 0.0089566338
## clarity.VS2 -0.100469651 -0.009458949 -0.009655146 -0.0010616879
## clarity.VS1 -0.078280079 -0.024168882 -0.026857522 -0.0098862584
## clarity.VVS2 -0.059647605 -0.019243134 -0.062272695 -0.0523808313
## clarity.VVS1 -0.049948658 -0.023477434 -0.069102784 -0.0952661654
## clarity.IF 1.000000000 -0.030880817 -0.078765865 -0.0495960070
## depth -0.030880817 1.000000000 -0.295778522 -0.0106474046
## table -0.078765865 -0.295778522 1.000000000 0.1271339021
## price -0.049596007 -0.010647405 0.127133902 1.0000000000
## x -0.125976111 -0.025289247 0.195344281 0.8844351610
## y -0.120799996 -0.029340671 0.183760147 0.8654208979
## z -0.125247837 0.094923882 0.150928692 0.8612494439
## x y z
## carat 0.975094227 0.95172220 0.95338738
## cut.Fair 0.080643583 0.06882158 0.11036739
## cut.Good 0.030348970 0.03218662 0.04516933
## cut.Very Good 0.004568574 0.01669904 0.01603908
## cut.Premium 0.126820057 0.10794315 0.09001939
## cut.Ideal -0.162673706 -0.15315807 -0.15868815
## color.D -0.106126820 -0.10399873 -0.10521551
## color.E -0.134213165 -0.13013590 -0.13220981
## color.F -0.048021190 -0.04670747 -0.04880252
## color.G -0.024593290 -0.02447895 -0.02458170
## color.H 0.095895826 0.09347977 0.09504126
## color.I 0.146522500 0.14289468 0.14527633
## color.J 0.164657523 0.16077626 0.16421717
## clarity.I1 0.108360589 0.10073568 0.11191321
## clarity.SI2 0.270826985 0.26325202 0.26319217
## clarity.SI1 0.079241478 0.07609748 0.08100493
## clarity.VS2 -0.035507416 -0.03592776 -0.03631391
## clarity.VS1 -0.059881798 -0.05648988 -0.05851256
## clarity.VVS2 -0.147150972 -0.14162399 -0.14474583
## clarity.VVS1 -0.185253404 -0.17927130 -0.18240115
## clarity.IF -0.125976111 -0.12080000 -0.12524784
## depth -0.025289247 -0.02934067 0.09492388
## table 0.195344281 0.18376015 0.15092869
## price 0.884435161 0.86542090 0.86124944
## x 1.000000000 0.97470148 0.97077180
## y 0.974701480 1.00000000 0.95200572
## z 0.970771799 0.95200572 1.00000000
7.5.3.1 -Q3 How does the price distribution of very large diamonds compare to small diamonds. Is it as you expect, or does it surprise you?
Returning to our summary stats for carats, we can use the IQR end points (25% and 75% quartiles) to define “very large” and “small”. Anything below the 1st quartile or .4 would be “small”, and anything above the 3rd quartile or 1.04 would be “large”for this data set.
Based on our histogram, and the QQ norm plot it seems that there are more extreme values in the large diamond data set than expected…indicating that the distribution isn’t normal - which makes sense given the shape our histogram, which appears to show a Poisson distribution. The phenomenon that I think is responsible for the shape is the curve is low demand for the “cheaper” large diamonds.
smalldiamonds <-dplyr::filter(diamonds, carat <= .4)
largediamonds <-dplyr::filter(diamonds, carat >= 1.04)
hist(smalldiamonds$price, breaks = 50, main = "Small Diamonds Price 50 Breaks")
qqnorm(smalldiamonds$price)
summary(smalldiamonds$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 326.0 579.0 718.0 739.3 877.0 2366.0
cor(smalldiamonds$price, smalldiamonds$carat)
## [1] 0.5067542
hist(largediamonds$price, breaks = 50, main = "Large Diamonds Price 50 Breaks")
qqnorm(largediamonds$price)
summary(largediamonds$price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2037 5720 8258 9154 11970 18820
cor(largediamonds$price, largediamonds$carat)
## [1] 0.7572961
10.5.1 - Q1 How can you tell if an object is a tibble? (Hint: try printing mtcars, which is a regular data frame).
When printing, tibbles show the first 10 rows by default and they list the variable type beneath the column heading (data, chr, int, dbl).
10.5.1 - Q3
If you have the name of a variable stored in an object, e.g. var <- “mpg”, how can you extract the reference variable from a tibble?
In this example it would be var$mpg. Reference the new tibble object, subsett on the original name.