ANALYST OF DATASET = diamonds

Question 1: Export data from package: ‘ggplot2’

setwd('D:/Rml')
library('ggplot2')
write.csv(diamonds, 'diamonds.csv')
dat = head(diamonds)
dat
## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
names(dat)
##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"

Question 2: Create scatter plot of price by carat

library('car')
## Loading required package: carData
attach(diamonds)
scatterplot(carat~price, pch=16, col='red', main='Scatter plot of price by carat')

Summary: Relationship of price & carat is positive correlation (as carat increases, price also increases accordingly).

Question 3: Create the multiple linear regression model of price by the remaining variables in dataset = (only the most valuable variables count)

attach(diamonds)
## The following objects are masked from diamonds (pos = 3):
## 
##     carat, clarity, color, cut, depth, price, table, x, y, z
summary(lm(price~carat+cut+color+clarity+depth+table))
## 
## Call:
## lm(formula = price ~ carat + cut + color + clarity + depth + 
##     table)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16828.8   -678.7   -199.4    464.6  10341.2 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  -969.661    360.432   -2.690  0.00714 ** 
## carat        8895.194     12.079  736.390  < 2e-16 ***
## cut.L         615.613     22.985   26.784  < 2e-16 ***
## cut.Q        -326.638     18.390  -17.762  < 2e-16 ***
## cut.C         156.333     15.814    9.886  < 2e-16 ***
## cut^4         -15.975     12.648   -1.263  0.20657    
## color.L     -1908.010     17.718 -107.689  < 2e-16 ***
## color.Q      -626.087     16.112  -38.858  < 2e-16 ***
## color.C      -172.056     15.063  -11.423  < 2e-16 ***
## color^4        20.319     13.833    1.469  0.14187    
## color^5       -85.245     13.068   -6.523 6.95e-11 ***
## color^6       -50.085     11.881   -4.216 2.50e-05 ***
## clarity.L    4206.854     30.867  136.290  < 2e-16 ***
## clarity.Q   -1831.804     28.811  -63.580  < 2e-16 ***
## clarity.C     919.725     24.672   37.278  < 2e-16 ***
## clarity^4    -361.609     19.728  -18.330  < 2e-16 ***
## clarity^5     213.910     16.108   13.280  < 2e-16 ***
## clarity^6       2.986     14.030    0.213  0.83148    
## clarity^7     110.147     12.375    8.901  < 2e-16 ***
## depth         -21.024      4.079   -5.154 2.56e-07 ***
## table         -24.803      2.978   -8.329  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1156 on 53919 degrees of freedom
## Multiple R-squared:  0.9161, Adjusted R-squared:  0.916 
## F-statistic: 2.942e+04 on 20 and 53919 DF,  p-value: < 2.2e-16

CONCLUSION

Based on the coefficients table above, we can see:

  • R-squared is 92% => the model is able to explain 92% of the fluctation of price in reality.

  • Variables that have the most statistical value & the greatest affect on diamonds’ price are: carat, cut.L, cut.Q, cut.C, colo.L, color.Q, color.C, color^5, color^6, clarity.L, clarity.Q, clarity.C, clarity^4, clarity^5, clarity^7, depth & table (because their Pr < 0.05)

  • Using those valuable variables, we have the multiple linear regression model as follows: price = -969,66 + 8895,19.carat + 615,61.(cut.L) - 326,64.(cut.Q) + 156,33.(cut.C) - 1908,01.(color.L) - 626,09.(color.Q) - 172,06.(color.C) - 85,25.(color^5) - 50,09.(color^6) + 4206,85.(clarity.L) - 1831,8.(clarity.Q) + 919,73.(clarity.C) - 361,61.(clarity^4) + 213,91.(clarity^5) + 110,15.(clarity^7) - 21,03.depth- 24,8.table

Plot of dataframe = diamonds:

library('psych')
## 
## Attaching package: 'psych'
## The following object is masked from 'package:car':
## 
##     logit
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
q = cbind(carat,cut,color,clarity,depth,table)
q = data.frame(q)
attach(q)
## The following objects are masked from diamonds (pos = 4):
## 
##     carat, clarity, color, cut, depth, table
## The following objects are masked from diamonds (pos = 5):
## 
##     carat, clarity, color, cut, depth, table
pairs.panels(q)

CONCLUSION

Based on the plot above, variable that have the greatest correlation in common (one-by-one) are:

  • cut & table = -0.43
  • carat & clarity = -0.35
  • depth & table = -0.3
  • carat & color = 0.29

=> the correlation values are rather high, resulting in multicollinearity.

Variables that have the least correlation in common (one-by-one) are:

  • cut % color = -0.02
  • color & clarity = 0.03
  • color & table = 0.03
  • color & depth = 0.05
  • carat & depth = 0.03

=> the correlation values are rather low, NOT resulting much in multicorllinearity.

Besides, R-squared = 92% (as proven above), therefore the multiple linear regression model can explain almost accurately the value of price.