setwd('D:/Rml')
library('ggplot2')
write.csv(diamonds, 'diamonds.csv')
dat = head(diamonds)
dat
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
names(dat)
## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"
## [8] "x" "y" "z"
library('car')
## Loading required package: carData
attach(diamonds)
z = ggplot(diamonds, aes(x=carat, y=price, color=clarity))
z + geom_point() + theme_bw() + ggtitle('Scatterplot of price by carat & clarity')
Summary: Relationship of price & carat is positive correlation (as carat increases, price also increases accordingly).
attach(diamonds)
## The following object is masked _by_ .GlobalEnv:
##
## z
## The following objects are masked from diamonds (pos = 3):
##
## carat, clarity, color, cut, depth, price, table, x, y, z
summary(lm(price~carat+cut+color+clarity+depth+table))
##
## Call:
## lm(formula = price ~ carat + cut + color + clarity + depth +
## table)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16828.8 -678.7 -199.4 464.6 10341.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -969.661 360.432 -2.690 0.00714 **
## carat 8895.194 12.079 736.390 < 2e-16 ***
## cut.L 615.613 22.985 26.784 < 2e-16 ***
## cut.Q -326.638 18.390 -17.762 < 2e-16 ***
## cut.C 156.333 15.814 9.886 < 2e-16 ***
## cut^4 -15.975 12.648 -1.263 0.20657
## color.L -1908.010 17.718 -107.689 < 2e-16 ***
## color.Q -626.087 16.112 -38.858 < 2e-16 ***
## color.C -172.056 15.063 -11.423 < 2e-16 ***
## color^4 20.319 13.833 1.469 0.14187
## color^5 -85.245 13.068 -6.523 6.95e-11 ***
## color^6 -50.085 11.881 -4.216 2.50e-05 ***
## clarity.L 4206.854 30.867 136.290 < 2e-16 ***
## clarity.Q -1831.804 28.811 -63.580 < 2e-16 ***
## clarity.C 919.725 24.672 37.278 < 2e-16 ***
## clarity^4 -361.609 19.728 -18.330 < 2e-16 ***
## clarity^5 213.910 16.108 13.280 < 2e-16 ***
## clarity^6 2.986 14.030 0.213 0.83148
## clarity^7 110.147 12.375 8.901 < 2e-16 ***
## depth -21.024 4.079 -5.154 2.56e-07 ***
## table -24.803 2.978 -8.329 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1156 on 53919 degrees of freedom
## Multiple R-squared: 0.9161, Adjusted R-squared: 0.916
## F-statistic: 2.942e+04 on 20 and 53919 DF, p-value: < 2.2e-16
CONCLUSION
Based on the coefficients table above, we can see:
R-squared is 92% => the model is able to explain 92% of the fluctation of price in reality.
Variables that have the most statistical value & the greatest affect on diamonds’ price are: carat, cut.L, cut.Q, cut.C, colo.L, color.Q, color.C, color^5, color^6, clarity.L, clarity.Q, clarity.C, clarity^4, clarity^5, clarity^7, depth & table (because their Pr < 0.05)
Using those valuable variables, we have the multiple linear regression model as follows: price = -969,66 + 8895,19.carat + 615,61.(cut.L) - 326,64.(cut.Q) + 156,33.(cut.C) - 1908,01.(color.L) - 626,09.(color.Q) - 172,06.(color.C) - 85,25.(color^5) - 50,09.(color^6) + 4206,85.(clarity.L) - 1831,8.(clarity.Q) + 919,73.(clarity.C) - 361,61.(clarity^4) + 213,91.(clarity^5) + 110,15.(clarity^7) - 21,03.depth- 24,8.table
Plot of dataframe = diamonds:
library('psych')
##
## Attaching package: 'psych'
## The following object is masked from 'package:car':
##
## logit
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
q = cbind(carat,cut,color,clarity,depth,table)
q = data.frame(q)
attach(q)
## The following objects are masked from diamonds (pos = 4):
##
## carat, clarity, color, cut, depth, table
## The following objects are masked from diamonds (pos = 5):
##
## carat, clarity, color, cut, depth, table
pairs.panels(q)
CONCLUSION
Based on the plot above, variable that have the greatest correlation in common (one-by-one) are:
=> the correlation values are rather high, resulting in multicollinearity.
Variables that have the least correlation in common (one-by-one) are:
=> the correlation values are rather low, NOT resulting much in multicorllinearity.
Besides, R-squared = 92% (as proven above), therefore the multiple linear regression model can explain almost accurately the value of price.