Introduction

To iterate the words of John Tukey , “The simple graph has brought more information to the data analyst’s mind than any other device”. This R Markdown document is an analysis output based on the data frame from ggplot. The data frame consists of 5390 rows and 10 variables. The variables are as follows:

price - price in US dollars ($326–$18,823).
carat - weight of the diamond (0.2–5.01).
cut - quality of the cut (Fair, Good, Very Good, Premium, Ideal).
color - diamond colour, from D (best) to J (worst).
clarity - a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best)).
x - length in mm (0–10.74).
y - width in mm (0–58.9).
z - depth in mm (0–31.8).
depth - total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79).
table - width of top of diamond relative to widest point (43–95).

Objectives.

Observe the distribution of clarity.
Investigate the relationship between price and clarity.
Observe the distribution and relationship of cut and price.
Investigate the relationship between cut, colour and price.
Perform normality diagnosis on carat and price.
Compare the differences between price and carat with their logarithms.
Model Building by performing linear regression .

First and foremost, we load the packages that we will use throughout our analysis. It is good practice to have all the libraries under one chunk.Since I will be using tidyverse() package the most,I will load it the last. I love tidyverse because it combines general packages that used for data wrangling,visualization and modelling.For instance, in stead of loading dplyr(),tidyr() and ggplot2() , you just load tidyverse() and you are set.

load packages

Then we load our dataset diamonds

load data set

data("diamonds")

descriptive summary

dlookr::describe(diamonds,quantiles=c(.25,.50,.75)) %>% flextable()

described_variables	n	mean	sd	se_mean	IQR	skewness	kurtosis	p25	p50	p75
carat	53,940	0.7979397	0.4740112	0.002040954	0.64	1.11664592	1.2566353	0.40	0.70	1.04
depth	53,940	61.7494049	1.4326213	0.006168448	1.50	-0.08229403	5.7394146	61.00	61.80	62.50
table	53,940	57.4571839	2.2344906	0.009621063	3.00	0.79689585	2.8018569	56.00	57.00	59.00
price	53,940	3,932.7997219	3,989.4397381	17.177360816	4,374.25	1.61839528	2.1776958	950.00	2,401.00	5,324.25
x	53,940	5.7311572	1.1217607	0.004829974	1.83	0.37867634	-0.6181607	4.71	5.70	6.54
y	53,940	5.7345260	1.1421347	0.004917698	1.82	2.43416672	91.2145572	4.72	5.71	6.54
z	53,940	3.5387338	0.7056988	0.003038533	1.13	1.52242256	47.0866193	2.91	3.53	4.04

Some of the notable observations are as follows: * The median carat size is 0.7. * About 75% of diamonds have carat weights less than 1. * The median price for a diamonds 2401 dollars.

Data Exploration

ggplot(diamonds)+
  geom_bar(mapping = aes(x=clarity,color=clarity))+
  ggtitle("Count of diamonds present with respect to Clarity")

Most diamonds are available with SI1 and SV2 clarity while least available with clarity I1, which has the worst clarity.

ggplot(data = diamonds)+
  geom_boxplot(aes(x=reorder(clarity,price,median),
                   price,color=clarity))+
  scale_x_discrete("clarity")+
  coord_flip()+
  ggtitle("Distribution of price and clarity")

The box plots have been reordered from the lowest median to the highest and we can observe that diamonds with IF clarity has the lowest price , which concludes that diamonds with quality clarity are cheaper on average.

diamonds %>% 
  ggplot(aes(x=cut,fill = cut))+
  geom_bar()+
  ggtitle("Count of diamonds present with respect to Cut")

The bar chart above shows that more diamonds are available with high quality(Ideal) cuts than with low quality cuts.

diamonds %>%
  ggplot(aes(x=cut,y=price, color=cut)) +
  geom_boxplot()+
  ggtitle("Distribution of price and cut")

The box plot shows that the best quality diamonds are cheaper on average.

diamonds %>% ggplot(mapping=aes(x=cut,fill= color))+
  geom_bar(position = "dodge")+
  ggtitle("Count of diamonds present with respect to Cut and color")

Based on color,most diamonds have a color G or better i.e. G,F,E,D.

ggplot(data = diamonds)+
  geom_boxplot(aes(x=reorder(color,price,median),
                   price,color=color))+
  scale_x_discrete("color")+
  coord_flip()+
  ggtitle("Distribution of color and price")

Diamonds with better colors are much cheaper in average compared to diamonds with poor colors.

We have see the relationship between the quality of diamonds and their prices :low quality diamonds(poor cuts,poor clarity and bad colors) have higher prices and vice versa.

To observe the relationship between two continuous variables, scatter plots are best suited.In this case we are going to use price and carat,because low quality diamonds have higher prices because of carat.The weight of the diamond which is a very important factor for determining the price of the diamond.This makes carat a confounding variable.

diamonds %>% 
  ggplot(mapping = aes(x=carat,y=price))+
  geom_point(aes(color=cut))+
  ggtitle("Visualization of carat vs. Price on distribution of cut")

diamonds %>% 
  ggplot(mapping = aes(x=carat,y=price))+
  geom_point(aes(color=clarity))+
  ggtitle("Visualization of carat vs. Price on distribution of clarity")

diamonds %>% 
  ggplot(mapping = aes(x=carat,y=price))+
  geom_point(aes(color=color))+
  ggtitle("Visualization of carat vs. Price on distribution of color")

The three scatter plots are proof that lower quality diamonds tend to have a larger weight.

To visualize the correlation of each variables we use ggpairs() from GGally

DF = diamonds[,c("price","carat","x","y","z","depth","table")]
ggpairs(DF)

From the pairwise plot we can observe that: * There is a high correlation between price and cart. * Carat is highly correlated to x,y,z. which could be translated that the carat would definitely be affected by its dimensions. * Price with table and depths exhibit poor correlation.

The histogram shows that price and carat show better normality after log transformation.Hence we will log-transform price and carat in our model and keep on assessing our assumptions.

diamonds %>% 
  plot_normality(c(price,carat))

## Log-transform price and carat

diamonds <- diamonds %>% 
  mutate(log_carat = log(carat),
         log_price = log(price))

diamonds %>% 
  ggplot(mapping = aes(x=log_carat,y=log_price))+
  geom_point(aes(color=clarity))+
  ggtitle("Visualization of log(carat) vs.log(price) on distribution of clarity")

diamonds %>% 
  ggplot(mapping = aes(x=log_carat,y=log_price))+
  geom_point(aes(color=cut))+
  ggtitle("Visualization of log(carat) vs.log(price) on distribution of cut")

diamonds %>% 
  ggplot(mapping = aes(x=log_carat,y=log_price))+
  geom_point(aes(color=color))+
  ggtitle("Visualization of log(carat) vs.log(price) on distribution of color")

The log transformation has made the pattern linear in all three cut,clarity and color.

DF2 = diamonds[,c("price","log_price","carat","log_carat")]
ggpairs(DF2)

The correlation between log transformations of price and carat are higher(0.956) compared to price and carat(0.922)

Linear Regression

We use lm() to build the regression model and compare the model without log transformation and one with log transformation and perform feature selection using stepAIC() on our preferred model.

mod1 <- lm(price ~ carat+cut+clarity+depth+color+table+x+y+z,data = diamonds)
summary(mod1)

## 
## Call:
## lm(formula = price ~ carat + cut + clarity + depth + color + 
##     table + x + y + z, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -21376.0   -592.4   -183.5    376.4  10694.2 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  5753.762    396.630   14.507  < 2e-16 ***
## carat       11256.978     48.628  231.494  < 2e-16 ***
## cut.L         584.457     22.478   26.001  < 2e-16 ***
## cut.Q        -301.908     17.994  -16.778  < 2e-16 ***
## cut.C         148.035     15.483    9.561  < 2e-16 ***
## cut^4         -20.794     12.377   -1.680  0.09294 .  
## clarity.L    4097.431     30.259  135.414  < 2e-16 ***
## clarity.Q   -1925.004     28.227  -68.197  < 2e-16 ***
## clarity.C     982.205     24.152   40.668  < 2e-16 ***
## clarity^4    -364.918     19.285  -18.922  < 2e-16 ***
## clarity^5     233.563     15.752   14.828  < 2e-16 ***
## clarity^6       6.883     13.715    0.502  0.61575    
## clarity^7      90.640     12.103    7.489 7.06e-14 ***
## depth         -63.806      4.535  -14.071  < 2e-16 ***
## color.L     -1952.160     17.342 -112.570  < 2e-16 ***
## color.Q      -672.054     15.777  -42.597  < 2e-16 ***
## color.C      -165.283     14.725  -11.225  < 2e-16 ***
## color^4        38.195     13.527    2.824  0.00475 ** 
## color^5       -95.793     12.776   -7.498 6.59e-14 ***
## color^6       -48.466     11.614   -4.173 3.01e-05 ***
## table         -26.474      2.912   -9.092  < 2e-16 ***
## x           -1008.261     32.898  -30.648  < 2e-16 ***
## y               9.609     19.333    0.497  0.61918    
## z             -50.119     33.486   -1.497  0.13448    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1130 on 53916 degrees of freedom
## Multiple R-squared:  0.9198, Adjusted R-squared:  0.9198 
## F-statistic: 2.688e+04 on 23 and 53916 DF,  p-value: < 2.2e-16

From the p-values we can tell that the predictors (except clarity 6,cut4 ,y and z) and the intercept are highly significant because of their low p-values (<0.05).R-squared measures how well the model fits the data.In this case, 92% of variation can be explained by the predictors.

mod2 <- lm(log_price ~ log_carat+cut+clarity+color+depth+table+x+y+z,data = diamonds)
summary(mod2)

## 
## Call:
## lm(formula = log_price ~ log_carat + cut + clarity + color + 
##     depth + table + x + y + z, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.05041 -0.08575 -0.00009  0.08301  1.93916 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  7.983e+00  5.894e-02  135.439  < 2e-16 ***
## log_carat    1.771e+00  7.870e-03  225.022  < 2e-16 ***
## cut.L        1.174e-01  2.657e-03   44.188  < 2e-16 ***
## cut.Q       -3.445e-02  2.126e-03  -16.204  < 2e-16 ***
## cut.C        1.485e-02  1.832e-03    8.104 5.44e-16 ***
## cut^4       -8.611e-04  1.463e-03   -0.588   0.5562    
## clarity.L    9.165e-01  3.578e-03  256.155  < 2e-16 ***
## clarity.Q   -2.459e-01  3.330e-03  -73.839  < 2e-16 ***
## clarity.C    1.339e-01  2.852e-03   46.947  < 2e-16 ***
## clarity^4   -6.619e-02  2.279e-03  -29.046  < 2e-16 ***
## clarity^5    2.733e-02  1.861e-03   14.681  < 2e-16 ***
## clarity^6   -1.721e-03  1.621e-03   -1.062   0.2882    
## clarity^7    3.295e-02  1.430e-03   23.044  < 2e-16 ***
## color.L     -4.429e-01  2.040e-03 -217.121  < 2e-16 ***
## color.Q     -9.705e-02  1.862e-03  -52.120  < 2e-16 ***
## color.C     -1.453e-02  1.740e-03   -8.353  < 2e-16 ***
## color^4      1.248e-02  1.599e-03    7.809 5.89e-15 ***
## color^5     -2.492e-03  1.510e-03   -1.650   0.0989 .  
## color^6      2.368e-03  1.372e-03    1.725   0.0845 .  
## depth        1.362e-03  5.536e-04    2.460   0.0139 *  
## table        7.028e-05  3.455e-04    0.203   0.8388    
## x            5.685e-02  4.979e-03   11.416  < 2e-16 ***
## y           -1.487e-03  2.286e-03   -0.651   0.5154    
## z            6.624e-03  3.960e-03    1.673   0.0943 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1335 on 53916 degrees of freedom
## Multiple R-squared:  0.9827, Adjusted R-squared:  0.9827 
## F-statistic: 1.33e+05 on 23 and 53916 DF,  p-value: < 2.2e-16

mod 2’s intercept and predictor variables(except cut4,clarity6,color 5 and 6,table,z and y)are highly significant. But model 2 has a higher R-squared(0.9827) compared to mod 1.This could mean that model 2 is a much better model compared to model 1. We will therefore carry model 2 for feature selection.

Checking performance of both models

The plot() returns a model performance plot which is usually 4 plots:

Plot 1(Linearity) - analyzes the linearity of the residuals versus the fitted values.The reference line ought to be horizontal.
Plot 2(Normality of residuals)- displays the qqplot ,the dots should fall along the line.
Plot 3(Homogeneity of Variance)- also measures the linearity of of the residuals vs the fitted values.
Plot 4(Influential Observations)- analyzes high leverage points which are points which deviate from the average The points should be within the contour lines.

plot(mod1)

plot(mod2)

Model 2 follows the proper guidelines of a better model using the 4 plots compared to model 1.

Feature Selection

stepAIC is one of the common methods of feature selection. It minimizes or increases the AIC value to come up with the best final set of features.Basically, stepAIC simplifies the model without impacting much on the performance.stepAIC also removes multicollinearity if it exists. Direction takes the following values:

“forward” -for forward selection.
“backward” - for backward selection.
“both” - for both forward and backward selection.

step <- stepAIC(mod2,direction = "backward")

## Start:  AIC=-217176.9
## log_price ~ log_carat + cut + clarity + color + depth + table + 
##     x + y + z
## 
##             Df Sum of Sq     RSS     AIC
## - table      1      0.00  961.48 -217179
## - y          1      0.01  961.48 -217178
## <none>                    961.48 -217177
## - z          1      0.05  961.53 -217176
## - depth      1      0.11  961.58 -217173
## - x          1      2.32  963.80 -217049
## - cut        4     39.94 1001.42 -214989
## - color      6    893.70 1855.17 -181736
## - log_carat  1    902.96 1864.44 -181457
## - clarity    7   1826.76 2788.24 -159761
## 
## Step:  AIC=-217178.9
## log_price ~ log_carat + cut + clarity + color + depth + x + y + 
##     z
## 
##             Df Sum of Sq     RSS     AIC
## - y          1      0.01  961.48 -217180
## <none>                    961.48 -217179
## - z          1      0.05  961.53 -217178
## - depth      1      0.12  961.59 -217174
## - x          1      2.33  963.80 -217051
## - cut        4     51.75 1013.23 -214359
## - color      6    893.94 1855.42 -181731
## - log_carat  1    911.99 1873.47 -181199
## - clarity    7   1827.70 2789.18 -159745
## 
## Step:  AIC=-217180.4
## log_price ~ log_carat + cut + clarity + color + depth + x + z
## 
##             Df Sum of Sq     RSS     AIC
## <none>                    961.48 -217180
## - z          1      0.05  961.53 -217180
## - depth      1      0.12  961.61 -217175
## - x          1      2.55  964.04 -217039
## - cut        4     51.75 1013.23 -214361
## - color      6    894.00 1855.49 -181731
## - log_carat  1    913.93 1875.42 -181145
## - clarity    7   1827.92 2789.40 -159743

In the final model, y has been eliminated from the model. Not much information would be lost if y would be removed from the model.We run our final model without the y variable.

mod3 <- lm(log_price ~ log_carat + cut + clarity + color + depth + x + z,data = diamonds)
summary(mod3)

## 
## Call:
## lm(formula = log_price ~ log_carat + cut + clarity + color + 
##     depth + x + z, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.05048 -0.08576 -0.00012  0.08302  1.93917 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  7.9877249  0.0471572  169.385  < 2e-16 ***
## log_carat    1.7708019  0.0078220  226.388  < 2e-16 ***
## cut.L        0.1171658  0.0024747   47.345  < 2e-16 ***
## cut.Q       -0.0344190  0.0021222  -16.218  < 2e-16 ***
## cut.C        0.0147143  0.0017976    8.186 2.77e-16 ***
## cut^4       -0.0009377  0.0014439   -0.649  0.51606    
## clarity.L    0.9164171  0.0035762  256.257  < 2e-16 ***
## clarity.Q   -0.2459017  0.0033303  -73.838  < 2e-16 ***
## clarity.C    0.1338797  0.0028517   46.948  < 2e-16 ***
## clarity^4   -0.0661832  0.0022788  -29.044  < 2e-16 ***
## clarity^5    0.0273095  0.0018610   14.675  < 2e-16 ***
## clarity^6   -0.0017179  0.0016205   -1.060  0.28912    
## clarity^7    0.0329533  0.0014300   23.045  < 2e-16 ***
## color.L     -0.4429488  0.0020398 -217.158  < 2e-16 ***
## color.Q     -0.0970424  0.0018616  -52.129  < 2e-16 ***
## color.C     -0.0145330  0.0017400   -8.352  < 2e-16 ***
## color^4      0.0124795  0.0015983    7.808 5.92e-15 ***
## color^5     -0.0024927  0.0015098   -1.651  0.09874 .  
## color^6      0.0023658  0.0013724    1.724  0.08473 .  
## depth        0.0013514  0.0005122    2.639  0.00832 ** 
## x            0.0556377  0.0046496   11.966  < 2e-16 ***
## z            0.0062773  0.0039256    1.599  0.10981    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1335 on 53918 degrees of freedom
## Multiple R-squared:  0.9827, Adjusted R-squared:  0.9827 
## F-statistic: 1.457e+05 on 21 and 53918 DF,  p-value: < 2.2e-16

The code to this R markdown is available on my Github profile https://github.com/TracyWhitneyAkinyi

I hope you enjoyed ;)

Data Analysis using Diamonds Data set

Tracy Whitney Akinyi

2022-06-29