To iterate the words of John Tukey , “The simple graph has brought more information to the data analyst’s mind than any other device”. This R Markdown document is an analysis output based on the data frame from ggplot. The data frame consists of 5390 rows and 10 variables. The variables are as follows:
First and foremost, we load the packages that we will use throughout our analysis. It is good practice to have all the libraries under one chunk.Since I will be using tidyverse() package the most,I will load it the last. I love tidyverse because it combines general packages that used for data wrangling,visualization and modelling.For instance, in stead of loading dplyr(),tidyr() and ggplot2() , you just load tidyverse() and you are set.
Then we load our dataset diamonds
data("diamonds")
descriptive summary
dlookr::describe(diamonds,quantiles=c(.25,.50,.75)) %>% flextable()
described_variables | n | na | mean | sd | se_mean | IQR | skewness | kurtosis | p25 | p50 | p75 |
carat | 53,940 | 0 | 0.7979397 | 0.4740112 | 0.002040954 | 0.64 | 1.11664592 | 1.2566353 | 0.40 | 0.70 | 1.04 |
depth | 53,940 | 0 | 61.7494049 | 1.4326213 | 0.006168448 | 1.50 | -0.08229403 | 5.7394146 | 61.00 | 61.80 | 62.50 |
table | 53,940 | 0 | 57.4571839 | 2.2344906 | 0.009621063 | 3.00 | 0.79689585 | 2.8018569 | 56.00 | 57.00 | 59.00 |
price | 53,940 | 0 | 3,932.7997219 | 3,989.4397381 | 17.177360816 | 4,374.25 | 1.61839528 | 2.1776958 | 950.00 | 2,401.00 | 5,324.25 |
x | 53,940 | 0 | 5.7311572 | 1.1217607 | 0.004829974 | 1.83 | 0.37867634 | -0.6181607 | 4.71 | 5.70 | 6.54 |
y | 53,940 | 0 | 5.7345260 | 1.1421347 | 0.004917698 | 1.82 | 2.43416672 | 91.2145572 | 4.72 | 5.71 | 6.54 |
z | 53,940 | 0 | 3.5387338 | 0.7056988 | 0.003038533 | 1.13 | 1.52242256 | 47.0866193 | 2.91 | 3.53 | 4.04 |
Some of the notable observations are as follows: * The median carat size is 0.7. * About 75% of diamonds have carat weights less than 1. * The median price for a diamonds 2401 dollars.
ggplot(diamonds)+
geom_bar(mapping = aes(x=clarity,color=clarity))+
ggtitle("Count of diamonds present with respect to Clarity")
Most diamonds are available with SI1 and SV2 clarity while least available with clarity I1, which has the worst clarity.
ggplot(data = diamonds)+
geom_boxplot(aes(x=reorder(clarity,price,median),
price,color=clarity))+
scale_x_discrete("clarity")+
coord_flip()+
ggtitle("Distribution of price and clarity")
The box plots have been reordered from the lowest median to the highest and we can observe that diamonds with IF clarity has the lowest price , which concludes that diamonds with quality clarity are cheaper on average.
diamonds %>%
ggplot(aes(x=cut,fill = cut))+
geom_bar()+
ggtitle("Count of diamonds present with respect to Cut")
The bar chart above shows that more diamonds are available with high quality(Ideal) cuts than with low quality cuts.
diamonds %>%
ggplot(aes(x=cut,y=price, color=cut)) +
geom_boxplot()+
ggtitle("Distribution of price and cut")
The box plot shows that the best quality diamonds are cheaper on average.
diamonds %>% ggplot(mapping=aes(x=cut,fill= color))+
geom_bar(position = "dodge")+
ggtitle("Count of diamonds present with respect to Cut and color")
Based on color,most diamonds have a color G or better i.e. G,F,E,D.
ggplot(data = diamonds)+
geom_boxplot(aes(x=reorder(color,price,median),
price,color=color))+
scale_x_discrete("color")+
coord_flip()+
ggtitle("Distribution of color and price")
Diamonds with better colors are much cheaper in average compared to diamonds with poor colors.
We have see the relationship between the quality of diamonds and their prices :low quality diamonds(poor cuts,poor clarity and bad colors) have higher prices and vice versa.
To observe the relationship between two continuous variables, scatter plots are best suited.In this case we are going to use price and carat,because low quality diamonds have higher prices because of carat.The weight of the diamond which is a very important factor for determining the price of the diamond.This makes carat a confounding variable.
diamonds %>%
ggplot(mapping = aes(x=carat,y=price))+
geom_point(aes(color=cut))+
ggtitle("Visualization of carat vs. Price on distribution of cut")
diamonds %>%
ggplot(mapping = aes(x=carat,y=price))+
geom_point(aes(color=clarity))+
ggtitle("Visualization of carat vs. Price on distribution of clarity")
diamonds %>%
ggplot(mapping = aes(x=carat,y=price))+
geom_point(aes(color=color))+
ggtitle("Visualization of carat vs. Price on distribution of color")
The three scatter plots are proof that lower quality diamonds tend to have a larger weight.
To visualize the correlation of each variables we use ggpairs() from GGally
DF = diamonds[,c("price","carat","x","y","z","depth","table")]
ggpairs(DF)
From the pairwise plot we can observe that: * There is a high correlation between price and cart. * Carat is highly correlated to x,y,z. which could be translated that the carat would definitely be affected by its dimensions. * Price with table and depths exhibit poor correlation.
The histogram shows that price and carat show better normality after log transformation.Hence we will log-transform price and carat in our model and keep on assessing our assumptions.
diamonds %>%
plot_normality(c(price,carat))
## Log-transform price and carat
diamonds <- diamonds %>%
mutate(log_carat = log(carat),
log_price = log(price))
diamonds %>%
ggplot(mapping = aes(x=log_carat,y=log_price))+
geom_point(aes(color=clarity))+
ggtitle("Visualization of log(carat) vs.log(price) on distribution of clarity")
diamonds %>%
ggplot(mapping = aes(x=log_carat,y=log_price))+
geom_point(aes(color=cut))+
ggtitle("Visualization of log(carat) vs.log(price) on distribution of cut")
diamonds %>%
ggplot(mapping = aes(x=log_carat,y=log_price))+
geom_point(aes(color=color))+
ggtitle("Visualization of log(carat) vs.log(price) on distribution of color")
The log transformation has made the pattern linear in all three cut,clarity and color.
DF2 = diamonds[,c("price","log_price","carat","log_carat")]
ggpairs(DF2)
The correlation between log transformations of price and carat are higher(0.956) compared to price and carat(0.922)
We use lm() to build the regression model and compare the model without log transformation and one with log transformation and perform feature selection using stepAIC() on our preferred model.
mod1 <- lm(price ~ carat+cut+clarity+depth+color+table+x+y+z,data = diamonds)
summary(mod1)
##
## Call:
## lm(formula = price ~ carat + cut + clarity + depth + color +
## table + x + y + z, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21376.0 -592.4 -183.5 376.4 10694.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5753.762 396.630 14.507 < 2e-16 ***
## carat 11256.978 48.628 231.494 < 2e-16 ***
## cut.L 584.457 22.478 26.001 < 2e-16 ***
## cut.Q -301.908 17.994 -16.778 < 2e-16 ***
## cut.C 148.035 15.483 9.561 < 2e-16 ***
## cut^4 -20.794 12.377 -1.680 0.09294 .
## clarity.L 4097.431 30.259 135.414 < 2e-16 ***
## clarity.Q -1925.004 28.227 -68.197 < 2e-16 ***
## clarity.C 982.205 24.152 40.668 < 2e-16 ***
## clarity^4 -364.918 19.285 -18.922 < 2e-16 ***
## clarity^5 233.563 15.752 14.828 < 2e-16 ***
## clarity^6 6.883 13.715 0.502 0.61575
## clarity^7 90.640 12.103 7.489 7.06e-14 ***
## depth -63.806 4.535 -14.071 < 2e-16 ***
## color.L -1952.160 17.342 -112.570 < 2e-16 ***
## color.Q -672.054 15.777 -42.597 < 2e-16 ***
## color.C -165.283 14.725 -11.225 < 2e-16 ***
## color^4 38.195 13.527 2.824 0.00475 **
## color^5 -95.793 12.776 -7.498 6.59e-14 ***
## color^6 -48.466 11.614 -4.173 3.01e-05 ***
## table -26.474 2.912 -9.092 < 2e-16 ***
## x -1008.261 32.898 -30.648 < 2e-16 ***
## y 9.609 19.333 0.497 0.61918
## z -50.119 33.486 -1.497 0.13448
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1130 on 53916 degrees of freedom
## Multiple R-squared: 0.9198, Adjusted R-squared: 0.9198
## F-statistic: 2.688e+04 on 23 and 53916 DF, p-value: < 2.2e-16
From the p-values we can tell that the predictors (except clarity 6,cut4 ,y and z) and the intercept are highly significant because of their low p-values (<0.05).R-squared measures how well the model fits the data.In this case, 92% of variation can be explained by the predictors.
mod2 <- lm(log_price ~ log_carat+cut+clarity+color+depth+table+x+y+z,data = diamonds)
summary(mod2)
##
## Call:
## lm(formula = log_price ~ log_carat + cut + clarity + color +
## depth + table + x + y + z, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.05041 -0.08575 -0.00009 0.08301 1.93916
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.983e+00 5.894e-02 135.439 < 2e-16 ***
## log_carat 1.771e+00 7.870e-03 225.022 < 2e-16 ***
## cut.L 1.174e-01 2.657e-03 44.188 < 2e-16 ***
## cut.Q -3.445e-02 2.126e-03 -16.204 < 2e-16 ***
## cut.C 1.485e-02 1.832e-03 8.104 5.44e-16 ***
## cut^4 -8.611e-04 1.463e-03 -0.588 0.5562
## clarity.L 9.165e-01 3.578e-03 256.155 < 2e-16 ***
## clarity.Q -2.459e-01 3.330e-03 -73.839 < 2e-16 ***
## clarity.C 1.339e-01 2.852e-03 46.947 < 2e-16 ***
## clarity^4 -6.619e-02 2.279e-03 -29.046 < 2e-16 ***
## clarity^5 2.733e-02 1.861e-03 14.681 < 2e-16 ***
## clarity^6 -1.721e-03 1.621e-03 -1.062 0.2882
## clarity^7 3.295e-02 1.430e-03 23.044 < 2e-16 ***
## color.L -4.429e-01 2.040e-03 -217.121 < 2e-16 ***
## color.Q -9.705e-02 1.862e-03 -52.120 < 2e-16 ***
## color.C -1.453e-02 1.740e-03 -8.353 < 2e-16 ***
## color^4 1.248e-02 1.599e-03 7.809 5.89e-15 ***
## color^5 -2.492e-03 1.510e-03 -1.650 0.0989 .
## color^6 2.368e-03 1.372e-03 1.725 0.0845 .
## depth 1.362e-03 5.536e-04 2.460 0.0139 *
## table 7.028e-05 3.455e-04 0.203 0.8388
## x 5.685e-02 4.979e-03 11.416 < 2e-16 ***
## y -1.487e-03 2.286e-03 -0.651 0.5154
## z 6.624e-03 3.960e-03 1.673 0.0943 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1335 on 53916 degrees of freedom
## Multiple R-squared: 0.9827, Adjusted R-squared: 0.9827
## F-statistic: 1.33e+05 on 23 and 53916 DF, p-value: < 2.2e-16
mod 2’s intercept and predictor variables(except cut4,clarity6,color 5 and 6,table,z and y)are highly significant. But model 2 has a higher R-squared(0.9827) compared to mod 1.This could mean that model 2 is a much better model compared to model 1. We will therefore carry model 2 for feature selection.
The plot() returns a model performance plot which is usually 4 plots:
plot(mod1)
plot(mod2)
Model 2 follows the proper guidelines of a better model using the 4 plots compared to model 1.
stepAIC is one of the common methods of feature selection. It minimizes or increases the AIC value to come up with the best final set of features.Basically, stepAIC simplifies the model without impacting much on the performance.stepAIC also removes multicollinearity if it exists. Direction takes the following values:
step <- stepAIC(mod2,direction = "backward")
## Start: AIC=-217176.9
## log_price ~ log_carat + cut + clarity + color + depth + table +
## x + y + z
##
## Df Sum of Sq RSS AIC
## - table 1 0.00 961.48 -217179
## - y 1 0.01 961.48 -217178
## <none> 961.48 -217177
## - z 1 0.05 961.53 -217176
## - depth 1 0.11 961.58 -217173
## - x 1 2.32 963.80 -217049
## - cut 4 39.94 1001.42 -214989
## - color 6 893.70 1855.17 -181736
## - log_carat 1 902.96 1864.44 -181457
## - clarity 7 1826.76 2788.24 -159761
##
## Step: AIC=-217178.9
## log_price ~ log_carat + cut + clarity + color + depth + x + y +
## z
##
## Df Sum of Sq RSS AIC
## - y 1 0.01 961.48 -217180
## <none> 961.48 -217179
## - z 1 0.05 961.53 -217178
## - depth 1 0.12 961.59 -217174
## - x 1 2.33 963.80 -217051
## - cut 4 51.75 1013.23 -214359
## - color 6 893.94 1855.42 -181731
## - log_carat 1 911.99 1873.47 -181199
## - clarity 7 1827.70 2789.18 -159745
##
## Step: AIC=-217180.4
## log_price ~ log_carat + cut + clarity + color + depth + x + z
##
## Df Sum of Sq RSS AIC
## <none> 961.48 -217180
## - z 1 0.05 961.53 -217180
## - depth 1 0.12 961.61 -217175
## - x 1 2.55 964.04 -217039
## - cut 4 51.75 1013.23 -214361
## - color 6 894.00 1855.49 -181731
## - log_carat 1 913.93 1875.42 -181145
## - clarity 7 1827.92 2789.40 -159743
In the final model, y has been eliminated from the model. Not much information would be lost if y would be removed from the model.We run our final model without the y variable.
mod3 <- lm(log_price ~ log_carat + cut + clarity + color + depth + x + z,data = diamonds)
summary(mod3)
##
## Call:
## lm(formula = log_price ~ log_carat + cut + clarity + color +
## depth + x + z, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.05048 -0.08576 -0.00012 0.08302 1.93917
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.9877249 0.0471572 169.385 < 2e-16 ***
## log_carat 1.7708019 0.0078220 226.388 < 2e-16 ***
## cut.L 0.1171658 0.0024747 47.345 < 2e-16 ***
## cut.Q -0.0344190 0.0021222 -16.218 < 2e-16 ***
## cut.C 0.0147143 0.0017976 8.186 2.77e-16 ***
## cut^4 -0.0009377 0.0014439 -0.649 0.51606
## clarity.L 0.9164171 0.0035762 256.257 < 2e-16 ***
## clarity.Q -0.2459017 0.0033303 -73.838 < 2e-16 ***
## clarity.C 0.1338797 0.0028517 46.948 < 2e-16 ***
## clarity^4 -0.0661832 0.0022788 -29.044 < 2e-16 ***
## clarity^5 0.0273095 0.0018610 14.675 < 2e-16 ***
## clarity^6 -0.0017179 0.0016205 -1.060 0.28912
## clarity^7 0.0329533 0.0014300 23.045 < 2e-16 ***
## color.L -0.4429488 0.0020398 -217.158 < 2e-16 ***
## color.Q -0.0970424 0.0018616 -52.129 < 2e-16 ***
## color.C -0.0145330 0.0017400 -8.352 < 2e-16 ***
## color^4 0.0124795 0.0015983 7.808 5.92e-15 ***
## color^5 -0.0024927 0.0015098 -1.651 0.09874 .
## color^6 0.0023658 0.0013724 1.724 0.08473 .
## depth 0.0013514 0.0005122 2.639 0.00832 **
## x 0.0556377 0.0046496 11.966 < 2e-16 ***
## z 0.0062773 0.0039256 1.599 0.10981
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1335 on 53918 degrees of freedom
## Multiple R-squared: 0.9827, Adjusted R-squared: 0.9827
## F-statistic: 1.457e+05 on 21 and 53918 DF, p-value: < 2.2e-16
The code to this R markdown is available on my Github profile https://github.com/TracyWhitneyAkinyi
I hope you enjoyed ;)