Content of Dataset

Description

Today I will start my analysis with diamonds dataset within ggplot2 package. A dataset containing the prices and other attributes of almost 54,000 diamonds.

Variables in dataset are:

price - in US dollars ($326 to $18,823)
carat - weight of the diamond (0.2 to 5.01)
cut - quality of the cut (Fair, Good, Very Good, Premium, Ideal)
color - diamond colour, from J (worst) to D (best)
clarity - a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
depth - total depth percentage = z / mean(x, y) = 2 * z / (x + y) *table - width of top of diamond relative to widest point (43 to 95)
x - length in mm (0 to 10.74)
y - width in mm (0 to 58.9)
z - depth in mm (0 to 31.8)

Sometimes it is hard to imagine (contrary to John Lenon’s thoughts) what exacly we are talking about just looking at numbers, at least for me, so visualizations are always useful:

This data frame has 53940 rows (ye, it is pretty huge) and 10 variables, as You can see in here:

library(ggplot2)
data("diamonds")
dim(diamonds)

## [1] 53940    10

And the head of this dataset looks like this:

head(diamonds)

## # A tibble: 6 x 10
##   carat cut       color clarity depth table price     x     y     z
##   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23  Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
## 2 0.21  Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
## 3 0.23  Good      E     VS1      56.9    65   327  4.05  4.07  2.31
## 4 0.290 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
## 5 0.31  Good      J     SI2      63.3    58   335  4.34  4.35  2.75
## 6 0.24  Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48

The size and the various variable names are one of the reasons I chose exactly this dataset, indeed, there are a lot of possibilities to grasp using diamonds, most of them might be quite advanced, but I won’t go that deep, instead I will stick to beginners code.

Here’s a little summary of what you can basically see in dataset diamonds:

summary(diamonds)

##      carat               cut        color        clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                  
##        y                z         
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.710   Median : 3.530  
##  Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :58.900   Max.   :31.800  
##

Plotting DIAMONDS

Let’s start plotting the price of a diamond and it’s carat weight- two great variables from which we can get out some assumtions. I am creating a scatterplot of price (y) vs. carat weight (x), and limit the x-axis and y-axis to omit the top 1% of values:

library(ggplot2)
data("diamonds")

ggplot(aes(x=carat, y=price), data=diamonds) +
  geom_point(fill=I("#F79420"), color=I("black"), shape=21) +
  scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99)) ) +
  scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99)) ) +
  ggtitle("Diamonds: Price vs. Carat")

## Warning: Removed 926 rows containing missing values (geom_point).

Information model gives

Looking at this scatterplot we already can think of somekind of judgement. Firstly, we can see that demand for diamonds at very low price are large, so we can see relatively high density for diamonds that wights less. Secondly, as the carat or weight grows the values, so to say, diffuse. At that is because, there aren’t as many people in the world, who can afford this kind of jewel. Also, we can see that diamonds in the carat range from 1,5 to 2,5 tends to be quite similar to each other, my assumption is that, prices for larger diamonds doesn’t differ as much, because demand for them in that seems quite small.

Model

The model itself seems quite understandable, although it is dense at some points. It is very convienent that R automatically removed missing rows. Even if from this particular model we can not derive information about individual diamonds (values), this ir more like overall information and it is still good enogh to see if the relationship here is linear or not, in this case it is non-linear, due to positive relationship between both variables, in other words, as carat size increases, price also increases, some smarter people would add, that it looks exponential.

Still, let’s see what we can get out of this model, by adding simple line:

ggplot(aes(x=carat, y=price), data=diamonds) +
  geom_point(fill=I("#F79420"), color=I("black"), shape=21) +
  stat_smooth(method="lm") +
  scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99)) ) +
  scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99)) ) +
  ggtitle("Diamonds: Price vs. Carat")

## Warning: Removed 926 rows containing non-finite values (stat_smooth).

## Warning: Removed 926 rows containing missing values (geom_point).

## Warning: Removed 4 rows containing missing values (geom_smooth).

As we can see this really isn’t great example of linear regression, because the residuals or errors are quite impressive for some values, to simplify, distance from induvidual value to line are too long. For model to be more precise, line should snake acrooss the model a little bit, in other words - model would be useful if the line would curve in certain parts.

Linear Regression model

As we saw in previous example, just looking at price and carat relationships in the model and putting line through middle wouldn’t do much good. So I would offer other option. As we know, linear regression is like relationships between two variables, one (or more) that is independent (x) and one that’s dependent (y), so, if the x value changes, the y value changes with it. This model is meant to hepl predicting next values from those, that are given, so to say, it gives as the mean value of all of those we have, give or take some errors in it (it might seem confusing, but you need to square those) and You can find some regularities and useful precedents.

For now I will stop with almost theory from my side, and show these words in action. In model below You can see the same relationship - carat and price, but the difference we added lines that shows every values connection to cut. From this we get even better information, because we can see mean values for each cut, that still is linked to price and carat range. Even though the errors are still there, the comprehension of this dataset is obviusly deeper.

predictor_var<-"carat"
color_var<-"cut"
ggplot(data = diamonds,aes_string(x=predictor_var, y="price",color=color_var)) + geom_point() + geom_smooth(method = "lm")

From this we cal learn that different perspectives, might give us more complete outlook on data, and this is exatly the reason why linear regression models are still popular (Francis galton would be proud) - they are rather simple, and might show tendencies for even more that two variables (as in the case above).

Decision Tree

Now I will kill two birds with one stone, and in a minute You will find out how. Decision tree comes from more inductive style than the logical and mathematical regression. Decision tree is like classification model and prediction model combined (there - two birds), because it predicts class.

Code that went wrong:

library(ggplot2) library(rpart) library(rpart.plot) data(“diamonds”)

inTrain <- createDataPartition(y = diamonds$cut, p = 0.80, list = FALSE) trainingData <- diamonds[inTrain,] testingData <- diamonds[-inTrain,] model <- train(cut ~., data = trainingData, method = “rpart”) model

rpart.plot(model$finalModel)

The particular decision tree, that might not be showing right now, tells the exact percentage of of diamonds in different classes, that are divided by table and depth variables. These classification trees helps to see overall view of our data and solve many problems by dividing them into groups. Ofcourse decision trees don’t have the same level of predictive accuracy as other approaches, even our discussed linear regression, this is more robust analysis and a small change in the data can lead up to very different conlusion.

Conclusion

I thought a lot these days about prediction and classifying models (yes, not about Christmas, but models) and my assuptions are quite blurry. The thing I am absolutely sure is that, usefulness of the models depend of the dataset. For example, interestingly enough in RStudio there’s dataset called titanic, where are summarized information about passengers (according to economic status (class), sex, age and survival) on sadly famous Titanics ship. So this data set would be great for classification models, because we can get valid information about each variable and survival rate among them, but to my mind regression model would not be as useful, because we are predicting dead data. Ofcourse, it is interesting to make model according to age and survival rate relationships - but will it bring any good? A long as we are not calculating the precise amount of lefiboats, to my ming, that is not significant. But, to sum up, model usefullness depends on:

Data and the things we want to find out
Goal and aim of our analysis

Than means, where one model might dissapont us, another can show us everything there is to know!!!

THANK YOU FOR THIS COURSE!

Comparison of two different models in RStudio

Elina Loseva

2018 gada 20 decembris