Today I will start my analysis with diamonds dataset within ggplot2 package. A dataset containing the prices and other attributes of almost 54,000 diamonds.
Variables in dataset are:
Sometimes it is hard to imagine (contrary to John Lenon’s thoughts) what exacly we are talking about just looking at numbers, at least for me, so visualizations are always useful:
This data frame has 53940 rows (ye, it is pretty huge) and 10 variables, as You can see in here:
library(ggplot2)
data("diamonds")
dim(diamonds)
## [1] 53940 10
And the head of this dataset looks like this:
head(diamonds)
## # A tibble: 6 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
The size and the various variable names are one of the reasons I chose exactly this dataset, indeed, there are a lot of possibilities to grasp using diamonds, most of them might be quite advanced, but I won’t go that deep, instead I will stick to beginners code.
summary(diamonds)
## carat cut color clarity
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.710 Median : 3.530
## Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :58.900 Max. :31.800
##
Let’s start plotting the price of a diamond and it’s carat weight- two great variables from which we can get out some assumtions. I am creating a scatterplot of price (y) vs. carat weight (x), and limit the x-axis and y-axis to omit the top 1% of values:
library(ggplot2)
data("diamonds")
ggplot(aes(x=carat, y=price), data=diamonds) +
geom_point(fill=I("#F79420"), color=I("black"), shape=21) +
scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99)) ) +
scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99)) ) +
ggtitle("Diamonds: Price vs. Carat")
## Warning: Removed 926 rows containing missing values (geom_point).
Looking at this scatterplot we already can think of somekind of judgement. Firstly, we can see that demand for diamonds at very low price are large, so we can see relatively high density for diamonds that wights less. Secondly, as the carat or weight grows the values, so to say, diffuse. At that is because, there aren’t as many people in the world, who can afford this kind of jewel. Also, we can see that diamonds in the carat range from 1,5 to 2,5 tends to be quite similar to each other, my assumption is that, prices for larger diamonds doesn’t differ as much, because demand for them in that seems quite small.
The model itself seems quite understandable, although it is dense at some points. It is very convienent that R automatically removed missing rows. Even if from this particular model we can not derive information about individual diamonds (values), this ir more like overall information and it is still good enogh to see if the relationship here is linear or not, in this case it is non-linear, due to positive relationship between both variables, in other words, as carat size increases, price also increases, some smarter people would add, that it looks exponential.
Still, let’s see what we can get out of this model, by adding simple line:
ggplot(aes(x=carat, y=price), data=diamonds) +
geom_point(fill=I("#F79420"), color=I("black"), shape=21) +
stat_smooth(method="lm") +
scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99)) ) +
scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99)) ) +
ggtitle("Diamonds: Price vs. Carat")
## Warning: Removed 926 rows containing non-finite values (stat_smooth).
## Warning: Removed 926 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing missing values (geom_smooth).
As we can see this really isn’t great example of linear regression, because the residuals or errors are quite impressive for some values, to simplify, distance from induvidual value to line are too long. For model to be more precise, line should snake acrooss the model a little bit, in other words - model would be useful if the line would curve in certain parts.
As we saw in previous example, just looking at price and carat relationships in the model and putting line through middle wouldn’t do much good. So I would offer other option. As we know, linear regression is like relationships between two variables, one (or more) that is independent (x) and one that’s dependent (y), so, if the x value changes, the y value changes with it. This model is meant to hepl predicting next values from those, that are given, so to say, it gives as the mean value of all of those we have, give or take some errors in it (it might seem confusing, but you need to square those) and You can find some regularities and useful precedents.
For now I will stop with almost theory from my side, and show these words in action. In model below You can see the same relationship - carat and price, but the difference we added lines that shows every values connection to cut. From this we get even better information, because we can see mean values for each cut, that still is linked to price and carat range. Even though the errors are still there, the comprehension of this dataset is obviusly deeper.
predictor_var<-"carat"
color_var<-"cut"
ggplot(data = diamonds,aes_string(x=predictor_var, y="price",color=color_var)) + geom_point() + geom_smooth(method = "lm")
From this we cal learn that different perspectives, might give us more complete outlook on data, and this is exatly the reason why linear regression models are still popular (Francis galton would be proud) - they are rather simple, and might show tendencies for even more that two variables (as in the case above).
Now I will kill two birds with one stone, and in a minute You will find out how. Decision tree comes from more inductive style than the logical and mathematical regression. Decision tree is like classification model and prediction model combined (there - two birds), because it predicts class.
Code that went wrong:
library(ggplot2) library(rpart) library(rpart.plot) data(“diamonds”)
inTrain <- createDataPartition(y = diamonds$cut, p = 0.80, list = FALSE) trainingData <- diamonds[inTrain,] testingData <- diamonds[-inTrain,] model <- train(cut ~., data = trainingData, method = “rpart”) model
rpart.plot(model$finalModel)
The particular decision tree, that might not be showing right now, tells the exact percentage of of diamonds in different classes, that are divided by table and depth variables. These classification trees helps to see overall view of our data and solve many problems by dividing them into groups. Ofcourse decision trees don’t have the same level of predictive accuracy as other approaches, even our discussed linear regression, this is more robust analysis and a small change in the data can lead up to very different conlusion.
I thought a lot these days about prediction and classifying models (yes, not about Christmas, but models) and my assuptions are quite blurry. The thing I am absolutely sure is that, usefulness of the models depend of the dataset. For example, interestingly enough in RStudio there’s dataset called titanic, where are summarized information about passengers (according to economic status (class), sex, age and survival) on sadly famous Titanics ship. So this data set would be great for classification models, because we can get valid information about each variable and survival rate among them, but to my mind regression model would not be as useful, because we are predicting dead data. Ofcourse, it is interesting to make model according to age and survival rate relationships - but will it bring any good? A long as we are not calculating the precise amount of lefiboats, to my ming, that is not significant. But, to sum up, model usefullness depends on:
Than means, where one model might dissapont us, another can show us everything there is to know!!!