Getting started with R
This report consists of two data analysis of the cars dataset and the iris dataset. In the cars dataset analysis, our main objective is to predict the braking distance of a car based on its speed, while in the iris dataset analysis our main objective is to predict the length of the petal of the iris flower based on the width of that same petal.
1. Study of the breaking distance of cars depending on their speed:
In this part of the task, we are asked to make predictions of the distance a car will travel when braking at a certain speed. To accomplish this, we are going to build a regression model based on the existing data. This datasets consists of 50 differents cars (50 rows) with 3 differents attributes (3 columns). The three attribues are the brand of the car, the braking distance and the speed.
1.1 Preprocessing:
First, we are going to preprocess and clean our data. We’ll also look for missing values and for outliers.
#We first import the packages that we need to do the task:
library(readr)
library(Metrics)
library(ggplot2)
library(markdown)
#We import our data and rename the columns:
cars <- read_csv("cars.csv")
str(cars)Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 50 obs. of 3 variables:
$ name of car : chr "Ford" "Jeep" "Honda" "KIA" ...
$ speed of car : num 4 4 7 7 8 9 10 10 10 11 ...
$ distance of car: num 2 4 10 10 14 16 17 18 20 20 ...
- attr(*, "spec")=
.. cols(
.. `name of car` = col_character(),
.. `speed of car` = col_double(),
.. `distance of car` = col_double()
.. )
names(cars)<-c("name","speed","distance")
#We look for outliers in the attribute "distance" and "distance".
boxplot(cars[,c("distance","speed")], main = "Boxplot of distance and speed")#There's one outlier in distance = 120. Let's remove this outlier.
outliers <- boxplot(cars$distance, plot = FALSE)$out
cars <- cars[-which(cars$distance %in% outliers),]
#Let's see the distribution of our variables.
hist(x = cars$distance, breaks = 50, main = "Histogram of Distance", xlab = "Distance")[1] FALSE
name speed distance
Length:49 Min. : 4.0 Min. : 2.00
Class :character 1st Qu.:12.0 1st Qu.:26.00
Mode :character Median :15.0 Median :36.00
Mean :15.2 Mean :41.41
3rd Qu.:19.0 3rd Qu.:56.00
Max. :24.0 Max. :93.00
Fortunately there were no missing values which we didn’t have to treat. However, there is one outlier in the braking distance that corresponds to the “Dodge” car, which with a speed of 25 mph needed 120 feets to stop. We will remove this outlier to improve our model and increase its performance.
1.2 Modeling:
We are doing a linear regression using a training dataset and then make the predictions on a test dataset. There is no feature selection, as we are asked to look the relation between braking distance of the car and its speed.
set.seed(100)
trainSize <- round(nrow(cars)*0.7)
testSize <- nrow(cars) - trainSize
training_indices <- sample(seq(nrow(cars)),size = trainSize)
trainSet <- cars[training_indices, ]
testSet <- cars[-training_indices, ]
#New variables
trainSet$speed <- trainSet$speed^2
testSet$speed <- testSet$speed^2We will create the new variable Speed² to do our lineal model. Basically, the theoretical braking distance can be found using the Newton’s laws and equations of Motion. We won’t get into detail, but theroically the braking distance of car can be described as:
\[d = \frac{v^{2}}{2\mu g}\]
Where d is the braking distance, v is the speed, \(\mu\) is the coefficient of friction and g is the gravity constant.
#We create our linear model:
LinearModel <- lm(distance ~ 0 + speed, trainSet)
summary(LinearModel)$coefficients Estimate Std. Error t value Pr(>|t|)
speed 0.1592121 0.001561602 101.9544 7.859919e-43
#Here we plot our dataset and our training model:
ggplot(data = cars, aes(x = speed^2, y = distance, colour = name)) +
geom_abline(intercept = 0,
slope = LinearModel$coefficients[1], colour = "red", ) +
geom_jitter() +
ggtitle("Scatterplot of Speed² vs Distance") +
xlab("Speed² (mph²)") +
ylab("Distance (feet)")Our model underestimates those cars who are above the line, while overstimates those who are under the line. To sum up, we can see that most cars are overstimated by our model and this fact should be consider for future predictions on new datasets.
predictions <- predict(LinearModel, testSet)
##Evaluating errors
ae_cars <- ae(testSet$distance,predictions)
re_cars <- ae_cars/testSet$distance
rmse_cars <- rmse(testSet$distance,predictions)
mae_cars <- mae(testSet$distance,predictions)
cars_results <- as.data.frame(predictions)
cars_results$"Real distance" <- testSet$distance
cars_results$"Abs error" <- ae_cars
cars_results$"Rel error" <- re_cars
cars_results$"MAE" <- mae_cars
cars_results$"MRE" <- mean(cars_results$`Rel error`)
cars_results$"RMSE" <- rmse_cars
cars_results$speed <- testSet$speed
cars_results$name <- testSet$name
#Here we create a performance dataframe:
cars_performance <- as.data.frame(rmse_cars)
cars_performance$mae <- mae_cars
cars_performance$mre <- mean(cars_results$`Rel error`)
cars_performance$r <- summary(LinearModel)$r.squared
names(cars_performance) <- c("RMSE","MAE","MRE","r²")
cars_performance RMSE MAE MRE r²
1 3.602141 2.97215 0.08460171 0.9968353
We can see that we have a good performance. Our model has a squared correlation of 0.9968, which means that 99.68% of our data is explained by our model. The errors are really low, the mean relative error is 12%, which is quite consistent. Our RMSE is higher than our MAE, which means that our model has bigger errors for high speeds. However, we are going to plot our errors vs the real distance to get a better of idea of the error distribution.
ggplot(data = cars_results, aes(x = cars_results$`Real distance`, y = cars_results$`Rel error`, colour = name)) +
geom_jitter() +
ggtitle("Scatterplot of Relative error vs Real distance") +
xlab("Distance") +
ylab("Relative error")ggplot(data = cars_results, aes(x = cars_results$`Real distance`, y = cars_results$`Abs error`, colour = name)) +
geom_jitter() +
ggtitle("Scatterplot of Absolute error vs Real distance") +
xlab("Distance") +
ylab("Absolute error") When plotting the relative error against the real distance we can observe that our model has bigger errors in the shortest braking distances, that correspond to the cars “Ford” and "GM2. When plotting the absolute error versus the real distance we can observe a linear dependence, as it is expected. In general ,the bigger our braking distance, the bigger our absolute error.
2. Study of the petal length through using the petal’s width:
The first thing was to correct the mistakes of the code that was provided. After doing that, we are going to do a deeper analysis of the relationship between the petal length and the petal width.
2.1 Preprocessing:
The first steps are to preprocess and clean our data, as we did with the cars dataset.
#First, we import our data and rename the columns:
iris <- read_csv("iris.csv")
names(iris) <- c("#","sepal_length","sepal_width","petal_length",
"petal_width","species")
#Outliers
boxplot(iris$petal_length, main = "Petal Length Boxplot", ylab = "Petal Length")#We can plot the histogram of this attributes to see its distribution.
hist(x = iris$petal_length, breaks = 50,
main = "Histogram of Petal Length", xlab = "Petal Length")[1] FALSE
# sepal_length sepal_width petal_length
Min. : 1.00 Min. :4.300 Min. :2.000 Min. :1.000
1st Qu.: 38.25 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600
Median : 75.50 Median :5.800 Median :3.000 Median :4.350
Mean : 75.50 Mean :5.843 Mean :3.057 Mean :3.758
3rd Qu.:112.75 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100
Max. :150.00 Max. :7.900 Max. :4.400 Max. :6.900
petal_width species
Min. :0.100 Length:150
1st Qu.:0.300 Class :character
Median :1.300 Mode :character
Mean :1.199
3rd Qu.:1.800
Max. :2.500
#Quick plot of Petal length vs Petal width
plot(x = iris$petal_width, y = iris$petal_length,
main = "Petal length vs petal width",
xlab = "Petal width",
ylab = "Petal length")We can observe that there’s a linear dependency between the petal length and the petal width. As it is expected, when the width increases, the length increases too.
2.2 Modeling:
We’ll also use a linear regression model for our predictions:
#Let's split the data (cross-validation)
set.seed(123)
iris_trainsize <- round(nrow(iris)*0.7)
iris_testsize <- nrow(iris) - iris_trainsize
iristrain_index <- sample(seq(nrow(iris)), size = iris_trainsize)
iris_trainset <- iris[iristrain_index, ]
iris_testset <- iris[-iristrain_index, ]
#Linear Regression Model
iris_linearmodel <- lm(formula = petal_length ~ petal_width, data = iris_trainset)
iris_linearmodel
Call:
lm(formula = petal_length ~ petal_width, data = iris_trainset)
Coefficients:
(Intercept) petal_width
1.027 2.276
ggplot(data = iris, aes(x = petal_width, y = petal_length, )) +
geom_jitter(aes(x = petal_width, y = petal_length, colour = species )) +
geom_abline(intercept = iris_linearmodel$coefficients[1],
slope = iris_linearmodel$coefficients[2], colour = "red") +
ggtitle("Scatterplot of petal length vs petal width with the linear model") +
xlab("Petal width") +
ylab("Petal Length")Our model makes no distinctions between the species. This was expected, as we can observe that the setosa variety has the shortest petals, and the virginica has the longest ones.
#Predictions
iris_predictions <- predict(object = iris_linearmodel, newdata = iris_testset)
#Let's calculate and plot the errors
ae_iris <- ae(iris_testset$petal_length, iris_predictions)
re_iris <- ae_iris/iris_testset$petal_length
mre_iris <- mean(re_iris)
mae_iris <- mae(iris_testset$petal_length, iris_predictions)
rmse_iris <- rmse(iris_testset$petal_length, iris_predictions)
iris_results <- as.data.frame(iris_predictions)
iris_results$"Real petal length" <- iris_testset$petal_length
iris_results$"Absolute error" <- ae_iris
iris_results$"Relative error" <- re_iris
iris_results$"mae" <- mae_iris
iris_results$"Mean relative error" <- mre_iris
iris_results$"rmse" <- rmse_iris
iris_results$"petal width" <- iris_testset$petal_width
iris_results$species <- iris_testset$species
#Here we create a dataframe of the performance of our model:
iris_performance <- as.data.frame(x = rmse_iris)
iris_performance$MAE <- mae_iris
iris_performance$MRE <- mre_iris
iris_performance$"r²"<- summary(iris_linearmodel)$r.squared
iris_performance rmse_iris MAE MRE r²
1 0.4583012 0.3333364 0.08464745 0.9270622
If we observe the performance of our model we can see that we have got a decent squared correlation, which means that our model can explain 92% of our data. If we look at errors, we can see that the errors are very low, however this is expected as the dimensions of the petal length and petal width are aso very little (the biggest one are 6.9 and 2.5 respectively). However, let’s plot the errors against the real length to see our error distribution.
#We plot the relative errors vs the real petal length:
ggplot(data = iris_results, mapping = aes(x = iris_results$`Real petal length`, y = iris_results$`Relative error`, color = species)) +
geom_jitter(aes(x = iris_results$`Real petal length`,
y = iris_results$`Relative error`),) +
ggtitle("Scatterplot of the relative error vs real petal length ") +
xlab(label = "Real petal length") +
ylab("Relative error") #Plot of the absolute errors vs the real petal length:
ggplot(data = iris_results, mapping = aes(x = iris_results$`Real petal length`, y = iris_results$`Absolute error`, color = species) ) +
geom_jitter() +
ggtitle("Scatterplot of the absolute error vs real petal length ") +
xlab(label = "Real petal length") +
ylab("Absolute error")If we plot the relative error vs the real petal length we observe a decaying function, which is expected. We can also see that the biggest relative errors are for short petals, which means that our model has less error when predicting virginica and versicolor variety. Furthermore, if we plot the absolute error we can see a correlation between the absolute error and the real petal length, that’s to say, when the petal length increases so the absolute error does.
3. Conclusions
To sum up, this task was a great introduction to programming in R. We could manage to do some datamining processes, such as data exploring and data cleansing. Furthermore, we could also create two linear models and then we could explore its errors. We were also introduced to ggplot2 and RMarkdown, which have been really useful for the visualizations of the plots and to create this report.
In my opinion, the linear models that we have found are pretty consistent, as they have very few errors when predicting. However, the datasets were pretty limited. This is obvious, as the tasks that we were asked to accomplish were fairly simple and introductory. In upcoming tasks, we could consider larger datasets in order to find more relationships between the different variables and make more accurate predictions and better models.