Getting started with R

This report consists of two data analysis of the cars dataset and the iris dataset. In the cars dataset analysis, our main objective is to predict the braking distance of a car based on its speed, while in the iris dataset analysis our main objective is to predict the length of the petal of the iris flower based on the width of that same petal.

1. Study of the breaking distance of cars depending on their speed:

In this part of the task, we are asked to make predictions of the distance a car will travel when braking at a certain speed. To accomplish this, we are going to build a regression model based on the existing data. This datasets consists of 50 differents cars (50 rows) with 3 differents attributes (3 columns). The three attribues are the brand of the car, the braking distance and the speed.

1.1 Preprocessing:

First, we are going to preprocess and clean our data. We’ll also look for missing values and for outliers.

Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame':    50 obs. of  3 variables:
 $ name of car    : chr  "Ford" "Jeep" "Honda" "KIA" ...
 $ speed of car   : num  4 4 7 7 8 9 10 10 10 11 ...
 $ distance of car: num  2 4 10 10 14 16 17 18 20 20 ...
 - attr(*, "spec")=
  .. cols(
  ..   `name of car` = col_character(),
  ..   `speed of car` = col_double(),
  ..   `distance of car` = col_double()
  .. )

[1] FALSE
     name               speed         distance    
 Length:49          Min.   : 4.0   Min.   : 2.00  
 Class :character   1st Qu.:12.0   1st Qu.:26.00  
 Mode  :character   Median :15.0   Median :36.00  
                    Mean   :15.2   Mean   :41.41  
                    3rd Qu.:19.0   3rd Qu.:56.00  
                    Max.   :24.0   Max.   :93.00  

Fortunately there were no missing values which we didn’t have to treat. However, there is one outlier in the braking distance that corresponds to the “Dodge” car, which with a speed of 25 mph needed 120 feets to stop. We will remove this outlier to improve our model and increase its performance.

1.2 Modeling:

We are doing a linear regression using a training dataset and then make the predictions on a test dataset. There is no feature selection, as we are asked to look the relation between braking distance of the car and its speed.

We will create the new variable Speed² to do our lineal model. Basically, the theoretical braking distance can be found using the Newton’s laws and equations of Motion. We won’t get into detail, but theroically the braking distance of car can be described as:

\[d = \frac{v^{2}}{2\mu g}\]

Where d is the braking distance, v is the speed, \(\mu\) is the coefficient of friction and g is the gravity constant.

       Estimate  Std. Error  t value     Pr(>|t|)
speed 0.1592121 0.001561602 101.9544 7.859919e-43

Our model underestimates those cars who are above the line, while overstimates those who are under the line. To sum up, we can see that most cars are overstimated by our model and this fact should be consider for future predictions on new datasets.

      RMSE     MAE        MRE        r²
1 3.602141 2.97215 0.08460171 0.9968353

We can see that we have a good performance. Our model has a squared correlation of 0.9968, which means that 99.68% of our data is explained by our model. The errors are really low, the mean relative error is 12%, which is quite consistent. Our RMSE is higher than our MAE, which means that our model has bigger errors for high speeds. However, we are going to plot our errors vs the real distance to get a better of idea of the error distribution.

When plotting the relative error against the real distance we can observe that our model has bigger errors in the shortest braking distances, that correspond to the cars “Ford” and "GM2. When plotting the absolute error versus the real distance we can observe a linear dependence, as it is expected. In general ,the bigger our braking distance, the bigger our absolute error.

2. Study of the petal length through using the petal’s width:

The first thing was to correct the mistakes of the code that was provided. After doing that, we are going to do a deeper analysis of the relationship between the petal length and the petal width.

2.1 Preprocessing:

The first steps are to preprocess and clean our data, as we did with the cars dataset.

[1] FALSE
       #           sepal_length    sepal_width     petal_length  
 Min.   :  1.00   Min.   :4.300   Min.   :2.000   Min.   :1.000  
 1st Qu.: 38.25   1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600  
 Median : 75.50   Median :5.800   Median :3.000   Median :4.350  
 Mean   : 75.50   Mean   :5.843   Mean   :3.057   Mean   :3.758  
 3rd Qu.:112.75   3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100  
 Max.   :150.00   Max.   :7.900   Max.   :4.400   Max.   :6.900  
  petal_width      species         
 Min.   :0.100   Length:150        
 1st Qu.:0.300   Class :character  
 Median :1.300   Mode  :character  
 Mean   :1.199                     
 3rd Qu.:1.800                     
 Max.   :2.500                     

We can observe that there’s a linear dependency between the petal length and the petal width. As it is expected, when the width increases, the length increases too.

2.2 Modeling:

We’ll also use a linear regression model for our predictions:


Call:
lm(formula = petal_length ~ petal_width, data = iris_trainset)

Coefficients:
(Intercept)  petal_width  
      1.027        2.276  

Our model makes no distinctions between the species. This was expected, as we can observe that the setosa variety has the shortest petals, and the virginica has the longest ones.

  rmse_iris       MAE        MRE        r²
1 0.4583012 0.3333364 0.08464745 0.9270622

If we observe the performance of our model we can see that we have got a decent squared correlation, which means that our model can explain 92% of our data. If we look at errors, we can see that the errors are very low, however this is expected as the dimensions of the petal length and petal width are aso very little (the biggest one are 6.9 and 2.5 respectively). However, let’s plot the errors against the real length to see our error distribution.

If we plot the relative error vs the real petal length we observe a decaying function, which is expected. We can also see that the biggest relative errors are for short petals, which means that our model has less error when predicting virginica and versicolor variety. Furthermore, if we plot the absolute error we can see a correlation between the absolute error and the real petal length, that’s to say, when the petal length increases so the absolute error does.

3. Conclusions

To sum up, this task was a great introduction to programming in R. We could manage to do some datamining processes, such as data exploring and data cleansing. Furthermore, we could also create two linear models and then we could explore its errors. We were also introduced to ggplot2 and RMarkdown, which have been really useful for the visualizations of the plots and to create this report.

In my opinion, the linear models that we have found are pretty consistent, as they have very few errors when predicting. However, the datasets were pretty limited. This is obvious, as the tasks that we were asked to accomplish were fairly simple and introductory. In upcoming tasks, we could consider larger datasets in order to find more relationships between the different variables and make more accurate predictions and better models.

Joël Ribera Zaragoza

July 4, 2019