Herein, an introduction to R is explained using as example a data set of cars containing 50 values and 3 variables. The 3 variables are the name of brands of different cars, the distance that each car can get and the speed that each car can get. In this example it is planned to perform a linear regression model in order to predict distance of any car given any speed.
In the first part of the tutorial we need to install the readr package. It is dessigned to read files, like .csv files. Some specific functions are part of spme specific packages. More packages will be needed throughout this tutorial, and there will be explained when they are needed.
install.packages("readr")
install.packages("ggplot2")
install.packages("lawstat") # runs.test() function in modelling (run test)
install.packages("lmtest") # dwtest() function in modelling (durbin-watson test)
install.packages("car") # vif() function in modelling (9th assumption)
install.packages('DMwR') # regr.eval() function in prediction
install.packages("gridExtra") # grid.arrange() dunction in graphs with ggplot
The next step is to call on the package with the function library().
library(readr)
library(ggplot2)
library(DMwR)
library(lawstat)
library(lmtest)
library(car)
library(gridExtra)
Once we have our package, we can upload the data. In our case we are going to upload the “cars.csv” file. When we upload a data file we create a data frame, which means that all of the data that is stored within the data frame is separated by columns.
Cars <- read.csv("cars.csv")
Before start to manipulate data it is highly recommended to get known of the data set that we are handling. Here there are different functions that helps to know better the data and furthermore can help in data preprocessing.
attributes(Cars)#List your attributes within your data set.
summary(Cars) #Prints the min, max, mean, median, and quartiles of each attribute.
str(Cars) #Displays the structure of your data set.
names(Cars) #Names your attributes within your data set.
As we can see the name of attributes can be written in an easier way. We can change the name of attributes with the next code line.
names(Cars) <- c("Brand", "Speed", "Distance") # the attribute name "name.of.car" will be changed by "Brand" and so on
The most important part of cleaning data is to know if there are missing values in our data set. We can check the existence of missing values with the summary() function, mentioned before, or with the next code line.
is.na(Cars)
In this example, there aren´t missing values but in case that exist some, a very common treatment is to replace NA values (missing data) by the mean like.
Cars$Speed[is.na(Cars$Speed)] <- mean(Cars$Speed, na.rm = TRUE) # refers to the logical parameter that tells the function whether or not to remove NA values from the calculation
Apart from understanding better data using functions mentioned before, there are another different tools that help to understand data, visualizing how it is distributed.
To visualize plots, it can be used ggplot() function, and this needs the installation of ggplot2 package. To visualize plots side by side grid.arrange() function is used that requires the installation of gridExtra package.
The first plot is an histogram which is a graphical display of continuous sample data.
hist_distance <- ggplot(Cars, aes(Distance)) + geom_histogram(color="black", binwidth = 5) # the first argument corresponds to the data set. aes = aesthetics, has x axis argument
hist_speed <- ggplot(Cars, aes(Speed)) + geom_histogram(color="black", binwidth=2) # color sets the color of the contour and binwidth the width of bars
grid.arrange(hist_distance, hist_speed, ncol = 2) # grid.arrange () permits apply plots side by side
Another common used plot is the scatter plot that represents values for two different numeric variables. It is used to observe relationships between variables.
ggplot(Cars, aes(Speed , Distance)) + geom_point() # aes have x and y parameters in this graph type and geom_point() is for scatter plot
The last plot that is going to be studied is the boxplot. It is a standardized way of displaying the distribution of data based on 5-number summary (min, Q1, median, Q3 and MAX). It can tell you about outliers, if data is symmetrical, how tightly is data grouped and if and how data is skewed.
par(mfrow=c(1, 2)) # set 2 rows and 2 column plot layout
boxplot(Cars$Speed, main="Speed") # box plot for 'speed'
boxplot(Cars$Distance, main="Distance") # box plot for 'distance'
It can be observed an outlier value in the Distance variables. This gives rise to introduce the next part where conflict data like outliers are going to be removed.
outlier <- boxplot(Cars$Distance, plot=FALSE)$out # Assign the outlier values into a vector
In this part, the data set has to be modified according to the analysis. As mentioned before prevoiusly observed outlier is going to be removed. Procedure is explained in the next code lines.
Cars[which(Cars$Distance %in% outlier),] # First you need find in which rows the outlier is
Cars[-which(Cars$Distance %in% outlier),] # Print the data set without the outlier
Cars_dataprep <- Cars[-which(Cars$Distance %in% outlier),] # Create a new data set without the outlier
boxplot(Cars_dataprep$Distance)
From this point forward our data set is renamed “Cars_dataprep”.
Once data is preprocessed, it´s time to create training and testing sets for the linear regression model that must be performed.
First the seed number must be set. The seed number is a number that you choose for a starting point used to create a sequence of random numbers.It is also helpful for others who want to recreate your same results.
set.seed(123)
The next step is to split data into two sets, training and testing set. A common split is 70/30, which means that 70% of the data will be the training set’s size and 30% of the data will be the test set’s size. In this example 70/30 split is used, but another common split is 80/20. Next two lines set the size of each set.
Cars_dataprep <- Cars[-which(Cars$Distance %in% outlier),]
trainSize <- round(nrow(Cars_dataprep)*0.7) # nrow()--> return the number of rows that are present in data set
testSize <- nrow(Cars_dataprep)-trainSize
Once the size is set, training and testing sets must be created. We also want these sets to be created in a randomized order, which will create the most optimal model.
training_indices <- sample(seq_len(nrow(Cars_dataprep)),size =trainSize) # training_indices --> creates the training set in randomized order
testing_indices <- sample(seq_len(nrow(Cars_dataprep)),size =testSize) # training_indices --> creates the testing set in randomized order
# seq_len() --> creates a sequence that starts at 1 and with steps of 1 finishes at the number value
trainSet <- Cars_dataprep[training_indices,]
testSet <- Cars_dataprep[testing_indices,]
Now is time to run data through linear modelling algorithm. Linear Regression Model is helpful when trying to discover the relationship between two variables. These two variables represent the X and Y within the linear equation. The X variable is the predictor variable, also known as the independent variable because it doesn’t depend on other attributes while making predictions. Y is the response variable, also known as the dependent variable because its value depends on the other variables.
In our cars example, we are trying to predict Distance, so it is the dependant variable whereas Speed is the independant variable.
CarsRegressionModel <- lm(Distance ~ Speed, trainSet)
summary(CarsRegressionModel)
##
## Call:
## lm(formula = Distance ~ Speed, data = trainSet)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9424 -3.6657 -0.2949 1.8836 9.9512
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -30.4590 2.5955 -11.73 3.94e-13 ***
## Speed 4.7295 0.1622 29.16 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.538 on 32 degrees of freedom
## Multiple R-squared: 0.9637, Adjusted R-squared: 0.9626
## F-statistic: 850.1 on 1 and 32 DF, p-value: < 2.2e-16
Building a linear regression model is only half of the work. In order to actually be usable in practice, the model should conform to the assumptions of linear regression.
First assumption says that the regression model is linear in parameters. This means that the equation must have the following form: Y = Estimate * X + Intercept. It can be checked in the next code line.
print(CarsRegressionModel) # y = -30.458 + 4.729 * X (x = Speed, y = Distance)
##
## Call:
## lm(formula = Distance ~ Speed, data = trainSet)
##
## Coefficients:
## (Intercept) Speed
## -30.459 4.729
Second assumption says that the mean of residuals is zero. It can be calculated like this.
mean(CarsRegressionModel$residuals)
## [1] 7.344513e-17
Since the mean of residuals is approximately zero, this assumption holds true for this model.
Third assumption check the homoscedasticity of residuals or equal variance. For that, four plots are built with the next lines.
par(mfrow=c(2,2))
plot(CarsRegressionModel) # plot() function prints a plot
From the first plot (top-left), as the fitted values along x increase, the residuals decrease and then increase. This pattern is indicated by the red line, which should be approximately flat if the disturbances are homoscedastic. The plot on the bottom left also checks this, and is more convenient as the disturbance term in Y axis is standardized.
In this case, there is a definite pattern noticed. So, there is heteroscedasticity. Lets check this on a different model.
par(mfrow=c(2,2))
CarsRegressionModel_2 <- lm(Distance ~ Speed, data=Cars[1:20,]) # linear model
plot(CarsRegressionModel_2)
The fourth assumption says that there must not be autocorrelation of residuals. There are three different ways to the no autocorrelation of residuals.
acf(CarsRegressionModel$residuals)
The X axis corresponds to the lags of the residual, increasing in steps of 1. If the residuals were not autocorrelated, the correlation (Y-axis) from the immediate next line onwards will drop to a near zero value below the dashed blue line (significance level). Here is clearly observed that residuals are not autocorrelated.
For this function, “lawstat” package must be installed.
runs.test(CarsRegressionModel$residuals)
##
## Runs Test - Two sided
##
## data: CarsRegressionModel$residuals
## Standardized Runs Statistic = -0.34832, p-value = 0.7276
This gives a p-value = 0.7276. Can’t reject null hypothesis that it is random. With a p-value = 0.7276, we cannot reject the null hypothesis. Therefore we can safely assume that residuals are not autocorrelated.
For this function, “lmtest” package must be installed.
dwtest(CarsRegressionModel)
##
## Durbin-Watson test
##
## data: CarsRegressionModel
## DW = 2.2498, p-value = 0.7797
## alternative hypothesis: true autocorrelation is greater than 0
With a high p value of 0.7797, we cannot reject the null hypothesis that true autocorrelation is zero. So the assumption that residuals should not be autocorrelated is satisfied by this model.
The 5th assumption says that the X variables and residuals must be uncorrelated. For that, here a correlation test on the X variable and the residuals is done.
cor.test(trainSet$Speed, CarsRegressionModel$residuals)
##
## Pearson's product-moment correlation
##
## data: trainSet$Speed and CarsRegressionModel$residuals
## t = -3.6173e-16, df = 32, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3381657 0.3381657
## sample estimates:
## cor
## -6.394557e-17
p-value is high, so null hypothesis that true correlation is 0 can’t be rejected. So, the assumption holds true for this model.
The 6th assumption says that the number of observation must be greater than number of Xs. This can be directly observed by looking at the data. In our case observations = 34 (trainSet) and variables = 3.
In the 7th assumption the variability in X values is observed, and this must be positive. This means that the X values in a given sample must not all be the same. To check this:
var(trainSet$Speed)
## [1] 23.71569
The 8th assumption says that if the Y and X variable has an inverse relationship, the model equation should be specified appropriately:
Y=B1 + B2*(1/X)
In this assumption a non perfect multicollinearity must exist. It means that there is no perfect linear relationship between explanatory variables. It can be checked using Variance Inflation Factor (VIF). If VIF is high means that the information in that variable is already explained by other X variables present in the given model, which means, mire redundant is that variable.
For the function used here “car” package must be installed.
CarsRegressionModel_2 <- lm(Speed ~ ., data=Cars)
vif(CarsRegressionModel_2)
## GVIF Df GVIF^(1/(2*Df))
## Brand 1.287729 22 1.005764
## Distance 1.287729 1 1.134781
Generally, VIF for an X variable should be less than 4. It is proven that does not exist perfect multicollinearity.
In the last assumption the residuals should be normally distributed. For that the Normal Q-Q is observed and if points lie exactly on the line, it is perfectly normal distribution.
par(mfrow=c(2,2))
plot(CarsRegressionModel)
The qqnorm() plot is shown in the top-right.
The last step is to predict the cars distances through the speed of the cars. To do this, we’ll be using the prediction function – predict().
CarsPrediction <- predict(CarsRegressionModel,testSet)
Now is possible to compare predictions and actual values for distance variable to check the correlation accuracy.
actuals_preds <- data.frame(cbind(actuals=testSet$Distance, predicteds=CarsPrediction)) # make actuals_predicteds dataframe.
correlation_accuracy <- cor(actuals_preds)
correlation_accuracy # 98.5%
## actuals predicteds
## actuals 1.0000000 0.9852094
## predicteds 0.9852094 1.0000000
Finally, other prediction parameters can be calculated, like MAE, MSE, RMSE and MAPE. Installing the package “DMwR” it can be used the regr.eval() function.
regr.eval(actuals_preds$actuals, actuals_preds$predicteds)
## mae mse rmse mape
## 3.1376201 15.6068266 3.9505476 0.1012628
# MAE: mean absolute error
# MSE: mean squared error
# RMSE: root mean squared error
# MAPE: mean absolute percentage error