Regression models describe the relationship between variables by fitting a line to the observed data. Linear regression models use a straight line, while logistic and nonlinear regression models use a curved line. Regression allows you to estimate how a dependent variable changes as the independent variable(s) change.
Simple linear regression is used to estimate the relationship between two quantitative variables.
Simple linear regression is a parametric test, meaning that it makes certain assumptions about the data. These assumptions are:
Linear regression makes one additional assumption: - The relationship between the independent and dependent variable is linear
The formula for a simple linear regression is:
# Loading necessary libraries
library(rio) # used for importing and exporting
library(e1071) # used for skewness calculation
# Loading the dataset
studentDf <- import("student_scores.csv") # Read data from a file in working directory
head(studentDf)
## Hours Scores
## 1 2.5 21
## 2 5.1 47
## 3 3.2 27
## 4 8.5 75
## 5 3.5 30
## 6 1.5 20
Scatter plots can help visualize linear relationships between the response and predictor variables.
# using the stats package
scatter.smooth(studentDf$Hours, studentDf$Scores, main="Time ~ Scores",
xlab = "Time(Hrs)", ylab="Score(%)")
The scatter plot along with the smoothing line above suggests a linear and positive relationship between the No of hours spent studying and scores in percentage.
Boxplots are good for detecting outliers.
Generally, an outlier is any datapoint that lies outside the 1.5 * inter quartile range (IQR).
## Box plot
par(mfrow=c(1,2)) # set the parameter mfrow to create layout with exactly 1 row and 2 columns
boxplot(studentDf$Hours, main="Time(Hrs)")
boxplot(studentDf$Scores, main="Score(%)")
In here, no potential outliers are found(may be data is so small).
You can use a Density Plot we can check if variable is close to normal distribution
par(mfrow=c(1,2)) # set the parameter mfrow to create layout with exactly 1 row and 2 columns
# plotting density of hours with skewness below
plot(density(studentDf$Hours), main="Density of Time(Hrs)", ylab="frequency",
sub=paste("Skewness:", round(skewness(studentDf$Hours),2)))
# filling the above plot with color
polygon(density(studentDf$Hours), col="red")
# plotting density of scores with skewness below
plot(density(studentDf$Scores), main="Density of Scores(%)", ylab="frequency",
sub=paste("Skewness:", round(skewness(studentDf$Scores),2)))
# filling the above plot with color
polygon(density(studentDf$Scores), col="red")
Correlation analysis studies the strength of relationship between two continuous variables. - It involves computing the correlation coefficient between the the two variables. - Correlation can take values between -1 to +1. Where a postive value indicate a increasing trend and negative value a decreasing trend
cor(studentDf$Hours, studentDf$Scores) # calculate correlation between Time and Scores
## [1] 0.9761907
# Create Training and Test data
set.seed(100) # setting seed to reproduce results of random sampling
trainingRowIndex <- sample(1:nrow(studentDf), 0.8*nrow(studentDf)) # row indices for training data
trainDf <- studentDf[trainingRowIndex, ] # training data
testDf <- studentDf[-trainingRowIndex, ] # test data
# Build and summarise the model on training data
mod <- lm(Hours ~ Scores, data=trainDf) # build the model
summary(mod) # model summary
##
## Call:
## lm(formula = Hours ~ Scores, data = trainDf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8112 -0.4367 -0.2106 0.4912 1.0939
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.082165 0.309151 -0.266 0.793
## Scores 0.099843 0.005659 17.643 8.29e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5656 on 18 degrees of freedom
## Multiple R-squared: 0.9453, Adjusted R-squared: 0.9423
## F-statistic: 311.3 on 1 and 18 DF, p-value: 8.29e-13
# Predict the test data
yPred <- predict(mod, testDf) # predict scores
From the model summary, the model p value and predictors p value are less than the significance level.
So you have a statistically significant model.
Also, the R-Sq and Adj R-Sq are comparative to the original model built on full data.
AIC(mod) # Calculate Akaike information criterion
## [1] 37.85384
BIC(mod) # Calculate Bayesian information criterion
## [1] 40.84104
predDf <- data.frame(cbind(actuals=testDf$Scores, predicteds=yPred)) # creating a dataframe
head(predDf) # printing the first few rows
## actuals predicteds
## 1 21 2.014535
## 9 81 8.005104
## 11 85 8.404475
## 15 17 1.615163
## 25 86 8.504318
cor(predDf) # correlation between actuals and predictors
## actuals predicteds
## actuals 1 1
## predicteds 1 1
# Min-Max Accuracy Calculation
min_max_accuracy <- mean(apply(predDf, 1, min) / apply(predDf, 1, max))
min_max_accuracy
## [1] 0.09750637
# Mean Absolute Percentage Error Calculation
mape <- mean(abs((predDf$predicteds - predDf$actuals))/predDf$actuals)
mape
## [1] 0.9024936
predict(mod, data.frame(Scores=c(9.25)))
## 1
## 0.8413815