Regression models describe the relationship between variables by fitting a line to the observed data. Linear regression models use a straight line, while logistic and nonlinear regression models use a curved line. Regression allows you to estimate how a dependent variable changes as the independent variable(s) change.

 Simple linear regression is used to estimate the relationship between two quantitative variables.

Assumptions of simple linear regression

 Simple linear regression is a parametric test, meaning that it makes certain assumptions about the data. These assumptions are:

Linear regression makes one additional assumption: - The relationship between the independent and dependent variable is linear

Formula

The formula for a simple linear regression is:

\(y=\beta_{0} + \beta_{1}X + \varepsilon\)
# Loading necessary libraries
library(rio) # used for importing and exporting
library(e1071) # used for skewness calculation
# Loading the dataset
studentDf <- import("student_scores.csv") # Read data from a file in working directory 

head(studentDf)
##   Hours Scores
## 1   2.5     21
## 2   5.1     47
## 3   3.2     27
## 4   8.5     75
## 5   3.5     30
## 6   1.5     20

Visualizing the Data

Scatter Plot

 Scatter plots can help visualize linear relationships between the response and predictor variables.

# using the stats package 
scatter.smooth(studentDf$Hours, studentDf$Scores, main="Time ~ Scores", 
               xlab = "Time(Hrs)", ylab="Score(%)")

The scatter plot along with the smoothing line above suggests a linear and positive relationship between the No of hours spent studying and scores in percentage.

Boxplot

 Boxplots are good for detecting outliers.

 Generally, an outlier is any datapoint that lies outside the 1.5 * inter quartile range (IQR).

## Box plot
par(mfrow=c(1,2)) # set the parameter mfrow to create layout with exactly 1 row and 2 columns

boxplot(studentDf$Hours, main="Time(Hrs)")

boxplot(studentDf$Scores, main="Score(%)")

 In here, no potential outliers are found(may be data is so small).

Density Plot

 You can use a Density Plot we can check if variable is close to normal distribution

par(mfrow=c(1,2)) # set the parameter mfrow to create layout with exactly 1 row and 2 columns

# plotting density of hours with skewness below
plot(density(studentDf$Hours), main="Density of Time(Hrs)", ylab="frequency", 
     sub=paste("Skewness:", round(skewness(studentDf$Hours),2)))
# filling the above plot with color
polygon(density(studentDf$Hours), col="red")

# plotting density of scores with skewness below
plot(density(studentDf$Scores), main="Density of Scores(%)", ylab="frequency", 
     sub=paste("Skewness:", round(skewness(studentDf$Scores),2)))
# filling the above plot with color
polygon(density(studentDf$Scores), col="red")

Correlation

 Correlation analysis studies the strength of relationship between two continuous variables. - It involves computing the correlation coefficient between the the two variables. - Correlation can take values between -1 to +1. Where a postive value indicate a increasing trend and negative value a decreasing trend

cor(studentDf$Hours, studentDf$Scores)  # calculate correlation between Time and Scores 
## [1] 0.9761907

Splitting the Data

# Create Training and Test data
set.seed(100)  # setting seed to reproduce results of random sampling
trainingRowIndex <- sample(1:nrow(studentDf), 0.8*nrow(studentDf))  # row indices for training data
trainDf <- studentDf[trainingRowIndex, ]  # training data
testDf  <- studentDf[-trainingRowIndex, ]   # test data

Model Building and Evaluation

# Build and summarise the model on training data
mod <- lm(Hours ~ Scores, data=trainDf) # build the model
summary(mod)  # model summary
## 
## Call:
## lm(formula = Hours ~ Scores, data = trainDf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8112 -0.4367 -0.2106  0.4912  1.0939 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.082165   0.309151  -0.266    0.793    
## Scores       0.099843   0.005659  17.643 8.29e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5656 on 18 degrees of freedom
## Multiple R-squared:  0.9453, Adjusted R-squared:  0.9423 
## F-statistic: 311.3 on 1 and 18 DF,  p-value: 8.29e-13
# Predict the test data
yPred <- predict(mod, testDf)  # predict scores

From the model summary, the model p value and predictors p value are less than the significance level.

So you have a statistically significant model.

Also, the R-Sq and Adj R-Sq are comparative to the original model built on full data.

Measuring goodness of fit

AIC(mod)  # Calculate Akaike information criterion
## [1] 37.85384
BIC(mod)  # Calculate Bayesian information criterion
## [1] 40.84104

Accuracy and Error Rates

predDf <- data.frame(cbind(actuals=testDf$Scores, predicteds=yPred)) # creating a dataframe
head(predDf) # printing the first few rows
##    actuals predicteds
## 1       21   2.014535
## 9       81   8.005104
## 11      85   8.404475
## 15      17   1.615163
## 25      86   8.504318
cor(predDf) # correlation between actuals and predictors
##            actuals predicteds
## actuals          1          1
## predicteds       1          1
# Min-Max Accuracy Calculation
min_max_accuracy <- mean(apply(predDf, 1, min) / apply(predDf, 1, max))  
min_max_accuracy
## [1] 0.09750637
# Mean Absolute Percentage Error Calculation
mape <- mean(abs((predDf$predicteds - predDf$actuals))/predDf$actuals)  
mape
## [1] 0.9024936

what is the predicted score if a student studies for 9.25hrs/day?

predict(mod, data.frame(Scores=c(9.25)))
##         1 
## 0.8413815