Prediction Using Supervised ML

What is Supervised Learning?

Supervised learning is an important aspect of Data Science. Supervised learning is the machine learning task of inferring a function from labelled training data. The training data consists of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). With the help of historical data, random sampling is carried out. Random sampling picks 70% and 30% of records. With 70%, the machine learning gets trained with the data. It is important to make sure the data is generalized and is not a specified one. Once the system is trained, it will provide a model (statistical model) which means that certain understanding has been attained from the data along with some formulas. Calculations will be the output of the modelling. For instance, the brain has to be evaluated to check its functioning. Thirty per cent of the data has an input and output but when you give that to the model it will take only the independent variable and it calculates, giving the output. Hence, the model will give an output and you’re going to compare the brain predicted output and the actual value. Hence, the accuracy of percentage will be attained.

What is Linear Regression?

Linear regression attempts to model the relationship between two variables by fitting a linear equation (= a straight line) to the observed data. If you have a hunch that the data follows a straight line trend, linear regression can give you quick and reasonably accurate results.

Importing necessary libraries

library(ggplot2)
library(psych)
library(highcharter)
library(plotly)

1.Import and Overview of data

#Read the file
link = "http://bit.ly/w-data"
df <- read.csv(link)
attach(df)
head(df)

##   Hours Scores
## 1   2.5     21
## 2   5.1     47
## 3   3.2     27
## 4   8.5     75
## 5   3.5     30
## 6   1.5     20

2.Exploratory Data Analysis

dim(df)

## [1] 25  2

Summary of variables

str(df)

## 'data.frame':    25 obs. of  2 variables:
##  $ Hours : num  2.5 5.1 3.2 8.5 3.5 1.5 9.2 5.5 8.3 2.7 ...
##  $ Scores: int  21 47 27 75 30 20 88 60 81 25 ...

summary(df)

##      Hours           Scores     
##  Min.   :1.100   Min.   :17.00  
##  1st Qu.:2.700   1st Qu.:30.00  
##  Median :4.800   Median :47.00  
##  Mean   :5.012   Mean   :51.48  
##  3rd Qu.:7.400   3rd Qu.:75.00  
##  Max.   :9.200   Max.   :95.00

Interactive Plots

hcboxplot(Scores,color = "red")

describe(df)

##        vars  n  mean    sd median trimmed   mad  min  max range skew kurtosis
## Hours     1 25  5.01  2.53    4.8    4.98  3.11  1.1  9.2   8.1 0.17    -1.42
## Scores    2 25 51.48 25.29   47.0   50.81 32.62 17.0 95.0  78.0 0.21    -1.53
##          se
## Hours  0.51
## Scores 5.06

Using describe() function, we can get extra info than summary() function such as 10%trimmed mean,Mean deviation about median, kurtosis

No of non null values

colSums(is.na(df))

##  Hours Scores 
##      0      0

This proves that there is no missing values and no need of data cleaning.

3.Data Visualisation

Scatter Plots

p <- ggplot(df,aes(x=Hours,y=Scores)) + geom_point(color= "red",shape=1) +
  labs(title = "Plot of Scores of students and \n No of Hours studied") +
  xlab("Hours Studied") + ylab("Percentage Score")
ggplotly(p)

Adding more customs to plot

q <- p + aes(size=Hours)
ggplotly(q)

From this we can see that there is a strong positive linear association between scores and hours studied.

We want to know how much strong this relationship

cor(Hours,Scores)

## [1] 0.9761907

So, we can apply linear regression on this dataset.

Split Dataset into “training” (70%) and validation(30%)

set.seed(2561)
ind <- sample(2,nrow(df),replace = TRUE,prob = c(0.7,0.3))
traindf <- df[ind==1,]
testdf <-  df[ind==2,]
head(traindf)

##   Hours Scores
## 3   3.2     27
## 4   8.5     75
## 5   3.5     30
## 6   1.5     20
## 7   9.2     88
## 9   8.3     81

head(testdf)

##    Hours Scores
## 1    2.5     21
## 2    5.1     47
## 8    5.5     60
## 13   4.5     41
## 20   7.4     69
## 21   2.7     30

Linear Regression Model

results <- lm(Scores~Hours,traindf)
summary(results)

## 
## Call:
## lm(formula = Scores ~ Hours, data = traindf)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.848  -4.738   1.889   4.245   6.919 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.6773     2.8784   1.278     0.22    
## Hours         9.6671     0.5034  19.205 1.79e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.774 on 16 degrees of freedom
## Multiple R-squared:  0.9584, Adjusted R-squared:  0.9558 
## F-statistic: 368.8 on 1 and 16 DF,  p-value: 1.786e-12

traindf$pred <- predict(results,traindf)
head(traindf[])

##   Hours Scores     pred
## 3   3.2     27 34.61211
## 4   8.5     75 85.84790
## 5   3.5     30 37.51225
## 6   1.5     20 18.17799
## 7   9.2     88 92.61489
## 9   8.3     81 83.91447

Performance of this model on test data

testdf$pred <- predict(results,testdf)
testdf

##    Hours Scores     pred
## 1    2.5     21 27.84512
## 2    5.1     47 52.97966
## 8    5.5     60 56.84651
## 13   4.5     41 47.17938
## 20   7.4     69 75.21406
## 21   2.7     30 29.77855
## 24   6.9     76 70.38049

Using this model,we can see that the model is fitting very well on test data. Now, we fitting the regression line through previos scatterplot.

s <- p + geom_smooth(method = "lm")
ggplotly(s)

## `geom_smooth()` using formula 'y ~ x'

Here we can check our model parameters are true upto how much level of significance. For that we have to find 95% level of significance confidence intervals of coefficients and intercepts.

beta <- summary(results)$coefficients[,1]
se.beta <- summary(results)$coefficients[,2]
t95 <- qt(0.975, results$df.residual)
ci.beta <- cbind(beta-t95*se.beta, beta+t95*se.beta)
ci.beta

##                  [,1]      [,2]
## (Intercept) -2.424662  9.779255
## Hours        8.600063 10.734197

Now, our model coefficients and intercepts are as follows:

intercept <- coef(summary(results))["(Intercept)", "Estimate"]
Hours <- coef(summary(results))["Hours" , "Estimate"]
intercept

## [1] 3.677297

Hours

## [1] 9.66713

So, as our predicted model’s coefficient and intercept are within in confidence intervals, thus this predicted values are atleast 95% significant.

Now, we have to predict of score if a student studies 9.25 hrs/day

intercept + Hours*9.25

## [1] 93.09825

Finally, we can comment that Our model predict Score of a student provided no of hours with 95.84% accurate results.

Comment: If a student studies 9.25 hrs/day, from the given Dataset we can predict that the student is expected to score 93.10 percentage(approx)

Thank You!!

Prediction Using Supervised ML

Article by Saikat Kar