This Linear Regression is part of internship from The Sparks Foundation.

Setting up the working environment.

setwd("D:/Sparks Foundation/Task 1")
library(ggplot2)

Reading the file, checking for any missing values, a small view of the dataset and structure of the dataset.

data <- read.csv("./data.csv")
any(is.na(data))
## [1] FALSE
head(data)
##   Hours Scores
## 1   2.5     21
## 2   5.1     47
## 3   3.2     27
## 4   8.5     75
## 5   3.5     30
## 6   1.5     20
str(data)
## 'data.frame':    25 obs. of  2 variables:
##  $ Hours : num  2.5 5.1 3.2 8.5 3.5 1.5 9.2 5.5 8.3 2.7 ...
##  $ Scores: int  21 47 27 75 30 20 88 60 81 25 ...

Splitting the Dataset in Training and Testing subsets.

set.seed(4466)
idx <- sample(nrow(data),nrow(data)*0.80)
train <- data[idx,]
test <- data[-idx,]

Plotting the Training dataset.

ggplot(data) +
  aes(x=Hours,y=Scores) + 
  geom_point() + stat_smooth(method = "lm", col = "blue")
## `geom_smooth()` using formula 'y ~ x'

Applying the Linear Regression function and summary of the Dataset after applying the linear function.

model <- lm(Scores~Hours,train)
summary(model)
## 
## Call:
## lm(formula = Scores ~ Hours, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8506 -4.4017 -0.4648  3.8555  8.8436 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.5645     2.9215   0.878    0.392    
## Hours         9.5631     0.5574  17.156 1.34e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.654 on 18 degrees of freedom
## Multiple R-squared:  0.9424, Adjusted R-squared:  0.9392 
## F-statistic: 294.3 on 1 and 18 DF,  p-value: 1.336e-12

Predicting the Test Data values, and comparison of the actual values in the Dataset and the predicted values.

prd <- predict(model,newdata = test)
mde <- cbind(test$Scores,prd)
colnames(mde) <- c('Actual Scores','Predicted Scores')
mde <- as.data.frame(mde)
mde
##    Actual Scores Predicted Scores
## 9             81         81.93798
## 11            85         76.20014
## 15            17         13.08386
## 16            95         87.67583
## 22            54         48.46723

The table shows that we have some values that are close to each other in Actual Values and Predicted Values.

Predicting the Score if a student’s study hours is 9.25hrs each day.

predict(model, data.frame(Hours = 9.25))
##       1 
## 91.0229

Therefor, the prediction is complete and the student would score around 91.0229 if they study for 9.25 hours per day.