This Linear Regression is part of internship from The Sparks Foundation.
Setting up the working environment.
setwd("D:/Sparks Foundation/Task 1")
library(ggplot2)
Reading the file, checking for any missing values, a small view of the dataset and structure of the dataset.
data <- read.csv("./data.csv")
any(is.na(data))
## [1] FALSE
head(data)
## Hours Scores
## 1 2.5 21
## 2 5.1 47
## 3 3.2 27
## 4 8.5 75
## 5 3.5 30
## 6 1.5 20
str(data)
## 'data.frame': 25 obs. of 2 variables:
## $ Hours : num 2.5 5.1 3.2 8.5 3.5 1.5 9.2 5.5 8.3 2.7 ...
## $ Scores: int 21 47 27 75 30 20 88 60 81 25 ...
Splitting the Dataset in Training and Testing subsets.
set.seed(4466)
idx <- sample(nrow(data),nrow(data)*0.80)
train <- data[idx,]
test <- data[-idx,]
Plotting the Training dataset.
ggplot(data) +
aes(x=Hours,y=Scores) +
geom_point() + stat_smooth(method = "lm", col = "blue")
## `geom_smooth()` using formula 'y ~ x'
Applying the Linear Regression function and summary of the Dataset after applying the linear function.
model <- lm(Scores~Hours,train)
summary(model)
##
## Call:
## lm(formula = Scores ~ Hours, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.8506 -4.4017 -0.4648 3.8555 8.8436
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.5645 2.9215 0.878 0.392
## Hours 9.5631 0.5574 17.156 1.34e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.654 on 18 degrees of freedom
## Multiple R-squared: 0.9424, Adjusted R-squared: 0.9392
## F-statistic: 294.3 on 1 and 18 DF, p-value: 1.336e-12
Predicting the Test Data values, and comparison of the actual values in the Dataset and the predicted values.
prd <- predict(model,newdata = test)
mde <- cbind(test$Scores,prd)
colnames(mde) <- c('Actual Scores','Predicted Scores')
mde <- as.data.frame(mde)
mde
## Actual Scores Predicted Scores
## 9 81 81.93798
## 11 85 76.20014
## 15 17 13.08386
## 16 95 87.67583
## 22 54 48.46723
The table shows that we have some values that are close to each other in Actual Values and Predicted Values.
Predicting the Score if a student’s study hours is 9.25hrs each day.
predict(model, data.frame(Hours = 9.25))
## 1
## 91.0229
Therefor, the prediction is complete and the student would score around 91.0229 if they study for 9.25 hours per day.