DATA SCIENCE AND BUSINESS ANALYTICS.
Task- OBJECTIVE: PREDICT THE PERCENTAGE OF AN STUDENT BASED ON THE NUMBER OF STUDY HOURS GIVEN THAT , TO SHOW WHAT WILL BE THE PREDICTED SCORE IF A STUDENT STUDIES FOR 9.25 HRS/DAY?
Step-1 Reading the Given Data Set.
# Reading The CSV Data
student_data <- read.csv("https://raw.githubusercontent.com/AdiPersonalWorks/Random/master/student_scores%20-%20student_scores.csv", header = TRUE)
head(student_data)
summary(student_data)
Hours Scores
Min. :1.100 Min. :17.00
1st Qu.:2.700 1st Qu.:30.00
Median :4.800 Median :47.00
Mean :5.012 Mean :51.48
3rd Qu.:7.400 3rd Qu.:75.00
Max. :9.200 Max. :95.00
Step-2 Plotting the Given Data set.
# Plotting the given data
plot(x = student_data$Hours, y = student_data$Scores, xlab = "Hours", ylab = "Scores", main = "Score v/s Hours", col = "blue")
Step-3 Running Linear Regression on the data as there are only Two variables.
# Linear regression
student_data_regression <- lm(formula = Scores~Hours, data = student_data)
plot(x = student_data$Hours, y = student_data$Scores, xlab = "Hours", ylab = "Scores", main = "Score v/s Hours", col = "blue")
abline(student_data_regression, col= "Red")
summary(student_data_regression)
Call:
lm(formula = Scores ~ Hours, data = student_data)
Residuals:
Min 1Q Median 3Q Max
-10.578 -5.340 1.839 4.593 7.265
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.4837 2.5317 0.981 0.337
Hours 9.7758 0.4529 21.583 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.603 on 23 degrees of freedom
Multiple R-squared: 0.9529, Adjusted R-squared: 0.9509
F-statistic: 465.8 on 1 and 23 DF, p-value: < 2.2e-16
Step-4 Splitting Data into Test and Training Data
# Splitting Data Into Test and Training data
library(caTools) # package "caTools" is used for splitting the data
split = sample.split(Y = student_data$Scores, SplitRatio = 0.75)
training_set = subset(student_data, split == TRUE)
test_set = subset(student_data, split==FALSE)
training_set # checking the training data
test_set# checking the test data
Step-5 Training the dataset
# training
result <- lm(formula = Scores~Hours, data = training_set)
summary(result)
Call:
lm(formula = Scores ~ Hours, data = training_set)
Residuals:
Min 1Q Median 3Q Max
-8.0607 -4.2990 0.3358 2.7462 9.4485
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.0205 2.7067 1.485 0.157
Hours 9.2988 0.4954 18.769 2.54e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.547 on 16 degrees of freedom
Multiple R-squared: 0.9566, Adjusted R-squared: 0.9538
F-statistic: 352.3 on 1 and 16 DF, p-value: 2.543e-12
result$coefficients
(Intercept) Hours
4.020513 9.298847
Step-6 Using the Test Data to predict the outcome
pred <- predict(result, test_set)
head(pred) #printing the predicted result
1 8 13 16 19 22
27.26763 55.16417 45.86533 86.78025 60.74348 48.65498
head(test_set) # printing the head of test set to compare with predicted values
Step-7 Calculating the Mean Absolute Error
#taking the head values of original test set
tset <- head(training_set)
tset
# taking the head values of predicted set
pset <- head(test_set)
pset
#Calculating Mean Absolute Error
library(ie2misc)
Registered S3 method overwritten by 'data.table':
method from
print.data.table
mae(pset$Hours, tset$Hours)
[1] 3.883333
Step-8 What will be predicted score if a student studies for 9.25 hrs/ day?
predicted_result <- predict(result, data.frame(Hours = 9.25))
predicted_result
1
90.03485
So, the score of student will be 92.40105.