This is my final project data science of Course-Net Jakarta.
This report provides student study hour prediction using regression.
The dataset used in this report is the Kaggle’s dataset, with the link:
Report Structure:
Data Extraction.
Exploratory Data Analysis.
Data Preparation/ Preprocessing.
Modelling
Evaluations
Recommendations
Read the house dataset and its structure
studentHour_df = read.csv("Student_Study_Hour_v2.csv")
str(studentHour_df)
## 'data.frame': 28 obs. of 2 variables:
## $ Hours : num 2.5 5.1 3.2 8.5 3.5 1.5 9.2 5.5 8.3 2.7 ...
## $ Scores: int 21 47 27 75 30 20 88 60 81 25 ...
The dataset contains 28 observations and 2 variables. The target variable is Hours
## Hours Scores
## Min. :1.100 Min. :17.00
## 1st Qu.:2.675 1st Qu.:29.25
## Median :4.650 Median :44.50
## Mean :4.832 Mean :49.96
## 3rd Qu.:7.025 3rd Qu.:70.50
## Max. :9.200 Max. :95.00
## Hours Scores
## Hours 1.0000000 0.9779497
## Scores 0.9779497 1.0000000
set.seed(2022)
train_idx <- sample(m, m*0.7)
train_idx[1:5]
## [1] 4 19 14 23 11
train_data <- studentHour_df_num[train_idx,]
test_data <- studentHour_df_num[-train_idx,]
dim(train_data)
## [1] 19 2
dim(test_data)
## [1] 9 2
slr <- lm(formula = Hours~Scores,
data=train_data)
summary(slr)
##
## Call:
## lm(formula = Hours ~ Scores, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8242 -0.4785 -0.0996 0.4665 1.1144
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.026709 0.283921 -0.094 0.926
## Scores 0.098830 0.005138 19.234 5.66e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5651 on 17 degrees of freedom
## Multiple R-squared: 0.9561, Adjusted R-squared: 0.9535
## F-statistic: 369.9 on 1 and 17 DF, p-value: 5.662e-13
actual <- test_data$Hours
pred.slr <- predict(slr, test_data)
Plot Actual vs Prediction Series
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Correlation Actual Data & Prediction Data by Single Linear Regression
## [1] 0.9822168
Regression Metrics Measurement
## Method: Single Linear Regression Model
## SSE: 2.288
## MSE: 0.254
## RMSE: 0.504