Description

This is my final project data science of Course-Net Jakarta.

This report provides student study hour prediction using regression.

The dataset used in this report is the Kaggle’s dataset, with the link:

https://www.kaggle.com/aditeloo/student-study-hour-v2

Report Structure:

  1. Data Extraction.

  2. Exploratory Data Analysis.

  3. Data Preparation/ Preprocessing.

  4. Modelling

  5. Evaluations

  6. Recommendations

1. Data Extraction

Read the house dataset and its structure

studentHour_df = read.csv("Student_Study_Hour_v2.csv")
str(studentHour_df)
## 'data.frame':    28 obs. of  2 variables:
##  $ Hours : num  2.5 5.1 3.2 8.5 3.5 1.5 9.2 5.5 8.3 2.7 ...
##  $ Scores: int  21 47 27 75 30 20 88 60 81 25 ...

The dataset contains 28 observations and 2 variables. The target variable is Hours

Statistical Summary

##      Hours           Scores     
##  Min.   :1.100   Min.   :17.00  
##  1st Qu.:2.675   1st Qu.:29.25  
##  Median :4.650   Median :44.50  
##  Mean   :4.832   Mean   :49.96  
##  3rd Qu.:7.025   3rd Qu.:70.50  
##  Max.   :9.200   Max.   :95.00

2. Exploratory Data Analysis

2.1 Bivariate Data Analysis

2.2 Multivariate Data Analysis

##            Hours    Scores
## Hours  1.0000000 0.9779497
## Scores 0.9779497 1.0000000

3. Training and Testing Data Division

set.seed(2022)
train_idx <- sample(m, m*0.7)
train_idx[1:5]
## [1]  4 19 14 23 11
train_data <- studentHour_df_num[train_idx,]
test_data <- studentHour_df_num[-train_idx,]

dim(train_data)
## [1] 19  2
dim(test_data)
## [1] 9 2

4. Modeling by Single Linear Regression

slr <- lm(formula = Hours~Scores,
          data=train_data)
summary(slr)
## 
## Call:
## lm(formula = Hours ~ Scores, data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8242 -0.4785 -0.0996  0.4665  1.1144 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.026709   0.283921  -0.094    0.926    
## Scores       0.098830   0.005138  19.234 5.66e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5651 on 17 degrees of freedom
## Multiple R-squared:  0.9561, Adjusted R-squared:  0.9535 
## F-statistic: 369.9 on 1 and 17 DF,  p-value: 5.662e-13

5. Evaluations

actual <- test_data$Hours

pred.slr <- predict(slr, test_data)

Plot Actual vs Prediction Series

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Correlation Actual Data & Prediction Data by Single Linear Regression

## [1] 0.9822168

Regression Metrics Measurement

## Method: Single Linear Regression Model 
##  SSE: 2.288 
##  MSE: 0.254 
##  RMSE: 0.504