This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
In this section we will see how the R language can be used to implement regression functions. We will start with simple linear regression involving two variables.
In this linear regression task, we are able to expect the proportion of marks that a student is anticipated to score based upon the variety of hours they studied. That is an easy linear regression venture as it involves simply variables.
To get the current the directory:
getwd()
## [1] "C:/Users/Ali Raza/Desktop/Sparks Foundation/Task#1"
To set the directory where I save the data sets:
setwd("C:/Users/Ali Raza/Desktop/Sparks Foundation/Task#1")
Read the data sets that given for solving the Problem: The given data sets is based on student scores.
s_scores <- read.csv("student_scores - student_scores.csv")
To View the Upper Row of Given Data Sets
head(s_scores)
## Hours Scores
## 1 2.5 21
## 2 5.1 47
## 3 3.2 27
## 4 8.5 75
## 5 3.5 30
## 6 1.5 20
To check the detail of Data Sets
summary(s_scores)
## Hours Scores
## Min. :1.100 Min. :17.00
## 1st Qu.:2.700 1st Qu.:30.00
## Median :4.800 Median :47.00
## Mean :5.012 Mean :51.48
## 3rd Qu.:7.400 3rd Qu.:75.00
## Max. :9.200 Max. :95.00
library(ggplot2)
library(tidyverse)
library(ggpubr)
The column names that I used for plotting or visualizing is Hours, Scores
Create Scatter plot for the Hours and Scores
## Warning: package 'ggpubr' was built under R version 4.1.1
Create Scatter Plot with Regression Line for the Hours and Scores
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The graph above shows a linearly growing relationship among the Hours and the scores variables. This is a superb factor, because one important assumption of linear regression is that the relationship between the final results and predictor variables is linear and additive.
So the Graph shows the positive relation.
The Correlation coefficient between Hours and Scores of Students.
cor(s_scores$Scores, s_scores$Hours)
## [1] 0.9761907
The correlation coefficient measures the level of the association between two variables x and y.
Its value ranges between -1 (perfect negative correlation: when x increases, y decreases) and +1 (perfect positive correlation: when x increases, y increases).
A value closer to 0 suggests a weak relationship between the variables. A low correlation (-0.2 < x < 0.2) probably suggests that much of variation of the outcome variable (y) is not explained by the predictor (x). In such case, we should probably look for better predictor variables.
In our Task, the correlation coefficient is large enough, so we can continue by building a linear model of y as a function of x.
The simple linear regression tries to find the best line to predict Scores on the basis of Hours.
The linear model equation can be written as follow: Scores = b0 + b1 x Hours
The R function lm() can be used to determine the beta coefficients of the linear model:
linear_model <- lm(Scores~Hours, data = s_scores)
linear_model
##
## Call:
## lm(formula = Scores ~ Hours, data = s_scores)
##
## Coefficients:
## (Intercept) Hours
## 2.484 9.776
The Result tell us about the intercept and beta coefficient of the Hours.
Display the Scatter Plot with Simple Linear Regression Line.
## `geom_smooth()` using formula 'y ~ x'
I displaying the statistical summary of the model.
summary(linear_model)
##
## Call:
## lm(formula = Scores ~ Hours, data = s_scores)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.578 -5.340 1.839 4.593 7.265
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4837 2.5317 0.981 0.337
## Hours 9.7758 0.4529 21.583 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.603 on 23 degrees of freedom
## Multiple R-squared: 0.9529, Adjusted R-squared: 0.9509
## F-statistic: 465.8 on 1 and 23 DF, p-value: < 2.2e-16
The summary outputs shows 6 components, including: * Call. Shows the function call used to compute the regression model. * Residuals. Provide a quick view of the distribution of the residuals, which by definition have a mean zero. Therefore, the median should not be far from zero, and the minimum and maximum should be roughly equal in absolute value * Coefficients. Shows the regression beta coefficients and their statistical significance. Predictor variables, that are significantly associated to the outcome variable, are marked by stars. * Residual standard error (RSE), R-squared (R2) and the F-statistic are metrics that are used to check how well the model fits to our data.
confint(linear_model)
## 2.5 % 97.5 %
## (Intercept) -2.753470 7.720817
## Hours 8.838823 10.712784
For check the model accuracy, I measure the percentage in predicting.
sigma(linear_model)*100/mean(s_scores$Scores)
## [1] 10.88395
So the error percentage is 10.8839534
Predict the score on 9.25 hours, according to given problems or task.
predict(linear_model, data.frame(Hours = 9.25))
## 1
## 92.90985