R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

Linear Regression By using R

In this section we will see how the R language can be used to implement regression functions. We will start with simple linear regression involving two variables.

Simple Linear Regression

In this linear regression task, we are able to expect the proportion of marks that a student is anticipated to score based upon the variety of hours they studied. That is an easy linear regression venture as it involves simply variables.

Set Directory for reading the Data sets

To get the current the directory:

getwd()
## [1] "C:/Users/Ali Raza/Desktop/Sparks Foundation/Task#1"

To set the directory where I save the data sets:

setwd("C:/Users/Ali Raza/Desktop/Sparks Foundation/Task#1")

Data Sets

Read the data sets that given for solving the Problem: The given data sets is based on student scores.

s_scores <- read.csv("student_scores - student_scores.csv")

To Inspect the Data Sets

To View the Upper Row of Given Data Sets

head(s_scores)
##   Hours Scores
## 1   2.5     21
## 2   5.1     47
## 3   3.2     27
## 4   8.5     75
## 5   3.5     30
## 6   1.5     20

To check the detail of Data Sets

summary(s_scores)
##      Hours           Scores     
##  Min.   :1.100   Min.   :17.00  
##  1st Qu.:2.700   1st Qu.:30.00  
##  Median :4.800   Median :47.00  
##  Mean   :5.012   Mean   :51.48  
##  3rd Qu.:7.400   3rd Qu.:75.00  
##  Max.   :9.200   Max.   :95.00

Loading required R packages For solving the Problem

library(ggplot2)
library(tidyverse)
library(ggpubr)

Visualization

The column names that I used for plotting or visualizing is Hours, Scores

Scatter plot

Create Scatter plot for the Hours and Scores

## Warning: package 'ggpubr' was built under R version 4.1.1

Scatter Plot with Regression Line

Create Scatter Plot with Regression Line for the Hours and Scores

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The graph above shows a linearly growing relationship among the Hours and the scores variables. This is a superb factor, because one important assumption of linear regression is that the relationship between the final results and predictor variables is linear and additive.

So the Graph shows the positive relation.

Coorelation

The Correlation coefficient between Hours and Scores of Students.

cor(s_scores$Scores, s_scores$Hours)
## [1] 0.9761907

The correlation coefficient measures the level of the association between two variables x and y.

Its value ranges between -1 (perfect negative correlation: when x increases, y decreases) and +1 (perfect positive correlation: when x increases, y increases).

A value closer to 0 suggests a weak relationship between the variables. A low correlation (-0.2 < x < 0.2) probably suggests that much of variation of the outcome variable (y) is not explained by the predictor (x). In such case, we should probably look for better predictor variables.

In our Task, the correlation coefficient is large enough, so we can continue by building a linear model of y as a function of x.

Linear Regression Model

The simple linear regression tries to find the best line to predict Scores on the basis of Hours.

The linear model equation can be written as follow: Scores = b0 + b1 x Hours

The R function lm() can be used to determine the beta coefficients of the linear model:

linear_model <- lm(Scores~Hours, data = s_scores)
linear_model
## 
## Call:
## lm(formula = Scores ~ Hours, data = s_scores)
## 
## Coefficients:
## (Intercept)        Hours  
##       2.484        9.776

The Result tell us about the intercept and beta coefficient of the Hours.

Plotting

Display the Scatter Plot with Simple Linear Regression Line.

## `geom_smooth()` using formula 'y ~ x'

Summary of Model

I displaying the statistical summary of the model.

summary(linear_model)
## 
## Call:
## lm(formula = Scores ~ Hours, data = s_scores)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.578  -5.340   1.839   4.593   7.265 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.4837     2.5317   0.981    0.337    
## Hours         9.7758     0.4529  21.583   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.603 on 23 degrees of freedom
## Multiple R-squared:  0.9529, Adjusted R-squared:  0.9509 
## F-statistic: 465.8 on 1 and 23 DF,  p-value: < 2.2e-16

The summary outputs shows 6 components, including: * Call. Shows the function call used to compute the regression model. * Residuals. Provide a quick view of the distribution of the residuals, which by definition have a mean zero. Therefore, the median should not be far from zero, and the minimum and maximum should be roughly equal in absolute value * Coefficients. Shows the regression beta coefficients and their statistical significance. Predictor variables, that are significantly associated to the outcome variable, are marked by stars. * Residual standard error (RSE), R-squared (R2) and the F-statistic are metrics that are used to check how well the model fits to our data.

Coefficients significance

confint(linear_model)
##                 2.5 %    97.5 %
## (Intercept) -2.753470  7.720817
## Hours        8.838823 10.712784

Check the Model

For check the model accuracy, I measure the percentage in predicting.

sigma(linear_model)*100/mean(s_scores$Scores)
## [1] 10.88395

So the error percentage is 10.8839534

Make Prediction

Predict the score on 9.25 hours, according to given problems or task.

predict(linear_model, data.frame(Hours = 9.25))
##        1 
## 92.90985