I will be using the same data in this project as well.
eq <- read.csv("https://raw.githubusercontent.com/khatriprajwol/NEW-Data/main/database.csv")
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(rvest)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5 ✓ purrr 0.3.4
## ✓ tibble 3.1.4 ✓ stringr 1.4.0
## ✓ tidyr 1.1.4 ✓ forcats 0.5.1
## ✓ readr 2.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x dplyr::lag() masks stats::lag()
This project contains the linear regression which is popular data mining method in R and it is a simple approach to supervised learning. Generally, regression is a statistical method to interpret data and determine if how it behaves or follows any trend. In fact regression is used to determine the relationship between the dependent and independent variables of the data set. It helps to understand how dependent variables change when one of the independent variables is changed and other independent variable is kept constant. This helps in building a regression model and further, helps in forecasting the values with respect to a change in one of the independent variables. I have chose technology to preform regression over traditional method of calculation, this project I have preformed a simple \(y ∼ x\) regression Including plots and statistics. I have also preformed a multiple regression, \(y ∼ x1 + x2 + · · · + xn\) which include plots and statistics. In addition I have used a multiple regression including an interaction, \(y ∼ x1 : xn\) by including the plots and statistics.
eq <- data.frame(Latitude=eq$Latitude, Longitude=eq$Longitude, Type=eq$Type, Depth=eq$Depth, Magnitude=eq$Magnitude, Status=eq$Status)
head(eq)
## Latitude Longitude Type Depth Magnitude Status
## 1 19.246 145.616 Earthquake 131.6 6.0 Automatic
## 2 1.863 127.352 Earthquake 80.0 5.8 Automatic
## 3 -20.579 -173.972 Earthquake 20.0 6.2 Automatic
## 4 -59.076 -23.557 Earthquake 15.0 5.8 Automatic
## 5 11.938 126.427 Earthquake 15.0 5.8 Automatic
## 6 -13.405 166.629 Earthquake 35.0 6.7 Automatic
tail(eq)
## Latitude Longitude Type Depth Magnitude Status
## 23407 38.3754 -118.8977 Earthquake 10.80 5.6 Reviewed
## 23408 38.3917 -118.8941 Earthquake 12.30 5.6 Reviewed
## 23409 38.3777 -118.8957 Earthquake 8.80 5.5 Reviewed
## 23410 36.9179 140.4262 Earthquake 10.00 5.9 Reviewed
## 23411 -9.0283 118.6639 Earthquake 79.00 6.3 Reviewed
## 23412 37.3973 141.4103 Earthquake 11.94 5.5 Reviewed
summary(eq)
## Latitude Longitude Type Depth
## Min. :-77.080 Min. :-180.00 Length:23412 Min. : -1.10
## 1st Qu.:-18.653 1st Qu.: -76.35 Class :character 1st Qu.: 14.52
## Median : -3.568 Median : 103.98 Mode :character Median : 33.00
## Mean : 1.679 Mean : 39.64 Mean : 70.77
## 3rd Qu.: 26.191 3rd Qu.: 145.03 3rd Qu.: 54.00
## Max. : 86.005 Max. : 180.00 Max. :700.00
## Magnitude Status
## Min. :5.500 Length:23412
## 1st Qu.:5.600 Class :character
## Median :5.700 Mode :character
## Mean :5.883
## 3rd Qu.:6.000
## Max. :9.100
Because both our variables i.e. depth and magnitude are quantitative, when we run \(summary()\) function we see a table in our console with a numeric or other type in the summary of the data. This tells us the minimum, median, mean, and maximum values of the independent variable \((Depth)\) and dependent variable \((Magnitude)\):
plot(Depth ~ Magnitude, data = eq)
Let’s find out if the data follows any linearity. Here, the relationship between the independent and dependent variable must be linear. We can test this visually with a scatter plot to see if the distribution of data points could be described with a straight line.The relationship does not look linear to me, so I am having doubt for my data to begin with. Instead of linear it follows the \(sinusoidal\) trend or perhaps non-linear trend.
cor( eq $Magnitude,eq $Depth, method= "pearson")
## [1] 0.02345731
Here, I have used the pearson method which is based on Pearson’s product moment correlation coefficient \(cor(x, y)\) and follows a t distribution with \(length(x)-2\) degrees of freedom to see if the samples follow independent normal distributions. function to test the relationship between my independent variables and make sure they aren’t too highly correlated. When I run this code, the output is 0.0235. The correlation between depth and magnitude is small (0.0235 is only a 2.35% correlation), so we can include both parameters in our model.
cor.test( eq $Depth,eq $Magnitude)
##
## Pearson's product-moment correlation
##
## data: eq$Depth and eq$Magnitude
## t = 3.59, df = 23410, p-value = 0.0003313
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.01065099 0.03625594
## sample estimates:
## cor
## 0.02345731
Let’s also repeat this using the definition of correlation which actually uses the formula to calculate correlation coefficient.The formula to calculate is given below. \[ r = \frac{\sum_{i=1}^n(x_i-\overline x)(y_i-\overline y )}{(n-1)s_xs_y} \]
x <- eq $Magnitude
y<- eq $Depth
sx = sd(x)
sy = sd(y)
n = length(x)
xbar = mean(x)
ybar = mean(y)
numerator = mean((x-xbar)*(y-ybar))*n
r = numerator/((n-1)*sx*sy)
r
## [1] 0.02345731
Here, instead of writing ‘\(Magnitude' and '\)Depth’ several times in a data i have assigned them to \(x\) and \(y\) to make it easier and less confusing. We have received the same result using the formula i.e. \(r = 0.235\).
lm(y ~ x)
##
## Call:
## lm(formula = y ~ x)
##
## Coefficients:
## (Intercept) x
## 30.763 6.801
In R, the \(lm()\), or “linear model,” function can be used to create a simple regression model. The \(lm()\) function accepts a number of arguments.The formula argument follows a specific format. For simple linear regression, this is \(YVAR ~ XVAR\) where \(YVAR\) is the dependent, or predicted, variable and \(XVAR\) is the independent, or predictor, variable.
fit = lm(y~x)
slope = coef(fit)[2]
intercept = coef(fit)[1]
yhat <- function(xpredict){
return( slope*xpredict+intercept)
}
yhat(10)
## x
## 98.76901
The simple linear regression tries to find the best line to predict sales on the basis of depth. The linear model equation can be written as follow: \(yhat <- function(xpredict){return( slope*xpredict+intercept)\) The R function \(lm()\) can be used to determine the beta coefficients of the linear model. When \(yhat\) is 10 then the value of \(x\) that is magnitude is 98.8.
xToPredict <-data.frame(x = c(10))
predict(fit, newdata = xToPredict)
## 1
## 98.76901
This is a second method where we can use different command. Here, I have only passed a data frame to the new data option and it is handy sometimes because it has nice feature where I can do confidence intervals.
predict(fit, newdata = xToPredict, interval = 'confidence')
## fit lwr upr
## 1 98.76901 83.40065 114.1374
It looks like the fit is in the range but it is not a perfect fit. The lower limit is 83.4 and upper limit is 114.1 and the fit is 98.8. The range gives us 95% confidence-interval. Lastly, we should perform a hypothesis test. Here the null hypothesis is that says there is no correlation between \(y\) and \(x\), where in opposite, alternative hypothesis says \(x\) and \(y\) is related to some level. \[ H_0: \rho =0\\ H_A: \rho\neq 0 \] We have already computed our estimate of the test statistic \(r\). Then \(t\) is \[ t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}} \] Here the degrees of freedom will be \(df=n-2\)
t = r*sqrt(n-2)/sqrt(1-r^2)
t
## [1] 3.590033
As we can see \(t\) value is 3.59.
pt(t,n-2)
## [1] 0.9998343
I was not expecting that however, from the data I have received the \(p\) value is really high and close to 1 not the 0. Therefore, i have failed to reject null hypothesis. So, I can safely say that there is no relation between \(x\) and \(y\) even though we do not want to accept null hypothesis.
If instead we want to test the slope of the line.vWe consider the following as the theoretical fit \[ \hat y = \beta_0+\beta_1x \]
The standard error for the slope is \[ SE_{\beta_1} = \sqrt{\frac{MSE}{\sum\left(x_i-\overline x\right)^2}} \] Where, \[ MSE = \frac{\sum_{i=1}^n\left(y_i-\hat y_i\right)^2}{n-2} \] \(MSE\) is referred to as the mean square error. It adds all the square errors and divides by the adjusted total \((n-2)\) because of the degrees of freedom.
yframe = data.frame('Values' = y)
yhat <- predict(fit,yframe)
MSE = sum((y-yhat)^2)/(n-2)
MSE
## [1] 15035.85
The denominator in the \(SE\) computation above is sometimes called the \(S_{xx}\) The sum of the squares of \(x\).
sxx = sum((x-xbar)^2)
Then we can now compute the standard error for the slope as:
SEslope = sqrt(MSE/sxx)
SEslope
## [1] 1.894289
If I wanted to look at the intercept, \(\beta_0\), it’s standard error is \[ SE_{\beta_0} = \sqrt{MSE\left(\frac1n+\frac{\overline x}{S_{xx}}\right)} \]
This will be straight forward as I have already computed all these values.
SEintercept = sqrt(MSE*(1/n+xbar^2/sxx))
SEintercept
## [1] 11.17199
Both \(\beta_0\) and \(\beta_1\) are \(t\)-student distributed with degrees of freedom \(df = n-2\).
If I want to do the confidence interval for a prediction by hand, it is very similar. For an \(x_h\) that we want to predict, \[ SE_{x_h}=\sqrt{MSE\left(\frac1n+\frac{\left(x_h-\overline x\right)^2}{S_{xx}}\right)} \]
Again this is \(t\)-student distributed. I won’t do this computation since I would rather let R do it.
summary(fit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -80.50 -57.71 -37.51 -16.25 631.83
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.763 11.172 2.754 0.005899 **
## x 6.801 1.894 3.590 0.000331 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 122.6 on 23410 degrees of freedom
## Multiple R-squared: 0.0005502, Adjusted R-squared: 0.0005076
## F-statistic: 12.89 on 1 and 23410 DF, p-value: 0.0003313
I was thinking about this problem a few hours later and low and behold there is a way to get R to tell us all of these values. Passing the fit to the summary() function, we see all these outputs computed by R. Standard Errors and even the MSE are computed right in the above summary.
plot(x,y)
abline(fit)
Well, I was expecting that my data will not behave well because I felt like it was not the best data to start with. Looking at my graph I am confused what should I predict from it. No doubt it has high variance looking at the data. It is not a linear graph instead it looks like it is a sinusoidal.
plot(fit)
These are the residual plots produced by the code that are in the above diagram.
Residuals are the unexplained \(variance\). They are not exactly the same as model error, but they are calculated from it, so seeing a bias in the residuals would also indicate a bias in the error.
In the \(Normal Q-Qplot\) in the top right, we can see that the real residuals from our model form an almost perfectly one-to-one line with the theoretical residuals from a perfect model. Based on these residuals, we can say that our model meets the assumption of homoscedasticity.
The best-fitting multiple regression equation is of the form \(\hat y=b_0+b_1x+b_2x^2+b_3x^3\). The graph of such an equation is called a cubic. Using the data set given below, and letting \(x_1=x\), \(x_2=x^2\), and \(x_3=x^3\), finding the multiple regression equation for the cubic that best fits the given data.
xsquare = x^2
xcube = x^3
quadraticfit = lm(y~x+xsquare+xcube)
summary(quadraticfit)
##
## Call:
## lm(formula = y ~ x + xsquare + xcube)
##
## Residuals:
## Min 1Q Median 3Q Max
## -90.74 -57.44 -37.82 -16.34 628.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3132.483 807.048 3.881 0.000104 ***
## x -1408.079 374.754 -3.757 0.000172 ***
## xsquare 212.993 57.635 3.696 0.000220 ***
## xcube -10.573 2.935 -3.602 0.000316 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 122.6 on 23408 degrees of freedom
## Multiple R-squared: 0.001301, Adjusted R-squared: 0.001173
## F-statistic: 10.16 on 3 and 23408 DF, p-value: 1.099e-06
Clearly, it provides all the information where we can see \(intercept\), \(x\) , \(slope\) and everything. I really this feature of R. Here one thing I would like to add is that adjusted \(R-squared\) value shows how good is our data. If the value is close to 1 it is termed as a good data. With out any doubt it is not even close to 1 rather it is close to 0. Therefore, it was not the good data to make a better prediction to reject the null hypothesis.
range.default(x)
## [1] 5.5 9.1
range.default(y)
## [1] -1.1 700.0
To be honest I was confused what range I should use in my \(x\) and \(y\). Therefore, I have used in build R default to see what my ranges were of \(x\) and \(y\). And it worked, at first the graph was messy but it actually worked.
eqn = function(x){
coef(quadraticfit)[1]
+
coef(quadraticfit)[2]*x
+
coef(quadraticfit)[3]*x^2
+
coef(quadraticfit)[4]*x^3}
plot(x=1, ylim = c(-1.1,700),xlim = c(5.5,9.1), xlab = "Great",ylab = "Scott")
points(x,y)
curve(eqn, add = TRUE)
In conclusion, linear regression is a regression model that uses a straight line to describe the relationship between variables. It finds the line of best fit through your data by searching for the value of the regression coefficient(s) that minimizes the total error of the model.
There are two main types of linear regression: \(Simple linear regression\) uses only one independent variable \(Multiple linear regression\) uses two or more independent variables Hence, In this project I was able to successfully apply all the knowledge from the materials and able to test \(Simple linear regression\) and \(Multiple linear regression\) and how it behaves. This was harder than I thought it would be however, it was fun and learned a lot.