I will be using the same data in this project as well.

eq <- read.csv("https://raw.githubusercontent.com/khatriprajwol/NEW-Data/main/database.csv")
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(rvest)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.5     ✓ purrr   0.3.4
## ✓ tibble  3.1.4     ✓ stringr 1.4.0
## ✓ tidyr   1.1.4     ✓ forcats 0.5.1
## ✓ readr   2.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter()         masks stats::filter()
## x readr::guess_encoding() masks rvest::guess_encoding()
## x dplyr::lag()            masks stats::lag()

This project contains the linear regression which is popular data mining method in R and it is a simple approach to supervised learning. Generally, regression is a statistical method to interpret data and determine if how it behaves or follows any trend. In fact regression is used to determine the relationship between the dependent and independent variables of the data set. It helps to understand how dependent variables change when one of the independent variables is changed and other independent variable is kept constant. This helps in building a regression model and further, helps in forecasting the values with respect to a change in one of the independent variables. I have chose technology to preform regression over traditional method of calculation, this project I have preformed a simple \(y ∼ x\) regression Including plots and statistics. I have also preformed a multiple regression, \(y ∼ x1 + x2 + · · · + xn\) which include plots and statistics. In addition I have used a multiple regression including an interaction, \(y ∼ x1 : xn\) by including the plots and statistics.

eq <- data.frame(Latitude=eq$Latitude, Longitude=eq$Longitude, Type=eq$Type, Depth=eq$Depth, Magnitude=eq$Magnitude, Status=eq$Status)
head(eq)
##   Latitude Longitude       Type Depth Magnitude    Status
## 1   19.246   145.616 Earthquake 131.6       6.0 Automatic
## 2    1.863   127.352 Earthquake  80.0       5.8 Automatic
## 3  -20.579  -173.972 Earthquake  20.0       6.2 Automatic
## 4  -59.076   -23.557 Earthquake  15.0       5.8 Automatic
## 5   11.938   126.427 Earthquake  15.0       5.8 Automatic
## 6  -13.405   166.629 Earthquake  35.0       6.7 Automatic
tail(eq)
##       Latitude Longitude       Type Depth Magnitude   Status
## 23407  38.3754 -118.8977 Earthquake 10.80       5.6 Reviewed
## 23408  38.3917 -118.8941 Earthquake 12.30       5.6 Reviewed
## 23409  38.3777 -118.8957 Earthquake  8.80       5.5 Reviewed
## 23410  36.9179  140.4262 Earthquake 10.00       5.9 Reviewed
## 23411  -9.0283  118.6639 Earthquake 79.00       6.3 Reviewed
## 23412  37.3973  141.4103 Earthquake 11.94       5.5 Reviewed
summary(eq)
##     Latitude         Longitude           Type               Depth       
##  Min.   :-77.080   Min.   :-180.00   Length:23412       Min.   : -1.10  
##  1st Qu.:-18.653   1st Qu.: -76.35   Class :character   1st Qu.: 14.52  
##  Median : -3.568   Median : 103.98   Mode  :character   Median : 33.00  
##  Mean   :  1.679   Mean   :  39.64                      Mean   : 70.77  
##  3rd Qu.: 26.191   3rd Qu.: 145.03                      3rd Qu.: 54.00  
##  Max.   : 86.005   Max.   : 180.00                      Max.   :700.00  
##    Magnitude        Status         
##  Min.   :5.500   Length:23412      
##  1st Qu.:5.600   Class :character  
##  Median :5.700   Mode  :character  
##  Mean   :5.883                     
##  3rd Qu.:6.000                     
##  Max.   :9.100

Because both our variables i.e. depth and magnitude are quantitative, when we run \(summary()\) function we see a table in our console with a numeric or other type in the summary of the data. This tells us the minimum, median, mean, and maximum values of the independent variable \((Depth)\) and dependent variable \((Magnitude)\):

plot(Depth ~ Magnitude, data = eq)

Let’s find out if the data follows any linearity. Here, the relationship between the independent and dependent variable must be linear. We can test this visually with a scatter plot to see if the distribution of data points could be described with a straight line.The relationship does not look linear to me, so I am having doubt for my data to begin with. Instead of linear it follows the \(sinusoidal\) trend or perhaps non-linear trend.

cor( eq $Magnitude,eq $Depth, method= "pearson")
## [1] 0.02345731

Here, I have used the pearson method which is based on Pearson’s product moment correlation coefficient \(cor(x, y)\) and follows a t distribution with \(length(x)-2\) degrees of freedom to see if the samples follow independent normal distributions. function to test the relationship between my independent variables and make sure they aren’t too highly correlated. When I run this code, the output is 0.0235. The correlation between depth and magnitude is small (0.0235 is only a 2.35% correlation), so we can include both parameters in our model.

cor.test( eq $Depth,eq $Magnitude)
## 
##  Pearson's product-moment correlation
## 
## data:  eq$Depth and eq$Magnitude
## t = 3.59, df = 23410, p-value = 0.0003313
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.01065099 0.03625594
## sample estimates:
##        cor 
## 0.02345731

Let’s also repeat this using the definition of correlation which actually uses the formula to calculate correlation coefficient.The formula to calculate is given below. \[ r = \frac{\sum_{i=1}^n(x_i-\overline x)(y_i-\overline y )}{(n-1)s_xs_y} \]

x <- eq $Magnitude
y<- eq $Depth
sx = sd(x)
sy = sd(y) 
n = length(x)
xbar = mean(x)
ybar = mean(y)
numerator = mean((x-xbar)*(y-ybar))*n
r = numerator/((n-1)*sx*sy)
r
## [1] 0.02345731

Here, instead of writing ‘\(Magnitude' and '\)Depth’ several times in a data i have assigned them to \(x\) and \(y\) to make it easier and less confusing. We have received the same result using the formula i.e. \(r = 0.235\).

lm(y ~ x)
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##      30.763        6.801

In R, the \(lm()\), or “linear model,” function can be used to create a simple regression model. The \(lm()\) function accepts a number of arguments.The formula argument follows a specific format. For simple linear regression, this is \(YVAR ~ XVAR\) where \(YVAR\) is the dependent, or predicted, variable and \(XVAR\) is the independent, or predictor, variable.

fit = lm(y~x)
slope = coef(fit)[2]
intercept = coef(fit)[1]
yhat <- function(xpredict){
  return( slope*xpredict+intercept)
}
yhat(10)
##        x 
## 98.76901

The simple linear regression tries to find the best line to predict sales on the basis of depth. The linear model equation can be written as follow: \(yhat <- function(xpredict){return( slope*xpredict+intercept)\) The R function \(lm()\) can be used to determine the beta coefficients of the linear model. When \(yhat\) is 10 then the value of \(x\) that is magnitude is 98.8.

xToPredict <-data.frame(x = c(10))
predict(fit, newdata = xToPredict)
##        1 
## 98.76901

This is a second method where we can use different command. Here, I have only passed a data frame to the new data option and it is handy sometimes because it has nice feature where I can do confidence intervals.

predict(fit, newdata = xToPredict, interval = 'confidence')
##        fit      lwr      upr
## 1 98.76901 83.40065 114.1374

It looks like the fit is in the range but it is not a perfect fit. The lower limit is 83.4 and upper limit is 114.1 and the fit is 98.8. The range gives us 95% confidence-interval. Lastly, we should perform a hypothesis test. Here the null hypothesis is that says there is no correlation between \(y\) and \(x\), where in opposite, alternative hypothesis says \(x\) and \(y\) is related to some level. \[ H_0: \rho =0\\ H_A: \rho\neq 0 \] We have already computed our estimate of the test statistic \(r\). Then \(t\) is \[ t = \frac{r\sqrt{n-2}}{\sqrt{1-r^2}} \] Here the degrees of freedom will be \(df=n-2\)

t = r*sqrt(n-2)/sqrt(1-r^2)
t
## [1] 3.590033

As we can see \(t\) value is 3.59.

pt(t,n-2)
## [1] 0.9998343

I was not expecting that however, from the data I have received the \(p\) value is really high and close to 1 not the 0. Therefore, i have failed to reject null hypothesis. So, I can safely say that there is no relation between \(x\) and \(y\) even though we do not want to accept null hypothesis.

If instead we want to test the slope of the line.vWe consider the following as the theoretical fit \[ \hat y = \beta_0+\beta_1x \]

The standard error for the slope is \[ SE_{\beta_1} = \sqrt{\frac{MSE}{\sum\left(x_i-\overline x\right)^2}} \] Where, \[ MSE = \frac{\sum_{i=1}^n\left(y_i-\hat y_i\right)^2}{n-2} \] \(MSE\) is referred to as the mean square error. It adds all the square errors and divides by the adjusted total \((n-2)\) because of the degrees of freedom.

yframe = data.frame('Values' = y)
yhat <- predict(fit,yframe)
MSE = sum((y-yhat)^2)/(n-2)
MSE
## [1] 15035.85

The denominator in the \(SE\) computation above is sometimes called the \(S_{xx}\) The sum of the squares of \(x\).

sxx = sum((x-xbar)^2)

Then we can now compute the standard error for the slope as:

SEslope = sqrt(MSE/sxx)
SEslope
## [1] 1.894289

If I wanted to look at the intercept, \(\beta_0\), it’s standard error is \[ SE_{\beta_0} = \sqrt{MSE\left(\frac1n+\frac{\overline x}{S_{xx}}\right)} \]

This will be straight forward as I have already computed all these values.

SEintercept = sqrt(MSE*(1/n+xbar^2/sxx))
SEintercept
## [1] 11.17199

Both \(\beta_0\) and \(\beta_1\) are \(t\)-student distributed with degrees of freedom \(df = n-2\).

If I want to do the confidence interval for a prediction by hand, it is very similar. For an \(x_h\) that we want to predict, \[ SE_{x_h}=\sqrt{MSE\left(\frac1n+\frac{\left(x_h-\overline x\right)^2}{S_{xx}}\right)} \]

Again this is \(t\)-student distributed. I won’t do this computation since I would rather let R do it.

summary(fit)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -80.50 -57.71 -37.51 -16.25 631.83 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   30.763     11.172   2.754 0.005899 ** 
## x              6.801      1.894   3.590 0.000331 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 122.6 on 23410 degrees of freedom
## Multiple R-squared:  0.0005502,  Adjusted R-squared:  0.0005076 
## F-statistic: 12.89 on 1 and 23410 DF,  p-value: 0.0003313

I was thinking about this problem a few hours later and low and behold there is a way to get R to tell us all of these values. Passing the fit to the summary() function, we see all these outputs computed by R. Standard Errors and even the MSE are computed right in the above summary.

plot(x,y)
abline(fit)

Well, I was expecting that my data will not behave well because I felt like it was not the best data to start with. Looking at my graph I am confused what should I predict from it. No doubt it has high variance looking at the data. It is not a linear graph instead it looks like it is a sinusoidal.

plot(fit)

These are the residual plots produced by the code that are in the above diagram.

Residuals are the unexplained \(variance\). They are not exactly the same as model error, but they are calculated from it, so seeing a bias in the residuals would also indicate a bias in the error.

In the \(Normal Q-Qplot\) in the top right, we can see that the real residuals from our model form an almost perfectly one-to-one line with the theoretical residuals from a perfect model. Based on these residuals, we can say that our model meets the assumption of homoscedasticity.

The best-fitting multiple regression equation is of the form \(\hat y=b_0+b_1x+b_2x^2+b_3x^3\). The graph of such an equation is called a cubic. Using the data set given below, and letting \(x_1=x\), \(x_2=x^2\), and \(x_3=x^3\), finding the multiple regression equation for the cubic that best fits the given data.

xsquare = x^2
xcube = x^3
quadraticfit = lm(y~x+xsquare+xcube)
summary(quadraticfit)
## 
## Call:
## lm(formula = y ~ x + xsquare + xcube)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -90.74 -57.44 -37.82 -16.34 628.08 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3132.483    807.048   3.881 0.000104 ***
## x           -1408.079    374.754  -3.757 0.000172 ***
## xsquare       212.993     57.635   3.696 0.000220 ***
## xcube         -10.573      2.935  -3.602 0.000316 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 122.6 on 23408 degrees of freedom
## Multiple R-squared:  0.001301,   Adjusted R-squared:  0.001173 
## F-statistic: 10.16 on 3 and 23408 DF,  p-value: 1.099e-06

Clearly, it provides all the information where we can see \(intercept\), \(x\) , \(slope\) and everything. I really this feature of R. Here one thing I would like to add is that adjusted \(R-squared\) value shows how good is our data. If the value is close to 1 it is termed as a good data. With out any doubt it is not even close to 1 rather it is close to 0. Therefore, it was not the good data to make a better prediction to reject the null hypothesis.

range.default(x)
## [1] 5.5 9.1
range.default(y)
## [1]  -1.1 700.0

To be honest I was confused what range I should use in my \(x\) and \(y\). Therefore, I have used in build R default to see what my ranges were of \(x\) and \(y\). And it worked, at first the graph was messy but it actually worked.

eqn = function(x){
  coef(quadraticfit)[1]
  +
    coef(quadraticfit)[2]*x
  +
    coef(quadraticfit)[3]*x^2
  +
    coef(quadraticfit)[4]*x^3}
plot(x=1, ylim = c(-1.1,700),xlim = c(5.5,9.1), xlab = "Great",ylab = "Scott")
points(x,y)
curve(eqn, add = TRUE)

In conclusion, linear regression is a regression model that uses a straight line to describe the relationship between variables. It finds the line of best fit through your data by searching for the value of the regression coefficient(s) that minimizes the total error of the model.
There are two main types of linear regression: \(Simple linear regression\) uses only one independent variable \(Multiple linear regression\) uses two or more independent variables Hence, In this project I was able to successfully apply all the knowledge from the materials and able to test \(Simple linear regression\) and \(Multiple linear regression\) and how it behaves. This was harder than I thought it would be however, it was fun and learned a lot.