2023-02-11

library(readr)
library(ggplot2)
library(dplyr)
library(plotly)
library(caTools)
library(tidyr)
Fish <- read_csv("DAT 301 Directory Folder/Fish.csv")
data_long <- gather(Fish, "length_type", "length", Length1, Length2, Length3)

Overview of the Fish Data Set

  • The Fish data set contains 159 objects and 7 variables.

  • The 7 variables associated with each object are “Species”, “Weight”, “Length1”, “Length2”, “Length3”, “Height”, and “Width”.

  • In this data set, weight is measured in grams and lengths are measured in centimeters.

  • The “Length1” variable refers to vertical length of the fish. The “Length2” column refers to the diagonal length of the fish. The “Length3” column refers to the cross length of the fish. The “Width” column refers to the diagonal width of the fish.

Relashionships Between Fish Weight and Dimensions

  • Looking at the graph below, there is an obvious trend in the data. As Length1 Length2 increase, the weight increases.

What is Simple Linear Regression?

  • Simple linear regression involves an independent variable x and a dependent variable y.

  • In simple linear regression we draw a line of best fit through our various x and y points. This line passes through the mean of the data and should have the least squared distance from all of the data points ensuring it is the line of best fit for the data.

  • Using this line we can predict values of the dependent variable y based on values of the dependent variable x.

The Formulas for Linear Regression

  • The equation of a linear regression line is the same as any other line: \(y = mx+b\)

  • When doing linear regression with an arbitrary amount of variables n, the equation becomes: \(y = Θ_nx_n + Θ_{n-1}x_{n-1} ..... Θ_1x_1 + Θ_0\)

Where do the Numbers Come From?

  • In the equation \(y = mx+b\):

    \(m = \frac{((n*\sum(x*y)) - (\sum(x)*\sum(y))}{((n*\sum(x^2))-(\sum(x)^2)}\)

    \(c = \frac{((\sum(y)*\sum(x^2))-(\sum(x)*\sum(x*y))}{((n*\sum(x^2))-(\sum(x)^2)}\)

  • n is the number of observations or data points you have.

  • Generally knowing these formulas isn’t necessary to perform linear regression but it’s important to know where the process came from.

Linear Regression on the Fish Data Set in R

  • Doing linear regression in R is quite simple with the command lm().
  • Since there is an obvious relationship between the dimensions and weight of the fish in the Fish data set(as shown by the previous graph) we can perform linear regression with weight as our dependent variable and the lengths as our independent variables.
  • Here is what the code would look like to create such a linear regression model:
   model <- lm(Weight ~., data = Fish)

Plotting the Line of Best Fit

  • With our newly created linear model we can plot the line of best fit for our data.

Plotting the Line of Best Fit (Continued)

ggplot(data = data_long, aes(x = length, y = Weight))+
  geom_point()+ geom_smooth(method = "lm", formula = y ~ x)

Plotting the Residuals

  • Finally we can plot the residuals of our data to determine that judge our line of best fit.

Plotting the Residuals(Continued)

data_residuals <- residuals(model)
ggplot(data = NULL, aes(x = model$fitted.values, y = data_residuals))+
  geom_point()+geom_hline(yintercept = 0)

Plotting the Residuals(Continued)

  • As the points of our residual graph are scattered around the zero line with no real trends, our line of best fit fits our data well.