Abstract
This project is all about applications of SLR to real data using R. The goal of this project is to see if there is a linear relationship between the temperature of water near coastal nuclear power plants and the size of fish that habits these waters. Based on the results of our tests, we would be able to tell if nulcear power plants near the water are affecting the size of the fish and what might can be done to limit these affects in the future.The data being used is from “OCEANTEMP.csv”. Environmental scientists theorized that varying temperatures in the ocean due to a nearby coastal nuclear power plant would affect the growth of fish who reside in these waters. To test this theory one scientist simulated ocean environments with varying temperatures to mimic an oceanic environment that could be near a coastal nuclear power plant and evenly dispersed fish in each of these environments to test what affects they would experience. The data was collected from each of the fish of each environment and recorded in OCEANTEMP. My goal is to determine if there is a relationship between the weight of the fish recorded and the temperature of the environment and see if there is a linear trend between them.
oc <- read.csv("OCEANTEMP.csv")
#oc <- oc[oc$TEMP >= 38,]
head(oc)
names(oc)
## [1] "TEMP" "WEIGHT"
table(oc$WEIGHT)
##
## 13 14 15 16 17 18 19 20 21 22 23 24 25 26 28
## 1 1 1 2 2 2 2 1 3 1 1 2 1 1 1
The variable WEIGHT is a quantitative variable that represents the weight, in ounces, of the fish in the simulated oceanic environments.
table(oc$TEMP)
##
## 15 38 42 46 50
## 1 5 6 6 4
The variable TEMP is a quantitative variable that represents the temperature, measured in Fehreinheit, of the simulated oceanic environments. There are four different environments, each with different temperatures, 38, 42, 46, or 50.
addmargins(table(oc))
## WEIGHT
## TEMP 13 14 15 16 17 18 19 20 21 22 23 24 25 26 28 Sum
## 15 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1
## 38 0 0 0 1 0 1 1 0 0 1 0 1 0 0 0 5
## 42 0 0 1 1 1 0 0 0 1 0 0 0 1 1 0 6
## 46 0 1 0 0 0 0 1 0 1 0 1 1 0 0 1 6
## 50 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 4
## Sum 1 1 1 2 2 2 2 1 3 1 1 2 1 1 1 22
library(ggplot2)
g = ggplot(oc, aes(x = TEMP, y = WEIGHT, col="red")) + geom_point()
g
library (s20x)
pairs20x(oc)
From the preliminary plots, there does not seem to be any uniform or linear relationship between the temperature of the environments and the weight of the fish, however more analysis is needed to truly determine this assumption.
The data was collected by environmental scientists who believed there was a correlation between the temperature of the ocean and the size of fish. They placed fish in one of four similar oceanic environments, each diff only by temperature, and were measured after six months to see how much they weighed.
Nuclear power plants can have many affects on the different environments that they are located in. Some can be harmful. Coastal power plant, being near the water, pose the opportunity of affecting the aquatic life near them. To see if this is so, environmental scientists performed the experiment above to test the growth of fish in these environments.
To see if there exists a relationship between the temperature of the ocean near nuclear power plants and the size of fish that occupy that environment.
Nuclear power plants have many advantages, such as it is reliable and cost effective, very low carbon energy source, and helps with the energy gap. But it also has its disadvantages such there there are many environmental concerns regarding nuclear power and the emissions and radioactive waste that it produces. My interest is, as an engineer, is to analyze affects it may have on our oceans wildlife, being so close to the ocean. Is it negatively affecting the animals that occupy those water or is is there no real evidence that suggests that? Understanding that may lead to more understanding in the ethics being used when determining where to place these structures.
The problem I am trying to solve is if there is a direct relationship between the size of a fish and the temperature of the water that it is in.
An SLR is a predictive model for a sample. This means we can create an equation of line that can estimate the values of our dependent variable (WEIGHT) based on our independent variable (TEMP).
The equation describes can be represented by \(y_i = \beta_o + \beta_1x_i + \epsilon_i\) where y_i can be thought of as our dependent variable, x_i as our independent variable, and \(\epsilon_i\) as our error term.
\[ E(y) = \mu_y = E(\beta_0 + \beta_1 x_i + \epsilon_i)\\ = E(\beta_0) + E(\beta_1 x_i) + E(\epsilon_i)\\ = \beta_0 + \beta_1 x_i + 0 \\ = \beta_0 + \beta_1 x_i \] ## Summary of linear model
summary(lm(oc$WEIGHT~oc$TEMP))
##
## Call:
## lm(formula = oc$WEIGHT ~ oc$TEMP)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3273 -2.7512 -0.7051 2.9988 8.3901
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.86014 5.19006 4.405 0.000273 ***
## oc$TEMP -0.07066 0.12063 -0.586 0.564604
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.104 on 20 degrees of freedom
## Multiple R-squared: 0.01687, Adjusted R-squared: -0.03229
## F-statistic: 0.3431 on 1 and 20 DF, p-value: 0.5646
Our \(\beta_0\) is 22.86014 and our \(\beta_1\) is -0.07066 which can be replaced in our equation as:
\[ \hat y_i = 22.86014 - 0.07066x_i \]
Looking at our model, we can assume that there will be a downward trend in our data such that the weight of fish will decrease as the temperature of the water increases.
For an SLR to be a valid linear model for our data, there are four assumptions that must be made:
The first assumption is that there must be a linear relationship between the independent and dependent variables. For my data, the independent variable is the fish weights and the dependent variable is the environment temperatures.
with(oc, plot(WEIGHT~TEMP, bg="Red",pch=23, cex=1.2, main="Scatter Plot and Linear Model of WEIGHT v. TEMP"))
oc.lm <- with(oc, lm(WEIGHT~TEMP))
abline(oc.lm)
From the plot above with the added straight line, there appears to be a slight downward trend in weights as the temperature gets higher, however the data is not necessarily crowding our linear model and completely following it downward. More analysis will be needed to test its accuracy.
The second assumption for SLR is that the error terms in a linear model are independent.
with(oc, plot(WEIGHT~TEMP,bg="black", pch=21,cex=1.2))
maxtemp = with(oc, predict(oc.lm, data.frame(TEMP)))
with(oc, segments(TEMP, WEIGHT, TEMP, maxtemp, col="red"))
abline(oc.lm)
In the plot, the length of the residuals does increase above and below the line of best fit. Below the line, as the temp gets higher for each environment, the lowest point for each environment starts to get a longer residual. This could indicate that linear independence of the errors is not present.
The third assumption for SLR is that the residuals have a constant variance throughout the sample and be distributed about a mean of 0.
plot(oc.lm, which=1)
As we can see form the Residuals vs Fitted plot, our errors do kinda seem normally distributed about the mean. But they also seem scattered and dont show much of a trend with the line of best fit.
The final assumption to be made for a SLR model is normality of the distribution of residuals. To view this we can use a Shapiro-Wilk Normality test.
normcheck(oc.lm, shapiro.wilk = TRUE)
title("Normal Distribution")
From this test, the linear model does seem to be normally distributed. The shapiro-wilk test proved to be promising in that we have a normal distribution of the errors.
The closer a models R-squared is to 1, R-squared being the ratio of RSS(Residual Sum of Squares) and TSS(total sum of squares), the better the model would fit the data.
Our Residual sum of squares value:
hat = with(oc, predict(oc.lm, data.frame(TEMP)))
RSS = with(oc, sum((WEIGHT - hat)^2))
RSS
## [1] 336.8131
Our mean sum of squares value:
MSS = with(oc, sum((hat - mean(WEIGHT))^2))
MSS
## [1] 5.777822
Our total sum of squares value:
TSS = with(oc, sum((WEIGHT - mean(WEIGHT))^2))
TSS
## [1] 342.5909
We can now calculate R-sqared:
R2 = MSS/TSS
R2
## [1] 0.01686508
Our R-squared value is very low which indicates that our model does not really fit the data. This makes sense because we only see a slight relationship in that we can see instances of the weight of fish being smaller in warmer environments than cooler ones but not enough to really say that is definitively the cause of such case.
There is no clear proof from the analysis that was done to show that there is a direct correlation between the temperature of the ocean environment and the weight of the fish.
Although we can see a very slight trend in our analysis where the weight of fish is is slightly smaller in warmer environments, it is not enough to say that that is the absolute cause of it. I can then conclude that coastal nuclear power plants do not have much affect if any on the wildlife that occupys the waters around it.
I think the only way to really improve this experiment would be to increase the population of data, meaning increase the number of fish that are in each environment and increase the data size. This could give a more accurate relationship than what we just concluded.
Fish weights data. Data Set Library. (n.d.). Retrieved April 27, 2023, from https://support.minitab.com/en-us/datasets/nonparametrics-data-sets/fish-weights-data/
“What Are the Advantages of Nuclear Energy?” EDF, www.edfenergy.com/energywise/what-are-advantages-nuclear-energy.