| year | days | seasonal_index |
|---|---|---|
| 1 | 91 | 16.7 |
| 2 | 105 | 17.1 |
| 3 | 106 | 18.2 |
| 4 | 108 | 18.1 |
| 5 | 88 | 17.2 |
| 6 | 91 | 18.2 |
| 7 | 58 | 16.0 |
| 8 | 82 | 17.2 |
| 9 | 81 | 18.0 |
| 10 | 65 | 17.2 |
| 11 | 61 | 16.9 |
| 12 | 48 | 17.1 |
| 13 | 61 | 18.2 |
| 14 | 43 | 17.3 |
| 15 | 33 | 17.5 |
| 16 | 36 | 16.6 |
The following is a scatter plot of the raw data
In order to show model adequacy, we need to show the following:
We find the equation for \({y_i} = {b_0} + {b_1}x\) The calculated equation is \({y_i} = -192.838 + 15.296x\) Using this calculated information we will graphical techniques to analyze the adequacy of our model.
First for the first condition “Normal probability of error” we are using the following equation to calculate the residual error for our datum. \([{e_i} = {y_i} - ({b_0} + {b_1}x)]\), where \({y_i}\) is the observed value and \(({b_0} + {b_1}x)\) is the expected value. We are assuming the error is \(\varepsilon \sim N(0,{\sigma ^2})\)
As evidenced from the NPP plot the data is approximately linear, so we can safely conclude that the first assumption that the errors are normally distributed.
For our second assumption we are looking for constant variance that is evidenced by a random scatter of data with the residuals plotted on the y-axis and the predicted value, \({{\hat y}_i} = ({b_0} + {b_1}{x_i})\) on the x-axis
From the above plots the data does display a random arrangement, so we can conclude that there is constant variance to our data. So our second requirement of constant variance is satisfied.
Our third and final requirement for model adequacy is that we have
independence. This is evidenced by a random scatter of our datum error
with our collected samples versus time. The should not see any periodic
behaviour of the data.
The model does satisfy all three requirements for model adequacy.
The following is the statistical analysis for our data
##
## Call:
## lm(formula = linear_obj)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.70 -21.54 2.12 18.56 36.42
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -192.984 163.503 -1.180 0.258
## x 15.296 9.421 1.624 0.127
##
## Residual standard error: 23.79 on 14 degrees of freedom
## Multiple R-squared: 0.1585, Adjusted R-squared: 0.09835
## F-statistic: 2.636 on 1 and 14 DF, p-value: 0.1267
##
## Call:
## lm(formula = linear_obj)
##
## Coefficients:
## (Intercept) x
## -193.0 15.3
## Analysis of Variance Table
##
## Response: y
## Df Sum Sq Mean Sq F value Pr(>F)
## x 1 1492.6 1492.6 2.6362 0.1267
## Residuals 14 7926.8 566.2
## (Intercept) x
## -1.180306 1.623650
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -192.98383 163.503283 -1.180306 0.2575450
## x 15.29637 9.420975 1.623650 0.1267446
The confidence interval would be given by the following
## fit lwr upr
## 1 51.75801 21.75791 81.75811
## 2 53.86126 26.35031 81.37221
## 3 55.96451 30.88192 81.04710
## 4 58.06776 35.33322 80.80230
## 5 60.17101 39.67663 80.66539
## 6 62.27426 43.87267 80.67585
## 7 64.37751 47.86522 80.88980
## 8 66.48076 51.57671 81.38481
## 9 68.58401 54.90761 82.26042
## 10 70.68726 57.74912 83.62540
## 11 72.79051 60.01612 85.56490
## 12 74.89376 61.68722 88.10031
## 13 76.99701 62.81679 91.17723
## 14 79.10026 63.50594 94.69458
## 15 81.20351 63.86209 98.54494
## 16 83.30676 63.97530 102.63822
## 17 85.41001 63.91295 106.90708
## fit lwr upr
## 1 51.75801 -7.441550 110.9576
## 2 53.86126 -4.116617 111.8391
## 3 55.96451 -0.901284 112.8303
## 4 58.06776 2.197902 113.9376
## 5 60.17101 5.174631 115.1674
## 6 62.27426 8.022988 116.5255
## 7 64.37751 10.737623 118.0174
## 8 66.48076 13.313923 119.6476
## 9 68.58401 15.748171 121.4199
## 10 70.68726 18.037689 123.3368
## 11 72.79051 20.180940 125.4001
## 12 74.89376 22.177590 127.6099
## 13 76.99701 24.028524 129.9655
## 14 79.10026 25.735810 132.4647
## 15 81.20351 27.302613 135.1044
## 16 83.30676 28.733076 137.8804
## 17 85.41001 30.032167 140.7879
Our original hypothesis will be:
\[ \begin{array}{l} {H_0}:{B_1} = 0\\ {H_a}:{B_1} \ne 0 \end{array} \]
Since the \({R^2}\) value is 0.1585, we can attributed approximately 15% of the regressor variable to the fitted model. Also with the p-value of 0.1267, at a significance level of \(\alpha = 0.05\) would yield a conclusion that we would conclude that the evidence is not strong to conclude a difference with this relationship between days and index.
It is a good idea to include this at the end of every RMarkdown document
#Derrell Dunn
#IE5344 Week 3 RMarkdown
#Input data y
y<-c(91,105,106,108,88,91,58,82,81,65,61,48,61,43,33,36)
#input data for x
x<-c(16.7,17.1,18.2,18.1,17.2,18.2,16.0,17.2,18.0,17.2,16.9,17.1,18.2,17.3,17.5,16.6)
#copy data to variable with the appropriate names
year<-seq(1976:1991)
days<-y
seasonal_index<-x
##lets print the data in a table for the User to view in the document
library(knitr)
library(kableExtra)
df <- data.frame( year,days,seasonal_index)
#display the table
knitr::kable(df, caption = "Ozone Trends in California South Coast Basin 1976-1991") %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
position = "center"
)
#For this section for ease of mathematical calculation, I am using x and y
#as variables instead of the names of years, days and seasonal index
#get the length of the input vectors
number_data<-length(y)
#sum the y vector
y_sum<-sum(y)
#get the mean for both the y and the xcarbon vector
y_mean<-mean(y)
x_mean<-mean(x)
#sum the xcarbon vector
x_sum<-sum(x)
#calculate the summation of (y * xcarbon)
y_mult_x<-sum(y*x)
#Calculate the sum of the xcarbon squared
x_sqrd<-sum((x^2))
#calculate the sum of the xcarbon and square the result
x_sumsq<-(sum(x_sum))^2
#We finally calculate the B1 coefficient using page 14 - eqn 2.7 from text and lecture
b1_coef<-((y_mult_x) - ((y_sum * x_sum)/number_data))/(x_sqrd-((x_sumsq)/number_data))
#calculate B0 using page 14 - eqn 2.6 in the text
b0_coef<-(y_mean)-(b1_coef*x_mean)
myfunction<-function(xvar) {
b0_coef + (b1_coef * xvar)
}
#not needed for this analysis
#expected_y<-curve(myfunction(x), from = 16, to= 19)
#curve(myfunction(x), add =TRUE, col="orange")
expected_yi <-myfunction(x)
error_residuals_values<- y - expected_yi
data_num <-seq(1:16)
plot(expected_yi, error_residuals_values, main="Constant Variance", pch=7, col="brown",panel.last = grid(10,10))
plot(data_num, error_residuals_values, main="Independence", pch=16, col="gold",panel.last = grid(10,10))
#now verify the result using the R function lm, we store the result
#into an object "linear_obj" for readability and SW best practice
#linear_obj<-lm(y~x)
#abline(linear_obj, col="blue")
#setup the NPP to test condition 1. Normal distrb. of errors
res <- residuals(linear_obj)
qqnorm(res)
qqline(res, pch=7, main = "Normal Q-Q Plot",xlab="1Theoretical Quantiles",ylab="Sample Quantiles"
,col = "red", lwd = 2)
#run the summary function to get our result
summary(lm(linear_obj))
#construct the linear model
lm(linear_obj)
#do an ANOVA and get the t statistics
anova(linear_obj)
t_stats <- summary(linear_obj)$coefficients[, "t value"]
t_stats
#get a summary
summary(linear_obj)$coefficients
newx<-seq(min(x),max(x),.1375)
predict(linear_obj,data.frame(x=newx),interval="confidence") #confidence interval
predict(linear_obj,data.frame(x=newx),interval="prediction") #prediction interval