library(plyr)
library(binom)
library(car)
library(collapsibleTree)
library(dbplyr)
library(dplyr)
library(EnvStats)
library(ggformula)
library(ggplot2)
library(gmodels)
library(htmltools)
library(ISLR)
library(knitr)
library(lawstat)
library(markdown)
library(mosaic)
library(mdsr)
library(mosaicData)
library(nycflights13)
library(olsrr)
library(purrr)
library(plotly)
library(resampledata)
library(rmarkdown)
library(rpart)
library(rpart.plot)
library(rvest)
library(SDaA)
library(shiny)
library(stringi)
library(tibble)
library(tidyr)
library(tidyselect)
library(tinytex)
library(yaml)
library(shiny)
Attempt all problems below. You will be required to complete this
assignment in R Notebook or R Markdown. Convert your work to a .pdf file
and submit to Gradescope.ca.
Ensure you justify all computation and data visualizations with
accompanying code.
1. The Capital Asset Price Model (CAPM) is a
financial model that attempts to predict the rate of return on a
financial instrument, such as a common stock, in such a way that it is
linearly related to the rate of return on the overal market.
Specifically
\[
R_{StockA,i} = \beta_{0} + \beta_{1}R_{Market, i} + e_{i}
\]
(Note: To align the notation with a stock’s ``beta’’, \(\beta_{0} = A\) and \(\beta_{1} = B\))
You are to study the relationship between the two variables and
estimate the above model:
\(R_{SUNCOR, i}\) - rate of return
on Stock A for month \(i\), \(i = 1, 2, \cdots, 59\).
\(R_{Market, i}\) - market rate of
return for month \(i\), \(i = 1, 2, \cdots, 59\).
\(\beta_{1}\) represent’s the stocks
`beta’ value, or its systematic risk. It measure’s the stocks
volatility related to the market volatility. \(\beta_{0}\) represents the risk-free
interest rate.
The data appeaing in the file
contains the data on Suncor’s rate of return and the Toronto Composite
Index rate of return for 59 randomly selected months.
Therefore \(R_{SUNCOR, i}\)
represents the monthly rate of return for a common share of Suncor
stock; \(R_{TSE, i}\) represents the
monthly rate of return (increase or decrease) of the TSE Index for the
same month, month \(i\). The first
column in this data file contains the monthly rate of return on Suncor
stock; the second column contains the monthly rate of return on the TSE
index for the same month.
Read this data into R Studio and answer the questions posed
below.
capmdata.df = read.csv("https://raw.githubusercontent.com/Statman44/Data602/main/capm.csv")
head(capmdata.df, 3) #to get a sense of what the data look like
- Appropriately visualize these data. What can you infer from this
visualization? Provide a brief commentary.
ggplot(capmdata.df, aes(x = TSE.Index, y = Suncor)) + geom_point(col="blue", size = 2) + xlab("TSE Index") + ylab("Suncor") + ggtitle("Plot:TSE Index to Suncor")

The graph shows a positive relation between the variables TSE.Index
and Suncor
- Estimate the model above.
predict.suncorRate = lm(Suncor ~ TSE.Index, data = capmdata.df)
predict.suncorRate$coef
(Intercept) TSE.Index
0.01664794 0.53869099
\[
\widehat{Suncor}_{i} = 0.01664794 + (0.53869099*TSE.Index_{i})
\hspace{0.5in} \text{(Note: There is no}\:\: e_{i} \:\:\text{term on the
estimate of the model)}
\] (c) In the context of these data, interpret the meaning of
your estimates of the estimates \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\), in the context of
these data.
The rate of return of Suncor will increase by an average/mean of
0.53869099 per every unit the TSE Index increases
- Refer to your answer in (b) In a certain month, the rate of return
on the TSE Index was 4%. Predict the rate of return on Suncor stock for
the same month.
0.01664794 + (0.53869099*0.04)
[1] 0.03819558
predict(predict.suncorRate, data.frame(TSE.Index=0.04))
1
0.03819558
- Think about the conditions of this model in the context of
these data. Create the visualizations that inspect each of the two
conditions and provide commentary that addresses the validity (or
invalidity) of each.
We need to check for the 2 conditions: Normality of Residuals and
Homoscedasticity 1. Normalcy of Residuals
tseindex = capmdata.df$TSE.Index #read tse.index
suncorvalues = capmdata.df$Suncor #read Suncor rate
predicted.SuncorRate = predict.suncorRate$fitted.values #get fitted values from the predict.suncor
eisSuncor = predict.suncorRate$residuals #gget residual from predict.suncor
standardized.eis=rstandard(predict.suncorRate) #applies rstandard to predeict.suncor and place them in standardized.eis variable
diagnosticdf = data.frame(tseindex, suncorvalues, predicted.SuncorRate, eisSuncor, standardized.eis) #create a data frame with all variables above
head(diagnosticdf,4)
ggplot(diagnosticdf, aes(sample = standardized.eis)) + stat_qq(col='blue') + stat_qqline(col='red') + ggtitle("Normal Probability Plot of Standardized Residuals")

This graph proves the normalcy of residuals.
- Homoscedasticity
ggplot(diagnosticdf, aes(x = tseindex, y = standardized.eis)) + geom_point(size=2, col='blue', position="jitter") + xlab("TSE Index") + ylab("Standardized Residuals") + ggtitle("Scatterplot of Residuals vs predicted values") + geom_hline(yintercept=0, color="red", linetype="dashed")

- From these data, can you infer that the monthly rate of return of
Suncor stock can be expressed as a positive linear function of the
monthly rate of return of the TSE Index? State your statistical
hypotheses, compute (and report) both the test statistic and the \(P\)-value and provide your decision.
Hypotheses H0: B1<=0
HA:B1>0
summary(predict.suncorRate)
Call:
lm(formula = Suncor ~ TSE.Index, data = capmdata.df)
Residuals:
Min 1Q Median 3Q Max
-0.187377 -0.068498 -0.003155 0.072103 0.252016
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.01665 0.01177 1.414 0.1628
TSE.Index 0.53869 0.19178 2.809 0.0068 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.08995 on 57 degrees of freedom
Multiple R-squared: 0.1216, Adjusted R-squared: 0.1062
F-statistic: 7.89 on 1 and 57 DF, p-value: 0.006797
Tobs=(0.53869099-0)/0.19177963
Tobs
[1] 2.808906
Tobs=2.8089 with df=57 same as the result obtained with the
predict.suncorRate and a Pvalue = 0.0068/2 = 0.00034 (We are looking at
the probability of T > Tobs so we divide Pvalue by 2)
1-pt(2.8089, 57-2)
[1] 0.003434463
As per Pvalue = 0.00343, (less than 0.05) we reject H0. We can say
that the monthly rate of return of Suncor stock can be expressed as a
positive linear function of the monthly rate of return of the TSE
Index
- Compute a 95% confidence interval for \(\beta_{1}\), then interpret its meaning in
the context of these data.
confint(predict.suncorRate, conf.int=0.95)
2.5 % 97.5 %
(Intercept) -0.006928949 0.04022482
TSE.Index 0.154658904 0.92272309
the regression coefficient will have a vale between 0.1546 and 0.9227
in 95% of the cases.
- Compute a 95% confidence interval for the mean monthly rate of
return of Suncor stock when the TSE has a monthly rate of return of
3%.
predict(predict.suncorRate, newdata=data.frame(TSE.Index=0.03), interval='conf')
fit lwr upr
1 0.03280867 0.007660256 0.05795708
when the TSE.Index value is 0.03, Suncor monthly rate will be 95% of
the time between 0.00766 and 0.0579
- In a month of September, the TSE Index had a rate of return of
1.16%. With 95% confidence, compute the September rate of return for
Suncor stock.
predict(predict.suncorRate, newdata=data.frame(TSE.Index=0.0116), interval='conf')
fit lwr upr
1 0.02289675 -0.0006404417 0.04643395
We can say with 95% confidence that the rate of return of the Suncor
stock for that month of September was between -0.00064 and 0.04643
- Recall the Bootstrap Method. From these data, use the bootstrap
method to create a 95% confidence interval for mean monthly rate of
return of Suncor stock when the TSE has a monthly rate of return of 3%.
Compare your result to your result in part (h). Use 1000 iterations for
your bootstrap. Carefully consider how you would resample
bivariate data points \((x_{TSE, i},
y_{Suncor, i})\).
Nsims = 1000
TSEval = 0.03
boot_predict = numeric(1000)
pair_x_y = dim(capmdata.df)[1]
for(i in 1:Nsims)
{
index = sample(pair_x_y,replace=TRUE)
dsample = capmdata.df[index, ]
temp = lm(Suncor~TSE.Index, data = dsample)
boot_predict[i] = (coef(temp)[1]) + ((coef(temp)[2])*TSEval)
}
reg_bootstrap = data.frame(boot_predict)
head(reg_bootstrap,4)
ggplot(reg_bootstrap, aes(x = boot_predict)) + geom_histogram(col='red', fill='blue', binwidth=0.006) + xlab("Predicted Values of Suncor Rate when TSE.Index = 0.03") + ylab("Count") +
ggtitle("Distribution of Bootstrap Statistic: Predicted Suncor Rate")

qdata(~boot_predict, c(0.025,0.975), data=reg_bootstrap)
2.5% 97.5%
0.008810805 0.056672167
This confidence interval is close to the interval found in h.
2. Was Barry Bonds using Steroids? During the 2001
Season, Bary Bonds hit a major leauge record of 73 home runs.
The following bivariate data set gives the year and the number of
home runs divided by the number of at bats - attempts to hit the ball -
for each season. The number of homeruns is not used as later in his
career he was given intentional walks, which do not count as an at
bat.
In this exercise, you will build on your learning of model building
and attempt to predict the number of home runs Barry Bonds would
have hit in the 2001 season. These data are stored in the data
file.
Remove the data point that corresponds to the season ==
2001. After which, you are attempting to build a statistical
model of the following form: \[
HRAT_{i} = A + B*Year_{i} + e_{i} \hspace{0.5in} i = 1987, 1988, \cdots,
2000.
\] Carry out the appropriate statistical analysis to justify your
answer. Communicate your answer in the form of a few paragraphs. In
writing your response and statistical justification, ensure that any
conditions/assumptions of the model stated above are mentioned and
checked.
bbonds.df = read.csv("https://raw.githubusercontent.com/Statman44/Data602/main/bondsdata.csv")
head(bbonds.df)
tail(bbonds.df)
Create a new dataframe without 2001
bbonds1.df = bbonds.df %>% filter(season!=2001)
tail(bbonds1.df)
ggplot(bbonds1.df, aes(x = season, y = hrat)) + geom_point(col="blue", size = 2) + xlab("Season") + ylab("hrat") + ggtitle("Scatter plot for Barry Bonds' (hrat) Homeruns 1987-2000")

The scatterplot pattern does not show a cone like pattern, seems to
follow linearity
predict.homerun = lm(hrat~season, data=bbonds1.df)
predict.homerun$coefficients
(Intercept) season
-7.992499290 0.004044169
summary(predict.homerun)
Call:
lm(formula = hrat ~ season, data = bbonds1.df)
Residuals:
Min 1Q Median 3Q Max
-0.020722 -0.009931 0.001841 0.007701 0.023055
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.9924993 1.7566775 -4.550 0.000666 ***
season 0.0040442 0.0008812 4.589 0.000622 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.01329 on 12 degrees of freedom
Multiple R-squared: 0.6371, Adjusted R-squared: 0.6068
F-statistic: 21.06 on 1 and 12 DF, p-value: 0.0006222
Se = 0.01329 r^2 = 0.6371 t value = 4.589 b = 0.00404 a = -7.9925
CONDITIONS OF THE MODEL
actual.hrat = bbonds1.df$hrat
actual.season= bbonds1.df$season
eis.bbonds=predict.homerun$residuals
se.bbonds = summary(predict.homerun)$sigma
standardized.eis = rstandard(predict.homerun)
predicted.values.hrat = predict.homerun$fitted.values
diagnostic.df = data.frame(actual.season, actual.hrat, predicted.values.hrat,eis.bbonds, standardized.eis)
head(diagnostic.df)
- Normality
ggplot(diagnostic.df, aes(sample = standardized.eis)) + stat_qq(col='blue') + stat_qq_line(col='red') + ggtitle("Normal Probability Plot of Standardized Residuals")

The line fits through the pattern. This can be considered linear and
proves Normality
- Homoscedasticity
ggplot(diagnostic.df, aes(x = actual.season, y = standardized.eis)) + geom_point(size=2, col='blue', position="jitter") + xlab("Season") + ylab("Residuals") +
ggtitle("Season to Standardized Residuals") + geom_hline(yintercept=0, color="red", linetype="dashed")

The spread does not show any cone like patterns on the spread of
residuals. We can consider it homoscedastic
Hypotheses H0: beta1 = 0 HA: Beta1 != 0
F-TEST
summary(aov(predict.homerun))
Df Sum Sq Mean Sq F value Pr(>F)
season 1 0.003721 0.003721 21.06 0.000622 ***
Residuals 12 0.002120 0.000177
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
summary(aov(predict.homerun))[[1]][["Pr(>F)"]][1]
[1] 0.0006222474
Pvalue is 0.000622 so we reject H0
PREDICTING HOMERUNS 2001 SEASON
nh = data.frame(season=2021)
predict(predict.homerun, newdata=nh, interval='prediction')
fit lwr upr
1 0.1807667 0.1200519 0.2414816
Extract Hrat for 2001
bbonds.df[15,2]
[1] 0.1534
use Hrat 2001 and number of homeruns to calculate atbat
atbat = 73/0.1534
atbat
[1] 475.8801
use Hrat 2001 and number of homeruns to calculate atbat
atbat = 73/0.1534
atbat
[1] 475.8801
Use fit value of the model and atbat to do the prediction
predict.2001 = atbat*0.1808
predict.2001
[1] 86.03911
predict.2001_upper = atbat * 0.1331382
predict.2001_lower = atbat * 0.0666284
predict.2001_upper
[1] 63.35781
predict.2001_lower
[1] 31.70713
---
title: "Data 602 - Assignment Four, Part II"
output:
  html_notebook: default
  pdf_document: default
  html_document:
    df_print: paged
date: "Due October 15, 2023 @ 11:59pm"
---

```{r}
library(plyr)
library(binom)
library(car)
library(collapsibleTree)
library(dbplyr)
library(dplyr)
library(EnvStats)
library(ggformula)
library(ggplot2)
library(gmodels)
library(htmltools)
library(ISLR)
library(knitr)
library(lawstat)
library(markdown)
library(mosaic)
library(mdsr)
library(mosaicData)
library(nycflights13)
library(olsrr)
library(purrr)
library(plotly)
library(resampledata)
library(rmarkdown)
library(rpart)
library(rpart.plot)
library(rvest)
library(SDaA)
library(shiny)
library(stringi)
library(tibble)
library(tidyr)
library(tidyselect)
library(tinytex)
library(yaml)
library(shiny)
```

Attempt all problems below. You will be required to complete this assignment in R Notebook or R Markdown. Convert your work to a .pdf file and submit to [Gradescope.ca](https://www.gradescope.ca/). 

Ensure you justify all computation and data visualizations with accompanying code. 


**1.** The Capital Asset Price Model (CAPM) is a financial model that attempts to predict the 
rate of return on a financial instrument, such as a common stock, in such a way that it is
linearly related to the rate of return on the overal market. Specifically

$$
R_{StockA,i} = \beta_{0} + \beta_{1}R_{Market, i} + e_{i}   
$$

(Note: To align the notation with a stock's ``beta'', $\beta_{0} = A$ and $\beta_{1} = B$)


You are to study the relationship between the two variables and estimate the above model:

$R_{SUNCOR, i}$ - rate of return on Stock A for month $i$, $i = 1, 2, \cdots, 59$.

$R_{Market, i}$ - market rate of return for month $i$, $i = 1, 2, \cdots, 59$.


$\beta_{1}$ represent's the stocks `beta' value, or its *systematic risk*. It measure's the stocks volatility related
to the market volatility.  $\beta_{0}$ represents the risk-free interest rate.

The data appeaing in the [file](https://github.com/Statman44/Data602/blob/main/bondsdata.csv) contains the data on Suncor's rate of return and the Toronto Composite Index rate of return for 59 randomly selected months.  

Therefore $R_{SUNCOR, i}$ represents the monthly rate of return for a common share of Suncor stock; 
$R_{TSE, i}$ represents the monthly rate of return (increase or decrease) of the TSE Index for the same month, month $i$.
The first column in this data file contains the monthly rate of return on Suncor stock; the second column contains the monthly rate of return on the TSE index for the *same month*. 

Read this data into R Studio and answer the questions posed below. 
```{r}
capmdata.df = read.csv("https://raw.githubusercontent.com/Statman44/Data602/main/capm.csv")
head(capmdata.df, 3)  #to get a sense of what the data look like
```


(a) Appropriately visualize these data. What can you infer from this visualization? Provide a brief commentary.

```{r}
ggplot(capmdata.df, aes(x = TSE.Index, y = Suncor)) + geom_point(col="blue", size = 2) + xlab("TSE Index") + ylab("Suncor") + ggtitle("Plot:TSE Index to Suncor") 

```
The graph shows a positive relation between the variables TSE.Index and Suncor

(b) Estimate the model above.  
```{r}
predict.suncorRate = lm(Suncor ~ TSE.Index, data = capmdata.df)
predict.suncorRate$coef

```
$$
\widehat{Suncor}_{i} = 0.01664794 + (0.53869099*TSE.Index_{i}) \hspace{0.5in} \text{(Note: There is no}\:\: e_{i} \:\:\text{term on the estimate of the model)}
$$
(c) In the context of these data, interpret the meaning of your estimates of the estimates $\widehat{\beta}_{0}$ and $\widehat{\beta}_{1}$, in the context of these data.

The rate of return of Suncor will increase by an average/mean of 0.53869099 per every unit the TSE Index increases


(d) Refer to your answer in (b) In a certain month, the rate of return on the TSE Index was 4%. Predict the rate of return on Suncor stock for the same month. 
```{r}
0.01664794 + (0.53869099*0.04)
```
```{r}
predict(predict.suncorRate, data.frame(TSE.Index=0.04))
```



(e) Think about the conditions of this model *in the context* of these data. Create the visualizations that inspect each of the two conditions and provide commentary that addresses the validity (or invalidity) of each. 

We need to check for the 2 conditions: Normality of Residuals and Homoscedasticity
1. Normalcy of Residuals

```{r}
tseindex = capmdata.df$TSE.Index  #read tse.index
suncorvalues = capmdata.df$Suncor   #read Suncor rate
predicted.SuncorRate = predict.suncorRate$fitted.values #get fitted values from the predict.suncor 
eisSuncor = predict.suncorRate$residuals      #gget residual from predict.suncor
standardized.eis=rstandard(predict.suncorRate)  #applies rstandard to predeict.suncor and place them in standardized.eis variable
diagnosticdf = data.frame(tseindex, suncorvalues, predicted.SuncorRate, eisSuncor, standardized.eis) #create a data frame with all variables above
head(diagnosticdf,4)
```
```{r}
ggplot(diagnosticdf, aes(sample = standardized.eis)) +  stat_qq(col='blue') + stat_qqline(col='red') + ggtitle("Normal Probability Plot of Standardized Residuals")
```
This graph proves the normalcy of residuals. 

2. Homoscedasticity

```{r}
ggplot(diagnosticdf, aes(x = tseindex, y = standardized.eis)) +  geom_point(size=2, col='blue', position="jitter") + xlab("TSE Index") + ylab("Standardized Residuals") + ggtitle("Scatterplot of  Residuals vs predicted values") + geom_hline(yintercept=0, color="red", linetype="dashed")
```

(f) From these data, can you infer that the monthly rate of return of Suncor stock can be expressed as a positive linear function of the monthly rate of return of the TSE Index? State your statistical hypotheses, compute (and report) both the test statistic and the $P$-value and provide your decision. 

Hypotheses
        H0: B1<=0           
        HA:B1>0

```{r}
summary(predict.suncorRate)
```

```{r}
Tobs=(0.53869099-0)/0.19177963 
Tobs
```
Tobs=2.8089 with df=57 same as the result obtained with the predict.suncorRate  and a Pvalue = 0.0068/2 = 0.00034 (We are looking at the probability of T > Tobs so we divide Pvalue by 2)

```{r}
1-pt(2.8089, 57-2)
```

As per Pvalue = 0.00343, (less than 0.05) we reject H0. We can say that the monthly rate of return of Suncor stock can be expressed as a positive linear function of the monthly rate of return of the TSE Index

(g) Compute a 95% confidence interval for $\beta_{1}$, then interpret its meaning in the context of these data. 

```{r}
confint(predict.suncorRate, conf.int=0.95)
```
the regression coefficient will have a vale between 0.1546 and 0.9227 in 95% of the cases.


(h) Compute a 95% confidence interval for the mean monthly rate of return of Suncor stock when the TSE has a monthly rate of return of 3%. 

```{r}
predict(predict.suncorRate, newdata=data.frame(TSE.Index=0.03), interval='conf')
```
when the TSE.Index value is 0.03, Suncor monthly rate will be 95% of the time between 0.00766 and 0.0579


(i) In a month of September, the TSE Index had a rate of return of 1.16%. With 95% confidence, compute the September rate of return for Suncor stock. 

```{r}
predict(predict.suncorRate, newdata=data.frame(TSE.Index=0.0116), interval='conf')
```
We can say with 95% confidence that the rate of return of the Suncor stock for that month of September was between -0.00064 and 0.04643

(j) Recall the Bootstrap Method. From these data, use the bootstrap method to create a 95% confidence interval for mean monthly rate of return of Suncor stock when the TSE has a monthly rate of return of 3%. Compare your result to your result in part (h). Use 1000 iterations for your bootstrap. *Carefully* consider how you would resample bivariate data points $(x_{TSE, i}, y_{Suncor, i})$. 

```{r}
Nsims = 1000
TSEval = 0.03
boot_predict = numeric(1000)
pair_x_y = dim(capmdata.df)[1]

for(i in 1:Nsims)
{
  index = sample(pair_x_y,replace=TRUE)
  dsample = capmdata.df[index, ]
  temp = lm(Suncor~TSE.Index, data = dsample)
  boot_predict[i] = (coef(temp)[1]) + ((coef(temp)[2])*TSEval)
}

reg_bootstrap = data.frame(boot_predict)
head(reg_bootstrap,4)
```

```{r}
ggplot(reg_bootstrap, aes(x = boot_predict)) + geom_histogram(col='red', fill='blue', binwidth=0.006) + xlab("Predicted Values of Suncor Rate when TSE.Index = 0.03") + ylab("Count") + 
ggtitle("Distribution of Bootstrap Statistic: Predicted Suncor Rate")
```

```{r}
qdata(~boot_predict, c(0.025,0.975), data=reg_bootstrap)
```
This confidence interval is close to the interval found in h. 


**2.** Was Barry Bonds using Steroids? During the 2001 Season, Bary Bonds hit a major leauge record of 73 home runs.

The following bivariate data set gives the year and the number of home runs divided by the number of at bats - attempts to hit the ball - for each season. The number of homeruns is not used as later in his career he was given intentional walks, which do not count as an at bat. 

In this exercise, you will build on your learning of model building and attempt to **predict the number of home runs Barry Bonds would have hit in the 2001** season. These data are stored in the [data file](https://github.com/Statman44/Data602/blob/main/bondsdata.csv). 


<center>
![Bonds Just After He Signed His San Fransico Contract](BondsPre.jpg)

![Circa 2000](BondsPost.jpg)


Remove the data point that corresponds to the **season == 2001**. After which, you are attempting to build a statistical model of the following form:
$$
HRAT_{i} = A + B*Year_{i} + e_{i} \hspace{0.5in} i = 1987, 1988, \cdots, 2000.
$$
Carry out the appropriate statistical analysis to justify your answer. Communicate your answer in the form of a few paragraphs. In writing your response and statistical justification, ensure that any conditions/assumptions of the model stated above are mentioned and checked. 

```{r}
bbonds.df = read.csv("https://raw.githubusercontent.com/Statman44/Data602/main/bondsdata.csv")
head(bbonds.df)
```

```{r}
tail(bbonds.df)
```

Create a new dataframe without 2001

```{r}
bbonds1.df = bbonds.df %>% filter(season!=2001)
tail(bbonds1.df)
```


```{r}
ggplot(bbonds1.df, aes(x = season, y = hrat)) + geom_point(col="blue", size = 2) + xlab("Season") + ylab("hrat") + ggtitle("Scatter plot for Barry Bonds' (hrat) Homeruns 1987-2000")
```
The scatterplot pattern does not show a cone like pattern, seems to follow linearity

```{r}
predict.homerun = lm(hrat~season, data=bbonds1.df)
predict.homerun$coefficients

```

```{r}
summary(predict.homerun)
```

Se = 0.01329
r^2 = 0.6371
t value = 4.589
b = 0.00404
a = -7.9925


CONDITIONS OF THE MODEL

```{r}
actual.hrat = bbonds1.df$hrat
actual.season= bbonds1.df$season
eis.bbonds=predict.homerun$residuals
se.bbonds = summary(predict.homerun)$sigma
standardized.eis = rstandard(predict.homerun)
predicted.values.hrat = predict.homerun$fitted.values
diagnostic.df = data.frame(actual.season, actual.hrat, predicted.values.hrat,eis.bbonds, standardized.eis)
head(diagnostic.df)
```

1. Normality

```{r}
ggplot(diagnostic.df, aes(sample = standardized.eis)) +  stat_qq(col='blue') + stat_qq_line(col='red') + ggtitle("Normal Probability Plot of Standardized Residuals")
```
The line fits through the pattern. This can be considered linear and proves Normality


2. Homoscedasticity

```{r}
ggplot(diagnostic.df, aes(x = actual.season, y = standardized.eis)) + geom_point(size=2, col='blue', position="jitter") + xlab("Season") + ylab("Residuals") +
ggtitle("Season to Standardized Residuals") + geom_hline(yintercept=0, color="red", linetype="dashed")
```
The spread does not show any cone like patterns on the spread of residuals. We can consider it homoscedastic

Hypotheses
        H0: beta1 = 0
        HA: Beta1 != 0
        
F-TEST
```{r}
summary(aov(predict.homerun))
```

```{r}
summary(aov(predict.homerun))[[1]][["Pr(>F)"]][1]
```
Pvalue is 0.000622 so we reject H0


PREDICTING HOMERUNS 2001 SEASON


```{r}
nh = data.frame(season=2021)
predict(predict.homerun, newdata=nh, interval='prediction')
```
Extract Hrat for 2001

```{r}
bbonds.df[15,2]
```
use Hrat 2001 and number of homeruns to calculate atbat

```{r}
atbat = 73/0.1534
atbat
```
use Hrat 2001 and number of homeruns to calculate atbat

```{r}
atbat = 73/0.1534
atbat

```
Use fit value of the model and atbat to do the prediction
```{r}
predict.2001 = atbat*0.1808
predict.2001
```
```{r}
predict.2001_upper = atbat * 0.1331382
predict.2001_lower = atbat * 0.0666284
predict.2001_upper
```


```{r}
predict.2001_lower

```

