library(plyr)
library(binom)
library(car)
library(collapsibleTree)
library(dbplyr)
library(dplyr)
library(EnvStats)
library(ggformula)
library(ggplot2)
library(gmodels)
library(htmltools)
library(ISLR)
library(knitr)
library(lawstat)
library(markdown)
library(mosaic)
library(mdsr)
library(mosaicData)
library(nycflights13)
library(olsrr)
library(purrr)
library(plotly)
library(resampledata)
library(rmarkdown)
library(rpart)
library(rpart.plot)
library(rvest)
library(SDaA)
library(shiny)
library(stringi)
library(tibble)
library(tidyr)
library(tidyselect)
library(tinytex)
library(yaml)
library(shiny)

Attempt all problems below. You will be required to complete this assignment in R Notebook or R Markdown. Convert your work to a .pdf file and submit to Gradescope.ca.

Ensure you justify all computation and data visualizations with accompanying code.

1. The Capital Asset Price Model (CAPM) is a financial model that attempts to predict the rate of return on a financial instrument, such as a common stock, in such a way that it is linearly related to the rate of return on the overal market. Specifically

\[ R_{StockA,i} = \beta_{0} + \beta_{1}R_{Market, i} + e_{i} \]

(Note: To align the notation with a stock’s ``beta’’, \(\beta_{0} = A\) and \(\beta_{1} = B\))

You are to study the relationship between the two variables and estimate the above model:

\(R_{SUNCOR, i}\) - rate of return on Stock A for month \(i\), \(i = 1, 2, \cdots, 59\).

\(R_{Market, i}\) - market rate of return for month \(i\), \(i = 1, 2, \cdots, 59\).

\(\beta_{1}\) represent’s the stocks `beta’ value, or its systematic risk. It measure’s the stocks volatility related to the market volatility. \(\beta_{0}\) represents the risk-free interest rate.

The data appeaing in the file contains the data on Suncor’s rate of return and the Toronto Composite Index rate of return for 59 randomly selected months.

Therefore \(R_{SUNCOR, i}\) represents the monthly rate of return for a common share of Suncor stock; \(R_{TSE, i}\) represents the monthly rate of return (increase or decrease) of the TSE Index for the same month, month \(i\). The first column in this data file contains the monthly rate of return on Suncor stock; the second column contains the monthly rate of return on the TSE index for the same month.

Read this data into R Studio and answer the questions posed below.

capmdata.df = read.csv("https://raw.githubusercontent.com/Statman44/Data602/main/capm.csv")
head(capmdata.df, 3)  #to get a sense of what the data look like
  1. Appropriately visualize these data. What can you infer from this visualization? Provide a brief commentary.
ggplot(capmdata.df, aes(x = TSE.Index, y = Suncor)) + geom_point(col="blue", size = 2) + xlab("TSE Index") + ylab("Suncor") + ggtitle("Plot:TSE Index to Suncor") 

The graph shows a positive relation between the variables TSE.Index and Suncor

  1. Estimate the model above.
predict.suncorRate = lm(Suncor ~ TSE.Index, data = capmdata.df)
predict.suncorRate$coef
(Intercept)   TSE.Index 
 0.01664794  0.53869099 

\[ \widehat{Suncor}_{i} = 0.01664794 + (0.53869099*TSE.Index_{i}) \hspace{0.5in} \text{(Note: There is no}\:\: e_{i} \:\:\text{term on the estimate of the model)} \] (c) In the context of these data, interpret the meaning of your estimates of the estimates \(\widehat{\beta}_{0}\) and \(\widehat{\beta}_{1}\), in the context of these data.

The rate of return of Suncor will increase by an average/mean of 0.53869099 per every unit the TSE Index increases

  1. Refer to your answer in (b) In a certain month, the rate of return on the TSE Index was 4%. Predict the rate of return on Suncor stock for the same month.
0.01664794 + (0.53869099*0.04)
[1] 0.03819558
predict(predict.suncorRate, data.frame(TSE.Index=0.04))
         1 
0.03819558 
  1. Think about the conditions of this model in the context of these data. Create the visualizations that inspect each of the two conditions and provide commentary that addresses the validity (or invalidity) of each.

We need to check for the 2 conditions: Normality of Residuals and Homoscedasticity 1. Normalcy of Residuals

tseindex = capmdata.df$TSE.Index  #read tse.index
suncorvalues = capmdata.df$Suncor   #read Suncor rate
predicted.SuncorRate = predict.suncorRate$fitted.values #get fitted values from the predict.suncor 
eisSuncor = predict.suncorRate$residuals      #gget residual from predict.suncor
standardized.eis=rstandard(predict.suncorRate)  #applies rstandard to predeict.suncor and place them in standardized.eis variable
diagnosticdf = data.frame(tseindex, suncorvalues, predicted.SuncorRate, eisSuncor, standardized.eis) #create a data frame with all variables above
head(diagnosticdf,4)
ggplot(diagnosticdf, aes(sample = standardized.eis)) +  stat_qq(col='blue') + stat_qqline(col='red') + ggtitle("Normal Probability Plot of Standardized Residuals")

This graph proves the normalcy of residuals.

  1. Homoscedasticity
ggplot(diagnosticdf, aes(x = tseindex, y = standardized.eis)) +  geom_point(size=2, col='blue', position="jitter") + xlab("TSE Index") + ylab("Standardized Residuals") + ggtitle("Scatterplot of  Residuals vs predicted values") + geom_hline(yintercept=0, color="red", linetype="dashed")

  1. From these data, can you infer that the monthly rate of return of Suncor stock can be expressed as a positive linear function of the monthly rate of return of the TSE Index? State your statistical hypotheses, compute (and report) both the test statistic and the \(P\)-value and provide your decision.

Hypotheses H0: B1<=0
HA:B1>0

summary(predict.suncorRate)

Call:
lm(formula = Suncor ~ TSE.Index, data = capmdata.df)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.187377 -0.068498 -0.003155  0.072103  0.252016 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)   
(Intercept)  0.01665    0.01177   1.414   0.1628   
TSE.Index    0.53869    0.19178   2.809   0.0068 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.08995 on 57 degrees of freedom
Multiple R-squared:  0.1216,    Adjusted R-squared:  0.1062 
F-statistic:  7.89 on 1 and 57 DF,  p-value: 0.006797
Tobs=(0.53869099-0)/0.19177963 
Tobs
[1] 2.808906

Tobs=2.8089 with df=57 same as the result obtained with the predict.suncorRate and a Pvalue = 0.0068/2 = 0.00034 (We are looking at the probability of T > Tobs so we divide Pvalue by 2)

1-pt(2.8089, 57-2)
[1] 0.003434463

As per Pvalue = 0.00343, (less than 0.05) we reject H0. We can say that the monthly rate of return of Suncor stock can be expressed as a positive linear function of the monthly rate of return of the TSE Index

  1. Compute a 95% confidence interval for \(\beta_{1}\), then interpret its meaning in the context of these data.
confint(predict.suncorRate, conf.int=0.95)
                   2.5 %     97.5 %
(Intercept) -0.006928949 0.04022482
TSE.Index    0.154658904 0.92272309

the regression coefficient will have a vale between 0.1546 and 0.9227 in 95% of the cases.

  1. Compute a 95% confidence interval for the mean monthly rate of return of Suncor stock when the TSE has a monthly rate of return of 3%.
predict(predict.suncorRate, newdata=data.frame(TSE.Index=0.03), interval='conf')
         fit         lwr        upr
1 0.03280867 0.007660256 0.05795708

when the TSE.Index value is 0.03, Suncor monthly rate will be 95% of the time between 0.00766 and 0.0579

  1. In a month of September, the TSE Index had a rate of return of 1.16%. With 95% confidence, compute the September rate of return for Suncor stock.
predict(predict.suncorRate, newdata=data.frame(TSE.Index=0.0116), interval='conf')
         fit           lwr        upr
1 0.02289675 -0.0006404417 0.04643395

We can say with 95% confidence that the rate of return of the Suncor stock for that month of September was between -0.00064 and 0.04643

  1. Recall the Bootstrap Method. From these data, use the bootstrap method to create a 95% confidence interval for mean monthly rate of return of Suncor stock when the TSE has a monthly rate of return of 3%. Compare your result to your result in part (h). Use 1000 iterations for your bootstrap. Carefully consider how you would resample bivariate data points \((x_{TSE, i}, y_{Suncor, i})\).
Nsims = 1000
TSEval = 0.03
boot_predict = numeric(1000)
pair_x_y = dim(capmdata.df)[1]

for(i in 1:Nsims)
{
  index = sample(pair_x_y,replace=TRUE)
  dsample = capmdata.df[index, ]
  temp = lm(Suncor~TSE.Index, data = dsample)
  boot_predict[i] = (coef(temp)[1]) + ((coef(temp)[2])*TSEval)
}

reg_bootstrap = data.frame(boot_predict)
head(reg_bootstrap,4)
ggplot(reg_bootstrap, aes(x = boot_predict)) + geom_histogram(col='red', fill='blue', binwidth=0.006) + xlab("Predicted Values of Suncor Rate when TSE.Index = 0.03") + ylab("Count") + 
ggtitle("Distribution of Bootstrap Statistic: Predicted Suncor Rate")

qdata(~boot_predict, c(0.025,0.975), data=reg_bootstrap)
       2.5%       97.5% 
0.008810805 0.056672167 

This confidence interval is close to the interval found in h.

2. Was Barry Bonds using Steroids? During the 2001 Season, Bary Bonds hit a major leauge record of 73 home runs.

The following bivariate data set gives the year and the number of home runs divided by the number of at bats - attempts to hit the ball - for each season. The number of homeruns is not used as later in his career he was given intentional walks, which do not count as an at bat.

In this exercise, you will build on your learning of model building and attempt to predict the number of home runs Barry Bonds would have hit in the 2001 season. These data are stored in the data file.

Bonds Just After He Signed His San Fransico Contract

Circa 2000

Remove the data point that corresponds to the season == 2001. After which, you are attempting to build a statistical model of the following form: \[ HRAT_{i} = A + B*Year_{i} + e_{i} \hspace{0.5in} i = 1987, 1988, \cdots, 2000. \] Carry out the appropriate statistical analysis to justify your answer. Communicate your answer in the form of a few paragraphs. In writing your response and statistical justification, ensure that any conditions/assumptions of the model stated above are mentioned and checked.

bbonds.df = read.csv("https://raw.githubusercontent.com/Statman44/Data602/main/bondsdata.csv")
head(bbonds.df)
tail(bbonds.df)

Create a new dataframe without 2001

bbonds1.df = bbonds.df %>% filter(season!=2001)
tail(bbonds1.df)
ggplot(bbonds1.df, aes(x = season, y = hrat)) + geom_point(col="blue", size = 2) + xlab("Season") + ylab("hrat") + ggtitle("Scatter plot for Barry Bonds' (hrat) Homeruns 1987-2000")

The scatterplot pattern does not show a cone like pattern, seems to follow linearity

predict.homerun = lm(hrat~season, data=bbonds1.df)
predict.homerun$coefficients
 (Intercept)       season 
-7.992499290  0.004044169 
summary(predict.homerun)

Call:
lm(formula = hrat ~ season, data = bbonds1.df)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.020722 -0.009931  0.001841  0.007701  0.023055 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -7.9924993  1.7566775  -4.550 0.000666 ***
season       0.0040442  0.0008812   4.589 0.000622 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.01329 on 12 degrees of freedom
Multiple R-squared:  0.6371,    Adjusted R-squared:  0.6068 
F-statistic: 21.06 on 1 and 12 DF,  p-value: 0.0006222

Se = 0.01329 r^2 = 0.6371 t value = 4.589 b = 0.00404 a = -7.9925

CONDITIONS OF THE MODEL

actual.hrat = bbonds1.df$hrat
actual.season= bbonds1.df$season
eis.bbonds=predict.homerun$residuals
se.bbonds = summary(predict.homerun)$sigma
standardized.eis = rstandard(predict.homerun)
predicted.values.hrat = predict.homerun$fitted.values
diagnostic.df = data.frame(actual.season, actual.hrat, predicted.values.hrat,eis.bbonds, standardized.eis)
head(diagnostic.df)
  1. Normality
ggplot(diagnostic.df, aes(sample = standardized.eis)) +  stat_qq(col='blue') + stat_qq_line(col='red') + ggtitle("Normal Probability Plot of Standardized Residuals")

The line fits through the pattern. This can be considered linear and proves Normality

  1. Homoscedasticity
ggplot(diagnostic.df, aes(x = actual.season, y = standardized.eis)) + geom_point(size=2, col='blue', position="jitter") + xlab("Season") + ylab("Residuals") +
ggtitle("Season to Standardized Residuals") + geom_hline(yintercept=0, color="red", linetype="dashed")

The spread does not show any cone like patterns on the spread of residuals. We can consider it homoscedastic

Hypotheses H0: beta1 = 0 HA: Beta1 != 0

F-TEST

summary(aov(predict.homerun))
            Df   Sum Sq  Mean Sq F value   Pr(>F)    
season       1 0.003721 0.003721   21.06 0.000622 ***
Residuals   12 0.002120 0.000177                     
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
summary(aov(predict.homerun))[[1]][["Pr(>F)"]][1]
[1] 0.0006222474

Pvalue is 0.000622 so we reject H0

PREDICTING HOMERUNS 2001 SEASON

nh = data.frame(season=2021)
predict(predict.homerun, newdata=nh, interval='prediction')
        fit       lwr       upr
1 0.1807667 0.1200519 0.2414816

Extract Hrat for 2001

bbonds.df[15,2]
[1] 0.1534

use Hrat 2001 and number of homeruns to calculate atbat

atbat = 73/0.1534
atbat
[1] 475.8801

use Hrat 2001 and number of homeruns to calculate atbat

atbat = 73/0.1534
atbat
[1] 475.8801

Use fit value of the model and atbat to do the prediction

predict.2001 = atbat*0.1808
predict.2001
[1] 86.03911
predict.2001_upper = atbat * 0.1331382
predict.2001_lower = atbat * 0.0666284
predict.2001_upper
[1] 63.35781
predict.2001_lower
[1] 31.70713
---
title: "Data 602 - Assignment Four, Part II"
output:
  html_notebook: default
  pdf_document: default
  html_document:
    df_print: paged
date: "Due October 15, 2023 @ 11:59pm"
---

```{r}
library(plyr)
library(binom)
library(car)
library(collapsibleTree)
library(dbplyr)
library(dplyr)
library(EnvStats)
library(ggformula)
library(ggplot2)
library(gmodels)
library(htmltools)
library(ISLR)
library(knitr)
library(lawstat)
library(markdown)
library(mosaic)
library(mdsr)
library(mosaicData)
library(nycflights13)
library(olsrr)
library(purrr)
library(plotly)
library(resampledata)
library(rmarkdown)
library(rpart)
library(rpart.plot)
library(rvest)
library(SDaA)
library(shiny)
library(stringi)
library(tibble)
library(tidyr)
library(tidyselect)
library(tinytex)
library(yaml)
library(shiny)
```

Attempt all problems below. You will be required to complete this assignment in R Notebook or R Markdown. Convert your work to a .pdf file and submit to [Gradescope.ca](https://www.gradescope.ca/). 

Ensure you justify all computation and data visualizations with accompanying code. 


**1.** The Capital Asset Price Model (CAPM) is a financial model that attempts to predict the 
rate of return on a financial instrument, such as a common stock, in such a way that it is
linearly related to the rate of return on the overal market. Specifically

$$
R_{StockA,i} = \beta_{0} + \beta_{1}R_{Market, i} + e_{i}   
$$

(Note: To align the notation with a stock's ``beta'', $\beta_{0} = A$ and $\beta_{1} = B$)


You are to study the relationship between the two variables and estimate the above model:

$R_{SUNCOR, i}$ - rate of return on Stock A for month $i$, $i = 1, 2, \cdots, 59$.

$R_{Market, i}$ - market rate of return for month $i$, $i = 1, 2, \cdots, 59$.


$\beta_{1}$ represent's the stocks `beta' value, or its *systematic risk*. It measure's the stocks volatility related
to the market volatility.  $\beta_{0}$ represents the risk-free interest rate.

The data appeaing in the [file](https://github.com/Statman44/Data602/blob/main/bondsdata.csv) contains the data on Suncor's rate of return and the Toronto Composite Index rate of return for 59 randomly selected months.  

Therefore $R_{SUNCOR, i}$ represents the monthly rate of return for a common share of Suncor stock; 
$R_{TSE, i}$ represents the monthly rate of return (increase or decrease) of the TSE Index for the same month, month $i$.
The first column in this data file contains the monthly rate of return on Suncor stock; the second column contains the monthly rate of return on the TSE index for the *same month*. 

Read this data into R Studio and answer the questions posed below. 
```{r}
capmdata.df = read.csv("https://raw.githubusercontent.com/Statman44/Data602/main/capm.csv")
head(capmdata.df, 3)  #to get a sense of what the data look like
```


(a) Appropriately visualize these data. What can you infer from this visualization? Provide a brief commentary.

```{r}
ggplot(capmdata.df, aes(x = TSE.Index, y = Suncor)) + geom_point(col="blue", size = 2) + xlab("TSE Index") + ylab("Suncor") + ggtitle("Plot:TSE Index to Suncor") 

```
The graph shows a positive relation between the variables TSE.Index and Suncor

(b) Estimate the model above.  
```{r}
predict.suncorRate = lm(Suncor ~ TSE.Index, data = capmdata.df)
predict.suncorRate$coef

```
$$
\widehat{Suncor}_{i} = 0.01664794 + (0.53869099*TSE.Index_{i}) \hspace{0.5in} \text{(Note: There is no}\:\: e_{i} \:\:\text{term on the estimate of the model)}
$$
(c) In the context of these data, interpret the meaning of your estimates of the estimates $\widehat{\beta}_{0}$ and $\widehat{\beta}_{1}$, in the context of these data.

The rate of return of Suncor will increase by an average/mean of 0.53869099 per every unit the TSE Index increases


(d) Refer to your answer in (b) In a certain month, the rate of return on the TSE Index was 4%. Predict the rate of return on Suncor stock for the same month. 
```{r}
0.01664794 + (0.53869099*0.04)
```
```{r}
predict(predict.suncorRate, data.frame(TSE.Index=0.04))
```



(e) Think about the conditions of this model *in the context* of these data. Create the visualizations that inspect each of the two conditions and provide commentary that addresses the validity (or invalidity) of each. 

We need to check for the 2 conditions: Normality of Residuals and Homoscedasticity
1. Normalcy of Residuals

```{r}
tseindex = capmdata.df$TSE.Index  #read tse.index
suncorvalues = capmdata.df$Suncor   #read Suncor rate
predicted.SuncorRate = predict.suncorRate$fitted.values #get fitted values from the predict.suncor 
eisSuncor = predict.suncorRate$residuals      #gget residual from predict.suncor
standardized.eis=rstandard(predict.suncorRate)  #applies rstandard to predeict.suncor and place them in standardized.eis variable
diagnosticdf = data.frame(tseindex, suncorvalues, predicted.SuncorRate, eisSuncor, standardized.eis) #create a data frame with all variables above
head(diagnosticdf,4)
```
```{r}
ggplot(diagnosticdf, aes(sample = standardized.eis)) +  stat_qq(col='blue') + stat_qqline(col='red') + ggtitle("Normal Probability Plot of Standardized Residuals")
```
This graph proves the normalcy of residuals. 

2. Homoscedasticity

```{r}
ggplot(diagnosticdf, aes(x = tseindex, y = standardized.eis)) +  geom_point(size=2, col='blue', position="jitter") + xlab("TSE Index") + ylab("Standardized Residuals") + ggtitle("Scatterplot of  Residuals vs predicted values") + geom_hline(yintercept=0, color="red", linetype="dashed")
```

(f) From these data, can you infer that the monthly rate of return of Suncor stock can be expressed as a positive linear function of the monthly rate of return of the TSE Index? State your statistical hypotheses, compute (and report) both the test statistic and the $P$-value and provide your decision. 

Hypotheses
        H0: B1<=0           
        HA:B1>0

```{r}
summary(predict.suncorRate)
```

```{r}
Tobs=(0.53869099-0)/0.19177963 
Tobs
```
Tobs=2.8089 with df=57 same as the result obtained with the predict.suncorRate  and a Pvalue = 0.0068/2 = 0.00034 (We are looking at the probability of T > Tobs so we divide Pvalue by 2)

```{r}
1-pt(2.8089, 57-2)
```

As per Pvalue = 0.00343, (less than 0.05) we reject H0. We can say that the monthly rate of return of Suncor stock can be expressed as a positive linear function of the monthly rate of return of the TSE Index

(g) Compute a 95% confidence interval for $\beta_{1}$, then interpret its meaning in the context of these data. 

```{r}
confint(predict.suncorRate, conf.int=0.95)
```
the regression coefficient will have a vale between 0.1546 and 0.9227 in 95% of the cases.


(h) Compute a 95% confidence interval for the mean monthly rate of return of Suncor stock when the TSE has a monthly rate of return of 3%. 

```{r}
predict(predict.suncorRate, newdata=data.frame(TSE.Index=0.03), interval='conf')
```
when the TSE.Index value is 0.03, Suncor monthly rate will be 95% of the time between 0.00766 and 0.0579


(i) In a month of September, the TSE Index had a rate of return of 1.16%. With 95% confidence, compute the September rate of return for Suncor stock. 

```{r}
predict(predict.suncorRate, newdata=data.frame(TSE.Index=0.0116), interval='conf')
```
We can say with 95% confidence that the rate of return of the Suncor stock for that month of September was between -0.00064 and 0.04643

(j) Recall the Bootstrap Method. From these data, use the bootstrap method to create a 95% confidence interval for mean monthly rate of return of Suncor stock when the TSE has a monthly rate of return of 3%. Compare your result to your result in part (h). Use 1000 iterations for your bootstrap. *Carefully* consider how you would resample bivariate data points $(x_{TSE, i}, y_{Suncor, i})$. 

```{r}
Nsims = 1000
TSEval = 0.03
boot_predict = numeric(1000)
pair_x_y = dim(capmdata.df)[1]

for(i in 1:Nsims)
{
  index = sample(pair_x_y,replace=TRUE)
  dsample = capmdata.df[index, ]
  temp = lm(Suncor~TSE.Index, data = dsample)
  boot_predict[i] = (coef(temp)[1]) + ((coef(temp)[2])*TSEval)
}

reg_bootstrap = data.frame(boot_predict)
head(reg_bootstrap,4)
```

```{r}
ggplot(reg_bootstrap, aes(x = boot_predict)) + geom_histogram(col='red', fill='blue', binwidth=0.006) + xlab("Predicted Values of Suncor Rate when TSE.Index = 0.03") + ylab("Count") + 
ggtitle("Distribution of Bootstrap Statistic: Predicted Suncor Rate")
```

```{r}
qdata(~boot_predict, c(0.025,0.975), data=reg_bootstrap)
```
This confidence interval is close to the interval found in h. 


**2.** Was Barry Bonds using Steroids? During the 2001 Season, Bary Bonds hit a major leauge record of 73 home runs.

The following bivariate data set gives the year and the number of home runs divided by the number of at bats - attempts to hit the ball - for each season. The number of homeruns is not used as later in his career he was given intentional walks, which do not count as an at bat. 

In this exercise, you will build on your learning of model building and attempt to **predict the number of home runs Barry Bonds would have hit in the 2001** season. These data are stored in the [data file](https://github.com/Statman44/Data602/blob/main/bondsdata.csv). 


<center>
![Bonds Just After He Signed His San Fransico Contract](BondsPre.jpg)

![Circa 2000](BondsPost.jpg)


Remove the data point that corresponds to the **season == 2001**. After which, you are attempting to build a statistical model of the following form:
$$
HRAT_{i} = A + B*Year_{i} + e_{i} \hspace{0.5in} i = 1987, 1988, \cdots, 2000.
$$
Carry out the appropriate statistical analysis to justify your answer. Communicate your answer in the form of a few paragraphs. In writing your response and statistical justification, ensure that any conditions/assumptions of the model stated above are mentioned and checked. 

```{r}
bbonds.df = read.csv("https://raw.githubusercontent.com/Statman44/Data602/main/bondsdata.csv")
head(bbonds.df)
```

```{r}
tail(bbonds.df)
```

Create a new dataframe without 2001

```{r}
bbonds1.df = bbonds.df %>% filter(season!=2001)
tail(bbonds1.df)
```


```{r}
ggplot(bbonds1.df, aes(x = season, y = hrat)) + geom_point(col="blue", size = 2) + xlab("Season") + ylab("hrat") + ggtitle("Scatter plot for Barry Bonds' (hrat) Homeruns 1987-2000")
```
The scatterplot pattern does not show a cone like pattern, seems to follow linearity

```{r}
predict.homerun = lm(hrat~season, data=bbonds1.df)
predict.homerun$coefficients

```

```{r}
summary(predict.homerun)
```

Se = 0.01329
r^2 = 0.6371
t value = 4.589
b = 0.00404
a = -7.9925


CONDITIONS OF THE MODEL

```{r}
actual.hrat = bbonds1.df$hrat
actual.season= bbonds1.df$season
eis.bbonds=predict.homerun$residuals
se.bbonds = summary(predict.homerun)$sigma
standardized.eis = rstandard(predict.homerun)
predicted.values.hrat = predict.homerun$fitted.values
diagnostic.df = data.frame(actual.season, actual.hrat, predicted.values.hrat,eis.bbonds, standardized.eis)
head(diagnostic.df)
```

1. Normality

```{r}
ggplot(diagnostic.df, aes(sample = standardized.eis)) +  stat_qq(col='blue') + stat_qq_line(col='red') + ggtitle("Normal Probability Plot of Standardized Residuals")
```
The line fits through the pattern. This can be considered linear and proves Normality


2. Homoscedasticity

```{r}
ggplot(diagnostic.df, aes(x = actual.season, y = standardized.eis)) + geom_point(size=2, col='blue', position="jitter") + xlab("Season") + ylab("Residuals") +
ggtitle("Season to Standardized Residuals") + geom_hline(yintercept=0, color="red", linetype="dashed")
```
The spread does not show any cone like patterns on the spread of residuals. We can consider it homoscedastic

Hypotheses
        H0: beta1 = 0
        HA: Beta1 != 0
        
F-TEST
```{r}
summary(aov(predict.homerun))
```

```{r}
summary(aov(predict.homerun))[[1]][["Pr(>F)"]][1]
```
Pvalue is 0.000622 so we reject H0


PREDICTING HOMERUNS 2001 SEASON


```{r}
nh = data.frame(season=2021)
predict(predict.homerun, newdata=nh, interval='prediction')
```
Extract Hrat for 2001

```{r}
bbonds.df[15,2]
```
use Hrat 2001 and number of homeruns to calculate atbat

```{r}
atbat = 73/0.1534
atbat
```
use Hrat 2001 and number of homeruns to calculate atbat

```{r}
atbat = 73/0.1534
atbat

```
Use fit value of the model and atbat to do the prediction
```{r}
predict.2001 = atbat*0.1808
predict.2001
```
```{r}
predict.2001_upper = atbat * 0.1331382
predict.2001_lower = atbat * 0.0666284
predict.2001_upper
```


```{r}
predict.2001_lower

```

