data <- read.csv('Football Statistics.csv')

Bootstrap

To keep this assignment simple, I have used the built in dataset Football Statistics that is included with R.

summary(data)
##    date_GMT           referee          total_goal_count
##  Length:380         Length:380         Min.   :0.000   
##  Class :character   Class :character   1st Qu.:2.000   
##  Mode  :character   Mode  :character   Median :3.000   
##                                        Mean   :2.821   
##                                        3rd Qu.:4.000   
##                                        Max.   :8.000   
##  total_goals_at_half_time stadium_name      
##  Min.   :0.000            Length:380        
##  1st Qu.:0.000            Class :character  
##  Median :1.000            Mode  :character  
##  Mean   :1.253                              
##  3rd Qu.:2.000                              
##  Max.   :6.000
  1. This datastats help me to choose top 5 stadium for future league purpose and the importance of the stadium. Using bootstrapping, test the hypothesis that the mean is 2.821053. Explain in words how to create a bootstrap and how to create a bootstrap distribution for the mean. Make sure to state the hypothesis, express a confidence interval, \(p\) value, and state the conclusion in the proper statistical terms for the mean.

Answer:

Any measure or metric that uses random sampling with substitution falls into the wider category of resampling processes, and bootstrapping is one of them. The term “bootstrapping” refers to the process of assigning precision measurements to sample estimates. Using random sampling techniques, this approach can estimate the sampling distribution of virtually every statistic.

Cross Validation

The multiple linear regression using total goal count, referee, and table to predict the top most referee.

display <- lm(total_goal_count ~ referee + total_goal_count + referee, data = data)
## Warning in model.matrix.default(mt, mf, contrasts): the response appeared on the
## right-hand side and was dropped
## Warning in model.matrix.default(mt, mf, contrasts): problem with term 2 in
## model.matrix: no columns are assigned
summary(display)
## 
## Call:
## lm(formula = total_goal_count ~ referee + total_goal_count + 
##     referee, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.059 -1.000 -0.250  1.189  4.708 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             2.81481    0.30978   9.086   <2e-16 ***
## refereeAndy Madley      0.18519    1.17962   0.157   0.8753    
## refereeAnthony Taylor  -0.06481    0.42064  -0.154   0.8776    
## refereeChris Kavanagh  -0.10648    0.45158  -0.236   0.8137    
## refereeCraig Pawson    -0.39174    0.44229  -0.886   0.3764    
## refereeDavid Coote      0.63973    0.57578   1.111   0.2673    
## refereeGraham Scott     0.24401    0.49838   0.490   0.6247    
## refereeJonathan Moss    0.18519    0.43810   0.423   0.6728    
## refereeKevin Friend    -0.03704    0.43810  -0.085   0.9327    
## refereeLee Mason       -0.13060    0.48202  -0.271   0.7866    
## refereeLee Probert     -0.20370    0.48981  -0.416   0.6777    
## refereeMartin Atkinson -0.22861    0.43048  -0.531   0.5957    
## refereeMichael Oliver  -0.01481    0.42701  -0.035   0.9723    
## refereeMike Dean       -0.22861    0.43048  -0.531   0.5957    
## refereePaul Tierney     0.47685    0.45158   1.056   0.2917    
## refereeRoger East       0.98519    0.59588   1.653   0.0991 .  
## refereeSimon Hooper    -0.56481    0.64796  -0.872   0.3840    
## refereeStuart Attwell   0.23519    0.47489   0.495   0.6207    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.61 on 362 degrees of freedom
## Multiple R-squared:  0.03484,    Adjusted R-squared:  -0.01049 
## F-statistic: 0.7686 on 17 and 362 DF,  p-value: 0.7294
  1. Repeat this linear model using 10 fold cross validation. Explain in words what you are doing. Examine one of the folds carefully explaining the steps involved. Examine the \(R^2\) value and residual mean standard error. Compare the values you get to the original model.

Answer:

  1. The technique of 10-fold cross validation is used to estimate the model’s ability on new data.

  2. We may use a variety of methods to choose the value of 10 for our dataset.

3)In scikit-learn, there are many widely used cross-validation variants, such as stratified and replicated.