data <- read.csv('Football Statistics.csv')
To keep this assignment simple, I have used the built in dataset Football Statistics that is included with R.
summary(data)
## date_GMT referee total_goal_count
## Length:380 Length:380 Min. :0.000
## Class :character Class :character 1st Qu.:2.000
## Mode :character Mode :character Median :3.000
## Mean :2.821
## 3rd Qu.:4.000
## Max. :8.000
## total_goals_at_half_time stadium_name
## Min. :0.000 Length:380
## 1st Qu.:0.000 Class :character
## Median :1.000 Mode :character
## Mean :1.253
## 3rd Qu.:2.000
## Max. :6.000
Answer:
Any measure or metric that uses random sampling with substitution falls into the wider category of resampling processes, and bootstrapping is one of them. The term “bootstrapping” refers to the process of assigning precision measurements to sample estimates. Using random sampling techniques, this approach can estimate the sampling distribution of virtually every statistic.
The multiple linear regression using total goal count, referee, and table to predict the top most referee.
display <- lm(total_goal_count ~ referee + total_goal_count + referee, data = data)
## Warning in model.matrix.default(mt, mf, contrasts): the response appeared on the
## right-hand side and was dropped
## Warning in model.matrix.default(mt, mf, contrasts): problem with term 2 in
## model.matrix: no columns are assigned
summary(display)
##
## Call:
## lm(formula = total_goal_count ~ referee + total_goal_count +
## referee, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.059 -1.000 -0.250 1.189 4.708
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.81481 0.30978 9.086 <2e-16 ***
## refereeAndy Madley 0.18519 1.17962 0.157 0.8753
## refereeAnthony Taylor -0.06481 0.42064 -0.154 0.8776
## refereeChris Kavanagh -0.10648 0.45158 -0.236 0.8137
## refereeCraig Pawson -0.39174 0.44229 -0.886 0.3764
## refereeDavid Coote 0.63973 0.57578 1.111 0.2673
## refereeGraham Scott 0.24401 0.49838 0.490 0.6247
## refereeJonathan Moss 0.18519 0.43810 0.423 0.6728
## refereeKevin Friend -0.03704 0.43810 -0.085 0.9327
## refereeLee Mason -0.13060 0.48202 -0.271 0.7866
## refereeLee Probert -0.20370 0.48981 -0.416 0.6777
## refereeMartin Atkinson -0.22861 0.43048 -0.531 0.5957
## refereeMichael Oliver -0.01481 0.42701 -0.035 0.9723
## refereeMike Dean -0.22861 0.43048 -0.531 0.5957
## refereePaul Tierney 0.47685 0.45158 1.056 0.2917
## refereeRoger East 0.98519 0.59588 1.653 0.0991 .
## refereeSimon Hooper -0.56481 0.64796 -0.872 0.3840
## refereeStuart Attwell 0.23519 0.47489 0.495 0.6207
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.61 on 362 degrees of freedom
## Multiple R-squared: 0.03484, Adjusted R-squared: -0.01049
## F-statistic: 0.7686 on 17 and 362 DF, p-value: 0.7294
Answer:
The technique of 10-fold cross validation is used to estimate the model’s ability on new data.
We may use a variety of methods to choose the value of 10 for our dataset.
3)In scikit-learn, there are many widely used cross-validation variants, such as stratified and replicated.