Attaching package: 'data.table'
The following objects are masked from 'package:dplyr':
between, first, last
The following object is masked from 'package:purrr':
transpose
Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
After you load the data, record which variables are categorical and which are numeric. CATEGORICAL:Date, HomeTeam, AwayTeam, full time result (FTR), halftime results (HTR), Referee NUMERIC: full time home goals (FTHG), full time away goals (FTAG), halftime home goals (HTHG), halftime away goals (HTAG), number of shots taken by the home team (HS), Number of shots taken by the away team (AS), number of shots on target by the home team (HST), number of shots on target by the away team (AST), number of fouls by the home team (HF), number of fouls by the away team (AF), number of corners taken by the home team (HC), number of corners taken by the away team (AC), number of yellow cards received by the home team (HY), number of yellow cards received by the away team (AY), number of red cards received by the home team (HR), number of red cards received by the away team (AR)
2.) Let’s consider the effects of home team shots (HS), home team (HomeTeam), and home team fouls (HF) on home team goals (full time home goals). Build a fully interactive multiple linear regression model. Assess model fit and then model assumptions. How well does the model fit the data? Is the model valid?
p-value of the model is less than 0.05: however, adjusted R-squared is 0.2355, which is pretty low…the mode does not even explain a quarter of the variability found in the data.
Assumptions: 1. Linearity: looks pretty good, the reference line is mainly flat and horizontal 2. Normality: normality of residuals is looking pretty good, not great, but I would still consider the assumption met 3. Equal variance: the reference line is not flat, this assumption is looking like it’s not met… 4. Independence: we know nothing about the experimental design, I am assuming independence because I have to do this lab 5. Colinearity: based on the VIFs, there is high colinearity in this data. This assumption of no colinearity is not met.
How well does the model fit the data? Is the model valid? Overall, the model does not fit the data well at all. Furthermore, based on the fact that multiple assumptions are violated, this model is not really valid.
3.) Run through a top-down modeling approach to find the best fit model! Be sure to check assumptions after each change and compare performance. What model is the best fit?
Technically, the best fit model is socmod6, which has the additive effects on FTHG for home team shots and home team. However, the most complex model that works the best is socmod4, which includes all terms (home team shots, home team, and home team fouls) in an additive model for their effect on full time home team goals. This best meets all of our assumptions of linearity, normality, equal variances, independence, and colinearity, and the assessment of the model is only a bit worse than socmod6.
4.) After identifying the best fit model, build the appropriate graph! See our multiple regression tutorial. Next, Build a coef plot for the model. Using patchwork, show me a 2-panel figure with the coef plot and the graph for the model
2.) Calculate means and 95% CIs of full time home goals and full time away goals (using bootstrapping). Plot the results and interpret the plot (is there a home advantage or not?)