Please note that option knitr::opts_knit$set(root.dir = “YOUR
WORKING DIRECTORY”) is only required if your .Rmd file is in a
different folder than you raw data, like in my case. Open up the .Rmd
file and see your yourself below -
Knitr
Explanation
You can modify it, or you can simply create a new .Rmd file on your machine and copy over the relevant chunks of code.
This file is for showing students some code, and is not a definitive guide on how to present. You can and should experiment with the commands shown and see if you can improve the presentation.
I am merely attempting to give you a starting point. You can still
use commands in this file like psych::describe but you can
see that stargazer is much better. Similarly, instead of
plotting with plot, use ggplot.
ENSURE YOUR FINAL REPORT IS CLEAN, POLISHED AND PROFESSIONAL AND DOES NOT HAVE ANY UNNECESSARY OUTPUT.
First, empty the environment so that we can upload the clean data.
# Set working directory and path to data
setwd(cd)
# Clear the workspace
rm(list = ls()) # Clear environment
gc() # Clear unused memory
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 525572 28.1 1166992 62.4 NA 669282 35.8
## Vcells 967734 7.4 8388608 64.0 32768 1840402 14.1
cat("\f") # Clear the console
dev.off # Clear the charts
## function (which = dev.cur())
## {
## if (which == 1)
## stop("cannot shut down device 1 (the null device)")
## .External(C_devoff, as.integer(which))
## dev.cur()
## }
## <bytecode: 0x7fcc117900e8>
## <environment: namespace:grDevices>
Now, we load the packages.
# Prepare needed libraries
packages <- c("psych","tidyverse")
for (i in 1:length(packages)) {
if (!packages[i] %in% rownames(installed.packages())) {
install.packages(packages[i]
, repos = "http://cran.rstudio.com/"
, dependencies = TRUE
)
}
library(packages[i], character.only = TRUE)
}
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%() masks psych::%+%()
## ✖ ggplot2::alpha() masks psych::alpha()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
rm(packages)
There are many ways to confirm you have successfully imported the raw data. Like I try to conduct some summary stats on the data.
Note that it is fine initially to use commands like
describe or glimpse to confirm data import or
exploring the data. But reports are an iterative process. The final
report is should be a client friendly reports i.e. keep only the most
informative command output in your final submitted file and suppress the
output of other commands.
For instance, does the client really need to know what is your
working directory ? Or would that information be useful for them ? If it
is not important for the client to see the output of certain commands
but you need to run that chunk, so try include=FALSE in r
chunk options.
Try to move away from an assignment style of writing (print all command outputs, and have as many commands as possible) to an executive report style of writing (attention highlighted to key points without wasting time). For example, if there are 4 ways to create summary statistics for a dataset, in your first iteration of the assignment run just any one command that achieves the target. In the second iteration, try to see if you can run the other three commands for summary statistics too - this will be important for your learning. In the third and final iteration, just present your preferred command for each task and rewrite the assignment in an executive/business report tone.
df_train <- read.csv("moneyball-training-data.csv")
df_eval <- read.csv("moneyball-evaluation-data.csv")
describe(df_train)
## vars n mean sd median trimmed mad min max
## INDEX 1 2276 1268.46 736.35 1270.5 1268.57 952.57 1 2535
## TARGET_WINS 2 2276 80.79 15.75 82.0 81.31 14.83 0 146
## TEAM_BATTING_H 3 2276 1469.27 144.59 1454.0 1459.04 114.16 891 2554
## TEAM_BATTING_2B 4 2276 241.25 46.80 238.0 240.40 47.44 69 458
## TEAM_BATTING_3B 5 2276 55.25 27.94 47.0 52.18 23.72 0 223
## TEAM_BATTING_HR 6 2276 99.61 60.55 102.0 97.39 78.58 0 264
## TEAM_BATTING_BB 7 2276 501.56 122.67 512.0 512.18 94.89 0 878
## TEAM_BATTING_SO 8 2174 735.61 248.53 750.0 742.31 284.66 0 1399
## TEAM_BASERUN_SB 9 2145 124.76 87.79 101.0 110.81 60.79 0 697
## TEAM_BASERUN_CS 10 1504 52.80 22.96 49.0 50.36 17.79 0 201
## TEAM_BATTING_HBP 11 191 59.36 12.97 58.0 58.86 11.86 29 95
## TEAM_PITCHING_H 12 2276 1779.21 1406.84 1518.0 1555.90 174.95 1137 30132
## TEAM_PITCHING_HR 13 2276 105.70 61.30 107.0 103.16 74.13 0 343
## TEAM_PITCHING_BB 14 2276 553.01 166.36 536.5 542.62 98.59 0 3645
## TEAM_PITCHING_SO 15 2174 817.73 553.09 813.5 796.93 257.23 0 19278
## TEAM_FIELDING_E 16 2276 246.48 227.77 159.0 193.44 62.27 65 1898
## TEAM_FIELDING_DP 17 1990 146.39 26.23 149.0 147.58 23.72 52 228
## range skew kurtosis se
## INDEX 2534 0.00 -1.22 15.43
## TARGET_WINS 146 -0.40 1.03 0.33
## TEAM_BATTING_H 1663 1.57 7.28 3.03
## TEAM_BATTING_2B 389 0.22 0.01 0.98
## TEAM_BATTING_3B 223 1.11 1.50 0.59
## TEAM_BATTING_HR 264 0.19 -0.96 1.27
## TEAM_BATTING_BB 878 -1.03 2.18 2.57
## TEAM_BATTING_SO 1399 -0.30 -0.32 5.33
## TEAM_BASERUN_SB 697 1.97 5.49 1.90
## TEAM_BASERUN_CS 201 1.98 7.62 0.59
## TEAM_BATTING_HBP 66 0.32 -0.11 0.94
## TEAM_PITCHING_H 28995 10.33 141.84 29.49
## TEAM_PITCHING_HR 343 0.29 -0.60 1.28
## TEAM_PITCHING_BB 3645 6.74 96.97 3.49
## TEAM_PITCHING_SO 19278 22.17 671.19 11.86
## TEAM_FIELDING_E 1833 2.99 10.97 4.77
## TEAM_FIELDING_DP 176 -0.39 0.18 0.59
glimpse(df_eval)
## Rows: 259
## Columns: 16
## $ INDEX <int> 9, 10, 14, 47, 60, 63, 74, 83, 98, 120, 123, 135, 138…
## $ TEAM_BATTING_H <int> 1209, 1221, 1395, 1539, 1445, 1431, 1430, 1385, 1259,…
## $ TEAM_BATTING_2B <int> 170, 151, 183, 309, 203, 236, 219, 158, 177, 212, 243…
## $ TEAM_BATTING_3B <int> 33, 29, 29, 29, 68, 53, 55, 42, 78, 42, 40, 55, 57, 2…
## $ TEAM_BATTING_HR <int> 83, 88, 93, 159, 5, 10, 37, 33, 23, 58, 50, 164, 186,…
## $ TEAM_BATTING_BB <int> 447, 516, 509, 486, 95, 215, 568, 356, 466, 452, 495,…
## $ TEAM_BATTING_SO <int> 1080, 929, 816, 914, 416, 377, 527, 609, 689, 584, 64…
## $ TEAM_BASERUN_SB <int> 62, 54, 59, 148, NA, NA, 365, 185, 150, 52, 64, 48, 3…
## $ TEAM_BASERUN_CS <int> 50, 39, 47, 57, NA, NA, NA, NA, NA, NA, NA, 28, 21, 8…
## $ TEAM_BATTING_HBP <int> NA, NA, NA, 42, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ TEAM_PITCHING_H <int> 1209, 1221, 1395, 1539, 3902, 2793, 1544, 1626, 1342,…
## $ TEAM_PITCHING_HR <int> 83, 88, 93, 159, 14, 20, 40, 39, 25, 62, 53, 173, 196…
## $ TEAM_PITCHING_BB <int> 447, 516, 509, 486, 257, 420, 613, 418, 497, 482, 521…
## $ TEAM_PITCHING_SO <int> 1080, 929, 816, 914, 1123, 736, 569, 715, 734, 622, 6…
## $ TEAM_FIELDING_E <int> 140, 135, 156, 124, 616, 572, 490, 328, 226, 184, 200…
## $ TEAM_FIELDING_DP <int> 156, 164, 153, 154, 130, 105, NA, 104, 132, 145, 183,…
?print # years "Total Years:"
print(x = 2006-1871+1,
digits = 1)
## [1] 136
136*162
## [1] 22032
My third command for summary stats will be stargazer and
I will only present that in my final report. This is still more in
assignment style.
I have three points to make -
Use ggplot2 package to create professional charts.
library(visdat)
?visdat
vis_dat(df_train)
df_no_missing <- df_train
df_no_missing$TEAM_BATTING_HBP = NULL
df_no_missing$TEAM_BATTING_CS = NULL
vis_miss(df_no_missing)
Plot or summarize the variable before and after the imputation to confirm the distribution has not
hist(df_no_missing$TEAM_BASERUN_CS )
df_no_missing$TEAM_BASERUN_CS <- ifelse(is.na(x = df_no_missing$TEAM_BASERUN_CS),
yes = mean(x = df_no_missing$TEAM_BASERUN_CS, na.rm = TRUE),
no = df_no_missing$TEAM_BASERUN_CS)
hist(df_no_missing$TEAM_BASERUN_CS )
# Create histogram in ggplot
ggplot(data = df_no_missing,
mapping = aes(x = TEAM_BASERUN_CS)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data = df_no_missing,
mapping = aes(x = TEAM_BASERUN_CS)) + geom_histogram(color = "black",
fill = "blue",
binwidth = 10) +
labs(title = "Histogram of X", x = "X", y = "Count")
df_no_missing <- na.omit(df_no_missing) # maybe drop a few more variables
Please play with the command below and make the correlation plot upper or lower triangular - else the chart below displays too much redundant information. It does not look very professional ( a little cluttered). At top firms, attention to detail is very important and you are required to create very polished charts.
library(ggcorrplot)
?ggcorrplot()
?cor
?cor_pmat # cor_pmat(): Compute a correlation matrix p-values.
mycorr<- cor(x = df_no_missing[, 1:ncol(df_no_missing )])
p.mat <- ggcorrplot::cor_pmat(x = df_no_missing[,1:ncol(df_no_missing)])
head(p.mat)
## INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## INDEX 0.000000000 3.085723e-01 2.870071e-01 1.012176e-01
## TARGET_WINS 0.308572334 0.000000e+00 1.013043e-54 2.490498e-20
## TEAM_BATTING_H 0.287007118 1.013043e-54 0.000000e+00 9.355168e-233
## TEAM_BATTING_2B 0.101217596 2.490498e-20 9.355168e-233 0.000000e+00
## TEAM_BATTING_3B 0.769316672 1.340586e-07 2.896946e-60 2.894398e-02
## TEAM_BATTING_HR 0.006420957 1.416443e-21 1.602541e-06 1.786146e-62
## TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## INDEX 7.693167e-01 6.420957e-03 2.259876e-02 7.106396e-07
## TARGET_WINS 1.340586e-07 1.416443e-21 6.788724e-40 1.082744e-02
## TEAM_BATTING_H 2.896946e-60 1.602541e-06 1.402162e-05 3.110665e-54
## TEAM_BATTING_2B 2.894398e-02 1.786146e-62 2.281359e-23 1.601577e-07
## TEAM_BATTING_3B 0.000000e+00 3.146352e-208 5.493361e-25 1.722999e-271
## TEAM_BATTING_HR 3.146352e-208 0.000000e+00 2.833152e-68 4.262615e-267
## TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_H
## INDEX 1.177395e-04 9.919907e-01 9.517913e-01
## TARGET_WINS 1.846788e-07 6.773671e-01 8.836481e-22
## TEAM_BATTING_H 1.497875e-02 6.096341e-01 7.485250e-277
## TEAM_BATTING_2B 9.931836e-07 9.663607e-06 6.214542e-71
## TEAM_BATTING_3B 8.653120e-30 2.117662e-35 5.645504e-73
## TEAM_BATTING_HR 1.327937e-41 9.273809e-63 2.113987e-05
## TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## INDEX 7.419790e-03 1.014059e-02 1.744343e-06
## TARGET_WINS 1.685830e-21 2.011167e-32 4.962297e-03
## TEAM_BATTING_H 4.769017e-08 3.435879e-08 7.730597e-49
## TEAM_BATTING_2B 1.151450e-63 2.211572e-17 5.592219e-07
## TEAM_BATTING_3B 2.009525e-190 1.798207e-07 2.195339e-225
## TEAM_BATTING_HR 0.000000e+00 2.039060e-29 1.157630e-220
## TEAM_FIELDING_E TEAM_FIELDING_DP
## INDEX 1.287249e-02 7.993341e-01
## TARGET_WINS 1.454931e-15 1.165960e-01
## TEAM_BATTING_H 2.969523e-04 1.173988e-11
## TEAM_BATTING_2B 9.295027e-35 1.901141e-14
## TEAM_BATTING_3B 2.596789e-244 7.211648e-21
## TEAM_BATTING_HR 4.191484e-271 2.530492e-36
myplot<-ggcorrplot(corr = mycorr, # correlation matrix to visualize
method = "square", # character, the visualization method of correlation matrix to be used. Allowed values are "square" (default), "circle"
type = "full", # character, "full" (default), "lower" or "upper" display
title = "Correlation Plot", # character, title of the graph
colors = c("red", "white","green"), # vector of 3 colors for low, mid and high correlation values.
lab = TRUE, # If TRUE, add corr coeff on the plot.
lab_size = 2, # labels. used when lab = TRUE.
p.mat = p.mat, # matrix of p-value. If NULL, arguments sig.level, insig, pch, pch.col, pch.cex is invalid. # Barring the no significant coefficient
insig = "pch", # character, specialized insignificant correlation coefficients, "pch" (default), "blank". If "blank", wipe away the corresponding glyphs; if "pch", add characters (see pch for details) on corresponding glyphs.
pch = 4, # add character on the glyphs of insignificant correlation coefficients (only valid when insig is "pch"). Default value is 4.
hc.order = TRUE, # If TRUE, correlation matrix will be hc.ordered using hclust function.
tl.cex = 8, # the size, the color and the string rotation of text label
tl.col = "black",
digits = 2
)
myplot
I went to Chat GPT and typed “r create a histogram of all variables in a dataframe”. You could have done so in Google alternatively.
I got the following code - and tried running it as it is to understand how it works.
library(ggplot2)
# create sample dataframe
df <- data.frame(
var1 = rnorm(100),
var2 = rnorm(100),
var3 = rnorm(100)
)
# melt the dataframe into long format
df_melted <- reshape2::melt(df)
## No id variables; using all as measure variables
# create histogram using ggplot2
ggplot(df_melted, aes(value)) +
geom_histogram() +
facet_wrap(~variable, scales = "free_x")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Then, I try to adapt it for my own code -
library(ggplot2)
# melt the dataframe into long format
df_melted <- reshape2::melt(df_no_missing)
## No id variables; using all as measure variables
# create histogram using ggplot2
ggplot(df_melted, aes(x = value)) +
geom_histogram() +
facet_wrap(~variable, scales = "free_x")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Use stargazer package to create professional summary tables. Can also consider using kabble, kabbleExtra.
Run summary statistics on the final, clean variables that you are
keeping in your regression. Play around with labelling variable with
covariate.labels option in stargazer().
library(stargazer)
##
## Please cite as:
## Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
## R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
?stargazer
stargazer(table = df_no_missing[,2:15],
type = "text",
title = "Summary Statistics" #,
# covariate.labels = c( "Target Wins",
# "Team Batting"
# )
)
##
## Summary Statistics
## ========================================================
## Statistic N Mean St. Dev. Min Max
## --------------------------------------------------------
## TARGET_WINS 1,835 80.990 13.158 38 120
## TEAM_BATTING_H 1,835 1,457.035 108.082 1,137 1,876
## TEAM_BATTING_2B 1,835 247.834 42.952 130 392
## TEAM_BATTING_3B 1,835 47.839 21.709 11 138
## TEAM_BATTING_HR 1,835 116.478 54.511 4 264
## TEAM_BATTING_BB 1,835 530.277 85.501 273 878
## TEAM_BATTING_SO 1,835 786.762 216.972 324 1,399
## TEAM_BASERUN_SB 1,835 100.801 52.755 18 367
## TEAM_BASERUN_CS 1,835 52.933 20.562 11.000 201.000
## TEAM_PITCHING_H 1,835 1,524.617 174.214 1,137 2,394
## TEAM_PITCHING_HR 1,835 120.711 56.412 4 343
## TEAM_PITCHING_BB 1,835 553.960 97.483 325 1,090
## TEAM_PITCHING_SO 1,835 817.993 222.223 345 1,781
## TEAM_FIELDING_E 1,835 159.925 57.789 65 515
## --------------------------------------------------------
The regression I will run is ( describe it ) -
\(wage_{i}= \beta_{0}+\gamma_1+ \epsilon_i\).
stepAIC may help you with backward/forward selection.
Box-Cox may be interesting to try too. Read up on it if you have time.
Present the models in one single table so that it is easy for the reader to compare coefficients.
# MODEL 1: keep all variables
model1 <- lm(data = df_no_missing,
TARGET_WINS ~ .)
# MODEL 2: drop variable randomly ...
model2 <- lm(data = df_no_missing,
TARGET_WINS ~ . -TEAM_BATTING_HR - TEAM_BATTING_BB )
# MODEL 3: Step AIC...Backward selection... drop variable systematically...
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
?stepAIC
model3 <- stepAIC(object = lm(data = df_no_missing, TARGET_WINS ~ . - INDEX),
direction = c("both")
)
## Start: AIC=8524.83
## TARGET_WINS ~ (INDEX + TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
## TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB +
## TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB +
## TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP) -
## INDEX
##
## Df Sum of Sq RSS AIC
## - TEAM_PITCHING_HR 1 12.4 188001 8523.0
## - TEAM_BATTING_SO 1 101.7 188090 8523.8
## <none> 187989 8524.8
## - TEAM_BATTING_HR 1 242.0 188231 8525.2
## - TEAM_BATTING_H 1 270.0 188259 8525.5
## - TEAM_PITCHING_BB 1 347.9 188336 8526.2
## - TEAM_PITCHING_SO 1 432.7 188421 8527.1
## - TEAM_BASERUN_CS 1 662.6 188651 8529.3
## - TEAM_BATTING_BB 1 710.8 188699 8529.8
## - TEAM_PITCHING_H 1 1303.8 189292 8535.5
## - TEAM_BATTING_2B 1 3226.4 191215 8554.1
## - TEAM_FIELDING_DP 1 8436.8 196425 8603.4
## - TEAM_BATTING_3B 1 9717.7 197706 8615.3
## - TEAM_BASERUN_SB 1 15747.2 203736 8670.4
## - TEAM_FIELDING_E 1 29331.9 217320 8788.9
##
## Step: AIC=8522.95
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
## TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB +
## TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
## TEAM_FIELDING_E + TEAM_FIELDING_DP
##
## Df Sum of Sq RSS AIC
## <none> 188001 8523.0
## - TEAM_BATTING_SO 1 229.1 188230 8523.2
## - TEAM_BATTING_H 1 325.8 188327 8524.1
## + TEAM_PITCHING_HR 1 12.4 187989 8524.8
## - TEAM_PITCHING_BB 1 400.4 188401 8524.9
## - TEAM_BASERUN_CS 1 677.5 188679 8527.6
## - TEAM_BATTING_BB 1 795.1 188796 8528.7
## - TEAM_PITCHING_SO 1 825.2 188826 8529.0
## - TEAM_PITCHING_H 1 1488.2 189489 8535.4
## - TEAM_BATTING_2B 1 3223.9 191225 8552.2
## - TEAM_FIELDING_DP 1 8462.1 196463 8601.7
## - TEAM_BATTING_3B 1 9711.3 197712 8613.4
## - TEAM_BATTING_HR 1 11284.0 199285 8627.9
## - TEAM_BASERUN_SB 1 15856.4 203857 8669.5
## - TEAM_FIELDING_E 1 29396.6 217398 8787.5
stargazer(model1, model2, model3,
type = "text"
)
##
## =================================================================================================
## Dependent variable:
## -----------------------------------------------------------------------------
## TARGET_WINS
## (1) (2) (3)
## -------------------------------------------------------------------------------------------------
## INDEX -0.0002 -0.0002
## (0.0003) (0.0003)
##
## TEAM_BATTING_H -0.027 0.003 -0.028*
## (0.017) (0.011) (0.016)
##
## TEAM_BATTING_2B -0.050*** -0.051*** -0.050***
## (0.009) (0.009) (0.009)
##
## TEAM_BATTING_3B 0.185*** 0.181*** 0.184***
## (0.019) (0.019) (0.019)
##
## TEAM_BATTING_HR 0.123 0.097***
## (0.082) (0.009)
##
## TEAM_BATTING_BB 0.112*** 0.115***
## (0.042) (0.041)
##
## TEAM_BATTING_SO 0.022 0.060*** 0.026
## (0.022) (0.017) (0.018)
##
## TEAM_BASERUN_SB 0.077*** 0.078*** 0.077***
## (0.006) (0.006) (0.006)
##
## TEAM_BASERUN_CS -0.036** -0.040*** -0.036**
## (0.014) (0.014) (0.014)
##
## TEAM_PITCHING_H 0.054*** 0.026*** 0.055***
## (0.015) (0.009) (0.014)
##
## TEAM_PITCHING_HR -0.026 0.092***
## (0.078) (0.009)
##
## TEAM_PITCHING_BB -0.075* 0.031*** -0.077**
## (0.040) (0.003) (0.039)
##
## TEAM_PITCHING_SO -0.043** -0.079*** -0.047***
## (0.021) (0.016) (0.017)
##
## TEAM_FIELDING_E -0.121*** -0.117*** -0.120***
## (0.007) (0.007) (0.007)
##
## TEAM_FIELDING_DP -0.111*** -0.112*** -0.111***
## (0.012) (0.012) (0.012)
##
## Constant 60.797*** 59.964*** 60.856***
## (6.066) (6.076) (6.057)
##
## -------------------------------------------------------------------------------------------------
## Observations 1,835 1,835 1,835
## R2 0.408 0.404 0.408
## Adjusted R2 0.403 0.400 0.404
## Residual Std. Error 10.165 (df = 1819) 10.192 (df = 1821) 10.161 (df = 1821)
## F Statistic 83.597*** (df = 15; 1819) 95.054*** (df = 13; 1821) 96.522*** (df = 13; 1821)
## =================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
## The following object is masked from 'package:psych':
##
## logit
?vif()
data_subset <- df_no_missing[,-1]
names(data_subset)
## [1] "TARGET_WINS" "TEAM_BATTING_H" "TEAM_BATTING_2B" "TEAM_BATTING_3B"
## [5] "TEAM_BATTING_HR" "TEAM_BATTING_BB" "TEAM_BATTING_SO" "TEAM_BASERUN_SB"
## [9] "TEAM_BASERUN_CS" "TEAM_PITCHING_H" "TEAM_PITCHING_HR" "TEAM_PITCHING_BB"
## [13] "TEAM_PITCHING_SO" "TEAM_FIELDING_E" "TEAM_FIELDING_DP"
model1 <- lm(data = data_subset,
TARGET_WINS ~ .)
summary(model1)
##
## Call:
## lm(formula = TARGET_WINS ~ ., data = data_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.350 -7.089 0.103 6.918 29.580
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 60.763342 6.064292 10.020 < 2e-16 ***
## TEAM_BATTING_H -0.026693 0.016510 -1.617 0.106107
## TEAM_BATTING_2B -0.049536 0.008863 -5.589 2.63e-08 ***
## TEAM_BATTING_3B 0.183918 0.018961 9.700 < 2e-16 ***
## TEAM_BATTING_HR 0.124744 0.081504 1.531 0.126061
## TEAM_BATTING_BB 0.111399 0.042465 2.623 0.008781 **
## TEAM_BATTING_SO 0.021847 0.022019 0.992 0.321233
## TEAM_BASERUN_SB 0.076395 0.006187 12.347 < 2e-16 ***
## TEAM_BASERUN_CS -0.036101 0.014253 -2.533 0.011399 *
## TEAM_PITCHING_H 0.053366 0.015021 3.553 0.000391 ***
## TEAM_PITCHING_HR -0.027021 0.077957 -0.347 0.728920
## TEAM_PITCHING_BB -0.074086 0.040371 -1.835 0.066647 .
## TEAM_PITCHING_SO -0.042878 0.020949 -2.047 0.040825 *
## TEAM_FIELDING_E -0.120625 0.007158 -16.852 < 2e-16 ***
## TEAM_FIELDING_DP -0.110900 0.012271 -9.038 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.16 on 1820 degrees of freedom
## Multiple R-squared: 0.408, Adjusted R-squared: 0.4034
## F-statistic: 89.59 on 14 and 1820 DF, p-value: < 2.2e-16
## INCORRECT COEFFICIENTS DUE TO MULTOCOLLINEARITY???
vif_values1 <- vif(model1) # multicollinearity
as.data.frame(vif(model1))
## vif(model1)
## TEAM_BATTING_H 56.539890
## TEAM_BATTING_2B 2.573319
## TEAM_BATTING_3B 3.008691
## TEAM_BATTING_HR 350.487237
## TEAM_BATTING_BB 234.073106
## TEAM_BATTING_SO 405.276770
## TEAM_BASERUN_SB 1.891731
## TEAM_BASERUN_CS 1.525089
## TEAM_PITCHING_H 121.586247
## TEAM_PITCHING_HR 343.396144
## TEAM_PITCHING_BB 274.994857
## TEAM_PITCHING_SO 384.810502
## TEAM_FIELDING_E 3.038294
## TEAM_FIELDING_DP 1.376169
summary(model1_vif <- lm(data = data_subset,
TEAM_BATTING_HR ~ . - TARGET_WINS)) # 351.042560
##
## Call:
## lm(formula = TEAM_BATTING_HR ~ . - TARGET_WINS, data = data_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.782 -0.874 -0.193 0.653 29.376
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2441869 1.7435860 -0.140 0.888637
## TEAM_BATTING_H -0.0515416 0.0045908 -11.227 < 2e-16 ***
## TEAM_BATTING_2B -0.0009689 0.0025483 -0.380 0.703816
## TEAM_BATTING_3B -0.0094590 0.0054473 -1.736 0.082650 .
## TEAM_BATTING_BB 0.1201058 0.0118807 10.109 < 2e-16 ***
## TEAM_BATTING_SO 0.1607601 0.0050881 31.595 < 2e-16 ***
## TEAM_BASERUN_SB 0.0030011 0.0017775 1.688 0.091515 .
## TEAM_BASERUN_CS -0.0144344 0.0040841 -3.534 0.000419 ***
## TEAM_PITCHING_H 0.0507239 0.0041519 12.217 < 2e-16 ***
## TEAM_PITCHING_HR 0.9503043 0.0025437 373.597 < 2e-16 ***
## TEAM_PITCHING_BB -0.1153016 0.0112884 -10.214 < 2e-16 ***
## TEAM_PITCHING_SO -0.1533735 0.0048334 -31.732 < 2e-16 ***
## TEAM_FIELDING_E 0.0044123 0.0020555 2.147 0.031958 *
## TEAM_FIELDING_DP -0.0040996 0.0035268 -1.162 0.245218
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.922 on 1821 degrees of freedom
## Multiple R-squared: 0.9971, Adjusted R-squared: 0.9971
## F-statistic: 4.896e+04 on 13 and 1821 DF, p-value: < 2.2e-16
#1/(1-summary(model1_vif)$r.squared)
1/(1-0.9971)
## [1] 344.8276
1/( 1 - summary(model1_vif)$r.squared )
## [1] 350.4872
ANSWER THE FOLLOWING questions atleast -
Describe your model evolution. Why did you try a new model? Which model do you finally prefer?
What is coefficient interpretation of the model of your choice ??? Do it for a few key variables.
How is the model fit ? Residual Analysis ???
Try to convince the reader your chosen model does not violate Gauss Markov Assumptions.
?par
par(mfrow = c(2,2)) # 2 x 2 pictures on one plot
plot(model3)
How do you interpret these 4 charts ???
?predict
https://www.rostrum.blog/2018/10/13/sessioninfo/
sessionInfo()
## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] car_3.1-2 carData_3.0-5 MASS_7.3-60 stargazer_5.2.3
## [5] ggcorrplot_0.1.4 visdat_0.6.0 lubridate_1.9.2 forcats_1.0.0
## [9] stringr_1.5.0 dplyr_1.1.3 purrr_1.0.1 readr_2.1.4
## [13] tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.2 tidyverse_2.0.0
## [17] psych_2.3.6
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.10 lattice_0.20-45 digest_0.6.32 utf8_1.2.3
## [5] R6_2.5.1 plyr_1.8.8 evaluate_0.21 highr_0.10
## [9] pillar_1.9.0 rlang_1.1.1 rstudioapi_0.14 jquerylib_0.1.4
## [13] rmarkdown_2.21 labeling_0.4.2 munsell_0.5.0 compiler_4.2.1
## [17] xfun_0.39 pkgconfig_2.0.3 mnormt_2.1.1 htmltools_0.5.5
## [21] tidyselect_1.2.0 fansi_1.0.4 tzdb_0.3.0 withr_2.5.1
## [25] grid_4.2.1 nlme_3.1-161 jsonlite_1.8.7 gtable_0.3.3
## [29] lifecycle_1.0.3 magrittr_2.0.3 scales_1.2.1 cli_3.6.1
## [33] stringi_1.7.12 cachem_1.0.8 farver_2.1.1 reshape2_1.4.4
## [37] bslib_0.5.0 ellipsis_0.3.2 generics_0.1.3 vctrs_0.6.3
## [41] tools_4.2.1 glue_1.6.2 hms_1.1.2 abind_1.4-5
## [45] parallel_4.2.1 fastmap_1.1.1 yaml_2.3.7 timechange_0.2.0
## [49] colorspace_2.1-0 knitr_1.42 sass_0.4.6
# devtools::session_info()