Please note that option knitr::opts_knit$set(root.dir = “YOUR WORKING DIRECTORY”) is only required if your .Rmd file is in a different folder than you raw data, like in my case. Open up the .Rmd file and see your yourself below -
Knitr Explanation

You can modify it, or you can simply create a new .Rmd file on your machine and copy over the relevant chunks of code.

1 INSTRUCTIONS

This file is for showing students some code, and is not a definitive guide on how to present. You can and should experiment with the commands shown and see if you can improve the presentation.

I am merely attempting to give you a starting point. You can still use commands in this file like psych::describe but you can see that stargazer is much better. Similarly, instead of plotting with plot, use ggplot.

ENSURE YOUR FINAL REPORT IS CLEAN, POLISHED AND PROFESSIONAL AND DOES NOT HAVE ANY UNNECESSARY OUTPUT.

2 Setup

2.1 Empty variables and functions in the environment tab/window

First, empty the environment so that we can upload the clean data.

# Set working directory and path to data
  setwd(cd)

# Clear the workspace
  rm(list = ls()) # Clear environment
  gc()            # Clear unused memory

##          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 525572 28.1    1166992 62.4         NA   669282 35.8
## Vcells 967734  7.4    8388608 64.0      32768  1840402 14.1

  cat("\f")       # Clear the console

  dev.off         # Clear the charts

## function (which = dev.cur()) 
## {
##     if (which == 1) 
##         stop("cannot shut down device 1 (the null device)")
##     .External(C_devoff, as.integer(which))
##     dev.cur()
## }
## <bytecode: 0x7fcc117900e8>
## <environment: namespace:grDevices>

2.2 Load packages

Now, we load the packages.

# Prepare needed libraries
packages <- c("psych","tidyverse")

  for (i in 1:length(packages)) {
    if (!packages[i] %in% rownames(installed.packages())) {
      install.packages(packages[i]
                       , repos = "http://cran.rstudio.com/"
                       , dependencies = TRUE
                       )
    }
    library(packages[i], character.only = TRUE)
  }

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::%+%()   masks psych::%+%()
## ✖ ggplot2::alpha() masks psych::alpha()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

rm(packages)

2.3 Load raw data

There are many ways to confirm you have successfully imported the raw data. Like I try to conduct some summary stats on the data.

Note that it is fine initially to use commands like describe or glimpse to confirm data import or exploring the data. But reports are an iterative process. The final report is should be a client friendly reports i.e. keep only the most informative command output in your final submitted file and suppress the output of other commands.

For instance, does the client really need to know what is your working directory ? Or would that information be useful for them ? If it is not important for the client to see the output of certain commands but you need to run that chunk, so try include=FALSE in r chunk options.

Try to move away from an assignment style of writing (print all command outputs, and have as many commands as possible) to an executive report style of writing (attention highlighted to key points without wasting time). For example, if there are 4 ways to create summary statistics for a dataset, in your first iteration of the assignment run just any one command that achieves the target. In the second iteration, try to see if you can run the other three commands for summary statistics too - this will be important for your learning. In the third and final iteration, just present your preferred command for each task and rewrite the assignment in an executive/business report tone.

df_train    <- read.csv("moneyball-training-data.csv")
df_eval     <- read.csv("moneyball-evaluation-data.csv")

describe(df_train)

##                  vars    n    mean      sd median trimmed    mad  min   max
## INDEX               1 2276 1268.46  736.35 1270.5 1268.57 952.57    1  2535
## TARGET_WINS         2 2276   80.79   15.75   82.0   81.31  14.83    0   146
## TEAM_BATTING_H      3 2276 1469.27  144.59 1454.0 1459.04 114.16  891  2554
## TEAM_BATTING_2B     4 2276  241.25   46.80  238.0  240.40  47.44   69   458
## TEAM_BATTING_3B     5 2276   55.25   27.94   47.0   52.18  23.72    0   223
## TEAM_BATTING_HR     6 2276   99.61   60.55  102.0   97.39  78.58    0   264
## TEAM_BATTING_BB     7 2276  501.56  122.67  512.0  512.18  94.89    0   878
## TEAM_BATTING_SO     8 2174  735.61  248.53  750.0  742.31 284.66    0  1399
## TEAM_BASERUN_SB     9 2145  124.76   87.79  101.0  110.81  60.79    0   697
## TEAM_BASERUN_CS    10 1504   52.80   22.96   49.0   50.36  17.79    0   201
## TEAM_BATTING_HBP   11  191   59.36   12.97   58.0   58.86  11.86   29    95
## TEAM_PITCHING_H    12 2276 1779.21 1406.84 1518.0 1555.90 174.95 1137 30132
## TEAM_PITCHING_HR   13 2276  105.70   61.30  107.0  103.16  74.13    0   343
## TEAM_PITCHING_BB   14 2276  553.01  166.36  536.5  542.62  98.59    0  3645
## TEAM_PITCHING_SO   15 2174  817.73  553.09  813.5  796.93 257.23    0 19278
## TEAM_FIELDING_E    16 2276  246.48  227.77  159.0  193.44  62.27   65  1898
## TEAM_FIELDING_DP   17 1990  146.39   26.23  149.0  147.58  23.72   52   228
##                  range  skew kurtosis    se
## INDEX             2534  0.00    -1.22 15.43
## TARGET_WINS        146 -0.40     1.03  0.33
## TEAM_BATTING_H    1663  1.57     7.28  3.03
## TEAM_BATTING_2B    389  0.22     0.01  0.98
## TEAM_BATTING_3B    223  1.11     1.50  0.59
## TEAM_BATTING_HR    264  0.19    -0.96  1.27
## TEAM_BATTING_BB    878 -1.03     2.18  2.57
## TEAM_BATTING_SO   1399 -0.30    -0.32  5.33
## TEAM_BASERUN_SB    697  1.97     5.49  1.90
## TEAM_BASERUN_CS    201  1.98     7.62  0.59
## TEAM_BATTING_HBP    66  0.32    -0.11  0.94
## TEAM_PITCHING_H  28995 10.33   141.84 29.49
## TEAM_PITCHING_HR   343  0.29    -0.60  1.28
## TEAM_PITCHING_BB  3645  6.74    96.97  3.49
## TEAM_PITCHING_SO 19278 22.17   671.19 11.86
## TEAM_FIELDING_E   1833  2.99    10.97  4.77
## TEAM_FIELDING_DP   176 -0.39     0.18  0.59

glimpse(df_eval)

## Rows: 259
## Columns: 16
## $ INDEX            <int> 9, 10, 14, 47, 60, 63, 74, 83, 98, 120, 123, 135, 138…
## $ TEAM_BATTING_H   <int> 1209, 1221, 1395, 1539, 1445, 1431, 1430, 1385, 1259,…
## $ TEAM_BATTING_2B  <int> 170, 151, 183, 309, 203, 236, 219, 158, 177, 212, 243…
## $ TEAM_BATTING_3B  <int> 33, 29, 29, 29, 68, 53, 55, 42, 78, 42, 40, 55, 57, 2…
## $ TEAM_BATTING_HR  <int> 83, 88, 93, 159, 5, 10, 37, 33, 23, 58, 50, 164, 186,…
## $ TEAM_BATTING_BB  <int> 447, 516, 509, 486, 95, 215, 568, 356, 466, 452, 495,…
## $ TEAM_BATTING_SO  <int> 1080, 929, 816, 914, 416, 377, 527, 609, 689, 584, 64…
## $ TEAM_BASERUN_SB  <int> 62, 54, 59, 148, NA, NA, 365, 185, 150, 52, 64, 48, 3…
## $ TEAM_BASERUN_CS  <int> 50, 39, 47, 57, NA, NA, NA, NA, NA, NA, NA, 28, 21, 8…
## $ TEAM_BATTING_HBP <int> NA, NA, NA, 42, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ TEAM_PITCHING_H  <int> 1209, 1221, 1395, 1539, 3902, 2793, 1544, 1626, 1342,…
## $ TEAM_PITCHING_HR <int> 83, 88, 93, 159, 14, 20, 40, 39, 25, 62, 53, 173, 196…
## $ TEAM_PITCHING_BB <int> 447, 516, 509, 486, 257, 420, 613, 418, 497, 482, 521…
## $ TEAM_PITCHING_SO <int> 1080, 929, 816, 914, 1123, 736, 569, 715, 734, 622, 6…
## $ TEAM_FIELDING_E  <int> 140, 135, 156, 124, 616, 572, 490, 328, 226, 184, 200…
## $ TEAM_FIELDING_DP <int> 156, 164, 153, 154, 130, 105, NA, 104, 132, 145, 183,…

?print # years "Total Years:"
print(x = 2006-1871+1,
      digits = 1)

## [1] 136

136*162

## [1] 22032

My third command for summary stats will be stargazer and I will only present that in my final report. This is still more in assignment style.

3 Data Analysis

I have three points to make -

This will bold.
This will italicize.

3.1 Data Visualization

Use ggplot2 package to create professional charts.

3.1.1 Missing Observations

library(visdat)
?visdat

vis_dat(df_train)

df_no_missing <- df_train
df_no_missing$TEAM_BATTING_HBP  = NULL
df_no_missing$TEAM_BATTING_CS   = NULL 
 vis_miss(df_no_missing)

3.1.2 Imputing Missing Observations

Plot or summarize the variable before and after the imputation to confirm the distribution has not

hist(df_no_missing$TEAM_BASERUN_CS )

df_no_missing$TEAM_BASERUN_CS <- ifelse(is.na(x = df_no_missing$TEAM_BASERUN_CS), 
                                        yes = mean(x = df_no_missing$TEAM_BASERUN_CS,                                                    na.rm = TRUE),
                                        no = df_no_missing$TEAM_BASERUN_CS)
hist(df_no_missing$TEAM_BASERUN_CS )

# Create histogram in ggplot
ggplot(data = df_no_missing, 
       mapping = aes(x = TEAM_BASERUN_CS)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = df_no_missing, 
       mapping = aes(x = TEAM_BASERUN_CS)) + geom_histogram(color = "black", 
                                                            fill = "blue", 
                                                            binwidth = 10) +
  labs(title = "Histogram of X", x = "X", y = "Count")

df_no_missing <- na.omit(df_no_missing) # maybe drop a few more variables

3.1.3 Correlation Plot

Please play with the command below and make the correlation plot upper or lower triangular - else the chart below displays too much redundant information. It does not look very professional ( a little cluttered). At top firms, attention to detail is very important and you are required to create very polished charts.

library(ggcorrplot)
?ggcorrplot()
?cor
  

?cor_pmat # cor_pmat(): Compute a correlation matrix p-values.
mycorr<- cor(x = df_no_missing[, 1:ncol(df_no_missing )])

p.mat <- ggcorrplot::cor_pmat(x = df_no_missing[,1:ncol(df_no_missing)])
head(p.mat)

##                       INDEX  TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
## INDEX           0.000000000 3.085723e-01   2.870071e-01    1.012176e-01
## TARGET_WINS     0.308572334 0.000000e+00   1.013043e-54    2.490498e-20
## TEAM_BATTING_H  0.287007118 1.013043e-54   0.000000e+00   9.355168e-233
## TEAM_BATTING_2B 0.101217596 2.490498e-20  9.355168e-233    0.000000e+00
## TEAM_BATTING_3B 0.769316672 1.340586e-07   2.896946e-60    2.894398e-02
## TEAM_BATTING_HR 0.006420957 1.416443e-21   1.602541e-06    1.786146e-62
##                 TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
## INDEX              7.693167e-01    6.420957e-03    2.259876e-02    7.106396e-07
## TARGET_WINS        1.340586e-07    1.416443e-21    6.788724e-40    1.082744e-02
## TEAM_BATTING_H     2.896946e-60    1.602541e-06    1.402162e-05    3.110665e-54
## TEAM_BATTING_2B    2.894398e-02    1.786146e-62    2.281359e-23    1.601577e-07
## TEAM_BATTING_3B    0.000000e+00   3.146352e-208    5.493361e-25   1.722999e-271
## TEAM_BATTING_HR   3.146352e-208    0.000000e+00    2.833152e-68   4.262615e-267
##                 TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_H
## INDEX              1.177395e-04    9.919907e-01    9.517913e-01
## TARGET_WINS        1.846788e-07    6.773671e-01    8.836481e-22
## TEAM_BATTING_H     1.497875e-02    6.096341e-01   7.485250e-277
## TEAM_BATTING_2B    9.931836e-07    9.663607e-06    6.214542e-71
## TEAM_BATTING_3B    8.653120e-30    2.117662e-35    5.645504e-73
## TEAM_BATTING_HR    1.327937e-41    9.273809e-63    2.113987e-05
##                 TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
## INDEX               7.419790e-03     1.014059e-02     1.744343e-06
## TARGET_WINS         1.685830e-21     2.011167e-32     4.962297e-03
## TEAM_BATTING_H      4.769017e-08     3.435879e-08     7.730597e-49
## TEAM_BATTING_2B     1.151450e-63     2.211572e-17     5.592219e-07
## TEAM_BATTING_3B    2.009525e-190     1.798207e-07    2.195339e-225
## TEAM_BATTING_HR     0.000000e+00     2.039060e-29    1.157630e-220
##                 TEAM_FIELDING_E TEAM_FIELDING_DP
## INDEX              1.287249e-02     7.993341e-01
## TARGET_WINS        1.454931e-15     1.165960e-01
## TEAM_BATTING_H     2.969523e-04     1.173988e-11
## TEAM_BATTING_2B    9.295027e-35     1.901141e-14
## TEAM_BATTING_3B   2.596789e-244     7.211648e-21
## TEAM_BATTING_HR   4.191484e-271     2.530492e-36

  myplot<-ggcorrplot(corr     = mycorr,   # correlation matrix to visualize
                     method   = "square", # character, the visualization method of correlation matrix to be used. Allowed values are "square" (default), "circle"
                     type     = "full",  # character, "full" (default), "lower" or "upper" display
                     title    = "Correlation Plot",  # character, title of the graph
                     colors   = c("red", "white","green"), #    vector of 3 colors for low, mid and high correlation values.
                     lab      = TRUE,   # If TRUE, add corr coeff on the plot.
                     lab_size = 2,      # labels. used when lab = TRUE.
                     p.mat    = p.mat,  # matrix of p-value. If NULL, arguments sig.level, insig, pch, pch.col, pch.cex is invalid.  # Barring the no significant coefficient
                     insig    = "pch",  # character, specialized insignificant correlation coefficients, "pch" (default), "blank". If "blank", wipe away the corresponding glyphs; if "pch", add characters (see pch for details) on corresponding glyphs.
                     pch      = 4, # add character on the glyphs of insignificant correlation coefficients (only valid when insig is "pch"). Default value is 4.
                     hc.order = TRUE, # If TRUE, correlation matrix will be hc.ordered using hclust function.
                     tl.cex   = 8, # the size, the color and the string rotation of text label
                      tl.col   = "black", 
                     digits = 2
                     )

myplot

3.2 Graphs

I went to Chat GPT and typed “r create a histogram of all variables in a dataframe”. You could have done so in Google alternatively.

I got the following code - and tried running it as it is to understand how it works.

library(ggplot2)

# create sample dataframe
df <- data.frame(
  var1 = rnorm(100),
  var2 = rnorm(100),
  var3 = rnorm(100)
)

# melt the dataframe into long format
df_melted <- reshape2::melt(df)

## No id variables; using all as measure variables

# create histogram using ggplot2
ggplot(df_melted, aes(value)) + 
  geom_histogram() + 
  facet_wrap(~variable, scales = "free_x")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Then, I try to adapt it for my own code -

library(ggplot2)

# melt the dataframe into long format
df_melted <- reshape2::melt(df_no_missing)

## No id variables; using all as measure variables

# create histogram using ggplot2
ggplot(df_melted, aes(x = value)) + 
  geom_histogram() + 
  facet_wrap(~variable, scales = "free_x")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3.3 Summary Statistics

Use stargazer package to create professional summary tables. Can also consider using kabble, kabbleExtra.

Run summary statistics on the final, clean variables that you are keeping in your regression. Play around with labelling variable with covariate.labels option in stargazer().

library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

?stargazer

stargazer(table = df_no_missing[,2:15],
          type = "text",
          title = "Summary Statistics" #,
    #     covariate.labels = c( "Target Wins", 
    #                            "Team Batting"
    #                           )
          )

## 
## Summary Statistics
## ========================================================
## Statistic          N     Mean    St. Dev.  Min     Max  
## --------------------------------------------------------
## TARGET_WINS      1,835  80.990    13.158    38     120  
## TEAM_BATTING_H   1,835 1,457.035 108.082  1,137   1,876 
## TEAM_BATTING_2B  1,835  247.834   42.952   130     392  
## TEAM_BATTING_3B  1,835  47.839    21.709    11     138  
## TEAM_BATTING_HR  1,835  116.478   54.511    4      264  
## TEAM_BATTING_BB  1,835  530.277   85.501   273     878  
## TEAM_BATTING_SO  1,835  786.762  216.972   324    1,399 
## TEAM_BASERUN_SB  1,835  100.801   52.755    18     367  
## TEAM_BASERUN_CS  1,835  52.933    20.562  11.000 201.000
## TEAM_PITCHING_H  1,835 1,524.617 174.214  1,137   2,394 
## TEAM_PITCHING_HR 1,835  120.711   56.412    4      343  
## TEAM_PITCHING_BB 1,835  553.960   97.483   325    1,090 
## TEAM_PITCHING_SO 1,835  817.993  222.223   345    1,781 
## TEAM_FIELDING_E  1,835  159.925   57.789    65     515  
## --------------------------------------------------------

4 Regression

The regression I will run is ( describe it ) -

$wage_{i}= \beta_{0}+\gamma_1+ \epsilon_i$.

stepAIC may help you with backward/forward selection.

Box-Cox may be interesting to try too. Read up on it if you have time.

Present the models in one single table so that it is easy for the reader to compare coefficients.

# MODEL 1:  keep all variables 

model1 <- lm(data = df_no_missing, 
             TARGET_WINS ~ .)


# MODEL 2:  drop variable randomly ...

model2 <- lm(data = df_no_missing, 
             TARGET_WINS ~ . -TEAM_BATTING_HR - TEAM_BATTING_BB )

# MODEL 3:  Step AIC...Backward selection... drop variable systematically...
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

?stepAIC
model3 <- stepAIC(object = lm(data = df_no_missing, TARGET_WINS ~ . - INDEX), 
                  direction = c("both")
                  )

## Start:  AIC=8524.83
## TARGET_WINS ~ (INDEX + TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + 
##     TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP) - 
##     INDEX
## 
##                    Df Sum of Sq    RSS    AIC
## - TEAM_PITCHING_HR  1      12.4 188001 8523.0
## - TEAM_BATTING_SO   1     101.7 188090 8523.8
## <none>                          187989 8524.8
## - TEAM_BATTING_HR   1     242.0 188231 8525.2
## - TEAM_BATTING_H    1     270.0 188259 8525.5
## - TEAM_PITCHING_BB  1     347.9 188336 8526.2
## - TEAM_PITCHING_SO  1     432.7 188421 8527.1
## - TEAM_BASERUN_CS   1     662.6 188651 8529.3
## - TEAM_BATTING_BB   1     710.8 188699 8529.8
## - TEAM_PITCHING_H   1    1303.8 189292 8535.5
## - TEAM_BATTING_2B   1    3226.4 191215 8554.1
## - TEAM_FIELDING_DP  1    8436.8 196425 8603.4
## - TEAM_BATTING_3B   1    9717.7 197706 8615.3
## - TEAM_BASERUN_SB   1   15747.2 203736 8670.4
## - TEAM_FIELDING_E   1   29331.9 217320 8788.9
## 
## Step:  AIC=8522.95
## TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
##     TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
##     TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
##     TEAM_FIELDING_E + TEAM_FIELDING_DP
## 
##                    Df Sum of Sq    RSS    AIC
## <none>                          188001 8523.0
## - TEAM_BATTING_SO   1     229.1 188230 8523.2
## - TEAM_BATTING_H    1     325.8 188327 8524.1
## + TEAM_PITCHING_HR  1      12.4 187989 8524.8
## - TEAM_PITCHING_BB  1     400.4 188401 8524.9
## - TEAM_BASERUN_CS   1     677.5 188679 8527.6
## - TEAM_BATTING_BB   1     795.1 188796 8528.7
## - TEAM_PITCHING_SO  1     825.2 188826 8529.0
## - TEAM_PITCHING_H   1    1488.2 189489 8535.4
## - TEAM_BATTING_2B   1    3223.9 191225 8552.2
## - TEAM_FIELDING_DP  1    8462.1 196463 8601.7
## - TEAM_BATTING_3B   1    9711.3 197712 8613.4
## - TEAM_BATTING_HR   1   11284.0 199285 8627.9
## - TEAM_BASERUN_SB   1   15856.4 203857 8669.5
## - TEAM_FIELDING_E   1   29396.6 217398 8787.5

stargazer(model1, model2, model3,
          type = "text"
          )

## 
## =================================================================================================
##                                                  Dependent variable:                             
##                     -----------------------------------------------------------------------------
##                                                      TARGET_WINS                                 
##                                (1)                       (2)                       (3)           
## -------------------------------------------------------------------------------------------------
## INDEX                        -0.0002                   -0.0002                                   
##                             (0.0003)                  (0.0003)                                   
##                                                                                                  
## TEAM_BATTING_H               -0.027                     0.003                    -0.028*         
##                              (0.017)                   (0.011)                   (0.016)         
##                                                                                                  
## TEAM_BATTING_2B             -0.050***                 -0.051***                 -0.050***        
##                              (0.009)                   (0.009)                   (0.009)         
##                                                                                                  
## TEAM_BATTING_3B             0.185***                  0.181***                  0.184***         
##                              (0.019)                   (0.019)                   (0.019)         
##                                                                                                  
## TEAM_BATTING_HR               0.123                                             0.097***         
##                              (0.082)                                             (0.009)         
##                                                                                                  
## TEAM_BATTING_BB             0.112***                                            0.115***         
##                              (0.042)                                             (0.041)         
##                                                                                                  
## TEAM_BATTING_SO               0.022                   0.060***                    0.026          
##                              (0.022)                   (0.017)                   (0.018)         
##                                                                                                  
## TEAM_BASERUN_SB             0.077***                  0.078***                  0.077***         
##                              (0.006)                   (0.006)                   (0.006)         
##                                                                                                  
## TEAM_BASERUN_CS             -0.036**                  -0.040***                 -0.036**         
##                              (0.014)                   (0.014)                   (0.014)         
##                                                                                                  
## TEAM_PITCHING_H             0.054***                  0.026***                  0.055***         
##                              (0.015)                   (0.009)                   (0.014)         
##                                                                                                  
## TEAM_PITCHING_HR             -0.026                   0.092***                                   
##                              (0.078)                   (0.009)                                   
##                                                                                                  
## TEAM_PITCHING_BB             -0.075*                  0.031***                  -0.077**         
##                              (0.040)                   (0.003)                   (0.039)         
##                                                                                                  
## TEAM_PITCHING_SO            -0.043**                  -0.079***                 -0.047***        
##                              (0.021)                   (0.016)                   (0.017)         
##                                                                                                  
## TEAM_FIELDING_E             -0.121***                 -0.117***                 -0.120***        
##                              (0.007)                   (0.007)                   (0.007)         
##                                                                                                  
## TEAM_FIELDING_DP            -0.111***                 -0.112***                 -0.111***        
##                              (0.012)                   (0.012)                   (0.012)         
##                                                                                                  
## Constant                    60.797***                 59.964***                 60.856***        
##                              (6.066)                   (6.076)                   (6.057)         
##                                                                                                  
## -------------------------------------------------------------------------------------------------
## Observations                  1,835                     1,835                     1,835          
## R2                            0.408                     0.404                     0.408          
## Adjusted R2                   0.403                     0.400                     0.404          
## Residual Std. Error    10.165 (df = 1819)        10.192 (df = 1821)        10.161 (df = 1821)    
## F Statistic         83.597*** (df = 15; 1819) 95.054*** (df = 13; 1821) 96.522*** (df = 13; 1821)
## =================================================================================================
## Note:                                                                 *p<0.1; **p<0.05; ***p<0.01

4.1 Multicollinearity

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

## The following object is masked from 'package:psych':
## 
##     logit

?vif()


data_subset <- df_no_missing[,-1]

names(data_subset)

##  [1] "TARGET_WINS"      "TEAM_BATTING_H"   "TEAM_BATTING_2B"  "TEAM_BATTING_3B" 
##  [5] "TEAM_BATTING_HR"  "TEAM_BATTING_BB"  "TEAM_BATTING_SO"  "TEAM_BASERUN_SB" 
##  [9] "TEAM_BASERUN_CS"  "TEAM_PITCHING_H"  "TEAM_PITCHING_HR" "TEAM_PITCHING_BB"
## [13] "TEAM_PITCHING_SO" "TEAM_FIELDING_E"  "TEAM_FIELDING_DP"

model1 <- lm(data = data_subset, 
             TARGET_WINS ~ .)

summary(model1)

## 
## Call:
## lm(formula = TARGET_WINS ~ ., data = data_subset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -32.350  -7.089   0.103   6.918  29.580 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      60.763342   6.064292  10.020  < 2e-16 ***
## TEAM_BATTING_H   -0.026693   0.016510  -1.617 0.106107    
## TEAM_BATTING_2B  -0.049536   0.008863  -5.589 2.63e-08 ***
## TEAM_BATTING_3B   0.183918   0.018961   9.700  < 2e-16 ***
## TEAM_BATTING_HR   0.124744   0.081504   1.531 0.126061    
## TEAM_BATTING_BB   0.111399   0.042465   2.623 0.008781 ** 
## TEAM_BATTING_SO   0.021847   0.022019   0.992 0.321233    
## TEAM_BASERUN_SB   0.076395   0.006187  12.347  < 2e-16 ***
## TEAM_BASERUN_CS  -0.036101   0.014253  -2.533 0.011399 *  
## TEAM_PITCHING_H   0.053366   0.015021   3.553 0.000391 ***
## TEAM_PITCHING_HR -0.027021   0.077957  -0.347 0.728920    
## TEAM_PITCHING_BB -0.074086   0.040371  -1.835 0.066647 .  
## TEAM_PITCHING_SO -0.042878   0.020949  -2.047 0.040825 *  
## TEAM_FIELDING_E  -0.120625   0.007158 -16.852  < 2e-16 ***
## TEAM_FIELDING_DP -0.110900   0.012271  -9.038  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.16 on 1820 degrees of freedom
## Multiple R-squared:  0.408,  Adjusted R-squared:  0.4034 
## F-statistic: 89.59 on 14 and 1820 DF,  p-value: < 2.2e-16

## INCORRECT COEFFICIENTS DUE TO MULTOCOLLINEARITY???
vif_values1 <- vif(model1) # multicollinearity
as.data.frame(vif(model1))

##                  vif(model1)
## TEAM_BATTING_H     56.539890
## TEAM_BATTING_2B     2.573319
## TEAM_BATTING_3B     3.008691
## TEAM_BATTING_HR   350.487237
## TEAM_BATTING_BB   234.073106
## TEAM_BATTING_SO   405.276770
## TEAM_BASERUN_SB     1.891731
## TEAM_BASERUN_CS     1.525089
## TEAM_PITCHING_H   121.586247
## TEAM_PITCHING_HR  343.396144
## TEAM_PITCHING_BB  274.994857
## TEAM_PITCHING_SO  384.810502
## TEAM_FIELDING_E     3.038294
## TEAM_FIELDING_DP    1.376169

summary(model1_vif <- lm(data = data_subset, 
             TEAM_BATTING_HR ~ . - TARGET_WINS)) # 351.042560

## 
## Call:
## lm(formula = TEAM_BATTING_HR ~ . - TARGET_WINS, data = data_subset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -42.782  -0.874  -0.193   0.653  29.376 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.2441869  1.7435860  -0.140 0.888637    
## TEAM_BATTING_H   -0.0515416  0.0045908 -11.227  < 2e-16 ***
## TEAM_BATTING_2B  -0.0009689  0.0025483  -0.380 0.703816    
## TEAM_BATTING_3B  -0.0094590  0.0054473  -1.736 0.082650 .  
## TEAM_BATTING_BB   0.1201058  0.0118807  10.109  < 2e-16 ***
## TEAM_BATTING_SO   0.1607601  0.0050881  31.595  < 2e-16 ***
## TEAM_BASERUN_SB   0.0030011  0.0017775   1.688 0.091515 .  
## TEAM_BASERUN_CS  -0.0144344  0.0040841  -3.534 0.000419 ***
## TEAM_PITCHING_H   0.0507239  0.0041519  12.217  < 2e-16 ***
## TEAM_PITCHING_HR  0.9503043  0.0025437 373.597  < 2e-16 ***
## TEAM_PITCHING_BB -0.1153016  0.0112884 -10.214  < 2e-16 ***
## TEAM_PITCHING_SO -0.1533735  0.0048334 -31.732  < 2e-16 ***
## TEAM_FIELDING_E   0.0044123  0.0020555   2.147 0.031958 *  
## TEAM_FIELDING_DP -0.0040996  0.0035268  -1.162 0.245218    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.922 on 1821 degrees of freedom
## Multiple R-squared:  0.9971, Adjusted R-squared:  0.9971 
## F-statistic: 4.896e+04 on 13 and 1821 DF,  p-value: < 2.2e-16

                                   #1/(1-summary(model1_vif)$r.squared)
1/(1-0.9971)

## [1] 344.8276

1/( 1 - summary(model1_vif)$r.squared )

## [1] 350.4872

ANSWER THE FOLLOWING questions atleast -

Describe your model evolution. Why did you try a new model? Which model do you finally prefer?
What is coefficient interpretation of the model of your choice ??? Do it for a few key variables.
How is the model fit ? Residual Analysis ???

4.2 Residual Analysis on Final Model

Try to convince the reader your chosen model does not violate Gauss Markov Assumptions.

?par
par(mfrow = c(2,2)) # 2 x 2 pictures on one plot
plot(model3)

How do you interpret these 4 charts ???

4.3 Predict

?predict

5 Session Info

https://www.rostrum.blog/2018/10/13/sessioninfo/

sessionInfo()

## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] car_3.1-2        carData_3.0-5    MASS_7.3-60      stargazer_5.2.3 
##  [5] ggcorrplot_0.1.4 visdat_0.6.0     lubridate_1.9.2  forcats_1.0.0   
##  [9] stringr_1.5.0    dplyr_1.1.3      purrr_1.0.1      readr_2.1.4     
## [13] tidyr_1.3.0      tibble_3.2.1     ggplot2_3.4.2    tidyverse_2.0.0 
## [17] psych_2.3.6     
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.10      lattice_0.20-45  digest_0.6.32    utf8_1.2.3      
##  [5] R6_2.5.1         plyr_1.8.8       evaluate_0.21    highr_0.10      
##  [9] pillar_1.9.0     rlang_1.1.1      rstudioapi_0.14  jquerylib_0.1.4 
## [13] rmarkdown_2.21   labeling_0.4.2   munsell_0.5.0    compiler_4.2.1  
## [17] xfun_0.39        pkgconfig_2.0.3  mnormt_2.1.1     htmltools_0.5.5 
## [21] tidyselect_1.2.0 fansi_1.0.4      tzdb_0.3.0       withr_2.5.1     
## [25] grid_4.2.1       nlme_3.1-161     jsonlite_1.8.7   gtable_0.3.3    
## [29] lifecycle_1.0.3  magrittr_2.0.3   scales_1.2.1     cli_3.6.1       
## [33] stringi_1.7.12   cachem_1.0.8     farver_2.1.1     reshape2_1.4.4  
## [37] bslib_0.5.0      ellipsis_0.3.2   generics_0.1.3   vctrs_0.6.3     
## [41] tools_4.2.1      glue_1.6.2       hms_1.1.2        abind_1.4-5     
## [45] parallel_4.2.1   fastmap_1.1.1    yaml_2.3.7       timechange_0.2.0
## [49] colorspace_2.1-0 knitr_1.42       sass_0.4.6

# devtools::session_info()

Homework 1

Arvind Sharma

2023-10-04