Homework 1

Please note that option knitr::opts_knit$set(root.dir = “YOUR WORKING DIRECTORY”) is only required if your .qmd file is in a different folder than you raw data, like in my case. Knitr Explanation

1 INSTRUCTIONS

This file is for showing students some code, and is not a definitive guide on how to present. You can and should experiment with the commands shown and see if you can improve the presentation.

I am merely attempting to give you a starting point. You can still use commands in this file like psych::describe but you can see that stargazer is much better. Similarly, instead of plotting with plot, use ggplot.

ENSURE YOUR FINAL REPORT IS CLEAN, POLISHED AND PROFESSIONAL AND DOES NOT HAVE ANY UNNECESSARY OUTPUT.

  • You will have to play more with this template and also the commands.

2 Setup

2.1 Empty variables and functions in the environment tab/window

First, empty the environment so that we can upload the clean data.

# Set working directory and path to data
  setwd(cd)

# Clear the workspace
  rm(list = ls()) # Clear environment
  gc()            # Clear unused memory
          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells  599569 32.1    1362770 72.8         NA   806282 43.1
Vcells 1108795  8.5    8388608 64.0      49152  1963198 15.0
  cat("\f")       # Clear the console
  dev.off         # Clear the charts
function (which = dev.cur()) 
{
    if (which == 1) 
        stop("cannot shut down device 1 (the null device)")
    .External(C_devoff, as.integer(which))
    dev.cur()
}
<bytecode: 0x12dc2eba0>
<environment: namespace:grDevices>

2.2 Load packages

Now, we load the packages.

# Prepare needed libraries
packages <- c("psych","tidyverse")

  for (i in 1:length(packages)) {
    if (!packages[i] %in% rownames(installed.packages())) {
      install.packages(packages[i]
                       , repos = "http://cran.rstudio.com/"
                       , dependencies = TRUE
                       )
    }
    library(packages[i], character.only = TRUE)
  }
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ ggplot2::%+%()   masks psych::%+%()
✖ ggplot2::alpha() masks psych::alpha()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
rm(packages)

2.3 Load raw data

There are many ways to confirm you have successfully imported the raw data. Like I try to conduct some summary stats on the data.

Note that it is fine initially to use commands like describe or glimpse to confirm data import or exploring the data. But reports are an iterative process. The final report is should be a client friendly reports i.e. keep only the most informative command output in your final submitted file and suppress the output of other commands.

For instance, does the client really need to know what is your working directory ? Or would that information be useful for them ? If it is not important for the client to see the output of certain commands but you need to run that chunk, so try include=FALSE in r chunk options.

Try to move away from an assignment style of writing (print all command outputs, and have as many commands as possible) to an executive report style of writing (attention highlighted to key points without wasting time). For example, if there are 4 ways to create summary statistics for a dataset, in your first iteration of the assignment run just any one command that achieves the target. In the second iteration, try to see if you can run the other three commands for summary statistics too - this will be important for your learning. In the third and final iteration, just present your preferred command for each task and rewrite the assignment in an executive/business report tone.

df_train    <- read.csv("moneyball-training-data.csv")
df_eval     <- read.csv("moneyball-evaluation-data.csv")

describe(df_train)
                 vars    n    mean      sd median trimmed    mad  min   max
INDEX               1 2276 1268.46  736.35 1270.5 1268.57 952.57    1  2535
TARGET_WINS         2 2276   80.79   15.75   82.0   81.31  14.83    0   146
TEAM_BATTING_H      3 2276 1469.27  144.59 1454.0 1459.04 114.16  891  2554
TEAM_BATTING_2B     4 2276  241.25   46.80  238.0  240.40  47.44   69   458
TEAM_BATTING_3B     5 2276   55.25   27.94   47.0   52.18  23.72    0   223
TEAM_BATTING_HR     6 2276   99.61   60.55  102.0   97.39  78.58    0   264
TEAM_BATTING_BB     7 2276  501.56  122.67  512.0  512.18  94.89    0   878
TEAM_BATTING_SO     8 2174  735.61  248.53  750.0  742.31 284.66    0  1399
TEAM_BASERUN_SB     9 2145  124.76   87.79  101.0  110.81  60.79    0   697
TEAM_BASERUN_CS    10 1504   52.80   22.96   49.0   50.36  17.79    0   201
TEAM_BATTING_HBP   11  191   59.36   12.97   58.0   58.86  11.86   29    95
TEAM_PITCHING_H    12 2276 1779.21 1406.84 1518.0 1555.90 174.95 1137 30132
TEAM_PITCHING_HR   13 2276  105.70   61.30  107.0  103.16  74.13    0   343
TEAM_PITCHING_BB   14 2276  553.01  166.36  536.5  542.62  98.59    0  3645
TEAM_PITCHING_SO   15 2174  817.73  553.09  813.5  796.93 257.23    0 19278
TEAM_FIELDING_E    16 2276  246.48  227.77  159.0  193.44  62.27   65  1898
TEAM_FIELDING_DP   17 1990  146.39   26.23  149.0  147.58  23.72   52   228
                 range  skew kurtosis    se
INDEX             2534  0.00    -1.22 15.43
TARGET_WINS        146 -0.40     1.03  0.33
TEAM_BATTING_H    1663  1.57     7.28  3.03
TEAM_BATTING_2B    389  0.22     0.01  0.98
TEAM_BATTING_3B    223  1.11     1.50  0.59
TEAM_BATTING_HR    264  0.19    -0.96  1.27
TEAM_BATTING_BB    878 -1.03     2.18  2.57
TEAM_BATTING_SO   1399 -0.30    -0.32  5.33
TEAM_BASERUN_SB    697  1.97     5.49  1.90
TEAM_BASERUN_CS    201  1.98     7.62  0.59
TEAM_BATTING_HBP    66  0.32    -0.11  0.94
TEAM_PITCHING_H  28995 10.33   141.84 29.49
TEAM_PITCHING_HR   343  0.29    -0.60  1.28
TEAM_PITCHING_BB  3645  6.74    96.97  3.49
TEAM_PITCHING_SO 19278 22.17   671.19 11.86
TEAM_FIELDING_E   1833  2.99    10.97  4.77
TEAM_FIELDING_DP   176 -0.39     0.18  0.59
glimpse(df_eval)
Rows: 259
Columns: 16
$ INDEX            <int> 9, 10, 14, 47, 60, 63, 74, 83, 98, 120, 123, 135, 138…
$ TEAM_BATTING_H   <int> 1209, 1221, 1395, 1539, 1445, 1431, 1430, 1385, 1259,…
$ TEAM_BATTING_2B  <int> 170, 151, 183, 309, 203, 236, 219, 158, 177, 212, 243…
$ TEAM_BATTING_3B  <int> 33, 29, 29, 29, 68, 53, 55, 42, 78, 42, 40, 55, 57, 2…
$ TEAM_BATTING_HR  <int> 83, 88, 93, 159, 5, 10, 37, 33, 23, 58, 50, 164, 186,…
$ TEAM_BATTING_BB  <int> 447, 516, 509, 486, 95, 215, 568, 356, 466, 452, 495,…
$ TEAM_BATTING_SO  <int> 1080, 929, 816, 914, 416, 377, 527, 609, 689, 584, 64…
$ TEAM_BASERUN_SB  <int> 62, 54, 59, 148, NA, NA, 365, 185, 150, 52, 64, 48, 3…
$ TEAM_BASERUN_CS  <int> 50, 39, 47, 57, NA, NA, NA, NA, NA, NA, NA, 28, 21, 8…
$ TEAM_BATTING_HBP <int> NA, NA, NA, 42, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ TEAM_PITCHING_H  <int> 1209, 1221, 1395, 1539, 3902, 2793, 1544, 1626, 1342,…
$ TEAM_PITCHING_HR <int> 83, 88, 93, 159, 14, 20, 40, 39, 25, 62, 53, 173, 196…
$ TEAM_PITCHING_BB <int> 447, 516, 509, 486, 257, 420, 613, 418, 497, 482, 521…
$ TEAM_PITCHING_SO <int> 1080, 929, 816, 914, 1123, 736, 569, 715, 734, 622, 6…
$ TEAM_FIELDING_E  <int> 140, 135, 156, 124, 616, 572, 490, 328, 226, 184, 200…
$ TEAM_FIELDING_DP <int> 156, 164, 153, 154, 130, 105, NA, 104, 132, 145, 183,…
?print # years "Total Years:"
print(x = 2006-1871+1,
      digits = 1)
[1] 136
136*162
[1] 22032

My third command for summary stats will be stargazer and I will only present that in my final report. This is still more in assignment style.

3 Data Analysis

I have three points to make -

  • This will bold.
  • This will italicize.

3.1 Data Visualization

Use ggplot2 package to create professional charts.

3.1.1 Missing Observations

library(visdat)
?visdat

vis_dat(df_train)

df_no_missing <- df_train
df_no_missing$TEAM_BATTING_HBP  = NULL
df_no_missing$TEAM_BATTING_CS   = NULL 
 vis_miss(df_no_missing)

3.1.2 Imputing Missing Observations

Plot or summarize the variable before and after the imputation to confirm the distribution has not

hist(df_no_missing$TEAM_BASERUN_CS )

df_no_missing$TEAM_BASERUN_CS <- ifelse(is.na(x = df_no_missing$TEAM_BASERUN_CS), 
                                        yes = mean(x = df_no_missing$TEAM_BASERUN_CS,                                                    na.rm = TRUE),
                                        no = df_no_missing$TEAM_BASERUN_CS)
hist(df_no_missing$TEAM_BASERUN_CS )

# Create histogram in ggplot
ggplot(data = df_no_missing, 
       mapping = aes(x = TEAM_BASERUN_CS)) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = df_no_missing, 
       mapping = aes(x = TEAM_BASERUN_CS)) + geom_histogram(color = "black", 
                                                            fill = "blue", 
                                                            binwidth = 10) +
  labs(title = "Histogram of X", x = "X", y = "Count")

df_no_missing <- na.omit(df_no_missing) # maybe drop a few more variables

3.1.3 Correlation Plot

Please play with the command below and make the correlation plot upper or lower triangular - else the chart below displays too much redundant information. It does not look very professional ( a little cluttered). At top firms, attention to detail is very important and you are required to create very polished charts.

library(ggcorrplot)
?ggcorrplot()
?cor
  

?cor_pmat # cor_pmat(): Compute a correlation matrix p-values.
mycorr<- cor(x = df_no_missing[, 1:ncol(df_no_missing )])

p.mat <- ggcorrplot::cor_pmat(x = df_no_missing[,1:ncol(df_no_missing)])
head(p.mat)
                      INDEX  TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
INDEX           0.000000000 3.085723e-01   2.870071e-01    1.012176e-01
TARGET_WINS     0.308572334 0.000000e+00   1.013043e-54    2.490498e-20
TEAM_BATTING_H  0.287007118 1.013043e-54   0.000000e+00   9.355168e-233
TEAM_BATTING_2B 0.101217596 2.490498e-20  9.355168e-233    0.000000e+00
TEAM_BATTING_3B 0.769316672 1.340586e-07   2.896946e-60    2.894398e-02
TEAM_BATTING_HR 0.006420957 1.416443e-21   1.602541e-06    1.786146e-62
                TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
INDEX              7.693167e-01    6.420957e-03    2.259876e-02    7.106396e-07
TARGET_WINS        1.340586e-07    1.416443e-21    6.788724e-40    1.082744e-02
TEAM_BATTING_H     2.896946e-60    1.602541e-06    1.402162e-05    3.110665e-54
TEAM_BATTING_2B    2.894398e-02    1.786146e-62    2.281359e-23    1.601577e-07
TEAM_BATTING_3B    0.000000e+00   3.146352e-208    5.493361e-25   1.722999e-271
TEAM_BATTING_HR   3.146352e-208    0.000000e+00    2.833152e-68   4.262615e-267
                TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_H
INDEX              1.177395e-04    9.919907e-01    9.517913e-01
TARGET_WINS        1.846788e-07    6.773671e-01    8.836481e-22
TEAM_BATTING_H     1.497875e-02    6.096341e-01   7.485250e-277
TEAM_BATTING_2B    9.931836e-07    9.663607e-06    6.214542e-71
TEAM_BATTING_3B    8.653120e-30    2.117662e-35    5.645504e-73
TEAM_BATTING_HR    1.327937e-41    9.273809e-63    2.113987e-05
                TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
INDEX               7.419790e-03     1.014059e-02     1.744343e-06
TARGET_WINS         1.685830e-21     2.011167e-32     4.962297e-03
TEAM_BATTING_H      4.769017e-08     3.435879e-08     7.730597e-49
TEAM_BATTING_2B     1.151450e-63     2.211572e-17     5.592219e-07
TEAM_BATTING_3B    2.009525e-190     1.798207e-07    2.195339e-225
TEAM_BATTING_HR     0.000000e+00     2.039060e-29    1.157630e-220
                TEAM_FIELDING_E TEAM_FIELDING_DP
INDEX              1.287249e-02     7.993341e-01
TARGET_WINS        1.454931e-15     1.165960e-01
TEAM_BATTING_H     2.969523e-04     1.173988e-11
TEAM_BATTING_2B    9.295027e-35     1.901141e-14
TEAM_BATTING_3B   2.596789e-244     7.211648e-21
TEAM_BATTING_HR   4.191484e-271     2.530492e-36
  myplot<-ggcorrplot(corr     = mycorr,   # correlation matrix to visualize
                     method   = "square", # character, the visualization method of correlation matrix to be used. Allowed values are "square" (default), "circle"
                     type     = "full",  # character, "full" (default), "lower" or "upper" display
                     title    = "Correlation Plot",  # character, title of the graph
                     colors   = c("red", "white","green"), #    vector of 3 colors for low, mid and high correlation values.
                     lab      = TRUE,   # If TRUE, add corr coeff on the plot.
                     lab_size = 2,      # labels. used when lab = TRUE.
                     p.mat    = p.mat,  # matrix of p-value. If NULL, arguments sig.level, insig, pch, pch.col, pch.cex is invalid.  # Barring the no significant coefficient
                     insig    = "pch",  # character, specialized insignificant correlation coefficients, "pch" (default), "blank". If "blank", wipe away the corresponding glyphs; if "pch", add characters (see pch for details) on corresponding glyphs.
                     pch      = 4, # add character on the glyphs of insignificant correlation coefficients (only valid when insig is "pch"). Default value is 4.
                     hc.order = TRUE, # If TRUE, correlation matrix will be hc.ordered using hclust function.
                     tl.cex   = 8, # the size, the color and the string rotation of text label
                      tl.col   = "black", 
                     digits = 2
                     )

myplot 

3.2 Graphs

I went to Chat GPT and typed “r create a histogram of all variables in a dataframe”. You could have done so in Google alternatively.

I got the following code - and tried running it as it is to understand how it works.

library(ggplot2)

# create sample dataframe
df <- data.frame(
  var1 = rnorm(100),
  var2 = rnorm(100),
  var3 = rnorm(100)
)

# melt the dataframe into long format
df_melted <- reshape2::melt(df)
No id variables; using all as measure variables
# create histogram using ggplot2
ggplot(df_melted, aes(value)) + 
  geom_histogram() + 
  facet_wrap(~variable, scales = "free_x")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Then, I try to adapt it for my own code -

library(ggplot2)

# melt the dataframe into long format
df_melted <- reshape2::melt(df_no_missing)
No id variables; using all as measure variables
# create histogram using ggplot2
ggplot(df_melted, aes(x = value)) + 
  geom_histogram() + 
  facet_wrap(~variable, scales = "free_x")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

3.3 Summary Statistics

Use stargazer package to create professional summary tables. Can also consider using kabble, kabbleExtra.

Run summary statistics on the final, clean variables that you are keeping in your regression. Play around with labelling variable with covariate.labels option in stargazer().

library(stargazer)

Please cite as: 
 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
?stargazer

stargazer(table = df_no_missing[,2:15],
          type = "text",
          title = "Summary Statistics" #,
    #     covariate.labels = c( "Target Wins", 
    #                            "Team Batting"
    #                           )
          )

Summary Statistics
========================================================
Statistic          N     Mean    St. Dev.  Min     Max  
--------------------------------------------------------
TARGET_WINS      1,835  80.990    13.158    38     120  
TEAM_BATTING_H   1,835 1,457.035 108.082  1,137   1,876 
TEAM_BATTING_2B  1,835  247.834   42.952   130     392  
TEAM_BATTING_3B  1,835  47.839    21.709    11     138  
TEAM_BATTING_HR  1,835  116.478   54.511    4      264  
TEAM_BATTING_BB  1,835  530.277   85.501   273     878  
TEAM_BATTING_SO  1,835  786.762  216.972   324    1,399 
TEAM_BASERUN_SB  1,835  100.801   52.755    18     367  
TEAM_BASERUN_CS  1,835  52.933    20.562  11.000 201.000
TEAM_PITCHING_H  1,835 1,524.617 174.214  1,137   2,394 
TEAM_PITCHING_HR 1,835  120.711   56.412    4      343  
TEAM_PITCHING_BB 1,835  553.960   97.483   325    1,090 
TEAM_PITCHING_SO 1,835  817.993  222.223   345    1,781 
TEAM_FIELDING_E  1,835  159.925   57.789    65     515  
--------------------------------------------------------

4 Regression

The regression I will run is ( describe it ) -

\(wage_{i}= \beta_{0}+\gamma_1+ \epsilon_i\).

stepAIC may help you with backward/forward selection.

Box-Cox may be interesting to try too. Read up on it if you have time.

Present the models in one single table so that it is easy for the reader to compare coefficients.

# MODEL 1:  keep all variables 

model1 <- lm(data = df_no_missing, 
             TARGET_WINS ~ .)


# MODEL 2:  drop variable randomly ...

model2 <- lm(data = df_no_missing, 
             TARGET_WINS ~ . -TEAM_BATTING_HR - TEAM_BATTING_BB )

# MODEL 3:  Step AIC...Backward selection... drop variable systematically...
library(MASS)

Attaching package: 'MASS'
The following object is masked from 'package:dplyr':

    select
?stepAIC
model3 <- stepAIC(object = lm(data = df_no_missing, TARGET_WINS ~ .), 
                  direction = c("backward")
                  )
Start:  AIC=8526.62
TARGET_WINS ~ INDEX + TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
    TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
    TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + 
    TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP

                   Df Sum of Sq    RSS    AIC
- TEAM_PITCHING_HR  1      11.1 187978 8524.7
- INDEX             1      21.5 187989 8524.8
- TEAM_BATTING_SO   1     104.2 188071 8525.6
<none>                          187967 8526.6
- TEAM_BATTING_HR   1     235.9 188203 8526.9
- TEAM_BATTING_H    1     272.3 188239 8527.3
- TEAM_PITCHING_BB  1     353.0 188320 8528.1
- TEAM_PITCHING_SO  1     436.0 188403 8528.9
- TEAM_BASERUN_CS   1     670.4 188637 8531.2
- TEAM_BATTING_BB   1     716.8 188684 8531.6
- TEAM_PITCHING_H   1    1312.2 189279 8537.4
- TEAM_BATTING_2B   1    3237.3 191204 8556.0
- TEAM_FIELDING_DP  1    8408.7 196376 8604.9
- TEAM_BATTING_3B   1    9723.5 197691 8617.2
- TEAM_BASERUN_SB   1   15739.9 203707 8672.2
- TEAM_FIELDING_E   1   29328.6 217296 8790.7

Step:  AIC=8524.73
TARGET_WINS ~ INDEX + TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
    TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
    TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
    TEAM_FIELDING_E + TEAM_FIELDING_DP

                   Df Sum of Sq    RSS    AIC
- INDEX             1      22.7 188001 8523.0
<none>                          187978 8524.7
- TEAM_BATTING_SO   1     229.8 188208 8525.0
- TEAM_BATTING_H    1     326.6 188305 8525.9
- TEAM_PITCHING_BB  1     404.6 188383 8526.7
- TEAM_BASERUN_CS   1     685.1 188663 8529.4
- TEAM_BATTING_BB   1     799.6 188778 8530.5
- TEAM_PITCHING_SO  1     822.9 188801 8530.7
- TEAM_PITCHING_H   1    1494.3 189473 8537.3
- TEAM_BATTING_2B   1    3235.3 191214 8554.0
- TEAM_FIELDING_DP  1    8431.8 196410 8603.2
- TEAM_BATTING_3B   1    9719.2 197697 8615.2
- TEAM_BATTING_HR   1   11281.0 199259 8629.7
- TEAM_BASERUN_SB   1   15854.4 203833 8671.3
- TEAM_FIELDING_E   1   29390.8 217369 8789.3

Step:  AIC=8522.95
TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
    TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
    TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
    TEAM_FIELDING_E + TEAM_FIELDING_DP

                   Df Sum of Sq    RSS    AIC
<none>                          188001 8523.0
- TEAM_BATTING_SO   1     229.1 188230 8523.2
- TEAM_BATTING_H    1     325.8 188327 8524.1
- TEAM_PITCHING_BB  1     400.4 188401 8524.9
- TEAM_BASERUN_CS   1     677.5 188679 8527.6
- TEAM_BATTING_BB   1     795.1 188796 8528.7
- TEAM_PITCHING_SO  1     825.2 188826 8529.0
- TEAM_PITCHING_H   1    1488.2 189489 8535.4
- TEAM_BATTING_2B   1    3223.9 191225 8552.2
- TEAM_FIELDING_DP  1    8462.1 196463 8601.7
- TEAM_BATTING_3B   1    9711.3 197712 8613.4
- TEAM_BATTING_HR   1   11284.0 199285 8627.9
- TEAM_BASERUN_SB   1   15856.4 203857 8669.5
- TEAM_FIELDING_E   1   29396.6 217398 8787.5
stargazer(model1, model2, model3,
          type = "text"
          )

=================================================================================================
                                                 Dependent variable:                             
                    -----------------------------------------------------------------------------
                                                     TARGET_WINS                                 
                               (1)                       (2)                       (3)           
-------------------------------------------------------------------------------------------------
INDEX                        -0.0002                   -0.0002                                   
                            (0.0003)                  (0.0003)                                   
                                                                                                 
TEAM_BATTING_H               -0.027                     0.003                    -0.028*         
                             (0.017)                   (0.011)                   (0.016)         
                                                                                                 
TEAM_BATTING_2B             -0.050***                 -0.051***                 -0.050***        
                             (0.009)                   (0.009)                   (0.009)         
                                                                                                 
TEAM_BATTING_3B             0.185***                  0.181***                  0.184***         
                             (0.019)                   (0.019)                   (0.019)         
                                                                                                 
TEAM_BATTING_HR               0.123                                             0.097***         
                             (0.082)                                             (0.009)         
                                                                                                 
TEAM_BATTING_BB             0.112***                                            0.115***         
                             (0.042)                                             (0.041)         
                                                                                                 
TEAM_BATTING_SO               0.022                   0.060***                    0.026          
                             (0.022)                   (0.017)                   (0.018)         
                                                                                                 
TEAM_BASERUN_SB             0.077***                  0.078***                  0.077***         
                             (0.006)                   (0.006)                   (0.006)         
                                                                                                 
TEAM_BASERUN_CS             -0.036**                  -0.040***                 -0.036**         
                             (0.014)                   (0.014)                   (0.014)         
                                                                                                 
TEAM_PITCHING_H             0.054***                  0.026***                  0.055***         
                             (0.015)                   (0.009)                   (0.014)         
                                                                                                 
TEAM_PITCHING_HR             -0.026                   0.092***                                   
                             (0.078)                   (0.009)                                   
                                                                                                 
TEAM_PITCHING_BB             -0.075*                  0.031***                  -0.077**         
                             (0.040)                   (0.003)                   (0.039)         
                                                                                                 
TEAM_PITCHING_SO            -0.043**                  -0.079***                 -0.047***        
                             (0.021)                   (0.016)                   (0.017)         
                                                                                                 
TEAM_FIELDING_E             -0.121***                 -0.117***                 -0.120***        
                             (0.007)                   (0.007)                   (0.007)         
                                                                                                 
TEAM_FIELDING_DP            -0.111***                 -0.112***                 -0.111***        
                             (0.012)                   (0.012)                   (0.012)         
                                                                                                 
Constant                    60.797***                 59.964***                 60.856***        
                             (6.066)                   (6.076)                   (6.057)         
                                                                                                 
-------------------------------------------------------------------------------------------------
Observations                  1,835                     1,835                     1,835          
R2                            0.408                     0.404                     0.408          
Adjusted R2                   0.403                     0.400                     0.404          
Residual Std. Error    10.165 (df = 1819)        10.192 (df = 1821)        10.161 (df = 1821)    
F Statistic         83.597*** (df = 15; 1819) 95.054*** (df = 13; 1821) 96.522*** (df = 13; 1821)
=================================================================================================
Note:                                                                 *p<0.1; **p<0.05; ***p<0.01

If you really want to highlight your best model -

library(coefplot)
coefplot(model3, 
         intercept = FALSE
         )

4.1 Multicollinearity

library(car)
Loading required package: carData

Attaching package: 'car'
The following object is masked from 'package:dplyr':

    recode
The following object is masked from 'package:purrr':

    some
The following object is masked from 'package:psych':

    logit
?vif()

## INCORRECT COEFFICIENTS DUE TO MULTOCOLLINEARITY???
vif_values1 <- vif(model1) # multicollinearity
# vif_values1

as.data.frame(vif(model1))
                 vif(model1)
INDEX               1.048305
TEAM_BATTING_H     56.553286
TEAM_BATTING_2B     2.574846
TEAM_BATTING_3B     3.031626
TEAM_BATTING_HR   351.042560
TEAM_BATTING_BB   234.233457
TEAM_BATTING_SO   405.584117
TEAM_BASERUN_SB     1.903778
TEAM_BASERUN_CS     1.527066
TEAM_PITCHING_H   121.681790
TEAM_PITCHING_HR  343.938411
TEAM_PITCHING_BB  275.276299
TEAM_PITCHING_SO  384.929404
TEAM_FIELDING_E     3.047851
TEAM_FIELDING_DP    1.377116
summary(model1_vif <- lm(data = df_no_missing[,-2], 
             TEAM_BATTING_HR ~ .)) # 351.042560

Call:
lm(formula = TEAM_BATTING_HR ~ ., data = df_no_missing[, -2])

Residuals:
    Min      1Q  Median      3Q     Max 
-42.879  -0.874  -0.196   0.660  29.263 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -2.078e-01  1.743e+00  -0.119  0.90511    
INDEX            -1.621e-04  9.548e-05  -1.698  0.08965 .  
TEAM_BATTING_H   -5.158e-02  4.588e-03 -11.242  < 2e-16 ***
TEAM_BATTING_2B  -1.073e-03  2.548e-03  -0.421  0.67377    
TEAM_BATTING_3B  -8.637e-03  5.466e-03  -1.580  0.11425    
TEAM_BATTING_BB   1.205e-01  1.188e-02  10.143  < 2e-16 ***
TEAM_BATTING_SO   1.608e-01  5.086e-03  31.620  < 2e-16 ***
TEAM_BASERUN_SB   3.237e-03  1.782e-03   1.817  0.06946 .  
TEAM_BASERUN_CS  -1.466e-02  4.084e-03  -3.590  0.00034 ***
TEAM_PITCHING_H   5.085e-02  4.150e-03  12.251  < 2e-16 ***
TEAM_PITCHING_HR  9.503e-01  2.542e-03 373.792  < 2e-16 ***
TEAM_PITCHING_BB -1.157e-01  1.129e-02 -10.256  < 2e-16 ***
TEAM_PITCHING_SO -1.533e-01  4.831e-03 -31.735  < 2e-16 ***
TEAM_FIELDING_E   4.210e-03  2.058e-03   2.046  0.04094 *  
TEAM_FIELDING_DP -3.936e-03  3.526e-03  -1.116  0.26447    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.921 on 1820 degrees of freedom
Multiple R-squared:  0.9972,    Adjusted R-squared:  0.9971 
F-statistic: 4.551e+04 on 14 and 1820 DF,  p-value: < 2.2e-16
                                   #1/(1-summary(model1_vif)$r.squared)
1/(1-0.9972)
[1] 357.1429
1/( 1 - summary(model1_vif)$r.squared )
[1] 351.0426

ANSWER THE FOLLOWING questions atleast -

  • Describe your model evolution. Why did you try a new model? Which model do you finally prefer?

  • What is coefficient interpretation of the model of your choice ??? Do it for a few key variables.

  • How is the model fit ? Residual Analysis ???

4.2 Residual Analysis on Final Model

Try to convince the reader your chosen model does not violate Gauss Markov Assumptions.

?par
par(mfrow = c(2,2)) # 2 x 2 pictures on one plot
plot(model3)

How do you interpret these 4 charts ???

4.3 Predict

5 Session Info

https://www.rostrum.blog/2018/10/13/sessioninfo/

sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: US/Pacific
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] car_3.1-2          carData_3.0-5      coefplot_1.2.8     MASS_7.3-60.2     
 [5] stargazer_5.2.3    ggcorrplot_0.1.4.1 visdat_0.6.0       lubridate_1.9.3   
 [9] forcats_1.0.0      stringr_1.5.1      dplyr_1.1.4        purrr_1.0.2       
[13] readr_2.1.5        tidyr_1.3.1        tibble_3.2.1       ggplot2_3.5.1     
[17] tidyverse_2.0.0    psych_2.4.6.26    

loaded via a namespace (and not attached):
 [1] utf8_1.2.4        generics_0.1.3    stringi_1.8.4     lattice_0.22-6   
 [5] hms_1.1.3         digest_0.6.36     magrittr_2.0.3    evaluate_0.24.0  
 [9] grid_4.4.1        timechange_0.3.0  useful_1.2.6.1    fastmap_1.2.0    
[13] plyr_1.8.9        jsonlite_1.8.8    fansi_1.0.6       scales_1.3.0     
[17] abind_1.4-5       mnormt_2.1.1      cli_3.6.3         rlang_1.1.4      
[21] munsell_0.5.1     withr_3.0.0       yaml_2.3.8        tools_4.4.1      
[25] parallel_4.4.1    reshape2_1.4.4    tzdb_0.4.0        colorspace_2.1-0 
[29] vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4   htmlwidgets_1.6.4
[33] pkgconfig_2.0.3   pillar_1.9.0      gtable_0.3.5      Rcpp_1.0.12      
[37] glue_1.7.0        xfun_0.45         tidyselect_1.2.1  rstudioapi_0.16.0
[41] knitr_1.47        farver_2.1.2      htmltools_0.5.8.1 nlme_3.1-164     
[45] rmarkdown_2.27    labeling_0.4.3    compiler_4.4.1   
# devtools::session_info()