Homework 1

1 INSTRUCTIONS

This file is for showing students some code, and is not a definitive guide on how to present. You can and should experiment with the commands shown and see if you can improve the presentation.

I am merely attempting to give you a starting point. You can still use commands in this file like psych::describe but you can see that stargazer is much better. Similarly, instead of plotting with plot, use ggplot.

ENSURE YOUR FINAL REPORT IS CLEAN, POLISHED AND PROFESSIONAL AND DOES NOT HAVE ANY UNNECESSARY OUTPUT.

You will have to play more with this template and the commands.

2 Setup

2.1 Empty variables and functions in the environment tab/window

First, empty the environment so that we can upload the clean data.

# Clear the environment
rm(list = ls()) 

# Clear unused memory
gc()
          used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells  597581 32.0    1357250 72.5         NA   700248 37.4
Vcells 1106170  8.5    8388608 64.0      49152  1963352 15.0
# Clear the console
cat("\014")  
# Clear all plots
while (!is.null(dev.list())) {
  dev.off()
}
  • rm(list = ls()): Removes all objects from the current workspace.

  • gc(): Calls the garbage collector to free up memory.

  • cat("\014"): Sends a form feed character to the console, which clears it. This works in RStudio and many other environments, but not in the R GUI.

  • while (!is.null(dev.list())) { dev.off() }: Closes all active graphical devices, clearing all plots.

2.2 Load packages

Now, we load the packages.

This script will ensure that each package in the packages list is installed (if not already installed) and loaded into your R session.

# Define the list of packages
packages <- c("tidyverse", 
              "stargazer",
              "knitr",
              "kableExtra",
              "visdat",
              "psych")  # Add your desired packages here

# Loop through the packages
for (package in packages) {
  if (!package %in% rownames(installed.packages())) {
    install.packages(package, repos = "http://cran.rstudio.com/", dependencies = TRUE)
  }
  library(package, character.only = TRUE)
}
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Please cite as: 


 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 



Attaching package: 'kableExtra'


The following object is masked from 'package:dplyr':

    group_rows



Attaching package: 'psych'


The following objects are masked from 'package:ggplot2':

    %+%, alpha
remove(packages)
  • Using character.only = TRUE ensures that the library function treats the package variable as a string.

2.3 Load raw data

There are many ways to confirm you have successfully imported the raw data. Like I try to conduct some summary stats on the data.

Note that it is fine initially to use commands like describe or glimpse to confirm data import or exploring the data. But reports are an iterative process. The final report is should be a client friendly reports i.e. keep only the most informative command output in your final submitted file and suppress the output of other commands.

For instance, does the client really need to know what is your working directory ? Or would that information be useful for them ? If it is not important for the client to see the output of certain commands but you need to run that chunk, so try include=FALSE in r chunk options.

Try to move away from an assignment style of writing (print all command outputs, and have as many commands as possible) to an executive report style of writing (attention highlighted to key points without wasting time).

  • For example, if there are 4 ways to create summary statistics for a dataset, in your first iteration of the assignment run just any one command that achieves the target.

  • In the second iteration, try to see if you can run the other three commands for summary statistics too - this will be important for your learning.

  • In the third and final iteration, just present your preferred command for each task and rewrite the assignment in an executive/business report tone.

# Set working directory and path to data
getwd()
[1] "/Users/arvindsharma/Library/CloudStorage/Dropbox/WCAS/Econometrics/Fall_2024/share/R Markdown_HW1"
df_train    <- read.csv("../../../moneyball-training-data.csv")
df_eval     <- read.csv("../../../moneyball-evaluation-data.csv")

describe(df_train)
                 vars    n    mean      sd median trimmed    mad  min   max
INDEX               1 2276 1268.46  736.35 1270.5 1268.57 952.57    1  2535
TARGET_WINS         2 2276   80.79   15.75   82.0   81.31  14.83    0   146
TEAM_BATTING_H      3 2276 1469.27  144.59 1454.0 1459.04 114.16  891  2554
TEAM_BATTING_2B     4 2276  241.25   46.80  238.0  240.40  47.44   69   458
TEAM_BATTING_3B     5 2276   55.25   27.94   47.0   52.18  23.72    0   223
TEAM_BATTING_HR     6 2276   99.61   60.55  102.0   97.39  78.58    0   264
TEAM_BATTING_BB     7 2276  501.56  122.67  512.0  512.18  94.89    0   878
TEAM_BATTING_SO     8 2174  735.61  248.53  750.0  742.31 284.66    0  1399
TEAM_BASERUN_SB     9 2145  124.76   87.79  101.0  110.81  60.79    0   697
TEAM_BASERUN_CS    10 1504   52.80   22.96   49.0   50.36  17.79    0   201
TEAM_BATTING_HBP   11  191   59.36   12.97   58.0   58.86  11.86   29    95
TEAM_PITCHING_H    12 2276 1779.21 1406.84 1518.0 1555.90 174.95 1137 30132
TEAM_PITCHING_HR   13 2276  105.70   61.30  107.0  103.16  74.13    0   343
TEAM_PITCHING_BB   14 2276  553.01  166.36  536.5  542.62  98.59    0  3645
TEAM_PITCHING_SO   15 2174  817.73  553.09  813.5  796.93 257.23    0 19278
TEAM_FIELDING_E    16 2276  246.48  227.77  159.0  193.44  62.27   65  1898
TEAM_FIELDING_DP   17 1990  146.39   26.23  149.0  147.58  23.72   52   228
                 range  skew kurtosis    se
INDEX             2534  0.00    -1.22 15.43
TARGET_WINS        146 -0.40     1.03  0.33
TEAM_BATTING_H    1663  1.57     7.28  3.03
TEAM_BATTING_2B    389  0.22     0.01  0.98
TEAM_BATTING_3B    223  1.11     1.50  0.59
TEAM_BATTING_HR    264  0.19    -0.96  1.27
TEAM_BATTING_BB    878 -1.03     2.18  2.57
TEAM_BATTING_SO   1399 -0.30    -0.32  5.33
TEAM_BASERUN_SB    697  1.97     5.49  1.90
TEAM_BASERUN_CS    201  1.98     7.62  0.59
TEAM_BATTING_HBP    66  0.32    -0.11  0.94
TEAM_PITCHING_H  28995 10.33   141.84 29.49
TEAM_PITCHING_HR   343  0.29    -0.60  1.28
TEAM_PITCHING_BB  3645  6.74    96.97  3.49
TEAM_PITCHING_SO 19278 22.17   671.19 11.86
TEAM_FIELDING_E   1833  2.99    10.97  4.77
TEAM_FIELDING_DP   176 -0.39     0.18  0.59
glimpse(df_eval)
Rows: 259
Columns: 16
$ INDEX            <int> 9, 10, 14, 47, 60, 63, 74, 83, 98, 120, 123, 135, 138…
$ TEAM_BATTING_H   <int> 1209, 1221, 1395, 1539, 1445, 1431, 1430, 1385, 1259,…
$ TEAM_BATTING_2B  <int> 170, 151, 183, 309, 203, 236, 219, 158, 177, 212, 243…
$ TEAM_BATTING_3B  <int> 33, 29, 29, 29, 68, 53, 55, 42, 78, 42, 40, 55, 57, 2…
$ TEAM_BATTING_HR  <int> 83, 88, 93, 159, 5, 10, 37, 33, 23, 58, 50, 164, 186,…
$ TEAM_BATTING_BB  <int> 447, 516, 509, 486, 95, 215, 568, 356, 466, 452, 495,…
$ TEAM_BATTING_SO  <int> 1080, 929, 816, 914, 416, 377, 527, 609, 689, 584, 64…
$ TEAM_BASERUN_SB  <int> 62, 54, 59, 148, NA, NA, 365, 185, 150, 52, 64, 48, 3…
$ TEAM_BASERUN_CS  <int> 50, 39, 47, 57, NA, NA, NA, NA, NA, NA, NA, 28, 21, 8…
$ TEAM_BATTING_HBP <int> NA, NA, NA, 42, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ TEAM_PITCHING_H  <int> 1209, 1221, 1395, 1539, 3902, 2793, 1544, 1626, 1342,…
$ TEAM_PITCHING_HR <int> 83, 88, 93, 159, 14, 20, 40, 39, 25, 62, 53, 173, 196…
$ TEAM_PITCHING_BB <int> 447, 516, 509, 486, 257, 420, 613, 418, 497, 482, 521…
$ TEAM_PITCHING_SO <int> 1080, 929, 816, 914, 1123, 736, 569, 715, 734, 622, 6…
$ TEAM_FIELDING_E  <int> 140, 135, 156, 124, 616, 572, 490, 328, 226, 184, 200…
$ TEAM_FIELDING_DP <int> 156, 164, 153, 154, 130, 105, NA, 104, 132, 145, 183,…
?print # years "Total Years:"
print(x = 2006-1871+1,
      digits = 1)
[1] 136
136*162
[1] 22032

My preferred command for summary stats will be stargazer and I will only present that in my final report, as it can be used for presenting regressions results as well.This is much more in assignment style expectations.

  • kable is another package you should explore for creating professional tables.

3 Data Analysis

Make sure you are billeting, using bold and italicizing in your report for better organisation.

3.1 Data Visualization

First, lets try to visualize missing values.

3.1.1 Missing Observations

library(visdat)
?visdat

vis_dat(df_train)

df_no_missing <- df_train
df_no_missing$TEAM_BATTING_HBP  = NULL
df_no_missing$TEAM_BATTING_CS   = NULL 
 vis_miss(df_no_missing)

visdat only works for small data sets. We can write some code to find missing vales for each variable.

?summarise_all
?is.na
summarise_all(.tbl  = df_train,
              .funs = ~  sum(is.na(x = .))) 
  INDEX TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B TEAM_BATTING_3B
1     0           0              0               0               0
  TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO TEAM_BASERUN_SB
1               0               0             102             131
  TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_H TEAM_PITCHING_HR
1             772             2085               0                0
  TEAM_PITCHING_BB TEAM_PITCHING_SO TEAM_FIELDING_E TEAM_FIELDING_DP
1                0              102               0              286
?gather
summarise_all(.tbl  = df_train,
              .funs = ~  sum(is.na(x = .))) |> gather(key   = "Variable",
                                                      value = "MissingValues"
                                                      ) |> filter(MissingValues > 0 )
          Variable MissingValues
1  TEAM_BATTING_SO           102
2  TEAM_BASERUN_SB           131
3  TEAM_BASERUN_CS           772
4 TEAM_BATTING_HBP          2085
5 TEAM_PITCHING_SO           102
6 TEAM_FIELDING_DP           286

3.1.2 Imputing Missing Observations

Plot or summarize the variable before and after the imputation to confirm the distribution has not changed substantially.

hist(df_no_missing$TEAM_BASERUN_CS )

df_no_missing$TEAM_BASERUN_CS <- ifelse(is.na(x = df_no_missing$TEAM_BASERUN_CS), 
                                        yes = mean(x = df_no_missing$TEAM_BASERUN_CS,                                                    na.rm = TRUE),
                                        no = df_no_missing$TEAM_BASERUN_CS)
hist(df_no_missing$TEAM_BASERUN_CS )

# Create histogram in ggplot
ggplot(data = df_no_missing, 
       mapping = aes(x = TEAM_BASERUN_CS)) + geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data = df_no_missing, 
       mapping = aes(x = TEAM_BASERUN_CS)) + geom_histogram(color = "black", 
                                                            fill = "blue", 
                                                            binwidth = 10) +
  labs(title = "Histogram of X", x = "X", y = "Count")

df_no_missing <- na.omit(df_no_missing) # maybe drop a few more variables
  • Imputation blog that suggests simply replacing missing values with mean or median may not be a good idea. While it suggests methods like “knn” or “MissForest” imputation that you have not seen in the class, it might be a good idea to know the weaknesses of your method. No imputation method is perfect, and sometimes you may decide to drop all observations with any missing value, or drop a column with too many misisng values. I am looking for your justification/explanantion more than the method.
  • The imputation strategy for missing data largely depends on the type of missingness, which can be of three types:
    1. Missing completely at random (MCAR): Data is genuinely missing at random and has no relation to any observed or unobserved variables.
    2. Missing at random (MAR): The missingness of one feature can be explained by other observed features in the dataset.
    3. Missing not at random (MNAR): Missingness is either attributed to the missing value itself or the feature(s) that we didn’t collect data for.

Missing at random (MAR) appears relatively much more in practical situations than the other two. That is why imputing missing values with zero/mean isn’t optimal all the time, and we can do better.

3.1.3 Correlation Plot

Please play with the command below and make the correlation plot upper or lower triangular - else the chart below displays too much redundant information. It does not look very professional ( a little cluttered). At top firms, attention to detail is very important and you are required to create very polished charts.

library(ggcorrplot)
?ggcorrplot()
?cor
  

?cor_pmat # cor_pmat(): Compute a correlation matrix p-values.
mycorr<- cor(x = df_no_missing[, 1:ncol(df_no_missing )])

p.mat <- ggcorrplot::cor_pmat(x = df_no_missing[,1:ncol(df_no_missing)])
head(p.mat)
                      INDEX  TARGET_WINS TEAM_BATTING_H TEAM_BATTING_2B
INDEX           0.000000000 3.085723e-01   2.870071e-01    1.012176e-01
TARGET_WINS     0.308572334 0.000000e+00   1.013043e-54    2.490498e-20
TEAM_BATTING_H  0.287007118 1.013043e-54   0.000000e+00   9.355168e-233
TEAM_BATTING_2B 0.101217596 2.490498e-20  9.355168e-233    0.000000e+00
TEAM_BATTING_3B 0.769316672 1.340586e-07   2.896946e-60    2.894398e-02
TEAM_BATTING_HR 0.006420957 1.416443e-21   1.602541e-06    1.786146e-62
                TEAM_BATTING_3B TEAM_BATTING_HR TEAM_BATTING_BB TEAM_BATTING_SO
INDEX              7.693167e-01    6.420957e-03    2.259876e-02    7.106396e-07
TARGET_WINS        1.340586e-07    1.416443e-21    6.788724e-40    1.082744e-02
TEAM_BATTING_H     2.896946e-60    1.602541e-06    1.402162e-05    3.110665e-54
TEAM_BATTING_2B    2.894398e-02    1.786146e-62    2.281359e-23    1.601577e-07
TEAM_BATTING_3B    0.000000e+00   3.146352e-208    5.493361e-25   1.722999e-271
TEAM_BATTING_HR   3.146352e-208    0.000000e+00    2.833152e-68   4.262615e-267
                TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_PITCHING_H
INDEX              1.177395e-04    9.919907e-01    9.517913e-01
TARGET_WINS        1.846788e-07    6.773671e-01    8.836481e-22
TEAM_BATTING_H     1.497875e-02    6.096341e-01   7.485250e-277
TEAM_BATTING_2B    9.931836e-07    9.663607e-06    6.214542e-71
TEAM_BATTING_3B    8.653120e-30    2.117662e-35    5.645504e-73
TEAM_BATTING_HR    1.327937e-41    9.273809e-63    2.113987e-05
                TEAM_PITCHING_HR TEAM_PITCHING_BB TEAM_PITCHING_SO
INDEX               7.419790e-03     1.014059e-02     1.744343e-06
TARGET_WINS         1.685830e-21     2.011167e-32     4.962297e-03
TEAM_BATTING_H      4.769017e-08     3.435879e-08     7.730597e-49
TEAM_BATTING_2B     1.151450e-63     2.211572e-17     5.592219e-07
TEAM_BATTING_3B    2.009525e-190     1.798207e-07    2.195339e-225
TEAM_BATTING_HR     0.000000e+00     2.039060e-29    1.157630e-220
                TEAM_FIELDING_E TEAM_FIELDING_DP
INDEX              1.287249e-02     7.993341e-01
TARGET_WINS        1.454931e-15     1.165960e-01
TEAM_BATTING_H     2.969523e-04     1.173988e-11
TEAM_BATTING_2B    9.295027e-35     1.901141e-14
TEAM_BATTING_3B   2.596789e-244     7.211648e-21
TEAM_BATTING_HR   4.191484e-271     2.530492e-36
  myplot<-ggcorrplot(corr     = mycorr,   # correlation matrix to visualize
                     method   = "square", # character, the visualization method of correlation matrix to be used. Allowed values are "square" (default), "circle"
                     type     = "full",  # character, "full" (default), "lower" or "upper" display
                     title    = "Correlation Plot",  # character, title of the graph
                     colors   = c("red", "white","green"), #    vector of 3 colors for low, mid and high correlation values.
                     lab      = TRUE,   # If TRUE, add corr coeff on the plot.
                     lab_size = 2,      # labels. used when lab = TRUE.
                     p.mat    = p.mat,  # matrix of p-value. If NULL, arguments sig.level, insig, pch, pch.col, pch.cex is invalid.  # Barring the no significant coefficient
                     insig    = "pch",  # character, specialized insignificant correlation coefficients, "pch" (default), "blank". If "blank", wipe away the corresponding glyphs; if "pch", add characters (see pch for details) on corresponding glyphs.
                     pch      = 4, # add character on the glyphs of insignificant correlation coefficients (only valid when insig is "pch"). Default value is 4.
                     hc.order = TRUE, # If TRUE, correlation matrix will be hc.ordered using hclust function.
                     tl.cex   = 8, # the size, the color and the string rotation of text label
                      tl.col   = "black", 
                     digits = 2
                     )

myplot 

3.2 Graphs

Use ggplot2 package to create professional charts. I went to Chat GPT and typed “r - create a histogram of all variables in a dataframe”. You could have done so in Google alternatively.

I got the following code - and tried running it as it is to understand how it works.

library(ggplot2)

# create sample dataframe
df <- data.frame(
  var1 = rnorm(100),
  var2 = rnorm(100),
  var3 = rnorm(100)
)

# melt the dataframe into long format
df_melted <- reshape2::melt(df)
No id variables; using all as measure variables
# create histogram using ggplot2
ggplot(data = df_melted, 
       mapping = aes(value)) + 
  geom_histogram() + 
  facet_wrap(facets = ~ variable,
             scales = "free_x"
             )
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

After I know the package exists and code chunk works, I try to adapt it for my own data -

library(ggplot2)

# melt the dataframe into long format
df_melted <- reshape2::melt(df_no_missing)
No id variables; using all as measure variables
# create histogram using ggplot2
ggplot(data = df_melted, 
       mapping = aes(x = value)) + 
  geom_histogram() + 
  facet_wrap(facets =  ~ variable, 
             scales = "free_x")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# now try to format the chart 
ggplot(data = df_melted, 
       mapping = aes(x = value)) + 
  geom_histogram() + 
  facet_wrap( facets = ~ variable,
              scales = "free_x") + theme(
    strip.text = element_text(size = 8),  # Adjust the size of facet titles
    axis.text.x = element_text(angle = 45, hjust = 1)  # Tilt x-axis labels
  )
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Most importantly, try to read the ggplot2 cheat sheet to understand why the command works.

3.2.1 Edward Tufte’s Principles for Effective Data Visualization

  • Follow appropriate and meaningful visualizations (e.g., Tufte’s principals).

Edward Tufte is a pioneer in the field of data visualization, emphasizing clarity, precision, and efficiency in the presentation of information. His principles aim to make data visualizations not only informative but also aesthetically clean and free from unnecessary elements.

For more on Edward Tufte’s visualization principles, check out Tufte’s official website or his classic book The Visual Display of Quantitative Information.

  • A summary of his principles below is worth a quick read.

    1. Maximize Data-Ink Ratio
      • Emphasize the data by reducing non-essential elements. The focus should be on maximizing the “ink” that represents data and removing chartjunk like excessive gridlines or decorations.
    2. Remove Chartjunk
      • Avoid unnecessary 3D effects, patterns, or other decorative elements that do not serve the purpose of communicating information clearly.
    3. Small Multiples
      • Use multiple smaller charts for comparison instead of a single, complex graph. This allows viewers to focus on each individual piece of data.
    4. Multifunctioning Graphical Elements
      • Elements such as labels or legends should perform multiple roles. For example, directly labeling a line within a plot removes the need for a separate legend.
    5. Content-Driven Design
      • The design should be driven by the data, not the aesthetic. Visuals should prioritize clarity and simplicity, guided by the data’s message.

3.3 Summary Statistics

Use stargazer package to create professional summary tables. Can also consider using kabble, kabbleExtra.

Run summary statistics on the final, clean variables that you are keeping in your regression. Play around with labeling variable with covariate.labels option in stargazer().

??kableExtra

library(stargazer)
?stargazer

stargazer(table = df_no_missing[,2:15],
          type = "text",
          title = "Summary Statistics" ,
          covariate.labels = c( "Target Wins", 
                                "Team Batting"
                                ), 
          notes = "n = 2000, except variable ...", 
          omit.summary.stat = c("n")
          )

Summary Statistics
==================================================
Statistic          Mean    St. Dev.  Min     Max  
--------------------------------------------------
Target Wins       80.990    13.158    38     120  
Team Batting     1,457.035 108.082  1,137   1,876 
TEAM_BATTING_2B   247.834   42.952   130     392  
TEAM_BATTING_3B   47.839    21.709    11     138  
TEAM_BATTING_HR   116.478   54.511    4      264  
TEAM_BATTING_BB   530.277   85.501   273     878  
TEAM_BATTING_SO   786.762  216.972   324    1,399 
TEAM_BASERUN_SB   100.801   52.755    18     367  
TEAM_BASERUN_CS   52.933    20.562  11.000 201.000
TEAM_PITCHING_H  1,524.617 174.214  1,137   2,394 
TEAM_PITCHING_HR  120.711   56.412    4      343  
TEAM_PITCHING_BB  553.960   97.483   325    1,090 
TEAM_PITCHING_SO  817.993  222.223   345    1,781 
TEAM_FIELDING_E   159.925   57.789    65     515  
--------------------------------------------------
n = 2000, except variable ...                     

Alternatively, you can use kable and kableExtra package to create a professional summary statistics.

4 Regression

The regression I will run is ( describe it ) -

\(wage_{i}= \beta_{0}+\gamma_1+ \epsilon_i\).

stepAIC may help you with backward/forward selection.

Box-Cox may be interesting to try too. Read up on it if you have time.

Present the models in one single table so that it is easy for the reader to compare coefficients.

# MODEL 1:  keep all variables 

model1 <- lm(data = df_no_missing, 
             TARGET_WINS ~ .)


# MODEL 2:  drop variable randomly ...

model2 <- lm(data = df_no_missing, 
             TARGET_WINS ~ . -TEAM_BATTING_HR - TEAM_BATTING_BB )

# MODEL 3:  Step AIC...Backward selection... drop variable systematically...
library(MASS)

Attaching package: 'MASS'
The following object is masked from 'package:dplyr':

    select
?stepAIC
model3 <- stepAIC(object = lm(data = df_no_missing, TARGET_WINS ~ .), 
                  direction = c("backward")
                  )
Start:  AIC=8526.62
TARGET_WINS ~ INDEX + TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
    TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
    TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + 
    TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP

                   Df Sum of Sq    RSS    AIC
- TEAM_PITCHING_HR  1      11.1 187978 8524.7
- INDEX             1      21.5 187989 8524.8
- TEAM_BATTING_SO   1     104.2 188071 8525.6
<none>                          187967 8526.6
- TEAM_BATTING_HR   1     235.9 188203 8526.9
- TEAM_BATTING_H    1     272.3 188239 8527.3
- TEAM_PITCHING_BB  1     353.0 188320 8528.1
- TEAM_PITCHING_SO  1     436.0 188403 8528.9
- TEAM_BASERUN_CS   1     670.4 188637 8531.2
- TEAM_BATTING_BB   1     716.8 188684 8531.6
- TEAM_PITCHING_H   1    1312.2 189279 8537.4
- TEAM_BATTING_2B   1    3237.3 191204 8556.0
- TEAM_FIELDING_DP  1    8408.7 196376 8604.9
- TEAM_BATTING_3B   1    9723.5 197691 8617.2
- TEAM_BASERUN_SB   1   15739.9 203707 8672.2
- TEAM_FIELDING_E   1   29328.6 217296 8790.7

Step:  AIC=8524.73
TARGET_WINS ~ INDEX + TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
    TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
    TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
    TEAM_FIELDING_E + TEAM_FIELDING_DP

                   Df Sum of Sq    RSS    AIC
- INDEX             1      22.7 188001 8523.0
<none>                          187978 8524.7
- TEAM_BATTING_SO   1     229.8 188208 8525.0
- TEAM_BATTING_H    1     326.6 188305 8525.9
- TEAM_PITCHING_BB  1     404.6 188383 8526.7
- TEAM_BASERUN_CS   1     685.1 188663 8529.4
- TEAM_BATTING_BB   1     799.6 188778 8530.5
- TEAM_PITCHING_SO  1     822.9 188801 8530.7
- TEAM_PITCHING_H   1    1494.3 189473 8537.3
- TEAM_BATTING_2B   1    3235.3 191214 8554.0
- TEAM_FIELDING_DP  1    8431.8 196410 8603.2
- TEAM_BATTING_3B   1    9719.2 197697 8615.2
- TEAM_BATTING_HR   1   11281.0 199259 8629.7
- TEAM_BASERUN_SB   1   15854.4 203833 8671.3
- TEAM_FIELDING_E   1   29390.8 217369 8789.3

Step:  AIC=8522.95
TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
    TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
    TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_BB + TEAM_PITCHING_SO + 
    TEAM_FIELDING_E + TEAM_FIELDING_DP

                   Df Sum of Sq    RSS    AIC
<none>                          188001 8523.0
- TEAM_BATTING_SO   1     229.1 188230 8523.2
- TEAM_BATTING_H    1     325.8 188327 8524.1
- TEAM_PITCHING_BB  1     400.4 188401 8524.9
- TEAM_BASERUN_CS   1     677.5 188679 8527.6
- TEAM_BATTING_BB   1     795.1 188796 8528.7
- TEAM_PITCHING_SO  1     825.2 188826 8529.0
- TEAM_PITCHING_H   1    1488.2 189489 8535.4
- TEAM_BATTING_2B   1    3223.9 191225 8552.2
- TEAM_FIELDING_DP  1    8462.1 196463 8601.7
- TEAM_BATTING_3B   1    9711.3 197712 8613.4
- TEAM_BATTING_HR   1   11284.0 199285 8627.9
- TEAM_BASERUN_SB   1   15856.4 203857 8669.5
- TEAM_FIELDING_E   1   29396.6 217398 8787.5
stargazer(model1, model2, model3,
          type = "text"
          )

=================================================================================================
                                                 Dependent variable:                             
                    -----------------------------------------------------------------------------
                                                     TARGET_WINS                                 
                               (1)                       (2)                       (3)           
-------------------------------------------------------------------------------------------------
INDEX                        -0.0002                   -0.0002                                   
                            (0.0003)                  (0.0003)                                   
                                                                                                 
TEAM_BATTING_H               -0.027                     0.003                    -0.028*         
                             (0.017)                   (0.011)                   (0.016)         
                                                                                                 
TEAM_BATTING_2B             -0.050***                 -0.051***                 -0.050***        
                             (0.009)                   (0.009)                   (0.009)         
                                                                                                 
TEAM_BATTING_3B             0.185***                  0.181***                  0.184***         
                             (0.019)                   (0.019)                   (0.019)         
                                                                                                 
TEAM_BATTING_HR               0.123                                             0.097***         
                             (0.082)                                             (0.009)         
                                                                                                 
TEAM_BATTING_BB             0.112***                                            0.115***         
                             (0.042)                                             (0.041)         
                                                                                                 
TEAM_BATTING_SO               0.022                   0.060***                    0.026          
                             (0.022)                   (0.017)                   (0.018)         
                                                                                                 
TEAM_BASERUN_SB             0.077***                  0.078***                  0.077***         
                             (0.006)                   (0.006)                   (0.006)         
                                                                                                 
TEAM_BASERUN_CS             -0.036**                  -0.040***                 -0.036**         
                             (0.014)                   (0.014)                   (0.014)         
                                                                                                 
TEAM_PITCHING_H             0.054***                  0.026***                  0.055***         
                             (0.015)                   (0.009)                   (0.014)         
                                                                                                 
TEAM_PITCHING_HR             -0.026                   0.092***                                   
                             (0.078)                   (0.009)                                   
                                                                                                 
TEAM_PITCHING_BB             -0.075*                  0.031***                  -0.077**         
                             (0.040)                   (0.003)                   (0.039)         
                                                                                                 
TEAM_PITCHING_SO            -0.043**                  -0.079***                 -0.047***        
                             (0.021)                   (0.016)                   (0.017)         
                                                                                                 
TEAM_FIELDING_E             -0.121***                 -0.117***                 -0.120***        
                             (0.007)                   (0.007)                   (0.007)         
                                                                                                 
TEAM_FIELDING_DP            -0.111***                 -0.112***                 -0.111***        
                             (0.012)                   (0.012)                   (0.012)         
                                                                                                 
Constant                    60.797***                 59.964***                 60.856***        
                             (6.066)                   (6.076)                   (6.057)         
                                                                                                 
-------------------------------------------------------------------------------------------------
Observations                  1,835                     1,835                     1,835          
R2                            0.408                     0.404                     0.408          
Adjusted R2                   0.403                     0.400                     0.404          
Residual Std. Error    10.165 (df = 1819)        10.192 (df = 1821)        10.161 (df = 1821)    
F Statistic         83.597*** (df = 15; 1819) 95.054*** (df = 13; 1821) 96.522*** (df = 13; 1821)
=================================================================================================
Note:                                                                 *p<0.1; **p<0.05; ***p<0.01

If you really want to highlight your best model -

library(coefplot)
coefplot(model3, 
         intercept = FALSE
         )

4.1 Multicollinearity

library(car)
Loading required package: carData

Attaching package: 'car'
The following object is masked from 'package:psych':

    logit
The following object is masked from 'package:dplyr':

    recode
The following object is masked from 'package:purrr':

    some
?vif()

## INCORRECT COEFFICIENTS DUE TO MULTOCOLLINEARITY???
vif_values1 <- vif(model1) # multicollinearity
# vif_values1

as.data.frame(vif(model1))
                 vif(model1)
INDEX               1.048305
TEAM_BATTING_H     56.553286
TEAM_BATTING_2B     2.574846
TEAM_BATTING_3B     3.031626
TEAM_BATTING_HR   351.042560
TEAM_BATTING_BB   234.233457
TEAM_BATTING_SO   405.584117
TEAM_BASERUN_SB     1.903778
TEAM_BASERUN_CS     1.527066
TEAM_PITCHING_H   121.681790
TEAM_PITCHING_HR  343.938411
TEAM_PITCHING_BB  275.276299
TEAM_PITCHING_SO  384.929404
TEAM_FIELDING_E     3.047851
TEAM_FIELDING_DP    1.377116
summary(model1_vif <- lm(data = df_no_missing[,-2], 
             TEAM_BATTING_HR ~ .)) # 351.042560

Call:
lm(formula = TEAM_BATTING_HR ~ ., data = df_no_missing[, -2])

Residuals:
    Min      1Q  Median      3Q     Max 
-42.879  -0.874  -0.196   0.660  29.263 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -2.078e-01  1.743e+00  -0.119  0.90511    
INDEX            -1.621e-04  9.548e-05  -1.698  0.08965 .  
TEAM_BATTING_H   -5.158e-02  4.588e-03 -11.242  < 2e-16 ***
TEAM_BATTING_2B  -1.073e-03  2.548e-03  -0.421  0.67377    
TEAM_BATTING_3B  -8.637e-03  5.466e-03  -1.580  0.11425    
TEAM_BATTING_BB   1.205e-01  1.188e-02  10.143  < 2e-16 ***
TEAM_BATTING_SO   1.608e-01  5.086e-03  31.620  < 2e-16 ***
TEAM_BASERUN_SB   3.237e-03  1.782e-03   1.817  0.06946 .  
TEAM_BASERUN_CS  -1.466e-02  4.084e-03  -3.590  0.00034 ***
TEAM_PITCHING_H   5.085e-02  4.150e-03  12.251  < 2e-16 ***
TEAM_PITCHING_HR  9.503e-01  2.542e-03 373.792  < 2e-16 ***
TEAM_PITCHING_BB -1.157e-01  1.129e-02 -10.256  < 2e-16 ***
TEAM_PITCHING_SO -1.533e-01  4.831e-03 -31.735  < 2e-16 ***
TEAM_FIELDING_E   4.210e-03  2.058e-03   2.046  0.04094 *  
TEAM_FIELDING_DP -3.936e-03  3.526e-03  -1.116  0.26447    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.921 on 1820 degrees of freedom
Multiple R-squared:  0.9972,    Adjusted R-squared:  0.9971 
F-statistic: 4.551e+04 on 14 and 1820 DF,  p-value: < 2.2e-16
                                   #1/(1-summary(model1_vif)$r.squared)
1/(1-0.9972)
[1] 357.1429
1/( 1 - summary(model1_vif)$r.squared )
[1] 351.0426

ANSWER THE FOLLOWING questions atleast -

  • Describe your model evolution. Why did you try a new model? Which model do you finally prefer?

  • What is coefficient interpretation of the model of your choice ??? Do it for a few key variables.

  • How is the model fit ? Residual Analysis ???

4.2 Residual Analysis on Final Model

Try to convince the reader your chosen model does not violate Gauss Markov Assumptions.

?par
par(mfrow = c(2,2)) # 2 x 2 pictures on one plot
plot(model3)

How do you interpret these 4 charts ???

4.3 Predict

5 Session Info

https://www.rostrum.blog/2018/10/13/sessioninfo/

sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.6.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] car_3.1-2          carData_3.0-5      coefplot_1.2.8     MASS_7.3-61       
 [5] ggcorrplot_0.1.4.1 psych_2.4.6.26     visdat_0.6.0       kableExtra_1.4.0  
 [9] knitr_1.48         stargazer_5.2.3    lubridate_1.9.3    forcats_1.0.0     
[13] stringr_1.5.1      dplyr_1.1.4        purrr_1.0.2        readr_2.1.5       
[17] tidyr_1.3.1        tibble_3.2.1       ggplot2_3.5.1      tidyverse_2.0.0   

loaded via a namespace (and not attached):
 [1] gtable_0.3.5      xfun_0.46         htmlwidgets_1.6.4 lattice_0.22-6   
 [5] tzdb_0.4.0        vctrs_0.6.5       tools_4.4.1       generics_0.1.3   
 [9] parallel_4.4.1    fansi_1.0.6       pkgconfig_2.0.3   lifecycle_1.0.4  
[13] compiler_4.4.1    farver_2.1.2      munsell_0.5.1     mnormt_2.1.1     
[17] htmltools_0.5.8.1 yaml_2.3.10       pillar_1.9.0      abind_1.4-5      
[21] useful_1.2.6.1    nlme_3.1-165      tidyselect_1.2.1  digest_0.6.36    
[25] stringi_1.8.4     reshape2_1.4.4    labeling_0.4.3    fastmap_1.2.0    
[29] grid_4.4.1        colorspace_2.1-1  cli_3.6.3         magrittr_2.0.3   
[33] utf8_1.2.4        withr_3.0.1       scales_1.3.0      timechange_0.3.0 
[37] rmarkdown_2.27    hms_1.1.3         evaluate_0.24.0   viridisLite_0.4.2
[41] rlang_1.1.4       Rcpp_1.0.13       glue_1.7.0        xml2_1.3.6       
[45] svglite_2.1.3     rstudioapi_0.16.0 jsonlite_1.8.8    R6_2.5.1         
[49] plyr_1.8.9        systemfonts_1.1.0
# devtools::session_info()