Money Ball Project

Author

Caitlin O’Driscoll

Weekend HW 1

Importing the Money Ball Data

remove(list=ls()) # clears the environment

# importing the data
moneyball.training.data <- 
  read.csv("~/Desktop/BCE/R FILES/Group Project/moneyball-training-data.csv")

library(psych)

describe(moneyball.training.data)

                 vars    n    mean      sd median trimmed    mad  min   max
INDEX               1 2276 1268.46  736.35 1270.5 1268.57 952.57    1  2535
TARGET_WINS         2 2276   80.79   15.75   82.0   81.31  14.83    0   146
TEAM_BATTING_H      3 2276 1469.27  144.59 1454.0 1459.04 114.16  891  2554
TEAM_BATTING_2B     4 2276  241.25   46.80  238.0  240.40  47.44   69   458
TEAM_BATTING_3B     5 2276   55.25   27.94   47.0   52.18  23.72    0   223
TEAM_BATTING_HR     6 2276   99.61   60.55  102.0   97.39  78.58    0   264
TEAM_BATTING_BB     7 2276  501.56  122.67  512.0  512.18  94.89    0   878
TEAM_BATTING_SO     8 2174  735.61  248.53  750.0  742.31 284.66    0  1399
TEAM_BASERUN_SB     9 2145  124.76   87.79  101.0  110.81  60.79    0   697
TEAM_BASERUN_CS    10 1504   52.80   22.96   49.0   50.36  17.79    0   201
TEAM_BATTING_HBP   11  191   59.36   12.97   58.0   58.86  11.86   29    95
TEAM_PITCHING_H    12 2276 1779.21 1406.84 1518.0 1555.90 174.95 1137 30132
TEAM_PITCHING_HR   13 2276  105.70   61.30  107.0  103.16  74.13    0   343
TEAM_PITCHING_BB   14 2276  553.01  166.36  536.5  542.62  98.59    0  3645
TEAM_PITCHING_SO   15 2174  817.73  553.09  813.5  796.93 257.23    0 19278
TEAM_FIELDING_E    16 2276  246.48  227.77  159.0  193.44  62.27   65  1898
TEAM_FIELDING_DP   17 1990  146.39   26.23  149.0  147.58  23.72   52   228
                 range  skew kurtosis    se
INDEX             2534  0.00    -1.22 15.43
TARGET_WINS        146 -0.40     1.03  0.33
TEAM_BATTING_H    1663  1.57     7.28  3.03
TEAM_BATTING_2B    389  0.22     0.01  0.98
TEAM_BATTING_3B    223  1.11     1.50  0.59
TEAM_BATTING_HR    264  0.19    -0.96  1.27
TEAM_BATTING_BB    878 -1.03     2.18  2.57
TEAM_BATTING_SO   1399 -0.30    -0.32  5.33
TEAM_BASERUN_SB    697  1.97     5.49  1.90
TEAM_BASERUN_CS    201  1.98     7.62  0.59
TEAM_BATTING_HBP    66  0.32    -0.11  0.94
TEAM_PITCHING_H  28995 10.33   141.84 29.49
TEAM_PITCHING_HR   343  0.29    -0.60  1.28
TEAM_PITCHING_BB  3645  6.74    96.97  3.49
TEAM_PITCHING_SO 19278 22.17   671.19 11.86
TEAM_FIELDING_E   1833  2.99    10.97  4.77
TEAM_FIELDING_DP   176 -0.39     0.18  0.59

Cleaning Data

library(visdat)

vis_dat(moneyball.training.data)

    #creates dataframe with incomplete cases removed

money_clean <- moneyball.training.data
money_clean$TEAM_BATTING_HBP = NULL
money_clean$TEAM_BATTING_CS = NULL
vis_miss(money_clean)

Imputing Missing Observations

hist(money_clean$TEAM_BASERUN_CS)

Summary Statistics Table

library(stargazer) #loads package


Please cite as:

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

library(ggplot2) #loads package


Attaching package: 'ggplot2'

The following objects are masked from 'package:psych':

    %+%, alpha

stargazer(money_clean, 
          type = "text", # determines the type or storage of the object
          title = "Summary Statistics", # creates the titles
          digits = 2, # rounds the data to the second decimal place
          omit.summary.stat = "n", #excludes incomplete cases
          notes = "n = 2276") #adds note at the bottom of the table describing that                                 there are 2276 variables missing


Summary Statistics
===============================================
Statistic          Mean   St. Dev.  Min   Max  
-----------------------------------------------
INDEX            1,268.46  736.35    1   2,535 
TARGET_WINS       80.79    15.75     0    146  
TEAM_BATTING_H   1,469.27  144.59   891  2,554 
TEAM_BATTING_2B   241.25   46.80    69    458  
TEAM_BATTING_3B   55.25    27.94     0    223  
TEAM_BATTING_HR   99.61    60.55     0    264  
TEAM_BATTING_BB   501.56   122.67    0    878  
TEAM_BATTING_SO   735.61   248.53    0   1,399 
TEAM_BASERUN_SB   124.76   87.79     0    697  
TEAM_BASERUN_CS   52.80    22.96     0    201  
TEAM_PITCHING_H  1,779.21 1,406.84 1,137 30,132
TEAM_PITCHING_HR  105.70   61.30     0    343  
TEAM_PITCHING_BB  553.01   166.36    0   3,645 
TEAM_PITCHING_SO  817.73   553.09    0   19,278
TEAM_FIELDING_E   246.48   227.77   65   1,898 
TEAM_FIELDING_DP  146.39   26.23    52    228  
-----------------------------------------------
n = 2276

GG Plot of Number of Wins and Home Runs

?ggplot

ggplot(data = money_clean, #inputs clean data to plot
       mapping = 
         aes(x = TEAM_BATTING_HR, #assigns data to x axis and y axis
             y = TARGET_WINS)) + geom_point() + ggtitle("Coorelation of Home Runs and Number of wins") + geom_point(colour = "Blue")

# adds points, title, and colors the points blue

GG Plot of Target wins and Strikeouts by batters

ggplot(data = money_clean, # inputs clean data
       mapping = 
         aes(x = TEAM_BATTING_SO, #assigns data to x axis and y axis
             y = TARGET_WINS)) + geom_point() + ggtitle("Coorelation of Strikeouts and Target wins") + geom_point(colour = "Red")

Warning: Removed 102 rows containing missing values or values outside the scale range
(`geom_point()`).
Removed 102 rows containing missing values or values outside the scale range
(`geom_point()`).

# adds points, title, and colors the points red

Histogram of Home Runs

ggplot(data = money_clean, # inputs clean data
       mapping = # Inserts data for the homeruns by batters
         aes(x = TEAM_BATTING_HR)) + geom_histogram() + ggtitle("Histogram of Home Runs")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Graphs

library(ggplot2)

df_melted <- reshape2::melt(money_clean)

No id variables; using all as measure variables

ggplot(df_melted, aes(x = value)) +
  geom_histogram() +
  facet_wrap(~variable, scales = "free_x")

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 1393 rows containing non-finite outside the scale range
(`stat_bin()`).

Moving Data to Excel

library(writexl) # loads package

write_xlsx(money_clean, "GroupProject.xlsx") # exports data to excel

5 Key Takeaways

The first major outlier in our ggplot of Home Runs and Number of wins shows that having a decent amount of homeruns did not lead to many wins. Another key outlier shows that the team had an average amount of home runs yet the greatest number of wins. This data in the ggplot shows that although the data has a low positive correlation there are some data points that fall significantly outside of the correlation.
The low positive correlation of the Home Runs and Number of wins plot shows that the lower the team batting home run is the lower the number of wins there are which is shown as well as the higher the homeruns are the higher the wins are as well.
Similarly, when plotting the number of target wins with the number of strikeouts by batters the ggplot shows a low negative correlation which was the expected outcome. Similarly again, although the data mainly sticks to one trend, some data points prove the reality of outliers.
This is also seen through the pivot table as you skim through the data. Most of the points follow the trend that when there are more strikeouts there are fewer wins. Logically this makes sense because in baseball every strikeout is an out that doesn’t put the ball in play. This eliminated the chance for a hit, walk, or any play that could help the team.
The histogram of number of Home Runs is skewed to the right and bimodal which shows how the data is imperfect and has outliers. The histogram depicts teams more often 150-200 home runs rather than very few or very many. The slight skew right shows the reality of the data as it is difficult to hit a homerun so the probability of a batter getting many is slim.

Summary

Overall, the data analysis shows that while there are general trends in the relationships between home runs, strikeouts, and wins, there are also significant outliers that highlight the complexity of baseball performance. The weak correlations suggest that multiple factors contribute to a team’s success, beyond just home runs and strikeouts.

Day 7: Backward Selection - Weekend Project 2

1. Store clean data as `clean_df`

df <- mtcars

clean_df <- money_clean

2. Run kitchen sink model

reg1<- 
lm(data = clean_df, formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B)

kitchen_sink <- 
lm(data = clean_df, formula = TARGET_WINS ~ . )


library(stargazer)

stargazer(reg1, kitchen_sink,
          type = "text"
          )


=======================================================================
                                    Dependent variable:                
                    ---------------------------------------------------
                                        TARGET_WINS                    
                               (1)                       (2)           
-----------------------------------------------------------------------
INDEX                                                  0.0003          
                                                      (0.0004)         
                                                                       
TEAM_BATTING_H              0.036***                    0.016          
                             (0.003)                   (0.020)         
                                                                       
TEAM_BATTING_2B             0.035***                  -0.070***        
                             (0.008)                   (0.009)         
                                                                       
TEAM_BATTING_3B                                       0.160***         
                                                       (0.022)         
                                                                       
TEAM_BATTING_HR                                         0.075          
                                                       (0.085)         
                                                                       
TEAM_BATTING_BB                                         0.043          
                                                       (0.046)         
                                                                       
TEAM_BATTING_SO                                         0.018          
                                                       (0.023)         
                                                                       
TEAM_BASERUN_SB                                       0.035***         
                                                       (0.009)         
                                                                       
TEAM_BASERUN_CS                                       0.054***         
                                                       (0.018)         
                                                                       
TEAM_PITCHING_H                                         0.019          
                                                       (0.018)         
                                                                       
TEAM_PITCHING_HR                                        0.022          
                                                       (0.082)         
                                                                       
TEAM_PITCHING_BB                                       -0.003          
                                                       (0.045)         
                                                                       
TEAM_PITCHING_SO                                       -0.038*         
                                                       (0.022)         
                                                                       
TEAM_FIELDING_E                                       -0.156***        
                                                       (0.010)         
                                                                       
TEAM_FIELDING_DP                                      -0.113***        
                                                       (0.013)         
                                                                       
Constant                    19.477***                 57.843***        
                             (3.102)                   (6.645)         
                                                                       
-----------------------------------------------------------------------
Observations                  2,276                     1,486          
R2                            0.158                     0.439          
Adjusted R2                   0.158                     0.433          
Residual Std. Error    14.457 (df = 2273)         9.557 (df = 1470)    
F Statistic         213.858*** (df = 2; 2273) 76.630*** (df = 15; 1470)
=======================================================================
Note:                                       *p<0.1; **p<0.05; ***p<0.01

library(MASS)

best_model <-
stepAIC(object = kitchen_sink,
        direction = "backward"
        )

Start:  AIC=6724.68
TARGET_WINS ~ INDEX + TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
    TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
    TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + 
    TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP

                   Df Sum of Sq    RSS    AIC
- TEAM_PITCHING_BB  1       0.5 134278 6722.7
- TEAM_PITCHING_HR  1       6.6 134285 6722.8
- INDEX             1      45.3 134323 6723.2
- TEAM_BATTING_SO   1      52.0 134330 6723.3
- TEAM_BATTING_H    1      59.8 134338 6723.3
- TEAM_BATTING_HR   1      70.3 134348 6723.5
- TEAM_BATTING_BB   1      78.1 134356 6723.5
- TEAM_PITCHING_H   1      93.3 134371 6723.7
<none>                          134278 6724.7
- TEAM_PITCHING_SO  1     260.2 134538 6725.6
- TEAM_BASERUN_CS   1     780.9 135059 6731.3
- TEAM_BASERUN_SB   1    1430.9 135709 6738.4
- TEAM_BATTING_3B   1    4664.4 138942 6773.4
- TEAM_BATTING_2B   1    5147.7 139426 6778.6
- TEAM_FIELDING_DP  1    6780.8 141059 6795.9
- TEAM_FIELDING_E   1   22451.0 156729 6952.4

Step:  AIC=6722.69
TARGET_WINS ~ INDEX + TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
    TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
    TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_SO + 
    TEAM_FIELDING_E + TEAM_FIELDING_DP

                   Df Sum of Sq    RSS    AIC
- TEAM_PITCHING_HR  1       6.2 134285 6720.8
- INDEX             1      45.6 134324 6721.2
- TEAM_BATTING_SO   1      52.7 134331 6721.3
- TEAM_BATTING_HR   1      78.9 134357 6721.6
- TEAM_BATTING_H    1     149.9 134428 6722.3
<none>                          134278 6722.7
- TEAM_PITCHING_H   1     193.8 134472 6722.8
- TEAM_PITCHING_SO  1     262.0 134540 6723.6
- TEAM_BASERUN_CS   1     780.8 135059 6729.3
- TEAM_BASERUN_SB   1    1435.3 135714 6736.5
- TEAM_BATTING_3B   1    4668.0 138946 6771.5
- TEAM_BATTING_2B   1    5156.1 139434 6776.7
- TEAM_FIELDING_DP  1    6782.9 141061 6793.9
- TEAM_BATTING_BB   1   12608.7 146887 6854.1
- TEAM_FIELDING_E   1   22517.9 156796 6951.1

Step:  AIC=6720.76
TARGET_WINS ~ INDEX + TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
    TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
    TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_FIELDING_E + 
    TEAM_FIELDING_DP

                   Df Sum of Sq    RSS    AIC
- INDEX             1      45.8 134330 6719.3
- TEAM_BATTING_SO   1      47.9 134332 6719.3
- TEAM_BATTING_H    1     147.4 134432 6720.4
<none>                          134285 6720.8
- TEAM_PITCHING_H   1     198.2 134483 6720.9
- TEAM_PITCHING_SO  1     293.6 134578 6722.0
- TEAM_BASERUN_CS   1     777.0 135062 6727.3
- TEAM_BASERUN_SB   1    1440.7 135725 6734.6
- TEAM_BATTING_3B   1    4669.4 138954 6769.5
- TEAM_BATTING_2B   1    5178.5 139463 6775.0
- TEAM_FIELDING_DP  1    6783.0 141067 6792.0
- TEAM_BATTING_HR   1    9801.3 144086 6823.4
- TEAM_BATTING_BB   1   12647.1 146932 6852.5
- TEAM_FIELDING_E   1   22551.2 156836 6949.4

Step:  AIC=6719.26
TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
    TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB + 
    TEAM_BASERUN_CS + TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_FIELDING_E + 
    TEAM_FIELDING_DP

                   Df Sum of Sq    RSS    AIC
- TEAM_BATTING_SO   1      51.2 134382 6717.8
- TEAM_BATTING_H    1     144.7 134475 6718.9
<none>                          134330 6719.3
- TEAM_PITCHING_H   1     202.0 134532 6719.5
- TEAM_PITCHING_SO  1     298.0 134628 6720.6
- TEAM_BASERUN_CS   1     742.6 135073 6725.5
- TEAM_BASERUN_SB   1    1570.4 135901 6734.5
- TEAM_BATTING_3B   1    4842.6 139173 6769.9
- TEAM_BATTING_2B   1    5198.7 139529 6773.7
- TEAM_FIELDING_DP  1    6744.4 141075 6790.1
- TEAM_BATTING_HR   1    9780.8 144111 6821.7
- TEAM_BATTING_BB   1   12606.9 146937 6850.6
- TEAM_FIELDING_E   1   22525.1 156855 6947.6

Step:  AIC=6717.83
TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
    TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BASERUN_CS + 
    TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP

                   Df Sum of Sq    RSS    AIC
<none>                          134382 6717.8
- TEAM_BASERUN_CS   1     737.6 135119 6724.0
- TEAM_PITCHING_H   1    1355.1 135737 6730.7
- TEAM_BASERUN_SB   1    1575.6 135957 6733.2
- TEAM_BATTING_H    1    1740.1 136122 6734.9
- TEAM_BATTING_3B   1    4849.8 139231 6768.5
- TEAM_BATTING_2B   1    5148.1 139530 6771.7
- TEAM_FIELDING_DP  1    6779.2 141161 6789.0
- TEAM_PITCHING_SO  1    7395.1 141777 6795.4
- TEAM_BATTING_HR   1    9785.1 144167 6820.3
- TEAM_BATTING_BB   1   12619.7 147001 6849.2
- TEAM_FIELDING_E   1   22552.0 156934 6946.4

stargazer(reg1, kitchen_sink, best_model, type="text" )


==================================================================================================
                                                 Dependent variable:                              
                    ------------------------------------------------------------------------------
                                                     TARGET_WINS                                  
                               (1)                       (2)                       (3)            
--------------------------------------------------------------------------------------------------
INDEX                                                  0.0003                                     
                                                      (0.0004)                                    
                                                                                                  
TEAM_BATTING_H              0.036***                    0.016                    0.026***         
                             (0.003)                   (0.020)                   (0.006)          
                                                                                                  
TEAM_BATTING_2B             0.035***                  -0.070***                 -0.070***         
                             (0.008)                   (0.009)                   (0.009)          
                                                                                                  
TEAM_BATTING_3B                                       0.160***                   0.162***         
                                                       (0.022)                   (0.022)          
                                                                                                  
TEAM_BATTING_HR                                         0.075                    0.098***         
                                                       (0.085)                   (0.009)          
                                                                                                  
TEAM_BATTING_BB                                         0.043                    0.039***         
                                                       (0.046)                   (0.003)          
                                                                                                  
TEAM_BATTING_SO                                         0.018                                     
                                                       (0.023)                                    
                                                                                                  
TEAM_BASERUN_SB                                       0.035***                   0.036***         
                                                       (0.009)                   (0.009)          
                                                                                                  
TEAM_BASERUN_CS                                       0.054***                   0.052***         
                                                       (0.018)                   (0.018)          
                                                                                                  
TEAM_PITCHING_H                                         0.019                    0.009***         
                                                       (0.018)                   (0.002)          
                                                                                                  
TEAM_PITCHING_HR                                        0.022                                     
                                                       (0.082)                                    
                                                                                                  
TEAM_PITCHING_BB                                       -0.003                                     
                                                       (0.045)                                    
                                                                                                  
TEAM_PITCHING_SO                                       -0.038*                  -0.021***         
                                                       (0.022)                   (0.002)          
                                                                                                  
TEAM_FIELDING_E                                       -0.156***                 -0.156***         
                                                       (0.010)                   (0.010)          
                                                                                                  
TEAM_FIELDING_DP                                      -0.113***                 -0.113***         
                                                       (0.013)                   (0.013)          
                                                                                                  
Constant                    19.477***                 57.843***                 58.446***         
                             (3.102)                   (6.645)                   (6.589)          
                                                                                                  
--------------------------------------------------------------------------------------------------
Observations                  2,276                     1,486                     1,486           
R2                            0.158                     0.439                     0.438           
Adjusted R2                   0.158                     0.433                     0.434           
Residual Std. Error    14.457 (df = 2273)         9.557 (df = 1470)         9.548 (df = 1474)     
F Statistic         213.858*** (df = 2; 2273) 76.630*** (df = 15; 1470) 104.596*** (df = 11; 1474)
==================================================================================================
Note:                                                                  *p<0.1; **p<0.05; ***p<0.01

what variables should you keep in your final regression?

In the final regression the variables that should be kept in the final regression are:

TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B + 
    TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BASERUN_SB + TEAM_BASERUN_CS + 
    TEAM_PITCHING_H + TEAM_PITCHING_SO + TEAM_FIELDING_E + TEAM_FIELDING_DP

All of these make sense as they are predicted to have a positive impact on the number of wins.

Day 7 In class

remove(list = ls())

train <- read.csv("~/Downloads/train (1).csv")

colSums(is.na(train))

PassengerId    Survived      Pclass        Name         Sex         Age 
          0           0           0           0           0         177 
      SibSp       Parch      Ticket        Fare       Cabin    Embarked 
          0           0           0           0           0           0

train_clean <- na.omit(train)

**1. Run some preliminary correlations of Survived with some other variables.**

Due to the Matrix’ ability to be expressed as a range of values expressed within the interval[-1,1]. By plotting this on a correlation plot we are able to understand that in the data the Passenger class has the strongest negative linear correlation with Fare. Similarly Passenger class has a negative linear correlation. This also shows that the numbers of parents or children and number of siblings and spouses on board has positive linear correlation. The other variables in this data are seen to have weak correlation.

The scatter plot depicts more people in the higher class surviving based on their fare. By looking at this plot it can also be noted that people who spent over 300 on their tickets survived. This also shows that no one in the second or third classes spent over 100 on their tickets.

?cor
# install.packages("tidyverse")
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ ggplot2::%+%()   masks psych::%+%()
✖ ggplot2::alpha() masks psych::alpha()
✖ dplyr::filter()  masks stats::filter()
✖ dplyr::lag()     masks stats::lag()
✖ dplyr::select()  masks MASS::select()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

train_numeric <- select(.data = train, -Name, -Sex, -Ticket,-Cabin,-Embarked)

train_corr_matrix <- cor(train_numeric)

# install.packages("corrplot")
library(corrplot)

corrplot 0.92 loaded

# install.packages("ggcorrplot")
library(ggcorrplot)

?ggcorrplot

ggcorrplot(corr = train_corr_matrix, 
           method = "square", 
           type = "lower",
           colors = c("deeppink", "white", "orange",
           lab = TRUE)
)

# install.packages("RColorBrewer")
library(RColorBrewer)

?ggplot

ggplot(train_clean, aes(x = Fare, y = Survived, color = Pclass)) +geom_jitter(width = 2)

2. Conduct descriptive statistics of the data set. Anything interesting you find?

The mean passenger class shows that on average passengers were in the second class. With this mean we can guess that there is a majority of passengers in the 3rd class which is proven true in the histogram. This makes sense as the average fare is about 35 dollars although there were some that spent 512.33. The average passengers age on board was also on the lower end showing us that most passengers were younger although the maximum was 80 years old. This can also be seen in the histogram as the data is skewed right demonstrating the low number of older passengers.

library(stargazer) #loads package

library(ggplot2) #loads package


stargazer(train_clean, 
          type = "text", # determines the type or storage of the object
          title = "Summary Statistics", # creates the titles
          digits = 2, # rounds the data to the second decimal place
          omit.summary.stat = "n", #excludes incomplete cases
          notes = "missing values = 177") #adds note at the bottom of the table describing that                                 there are 177 variables missing


Summary Statistics
=======================================
Statistic    Mean  St. Dev. Min   Max  
---------------------------------------
PassengerId 448.58  259.12   1    891  
Survived     0.41    0.49    0     1   
Pclass       2.24    0.84    1     3   
Age         29.70   14.53   0.42 80.00 
SibSp        0.51    0.93    0     5   
Parch        0.43    0.85    0     6   
Fare        34.69   52.92   0.00 512.33
---------------------------------------
missing values = 177

?hist

hist(train_clean$Pclass)

hist(train_clean$Age)

**3. Use set.seed(100) command, and create a subset of train dataset that has only 500 observations.**

library(dplyr)

?set.seed(100)

train_subset <- sample_n(tbl = train, 
                         size = 500)

test_subset <- dplyr::anti_join(x = train,
                                y = train_subset, 
                                by = "PassengerId")

**4. Create an Ordinary Least Squares model / linear regression where Survived is the dependent variable on your `n=500` sample.**

library(ggplot2)

?lm()

summary(lm( formula = Survived ~ as.factor(Pclass) + Sex + Age +I(Age^2), train_subset))


Call:
lm(formula = Survived ~ as.factor(Pclass) + Sex + Age + I(Age^2), 
    data = train_subset)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.06480 -0.23932 -0.08141  0.22975  0.97841 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         1.073e+00  8.981e-02  11.953  < 2e-16 ***
as.factor(Pclass)2 -1.986e-01  5.673e-02  -3.501 0.000517 ***
as.factor(Pclass)3 -3.731e-01  5.155e-02  -7.238 2.45e-12 ***
Sexmale            -4.892e-01  4.201e-02 -11.644  < 2e-16 ***
Age                -4.340e-03  4.906e-03  -0.885 0.376910    
I(Age^2)            2.845e-06  7.126e-05   0.040 0.968169    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.386 on 389 degrees of freedom
  (105 observations deleted due to missingness)
Multiple R-squared:  0.3872,    Adjusted R-squared:  0.3793 
F-statistic: 49.16 on 5 and 389 DF,  p-value: < 2.2e-16

model1 <- (lm( formula = Survived ~ as.factor(Pclass) + Sex + Age +I(Age^2), train_subset))

5. Create an estimate of whether an individual survived or not (binary variable) using the predict command on your estimated model. Essentially, you are using the coefficient from your linear model to forecast/predict/estimate the survival variable given independant variable values /data.

test_subset$prediction <- predict(object = model1, newdata = test_subset)

summary(test_subset$prediction)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
-0.08071  0.12759  0.36627  0.41134  0.65515  1.00475       72

test_subset$predicted_Survived <- ifelse(test = test_subset$prediction > .5, yes = 1, no = 0)

table(test_subset$Survived, test_subset$predicted_Survived)