#creates dataframe with incomplete cases removedmoney_clean <- moneyball.training.datamoney_clean$TEAM_BATTING_HBP =NULLmoney_clean$TEAM_BATTING_CS =NULLvis_miss(money_clean)
Imputing Missing Observations
hist(money_clean$TEAM_BASERUN_CS)
Summary Statistics Table
library(stargazer) #loads package
Please cite as:
Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
R package version 5.2.3. https://CRAN.R-project.org/package=stargazer
library(ggplot2) #loads package
Attaching package: 'ggplot2'
The following objects are masked from 'package:psych':
%+%, alpha
stargazer(money_clean, type ="text", # determines the type or storage of the objecttitle ="Summary Statistics", # creates the titlesdigits =2, # rounds the data to the second decimal placeomit.summary.stat ="n", #excludes incomplete casesnotes ="n = 2276") #adds note at the bottom of the table describing that there are 2276 variables missing
?ggplotggplot(data = money_clean, #inputs clean data to plotmapping =aes(x = TEAM_BATTING_HR, #assigns data to x axis and y axisy = TARGET_WINS)) +geom_point() +ggtitle("Coorelation of Home Runs and Number of wins") +geom_point(colour ="Blue")
# adds points, title, and colors the points blue
GG Plot of Target wins and Strikeouts by batters
ggplot(data = money_clean, # inputs clean datamapping =aes(x = TEAM_BATTING_SO, #assigns data to x axis and y axisy = TARGET_WINS)) +geom_point() +ggtitle("Coorelation of Strikeouts and Target wins") +geom_point(colour ="Red")
Warning: Removed 102 rows containing missing values or values outside the scale range
(`geom_point()`).
Removed 102 rows containing missing values or values outside the scale range
(`geom_point()`).
# adds points, title, and colors the points red
Histogram of Home Runs
ggplot(data = money_clean, # inputs clean datamapping =# Inserts data for the homeruns by battersaes(x = TEAM_BATTING_HR)) +geom_histogram() +ggtitle("Histogram of Home Runs")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 1393 rows containing non-finite outside the scale range
(`stat_bin()`).
Moving Data to Excel
library(writexl) # loads packagewrite_xlsx(money_clean, "GroupProject.xlsx") # exports data to excel
5 Key Takeaways
The first major outlier in our ggplot of Home Runs and Number of wins shows that having a decent amount of homeruns did not lead to many wins. Another key outlier shows that the team had an average amount of home runs yet the greatest number of wins. This data in the ggplot shows that although the data has a low positive correlation there are some data points that fall significantly outside of the correlation.
The low positive correlation of the Home Runs and Number of wins plot shows that the lower the team batting home run is the lower the number of wins there are which is shown as well as the higher the homeruns are the higher the wins are as well.
Similarly, when plotting the number of target wins with the number of strikeouts by batters the ggplot shows a low negative correlation which was the expected outcome. Similarly again, although the data mainly sticks to one trend, some data points prove the reality of outliers.
This is also seen through the pivot table as you skim through the data. Most of the points follow the trend that when there are more strikeouts there are fewer wins. Logically this makes sense because in baseball every strikeout is an out that doesn’t put the ball in play. This eliminated the chance for a hit, walk, or any play that could help the team.
The histogram of number of Home Runs is skewed to the right and bimodal which shows how the data is imperfect and has outliers. The histogram depicts teams more often 150-200 home runs rather than very few or very many. The slight skew right shows the reality of the data as it is difficult to hit a homerun so the probability of a batter getting many is slim.
Summary
Overall, the data analysis shows that while there are general trends in the relationships between home runs, strikeouts, and wins, there are also significant outliers that highlight the complexity of baseball performance. The weak correlations suggest that multiple factors contribute to a team’s success, beyond just home runs and strikeouts.
PassengerId Survived Pclass Name Sex Age
0 0 0 0 0 177
SibSp Parch Ticket Fare Cabin Embarked
0 0 0 0 0 0
train_clean <-na.omit(train)
1. Run some preliminary correlations of Survived with some other variables.
Due to the Matrix’ ability to be expressed as a range of values expressed within the interval[-1,1]. By plotting this on a correlation plot we are able to understand that in the data the Passenger class has the strongest negative linear correlation with Fare. Similarly Passenger class has a negative linear correlation. This also shows that the numbers of parents or children and number of siblings and spouses on board has positive linear correlation. The other variables in this data are seen to have weak correlation.
The scatter plot depicts more people in the higher class surviving based on their fare. By looking at this plot it can also be noted that people who spent over 300 on their tickets survived. This also shows that no one in the second or third classes spent over 100 on their tickets.
# install.packages("RColorBrewer")library(RColorBrewer)?ggplotggplot(train_clean, aes(x = Fare, y = Survived, color = Pclass)) +geom_jitter(width =2)
2. Conduct descriptive statistics of the data set. Anything interesting you find?
The mean passenger class shows that on average passengers were in the second class. With this mean we can guess that there is a majority of passengers in the 3rd class which is proven true in the histogram. This makes sense as the average fare is about 35 dollars although there were some that spent 512.33. The average passengers age on board was also on the lower end showing us that most passengers were younger although the maximum was 80 years old. This can also be seen in the histogram as the data is skewed right demonstrating the low number of older passengers.
library(stargazer) #loads packagelibrary(ggplot2) #loads packagestargazer(train_clean, type ="text", # determines the type or storage of the objecttitle ="Summary Statistics", # creates the titlesdigits =2, # rounds the data to the second decimal placeomit.summary.stat ="n", #excludes incomplete casesnotes ="missing values = 177") #adds note at the bottom of the table describing that there are 177 variables missing
Summary Statistics
=======================================
Statistic Mean St. Dev. Min Max
---------------------------------------
PassengerId 448.58 259.12 1 891
Survived 0.41 0.49 0 1
Pclass 2.24 0.84 1 3
Age 29.70 14.53 0.42 80.00
SibSp 0.51 0.93 0 5
Parch 0.43 0.85 0 6
Fare 34.69 52.92 0.00 512.33
---------------------------------------
missing values = 177
?histhist(train_clean$Pclass)
hist(train_clean$Age)
3. Use set.seed(100) command, and create a subset of train dataset that has only 500 observations.
4. Create an Ordinary Least Squares model / linear regression where Survived is the dependent variable on your n=500 sample.
library(ggplot2)?lm()summary(lm( formula = Survived ~as.factor(Pclass) + Sex + Age +I(Age^2), train_subset))
Call:
lm(formula = Survived ~ as.factor(Pclass) + Sex + Age + I(Age^2),
data = train_subset)
Residuals:
Min 1Q Median 3Q Max
-1.06480 -0.23932 -0.08141 0.22975 0.97841
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.073e+00 8.981e-02 11.953 < 2e-16 ***
as.factor(Pclass)2 -1.986e-01 5.673e-02 -3.501 0.000517 ***
as.factor(Pclass)3 -3.731e-01 5.155e-02 -7.238 2.45e-12 ***
Sexmale -4.892e-01 4.201e-02 -11.644 < 2e-16 ***
Age -4.340e-03 4.906e-03 -0.885 0.376910
I(Age^2) 2.845e-06 7.126e-05 0.040 0.968169
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.386 on 389 degrees of freedom
(105 observations deleted due to missingness)
Multiple R-squared: 0.3872, Adjusted R-squared: 0.3793
F-statistic: 49.16 on 5 and 389 DF, p-value: < 2.2e-16
model1 <- (lm( formula = Survived ~as.factor(Pclass) + Sex + Age +I(Age^2), train_subset))
5. Create an estimate of whether an individual survived or not (binary variable) using the predict command on your estimated model. Essentially, you are using the coefficient from your linear model to forecast/predict/estimate the survival variable given independant variable values /data.