library(Lahman)
library(tidyverse)
library(tidyverse)
library(Lahman)
view(Teams)
teams_2016 <- Teams %>% filter(yearID == 2016)
teams_2016_batting <- teams_2016 %>% select (yearID:teamID, R:SF) 

1.

view(teams_2016)
teams_2016 %>% arrange (W)
ggplot(data=teams_2016, aes(x=W))+
    geom_histogram()

The histogram is symmetric. However teams tend to win 68 games the most and 86 games the second most.

2.

teams1_2016<-teams_2016 %>%
    mutate(win_pct=W/G)
teams1_2016%>%    
    select(teamID,win_pct)
    filter(teamID==CHN)

The Cubs win percentage is 63.5% ##3.

ggplot(data=teams1_2016, aes(x = lgID, y=X3B))+
    geom_boxplot()

The National league median of tripples is 7 higher than that of the American League. the inter quartile range of the American Leagues spread ranges from 20 to 31 while the spread for the National league lies between 26 and 39. The maximum for the National leage goes all the way to 51 while the maximum for the American leage only goes up to 35. ##4.

ggplot(data=teams1_2016, aes(x = lgID, y=X3B))+
    geom_boxplot()+
    labs(x="League ID",
         y="Tripples",
         title="tripples by leage")

5.

teams2_2016 <- teams1_2016 %>%
  mutate(BA = (H/AB), 
         OBP = (H + BB)/(AB + BB), 
         SLG = ((H - X2B - X3B - HR)*1 + X2B*2 + X3B*3 + HR*4)/AB)
ggplot(teams2_2016, aes(x=BA, y=R))+
    geom_point()
ggplot(teams2_2016, aes(x=OBP, y=R))+
    geom_point()
ggplot(teams2_2016, aes(x=SLG, y=R))+
    geom_point() 

Slugging percentage seems to have the strongest association with runs ##6. The corrolation seems to be linear with a medium strength in the positive direction. The code below proves my findings. I would suggest that the coach use slugging percentage as a way to measure runs becasue the 75% of the variation can be explained by the linear association with slugging percentage.

teams2_2016%>%
    summarise(r_value=cor(R,SLG),
              r_sq_value=r_value^2)
teams2_2016%>%
    summarise(r_value=cor(R,BA),
              r_sq_value=r_value^2)
teams2_2016%>%
    summarise(r_value=cor(R,OBP),
              r_sq_value=r_value^2)

7.

teams3_2016<-teams2_2016%>%
    mutate(win_greater85=W>=85)

teams3_2016%>%select(teamID,win_greater85)%>%filter(win_greater85==TRUE)

13 teams won over 85 games ##8.

ggplot(teams3_2016, aes(x=SB,y=HR))+
    geom_point()+
       labs(title="Stollen bases to Home Runs")+
        geom_smooth()
teams3_2016%>%
    summarize(r_value=cor(SB,HR),
              r_squared=r_value^2)

Stollen bases seems to be negatiavely corrolated with home runs. the curve however is parabolic implying that when a team gets to 75 stollen bases anything past that they have a better chance at hitting a home run. However the r value is very low/negative and the r squared is low/positive implying that not much variation in home runs can be attested to stollen bases. ##9.

ggplot(teams3_2016, aes(x=H))+
    geom_histogram()
ggplot(teams3_2016, aes(x=lgID,y=H))+
    geom_boxplot()

The histogram shows the spread of the count of teams that get a certain amount of hits while the boxplot only shows the hits associated with the American and National leagues. Box plots however show the outliers and the inter quartile ranges of hits. ##10.

view(Teams)
Teams%>%
    filter(HBP>100, yearID>=2000)%>%select(teamID,HBP)%>%arrange(HBP)

only two teams were hit by pitches more than 100 times since 2000 and those were Tampa Bay and Clevland

Part 2

1.

Voros McCracken’s primary finding is that hits allowed is not a meaningfull statistic when it comes to evaluating pitchers. Another finding was little difference among pitchers in their ability to prevent hits landing in the field of play. He states that pitchers have little to no ability to determine if a batter hits their ball.

2.

traditional baseball followers will see an issue with this becasue ERA, a primary way to determine a pitchers ability, is determined by the amout of runsthey give up. On a surface level it is clear that the more htis a pitcher gives up the higher their ERA will be.

3.

To supplement his analysis McCracken could track in which inning are pitches hit. For example are more pitches hit in later innings when the pitcher’s arm gets tired? Do some pitcher give up less htis later on in the game becasue they are generally in better shape? These factors can not be changed in the short run but rather in the long run.