Lab 7

download.file("http://www.openintro.org/stat/data/mlb11.RData", destfile = "mlb11.RData")
load("mlb11.RData")
summary(mlb11)

                   team         runs        at_bats          hits     
 Arizona Diamondbacks: 1   Min.   :556   Min.   :5417   Min.   :1263  
 Atlanta Braves      : 1   1st Qu.:629   1st Qu.:5448   1st Qu.:1348  
 Baltimore Orioles   : 1   Median :706   Median :5516   Median :1394  
 Boston Red Sox      : 1   Mean   :694   Mean   :5524   Mean   :1409  
 Chicago Cubs        : 1   3rd Qu.:734   3rd Qu.:5575   3rd Qu.:1441  
 Chicago White Sox   : 1   Max.   :875   Max.   :5710   Max.   :1600  
 (Other)             :24                                              
    homeruns      bat_avg        strikeouts    stolen_bases  
 Min.   : 91   Min.   :0.233   Min.   : 930   Min.   : 49.0  
 1st Qu.:118   1st Qu.:0.245   1st Qu.:1085   1st Qu.: 89.8  
 Median :154   Median :0.253   Median :1140   Median :107.0  
 Mean   :152   Mean   :0.255   Mean   :1150   Mean   :109.3  
 3rd Qu.:173   3rd Qu.:0.260   3rd Qu.:1248   3rd Qu.:130.8  
 Max.   :222   Max.   :0.283   Max.   :1323   Max.   :170.0  

      wins       new_onbase       new_slug        new_obs     
 Min.   : 56   Min.   :0.292   Min.   :0.348   Min.   :0.640  
 1st Qu.: 72   1st Qu.:0.311   1st Qu.:0.377   1st Qu.:0.692  
 Median : 80   Median :0.318   Median :0.399   Median :0.716  
 Mean   : 81   Mean   :0.321   Mean   :0.399   Mean   :0.719  
 3rd Qu.: 90   3rd Qu.:0.328   3rd Qu.:0.413   3rd Qu.:0.738  
 Max.   :102   Max.   :0.349   Max.   :0.461   Max.   :0.810

Exercise 1 What type of plot would you use to display the relationship between runs and one of the other numerical variables? Plot this relationship using the variable at_bats as the predictor. Does the relationship look linear? If you knew a team's at_bats, would you be comfortable using a linear model to predict the number of runs?

I would use Scatterplot to present two numerical variables simultaneously because it permits the relationship between the variables to be examined with ease. The relationship is positive but only moderately strong. I will not be very comfortable using a linear model to predict the number of runs.

plot(mlb11$runs ~ mlb11$at_bats, main = "Relationship")

plot of chunk unnamed-chunk-2

Exercise 2 Looking at your plot from the previous exercise, describe the relationship between these two variables. Make sure to discuss the form, direction, and strength of the relationship as well as any unusual observations.

We can tell the moderate strong relationship of these two variables by looking at the calculated correlation coefficient of 0.610627. If the relationship is strong and positive, the correlation will be near +1 and 0.610627 is not that close. I am also seeing a few positive outliers in the plot that could skew the line. Looks like most of the teams at around 5,600 or less at_bats.

cor(mlb11$runs, mlb11$at_bats)

correlation [1] 0.6106

Exercise 3 Using plot_ss, choose a line that does a good job of minimizing the sum of squares. Run the function several times. What was the smallest sum of squares that you got? How does it compare to your neighbors?

I ran the plot using plot_ss 5 times and the best result for the sum of squares i got was 127,559. I can compare the result with the R generated sum of squares which is not too terribly far apart.

Call: lm(formula = y ~ x, data = pts)

Coefficients: (Intercept) x
-2044.9172 0.4951 cor(mlb11$runs, mlb11$at_bats)

Sum of Squares: 127559.4

plot_ss(x = mlb11$at_bats, y = mlb11$runs, showSquares = TRUE)

plot of chunk unnamed-chunk-4

## Click two points to make a line.

## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##   -2789.243        0.631  
## 
## Sum of Squares:  123722

Exercise 4 Fit a new model that uses homeruns to predict runs. Using the estimates from the R output, write the equation of the regression line. What does the slope tell us in the context of the relationship between success of a team and its home runs?

y = b0 + b1x = 415.2389 + 1.8345*homeruns

In term of the relationship between success of a team and it home run, it seems that for every home run a team has the average number of total runs will also increase by 1.83. This is a positive relationship with a correlation coefficient of 0.7916, which is relatively strong.

cor(mlb11$runs, mlb11$homeruns)

[1] 0.7916

plot_ss(x = mlb11$homeruns, y = mlb11$runs, showSquares = TRUE)

plot of chunk unnamed-chunk-6

## Click two points to make a line.

## Call:
## lm(formula = y ~ x, data = pts)
## 
## Coefficients:
## (Intercept)            x  
##      415.24         1.83  
## 
## Sum of Squares:  73672

m2 <- lm(runs ~ homeruns, data = mlb11)
summary(m2)

correlation 
correlation Call:
correlation lm(formula = runs ~ homeruns, data = mlb11)
correlation 
correlation Residuals:
correlation    Min     1Q Median     3Q    Max 
correlation -91.61 -33.41   3.23  24.29 104.63 
correlation 
correlation Coefficients:
correlation             Estimate Std. Error t value Pr(>|t|)    
correlation (Intercept)  415.239     41.678    9.96  1.0e-10 ***
correlation homeruns       1.835      0.268    6.85  1.9e-07 ***
correlation ---
correlation Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
correlation 
correlation Residual standard error: 51.3 on 28 degrees of freedom
correlation Multiple R-squared:  0.627, Adjusted R-squared:  0.613 
correlation F-statistic:   47 on 1 and 28 DF,  p-value: 1.9e-07

Exercise 5 If a team manager saw the least squares regression line and not the actual data, how many runs would he or she predict for a team with 5,578 at-bats? Is this an overestimate or an underestimate, and by how much? In other words, what is the residual for this prediction?

Based on the formula for least squares regression line for the linear model below the estimated runs for a team with 5578 at_bats are 730.5. Looking at the actual observed data there is no team with 5578 at_bats, but Philadelphia Phillies has a at_bats of 5,579 with 713 runs. Using these two numbers we can see that the model overestimated the runs by 730.5 - 713 = 17.5.

b0 <- -2789.243
b1 <- 0.631
x <- 5578
Yhat <- b0 + b1*x
Yhat

Y hat [1] 730.5

mlb11[order(mlb11$runs,mlb11$at_bats),]

observed Y                     team runs at_bats hits homeruns bat_avg strikeouts
observed Y 30      Seattle Mariners  556    5421 1263      109   0.233       1280
observed Y 28  San Francisco Giants  570    5486 1327      121   0.242       1122
observed Y 29      San Diego Padres  593    5417 1284       91   0.237       1320
observed Y 23    Pittsburgh Pirates  610    5421 1325      107   0.244       1308
observed Y 10        Houston Astros  615    5598 1442       95   0.258       1164
observed Y 21       Minnesota Twins  619    5487 1357      103   0.247       1048
observed Y 27  Washington Nationals  624    5441 1319      154   0.242       1323
observed Y 22       Florida Marlins  625    5508 1358      149   0.247       1244
observed Y 26        Atlanta Braves  641    5528 1345      173   0.243       1260
observed Y 12   Los Angeles Dodgers  644    5436 1395      117   0.257       1087
observed Y 24     Oakland Athletics  645    5452 1330      114   0.244       1094
observed Y 17     Chicago White Sox  654    5502 1387      154   0.252        989
observed Y 13          Chicago Cubs  654    5549 1423      148   0.256       1202
observed Y 15    Los Angeles Angels  667    5513 1394      155   0.253       1086
observed Y 18     Cleveland Indians  704    5509 1380      154   0.250       1269
observed Y 25        Tampa Bay Rays  707    5436 1324      172   0.244       1193
observed Y 11     Baltimore Orioles  708    5585 1434      191   0.257       1120
observed Y 16 Philadelphia Phillies  713    5579 1409      153   0.253       1024
observed Y 6          New York Mets  718    5600 1477      108   0.264       1085
observed Y 8      Milwaukee Brewers  721    5447 1422      185   0.261       1083
observed Y 4     Kansas City Royals  730    5672 1560      129   0.275       1006
observed Y 19  Arizona Diamondbacks  731    5421 1357      172   0.250       1249
observed Y 9       Colorado Rockies  735    5544 1429      163   0.258       1201
observed Y 14       Cincinnati Reds  735    5612 1438      183   0.256       1250
observed Y 20     Toronto Blue Jays  743    5559 1384      186   0.249       1184
observed Y 5    St. Louis Cardinals  762    5532 1513      162   0.273        978
observed Y 3         Detroit Tigers  787    5563 1540      169   0.277       1143
observed Y 1          Texas Rangers  855    5659 1599      210   0.283        930
observed Y 7       New York Yankees  867    5518 1452      222   0.263       1138
observed Y 2         Boston Red Sox  875    5710 1600      203   0.280       1108
observed Y    stolen_bases wins new_onbase new_slug new_obs
observed Y 30          125   67      0.292    0.348   0.640
observed Y 28           85   86      0.303    0.368   0.671
observed Y 29          170   71      0.305    0.349   0.653
observed Y 23          108   72      0.309    0.368   0.676
observed Y 10          118   56      0.311    0.374   0.684
observed Y 21           92   63      0.306    0.360   0.666
observed Y 27          106   80      0.309    0.383   0.691
observed Y 22           95   72      0.318    0.388   0.706
observed Y 26           77   89      0.308    0.387   0.695
observed Y 12          126   82      0.322    0.375   0.697
observed Y 24          117   74      0.311    0.369   0.680
observed Y 17           81   79      0.319    0.388   0.706
observed Y 13           69   71      0.314    0.401   0.715
observed Y 15          135   86      0.313    0.402   0.714
observed Y 18           89   80      0.317    0.396   0.714
observed Y 25          155   91      0.322    0.402   0.724
observed Y 11           81   69      0.316    0.413   0.729
observed Y 16           96  102      0.323    0.395   0.717
observed Y 6           130   77      0.335    0.391   0.725
observed Y 8            94   96      0.325    0.425   0.750
observed Y 4           153   71      0.329    0.415   0.744
observed Y 19          133   94      0.322    0.413   0.736
observed Y 9           118   73      0.329    0.410   0.739
observed Y 14           97   79      0.326    0.408   0.734
observed Y 20          131   81      0.317    0.413   0.730
observed Y 5            57   90      0.341    0.425   0.766
observed Y 3            49   95      0.340    0.434   0.773
observed Y 1           143   96      0.340    0.460   0.800
observed Y 7           147   97      0.343    0.444   0.788
observed Y 2           102   90      0.349    0.461   0.810

m1 <- lm(runs ~ at_bats, data = mlb11)
plot(mlb11$runs ~ mlb11$at_bats, main = "Relationship")
abline(m1)

plot of chunk unnamed-chunk-10

summary(m1)


Call:
lm(formula = runs ~ at_bats, data = mlb11)

Residuals:
   Min     1Q Median     3Q    Max 
-125.6  -47.0  -16.6   54.4  176.9 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2789.243    853.696   -3.27  0.00287 ** 
at_bats         0.631      0.155    4.08  0.00034 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 66.5 on 28 degrees of freedom
Multiple R-squared:  0.373, Adjusted R-squared:  0.35 
F-statistic: 16.6 on 1 and 28 DF,  p-value: 0.000339

Exercise 6 Is there any apparent pattern in the residuals plot? What does this indicate about the linearity of the relationship between runs and at-bats?

The residuals show no obvious patterns and appear to be scattered randomly around the dashed line that represents 0. I would say that the relationship is linear.

plot(m1$residuals ~ mlb11$at_bats)
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0

plot of chunk unnamed-chunk-11

Exercise 7 Based on the histogram and the normal probability plot, does the nearly normal residuals condition appear to be met?

It looks nearly normal to me.

hist(m1$residuals)

plot of chunk unnamed-chunk-12

qqnorm(m1$residuals)
qqline(m1$residuals) # adds diagonal line to the normal prob plot

plot of chunk unnamed-chunk-12

Exercise 8 Based on the plot in (1), does the constant variability condition appear to be met?

The constant variability condition calls for the variability of points around the least squares line remains roughly constant. Based on the plots we have done it looks to me this condition has been met.

Data Analysis

###. 1. Choose another traditional variable from mlb11 that you think might be a good predictor of runs. Produce a scatterplot of the two variables and fit a linear model. At a glance, does there seem to be a linear relationship?

Since we already looked at the relationship between runs and homeruns and runs and at_bat I chose runs and bat_avg to see if it is a good predictor. F rom the plot and summary statistics below it looks to me that the two variables fit a liner model.

y = b0 + b1X = -642.8+5242.2*bat_avg

m3 <- lm(runs ~ bat_avg, data = mlb11)
plot(mlb11$runs ~ mlb11$bat_avg, main = "Relationship3")
abline(m3)

plot of chunk unnamed-chunk-13

summary(m3)


Call:
lm(formula = runs ~ bat_avg, data = mlb11)

Residuals:
   Min     1Q Median     3Q    Max 
 -94.7  -26.3   -5.5   28.5  131.1 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)     -643        183   -3.51   0.0015 ** 
bat_avg         5242        717    7.31  5.9e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 49.2 on 28 degrees of freedom
Multiple R-squared:  0.656, Adjusted R-squared:  0.644 
F-statistic: 53.4 on 1 and 28 DF,  p-value: 5.88e-08

2. How does this relationship compare to the relationship between runs and at_bats? Use the R2 values from the two model summaries to compare. Does your variable seem to predict runs better than at_bats? How can you tell?

R2 measure of how close the data are to least squares line. 0% indicates that the model explains none of the variability of the response data around its mean. 100% indicates that the model explains all the variability of the response data around its mean. comparing the R2 data for runs and at-bats and runs and bat_avg it seems that the latter predict runs better because the R2 for bat_avg is 0.6561 vs. 0.3729 forat_abts. This indicates that 65.61% of variability can be explained by the model.

Now that you can summarize the linear relationship between two variables, investigate the relationships between runs and each of the other five traditional variables. Which variable best predicts runs? Support your conclusion using the graphical and numerical methods we've discussed (for the sake of conciseness, only include output for the best variable, not all five).

after running summary statistics for all other traditional variables it turns out that the best variable to predict the runs is bat_avg. It has the highest r2 value.

m3 <- lm(runs ~ bat_avg, data = mlb11)
hist(m3$residuals)

plot of chunk unnamed-chunk-14

qqnorm(m3$residuals)
qqline(m3$residuals) # adds diagonal line to the normal prob plot

plot of chunk unnamed-chunk-14

4. Now examine the three newer variables. These are the statistics used by the author of Moneyball to predict a teams success. In general, are they more or less effective at predicting runs that the old variables? Explain using appropriate graphical and numerical evidence. Of all ten variables we've analyzed, which seems to be the best predictor of runs? Using the limited (or not so limited) information you know about these baseball statistics, does your result make sense?

If I don't know anything about baseball but only have the following summary statistics to predict which new variable is the most effective at predicting run I would pick new_obs. The R-squared for new_obs is at a high 93.5%!!

names(mlb11)

 [1] "team"         "runs"         "at_bats"      "hits"        
 [5] "homeruns"     "bat_avg"      "strikeouts"   "stolen_bases"
 [9] "wins"         "new_onbase"   "new_slug"     "new_obs"

mnew_obs <- lm(runs ~ new_obs, data = mlb11)
mnew_slug <- lm(runs ~ new_slug, data = mlb11)
mnew_onbase <- lm(runs ~ new_onbase, data = mlb11)
summary(mnew_obs)


Call:
lm(formula = runs ~ new_obs, data = mlb11)

Residuals:
   Min     1Q Median     3Q    Max 
-43.46 -13.69   1.16  13.94  41.16 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -686.6       68.9   -9.96    1e-10 ***
new_obs       1919.4       95.7   20.06   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 21.4 on 28 degrees of freedom
Multiple R-squared:  0.935, Adjusted R-squared:  0.933 
F-statistic:  402 on 1 and 28 DF,  p-value: <2e-16

summary(mnew_slug)


Call:
lm(formula = runs ~ new_slug, data = mlb11)

Residuals:
   Min     1Q Median     3Q    Max 
-45.41 -18.66  -0.91  16.29  52.29 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   -375.8       68.7   -5.47  7.7e-06 ***
new_slug      2681.3      171.8   15.60  2.4e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 27 on 28 degrees of freedom
Multiple R-squared:  0.897, Adjusted R-squared:  0.893 
F-statistic:  244 on 1 and 28 DF,  p-value: 2.42e-15

summary(mnew_onbase)


Call:
lm(formula = runs ~ new_onbase, data = mlb11)

Residuals:
   Min     1Q Median     3Q    Max 
-58.27 -18.33   3.25  19.52  69.00 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -1118        144   -7.74  2.0e-08 ***
new_onbase      5654        450   12.55  5.1e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 32.6 on 28 degrees of freedom
Multiple R-squared:  0.849, Adjusted R-squared:  0.844 
F-statistic:  158 on 1 and 28 DF,  p-value: 5.12e-13

mnew_obs <- lm(runs ~ new_obs, data = mlb11)
hist(mnew_obs$residuals)

plot of chunk unnamed-chunk-16

qqnorm(mnew_obs$residuals)
qqline(mnew_obs$residuals) # adds diagonal line to the normal prob plot

plot of chunk unnamed-chunk-16

Problems

The scatterplot shows the relationship between socioeconomic status measured as the percentage of children in a neighborhood receiving reduced-fee lunches at school (lunch) and the percentage of bike riders in the neighborhood wearing helmets (helmet). The average percentage of children receiving reduced-fee lunches is 30.8% with a standard deviation of 26.7% and the average percentage of bike riders wearing helmets is 38.8% with a standard deviation of 16.9%.

Average % (mean) children receiving reduced-fee lunches = .308
SD - children receiving reduced-fee lunches = .267
Average % (mean) of bike rider wearing helmets = .388
SD - bike rider wearing helmets = .169

avglunch <- .308
avghelmet <- .388
sdlunch <-.267
sdhelmet <-.169
data <-data.frame(avglunch,avghelmet,sdlunch,sdhelmet)
data

  avglunch avghelmet sdlunch sdhelmet
1    0.308     0.388   0.267    0.169

(a) If the R2 for the least-squares regression line for these data is 72%, what is the correlation between lunch and helmet?

If the R2 for the least-squares regression line is 72% the correlation coefficient is about -.85. It indicates strong negative linear relationship between the explanatory and response variables. 72% of the variation in the average percentage of bike riders wearing helmets can be explained by the explanatory variable of average percent of children receiving reduced-fee lunches.

R2 <- .72
sqrt(R2)*-1

R [1] -0.8485

(b) Calculate the slope and intercept for the least-squares regression line for these data.

The slope and intercept for the least-squares regression line is y=0.5534-0.5371X

Sx <- .267
Sy <- .169
R<- sqrt(.72)*-1
beta1 = Sy/Sx*R
beta1

Slope [1] -0.5371

Meanx <- .308
Meany <- .388
beta0 = Meany - beta1 * Meanx
beta0

Intercept [1] 0.5534

The intercept is the expected mean value of Y when all X=0. In this context, the y intercept of the average bikers wearing helmets is 55.34 % when the average percent of children receiving reduced-fee lunches is at 0. For this question, the y-intercept is relevant, since there are children who do not receive any reduced-fee lunches.

(d) Interpret the slope of the least-squares regression line in the context of the application.

The slope is -.5370843. What this mean is that for every decrease of 1% in x there is a decrease of 0.5370842*.01 in y.

(e) What would the value of the residual be for a neighborhood where 40% of the children receive reduced-fee lunches and 40% of the bike riders wear helmets? Interpret the meaning of this residual in the context of the application.

We will first compute the predicted value of y (yhat). Then we will use the residual formula ex = y - yhat. Based on the computation the linear model estimated 33.86% bike riders wear helmets when 40% of the children receive reduced-fee lunches. The residual is the difference between observed and expected value, which is .06141 in this context.

x <- .40
y<-.4
beta0

[1] 0.5534

beta1

[1] -0.5371

yhat <-beta0+beta1*x
yhat

[1] 0.3386

ex <- y-yhat
ex

[1] 0.06141

2. Identify the outliers in the scatterplots shown below and determine what type of outliers they are. Explain your reasoning.

1. There is one outlier on the top left far from other points and it appears to pull the least squares line up on the left. I think this is a influential point that affects the slope of the line.
1. There is one outlier on the bottom left corner of the line, thought it appears only slightly influence the line.
1. There is one outlier vertically falls away from the cloud. It is not influencing the line at all.

3. The scatterplot and least squares summary below show the relationship between weight measured in kilograms and height measured in centimeters of 507 physically active individuals.

a) Describe the relationship between height and weight.

We can tell the strong positive relationship of these two variables by visually insert a least squares line among the data. There are very few positive outliers in the plot that may or may not skew the line.

(b) Write the equation of the regression line. Interpret the slope and intercept in context.

y = beta0 + beta1*X = -105.0113 + 1.0176*X

The y-intercept has a negative value of -105.0113, which is the value where it intercept with y when x is 0. In this context I do not think the y-intercept is usable because a person can not be at 0 centimeter height. The slope of the model is 1.0176, which means that for every centimeter the height goes up, the weight will go up by 1.0176.

© Do the data provide strong evidence that an increase in height is associated with an increase in weight? State the null and alternative hypotheses, report the p-value, and state your conclusion.

H0: beta1 = 0 - The true linear model has slope 0
HA: beta1 > 0 - The true linear model has a slope greater than 0

The sample statistics given show a p-value of 0.0000. The p-value is so small that we reject the null hypothesis. The data provides strong evidence that an increase in height is associated with an increase in weight.

(d) The correlation coefficient for height and weight is 0.72. Calculate R2 and interpret it in context.

The R2 is .5184 or 51.84% of the response variable weight can be explained by the explanatory variable Height.

R <- .72
R2 <- R^2
R2

R2 [1] 0.5184

** Lab 7**