This markdown document is an exercise required in the course of Regression Models 2.
The databases used in this analysis can be downloaded at:
Piecewise
Major League Baseball

Piecewise

A product company shipments from a certain used as a component in several of its products. The delivery varies according to production schedules. For handling and distribution to assembly plants, size of 250 thousand pieces or less are sent to warehouse A; shipments are sent to deposit B, since specialized equipment that provides for large shipments. Below we have the scatter plot between the variables \(y_i=\) Logistic cost (Thousands USD) and \(x_{1i}=\) Shipping size (Thousands pieces). As we can see, it seems to exist a positive linear correlation between these variables, which it is very plausible once the logistic costs increases as the size of shipments increases.

We can see that about the value \(x_{1i} = 250\) the positive linear correlation seems to become weaker in a manner that logistic costs increases as the size of shipments increases in a slower way. Identifying these condition, let’s set a model that captures this scenario.

Consider the following model: \[ y_i = \beta_0 + \beta_1x_{1i} + \beta_2(x_{1i} - 250)x_{2i} + \epsilon_i,\] where \(x_{1i}=\) Shipping size (Thousands pieces) and \(x_{2i} = I_{[x_{1i} > 250]}\).
Let’s fit the model above.

We obtain the following estimated regression function: \[\hat{y_i} = 3.2139 + 0.0385x_{1i} - 0.0248(x_{1i} - 250)x_{2i}\] Below we have the plot of the model fitted.

Below, we have the Q-Q plot and the Studentized residual plot. The Q-Q plot suggests that a heavy-tail distribution could be more adequate to these data. And the Studentized residual plot vs its indice seems not to evidence any distance from the assumptions.

Questions

What is the predicted cost for a shipment with a size of 125? : approximately 8 thousands USD
What is the predicted cost for a shipment with a size of 250? : approximately 12 thousands and 900 USD
What is the predicted cost for a shipment with a size of 400? : approximately 14 thousands and 900 USD

Major League Baseball

Now we have 353 baseball players’ salary and a lot of more information eg games played, years in major league, ethnicity, team’s salary budget.

By using the stepAIC() function we got the following regression function to model the average salary: \[y_i = \beta_0 + \beta_1x_{1i} + \beta_2x_{2i} + \beta_3x_{3i} + \epsilon_i,\] where \(x_{1i} =\) team’s salary budget, \(x_{2i} =\) career games played and \(x_{3i} =\) years in major league. But the variables \(x_{2i}\) and \(x_{3i}\) are highly correlated, so let’s choose only one of them to be on the model.

## 
## Call:
## lm(formula = salary ~ teamsl + games, data = data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2971197  -599717  -256957   412251  4107928 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.357e+05  2.230e+05  -1.057  0.29113    
## teamsl       2.007e-02  6.988e-03   2.872  0.00433 ** 
## games        1.486e+03  1.132e+02  13.130  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1126000 on 350 degrees of freedom
## Multiple R-squared:  0.3638, Adjusted R-squared:  0.3602 
## F-statistic: 100.1 on 2 and 350 DF,  p-value: < 2.2e-16

As we can see in the output above, the model is significant but looking on the individual tests, the parameter \(\beta_0\) is not significant. Let’s fit the model again but without intercept.

## 
## Call:
## lm(formula = salary ~ teamsl + games - 1, data = data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3052688  -584351  -288190   360331  4150460 
## 
## Coefficients:
##         Estimate Std. Error t value Pr(>|t|)    
## teamsl 1.337e-02  2.942e-03   4.544  7.6e-06 ***
## games  1.467e+03  1.117e+02  13.127  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1126000 on 351 degrees of freedom
## Multiple R-squared:  0.6671, Adjusted R-squared:  0.6652 
## F-statistic: 351.6 on 2 and 351 DF,  p-value: < 2.2e-16

Based on the variables commented previously, we obtain the following estimated regression function:
\[\hat{y_i} = 0.0133x_{1i} + 1,467x_{2i}\] So, for each game played, the salary increases, on average, 1,467 thousands USD. In the other hand, as team’s salary budget increases the salary increases, on average, 0.0133 approximately.

The assumption of normality seems to be violated by the Q-Q plot below. By the Studentized residual plot we see that the assumption of equal variance seems reasonable and a dozen observations seems to be outliers.