Data Preparation

After a cursory analysis of the data, a decision was made to aggregate seasons at the player level and to sum the performance totals of each player for the two seasons. Due to the large number of observations with very few games played, a decision was also made to filter out those players that had not played at least 75% of games over two season period (162 games x 2 = 324). This worked out to be approximately 240 games. This left us with 85 high quality observations, but one observation was an outlier that appeared to be skewing the data. It was the highest paid player in the data set with only average performance based on the data. It was removed, leaving 84 observations to analyze and model.

## 'data.frame':    1340 obs. of  19 variables:
##  $ playerID: chr  "aardsda01" "abadfe01" "abreuto01" "aceveal01" ...
##  $ G       : num  43 108 56 21 256 43 250 50 61 62 ...
##  $ AB      : num  0 0 142 0 886 140 823 0 132 115 ...
##  $ R       : num  0 0 21 0 104 10 101 0 15 13 ...
##  $ H       : num  0 0 37 0 220 27 236 0 25 27 ...
##  $ X2B     : num  0 0 12 0 45 5 48 0 4 7 ...
##  $ X3B     : num  0 0 3 0 6 1 5 0 0 0 ...
##  $ HR      : num  0 0 2 0 18 2 32 0 1 1 ...
##  $ RBI     : num  0 0 14 0 96 13 119 0 8 8 ...
##  $ SB      : num  0 0 0 0 10 0 3 0 5 1 ...
##  $ CS      : num  0 0 2 0 7 0 3 0 1 1 ...
##  $ BB      : num  0 0 6 0 69 9 49 0 13 6 ...
##  $ SO      : num  0 0 33 0 162 43 194 0 36 27 ...
##  $ IBB     : num  0 0 1 0 2 0 5 0 0 1 ...
##  $ HBP     : num  0 0 1 0 4 2 3 0 0 1 ...
##  $ SH      : num  0 0 1 0 7 1 0 0 1 3 ...
##  $ SF      : num  0 0 1 0 3 0 7 0 2 1 ...
##  $ GIDP    : num  0 0 6 0 16 4 18 0 2 3 ...
##  $ SALARY  : num  2700000 540000 490000 5000000 6000000 ...
## 'data.frame':    84 obs. of  19 variables:
##  $ playerID: chr  "ackledu01" "adamsma01" "altuvjo01" "alvarpe01" ...
##  $ G       : num  256 250 310 274 294 313 287 244 294 268 ...
##  $ AB      : num  886 823 1286 956 791 ...
##  $ R       : num  104 101 149 116 74 163 143 107 145 83 ...
##  $ H       : num  220 236 402 222 188 331 311 254 313 173 ...
##  $ X2B     : num  45 48 78 35 27 52 42 63 63 34 ...
##  $ X3B     : num  6 5 5 3 6 5 9 6 9 5 ...
##  $ HR      : num  18 32 12 54 10 6 9 28 13 16 ...
##  $ RBI     : num  96 119 111 156 72 108 80 113 122 68 ...
##  $ SB      : num  10 3 91 10 16 69 37 4 28 16 ...
##  $ CS      : num  7 3 22 3 3 23 20 1 16 15 ...
##  $ BB      : num  69 49 68 93 51 98 98 48 59 36 ...
##  $ SO      : num  162 194 138 299 126 193 89 130 121 227 ...
##  $ IBB     : num  2 5 12 13 6 1 1 2 5 0 ...
##  $ HBP     : num  4 3 7 6 3 7 17 5 8 8 ...
##  $ SH      : num  7 0 5 0 11 25 16 3 11 12 ...
##  $ SF      : num  3 7 13 4 6 13 4 7 13 2 ...
##  $ GIDP    : num  16 18 44 28 13 40 14 29 24 16 ...
##  $ SALARY  : num  6000000 1120000 2250000 930000 5500000 ...

There are some things we don’t know about our data. For instance, we might like to know how many years or seasons each player has been in the leauge because we know that players who are proven over several seasons will likely command a higher salary. Disambiguating that part of our data might allow for a more clear relationship to appear between performance and Salary. Moreover, we are not sure when this salary figure is being paid. Due to the nature of pro sports, players are often rewarded with contracts after a period of high performance. Because of the lagged nature of this relationship, identifying the performance attributable to the period in which salary was paid might be also be helpful in uncovering relationships between performance and salary.

Exploratory Analysis

To observe the interactions of covariates with each other, we first explore a variety of pair plots. Observations are grouped by whether or not they have hit more than 10 home runs. While there are no ground breaking observations here, one might be surprised by the negative correlation of triples with home runs. Perhaps hitting for power and hitting for extra bases are relatively mutually exclusive.

chart.Correlation(filtered_df[2:8], pch=21) 

chart.Correlation(filtered_df[9:14], pch=21)

chart.Correlation(filtered_df[15:18], pch=21)

Target Exploration: Salary

Now we look at the target, Salary, versus the variables in our data. In all of these charts, Salary is on the y-axis. Although we do not observe any outright pure positive linear relationships with the target, we do so slightly positive patterns in offensive high performance categories such as R and BB. Perhaps, we would expect a more positive relationship between Salary and HR, but it appears there are a cluster of observations where big time HR hitters are not yet commanding a high salary. Also of note is the kinked nature of the loess curve for SO - a moderate number of strike outs actually appear to have a positive relationship with Salary. This actually makes a bit of sense as heavy hitters also tend to have high strike out rates.

regVar <- c("G","AB", "R", "H", "X2B", "X3B", "HR", "RBI", "SB", "CS", "BB", "SO", "IBB", "HBP", "SH", "SF", "GIDP")
featurePlot(x = filtered_df[, regVar],
            y = filtered_df$SALARY,
            plot = "scatter",
            type = c("p", "smooth"),
            layout = c(3, 6), main="Univariate Plots vs Salary")

Finally, we look at correlation among the variables and specifically the correlation of covariates with the target. It appears that SB and CS have a slightly negative correlation, while G, AB, R, H, X2B, and RBI have a slightly positive correlation with the target.

Question 1 - Does Salary Reward Performance? (25 pts)

We have already started to examine the relationship of performance with salary, but to understand the quantitative impact of a specific performance category on a player’s salary we might want to do a linear regression.

Because we are not building a predictive model, but rather trying to understand our data, the decision was made not to use a training and a test set for this part of the analysis.

First, we attempt a regression with all of the variables, knowing that this is likely only a starting point in our evaluation and will not be a “good” model.

lm.ball <- lm(SALARY~., data = filtered_df[,2:19])
summary(lm.ball)
## 
## Call:
## lm(formula = SALARY ~ ., data = filtered_df[, 2:19])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -9563206 -3587691  -991884  3098990 15209095 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept) -1351164   10741079  -0.126   0.9003  
## G              83025      62890   1.320   0.1913  
## AB            -39642      19097  -2.076   0.0418 *
## R             125612      66534   1.888   0.0634 .
## H              82387      55747   1.478   0.1442  
## X2B           -57744     101522  -0.569   0.5714  
## X3B            72736     226721   0.321   0.7494  
## HR            -46148     125009  -0.369   0.7132  
## RBI           -17117      61704  -0.277   0.7823  
## SB            -83958      52924  -1.586   0.1174  
## CS           -389510     213331  -1.826   0.0724 .
## BB            -54357      34787  -1.563   0.1229  
## SO             13179      16863   0.782   0.4373  
## IBB          -141683     149172  -0.950   0.3457  
## HBP            19875     120980   0.164   0.8700  
## SH             52528     197200   0.266   0.7908  
## SF            -94805     253476  -0.374   0.7096  
## GIDP          -14490      94097  -0.154   0.8781  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6141000 on 66 degrees of freedom
## Multiple R-squared:  0.2746, Adjusted R-squared:  0.08779 
## F-statistic:  1.47 on 17 and 66 DF,  p-value: 0.1342

The resulting R sq of .27, meaning that our model explains about 27% of the variance in the data, is not terrible. The p-value of 13.4% suggests that there is only a 13% chance that our results or more extreme results would occur if the null hypothesis were true. The null in this case is that there is no statistical relationship between our co-variates and our target.

And while our model doesn’t seem bad at first glance, we are looking for explanations and there are coefficients in our model that both do and do not make sense. Moreover, many of our co-variates are not statistically significant. Apparently, in this model players are rewarded for G played, but not for AB (which is highly significant at p = .042). This seems to be a bit counter-intuitive.

In order to evaluate if a better model exists, we can evaluate several stepwise approaches to the model. After observing the results of the various stepwise models, we can compare a new leaner model with the previous “kitchen sink” model. The backward approach gives us a model with 5 variables that has a lower overall R sq (19%), but also has variables that are more statistically significant. This is a better model to use.

step <- stepAIC(lm.ball, direction = "backward", trace = FALSE)
step1 <- stepAIC(lm.ball, direction = "forward", trace = FALSE)
step2 <- stepAIC(lm.ball, direction = "both", trace = FALSE)
htmlTable(anova(step, step1, step2))
Res.Df RSS Df Sum of Sq F Pr(>F)
1 78 2771167640698472
2 66 2489205343560266 12 281962297138206 0.623007112800931 0.815183428120628
3 78 2771167640698472 -12 -281962297138206 0.623007112800931 0.815183428120628
lm.ball2 <- lm(SALARY~G + AB + R + SB + BB, data = filtered_df[,2:19])
summary(lm.ball2)
## 
## Call:
## lm(formula = SALARY ~ G + AB + R + SB + BB, data = filtered_df[, 
##     2:19])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -9122339 -4055525 -1100077  3637245 14880708 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -5252836    9357515  -0.561  0.57617   
## G              81748      53965   1.515  0.13386   
## AB            -17660      11398  -1.549  0.12533   
## R             116584      40841   2.855  0.00552 **
## SB           -114314      34447  -3.319  0.00138 **
## BB            -76370      27250  -2.803  0.00639 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5961000 on 78 degrees of freedom
## Multiple R-squared:  0.1925, Adjusted R-squared:  0.1407 
## F-statistic: 3.718 on 5 and 78 DF,  p-value: 0.004526

What we can learn from this model is that for every additional G played, which requires a player to be healthy (a form of performance), his salary should increase $81,748. For every R, he is compensated with $116,584. Apparently more AB are not a contributor to a greater salary, but the amount is low enough to be perhaps ignored. It is notable that SB have a negative impact on Salary and this is consistent with our observations of correlation above. Perhaps SB are related to more injuries and thus this trait is not valued by GM’s. It’s hard to say, however, exactly why this might be.

While we can not conclusively say that general performance is rewarded with salary since not all positive performance variables appear to have positive coefficients or are statistically significant, the data does seem to indicate that some positive relationship does exist. Specifically, we can say there is a positive relationship with R, and this does make a fair bit of intuitive sense. Runs lead to wins, and we expect winning to be rewarded. Perhaps with some of the improvements in data quality identified in the Data Preparation section, we might be able to find more conclusive answers.

Question 2 - Does the data support the idea that there are players specialized in different kinds of play? (50 pts)

In order to identify groups, we can use k-means. Prior to using k-means, however, we should use PCA to create uncorrelated variables so that the resulting groups are statistically valid. Here, we compute Principal Components (centered and scaled), and review a visual representation of the results. By using the first four PCs, with eigenvalues over 1, we capture 70% of the variance. The plot helps to illustrate this point.

ball.pca <- prcomp(filtered_df[,2:18], center = TRUE, scale = TRUE)
summary(ball.pca)
## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6
## Standard deviation     2.3996 1.8292 1.3529 0.99989 0.97012 0.88485
## Proportion of Variance 0.3387 0.1968 0.1077 0.05881 0.05536 0.04606
## Cumulative Proportion  0.3387 0.5355 0.6432 0.70201 0.75737 0.80343
##                            PC7     PC8     PC9    PC10    PC11    PC12
## Standard deviation     0.87528 0.78848 0.64894 0.62049 0.60286 0.54247
## Proportion of Variance 0.04507 0.03657 0.02477 0.02265 0.02138 0.01731
## Cumulative Proportion  0.84850 0.88507 0.90984 0.93249 0.95387 0.97118
##                           PC13    PC14    PC15    PC16    PC17
## Standard deviation     0.45279 0.38038 0.23963 0.23397 0.16772
## Proportion of Variance 0.01206 0.00851 0.00338 0.00322 0.00165
## Cumulative Proportion  0.98324 0.99175 0.99513 0.99835 1.00000
plot(ball.pca, type = "l")

By exploring the biplot below, we can observe the similarities of the variables used in the PCA. Below are some interpretations.

A.) SH, SB, X3b, and CS are similar.

B.) AB, G, HBP, X2b, SF, R, GIDP are similar.

C.) RBI, BB, IBB, HR, and SO are similar.

princomp_ball <- predict(ball.pca, newdata = filtered_df[,2:19])
princomp_salry <- as.data.frame(cbind(princomp_ball[,1:10], SALARY =filtered_df$SALARY))

g <- ggbiplot(ball.pca, obs.scale = 2, var.scale = 2, ellipse = TRUE, circle = TRUE)
g+scale_color_discrete(name = '')

Now that we have investigated the results of the PCA, we can move onto the K-Means clustering. We search for 4 groups, with 1,000 iterations. Below, we run the K-Means algorithm and print the results.

kmean.ball <- kmeans(ball.pca$x[,1:4], 4, nstart = 25, iter.max = 1000)
kmean.ball
## K-means clustering with 4 clusters of sizes 20, 29, 9, 26
## 
## Cluster means:
##           PC1        PC2         PC3         PC4
## 1  0.07297104  2.2257201  0.62860493  0.29972126
## 2  2.47617327 -0.4808923 -0.17018232 -0.15184347
## 3 -4.06210917  0.5270141  0.09011183  0.03035923
## 4 -1.41190242 -1.3581405 -0.32491607 -0.07169990
## 
## Clustering vector:
##   8  13  30  34  35  42  45  48  63  76  88 110 111 133 147 162 177 181 
##   3   4   1   4   3   1   1   4   1   3   2   2   4   3   3   2   4   2 
## 202 208 211 214 216 219 225 232 240 248 250 257 274 322 323 325 327 333 
##   2   2   1   2   3   4   2   2   2   4   1   2   1   4   4   4   4   2 
## 350 352 354 356 377 397 398 405 408 414 429 431 437 440 443 490 491 494 
##   2   4   3   1   1   4   2   1   4   4   1   2   1   4   4   1   2   4 
## 517 536 545 549 552 564 624 639 642 667 682 688 699 707 714 723 733 745 
##   1   4   2   1   2   2   4   4   3   4   2   2   2   4   1   1   4   2 
## 746 753 765 776 778 813 827 853 854 862 864 886 
##   4   4   1   1   1   2   3   2   2   2   2   2 
## 
## Within cluster sum of squares by cluster:
## [1] 129.08586 165.13539  22.97795 124.67838
##  (between_SS / total_SS =  55.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Now that we have these groups to add classification to our data set, let’s take another look at the pairs plots and this time class them by the new k-means groups (1-4). What we ultimately want is to be able to describe separate and distinct groups.

transparentTheme(trans = .6)
featurePlot(x = filtered_df[, regVar[1:5]],
            y = as.factor(kmean.ball$cluster),
            plot = "pairs",
            main="k-means pair plots",
            auto.key = list(columns = 4))

featurePlot(x = filtered_df[, regVar[6:11]],
            y = as.factor(kmean.ball$cluster),
            plot = "pairs",
            main="k-means pair plots",
            auto.key = list(columns = 4))

featurePlot(x = filtered_df[, regVar[12:17]],
            y = as.factor(kmean.ball$cluster),
            plot = "pairs",
            span = .5,
            main="k-means pair plots",
            auto.key = list(columns = 4))

The clusters appear to create relatively distinct clusters for most x,y pairs. Look for instance at the pair HR & SB. Group 1 seems to be decent HR hitters and either not base stealers or only moderately base stealers. Group 4 has very few stolen bases and very few HRs. However, Group 4 also seems to be likely to get walks or sacrifice hits. Below we look at some density plots to get a clearer picture of these groups in hopes of being able to better differentiate them.

featurePlot(x = filtered_df$H,
            y = as.factor(kmean.ball$cluster),
            plot = "density",
            main="H Density Plot",
            auto.key = list(columns = 4))

featurePlot(x = filtered_df$HR,
            y = as.factor(kmean.ball$cluster),
            plot = "density",
            main="HR Density Plot",
            auto.key = list(columns = 4))

featurePlot(x = filtered_df$SB,
            y = as.factor(kmean.ball$cluster),
            plot = "density",
            main="SB Density Plot",
            auto.key = list(columns = 4))

featurePlot(x = filtered_df$SH,
            y = as.factor(kmean.ball$cluster),
            plot = "density",
            main="SH Density Plot",
            auto.key = list(columns = 4))

featurePlot(x = filtered_df$SF,
            y = as.factor(kmean.ball$cluster),
            plot = "density",
            main="SF Density Plot",
            auto.key = list(columns = 4))

featurePlot(x = filtered_df$R,
            y = as.factor(kmean.ball$cluster),
            plot = "density",
            main="R Density Plot",
            auto.key = list(columns = 4))

featurePlot(x = filtered_df$RBI,
            y = as.factor(kmean.ball$cluster),
            plot = "density",
            main="RBI Density Plot",
            auto.key = list(columns = 4))

featurePlot(x = filtered_df$BB,
            y = as.factor(kmean.ball$cluster),
            plot = "density",
            main="BB Density Plot",
            auto.key = list(columns = 4))

From the above plots we can infer the following;

Group 1: Power hitters - hitting home runs, getting RBIs, hitting sacrafice flies but not scrafice hits

Group 2: Hitters for contact and speed - hits and home runs but not as much as group 1 or 4, get RBIs, and SBs

Group 3: Hitters for contact with an eye - Doesn’t hit for distance, least productive at scoring runs, tend to draw walks

Group 4: Heavy hitters - hit home runs but not as good at getting runs in.

Question 3 - Is Salary a good predictor of the probability of hitting 10 or more HRs in a season?

The way to identify the probability of hitting 10 or more HRs in a season is to create a logistic regression. In order to do that, it’s necessary to go back and create a data set appropriate for this analysis. We do not aggregate the data in this instance, but do filter out a minimum of 80 games, salaries over 0, and those observations with no HRs. We create a new variable that is “0” for <10 HRs and “1” for 10 or more.

We then go on to build train and test sets, using 70% for the train data. The results of the logistic regression are below.

logistic_data <- Baseball[,1:23] 
logistic_data <- logistic_data[logistic_data$G>80 & logistic_data$SALARY>0 & logistic_data$HR>0,]
logistic_data <- cbind(logistic_data, HITTER = as.numeric(logistic_data$HR>10))

smp_size <- floor(0.7*nrow(logistic_data))
set.seed(143)
train_ind <- sample(seq_len(nrow(logistic_data)), size = smp_size)
train <- logistic_data[train_ind,]
test <- logistic_data[-train_ind,]

glm.ball <- glm(formula=HITTER~SALARY, family = binomial(link = "logit"), data = logistic_data)
summary(glm.ball)
## 
## Call:
## glm(formula = HITTER ~ SALARY, family = binomial(link = "logit"), 
##     data = logistic_data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -1.226  -1.216   1.129   1.136   1.264  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)  1.199e-01  1.354e-01   0.886    0.376
## SALARY      -1.148e-08  1.825e-08  -0.629    0.529
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 501.44  on 361  degrees of freedom
## Residual deviance: 501.04  on 360  degrees of freedom
##   (1 observation deleted due to missingness)
## AIC: 505.04
## 
## Number of Fisher Scoring iterations: 3

The results are not very good. The probabilities assigned to the test data show that most observations are assigned a likelihood of hitting 10 or more homeruns between .46 and .53 - pretty much a coin flip. Salary does not seem to be a very good predictor of hitting 10 or more HRs in a season.

probs <- predict(glm.ball, newdata = test, type = "response")
results <- data.frame(cbind(p = probs, HR = test$HITTER))
describe(results)
## results 
## 
##  2  Variables      109  Observations
## ---------------------------------------------------------------------------
## p 
##       n missing  unique    Info    Mean     .05     .10     .25     .50 
##     109       0      87       1  0.5169  0.4838  0.4912  0.5108  0.5244 
##     .75     .90     .95 
##  0.5285  0.5285  0.5285 
## 
## lowest : 0.4583 0.4649 0.4740 0.4783 0.4833
## highest: 0.5285 0.5285 0.5285 0.5285 0.5285 
## ---------------------------------------------------------------------------
## HR 
##       n missing  unique    Info     Sum    Mean 
##     109       0       2    0.74      49  0.4495 
## ---------------------------------------------------------------------------

Question 4 - Design a conjoint experiment to understand the importance of different player characteristics for coaches.

In order to create a conjoint experiment we might explore characteristics that GMs seem to value such as Tenure, Hit Style, Mindedness, and Passion.

The various responses for each category can be seen below.

#Create attribute list
attribute <- list(
    Tenure=c("Veteran", "Rookie", "Journeyman"),
    HitStyle=c("Hits for Power", "Hits for Contact"),
    Mindedness=c("Offensive Minded", "Defensive Minded"),
    Passion=c("For the game", "For teammates", "For winning")
)
attribute
## $Tenure
## [1] "Veteran"    "Rookie"     "Journeyman"
## 
## $HitStyle
## [1] "Hits for Power"   "Hits for Contact"
## 
## $Mindedness
## [1] "Offensive Minded" "Defensive Minded"
## 
## $Passion
## [1] "For the game"  "For teammates" "For winning"

Below is an example of the profiles that the GMs might be expected to rank.

#develop profiles
profiles <- expand.grid(attribute)
htmlTable(profiles)
Tenure HitStyle Mindedness Passion
1 Veteran Hits for Power Offensive Minded For the game
2 Rookie Hits for Power Offensive Minded For the game
3 Journeyman Hits for Power Offensive Minded For the game
4 Veteran Hits for Contact Offensive Minded For the game
5 Rookie Hits for Contact Offensive Minded For the game
6 Journeyman Hits for Contact Offensive Minded For the game
7 Veteran Hits for Power Defensive Minded For the game
8 Rookie Hits for Power Defensive Minded For the game
9 Journeyman Hits for Power Defensive Minded For the game
10 Veteran Hits for Contact Defensive Minded For the game
11 Rookie Hits for Contact Defensive Minded For the game
12 Journeyman Hits for Contact Defensive Minded For the game
13 Veteran Hits for Power Offensive Minded For teammates
14 Rookie Hits for Power Offensive Minded For teammates
15 Journeyman Hits for Power Offensive Minded For teammates
16 Veteran Hits for Contact Offensive Minded For teammates
17 Rookie Hits for Contact Offensive Minded For teammates
18 Journeyman Hits for Contact Offensive Minded For teammates
19 Veteran Hits for Power Defensive Minded For teammates
20 Rookie Hits for Power Defensive Minded For teammates
21 Journeyman Hits for Power Defensive Minded For teammates
22 Veteran Hits for Contact Defensive Minded For teammates
23 Rookie Hits for Contact Defensive Minded For teammates
24 Journeyman Hits for Contact Defensive Minded For teammates
25 Veteran Hits for Power Offensive Minded For winning
26 Rookie Hits for Power Offensive Minded For winning
27 Journeyman Hits for Power Offensive Minded For winning
28 Veteran Hits for Contact Offensive Minded For winning
29 Rookie Hits for Contact Offensive Minded For winning
30 Journeyman Hits for Contact Offensive Minded For winning
31 Veteran Hits for Power Defensive Minded For winning
32 Rookie Hits for Power Defensive Minded For winning
33 Journeyman Hits for Power Defensive Minded For winning
34 Veteran Hits for Contact Defensive Minded For winning
35 Rookie Hits for Contact Defensive Minded For winning
36 Journeyman Hits for Contact Defensive Minded For winning
design <- caFactorialDesign(data = profiles,type = "fractional", cards = 10)

In order to carry out this example in full, we can simulate observations of ranked profiles.

set.seed(45)

#synthesize respondent data matrix
matrix.new <- matrix(data = 0, nrow = 50, ncol = 10)
#iterate with sample data
for (i in 1:nrow(matrix.new)){
    
    x <- sample(1:10)
    matrix.new[i,] <- x
}

preferences <- matrix.new
attribute_levels <- as.vector(unlist(attribute))

The results of the analysis are below. Given that synthetic data was used, it is not surprising that the model does not appear to be statistically significant. This does not stop us, however, from extracting some methodological value from the interpretation of the results.

The ‘Conjoint’ package returns a normalized version of results, transforming the high and low scores appropriately. The results of the part worths seem to indicate that the inclusion of journeyman in the profile would improve the rank by approximately .5 rank pts. Also, hitting for power improves rank by .34. Offensive minded improves rank by .25. Passion for teammates improves rank by .5 spots. So the optimal profile is one of a Journeyman who hits for power, is offensive minded, and has a passion for his teammates.

#Run Conjoint function
conjoint.analysis <- Conjoint(preferences, design, attribute_levels)
## 
## Call:
## lm(formula = frml)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4,890 -2,362  0,130  2,247  5,247 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             5,5017     0,1371  40,121   <2e-16 ***
## factor(x$Tenure)1      -0,2067     0,1803  -1,146    0,252    
## factor(x$Tenure)2      -0,1333     0,1886  -0,707    0,480    
## factor(x$HitStyle)1     0,1700     0,1353   1,257    0,209    
## factor(x$Mindedness)1   0,1250     0,1353   0,924    0,356    
## factor(x$Passion)1     -0,2467     0,1886  -1,308    0,192    
## factor(x$Passion)2      0,2800     0,1803   1,553    0,121    
## ---
## Signif. codes:  0 '***' 0,001 '**' 0,01 '*' 0,05 '.' 0,1 ' ' 1
## 
## Residual standard error: 2,869 on 493 degrees of freedom
## Multiple R-squared:  0,01602,    Adjusted R-squared:  0,004043 
## F-statistic: 1,338 on 6 and 493 DF,  p-value: 0,2386
## [1] "Part worths (utilities) of levels (model parameters for whole sample):"
##              levnms    utls
## 1         intercept  5,5017
## 2           Veteran -0,2067
## 3            Rookie -0,1333
## 4        Journeyman    0,34
## 5    Hits for Power    0,17
## 6  Hits for Contact   -0,17
## 7  Offensive Minded   0,125
## 8  Defensive Minded  -0,125
## 9      For the game -0,2467
## 10    For teammates    0,28
## 11      For winning -0,0333
## [1] "Average importance of factors (attributes):"
## [1] 30,74 17,94 18,36 32,96
## [1] Sum of average importance:  100
## [1] "Chart of average factors importance"
imps <- caImportance(preferences, design)

Overall, as indicated below, GMs appear to place importance on Tenure and Passion over HitStyle and Mindedness.

barplot(imps, main = "Average Importance by Factor", names.arg = c("Tenure"," HitStyle", "Mindness", "Passion"))

APPENDIX: Data Dictionary

– playerID Player ID code

– yearID Year

– stint player’s stint (order of appearances within a season)

– teamID Team

– lgID League

– G Games

– AB At Bats

– R Runs

– H Hits

– 2B Doubles

– 3B Triples

– HR Homeruns

– RBI Runs Batted In

– SB Stolen Bases

– CS Caught Stealing

– BB Base on Balls

– SO Strikeouts

– IBB Intentional walks

– HBP Hit by pitch

– SH Sacrifice hits

– SF Sacrifice flies

– GIDP Grounded into double plays