After a cursory analysis of the data, a decision was made to aggregate seasons at the player level and to sum the performance totals of each player for the two seasons. Due to the large number of observations with very few games played, a decision was also made to filter out those players that had not played at least 75% of games over two season period (162 games x 2 = 324). This worked out to be approximately 240 games. This left us with 85 high quality observations, but one observation was an outlier that appeared to be skewing the data. It was the highest paid player in the data set with only average performance based on the data. It was removed, leaving 84 observations to analyze and model.
## 'data.frame': 1340 obs. of 19 variables:
## $ playerID: chr "aardsda01" "abadfe01" "abreuto01" "aceveal01" ...
## $ G : num 43 108 56 21 256 43 250 50 61 62 ...
## $ AB : num 0 0 142 0 886 140 823 0 132 115 ...
## $ R : num 0 0 21 0 104 10 101 0 15 13 ...
## $ H : num 0 0 37 0 220 27 236 0 25 27 ...
## $ X2B : num 0 0 12 0 45 5 48 0 4 7 ...
## $ X3B : num 0 0 3 0 6 1 5 0 0 0 ...
## $ HR : num 0 0 2 0 18 2 32 0 1 1 ...
## $ RBI : num 0 0 14 0 96 13 119 0 8 8 ...
## $ SB : num 0 0 0 0 10 0 3 0 5 1 ...
## $ CS : num 0 0 2 0 7 0 3 0 1 1 ...
## $ BB : num 0 0 6 0 69 9 49 0 13 6 ...
## $ SO : num 0 0 33 0 162 43 194 0 36 27 ...
## $ IBB : num 0 0 1 0 2 0 5 0 0 1 ...
## $ HBP : num 0 0 1 0 4 2 3 0 0 1 ...
## $ SH : num 0 0 1 0 7 1 0 0 1 3 ...
## $ SF : num 0 0 1 0 3 0 7 0 2 1 ...
## $ GIDP : num 0 0 6 0 16 4 18 0 2 3 ...
## $ SALARY : num 2700000 540000 490000 5000000 6000000 ...
## 'data.frame': 84 obs. of 19 variables:
## $ playerID: chr "ackledu01" "adamsma01" "altuvjo01" "alvarpe01" ...
## $ G : num 256 250 310 274 294 313 287 244 294 268 ...
## $ AB : num 886 823 1286 956 791 ...
## $ R : num 104 101 149 116 74 163 143 107 145 83 ...
## $ H : num 220 236 402 222 188 331 311 254 313 173 ...
## $ X2B : num 45 48 78 35 27 52 42 63 63 34 ...
## $ X3B : num 6 5 5 3 6 5 9 6 9 5 ...
## $ HR : num 18 32 12 54 10 6 9 28 13 16 ...
## $ RBI : num 96 119 111 156 72 108 80 113 122 68 ...
## $ SB : num 10 3 91 10 16 69 37 4 28 16 ...
## $ CS : num 7 3 22 3 3 23 20 1 16 15 ...
## $ BB : num 69 49 68 93 51 98 98 48 59 36 ...
## $ SO : num 162 194 138 299 126 193 89 130 121 227 ...
## $ IBB : num 2 5 12 13 6 1 1 2 5 0 ...
## $ HBP : num 4 3 7 6 3 7 17 5 8 8 ...
## $ SH : num 7 0 5 0 11 25 16 3 11 12 ...
## $ SF : num 3 7 13 4 6 13 4 7 13 2 ...
## $ GIDP : num 16 18 44 28 13 40 14 29 24 16 ...
## $ SALARY : num 6000000 1120000 2250000 930000 5500000 ...
There are some things we don’t know about our data. For instance, we might like to know how many years or seasons each player has been in the leauge because we know that players who are proven over several seasons will likely command a higher salary. Disambiguating that part of our data might allow for a more clear relationship to appear between performance and Salary. Moreover, we are not sure when this salary figure is being paid. Due to the nature of pro sports, players are often rewarded with contracts after a period of high performance. Because of the lagged nature of this relationship, identifying the performance attributable to the period in which salary was paid might be also be helpful in uncovering relationships between performance and salary.
To observe the interactions of covariates with each other, we first explore a variety of pair plots. Observations are grouped by whether or not they have hit more than 10 home runs. While there are no ground breaking observations here, one might be surprised by the negative correlation of triples with home runs. Perhaps hitting for power and hitting for extra bases are relatively mutually exclusive.
chart.Correlation(filtered_df[2:8], pch=21)
chart.Correlation(filtered_df[9:14], pch=21)
chart.Correlation(filtered_df[15:18], pch=21)
Now we look at the target, Salary, versus the variables in our data. In all of these charts, Salary is on the y-axis. Although we do not observe any outright pure positive linear relationships with the target, we do so slightly positive patterns in offensive high performance categories such as R and BB. Perhaps, we would expect a more positive relationship between Salary and HR, but it appears there are a cluster of observations where big time HR hitters are not yet commanding a high salary. Also of note is the kinked nature of the loess curve for SO - a moderate number of strike outs actually appear to have a positive relationship with Salary. This actually makes a bit of sense as heavy hitters also tend to have high strike out rates.
regVar <- c("G","AB", "R", "H", "X2B", "X3B", "HR", "RBI", "SB", "CS", "BB", "SO", "IBB", "HBP", "SH", "SF", "GIDP")
featurePlot(x = filtered_df[, regVar],
y = filtered_df$SALARY,
plot = "scatter",
type = c("p", "smooth"),
layout = c(3, 6), main="Univariate Plots vs Salary")
Finally, we look at correlation among the variables and specifically the correlation of covariates with the target. It appears that SB and CS have a slightly negative correlation, while G, AB, R, H, X2B, and RBI have a slightly positive correlation with the target.
We have already started to examine the relationship of performance with salary, but to understand the quantitative impact of a specific performance category on a player’s salary we might want to do a linear regression.
Because we are not building a predictive model, but rather trying to understand our data, the decision was made not to use a training and a test set for this part of the analysis.
First, we attempt a regression with all of the variables, knowing that this is likely only a starting point in our evaluation and will not be a “good” model.
lm.ball <- lm(SALARY~., data = filtered_df[,2:19])
summary(lm.ball)
##
## Call:
## lm(formula = SALARY ~ ., data = filtered_df[, 2:19])
##
## Residuals:
## Min 1Q Median 3Q Max
## -9563206 -3587691 -991884 3098990 15209095
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1351164 10741079 -0.126 0.9003
## G 83025 62890 1.320 0.1913
## AB -39642 19097 -2.076 0.0418 *
## R 125612 66534 1.888 0.0634 .
## H 82387 55747 1.478 0.1442
## X2B -57744 101522 -0.569 0.5714
## X3B 72736 226721 0.321 0.7494
## HR -46148 125009 -0.369 0.7132
## RBI -17117 61704 -0.277 0.7823
## SB -83958 52924 -1.586 0.1174
## CS -389510 213331 -1.826 0.0724 .
## BB -54357 34787 -1.563 0.1229
## SO 13179 16863 0.782 0.4373
## IBB -141683 149172 -0.950 0.3457
## HBP 19875 120980 0.164 0.8700
## SH 52528 197200 0.266 0.7908
## SF -94805 253476 -0.374 0.7096
## GIDP -14490 94097 -0.154 0.8781
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6141000 on 66 degrees of freedom
## Multiple R-squared: 0.2746, Adjusted R-squared: 0.08779
## F-statistic: 1.47 on 17 and 66 DF, p-value: 0.1342
The resulting R sq of .27, meaning that our model explains about 27% of the variance in the data, is not terrible. The p-value of 13.4% suggests that there is only a 13% chance that our results or more extreme results would occur if the null hypothesis were true. The null in this case is that there is no statistical relationship between our co-variates and our target.
And while our model doesn’t seem bad at first glance, we are looking for explanations and there are coefficients in our model that both do and do not make sense. Moreover, many of our co-variates are not statistically significant. Apparently, in this model players are rewarded for G played, but not for AB (which is highly significant at p = .042). This seems to be a bit counter-intuitive.
In order to evaluate if a better model exists, we can evaluate several stepwise approaches to the model. After observing the results of the various stepwise models, we can compare a new leaner model with the previous “kitchen sink” model. The backward approach gives us a model with 5 variables that has a lower overall R sq (19%), but also has variables that are more statistically significant. This is a better model to use.
step <- stepAIC(lm.ball, direction = "backward", trace = FALSE)
step1 <- stepAIC(lm.ball, direction = "forward", trace = FALSE)
step2 <- stepAIC(lm.ball, direction = "both", trace = FALSE)
htmlTable(anova(step, step1, step2))
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) | |
|---|---|---|---|---|---|---|
| 1 | 78 | 2771167640698472 | ||||
| 2 | 66 | 2489205343560266 | 12 | 281962297138206 | 0.623007112800931 | 0.815183428120628 |
| 3 | 78 | 2771167640698472 | -12 | -281962297138206 | 0.623007112800931 | 0.815183428120628 |
lm.ball2 <- lm(SALARY~G + AB + R + SB + BB, data = filtered_df[,2:19])
summary(lm.ball2)
##
## Call:
## lm(formula = SALARY ~ G + AB + R + SB + BB, data = filtered_df[,
## 2:19])
##
## Residuals:
## Min 1Q Median 3Q Max
## -9122339 -4055525 -1100077 3637245 14880708
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5252836 9357515 -0.561 0.57617
## G 81748 53965 1.515 0.13386
## AB -17660 11398 -1.549 0.12533
## R 116584 40841 2.855 0.00552 **
## SB -114314 34447 -3.319 0.00138 **
## BB -76370 27250 -2.803 0.00639 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5961000 on 78 degrees of freedom
## Multiple R-squared: 0.1925, Adjusted R-squared: 0.1407
## F-statistic: 3.718 on 5 and 78 DF, p-value: 0.004526
What we can learn from this model is that for every additional G played, which requires a player to be healthy (a form of performance), his salary should increase $81,748. For every R, he is compensated with $116,584. Apparently more AB are not a contributor to a greater salary, but the amount is low enough to be perhaps ignored. It is notable that SB have a negative impact on Salary and this is consistent with our observations of correlation above. Perhaps SB are related to more injuries and thus this trait is not valued by GM’s. It’s hard to say, however, exactly why this might be.
While we can not conclusively say that general performance is rewarded with salary since not all positive performance variables appear to have positive coefficients or are statistically significant, the data does seem to indicate that some positive relationship does exist. Specifically, we can say there is a positive relationship with R, and this does make a fair bit of intuitive sense. Runs lead to wins, and we expect winning to be rewarded. Perhaps with some of the improvements in data quality identified in the Data Preparation section, we might be able to find more conclusive answers.
In order to identify groups, we can use k-means. Prior to using k-means, however, we should use PCA to create uncorrelated variables so that the resulting groups are statistically valid. Here, we compute Principal Components (centered and scaled), and review a visual representation of the results. By using the first four PCs, with eigenvalues over 1, we capture 70% of the variance. The plot helps to illustrate this point.
ball.pca <- prcomp(filtered_df[,2:18], center = TRUE, scale = TRUE)
summary(ball.pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 2.3996 1.8292 1.3529 0.99989 0.97012 0.88485
## Proportion of Variance 0.3387 0.1968 0.1077 0.05881 0.05536 0.04606
## Cumulative Proportion 0.3387 0.5355 0.6432 0.70201 0.75737 0.80343
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 0.87528 0.78848 0.64894 0.62049 0.60286 0.54247
## Proportion of Variance 0.04507 0.03657 0.02477 0.02265 0.02138 0.01731
## Cumulative Proportion 0.84850 0.88507 0.90984 0.93249 0.95387 0.97118
## PC13 PC14 PC15 PC16 PC17
## Standard deviation 0.45279 0.38038 0.23963 0.23397 0.16772
## Proportion of Variance 0.01206 0.00851 0.00338 0.00322 0.00165
## Cumulative Proportion 0.98324 0.99175 0.99513 0.99835 1.00000
plot(ball.pca, type = "l")
By exploring the biplot below, we can observe the similarities of the variables used in the PCA. Below are some interpretations.
A.) SH, SB, X3b, and CS are similar.
B.) AB, G, HBP, X2b, SF, R, GIDP are similar.
C.) RBI, BB, IBB, HR, and SO are similar.
princomp_ball <- predict(ball.pca, newdata = filtered_df[,2:19])
princomp_salry <- as.data.frame(cbind(princomp_ball[,1:10], SALARY =filtered_df$SALARY))
g <- ggbiplot(ball.pca, obs.scale = 2, var.scale = 2, ellipse = TRUE, circle = TRUE)
g+scale_color_discrete(name = '')
Now that we have investigated the results of the PCA, we can move onto the K-Means clustering. We search for 4 groups, with 1,000 iterations. Below, we run the K-Means algorithm and print the results.
kmean.ball <- kmeans(ball.pca$x[,1:4], 4, nstart = 25, iter.max = 1000)
kmean.ball
## K-means clustering with 4 clusters of sizes 20, 29, 9, 26
##
## Cluster means:
## PC1 PC2 PC3 PC4
## 1 0.07297104 2.2257201 0.62860493 0.29972126
## 2 2.47617327 -0.4808923 -0.17018232 -0.15184347
## 3 -4.06210917 0.5270141 0.09011183 0.03035923
## 4 -1.41190242 -1.3581405 -0.32491607 -0.07169990
##
## Clustering vector:
## 8 13 30 34 35 42 45 48 63 76 88 110 111 133 147 162 177 181
## 3 4 1 4 3 1 1 4 1 3 2 2 4 3 3 2 4 2
## 202 208 211 214 216 219 225 232 240 248 250 257 274 322 323 325 327 333
## 2 2 1 2 3 4 2 2 2 4 1 2 1 4 4 4 4 2
## 350 352 354 356 377 397 398 405 408 414 429 431 437 440 443 490 491 494
## 2 4 3 1 1 4 2 1 4 4 1 2 1 4 4 1 2 4
## 517 536 545 549 552 564 624 639 642 667 682 688 699 707 714 723 733 745
## 1 4 2 1 2 2 4 4 3 4 2 2 2 4 1 1 4 2
## 746 753 765 776 778 813 827 853 854 862 864 886
## 4 4 1 1 1 2 3 2 2 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 129.08586 165.13539 22.97795 124.67838
## (between_SS / total_SS = 55.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
Now that we have these groups to add classification to our data set, let’s take another look at the pairs plots and this time class them by the new k-means groups (1-4). What we ultimately want is to be able to describe separate and distinct groups.
transparentTheme(trans = .6)
featurePlot(x = filtered_df[, regVar[1:5]],
y = as.factor(kmean.ball$cluster),
plot = "pairs",
main="k-means pair plots",
auto.key = list(columns = 4))
featurePlot(x = filtered_df[, regVar[6:11]],
y = as.factor(kmean.ball$cluster),
plot = "pairs",
main="k-means pair plots",
auto.key = list(columns = 4))
featurePlot(x = filtered_df[, regVar[12:17]],
y = as.factor(kmean.ball$cluster),
plot = "pairs",
span = .5,
main="k-means pair plots",
auto.key = list(columns = 4))
The clusters appear to create relatively distinct clusters for most x,y pairs. Look for instance at the pair HR & SB. Group 1 seems to be decent HR hitters and either not base stealers or only moderately base stealers. Group 4 has very few stolen bases and very few HRs. However, Group 4 also seems to be likely to get walks or sacrifice hits. Below we look at some density plots to get a clearer picture of these groups in hopes of being able to better differentiate them.
featurePlot(x = filtered_df$H,
y = as.factor(kmean.ball$cluster),
plot = "density",
main="H Density Plot",
auto.key = list(columns = 4))
featurePlot(x = filtered_df$HR,
y = as.factor(kmean.ball$cluster),
plot = "density",
main="HR Density Plot",
auto.key = list(columns = 4))
featurePlot(x = filtered_df$SB,
y = as.factor(kmean.ball$cluster),
plot = "density",
main="SB Density Plot",
auto.key = list(columns = 4))
featurePlot(x = filtered_df$SH,
y = as.factor(kmean.ball$cluster),
plot = "density",
main="SH Density Plot",
auto.key = list(columns = 4))
featurePlot(x = filtered_df$SF,
y = as.factor(kmean.ball$cluster),
plot = "density",
main="SF Density Plot",
auto.key = list(columns = 4))
featurePlot(x = filtered_df$R,
y = as.factor(kmean.ball$cluster),
plot = "density",
main="R Density Plot",
auto.key = list(columns = 4))
featurePlot(x = filtered_df$RBI,
y = as.factor(kmean.ball$cluster),
plot = "density",
main="RBI Density Plot",
auto.key = list(columns = 4))
featurePlot(x = filtered_df$BB,
y = as.factor(kmean.ball$cluster),
plot = "density",
main="BB Density Plot",
auto.key = list(columns = 4))
From the above plots we can infer the following;
Group 1: Power hitters - hitting home runs, getting RBIs, hitting sacrafice flies but not scrafice hits
Group 2: Hitters for contact and speed - hits and home runs but not as much as group 1 or 4, get RBIs, and SBs
Group 3: Hitters for contact with an eye - Doesn’t hit for distance, least productive at scoring runs, tend to draw walks
Group 4: Heavy hitters - hit home runs but not as good at getting runs in.
The way to identify the probability of hitting 10 or more HRs in a season is to create a logistic regression. In order to do that, it’s necessary to go back and create a data set appropriate for this analysis. We do not aggregate the data in this instance, but do filter out a minimum of 80 games, salaries over 0, and those observations with no HRs. We create a new variable that is “0” for <10 HRs and “1” for 10 or more.
We then go on to build train and test sets, using 70% for the train data. The results of the logistic regression are below.
logistic_data <- Baseball[,1:23]
logistic_data <- logistic_data[logistic_data$G>80 & logistic_data$SALARY>0 & logistic_data$HR>0,]
logistic_data <- cbind(logistic_data, HITTER = as.numeric(logistic_data$HR>10))
smp_size <- floor(0.7*nrow(logistic_data))
set.seed(143)
train_ind <- sample(seq_len(nrow(logistic_data)), size = smp_size)
train <- logistic_data[train_ind,]
test <- logistic_data[-train_ind,]
glm.ball <- glm(formula=HITTER~SALARY, family = binomial(link = "logit"), data = logistic_data)
summary(glm.ball)
##
## Call:
## glm(formula = HITTER ~ SALARY, family = binomial(link = "logit"),
## data = logistic_data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.226 -1.216 1.129 1.136 1.264
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.199e-01 1.354e-01 0.886 0.376
## SALARY -1.148e-08 1.825e-08 -0.629 0.529
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 501.44 on 361 degrees of freedom
## Residual deviance: 501.04 on 360 degrees of freedom
## (1 observation deleted due to missingness)
## AIC: 505.04
##
## Number of Fisher Scoring iterations: 3
The results are not very good. The probabilities assigned to the test data show that most observations are assigned a likelihood of hitting 10 or more homeruns between .46 and .53 - pretty much a coin flip. Salary does not seem to be a very good predictor of hitting 10 or more HRs in a season.
probs <- predict(glm.ball, newdata = test, type = "response")
results <- data.frame(cbind(p = probs, HR = test$HITTER))
describe(results)
## results
##
## 2 Variables 109 Observations
## ---------------------------------------------------------------------------
## p
## n missing unique Info Mean .05 .10 .25 .50
## 109 0 87 1 0.5169 0.4838 0.4912 0.5108 0.5244
## .75 .90 .95
## 0.5285 0.5285 0.5285
##
## lowest : 0.4583 0.4649 0.4740 0.4783 0.4833
## highest: 0.5285 0.5285 0.5285 0.5285 0.5285
## ---------------------------------------------------------------------------
## HR
## n missing unique Info Sum Mean
## 109 0 2 0.74 49 0.4495
## ---------------------------------------------------------------------------
In order to create a conjoint experiment we might explore characteristics that GMs seem to value such as Tenure, Hit Style, Mindedness, and Passion.
The various responses for each category can be seen below.
#Create attribute list
attribute <- list(
Tenure=c("Veteran", "Rookie", "Journeyman"),
HitStyle=c("Hits for Power", "Hits for Contact"),
Mindedness=c("Offensive Minded", "Defensive Minded"),
Passion=c("For the game", "For teammates", "For winning")
)
attribute
## $Tenure
## [1] "Veteran" "Rookie" "Journeyman"
##
## $HitStyle
## [1] "Hits for Power" "Hits for Contact"
##
## $Mindedness
## [1] "Offensive Minded" "Defensive Minded"
##
## $Passion
## [1] "For the game" "For teammates" "For winning"
Below is an example of the profiles that the GMs might be expected to rank.
#develop profiles
profiles <- expand.grid(attribute)
htmlTable(profiles)
| Tenure | HitStyle | Mindedness | Passion | |
|---|---|---|---|---|
| 1 | Veteran | Hits for Power | Offensive Minded | For the game |
| 2 | Rookie | Hits for Power | Offensive Minded | For the game |
| 3 | Journeyman | Hits for Power | Offensive Minded | For the game |
| 4 | Veteran | Hits for Contact | Offensive Minded | For the game |
| 5 | Rookie | Hits for Contact | Offensive Minded | For the game |
| 6 | Journeyman | Hits for Contact | Offensive Minded | For the game |
| 7 | Veteran | Hits for Power | Defensive Minded | For the game |
| 8 | Rookie | Hits for Power | Defensive Minded | For the game |
| 9 | Journeyman | Hits for Power | Defensive Minded | For the game |
| 10 | Veteran | Hits for Contact | Defensive Minded | For the game |
| 11 | Rookie | Hits for Contact | Defensive Minded | For the game |
| 12 | Journeyman | Hits for Contact | Defensive Minded | For the game |
| 13 | Veteran | Hits for Power | Offensive Minded | For teammates |
| 14 | Rookie | Hits for Power | Offensive Minded | For teammates |
| 15 | Journeyman | Hits for Power | Offensive Minded | For teammates |
| 16 | Veteran | Hits for Contact | Offensive Minded | For teammates |
| 17 | Rookie | Hits for Contact | Offensive Minded | For teammates |
| 18 | Journeyman | Hits for Contact | Offensive Minded | For teammates |
| 19 | Veteran | Hits for Power | Defensive Minded | For teammates |
| 20 | Rookie | Hits for Power | Defensive Minded | For teammates |
| 21 | Journeyman | Hits for Power | Defensive Minded | For teammates |
| 22 | Veteran | Hits for Contact | Defensive Minded | For teammates |
| 23 | Rookie | Hits for Contact | Defensive Minded | For teammates |
| 24 | Journeyman | Hits for Contact | Defensive Minded | For teammates |
| 25 | Veteran | Hits for Power | Offensive Minded | For winning |
| 26 | Rookie | Hits for Power | Offensive Minded | For winning |
| 27 | Journeyman | Hits for Power | Offensive Minded | For winning |
| 28 | Veteran | Hits for Contact | Offensive Minded | For winning |
| 29 | Rookie | Hits for Contact | Offensive Minded | For winning |
| 30 | Journeyman | Hits for Contact | Offensive Minded | For winning |
| 31 | Veteran | Hits for Power | Defensive Minded | For winning |
| 32 | Rookie | Hits for Power | Defensive Minded | For winning |
| 33 | Journeyman | Hits for Power | Defensive Minded | For winning |
| 34 | Veteran | Hits for Contact | Defensive Minded | For winning |
| 35 | Rookie | Hits for Contact | Defensive Minded | For winning |
| 36 | Journeyman | Hits for Contact | Defensive Minded | For winning |
design <- caFactorialDesign(data = profiles,type = "fractional", cards = 10)
In order to carry out this example in full, we can simulate observations of ranked profiles.
set.seed(45)
#synthesize respondent data matrix
matrix.new <- matrix(data = 0, nrow = 50, ncol = 10)
#iterate with sample data
for (i in 1:nrow(matrix.new)){
x <- sample(1:10)
matrix.new[i,] <- x
}
preferences <- matrix.new
attribute_levels <- as.vector(unlist(attribute))
The results of the analysis are below. Given that synthetic data was used, it is not surprising that the model does not appear to be statistically significant. This does not stop us, however, from extracting some methodological value from the interpretation of the results.
The ‘Conjoint’ package returns a normalized version of results, transforming the high and low scores appropriately. The results of the part worths seem to indicate that the inclusion of journeyman in the profile would improve the rank by approximately .5 rank pts. Also, hitting for power improves rank by .34. Offensive minded improves rank by .25. Passion for teammates improves rank by .5 spots. So the optimal profile is one of a Journeyman who hits for power, is offensive minded, and has a passion for his teammates.
#Run Conjoint function
conjoint.analysis <- Conjoint(preferences, design, attribute_levels)
##
## Call:
## lm(formula = frml)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4,890 -2,362 0,130 2,247 5,247
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5,5017 0,1371 40,121 <2e-16 ***
## factor(x$Tenure)1 -0,2067 0,1803 -1,146 0,252
## factor(x$Tenure)2 -0,1333 0,1886 -0,707 0,480
## factor(x$HitStyle)1 0,1700 0,1353 1,257 0,209
## factor(x$Mindedness)1 0,1250 0,1353 0,924 0,356
## factor(x$Passion)1 -0,2467 0,1886 -1,308 0,192
## factor(x$Passion)2 0,2800 0,1803 1,553 0,121
## ---
## Signif. codes: 0 '***' 0,001 '**' 0,01 '*' 0,05 '.' 0,1 ' ' 1
##
## Residual standard error: 2,869 on 493 degrees of freedom
## Multiple R-squared: 0,01602, Adjusted R-squared: 0,004043
## F-statistic: 1,338 on 6 and 493 DF, p-value: 0,2386
## [1] "Part worths (utilities) of levels (model parameters for whole sample):"
## levnms utls
## 1 intercept 5,5017
## 2 Veteran -0,2067
## 3 Rookie -0,1333
## 4 Journeyman 0,34
## 5 Hits for Power 0,17
## 6 Hits for Contact -0,17
## 7 Offensive Minded 0,125
## 8 Defensive Minded -0,125
## 9 For the game -0,2467
## 10 For teammates 0,28
## 11 For winning -0,0333
## [1] "Average importance of factors (attributes):"
## [1] 30,74 17,94 18,36 32,96
## [1] Sum of average importance: 100
## [1] "Chart of average factors importance"
imps <- caImportance(preferences, design)
Overall, as indicated below, GMs appear to place importance on Tenure and Passion over HitStyle and Mindedness.
barplot(imps, main = "Average Importance by Factor", names.arg = c("Tenure"," HitStyle", "Mindness", "Passion"))
– playerID Player ID code
– yearID Year
– stint player’s stint (order of appearances within a season)
– teamID Team
– lgID League
– G Games
– AB At Bats
– R Runs
– H Hits
– 2B Doubles
– 3B Triples
– HR Homeruns
– RBI Runs Batted In
– SB Stolen Bases
– CS Caught Stealing
– BB Base on Balls
– SO Strikeouts
– IBB Intentional walks
– HBP Hit by pitch
– SH Sacrifice hits
– SF Sacrifice flies
– GIDP Grounded into double plays