DATA 621 Homework #5

Introduction
Data Exploration & Preparation
Model Building & Evaluation
Model Selection

Introduction

We have been given a dataset with information on 12795 commercially available wines. The variables in the dataset are mostly related to the chemical properties of the wine being sold. A large wine manufacturer is studying the data in order to predict the number of wine cases ordered based upon the wine characteristics. If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales. The objective is explore, analyze and build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine.

Data Exploration & Preparation

As previously stated we have 12795 observations for developing the model. The evaluation dataset (which we will set aside for now) consists of 3335 observations. We will begin by taking a look at the modeling data:

summary(df) %>%
  kable() %>%
  kable_styling()

TARGET	FixedAcidity	VolatileAcidity	CitricAcid	ResidualSugar	Chlorides	FreeSulfurDioxide	TotalSulfurDioxide	Density	pH	Sulphates	Alcohol	LabelAppeal	AcidIndex	STARS
Min. :0.000	Min. :-18.100	Min. :-2.7900	Min. :-3.2400	Min. :-127.800	Min. :-1.1710	Min. :-555.00	Min. :-823.0	Min. :0.8881	Min. :0.480	Min. :-3.1300	Min. :-4.70	Min. :-2.000000	Min. : 4.000	Min. :1.000
1st Qu.:2.000	1st Qu.: 5.200	1st Qu.: 0.1300	1st Qu.: 0.0300	1st Qu.: -2.000	1st Qu.:-0.0310	1st Qu.: 0.00	1st Qu.: 27.0	1st Qu.:0.9877	1st Qu.:2.960	1st Qu.: 0.2800	1st Qu.: 9.00	1st Qu.:-1.000000	1st Qu.: 7.000	1st Qu.:1.000
Median :3.000	Median : 6.900	Median : 0.2800	Median : 0.3100	Median : 3.900	Median : 0.0460	Median : 30.00	Median : 123.0	Median :0.9945	Median :3.200	Median : 0.5000	Median :10.40	Median : 0.000000	Median : 8.000	Median :2.000
Mean :3.029	Mean : 7.076	Mean : 0.3241	Mean : 0.3084	Mean : 5.419	Mean : 0.0548	Mean : 30.85	Mean : 120.7	Mean :0.9942	Mean :3.208	Mean : 0.5271	Mean :10.49	Mean :-0.009066	Mean : 7.773	Mean :2.042
3rd Qu.:4.000	3rd Qu.: 9.500	3rd Qu.: 0.6400	3rd Qu.: 0.5800	3rd Qu.: 15.900	3rd Qu.: 0.1530	3rd Qu.: 70.00	3rd Qu.: 208.0	3rd Qu.:1.0005	3rd Qu.:3.470	3rd Qu.: 0.8600	3rd Qu.:12.40	3rd Qu.: 1.000000	3rd Qu.: 8.000	3rd Qu.:3.000
Max. :8.000	Max. : 34.400	Max. : 3.6800	Max. : 3.8600	Max. : 141.150	Max. : 1.3510	Max. : 623.00	Max. :1057.0	Max. :1.0992	Max. :6.130	Max. : 4.2400	Max. :26.50	Max. : 2.000000	Max. :17.000	Max. :4.000
NA	NA	NA	NA	NA’s :616	NA’s :638	NA’s :647	NA’s :682	NA	NA’s :395	NA’s :1210	NA’s :653	NA	NA	NA’s :3359

Two things jump out about this dataset. The first thing is that there are variables with negative values. These negative values don’t make sense in the context of chemical properties of wine. After all, how can you have negative chlorides?

Second, there are quite a few records with missing values. We will need to either impute values or drop the records with incomplete data. These two issues will need to be corrected in the data.

Correcting Invalid Values

We will begin by correcting the invalid values. We will assume the negative values were a data entry issue and a negative value was entered when a positive value was desired. If these negative values were changed into positive numbers, they would be within the range of plausible values based on the other data.

fix_negative_values <- function(df){
  df %>%
    mutate(FixedAcidity = abs(FixedAcidity),
           VolatileAcidity = abs(VolatileAcidity),
           CitricAcid = abs(CitricAcid),
           ResidualSugar = abs(ResidualSugar),
           Chlorides = abs(Chlorides),
           FreeSulfurDioxide = abs(FreeSulfurDioxide),
           TotalSulfurDioxide = abs(TotalSulfurDioxide),
           Sulphates = abs(Sulphates),
           Alcohol = abs(Alcohol))
}
df <- fix_negative_values(df)
hold_out_data <- fix_negative_values(hold_out_data)

Missing Values

The instructions indicate that sometimes, the fact that a variable is missing is actually predictive of the target. Let’s see what we can learn about TARGET from the missing values.

temp <- df %>%
  select(TARGET, STARS, Alcohol, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, pH, Sulphates) %>%
  gather("variable", "value", -TARGET) %>%
  mutate(na = ifelse(is.na(value), "Yes", "No")) %>%
  group_by(TARGET, na, variable) %>%
  tally() %>%
  ungroup()

 temp <- temp %>%
    group_by(variable, na) %>%
    summarise(total = sum(n)) %>%
    merge(temp) %>%
    mutate(share = n / total)
 
 ggplot(temp, aes(TARGET, share, fill = na)) +
   geom_bar(stat = "identity", position = "dodge") +
   scale_fill_brewer(palette = "Set1") +
   facet_wrap(~variable, ncol = 4) + 
   ylab("Percent of Group")

We learn that the observations that are missing STARS are much more likely to have a zero TARGET. So let’s look at how many stars wine with a zero TARGET have:

df %>%
  filter(TARGET == 0) %>%
  select(STARS) %>%
  na.omit() %>%
  group_by(STARS) %>%
  tally() %>%
  kable() %>%
  kable_styling()

STARS	n
1	607
2	89

Based on this it looks like we should assign 1 for all the missing STARS.

There is little to no information encoded in the NAs for Alcohol, ResidualSugar, Chlorides, FreeSulfurDioxide, TotalSulfurDioxide, pH, and Sulphates. We will impute using the median value.

fix_missing_values <- function(df){
  df %>%
    mutate(
      ResidualSugar = ifelse(is.na(ResidualSugar), median(ResidualSugar, na.rm = T), ResidualSugar),
      Chlorides = ifelse(is.na(Chlorides), median(Chlorides, na.rm = T), Chlorides),
      FreeSulfurDioxide = ifelse(is.na(FreeSulfurDioxide), median(FreeSulfurDioxide, na.rm = T), FreeSulfurDioxide),
      TotalSulfurDioxide = ifelse(is.na(TotalSulfurDioxide), median(TotalSulfurDioxide, na.rm = T), TotalSulfurDioxide),
      pH = ifelse(is.na(pH), median(pH, na.rm = T), pH),
      Sulphates = ifelse(is.na(Sulphates), median(Sulphates, na.rm = T), Sulphates),
      Alcohol = ifelse(is.na(Alcohol), median(Alcohol, na.rm = T), Alcohol),
      STARS_imputed = ifelse(is.na(STARS), 1, 0),
      STARS = ifelse(is.na(STARS), 1, STARS)
    )
}
df <- fix_missing_values(df)
hold_out_data <- fix_missing_values(hold_out_data)

Training Test Split

Next we will look at splitting the data into a training and testing set at a standard 70-30 split between train and test:

set.seed(42)
train_index <- createDataPartition(df$TARGET, p = .7, list = FALSE, times = 1)
train <- df[train_index,]
test <- df[-train_index,]

Univariate Analysis

Response Variable

In order to model this as a Poisson process the mean should be equal to the variance

train %>%
  summarise(mean = mean(TARGET), variance = var(TARGET)) %>%
  kable() %>%
  kable_styling()

mean	variance
3.028243	3.701446

They are not the same. The implications of this requirement not being met is the the \(\beta\)s are correct but the standard errors will be wrong.

Predictors

dens <- lapply(predictors, FUN=function(var) {
  ggplot(df, aes_string(x = var)) + 
    geom_density(fill = "gray") +
    geom_vline(aes(xintercept = mean(train[, var])), color = "blue", size=1) +
    geom_vline(aes(xintercept = median(train[, var])), color = "red", size=1) +
    geom_vline(aes(xintercept = quantile(train[, var], 0.25)), linetype = "dashed", size = 0.5) + 
    geom_vline(aes(xintercept = quantile(train[, var], 0.75)), linetype = "dashed", size = 0.5)
  })
do.call(grid.arrange, args = c(dens, list(ncol = 3)))

Some of the predictors (the chemical composition variables) are right skewed.

Bivariate Analysis

Now let’s examine the correlations between the variables:

corrplot.mixed(cor(train))

Hmm. This suggests that the chemical makeup of the wine has little bearing on the number of cases sold. The label appeal, acidity index value and star rating have the strongest correlation. This also suggests that it may be worth the effort to do a better job filling in the missing values in the STARS variable.

Model Building & Evaluation

Evaluation Approach

In order to assess the models we will return the the original intent of the exercise: If the wine manufacturer can predict the number of cases, then that manufacturer will be able to adjust their wine offering to maximize sales. We will evaluate the models based on the number of cases that are likely to be sold. We will assume that when the model overestimates the cases sold, these cases will never be sold and represent a loss. When the model underestimates there is also a loss but it is unseen. It is potential sales that are not realized.

We will summarize the evaluation by first constructing a confusion matrix like summary looking at what our model predicts vs what actually occured. We will also summarize the total number of cases that would be sold, the net cases sold (after adjusting for the glut due to overestimating), and a final adjusted net which factors in “losses” from under estimation.

evaluate_model <- function(model, test_df, yhat = FALSE){
  temp <- data.frame(yhat=c(0:8), TARGET = c(0:8), n=c(0))
  
  if(yhat){
    test_df$yhat <- yhat
  } else {
    test_df$yhat <- round(predict.glm(model, newdata=test_df, type="response"), 0)
  }
  
  test_df <- test_df %>%
    group_by(yhat, TARGET) %>%
    tally() %>%
    mutate(accuracy = ifelse(yhat > TARGET, "Over", ifelse(yhat < TARGET, "Under", "Accurate"))) %>%
    mutate(cases_sold = ifelse(yhat > TARGET, TARGET, yhat) * n,
           glut = ifelse(yhat > TARGET, yhat - TARGET, 0) * n,
           missed_opportunity = ifelse(yhat < TARGET, TARGET - yhat, 0) * n) %>%
    mutate(net_cases_sold = cases_sold - glut,
           adj_net_cases_sold = cases_sold - glut - missed_opportunity)
  
  results <- test_df %>%
    group_by(accuracy) %>%
    summarise(n = sum(n)) %>%
    spread(accuracy, n)
  
  accurate <- results$Accurate
  over <- results$Over
  under <- results$Under
  
  cases_sold <- sum(test_df$cases_sold)
  net_cases_sold <- sum(test_df$net_cases_sold)
  adj_net_cases_sold <- sum(test_df$adj_net_cases_sold)
  missed_opportunity <- sum(test_df$missed_opportunity)
  glut <- sum(test_df$glut)
  
  confusion_matrix <- test_df %>%
    bind_rows(temp) %>%
    group_by(yhat, TARGET) %>%
    summarise(n = sum(n)) %>%
    spread(TARGET, n, fill = 0)
  
  return(list("confusion_matrix" = confusion_matrix, "results" = results, "df" = test_df, "accurate" = accurate, "over" = over, "under" = under, "cases_sold" = cases_sold, "net_cases_sold" = net_cases_sold, "adj_net_cases_sold" = adj_net_cases_sold, "glut" = glut, "missed_opportunity" = missed_opportunity))
}

Chemical Property Model

We will begin by first constructing a count regression model based on the chemical properties of the wine. So we will be excluding the label, star rating and the acid index as it is a weighted average of the acidity variables. Based on our bivariate analysis we would not expect this model to preform well. Since the mean is not equal to the variance we will be using the quasi-Poisson model.

train1 <- train %>%
  select(-LabelAppeal, -AcidIndex, -STARS, -STARS_imputed)
model1 <- glm(TARGET ~ ., family = quasipoisson, train1)
summary(model1)


Call:
glm(formula = TARGET ~ ., family = quasipoisson, data = train1)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7505  -0.7187   0.1491   0.7245   2.5330  

Coefficients:
                      Estimate  Std. Error t value         Pr(>|t|)    
(Intercept)         1.73286580  0.25394606   6.824 0.00000000000945 ***
FixedAcidity       -0.00587468  0.00135015  -4.351 0.00001369486133 ***
VolatileAcidity    -0.08484398  0.01249535  -6.790 0.00000000001192 ***
CitricAcid          0.00947082  0.01107239   0.855          0.39238    
ResidualSugar       0.00023887  0.00027490   0.869          0.38492    
Chlorides          -0.09293940  0.02997436  -3.101          0.00194 ** 
FreeSulfurDioxide   0.00013301  0.00006294   2.113          0.03459 *  
TotalSulfurDioxide  0.00010001  0.00004135   2.419          0.01559 *  
Density            -0.64377662  0.25159620  -2.559          0.01052 *  
pH                 -0.00219600  0.01001878  -0.219          0.82651    
Sulphates          -0.02528552  0.01096751  -2.305          0.02116 *  
Alcohol             0.01099055  0.00186578   5.891 0.00000000398672 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 1.216629)

    Null deviance: 15984  on 8957  degrees of freedom
Residual deviance: 15817  on 8946  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

model1_results <- evaluate_model(model1, test)

The following is a confusion matrix like examination of the model with the predicted cases going down the rows and the actual cases going across the columns.

model1_results$confusion_matrix %>% 
  kable() %>% 
  kable_styling()

yhat	0	1	2	3	4	5	6	7	8
0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0
2	17	2	6	14	15	3	2	1	0
3	792	74	314	751	914	581	216	45	8
4	11	0	4	18	24	20	4	1	0
5	0	0	0	0	0	0	0	0	0
6	0	0	0	0	0	0	0	0	0
7	0	0	0	0	0	0	0	0	0
8	0	0	0	0	0	0	0	0	0

The model is only accurate 781 times. It overestimates 1232 times and underestimates 1824 times. If decisions were based off of this model, one would expect to sell 8589. If we subtract out the extra cases due to overestimation we would have 5645 net cases sold. If we adjust this for the cases that could have been sold if the model didn’t underestimate we would have 2604 cases sold.

Second Chemical Property Model

We will try one more chemical property model but only select the variables that were statistically significant in the preious model (FixedAcidity, VolatileAcidity and Alcohol).

model2 <- glm(TARGET ~ FixedAcidity + VolatileAcidity + Alcohol, family = quasipoisson, train)
summary(model2)


Call:
glm(formula = TARGET ~ FixedAcidity + VolatileAcidity + Alcohol, 
    family = quasipoisson, data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7172  -0.7017   0.1321   0.7044   2.5983  

Coefficients:
                 Estimate Std. Error t value             Pr(>|t|)    
(Intercept)      1.094133   0.024678  44.335 < 0.0000000000000002 ***
FixedAcidity    -0.005920   0.001349  -4.389     0.00001153301758 ***
VolatileAcidity -0.086542   0.012496  -6.925     0.00000000000464 ***
Alcohol          0.010932   0.001865   5.861     0.00000000475204 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 1.216481)

    Null deviance: 15984  on 8957  degrees of freedom
Residual deviance: 15859  on 8954  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

model2_results <- evaluate_model(model2, test)

We would expect to have 8574 cases sold using this model (with a net of 5643 and adjusted net of 2587). Looking at the confusion matrix we see this model is accurate only 784 out of 3837 times.

model2_results$confusion_matrix %>% 
  kable() %>% 
  kable_styling()

yhat	0	1	2	3	4	5	6	7	8
0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0
2	11	2	3	5	6	2	3	0	0
3	806	73	320	772	938	598	217	47	8
4	3	1	1	6	9	4	2	0	0
5	0	0	0	0	0	0	0	0	0
6	0	0	0	0	0	0	0	0	0
7	0	0	0	0	0	0	0	0	0
8	0	0	0	0	0	0	0	0	0

The previous model is slightly better but is still a horible model. Building a model based off of the chemical properties does not seem like a good solution. Let’s turn our attention to the other variables with stronger correlations.

Preceived Quality Model

This model relies on the consumer preception of quality based on the label, and the star ratings of the experts.

model3 <- glm(TARGET ~ LabelAppeal + STARS, family = quasipoisson, train)
summary(model3)


Call:
glm(formula = TARGET ~ LabelAppeal + STARS, family = quasipoisson, 
    data = train)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-2.83925  -0.67249   0.08322   0.53669   2.77111  

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 0.426188   0.013822   30.83 <0.0000000000000002 ***
LabelAppeal 0.138170   0.006936   19.92 <0.0000000000000002 ***
STARS       0.345702   0.006095   56.72 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 0.9132717)

    Null deviance: 15984  on 8957  degrees of freedom
Residual deviance: 11780  on 8955  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 5

model3_results <- evaluate_model(model3, test)

Here is a confusion matrix for this model:

model3_results$confusion_matrix %>% 
  kable() %>% 
  kable_styling()

yhat	0	1	2	3	4	5	6	7	8
0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0
2	771	74	256	405	286	89	11	2	0
3	44	2	62	276	311	92	15	3	0
4	5	0	5	97	261	237	59	4	0
5	0	0	1	5	62	117	43	7	0
6	0	0	0	0	26	26	47	7	3
7	0	0	0	0	7	41	32	18	1
8	0	0	0	0	0	2	15	6	4

Perceived Quality Plus Model

We will begin with the preceeding model but add in the Acid Index and the flag if the STARS was imputed.

model4 <- glm(TARGET ~ LabelAppeal + STARS + AcidIndex + STARS_imputed, family = quasipoisson, train)
summary(model4)


Call:
glm(formula = TARGET ~ LabelAppeal + STARS + AcidIndex + STARS_imputed, 
    family = quasipoisson, data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.1907  -0.6846  -0.0020   0.4711   3.7802  

Coefficients:
               Estimate Std. Error t value            Pr(>|t|)    
(Intercept)    1.512852   0.042068   35.96 <0.0000000000000002 ***
LabelAppeal    0.162061   0.006881   23.55 <0.0000000000000002 ***
STARS          0.189468   0.006835   27.72 <0.0000000000000002 ***
AcidIndex     -0.084084   0.005015  -16.77 <0.0000000000000002 ***
STARS_imputed -0.825021   0.020442  -40.36 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 0.8847836)

    Null deviance: 15984.3  on 8957  degrees of freedom
Residual deviance:  9727.3  on 8953  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 6

model4_results <- evaluate_model(model4, test)

Let’s see how the predictions line up with the reality:

model4_results$confusion_matrix %>% 
  kable() %>% 
  kable_styling()

yhat	0	1	2	3	4	5	6	7	8
0	0	0	0	0	0	0	0	0	0
1	548	42	92	137	59	19	6	1	0
2	129	29	90	72	31	15	5	3	0
3	113	5	130	420	367	91	4	0	0
4	30	0	11	146	387	288	59	2	0
5	0	0	1	7	96	136	63	9	0
6	0	0	0	1	13	48	63	24	4
7	0	0	0	0	0	7	19	7	4
8	0	0	0	0	0	0	3	1	0

We would expect to have 9342 net cases sold using this model (with a net of 7071 and adjusted net of 4783). Looking at the confusion matrix we see this model is accurate only 979 out of 3837 times.

There are no zero predictions in this model which is not accurate. We will try to address this in our last model

Adjusted Perceived Quality Plus Model

We will start with the predictions from preceived quality plus model. We will then manually override the the predictions that have the stars filled in with a predicted zero.

temp <- test
temp$yhat <- round(predict.glm(model4, newdata=temp, type="response"), 0)
temp <- temp %>%
  mutate(yhat = ifelse(STARS_imputed == 1, 0, yhat))
model5_results <- evaluate_model(model4, test, temp$yhat)

model5_results$confusion_matrix %>% 
  kable() %>% 
  kable_styling()

yhat	0	1	2	3	4	5	6	7	8
0	617	42	92	137	68	29	10	4	0
1	1	0	0	1	0	0	0	0	0
2	59	29	90	71	22	5	1	0	0
3	113	5	130	420	367	91	4	0	0
4	30	0	11	146	387	288	59	2	0
5	0	0	1	7	96	136	63	9	0
6	0	0	0	1	13	48	63	24	4
7	0	0	0	0	0	7	19	7	4
8	0	0	0	0	0	0	3	1	0

We see that applying the adjustment based on if the stars are imputed increased the correct number of cases where it should be predicted as a zero. It also introduced some underestimation.

We would expect to have 9671 net cases sold using this model (with a net of 7839 and adjusted net of 5880). Looking at the confusion matrix we see this model is accurate only 1145 out of 3837 times.

Model Selection

The following is a summary of the five models

Model	Number of Cases Sold	Number of Cases Overestimated to Sell	Missed Opportunities	Times Accurate
Chemical Property Model 1	8589	2944	3041	781
Chemical Property Model 2	8574	2931	3056	784
Perceived Quality Model	9342	2271	2288	979
Perceived Quality Plus Model	9671	1832	1959	1145
Adjusted Perceived Quality Plus Model	9262	1145	2368	1720

The Perceived Quality Plus Model has the most cases sold out of the models. However the Adjusted Perceived Quality Plus Model has the highest accuracy. One of the reasons why the Perceived Quality Plus Model predicted more cases sold is because we forced more zero predictions in the Adjusted Perceived Quality Plus Model.

If we subtract out the waste from overestimation the Adjusted Perceived Quality Plus Model is the best model. This has real business consequences as having extra cases of wine on hand that don’t sell becomes a real cost to a business. For these two reasons it will be the model of choice in producing estimates for the hold out datase.

hold_out_data$TARGET <- round(predict.glm(model4, newdata=hold_out_data, type="response"), 0)
hold_out_data %>%
  mutate(TARGET = ifelse(STARS_imputed == 1, 0, TARGET)) %>%
  write.csv(., "./data/hw5-predictions.csv", row.names = F)

yhat	0	1	2	3	4	5	6	7	8
0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0
2	17	2	6	14	15	3	2	1	0
3	792	74	314	751	914	581	216	45	8
4	11	0	4	18	24	20	4	1	0
5	0	0	0	0	0	0	0	0	0
6	0	0	0	0	0	0	0	0	0
7	0	0	0	0	0	0	0	0	0
8	0	0	0	0	0	0	0	0	0

yhat	0	1	2	3	4	5	6	7	8
0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0
2	11	2	3	5	6	2	3	0	0
3	806	73	320	772	938	598	217	47	8
4	3	1	1	6	9	4	2	0	0
5	0	0	0	0	0	0	0	0	0
6	0	0	0	0	0	0	0	0	0
7	0	0	0	0	0	0	0	0	0
8	0	0	0	0	0	0	0	0	0

yhat	0	1	2	3	4	5	6	7	8
0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0
2	771	74	256	405	286	89	11	2	0
3	44	2	62	276	311	92	15	3	0
4	5	0	5	97	261	237	59	4	0
5	0	0	1	5	62	117	43	7	0
6	0	0	0	0	26	26	47	7	3
7	0	0	0	0	7	41	32	18	1
8	0	0	0	0	0	2	15	6	4

yhat	0	1	2	3	4	5	6	7	8
0	0	0	0	0	0	0	0	0	0
1	548	42	92	137	59	19	6	1	0
2	129	29	90	72	31	15	5	3	0
3	113	5	130	420	367	91	4	0	0
4	30	0	11	146	387	288	59	2	0
5	0	0	1	7	96	136	63	9	0
6	0	0	0	1	13	48	63	24	4
7	0	0	0	0	0	7	19	7	4
8	0	0	0	0	0	0	3	1	0

yhat	0	1	2	3	4	5	6	7	8
0	617	42	92	137	68	29	10	4	0
1	1	0	0	1	0	0	0	0	0
2	59	29	90	71	22	5	1	0	0
3	113	5	130	420	367	91	4	0	0
4	30	0	11	146	387	288	59	2	0
5	0	0	1	7	96	136	63	9	0
6	0	0	0	1	13	48	63	24	4
7	0	0	0	0	0	7	19	7	4
8	0	0	0	0	0	0	3	1	0

yhat	0	1	2	3	4	5	6	7	8
0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0
2	17	2	6	14	15	3	2	1	0
3	792	74	314	751	914	581	216	45	8
4	11	0	4	18	24	20	4	1	0
5	0	0	0	0	0	0	0	0	0
6	0	0	0	0	0	0	0	0	0
7	0	0	0	0	0	0	0	0	0
8	0	0	0	0	0	0	0	0	0

yhat	0	1	2	3	4	5	6	7	8
0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0
2	11	2	3	5	6	2	3	0	0
3	806	73	320	772	938	598	217	47	8
4	3	1	1	6	9	4	2	0	0
5	0	0	0	0	0	0	0	0	0
6	0	0	0	0	0	0	0	0	0
7	0	0	0	0	0	0	0	0	0
8	0	0	0	0	0	0	0	0	0

yhat	0	1	2	3	4	5	6	7	8
0	0	0	0	0	0	0	0	0	0
1	0	0	0	0	0	0	0	0	0
2	771	74	256	405	286	89	11	2	0
3	44	2	62	276	311	92	15	3	0
4	5	0	5	97	261	237	59	4	0
5	0	0	1	5	62	117	43	7	0
6	0	0	0	0	26	26	47	7	3
7	0	0	0	0	7	41	32	18	1
8	0	0	0	0	0	2	15	6	4

yhat	0	1	2	3	4	5	6	7	8
0	0	0	0	0	0	0	0	0	0
1	548	42	92	137	59	19	6	1	0
2	129	29	90	72	31	15	5	3	0
3	113	5	130	420	367	91	4	0	0
4	30	0	11	146	387	288	59	2	0
5	0	0	1	7	96	136	63	9	0
6	0	0	0	1	13	48	63	24	4
7	0	0	0	0	0	7	19	7	4
8	0	0	0	0	0	0	3	1	0

yhat	0	1	2	3	4	5	6	7	8
0	617	42	92	137	68	29	10	4	0
1	1	0	0	1	0	0	0	0	0
2	59	29	90	71	22	5	1	0	0
3	113	5	130	420	367	91	4	0	0
4	30	0	11	146	387	288	59	2	0
5	0	0	1	7	96	136	63	9	0
6	0	0	0	1	13	48	63	24	4
7	0	0	0	0	0	7	19	7	4
8	0	0	0	0	0	0	3	1	0

DATA 621 Homework #5

From Work by Critical Thinking Group 3