head(modelData)
##   total_spend est_Total_pop Total.Households Median_Income Distance
## 1   132332.15         37528            12663         85448 12.80761
## 2    77185.73         33200            10652         80757 15.20030
## 3    45292.68         34357            12033         70673 13.87752
## 4    90731.11         20499             6859        154331 12.87378
## 5     8479.08         26265             7972         59743 18.98324
## 6    27175.28         42013            15029         69872 17.04709
##     Males est_18.21 est_21.62     white AfricanAmerican     Asian    Other
## 1 1872647  116336.8   2109074 1853883.2        983233.6  202651.2 487864.0
## 2 1590280  106240.0   1809400  999320.0         33200.0 1972080.0 166000.0
## 3 1618215   79021.1   1820921  931074.7         54971.2 2123262.6 216449.1
## 4 1018800   61497.0   1020850 1285287.3          4099.8  617019.9  38948.1
## 5 1221323  120819.0   1460334 1449828.0        207493.5  417613.5 404481.0
## 6 2083845  151246.8   2436754 2819072.3        205863.7  575578.1 390720.9

The data we are using is grouped by Zip Codes around LA. We would like to create a model that predicts Total Spent $ on Clippers Games based off of household and demographic data.

Diagnostics

The first plot above plots a linear models fitted values vs the residuals. This should be a plot of random points but as you can see there is a downward pattern which violates one of the linear model assumptions.

The Cooks Distance plot shows us there are no outliers which is good.

## summary statistics
## ------
## min:  0   max:  2269091 
## median:  41478.6 
## mean:  113314.4 
## estimated sd:  219713.1 
## estimated skewness:  5.166051 
## estimated kurtosis:  40.92266

The Cullen Frey Graph is telling us the data may be distributed Gamma or Beta, but it cannot be Beta since a Beta distribution only takes on values from (0:1).

The last is a density plot and as you can see it is not bell-shaped or normally distributed so we will have to diagnose the distribution of the data.

qqnorm(lmout$residuals)
qqline(lmout$residuals, col = "red")

z <- rnorm(100, mean = 0 , sd = 1)
qqnorm(z)
qqline(z, col = "red")

The first plot above is our data and the second is what a normal distrubtion should look like with mean 0 and standard deviation 1. As you can see there is a skewness to the top plot which again tells us the data is not normal.

After trying to fit a Gamma model I found about 10 values that are zero, which is not allowed in a Gamma distribution. So above you will see the left plot is our data and the right plot is the logarithm of our data. This tells us the data has a lognormal distrubution because if X is log-normally distributed then Y = ln(X) (conversly e^Y = X). And as you can see the plot on the right is bell shaped which looks normal.

I will fit 3 models and see which one is best.

2: The second model I will use a Boxcox transformation which uses Maximum Likelihood estimation with which we can transform our response variable. To do this I will also replace the 0 values in the data with .00000001 so we can make the transformation.

3: The last model will just be a log(x) transformation with the zero values being .00000001 like the 2nd model.

lmout <- glm(log(total_spend+1) ~ ., data = modelData)
modelData$total_spend[modelData$total_spend == 0] <- .00000001
bc <- boxcox(total_spend ~.,data = modelData)

lambda <- bc$x[which.max(bc$y)]
lambda
## [1] 0.1818182
lmout1 <- lm(total_spend^(0.1818182) ~ ., data = modelData)

lmout2 <- lm(log(total_spend) ~ ., data = modelData)

AIC(lmout)
## [1] 1097.223
AIC(lmout1)
## [1] 1095.863
AIC(lmout2)
## [1] 1563.067

It looks like model 2 is the best model with the lowest AIC value. I used a lambda value of ~0.18 as you can see from the log-Likelihood plot.

## 
## Call:
## lm(formula = total_spend^(0.1818182) ~ ., data = modelData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8641 -0.9694 -0.0577  0.8809  5.3442 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.656e+00  4.347e-01  13.013  < 2e-16 ***
## est_Total_pop    -1.260e-04  1.652e-04  -0.763   0.4464    
## Total.Households  2.270e-04  6.932e-05   3.275   0.0012 ** 
## Median_Income     3.396e-05  4.991e-06   6.804 6.58e-11 ***
## Distance         -1.235e-01  1.154e-02 -10.707  < 2e-16 ***
## Males             2.458e-06  1.777e-06   1.383   0.1677    
## est_18.21         1.117e-06  1.162e-06   0.961   0.3374    
## est_21.62        -1.189e-06  9.875e-07  -1.204   0.2297    
## white             4.443e-07  1.323e-06   0.336   0.7374    
## AfricanAmerican   3.613e-07  1.386e-06   0.261   0.7945    
## Asian             2.024e-07  1.347e-06   0.150   0.8807    
## Other            -1.826e-07  1.323e-06  -0.138   0.8903    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.66 on 269 degrees of freedom
## Multiple R-squared:  0.4805, Adjusted R-squared:  0.4593 
## F-statistic: 22.62 on 11 and 269 DF,  p-value: < 2.2e-16