head(modelData)
## total_spend est_Total_pop Total.Households Median_Income Distance
## 1 132332.15 37528 12663 85448 12.80761
## 2 77185.73 33200 10652 80757 15.20030
## 3 45292.68 34357 12033 70673 13.87752
## 4 90731.11 20499 6859 154331 12.87378
## 5 8479.08 26265 7972 59743 18.98324
## 6 27175.28 42013 15029 69872 17.04709
## Males est_18.21 est_21.62 white AfricanAmerican Asian Other
## 1 1872647 116336.8 2109074 1853883.2 983233.6 202651.2 487864.0
## 2 1590280 106240.0 1809400 999320.0 33200.0 1972080.0 166000.0
## 3 1618215 79021.1 1820921 931074.7 54971.2 2123262.6 216449.1
## 4 1018800 61497.0 1020850 1285287.3 4099.8 617019.9 38948.1
## 5 1221323 120819.0 1460334 1449828.0 207493.5 417613.5 404481.0
## 6 2083845 151246.8 2436754 2819072.3 205863.7 575578.1 390720.9
The data we are using is grouped by Zip Codes around LA. We would like to create a model that predicts Total Spent $ on Clippers Games based off of household and demographic data.
Diagnostics

The first plot above plots a linear models fitted values vs the residuals. This should be a plot of random points but as you can see there is a downward pattern which violates one of the linear model assumptions.

The Cooks Distance plot shows us there are no outliers which is good.

## summary statistics
## ------
## min: 0 max: 2269091
## median: 41478.6
## mean: 113314.4
## estimated sd: 219713.1
## estimated skewness: 5.166051
## estimated kurtosis: 40.92266
The Cullen Frey Graph is telling us the data may be distributed Gamma or Beta, but it cannot be Beta since a Beta distribution only takes on values from (0:1).

The last is a density plot and as you can see it is not bell-shaped or normally distributed so we will have to diagnose the distribution of the data.
qqnorm(lmout$residuals)
qqline(lmout$residuals, col = "red")

z <- rnorm(100, mean = 0 , sd = 1)
qqnorm(z)
qqline(z, col = "red")

The first plot above is our data and the second is what a normal distrubtion should look like with mean 0 and standard deviation 1. As you can see there is a skewness to the top plot which again tells us the data is not normal.

After trying to fit a Gamma model I found about 10 values that are zero, which is not allowed in a Gamma distribution. So above you will see the left plot is our data and the right plot is the logarithm of our data. This tells us the data has a lognormal distrubution because if X is log-normally distributed then Y = ln(X) (conversly e^Y = X). And as you can see the plot on the right is bell shaped which looks normal.
I will fit 3 models and see which one is best.
1: The first model I will use the log(x + 1) transformation which is a popular method when dealing with zero values since log(0) = INF.
2: The second model I will use a Boxcox transformation which uses Maximum Likelihood estimation with which we can transform our response variable. To do this I will also replace the 0 values in the data with .00000001 so we can make the transformation.