Zip Analysis

Diagnostics

The first plot above plots a linear models fitted values vs the residuals. This should be a plot of random points but as you can see there is a downward pattern which violates one of the linear model assumptions.

The Cooks Distance plot shows us there are no outliers which is good.

## summary statistics
## ------
## min:  0   max:  2269091 
## median:  41478.6 
## mean:  113314.4 
## estimated sd:  219713.1 
## estimated skewness:  5.166051 
## estimated kurtosis:  40.92266

The Cullen Frey Graph is telling us the data may be distributed Gamma or Beta, but it cannot be Beta since a Beta distribution only takes on values from (0:1).

The last is a density plot and as you can see it is not bell-shaped or normally distributed so we will have to diagnose the distribution of the data.

qqnorm(lmout$residuals)
qqline(lmout$residuals, col = "red")

z <- rnorm(100, mean = 0 , sd = 1)
qqnorm(z)
qqline(z, col = "red")

The first plot above is our data and the second is what a normal distrubtion should look like with mean 0 and standard deviation 1. As you can see there is a skewness to the top plot which again tells us the data is not normal.

After trying to fit a Gamma model I found about 10 values that are zero, which is not allowed in a Gamma distribution. So above you will see the left plot is our data and the right plot is the logarithm of our data. This tells us the data has a lognormal distrubution because if X is log-normally distributed then Y = ln(X) (conversly e^Y = X). And as you can see the plot on the right is bell shaped which looks normal.

I will fit 3 models and see which one is best.

1: The first model I will use the log(x + 1) transformation which is a popular method when dealing with zero values since log(0) = INF.

2: The second model I will use a Boxcox transformation which uses Maximum Likelihood estimation with which we can transform our response variable. To do this I will also replace the 0 values in the data with .00000001 so we can make the transformation.

3: The last model will just be a log(x) transformation with the zero values being .00000001 like the 2nd model.

lmout <- glm(log(total_spend+1) ~ ., data = modelData)

modelData$total_spend[modelData$total_spend == 0] <- .00000001
bc <- boxcox(total_spend ~.,data = modelData)

lambda <- bc$x[which.max(bc$y)]
lambda

## [1] 0.1818182

lmout1 <- lm(total_spend^(0.1818182) ~ ., data = modelData)

lmout2 <- lm(log(total_spend) ~ ., data = modelData)

AIC(lmout)

## [1] 1097.223

AIC(lmout1)

## [1] 1095.863

AIC(lmout2)

## [1] 1563.067

It looks like model 2 is the best model with the lowest AIC value. I used a lambda value of ~0.18 as you can see from the log-Likelihood plot.

## 
## Call:
## lm(formula = total_spend^(0.1818182) ~ ., data = modelData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8641 -0.9694 -0.0577  0.8809  5.3442 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.656e+00  4.347e-01  13.013  < 2e-16 ***
## est_Total_pop    -1.260e-04  1.652e-04  -0.763   0.4464    
## Total.Households  2.270e-04  6.932e-05   3.275   0.0012 ** 
## Median_Income     3.396e-05  4.991e-06   6.804 6.58e-11 ***
## Distance         -1.235e-01  1.154e-02 -10.707  < 2e-16 ***
## Males             2.458e-06  1.777e-06   1.383   0.1677    
## est_18.21         1.117e-06  1.162e-06   0.961   0.3374    
## est_21.62        -1.189e-06  9.875e-07  -1.204   0.2297    
## white             4.443e-07  1.323e-06   0.336   0.7374    
## AfricanAmerican   3.613e-07  1.386e-06   0.261   0.7945    
## Asian             2.024e-07  1.347e-06   0.150   0.8807    
## Other            -1.826e-07  1.323e-06  -0.138   0.8903    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.66 on 269 degrees of freedom
## Multiple R-squared:  0.4805, Adjusted R-squared:  0.4593 
## F-statistic: 22.62 on 11 and 269 DF,  p-value: < 2.2e-16

Zip Analysis

Justin Cowden

March 5, 2019

The data we are using is grouped by Zip Codes around LA. We would like to create a model that predicts Total Spent $ on Clippers Games based off of household and demographic data.

Diagnostics

The first plot above plots a linear models fitted values vs the residuals. This should be a plot of random points but as you can see there is a downward pattern which violates one of the linear model assumptions.

The Cooks Distance plot shows us there are no outliers which is good.

The Cullen Frey Graph is telling us the data may be distributed Gamma or Beta, but it cannot be Beta since a Beta distribution only takes on values from (0:1).

The last is a density plot and as you can see it is not bell-shaped or normally distributed so we will have to diagnose the distribution of the data.

The first plot above is our data and the second is what a normal distrubtion should look like with mean 0 and standard deviation 1. As you can see there is a skewness to the top plot which again tells us the data is not normal.

I will fit 3 models and see which one is best.

1: The first model I will use the log(x + 1) transformation which is a popular method when dealing with zero values since log(0) = INF.

2: The second model I will use a Boxcox transformation which uses Maximum Likelihood estimation with which we can transform our response variable. To do this I will also replace the 0 values in the data with .00000001 so we can make the transformation.

3: The last model will just be a log(x) transformation with the zero values being .00000001 like the 2nd model.

It looks like model 2 is the best model with the lowest AIC value. I used a lambda value of ~0.18 as you can see from the log-Likelihood plot.