Synopsis:

Disclosure:
This product uses the Census Bureau Data API but is not endorsed or certified by the Census Bureau. I may use the Census Bureau name in order to identify the source of API content subject to these rules. I may not use the Census Bureau name, or the like to imply endorsement of any product, service, or entity, not-for-profit, commercial or otherwise.

1. Question:
Can we predict ’MortgageIncome` (Mortgage Status by Selected Monthly Owner Costs as a Percentage of Household Income) using other socio-economic variables extracted from the Census?

2. Input Data:
This study collects, organize, analyze, and disseminate significant amounts of information.

3. Preprocessing:
Merge all the datasets in a single dataset
Dealing with NA values: Replace NA values using randomForest
Dealing with near Zero variance and low Frequency Ratio
Remove if necessary redundant highly correlated predictors.

4. Exploratory Analysis:
We will map some predictors extracted from the CENSUS data as geographical information (by zipcode)

5. Build the models and compare:
Analyzing quantitative data using varied statistical techniques linear regression and ANOVA.
A Box-cox variable transformation is performed and the different models are analyzed using R^2, RSE and anova to determine the best performance model.

6. Applicable insights to improve response:
Optimization of the output as a funcion of selected variables.

Input Data

## Calling the necessary libraries to run the program.

rm(list=ls())
libs <- c ("tm", "plyr", "class", "SnowballC", "dplyr", "caret", "corrplot", "utils",
           "beepr","RWeka","e1071","ggplot2", "datasets", "maps","plotly", "wordcloud",
           "missForest", "rvest", "knitr", "ggmap", "maptools", "rgdal","UScensus2010", "acs",
           "RJSONIO", "gridExtra", "car", "MASS")

lapply(libs, library, character.only=T)

A. Extraction of Austin, Counties and Zip-Codes

In order to filter our data to the few Counties and Cities of interest we scrap from http://www.zip-codes.com/state/TX.asp a relational data set that links Counties, cities, zip code and states.

This dataset will be used to subset the data into the desired counties and obtain the zip-codes of the region of interst. In this case we will focus our attention into Travis, Williamson, Hays and Bastrop counties

setwd("~/R/Austin_SocEconInd")
url <- read_html("http://www.zip-codes.com/state/TX.asp")

## Extract the data Zip codes in the Austin Area 
Zip_Region <- url %>%
        html_nodes("#tblZIP") %>%
        .[[1]] %>%
        html_table()

colnames(Zip_Region) = (Zip_Region[1,])

## Clean the Data Set
Zip_Region$`ZIP Code` <-  sapply(Zip_Region$`ZIP Code`, function(x)  gsub("ZIP Code ", "", x))
Zip_Region$`Area Codes` <- NULL
Zip_Region$`ZIP Code` <- as.numeric(Zip_Region$`ZIP Code`)

This dataset contains 2598 observables for the state of texas. Subsetting the regions of Travis, Williamson, Hays and Bastrop results in a dataset containing 122 observables.

## Reduce the information to the Great Austin Area ###
cnty_List=c("Travis",
            "Williamson",
            "Hays",
            "Bastrop")

Zip_Region_Austin=Zip_Region[Zip_Region$County %in% cnty_List,]

#print(paste0("From 2598 observables we ended with: ", (nrow(Zip_Region_Austin))))

B. Extaction of Zip-codes shape

Extract the shape of the zipcodes within the desired counties. This data will later be used to map the variables of interest by Zip code area.

The dataset is extracted from TIGER Products - Geography - U.s. Census Bureau

## Extract Map of the great Austin Area 
x=get_googlemap(center="Austin",maptype=c("roadmap"))

## Extract Shape of Zip code areas
zipShp=readShapePoly("./raw_zipCode_Shape/tl_2010_48_zcta510/tl_2010_48_zcta510.shp")

zipShp2=fortify(zipShp,region="ZCTA5CE10")

## Reduce to zips in Austin area
zipShp3=zipShp2[zipShp2$id %in% Zip_Region_Austin$`ZIP Code` ,]
rm(zipShp2, zipShp, Zip_Region, url)

## Eliminate unnecessart variables
zipShp3$hole <- NULL
zipShp3$group <- NULL

C. Extraction of variables of interest from U.S. CENSUS

Extracted dataset from CENSUS 2010 and CENSUS 2014. The sets were extracted using 3 different API calls.
In addition we created some new variables that are the result of combining the number of females and males of a specific group age into a single variable.

#### Extract Data from CENSUS ####
## Extracting the data function

getCensusData=function(APIkey,state,fieldnm){
        resURL=paste("http://api.census.gov/data/2010/sf1?get=",fieldnm,
                     "&for=zip+code+tabulation+area:*&in=state:",state,
                     "&key=",API_KEY_CENSUS,
                     sep="")
        dfJSON=fromJSON(resURL)
        dfJSON[[1]]
        dfJSON=dfJSON[2:length(dfJSON)]
        dfJSON_Zip_Code    = sapply(dfJSON,function(x) x[7])
        dfJSON_state       = sapply(dfJSON,function(x) x[6])
        dfJSON_HouseFamily = sapply(dfJSON,function(x) x[5])
        dfJSON_HouseType   = sapply(dfJSON,function(x) x[4])
        dfJSON_HouseSize   = sapply(dfJSON,function(x) x[3])
        dfJSON_MedianAge   = sapply(dfJSON,function(x) x[2])
        dfJSON_Pop         = sapply(dfJSON,function(x) x[1])
        df=data.frame(dfJSON_Pop,as.numeric(dfJSON_MedianAge), dfJSON_HouseSize,
                      dfJSON_HouseType,dfJSON_HouseFamily, dfJSON_state,
                      dfJSON_Zip_Code)
        names(df) = c("Pop", "MedianAge", "HouseSize",
                      "Households", "HouseFamily", "State",
                      "ZipCode")
        return(df)
}        

## get median age from US census 2010 for a state
fieldnm="P0010001,P0130001,P0170001,P0180001,P0280002"  
state = "48"                      ## State Code:TEXAS
## Census Api-Key = 837f5736b2e8c510f4a2b7de5303406594d73465
API_KEY_CENSUS = "837f5736b2e8c510f4a2b7de5303406594d73465"

dfHouse=getCensusData(APIkey,state,fieldnm); head(dfHouse)
dfHouse$Zip_2 <- dfHouse$Zip; head(dfHouse) 
dfHouse$Zip_3 <- dfHouse$Zip; head(dfHouse)
names(dfHouse)
names(dfHouse)=c("Pop", "MedianAge", "HouseSize","Households",
               "HouseFAmily", "State", "Zip" , "Zip_2")

getCensusData=function(APIkey,state,fieldnm){
        resURL=paste("http://api.census.gov/data/2014/acs5?get=",fieldnm,
                     "&for=zip+code+tabulation+area:*&key=",API_KEY_CENSUS,
                     sep="")
        dfJSON=fromJSON(resURL)
        dfJSON[[1]]
        dfJSON=dfJSON[2:length(dfJSON)]
        dfJSON_Zip_Code           = sapply(dfJSON,function(x) x[6])
        dfJSON_CostIncme          = sapply(dfJSON,function(x) x[5])
        dfJSON_HouseMortgageIncome = sapply(dfJSON,function(x) x[4])
        dfJSON_MedYearBuilt       = sapply(dfJSON,function(x) x[3])
        dfJSON_MedRealStTaxes     = sapply(dfJSON,function(x) x[2])
        dfJSON_MedIncome          = sapply(dfJSON,function(x) x[1])
        df=data.frame(as.numeric(as.character(dfJSON_MedIncome)),
                      as.numeric(as.character(dfJSON_MedRealStTaxes)),
                      as.numeric(as.character(dfJSON_MedYearBuilt)),
                      as.numeric(as.character(dfJSON_HouseMortgageIncome)), 
                      as.numeric(as.character(dfJSON_CostIncme)),
                      as.numeric(as.character(dfJSON_Zip_Code)))
        names(df)=c("MedIncome", "RealStTaxes", "YearBuilt",
                    "MortgageIncome", "Costincome",
                    "ZipCode")
        return(df)
}        


## Total population, Population in House Holds by age
fieldnm="B19013_001E,B25103_001E,B25035_001E,B25091_001E,B25092_001E"     
state = "48"                      ## State Code:TEXAS
## Census Api-Key = 837f5736b2e8c510f4a2b7de5303406594d73465
API_KEY_CENSUS = "837f5736b2e8c510f4a2b7de5303406594d73465"

dfIncome=getCensusData(APIkey,state,fieldnm); head(dfIncome)
dfIncome$Zip_2 <- dfIncome$ZipCode; head(dfIncome)
dfIncome$Zip_3 <- dfIncome$ZipCode; head(dfIncome)
names(dfIncome)=c("MedIncome", "RealStTaxes", "YearBuilt",
                  "MortgageIncome", "Costincome",
                  "Zip", "Zip_2", "Zip_3" )

head(dfIncome)

getCensusData=function(APIkey,state,fieldnm){
        resURL=paste("http://api.census.gov/data/2014/acs5?get=",fieldnm,
                     "&for=zip+code+tabulation+area:*&key=",API_KEY_CENSUS,
                     sep="")
        dfJSON=fromJSON(resURL)
        dfJSON[[1]]
        dfJSON=dfJSON[2:length(dfJSON)]
        dfJSON_Zip_Code= sapply(dfJSON,function(x) x[10])
        dfJSON_F15_17  = sapply(dfJSON,function(x) x[9])
        dfJSON_F10_14  = sapply(dfJSON,function(x) x[8])
        dfJSON_F5_9    = sapply(dfJSON,function(x) x[7])
        dfJSON_FUnder5 = sapply(dfJSON,function(x) x[6])
        dfJSON_M15_17  = sapply(dfJSON,function(x) x[5])
        dfJSON_M10_14  = sapply(dfJSON,function(x) x[4])
        dfJSON_M5_9    = sapply(dfJSON,function(x) x[3])
        dfJSON_MUnder5 = sapply(dfJSON,function(x) x[2])
        dfJSON_Tot     = sapply(dfJSON,function(x) x[1])
        df=data.frame(as.numeric(as.character(dfJSON_Tot)),
                      as.numeric(as.character(dfJSON_MUnder5)),
                      as.numeric(as.character(dfJSON_M5_9)),
                      as.numeric(as.character(dfJSON_M10_14)),
                      as.numeric(as.character(dfJSON_M15_17)),
                      as.numeric(as.character(dfJSON_FUnder5)),
                      as.numeric(as.character(dfJSON_F5_9)),
                      as.numeric(as.character(dfJSON_F10_14)),
                      as.numeric(as.character(dfJSON_F15_17)),
                      as.numeric(as.character(dfJSON_Zip_Code)))
        names(df)=c("Tot", "MUnder5", "M5_9",
                    "M10_14", "M15_17", "FUnder5", "F5_9", "F10_14", "F15_17",
                    "ZipCode")
        return(df)
}        


## get median age from US census 2010 for a state
fieldnm="B01001_001E,B01001_003E,B01001_004E,B01001_005E,B01001_006E,
B01001_027E,B01001_028E,B01001_029E,B01001_030E"     

state = "48"                      ## State Code:TEXAS
## Census Api-Key = 837f5736b2e8c510f4a2b7de5303406594d73465
API_KEY_CENSUS = "837f5736b2e8c510f4a2b7de5303406594d73465"

dfage=getCensusData(APIkey,state,fieldnm); head(dfage)
dfage$Zip_2 <- dfage$Zip
dfage$Zip_3 <- dfage$Zip; head(dfage)
names(dfage)=c("Tot", "MUnder5", "M5_9","M10_14", "M15_17", 
               "FUnder5", "F5_9", "F10_14", "F15_17",
               "Zip" , "Zip_2")

dfage$Under5 = dfage$MUnder5 + dfage$FUnder5
dfage$at5_9 <- dfage$M5_9 + dfage$F5_9
dfage$at10_14 <- dfage$M10_14 + dfage$F10_14
dfage$at15_17 <- dfage$M15_17 + dfage$F15_17

Now we have 4 datasets that we will merge into a single one. Here we show all the variables of the resulting dataset.

## Merge dfage with zipshp
ZipHouse=merge(zipShp3,dfHouse,by.x=c("id"),by.y=c("Zip"))
ZipHouseIncome = merge(ZipHouse, dfIncome, by="Zip_2")
mydata <- merge(ZipHouseIncome, dfage, by="Zip_2")
##  [1] "Zip_2"          "id"             "long"           "lat"           
##  [5] "order"          "piece"          "Pop"            "MedianAge"     
##  [9] "HouseSize"      "Households"     "HouseFAmily"    "State"         
## [13] "MedIncome"      "RealStTaxes"    "YearBuilt"      "MortgageIncome"
## [17] "Costincome"     "Tot"            "MUnder5"        "M5_9"          
## [21] "M10_14"         "M15_17"         "FUnder5"        "F5_9"          
## [25] "F10_14"         "F15_17"         "Under5"         "at5_9"         
## [29] "at10_14"        "at15_17"

Preprocessing

A. ‘NA’ distribution analysis and correction

First, we are going to make a copy of our original dataset that will be used exclusively for graphical purposes.

## Create a new set that we will use for mapping
mydata_plot= mydata

Second, we will set all non-numeric variables as numeric.

### Set all variables as numeric
mydata$Zip_2        <- as.numeric(as.character(mydata$Zip_2))
mydata$id           <- as.numeric(mydata$id)
mydata$piece        <- as.numeric(as.character(mydata$piece))
mydata$Pop          <- as.numeric(as.character(mydata$Pop))
mydata$HouseSize    <- as.numeric(as.character(mydata$HouseSize))
mydata$Households   <- as.numeric(as.character(mydata$Households))
mydata$HouseFAmily  <- as.numeric(as.character(mydata$HouseFAmily))
mydata$State  <- as.numeric(as.character(mydata$State))

Third, Look only at variables containing NA’s and to their the distribution and frequency per Variable.

## Find and locate NA variables
Find_NA = data.frame (apply(mydata, 2, function(x) sum(is.na(x)==T)))

## Express NA's as precentage of same column total 
NA_Variables <- Find_NA*100/nrow(mydata)
## Exclude Varibles no containig NA
NA_values<- subset(NA_Variables[1], (Find_NA/nrow(mydata)) != 0)
names(NA_values)[1] <- "Percentage_NA";
MedIncome RealStTaxes YearBuilt Costincome
Percentage_NA 0.148392 0.193885 0.148392 0.193885

Since the percentage of NAs are so low for these variables, we will eliminate them.

## Impute missing values using all parameters as defaults values
if (Ratio_NA <= 0.1){ ## to be set to 0.005
        mydata <- na.omit(mydata)
        data <- mydata
} else {
        data_imp <- missForest(mydata)
        print(paste0( "Proportion of NA falsely classified: ", round(data_imp$OOBerror, digits = 3)))
        data <- data_imp$ximp
}

All NA observables have been replaced correctly

B. Near Zero Variance variable analysis

Now that we got rid off the NAs we are going to look for the variables with near zero variance and also to those with zero frequency ratio.

## Logwage and Zero covariates
ZC <- nearZeroVar(data, saveMetrics = T)
freqRatio percentUnique zeroVar nzv
piece 79.07143 0.004341 FALSE TRUE
State 0.00000 0.001085 TRUE TRUE

The data-set contains 2 variables with near zero variance and 1 of these variables containts zero frequency ratio. We will eliminate these 2 variables from the data set.

## Eliminate these varaibles from the data set.
nzv_T <- which(ZC$zeroVar == T) # locate those variables
mydata1 <- data[,-c(nzv_T)]

C. Find Highly Correlated Variables

We are going to look for those variables that are highly correlated and we will eliminate all variables within the dataset with a correlation index above 95.

correlMTX <- cor(mydata1) 
library(corrplot)
highlyCor <- findCorrelation(correlMTX, 0.95)

mydata_clean <- mydata1[,-highlyCor]
correlMTX_clean <- cor(mydata_clean)
mydata_clean <- data.frame(mydata_clean)

After eliminating all the highly correlated variables we have a reduced dataset to just 15 variables.

##  [1] "long"           "lat"            "piece"          "MedianAge"     
##  [5] "HouseSize"      "Households"     "MedIncome"      "RealStTaxes"   
##  [9] "YearBuilt"      "MortgageIncome" "Costincome"     "M15_17"        
## [13] "FUnder5"        "F5_9"           "Zip.y"

Looking at the remaining variables in the dataset, we can safely assume that Latitud and longitude has no relationship with our response variable MortgageIncome. We will take them out of the dataset. As a result our dataset contains now 13 predictors.

mydata2 <- mydata_clean
mydata2$long <- NULL
mydata2$lat <- NULL

Exploratory Analysis

Let’s map some socioeconomic indexes fron the CENSUS in the Great Austin Area. Our analysis, a regression model is not the ideal model for classification purposes since our output takes continuous values. However, showing some geographical classification plots of the socioeconomic indexes from the CENSUS by zipcode indicates interesting patterns.
In order to achive this, first we will create variables that will allow us to map Socio-Economic indexes by Zip areas

## Create groups for average age
mydata_plot$MedAgeLvl <- cut(mydata_plot$MedianAge, 
                              breaks = c(0,30,35,40,45,50,55,60,65,70,80),
                              labels = c("<30", "30-35", "35-40", "40-45", "45-50", "50-55",
                                  "55-60", "60-65", "65-70", ">75"))
## Create Groups for Median Income
mydata_plot$MedIncomeLvl <- cut(mydata_plot$MedIncome, 
                                breaks = c(0,40000,60000,80000,100000,120000, 140000,160000,180000),
                                labels = c("<40k", "40-60k", "60-80k", "80-100k", "100-120k",
                                           "120-140k","140-160k", ">180k" ))

## Create Groups for Group Median  Monthly Owner Costs as a Percentage of Household Income
mydata_plot$CostIncomelvl <- cut(mydata_plot$Costincome, 
                                 breaks = c(0,5,10,15,20,25,30,35),
                                 labels = c("<5%", "5-10%", "10-15%","15-20%", "20-25%",
                                         "25-30%", ">30%"))
## To summarize MedAge group to a single value 
mydata_plot$RealStTaxesLvl <- cut(mydata_plot$RealStTaxes, 
                                  breaks = c(0,2000,4000,6000,8000,10000,15000),
                                  labels = c("<2000", "2000-4000", "4000-6000", "6000-8000",
                                          "8000-10000", ">10000")) 

## Create groups for Median year structure built
mydata_plot$YearBuiltLvl <- cut(mydata_plot$YearBuilt, 
                                breaks = c(1940, 1965,1971,1977,1983,1989,1995,2001,2007,2013,2019),
                                labels = c("<1965","1965-1971", "1971-1977", "1977-1983",
                                        "1983-1989", "1989-1995", "1995-2001","2001-2007",
                                        "2007-2013", ">2013")) 

Second we will map this varaibles by Zip

### Map Age Lvl
x=get_googlemap(center="Austin",maptype=c("roadmap"))
p2=ggmap(x)
p2=p2+geom_polygon(data=mydata_plot, aes(x=long, y=lat, group=Zip_2, fill=MedAgeLvl),
                   color="black",alpha=0.4)
p2=p2+scale_fill_manual(values=c("springgreen", "springgreen4",
                                 "orange", "orange4",
                                 "red","red4",
                                 "grey40"))
#p2 = p2 + geom_text(   data=mydata_plot, hjust=0.5, vjust=0, check_overlap = F, fontface="bold",
#                      aes(x=long, y=lat, label=MedAgeLvl),
#                      colour="red3", size=3, alpha = 0.9 ) 
p2 = p2 + labs(title = "Median Age by Zip (US CENSUS)")
p2 = p2 + theme(plot.title = element_text(size = rel(1.2)),
                legend.title =element_text(size = rel(1)))

### Map CostIncome
x=get_googlemap(center="Austin",maptype=c("roadmap"))
p3=ggmap(x)
p3=p3+geom_polygon(data=mydata_plot, aes(x=long, y=lat, group=Zip_2, fill=CostIncomelvl),
                   color="black",alpha=0.4)
p3=p3+scale_fill_manual(values=c("springgreen", "springgreen4",
                                 "orange", "orange4",
                                 "red","red4",
                                 "grey40"))
p3 = p3 + labs(title = "Cost IncomeLvl by Zip (US CENSUS)")
p3 = p3 + theme(plot.title = element_text(size = rel(1.2)),
                legend.title =element_text(size = rel(1.2)))


### Map Year Built
x=get_googlemap(center="Austin",maptype=c("roadmap"))
p4=ggmap(x)
p4=p4+geom_polygon(data=mydata_plot, aes(x=long, y=lat, group=Zip_2, fill=YearBuiltLvl ),
                   color="black",alpha=0.4)
p4=p4+scale_fill_manual(values=c("springgreen", "springgreen3",  "springgreen4",
                                 "orange", "orange3", "orange4",
                                 "red","red3", "red4",
                                 "grey40", "grey20", "grey0"))
#p4 = p4 + geom_text(   data=mydata_plot, hjust=0.5, vjust=0, check_overlap = F, fontface="bold",
#                      aes(x=long, y=lat, label=YearBuiltLvl),
#                      colour="red3", size=3, alpha = 0.9 )
p4 = p4 + labs(title = "Median Year Built by Zip (US CENSUS)")
p4 = p4 + theme(plot.title = element_text(size = rel(1)),
                legend.title =element_text(size = rel(1)))

### Map Real st Taxes
x=get_googlemap(center="Austin",maptype=c("roadmap"))
p5=ggmap(x)
p5=p5+geom_polygon(data=mydata_plot, aes(x=long, y=lat, group=Zip_2, fill= RealStTaxesLvl),
                   color="black",alpha=0.4)
p5=p5+scale_fill_manual(values=c("springgreen", "springgreen4",
                                 "orange", "orange4",
                                 "red","red4",
                                 "grey40", "grey20"))
p5 = p5 + labs(title = "Median Real State Taxes by Zip (US CENSUS)")
p5 = p5 + theme(plot.title = element_text(size = rel(1)),
                legend.title =element_text(size = rel(1)))

### Map Med income
x=get_googlemap(center="Austin",maptype=c("roadmap"))
p6=ggmap(x)
p6=p6+geom_polygon(data=mydata_plot, aes(x=long, y=lat, group=Zip_2, fill=MedIncomeLvl),
                   color="black",alpha=0.4)
p6=p6+scale_fill_manual(values=c("springgreen", "springgreen4",
                                 "orange", "orange4",
                                 "red","red4",
                                 "grey40", "grey20"))
p6 = p6 + labs(title = "Median Income by Zip (US CENSUS)")
p6 = p6 + theme(plot.title = element_text(size = rel(1)),
                legend.title =element_text(size = rel(1)))

Build the Model

Third we will study the colinearity between predictors by studying the variance inflation factor (VIF)

A. Analysis of the linear regression using MortgageIncome as outcome

We will perform linear regression. using lm.

## linear regression
model0 <- (lm(MortgageIncome ~. , data = mydata2))

We will identify those variables whose P values are lower than<= 0.05

Linear Regression P-Values per variable
piece MedianAge HouseSize Households MedIncome RealStTaxes YearBuilt Costincome M15_17 FUnder5 F5_9 Zip.y
0.8422 0 0 0 0 0 0 0.2261 0 0 0 0

The P-values of the predictors piece and Costincome are clearly above 0.05. This indicates that these two variables should be excluded from our model.

## Elimination of variables
mydata2$piece <- NULL
mydata2$Costincome <- NULL

Now we can perform a lineal regression analysis without these two variables. We will look at RES and not R^2 index since R^2 is heavily dependent with the number of variables.

model1 <- (lm(MortgageIncome ~. , data = mydata2))
## [1] "Residual Standard Error (RSE) Model_0: 1001.722       Residual Standard Error (RSE) Model_1: 1001.719"

The RSE (Residual Standard Error) value remains the same after eliminating the predictors piece and Costincome. This means that we can continue our estudy without this variables.

A major problem with multivariate regression is collinearity. If two or more predictor variables are highly correlated, and they are both entered into a regression model, it increases the true standard error and you get a very unstable estimates of the slope. We can assess the collinearity by studying the variance inflation factor (VIF).

## VIF
VIF_model1 <- vif(lm(MortgageIncome ~. , data = mydata2))
VIF Value per variable
Zip.y YearBuilt MedianAge HouseSize RealStTaxes MedIncome M15_17 Households FUnder5 F5_9
1.2517 1.8908 3.4368 5.2744 5.755 5.8085 11.0599 11.3197 14.2099 18.3679

Using the following guidelines to interpret the VIF:

  • VIF = 1 Not correlated
  • 1 < VIF < 5 Moderately correlated
  • VIF > 5 to 10 Highly correlated
    We will eliminate any predictor with a VIF larger than 10
## Eliminating the highly correlated variables
myvars <-  names(mydata2) %in% c(names(mydata2[VIF_model1>=10])[-2])
mydata3 <- mydata2[!myvars]
model2 <- (lm(MortgageIncome ~. , data = mydata3))

After eliminating some predictors with VIF index larger than 10 like M15_17, Households and FUnder5 our final model contains 7 predictors.

Model_2 Final linear Regression dataset
Estimate Std. Error t value Pr(>|t|)
(Intercept) -91678.1695 1320.6539 -69.4188 0
MedianAge 59.8408 1.1295 52.9814 0
HouseSize -2613.7479 21.0663 -124.0722 0
MedIncome 0.0032 0.0005 6.7195 0
RealStTaxes -0.0236 0.0052 -4.5163 0
YearBuilt 57.7077 0.7002 82.4125 0
F5_9 5.1164 0.0074 690.3247 0
Zip.y -0.2242 0.0087 -25.8444 0

If you are asking your self what happened to F5_9 with VIF = 18.3679? Why this predictor is not eliminated? The answer is that F5_9 is heavily correlated with M15_17 and FUnder5.
It can be seen below that when this two variables are eliminated the VIF of F5_9 becomes 1.2145

#print(paste0("R^2 Model_2 = ", round(summary(model2)$r.sq,digits = 3), "           Residual Standard Error (RSE) Model_2 = ", round(summary(model2)$sigma, digits = 3)))

## VIF
VIF_model2 <- vif(lm(MortgageIncome ~. , data = mydata3))
## All values have low p values and also low VIF is within normal parameters
## We will go with non linear analysis to see if the system improbes
Model_2 VIF Values
MedianAge HouseSize MedIncome RealStTaxes YearBuilt F5_9 Zip.y
2.7104 2.9173 5.4177 5.4011 1.8164 1.4483 1.2145

There are no more highly correlated predictors and RealStTaxes and MedIncome are kept because there are in the low “high correlated region”. Since we have reached a stopping point in the variable reduction process, now we can focus in finding the best model.

B. Heteroskedasticity correction

We will starting looking at heteroskedasticity

car::ncvTest(model2)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 14048.88    Df = 1     p = 0
lmtest::bptest(model2)
## 
##  studentized Breusch-Pagan test
## 
## data:  model2
## BP = 19031, df = 7, p-value < 2.2e-16

Both p-values are less than a significan level of 0.05 The Breush-Pagan test creates a statistic that is chi-squared distributed and for your data that statistic=14048.88. The p-value is the result of the chi-squared test and (normally) the null hypothesis is rejected for p-value < 0.05. In this case, the null hypothesis is of homoskedasticity and it would be rejected.

One of the main assumptions for linear model fitting is that the data is normally distributed. However, this assumption doesn’t always fit. Therefore, we need to create data transformation when the data isn’t distributed normally.

Box-Cox is a way to transform response variable to meet the “normality” assumption.

We must perform a Box-Cox transformation of some variables to make it approximate to a normal distribution. We will apply this variable transformations and apply them to the linear model4

bc <- boxcox(model2, plotit = F)
lambda <- bc$x[which.max(bc$y)]

model3 <- (lm(I(MortgageIncome^lambda) ~ MedianAge + HouseSize + MedIncome + RealStTaxes + YearBuilt + F5_9 + Zip.y , data = mydata3))

## transformation of variables using BOXCOX
BoxPlot_MA <-caret::BoxCoxTrans(mydata3$MedianAge)
BoxPlot_HS <- caret::BoxCoxTrans(mydata3$HouseSize)
BoxPlot_MI <- caret::BoxCoxTrans(mydata3$MedIncome)
BoxPlot_RST<-  caret::BoxCoxTrans(mydata3$RealStTaxes)
BoxPlot_YB <- caret::BoxCoxTrans(mydata3$YearBuilt)
BoxPlot_F <- caret::BoxCoxTrans(mydata3$F5_9)
BoxPlot_Zip <- caret::BoxCoxTrans(mydata3$Zip.y)

model4 <-lm((MortgageIncome) ~ I(MedianAge^-0.1) + I(HouseSize^0.6) + I(MedIncome^0.2) + I(RealStTaxes^(-0.4)) + I(YearBuilt^(2)) + I(F5_9^0.3) + I(Zip.y^2) , data = mydata3)

Now we will apply boxcox to transform the reponse variable MortgageIncome and obtain a homoskedasticity data distribution. In the next plot, it can be seen the residuals distributions of the non-transformed model (heteroskedasticity) and the residual distribution of the transformed model.

bc <- boxcox(model4, plotit = F)
lambda <- bc$x[which.max(bc$y)]

mydata3$MortgageIncomeLambda <- (mydata3$MortgageIncome)^lambda

model5<-lm(I(MortgageIncome^lambda) ~ I(MedianAge^-0.1) + I(HouseSize^0.6) + I(MedIncome^0.2) + I(RealStTaxes^(-0.4)) + I(YearBuilt^(2)) + I(F5_9^0.3) + I(Zip.y^2) , data = mydata3)

C. Model selection

Finally we will create non-linar models under these transformation and use anova, r^2 and RSE to select the most accurate model.

model6<-lm(I(MortgageIncome^lambda) ~ poly((I(MedianAge^-0.1)),2) + poly((I(HouseSize^0.6)),2) + poly((I(MedIncome^0.2)),2) + poly((I(RealStTaxes^(-0.4))),2) + poly((I(YearBuilt^(2))),2) + poly((I(F5_9^0.3)),2) + poly((I(Zip.y^2)),2) , data = mydata3)

model7<-lm(I(MortgageIncome^lambda) ~ poly((I(MedianAge^-0.1)),3) + poly((I(HouseSize^0.6)),3) + poly((I(MedIncome^0.2)),3) + poly((I(RealStTaxes^(-0.4))),3) + poly((I(YearBuilt^(2))),3) + poly((I(F5_9^0.3)),3) + poly((I(Zip.y^2)),3) , data = mydata3)

model8 <- lm(I(MortgageIncome^lambda) ~ poly((I(MedianAge^-0.1)),4) + poly((I(HouseSize^0.6)),4) + poly((I(MedIncome^0.2)),4) + poly((I(RealStTaxes^(-0.4))),4) + poly((I(YearBuilt^(2))),4) + poly((I(F5_9^0.3)),4) + poly((I(Zip.y^2)),4) , data = mydata3)

model9 <- lm(I(MortgageIncome^lambda) ~ poly((I(MedianAge^-0.1)),5) + poly((I(HouseSize^0.6)),5) + poly((I(MedIncome^0.2)),5) + poly((I(RealStTaxes^(-0.4))),5) + poly((I(YearBuilt^(2))),5) + poly((I(F5_9^0.3)),5) + poly((I(Zip.y^2)),5) , data = mydata3)

model10 <- lm(I(MortgageIncome^lambda) ~ poly((I(MedianAge^-0.1)),6) + poly((I(HouseSize^0.6)),6) + poly((I(MedIncome^0.2)),6) + poly((I(RealStTaxes^(-0.4))),6) + poly((I(YearBuilt^(2))),6) + poly((I(F5_9^0.3)),6) + poly((I(Zip.y^2)),6) , data = mydata3)

Know we have 5 different models
* Model5
\[MortgageIncome^\lambda = MedianAge^{-0.1} + HouseSize^{0.6} + MedIncome^{0.2} + RealStTaxes^{-0.4} + YearBuilt^{2} + F5_9^{0.3} + Zip.y^2\]
* Models 6, 7, 8, 9, 10 \[MortgageIncome^\lambda = (MedianAge^{-0.1})^X + (HouseSize^{0.6})^X + (MedIncome^{0.2})^X + (RealStTaxes^{-0.4})^X + (YearBuilt^{2})^X + (F5_9^{0.3})^X + (Zip.y^2)^X\] * Model 6, 7, 8, 9 and 10 creates polynomials of order X = 2, 3, 4, 5 and 6 respectively.
Now we will approach model selection by studying the values R^2, RSE and ANOVA.

r2 <- c(summary(model5)$r.sq, summary(model6)$r.sq, summary(model7)$r.sq, summary(model8)$r.sq, summary(model9)$r.sq, summary(model10)$r.sq)
RSE <- c(summary(model5)$sigma, summary(model6)$sigma, summary(model7)$sigma, summary(model8)$sigma, summary(model9)$sigma, summary(model10)$sigma)
Fit_res <-data.frame(cbind(r2, RSE, (anova(model5, model6, model7, model8, model9, model10))[1], (anova(model5, model6, model7, model8, model9, model10))[2], (anova(model5, model6, model7, model8, model9, model10))[3], (anova(model5, model6, model7, model8, model9, model10))[4], (anova(model5, model6, model7, model8, model9, model10))[5], (anova(model5, model6, model7, model8, model9, model10))[6]))
colnames(Fit_res) <- c("R^2", "RSE", "Anova Res.Df", "Anova RSS", "Anova Df", "Anova Sum of Sq", "Anova F", "Anova Pr(>F)")
row.names(Fit_res) <- c("Model5", "Model6", "Model7", "Model8", "Model9", "Model10")
Assesing the Accuracy Indicators of the Regression Models
R^2 RSE Anova Res.Df Anova RSS Anova Df Anova Sum of Sq Anova F Anova Pr(>F)
Model5 0.8964 1.0043 92136 92923.70 NA NA NA NA
Model6 0.9460 0.7251 92129 48440.89 7 44482.810 41685.194 0
Model7 0.9512 0.6891 92122 43739.14 7 4701.751 4406.048 0
Model8 0.9662 0.5738 92115 30328.64 7 13410.494 12567.080 0
Model9 0.9776 0.4668 92108 20070.26 7 10258.384 9613.213 0
Model10 0.9844 0.3904 92101 14040.31 7 6029.948 5650.712 0

It can be seen that the best R^2 and RSE result are obtained with the Model10. In addition, anova analysis also indicates that the best model is Model10. All anova p-values are under 0.05 but anova F-value from model10 is the smallest of all them. Below we can see three Residual versus Fitted plots of three different models.

Applicable insights to improve response.

The variables are used as an example and it is understood that YearBuilt can not be changed as desired.

Let’s assume that for this observable, we have control to modify two predictors. In this case we have control over the predictors HouseSize and YearBuilt. If we keep all the variables constant except HouseSize and YearBuilt. What combination of HouseSize and YearBuiltvalues will result in a minimum Monthly Owner Costs as a Percentage of Household Income or MortgageIncome reponse?

First we extract a single random observable (we do not include more than one sample due to limitations of the analysis to 5 pages)

### Prediction of mean response for cases like this...
#Pool <- mydata3[sample(1:nrow(mydata3), size=1, replace = F),]
#Pred_Pool <- predict(model9, Pool, interval="conf")^(1/lambda)

### Prediction of mean response for cases like this...
Pool <- mydata3[sample(1:nrow(mydata3), size=1, replace = F),]
Pred_Pool <- predict(model9, Pool, interval="pred")^(1/lambda)
Random Sample Extraction
MedianAge HouseSize MedIncome RealStTaxes YearBuilt MortgageIncome F5_9 Zip.y MortgageIncomeLambda
67966 39.2 2.03 79315 8498 1978 7201 784 78731 14.3621

Now we are going to look at the predicted MortgageIncome value. Do not forget that MortgageIncomeLambda is equivalent to MortgageIncome^(Lambda) and it must be transformed to obtain MortgageIncome

Expected MortgageIncome
fit lwr upr
67966 6558.828 5232.512 8104.241

If we keep all the variables constant except HouseSize and YearBuilt. What combination of HouseSize and YearBuiltvalues will result in a minimum Monthly Owner Costs as a Percentage of Household Income or MortgageIncome reponse

First we create the limits in the range of variables

### Create all the ranges around all variables.

MedAge_rep = rep(Pool$MedianAge[1],11)
MedAge_int = seq(from=((Pool$MedianAge[1])-2), to=((Pool$MedianAge[1])+2), length.out = 11); 
HouSiz_rep = rep(Pool$HouseSize[1],11)
HouSiz_int = seq(from=((Pool$HouseSize[1])-0.5), to=((Pool$HouseSize[1])+0.5),length.out = 11); 
MedInc_rep = rep(Pool$MedIncome[1],11)
MedInc_int = seq(from=((Pool$MedIncome[1])-1000), to=((Pool$MedIncome[1])+1000),length.out = 11)
ReaStTax_rep = rep(Pool$RealStTaxes[1],11)
ReaStTax_int = seq(from=((Pool$RealStTaxes[1])-1000), to=((Pool$RealStTaxes[1])+1000),length.out = 11)
YeaBui_rep = rep(Pool$YearBuilt[1],11)
YeaBui_int = seq(from=((Pool$YearBuilt[1])-10), to=((Pool$YearBuilt[1])+10), length.out = 11)
F5_9_rep = rep(Pool$F5_9, 11)
Zip.y_rep = rep(Pool$Zip.y, 11)

Second we create a data frame with all possible combinations between HouseSize and YearBuilt and we compute all values of MortgageIncome. Now we only must select the minimum MortgageIncome value and identify the parameters of HouseSize and YearBuilt that provide such result.

## Create a data frame from all combinations of the supplied vectors or factors.
Opt <- expand.grid( HouSiz_int, YeaBui_int)
colnames(Opt) <- c("HouSiz_int", "YeaBui_int")

Opt2 <- data.frame(cbind(rep(Pool$MedianAge,nrow(Opt)),
              Opt$HouSiz_int,
              rep(Pool$MedIncome,nrow(Opt)),
              rep(Pool$RealStTaxes,nrow(Opt)),
              Opt$YeaBui_int,
              rep(Pool$F5_9,nrow(Opt)),
              rep(Pool$Zip.y,nrow(Opt))))

colnames(Opt2) <- c("MedianAge", "HouSiz_int", "MedIncome", "RealStTaxes", "YeaBui_int", "F5_9", "Zip.y" )

Range_Predict <- (predict(model9, list(MedianAge=Opt2$MedianAge,
                             HouseSize=Opt2$HouSiz_int,
                             MedIncome=Opt2$MedIncome, 
                             RealStTaxes=Opt2$RealStTaxes,
                             YearBuilt=Opt2$YeaBui_int,
                             F5_9=Opt2$F5_9, 
                             Zip.y=Opt2$Zip.y)))^(1/lambda)
                             

Opt_Pool <- (predict(model9, list(MedianAge=Pool$MedianAge,
                                 HouseSize=Opt2$HouSiz_int[which.min(Range_Predict)], 
                                 MedIncome=Pool$MedIncome, 
                                 RealStTaxes=Pool$RealStTaxes, 
                                 YearBuilt=Opt2$YeaBui_int[which.min(Range_Predict)], 
                                 F5_9=Pool$F5_9, 
                                 Zip.y=Pool$Zip.y)))^(1/lambda)

Observable <- cbind(Pool$MedianAge, Opt2$HouSiz_int[which.min(Range_Predict)], Pool$MedIncome, Pool$RealStTaxes, Opt2$YeaBui_int[which.min(Range_Predict)],Opt_Pool, Pool$F5_9, Pool$Zip.y)
colnames(Observable) <- c("MedianAge", "HouseSize", "MedIncome","RealStTaxes","YearBuilt", "OptMortgageIncome", "F5_9" , "Zip.y")
Assesing the Accuracy Indicators of the Regression Models
MedianAge HouseSize MedIncome RealStTaxes YearBuilt OptMortgageIncome F5_9 Zip.y
39.2 2.53 79315 8498 1988 4683.87 784 78731

It can be seen how the optimization algorithm find the lowest MortgageIncome value by selecting the right HouseSize and YearBuilt variables.