Disclosure:
This product uses the Census Bureau Data API but is not endorsed or certified by the Census Bureau. I may use the Census Bureau name in order to identify the source of API content subject to these rules. I may not use the Census Bureau name, or the like to imply endorsement of any product, service, or entity, not-for-profit, commercial or otherwise.
1. Question:
Can we predict ’MortgageIncome` (Mortgage Status by Selected Monthly Owner Costs as a Percentage of Household Income) using other socio-economic variables extracted from the Census?
2. Input Data:
This study collects, organize, analyze, and disseminate significant amounts of information.
Variables extracted from the Census Bureau Data 2014 and 2010 datasets CENSUS 2014 and CENSUS 2010
State, City, County and Zip relationship were extracted from Zip-codes (state TX) and CENSUS GIS
3. Preprocessing:
Merge all the datasets in a single dataset
Dealing with NA values: Replace NA values using randomForest
Dealing with near Zero variance and low Frequency Ratio
Remove if necessary redundant highly correlated predictors.
4. Exploratory Analysis:
We will map some predictors extracted from the CENSUS data as geographical information (by zipcode)
5. Build the models and compare:
Analyzing quantitative data using varied statistical techniques linear regression and ANOVA.
A Box-cox variable transformation is performed and the different models are analyzed using R^2, RSE and anova to determine the best performance model.
6. Applicable insights to improve response:
Optimization of the output as a funcion of selected variables.
## Calling the necessary libraries to run the program.
rm(list=ls())
libs <- c ("tm", "plyr", "class", "SnowballC", "dplyr", "caret", "corrplot", "utils",
"beepr","RWeka","e1071","ggplot2", "datasets", "maps","plotly", "wordcloud",
"missForest", "rvest", "knitr", "ggmap", "maptools", "rgdal","UScensus2010", "acs",
"RJSONIO", "gridExtra", "car", "MASS")
lapply(libs, library, character.only=T)
In order to filter our data to the few Counties and Cities of interest we scrap from http://www.zip-codes.com/state/TX.asp a relational data set that links Counties, cities, zip code and states.
This dataset will be used to subset the data into the desired counties and obtain the zip-codes of the region of interst. In this case we will focus our attention into Travis, Williamson, Hays and Bastrop counties
setwd("~/R/Austin_SocEconInd")
url <- read_html("http://www.zip-codes.com/state/TX.asp")
## Extract the data Zip codes in the Austin Area
Zip_Region <- url %>%
html_nodes("#tblZIP") %>%
.[[1]] %>%
html_table()
colnames(Zip_Region) = (Zip_Region[1,])
## Clean the Data Set
Zip_Region$`ZIP Code` <- sapply(Zip_Region$`ZIP Code`, function(x) gsub("ZIP Code ", "", x))
Zip_Region$`Area Codes` <- NULL
Zip_Region$`ZIP Code` <- as.numeric(Zip_Region$`ZIP Code`)
This dataset contains 2598 observables for the state of texas. Subsetting the regions of Travis, Williamson, Hays and Bastrop results in a dataset containing 122 observables.
## Reduce the information to the Great Austin Area ###
cnty_List=c("Travis",
"Williamson",
"Hays",
"Bastrop")
Zip_Region_Austin=Zip_Region[Zip_Region$County %in% cnty_List,]
#print(paste0("From 2598 observables we ended with: ", (nrow(Zip_Region_Austin))))
Extract the shape of the zipcodes within the desired counties. This data will later be used to map the variables of interest by Zip code area.
The dataset is extracted from TIGER Products - Geography - U.s. Census Bureau
## Extract Map of the great Austin Area
x=get_googlemap(center="Austin",maptype=c("roadmap"))
## Extract Shape of Zip code areas
zipShp=readShapePoly("./raw_zipCode_Shape/tl_2010_48_zcta510/tl_2010_48_zcta510.shp")
zipShp2=fortify(zipShp,region="ZCTA5CE10")
## Reduce to zips in Austin area
zipShp3=zipShp2[zipShp2$id %in% Zip_Region_Austin$`ZIP Code` ,]
rm(zipShp2, zipShp, Zip_Region, url)
## Eliminate unnecessart variables
zipShp3$hole <- NULL
zipShp3$group <- NULL
Extracted dataset from CENSUS 2010 and CENSUS 2014. The sets were extracted using 3 different API calls.
In addition we created some new variables that are the result of combining the number of females and males of a specific group age into a single variable.
#### Extract Data from CENSUS ####
## Extracting the data function
getCensusData=function(APIkey,state,fieldnm){
resURL=paste("http://api.census.gov/data/2010/sf1?get=",fieldnm,
"&for=zip+code+tabulation+area:*&in=state:",state,
"&key=",API_KEY_CENSUS,
sep="")
dfJSON=fromJSON(resURL)
dfJSON[[1]]
dfJSON=dfJSON[2:length(dfJSON)]
dfJSON_Zip_Code = sapply(dfJSON,function(x) x[7])
dfJSON_state = sapply(dfJSON,function(x) x[6])
dfJSON_HouseFamily = sapply(dfJSON,function(x) x[5])
dfJSON_HouseType = sapply(dfJSON,function(x) x[4])
dfJSON_HouseSize = sapply(dfJSON,function(x) x[3])
dfJSON_MedianAge = sapply(dfJSON,function(x) x[2])
dfJSON_Pop = sapply(dfJSON,function(x) x[1])
df=data.frame(dfJSON_Pop,as.numeric(dfJSON_MedianAge), dfJSON_HouseSize,
dfJSON_HouseType,dfJSON_HouseFamily, dfJSON_state,
dfJSON_Zip_Code)
names(df) = c("Pop", "MedianAge", "HouseSize",
"Households", "HouseFamily", "State",
"ZipCode")
return(df)
}
## get median age from US census 2010 for a state
fieldnm="P0010001,P0130001,P0170001,P0180001,P0280002"
state = "48" ## State Code:TEXAS
## Census Api-Key = 837f5736b2e8c510f4a2b7de5303406594d73465
API_KEY_CENSUS = "837f5736b2e8c510f4a2b7de5303406594d73465"
dfHouse=getCensusData(APIkey,state,fieldnm); head(dfHouse)
dfHouse$Zip_2 <- dfHouse$Zip; head(dfHouse)
dfHouse$Zip_3 <- dfHouse$Zip; head(dfHouse)
names(dfHouse)
names(dfHouse)=c("Pop", "MedianAge", "HouseSize","Households",
"HouseFAmily", "State", "Zip" , "Zip_2")
getCensusData=function(APIkey,state,fieldnm){
resURL=paste("http://api.census.gov/data/2014/acs5?get=",fieldnm,
"&for=zip+code+tabulation+area:*&key=",API_KEY_CENSUS,
sep="")
dfJSON=fromJSON(resURL)
dfJSON[[1]]
dfJSON=dfJSON[2:length(dfJSON)]
dfJSON_Zip_Code = sapply(dfJSON,function(x) x[6])
dfJSON_CostIncme = sapply(dfJSON,function(x) x[5])
dfJSON_HouseMortgageIncome = sapply(dfJSON,function(x) x[4])
dfJSON_MedYearBuilt = sapply(dfJSON,function(x) x[3])
dfJSON_MedRealStTaxes = sapply(dfJSON,function(x) x[2])
dfJSON_MedIncome = sapply(dfJSON,function(x) x[1])
df=data.frame(as.numeric(as.character(dfJSON_MedIncome)),
as.numeric(as.character(dfJSON_MedRealStTaxes)),
as.numeric(as.character(dfJSON_MedYearBuilt)),
as.numeric(as.character(dfJSON_HouseMortgageIncome)),
as.numeric(as.character(dfJSON_CostIncme)),
as.numeric(as.character(dfJSON_Zip_Code)))
names(df)=c("MedIncome", "RealStTaxes", "YearBuilt",
"MortgageIncome", "Costincome",
"ZipCode")
return(df)
}
## Total population, Population in House Holds by age
fieldnm="B19013_001E,B25103_001E,B25035_001E,B25091_001E,B25092_001E"
state = "48" ## State Code:TEXAS
## Census Api-Key = 837f5736b2e8c510f4a2b7de5303406594d73465
API_KEY_CENSUS = "837f5736b2e8c510f4a2b7de5303406594d73465"
dfIncome=getCensusData(APIkey,state,fieldnm); head(dfIncome)
dfIncome$Zip_2 <- dfIncome$ZipCode; head(dfIncome)
dfIncome$Zip_3 <- dfIncome$ZipCode; head(dfIncome)
names(dfIncome)=c("MedIncome", "RealStTaxes", "YearBuilt",
"MortgageIncome", "Costincome",
"Zip", "Zip_2", "Zip_3" )
head(dfIncome)
getCensusData=function(APIkey,state,fieldnm){
resURL=paste("http://api.census.gov/data/2014/acs5?get=",fieldnm,
"&for=zip+code+tabulation+area:*&key=",API_KEY_CENSUS,
sep="")
dfJSON=fromJSON(resURL)
dfJSON[[1]]
dfJSON=dfJSON[2:length(dfJSON)]
dfJSON_Zip_Code= sapply(dfJSON,function(x) x[10])
dfJSON_F15_17 = sapply(dfJSON,function(x) x[9])
dfJSON_F10_14 = sapply(dfJSON,function(x) x[8])
dfJSON_F5_9 = sapply(dfJSON,function(x) x[7])
dfJSON_FUnder5 = sapply(dfJSON,function(x) x[6])
dfJSON_M15_17 = sapply(dfJSON,function(x) x[5])
dfJSON_M10_14 = sapply(dfJSON,function(x) x[4])
dfJSON_M5_9 = sapply(dfJSON,function(x) x[3])
dfJSON_MUnder5 = sapply(dfJSON,function(x) x[2])
dfJSON_Tot = sapply(dfJSON,function(x) x[1])
df=data.frame(as.numeric(as.character(dfJSON_Tot)),
as.numeric(as.character(dfJSON_MUnder5)),
as.numeric(as.character(dfJSON_M5_9)),
as.numeric(as.character(dfJSON_M10_14)),
as.numeric(as.character(dfJSON_M15_17)),
as.numeric(as.character(dfJSON_FUnder5)),
as.numeric(as.character(dfJSON_F5_9)),
as.numeric(as.character(dfJSON_F10_14)),
as.numeric(as.character(dfJSON_F15_17)),
as.numeric(as.character(dfJSON_Zip_Code)))
names(df)=c("Tot", "MUnder5", "M5_9",
"M10_14", "M15_17", "FUnder5", "F5_9", "F10_14", "F15_17",
"ZipCode")
return(df)
}
## get median age from US census 2010 for a state
fieldnm="B01001_001E,B01001_003E,B01001_004E,B01001_005E,B01001_006E,
B01001_027E,B01001_028E,B01001_029E,B01001_030E"
state = "48" ## State Code:TEXAS
## Census Api-Key = 837f5736b2e8c510f4a2b7de5303406594d73465
API_KEY_CENSUS = "837f5736b2e8c510f4a2b7de5303406594d73465"
dfage=getCensusData(APIkey,state,fieldnm); head(dfage)
dfage$Zip_2 <- dfage$Zip
dfage$Zip_3 <- dfage$Zip; head(dfage)
names(dfage)=c("Tot", "MUnder5", "M5_9","M10_14", "M15_17",
"FUnder5", "F5_9", "F10_14", "F15_17",
"Zip" , "Zip_2")
dfage$Under5 = dfage$MUnder5 + dfage$FUnder5
dfage$at5_9 <- dfage$M5_9 + dfage$F5_9
dfage$at10_14 <- dfage$M10_14 + dfage$F10_14
dfage$at15_17 <- dfage$M15_17 + dfage$F15_17
Now we have 4 datasets that we will merge into a single one. Here we show all the variables of the resulting dataset.
## Merge dfage with zipshp
ZipHouse=merge(zipShp3,dfHouse,by.x=c("id"),by.y=c("Zip"))
ZipHouseIncome = merge(ZipHouse, dfIncome, by="Zip_2")
mydata <- merge(ZipHouseIncome, dfage, by="Zip_2")
## [1] "Zip_2" "id" "long" "lat"
## [5] "order" "piece" "Pop" "MedianAge"
## [9] "HouseSize" "Households" "HouseFAmily" "State"
## [13] "MedIncome" "RealStTaxes" "YearBuilt" "MortgageIncome"
## [17] "Costincome" "Tot" "MUnder5" "M5_9"
## [21] "M10_14" "M15_17" "FUnder5" "F5_9"
## [25] "F10_14" "F15_17" "Under5" "at5_9"
## [29] "at10_14" "at15_17"
First, we are going to make a copy of our original dataset that will be used exclusively for graphical purposes.
## Create a new set that we will use for mapping
mydata_plot= mydata
Second, we will set all non-numeric variables as numeric.
### Set all variables as numeric
mydata$Zip_2 <- as.numeric(as.character(mydata$Zip_2))
mydata$id <- as.numeric(mydata$id)
mydata$piece <- as.numeric(as.character(mydata$piece))
mydata$Pop <- as.numeric(as.character(mydata$Pop))
mydata$HouseSize <- as.numeric(as.character(mydata$HouseSize))
mydata$Households <- as.numeric(as.character(mydata$Households))
mydata$HouseFAmily <- as.numeric(as.character(mydata$HouseFAmily))
mydata$State <- as.numeric(as.character(mydata$State))
Third, Look only at variables containing NA’s and to their the distribution and frequency per Variable.
## Find and locate NA variables
Find_NA = data.frame (apply(mydata, 2, function(x) sum(is.na(x)==T)))
## Express NA's as precentage of same column total
NA_Variables <- Find_NA*100/nrow(mydata)
## Exclude Varibles no containig NA
NA_values<- subset(NA_Variables[1], (Find_NA/nrow(mydata)) != 0)
names(NA_values)[1] <- "Percentage_NA";
| MedIncome | RealStTaxes | YearBuilt | Costincome | |
|---|---|---|---|---|
| Percentage_NA | 0.148392 | 0.193885 | 0.148392 | 0.193885 |
Since the percentage of NAs are so low for these variables, we will eliminate them.
## Impute missing values using all parameters as defaults values
if (Ratio_NA <= 0.1){ ## to be set to 0.005
mydata <- na.omit(mydata)
data <- mydata
} else {
data_imp <- missForest(mydata)
print(paste0( "Proportion of NA falsely classified: ", round(data_imp$OOBerror, digits = 3)))
data <- data_imp$ximp
}
All NA observables have been replaced correctly
Now that we got rid off the NAs we are going to look for the variables with near zero variance and also to those with zero frequency ratio.
## Logwage and Zero covariates
ZC <- nearZeroVar(data, saveMetrics = T)
| freqRatio | percentUnique | zeroVar | nzv | |
|---|---|---|---|---|
| piece | 79.07143 | 0.004341 | FALSE | TRUE |
| State | 0.00000 | 0.001085 | TRUE | TRUE |
The data-set contains 2 variables with near zero variance and 1 of these variables containts zero frequency ratio. We will eliminate these 2 variables from the data set.
## Eliminate these varaibles from the data set.
nzv_T <- which(ZC$zeroVar == T) # locate those variables
mydata1 <- data[,-c(nzv_T)]
Let’s map some socioeconomic indexes fron the CENSUS in the Great Austin Area. Our analysis, a regression model is not the ideal model for classification purposes since our output takes continuous values. However, showing some geographical classification plots of the socioeconomic indexes from the CENSUS by zipcode indicates interesting patterns.
In order to achive this, first we will create variables that will allow us to map Socio-Economic indexes by Zip areas
## Create groups for average age
mydata_plot$MedAgeLvl <- cut(mydata_plot$MedianAge,
breaks = c(0,30,35,40,45,50,55,60,65,70,80),
labels = c("<30", "30-35", "35-40", "40-45", "45-50", "50-55",
"55-60", "60-65", "65-70", ">75"))
## Create Groups for Median Income
mydata_plot$MedIncomeLvl <- cut(mydata_plot$MedIncome,
breaks = c(0,40000,60000,80000,100000,120000, 140000,160000,180000),
labels = c("<40k", "40-60k", "60-80k", "80-100k", "100-120k",
"120-140k","140-160k", ">180k" ))
## Create Groups for Group Median Monthly Owner Costs as a Percentage of Household Income
mydata_plot$CostIncomelvl <- cut(mydata_plot$Costincome,
breaks = c(0,5,10,15,20,25,30,35),
labels = c("<5%", "5-10%", "10-15%","15-20%", "20-25%",
"25-30%", ">30%"))
## To summarize MedAge group to a single value
mydata_plot$RealStTaxesLvl <- cut(mydata_plot$RealStTaxes,
breaks = c(0,2000,4000,6000,8000,10000,15000),
labels = c("<2000", "2000-4000", "4000-6000", "6000-8000",
"8000-10000", ">10000"))
## Create groups for Median year structure built
mydata_plot$YearBuiltLvl <- cut(mydata_plot$YearBuilt,
breaks = c(1940, 1965,1971,1977,1983,1989,1995,2001,2007,2013,2019),
labels = c("<1965","1965-1971", "1971-1977", "1977-1983",
"1983-1989", "1989-1995", "1995-2001","2001-2007",
"2007-2013", ">2013"))
Second we will map this varaibles by Zip
### Map Age Lvl
x=get_googlemap(center="Austin",maptype=c("roadmap"))
p2=ggmap(x)
p2=p2+geom_polygon(data=mydata_plot, aes(x=long, y=lat, group=Zip_2, fill=MedAgeLvl),
color="black",alpha=0.4)
p2=p2+scale_fill_manual(values=c("springgreen", "springgreen4",
"orange", "orange4",
"red","red4",
"grey40"))
#p2 = p2 + geom_text( data=mydata_plot, hjust=0.5, vjust=0, check_overlap = F, fontface="bold",
# aes(x=long, y=lat, label=MedAgeLvl),
# colour="red3", size=3, alpha = 0.9 )
p2 = p2 + labs(title = "Median Age by Zip (US CENSUS)")
p2 = p2 + theme(plot.title = element_text(size = rel(1.2)),
legend.title =element_text(size = rel(1)))
### Map CostIncome
x=get_googlemap(center="Austin",maptype=c("roadmap"))
p3=ggmap(x)
p3=p3+geom_polygon(data=mydata_plot, aes(x=long, y=lat, group=Zip_2, fill=CostIncomelvl),
color="black",alpha=0.4)
p3=p3+scale_fill_manual(values=c("springgreen", "springgreen4",
"orange", "orange4",
"red","red4",
"grey40"))
p3 = p3 + labs(title = "Cost IncomeLvl by Zip (US CENSUS)")
p3 = p3 + theme(plot.title = element_text(size = rel(1.2)),
legend.title =element_text(size = rel(1.2)))
### Map Year Built
x=get_googlemap(center="Austin",maptype=c("roadmap"))
p4=ggmap(x)
p4=p4+geom_polygon(data=mydata_plot, aes(x=long, y=lat, group=Zip_2, fill=YearBuiltLvl ),
color="black",alpha=0.4)
p4=p4+scale_fill_manual(values=c("springgreen", "springgreen3", "springgreen4",
"orange", "orange3", "orange4",
"red","red3", "red4",
"grey40", "grey20", "grey0"))
#p4 = p4 + geom_text( data=mydata_plot, hjust=0.5, vjust=0, check_overlap = F, fontface="bold",
# aes(x=long, y=lat, label=YearBuiltLvl),
# colour="red3", size=3, alpha = 0.9 )
p4 = p4 + labs(title = "Median Year Built by Zip (US CENSUS)")
p4 = p4 + theme(plot.title = element_text(size = rel(1)),
legend.title =element_text(size = rel(1)))
### Map Real st Taxes
x=get_googlemap(center="Austin",maptype=c("roadmap"))
p5=ggmap(x)
p5=p5+geom_polygon(data=mydata_plot, aes(x=long, y=lat, group=Zip_2, fill= RealStTaxesLvl),
color="black",alpha=0.4)
p5=p5+scale_fill_manual(values=c("springgreen", "springgreen4",
"orange", "orange4",
"red","red4",
"grey40", "grey20"))
p5 = p5 + labs(title = "Median Real State Taxes by Zip (US CENSUS)")
p5 = p5 + theme(plot.title = element_text(size = rel(1)),
legend.title =element_text(size = rel(1)))
### Map Med income
x=get_googlemap(center="Austin",maptype=c("roadmap"))
p6=ggmap(x)
p6=p6+geom_polygon(data=mydata_plot, aes(x=long, y=lat, group=Zip_2, fill=MedIncomeLvl),
color="black",alpha=0.4)
p6=p6+scale_fill_manual(values=c("springgreen", "springgreen4",
"orange", "orange4",
"red","red4",
"grey40", "grey20"))
p6 = p6 + labs(title = "Median Income by Zip (US CENSUS)")
p6 = p6 + theme(plot.title = element_text(size = rel(1)),
legend.title =element_text(size = rel(1)))
Third we will study the colinearity between predictors by studying the variance inflation factor (VIF)
We will perform linear regression. using lm.
## linear regression
model0 <- (lm(MortgageIncome ~. , data = mydata2))
We will identify those variables whose P values are lower than<= 0.05
| piece | MedianAge | HouseSize | Households | MedIncome | RealStTaxes | YearBuilt | Costincome | M15_17 | FUnder5 | F5_9 | Zip.y |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.8422 | 0 | 0 | 0 | 0 | 0 | 0 | 0.2261 | 0 | 0 | 0 | 0 |
The P-values of the predictors piece and Costincome are clearly above 0.05. This indicates that these two variables should be excluded from our model.
## Elimination of variables
mydata2$piece <- NULL
mydata2$Costincome <- NULL
Now we can perform a lineal regression analysis without these two variables. We will look at RES and not R^2 index since R^2 is heavily dependent with the number of variables.
model1 <- (lm(MortgageIncome ~. , data = mydata2))
## [1] "Residual Standard Error (RSE) Model_0: 1001.722 Residual Standard Error (RSE) Model_1: 1001.719"
The RSE (Residual Standard Error) value remains the same after eliminating the predictors piece and Costincome. This means that we can continue our estudy without this variables.
A major problem with multivariate regression is collinearity. If two or more predictor variables are highly correlated, and they are both entered into a regression model, it increases the true standard error and you get a very unstable estimates of the slope. We can assess the collinearity by studying the variance inflation factor (VIF).
## VIF
VIF_model1 <- vif(lm(MortgageIncome ~. , data = mydata2))
| Zip.y | YearBuilt | MedianAge | HouseSize | RealStTaxes | MedIncome | M15_17 | Households | FUnder5 | F5_9 |
|---|---|---|---|---|---|---|---|---|---|
| 1.2517 | 1.8908 | 3.4368 | 5.2744 | 5.755 | 5.8085 | 11.0599 | 11.3197 | 14.2099 | 18.3679 |
Using the following guidelines to interpret the VIF:
## Eliminating the highly correlated variables
myvars <- names(mydata2) %in% c(names(mydata2[VIF_model1>=10])[-2])
mydata3 <- mydata2[!myvars]
model2 <- (lm(MortgageIncome ~. , data = mydata3))
After eliminating some predictors with VIF index larger than 10 like M15_17, Households and FUnder5 our final model contains 7 predictors.
| Estimate | Std. Error | t value | Pr(>|t|) | |
|---|---|---|---|---|
| (Intercept) | -91678.1695 | 1320.6539 | -69.4188 | 0 |
| MedianAge | 59.8408 | 1.1295 | 52.9814 | 0 |
| HouseSize | -2613.7479 | 21.0663 | -124.0722 | 0 |
| MedIncome | 0.0032 | 0.0005 | 6.7195 | 0 |
| RealStTaxes | -0.0236 | 0.0052 | -4.5163 | 0 |
| YearBuilt | 57.7077 | 0.7002 | 82.4125 | 0 |
| F5_9 | 5.1164 | 0.0074 | 690.3247 | 0 |
| Zip.y | -0.2242 | 0.0087 | -25.8444 | 0 |
If you are asking your self what happened to F5_9 with VIF = 18.3679? Why this predictor is not eliminated? The answer is that F5_9 is heavily correlated with M15_17 and FUnder5.
It can be seen below that when this two variables are eliminated the VIF of F5_9 becomes 1.2145
#print(paste0("R^2 Model_2 = ", round(summary(model2)$r.sq,digits = 3), " Residual Standard Error (RSE) Model_2 = ", round(summary(model2)$sigma, digits = 3)))
## VIF
VIF_model2 <- vif(lm(MortgageIncome ~. , data = mydata3))
## All values have low p values and also low VIF is within normal parameters
## We will go with non linear analysis to see if the system improbes
| MedianAge | HouseSize | MedIncome | RealStTaxes | YearBuilt | F5_9 | Zip.y |
|---|---|---|---|---|---|---|
| 2.7104 | 2.9173 | 5.4177 | 5.4011 | 1.8164 | 1.4483 | 1.2145 |
There are no more highly correlated predictors and RealStTaxes and MedIncome are kept because there are in the low “high correlated region”. Since we have reached a stopping point in the variable reduction process, now we can focus in finding the best model.
We will starting looking at heteroskedasticity
car::ncvTest(model2)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 14048.88 Df = 1 p = 0
lmtest::bptest(model2)
##
## studentized Breusch-Pagan test
##
## data: model2
## BP = 19031, df = 7, p-value < 2.2e-16
Both p-values are less than a significan level of 0.05 The Breush-Pagan test creates a statistic that is chi-squared distributed and for your data that statistic=14048.88. The p-value is the result of the chi-squared test and (normally) the null hypothesis is rejected for p-value < 0.05. In this case, the null hypothesis is of homoskedasticity and it would be rejected.
One of the main assumptions for linear model fitting is that the data is normally distributed. However, this assumption doesn’t always fit. Therefore, we need to create data transformation when the data isn’t distributed normally.
Box-Cox is a way to transform response variable to meet the “normality” assumption.
We must perform a Box-Cox transformation of some variables to make it approximate to a normal distribution. We will apply this variable transformations and apply them to the linear model4
bc <- boxcox(model2, plotit = F)
lambda <- bc$x[which.max(bc$y)]
model3 <- (lm(I(MortgageIncome^lambda) ~ MedianAge + HouseSize + MedIncome + RealStTaxes + YearBuilt + F5_9 + Zip.y , data = mydata3))
## transformation of variables using BOXCOX
BoxPlot_MA <-caret::BoxCoxTrans(mydata3$MedianAge)
BoxPlot_HS <- caret::BoxCoxTrans(mydata3$HouseSize)
BoxPlot_MI <- caret::BoxCoxTrans(mydata3$MedIncome)
BoxPlot_RST<- caret::BoxCoxTrans(mydata3$RealStTaxes)
BoxPlot_YB <- caret::BoxCoxTrans(mydata3$YearBuilt)
BoxPlot_F <- caret::BoxCoxTrans(mydata3$F5_9)
BoxPlot_Zip <- caret::BoxCoxTrans(mydata3$Zip.y)
model4 <-lm((MortgageIncome) ~ I(MedianAge^-0.1) + I(HouseSize^0.6) + I(MedIncome^0.2) + I(RealStTaxes^(-0.4)) + I(YearBuilt^(2)) + I(F5_9^0.3) + I(Zip.y^2) , data = mydata3)
Now we will apply boxcox to transform the reponse variable MortgageIncome and obtain a homoskedasticity data distribution. In the next plot, it can be seen the residuals distributions of the non-transformed model (heteroskedasticity) and the residual distribution of the transformed model.
bc <- boxcox(model4, plotit = F)
lambda <- bc$x[which.max(bc$y)]
mydata3$MortgageIncomeLambda <- (mydata3$MortgageIncome)^lambda
model5<-lm(I(MortgageIncome^lambda) ~ I(MedianAge^-0.1) + I(HouseSize^0.6) + I(MedIncome^0.2) + I(RealStTaxes^(-0.4)) + I(YearBuilt^(2)) + I(F5_9^0.3) + I(Zip.y^2) , data = mydata3)
Finally we will create non-linar models under these transformation and use anova, r^2 and RSE to select the most accurate model.
model6<-lm(I(MortgageIncome^lambda) ~ poly((I(MedianAge^-0.1)),2) + poly((I(HouseSize^0.6)),2) + poly((I(MedIncome^0.2)),2) + poly((I(RealStTaxes^(-0.4))),2) + poly((I(YearBuilt^(2))),2) + poly((I(F5_9^0.3)),2) + poly((I(Zip.y^2)),2) , data = mydata3)
model7<-lm(I(MortgageIncome^lambda) ~ poly((I(MedianAge^-0.1)),3) + poly((I(HouseSize^0.6)),3) + poly((I(MedIncome^0.2)),3) + poly((I(RealStTaxes^(-0.4))),3) + poly((I(YearBuilt^(2))),3) + poly((I(F5_9^0.3)),3) + poly((I(Zip.y^2)),3) , data = mydata3)
model8 <- lm(I(MortgageIncome^lambda) ~ poly((I(MedianAge^-0.1)),4) + poly((I(HouseSize^0.6)),4) + poly((I(MedIncome^0.2)),4) + poly((I(RealStTaxes^(-0.4))),4) + poly((I(YearBuilt^(2))),4) + poly((I(F5_9^0.3)),4) + poly((I(Zip.y^2)),4) , data = mydata3)
model9 <- lm(I(MortgageIncome^lambda) ~ poly((I(MedianAge^-0.1)),5) + poly((I(HouseSize^0.6)),5) + poly((I(MedIncome^0.2)),5) + poly((I(RealStTaxes^(-0.4))),5) + poly((I(YearBuilt^(2))),5) + poly((I(F5_9^0.3)),5) + poly((I(Zip.y^2)),5) , data = mydata3)
model10 <- lm(I(MortgageIncome^lambda) ~ poly((I(MedianAge^-0.1)),6) + poly((I(HouseSize^0.6)),6) + poly((I(MedIncome^0.2)),6) + poly((I(RealStTaxes^(-0.4))),6) + poly((I(YearBuilt^(2))),6) + poly((I(F5_9^0.3)),6) + poly((I(Zip.y^2)),6) , data = mydata3)
Know we have 5 different models
* Model5
\[MortgageIncome^\lambda = MedianAge^{-0.1} + HouseSize^{0.6} + MedIncome^{0.2} + RealStTaxes^{-0.4} + YearBuilt^{2} + F5_9^{0.3} + Zip.y^2\]
* Models 6, 7, 8, 9, 10 \[MortgageIncome^\lambda = (MedianAge^{-0.1})^X + (HouseSize^{0.6})^X + (MedIncome^{0.2})^X + (RealStTaxes^{-0.4})^X + (YearBuilt^{2})^X + (F5_9^{0.3})^X + (Zip.y^2)^X\] * Model 6, 7, 8, 9 and 10 creates polynomials of order X = 2, 3, 4, 5 and 6 respectively.
Now we will approach model selection by studying the values R^2, RSE and ANOVA.
r2 <- c(summary(model5)$r.sq, summary(model6)$r.sq, summary(model7)$r.sq, summary(model8)$r.sq, summary(model9)$r.sq, summary(model10)$r.sq)
RSE <- c(summary(model5)$sigma, summary(model6)$sigma, summary(model7)$sigma, summary(model8)$sigma, summary(model9)$sigma, summary(model10)$sigma)
Fit_res <-data.frame(cbind(r2, RSE, (anova(model5, model6, model7, model8, model9, model10))[1], (anova(model5, model6, model7, model8, model9, model10))[2], (anova(model5, model6, model7, model8, model9, model10))[3], (anova(model5, model6, model7, model8, model9, model10))[4], (anova(model5, model6, model7, model8, model9, model10))[5], (anova(model5, model6, model7, model8, model9, model10))[6]))
colnames(Fit_res) <- c("R^2", "RSE", "Anova Res.Df", "Anova RSS", "Anova Df", "Anova Sum of Sq", "Anova F", "Anova Pr(>F)")
row.names(Fit_res) <- c("Model5", "Model6", "Model7", "Model8", "Model9", "Model10")
| R^2 | RSE | Anova Res.Df | Anova RSS | Anova Df | Anova Sum of Sq | Anova F | Anova Pr(>F) | |
|---|---|---|---|---|---|---|---|---|
| Model5 | 0.8964 | 1.0043 | 92136 | 92923.70 | NA | NA | NA | NA |
| Model6 | 0.9460 | 0.7251 | 92129 | 48440.89 | 7 | 44482.810 | 41685.194 | 0 |
| Model7 | 0.9512 | 0.6891 | 92122 | 43739.14 | 7 | 4701.751 | 4406.048 | 0 |
| Model8 | 0.9662 | 0.5738 | 92115 | 30328.64 | 7 | 13410.494 | 12567.080 | 0 |
| Model9 | 0.9776 | 0.4668 | 92108 | 20070.26 | 7 | 10258.384 | 9613.213 | 0 |
| Model10 | 0.9844 | 0.3904 | 92101 | 14040.31 | 7 | 6029.948 | 5650.712 | 0 |
It can be seen that the best R^2 and RSE result are obtained with the Model10. In addition, anova analysis also indicates that the best model is Model10. All anova p-values are under 0.05 but anova F-value from model10 is the smallest of all them. Below we can see three Residual versus Fitted plots of three different models.
The variables are used as an example and it is understood that YearBuilt can not be changed as desired.
Let’s assume that for this observable, we have control to modify two predictors. In this case we have control over the predictors HouseSize and YearBuilt. If we keep all the variables constant except HouseSize and YearBuilt. What combination of HouseSize and YearBuiltvalues will result in a minimum Monthly Owner Costs as a Percentage of Household Income or MortgageIncome reponse?
First we extract a single random observable (we do not include more than one sample due to limitations of the analysis to 5 pages)
### Prediction of mean response for cases like this...
#Pool <- mydata3[sample(1:nrow(mydata3), size=1, replace = F),]
#Pred_Pool <- predict(model9, Pool, interval="conf")^(1/lambda)
### Prediction of mean response for cases like this...
Pool <- mydata3[sample(1:nrow(mydata3), size=1, replace = F),]
Pred_Pool <- predict(model9, Pool, interval="pred")^(1/lambda)
| MedianAge | HouseSize | MedIncome | RealStTaxes | YearBuilt | MortgageIncome | F5_9 | Zip.y | MortgageIncomeLambda | |
|---|---|---|---|---|---|---|---|---|---|
| 67966 | 39.2 | 2.03 | 79315 | 8498 | 1978 | 7201 | 784 | 78731 | 14.3621 |
Now we are going to look at the predicted MortgageIncome value. Do not forget that MortgageIncomeLambda is equivalent to MortgageIncome^(Lambda) and it must be transformed to obtain MortgageIncome
| fit | lwr | upr | |
|---|---|---|---|
| 67966 | 6558.828 | 5232.512 | 8104.241 |
If we keep all the variables constant except HouseSize and YearBuilt. What combination of HouseSize and YearBuiltvalues will result in a minimum Monthly Owner Costs as a Percentage of Household Income or MortgageIncome reponse
First we create the limits in the range of variables
### Create all the ranges around all variables.
MedAge_rep = rep(Pool$MedianAge[1],11)
MedAge_int = seq(from=((Pool$MedianAge[1])-2), to=((Pool$MedianAge[1])+2), length.out = 11);
HouSiz_rep = rep(Pool$HouseSize[1],11)
HouSiz_int = seq(from=((Pool$HouseSize[1])-0.5), to=((Pool$HouseSize[1])+0.5),length.out = 11);
MedInc_rep = rep(Pool$MedIncome[1],11)
MedInc_int = seq(from=((Pool$MedIncome[1])-1000), to=((Pool$MedIncome[1])+1000),length.out = 11)
ReaStTax_rep = rep(Pool$RealStTaxes[1],11)
ReaStTax_int = seq(from=((Pool$RealStTaxes[1])-1000), to=((Pool$RealStTaxes[1])+1000),length.out = 11)
YeaBui_rep = rep(Pool$YearBuilt[1],11)
YeaBui_int = seq(from=((Pool$YearBuilt[1])-10), to=((Pool$YearBuilt[1])+10), length.out = 11)
F5_9_rep = rep(Pool$F5_9, 11)
Zip.y_rep = rep(Pool$Zip.y, 11)
Second we create a data frame with all possible combinations between HouseSize and YearBuilt and we compute all values of MortgageIncome. Now we only must select the minimum MortgageIncome value and identify the parameters of HouseSize and YearBuilt that provide such result.
## Create a data frame from all combinations of the supplied vectors or factors.
Opt <- expand.grid( HouSiz_int, YeaBui_int)
colnames(Opt) <- c("HouSiz_int", "YeaBui_int")
Opt2 <- data.frame(cbind(rep(Pool$MedianAge,nrow(Opt)),
Opt$HouSiz_int,
rep(Pool$MedIncome,nrow(Opt)),
rep(Pool$RealStTaxes,nrow(Opt)),
Opt$YeaBui_int,
rep(Pool$F5_9,nrow(Opt)),
rep(Pool$Zip.y,nrow(Opt))))
colnames(Opt2) <- c("MedianAge", "HouSiz_int", "MedIncome", "RealStTaxes", "YeaBui_int", "F5_9", "Zip.y" )
Range_Predict <- (predict(model9, list(MedianAge=Opt2$MedianAge,
HouseSize=Opt2$HouSiz_int,
MedIncome=Opt2$MedIncome,
RealStTaxes=Opt2$RealStTaxes,
YearBuilt=Opt2$YeaBui_int,
F5_9=Opt2$F5_9,
Zip.y=Opt2$Zip.y)))^(1/lambda)
Opt_Pool <- (predict(model9, list(MedianAge=Pool$MedianAge,
HouseSize=Opt2$HouSiz_int[which.min(Range_Predict)],
MedIncome=Pool$MedIncome,
RealStTaxes=Pool$RealStTaxes,
YearBuilt=Opt2$YeaBui_int[which.min(Range_Predict)],
F5_9=Pool$F5_9,
Zip.y=Pool$Zip.y)))^(1/lambda)
Observable <- cbind(Pool$MedianAge, Opt2$HouSiz_int[which.min(Range_Predict)], Pool$MedIncome, Pool$RealStTaxes, Opt2$YeaBui_int[which.min(Range_Predict)],Opt_Pool, Pool$F5_9, Pool$Zip.y)
colnames(Observable) <- c("MedianAge", "HouseSize", "MedIncome","RealStTaxes","YearBuilt", "OptMortgageIncome", "F5_9" , "Zip.y")
| MedianAge | HouseSize | MedIncome | RealStTaxes | YearBuilt | OptMortgageIncome | F5_9 | Zip.y |
|---|---|---|---|---|---|---|---|
| 39.2 | 2.53 | 79315 | 8498 | 1988 | 4683.87 | 784 | 78731 |
It can be seen how the optimization algorithm find the lowest MortgageIncome value by selecting the right HouseSize and YearBuilt variables.