For this final project, we want to analyze the U.S household income distributions among states and counties. When we seek jobs, wages are one of the most concerned questions we would like to know. So, our focus is which states and counties are in top incomes locations, and what factors could be the correlative.
The data was downloaded from Kaggle and read in R. we are planning to group the data by states and plot average incomes. We will also plot mean incomes by counties within nationwide. Finally, we will try fit the date with linear regression or other statistics modelings.
We hope analysis of this dataset can help people have a sense of the income distributions among different states and counties, so they can choose their ideal locations for income accordingly. However, this analysis only takes locations in consideration, we do not analyze other factors that can affect incomes, like types of jobs, population density.
Packages required for reproduce this analysis are listed below
library(readr) # load .csv file
library(dplyr) # manipulate data
library(ggplot2) #visualize the data
library(DT) # output data in table
library(PerformanceAnalytics)
library(lattice) # charts and tables
library(corrplot) #correlation plot
library(Hmisc) # plotting
library(caret)# modeling
library(maps) # maps state and county incomes
library(ggrepel) #avoid label overlapping
library(plotly) #showing data and plot
We obtained this dataset from Kaggle.
The database contains 32,000 records on US Household Income Statistics & Geo Locations. The data was provided by the U.S. Census Reports for years from 2011-2015. This dataset has 19 variables and 32526 observations. There is only 1 missising value in Area_Code variable and it was recored as “M”.
This dataset is tidy and clean engough, so just moved forward and selected the interested variables for further analysis: State_Name, State_ab, City, Type, ALand, AWater, Lat, Lon, Mean, Median. The State_Name, State_ab, City, Type help to filter intersested locations and when we compare incomes in cities, we only want observations with Type of county. Lat and Lon help to build up a map. ALand and AWater are used for modelings. Mean and Median are the values we want to compare.
We will read in the file and clean it up with the following codes:
df <- read_csv("kaggle_income.csv")
df_clean <- select(df, State_Name, State_ab, City, Type, ALand, AWater, Lat, Lon, Mean, Median)
df_clean_County <- df %>%
select (County, Lat, Lon, Mean, Median) %>%
arrange(desc(Mean))
states <- map_data("state")
counties <- map_data("county")
We show the data in a table:
datatable(df_clean, caption = "Table 1: Tidy Dataset")
Here are variables metadata
| Variable | Information |
|---|---|
| State_Name | Type: Character Description: The state code reported by the U.S. Census Bureau for the specified geographic location. |
| State_ab | Type: Character Description: The abbreviated state name reported by the U.S. Census Bureau for the specified geographic location. |
| City | Type: Character Description: The city name reported by the U.S. Census Bureau for the specified geographic location. |
| Type | Type: Character Description: The place Type reported by the U.S. Census Bureau for the specified geographic location. |
| ALand | Type: Double Description: The Square area of land at the geographic or track location. |
| AWater | Type: Double Description: The Square area of water at the geographic or track location. |
| Lat | Type: Double Description: The latitude of the specified geographic location. |
| Lon | Type: Double Description: The longitude deviation of the household income for the specified geographic location. |
| Mean | Type: Double Description: The mean household income of the specified geographic location. |
| Median | Type: Double Description: The median household income of the specified geographic location. |
We first plotted histograms for all numerical variables
par(mai = c(1, 1.5, 1, 0))
par(mar = c(4,3, 5, 5))
hist.data.frame <- function(x, ..., colors=rainbow(ncol(x))) {
col <- 1
hist <- function(...) {
graphics::hist(..., col=colors[col])
col <<- col+1
}
f <- Hmisc:::hist.data.frame
environment(f) <- environment()
f(x,...)
}
hist(df[c(12,13,16:19)], cex = 2, font = 2, cex.axis = 1.2)
As positive-skewed distributions for variables of ALand, AWater and sum_w were identified, we therefore transformed the values of these variables using log scale and regenerated histograms for these three variables.
par(mfrow = c(2,2))
hist(log(df$ALand), col = "red")
hist(log(df$AWater), col = "purple")
hist(log(df$sum_w), col = "coral")
Then we plotted the boxplots for all numerical variables and found that all numerical variables have outliers
par(mfrow = c(2,4))
for (i in c(12,13,16:19)) {
n <- names(df)
boxplot(df[,i], prob = TRUE, col = "blue", xlab = paste("The range of", n[i]), main = paste("Boxplot of", n[i]))
}
Furthermore, we performed correlation analysis between the numerical variables. Specifically, we first plotted x-y plot for the Mean against log transformed-ALand grouped by Type; Then we drew a correlation plot for all numerical variables; In addition, we also drew a correlation chart for all numerical variables. From the correlation chart below, we can see that there is a strong positive relationships between Mean and Stdev, a strong positive relationship between ALand and AWater and a strong positive relationship between Mean and Median.
# xy plot of Mean with log(ALand)
par(mar = c(4.5,5.1,2,2))
xyplot(Mean ~ log(ALand), group = Type, data = df,
auto.key = list(space = "right"),
jitter.x = TRUE, jitter.y = TRUE, col = c("blue", "red"))
#correlation plot of all numerical variables
par(mar = c(4.5,5.1,2,2))
df_cor <- cor(df[c(12,13,16:19)])
corrplot(df_cor, type = "upper", order = "hclust",
tl.col = "black", tl.srt = 45)
#chart correlation plot of all numerical variables
chart.Correlation(df_cor, histogram = TRUE, pch = 19)
At last, we also mapped the Mean income (from low to high) by State and County respectively.
We plotted states on map, and colorred them with the mean of Income. The darker the color is on the state, the higher mean income it has.
#Transform the names of the states to lower cases
df$State_Name <- tolower(df$State_Name)
# use map_data function to gater states information
all_states <- map_data("state")
#Mean Income by States
test <- tapply(df$Mean, df$State_Name, mean)
test<- data.frame(test, stringsAsFactors = FALSE)
colnames(test) <- "Mean"
test$region <- rownames(test)
Total <- merge(all_states, test, by="region")
#draw a map
p <- ggplot()
p <- p + geom_polygon(data = Total, aes(x=long, y=lat, group = group, fill = Total$Mean),colour="white"
) + scale_fill_continuous(low = "thistle2", high = "darkred", guide="colorbar")
P1 <- p + theme_bw() + labs(fill = "Low to High Mean Income by State"
,title = "Mean Income By State", x="", y="")
p_statemap <- P1 + scale_y_continuous(breaks=c()) + scale_x_continuous(breaks=c()) + theme(panel.border = element_blank())
p_statemap
Next, we want to know which sates are the ones that have higher mean income, so we plotted a boxplot of mean income among states and ranked them from top to bottom as the state with highest mean income on the top. We also drew the top states with higher mean income the map.
#boxplot
g1 <- group_by(df_clean, State_Name)
g1$State_Name <- with(g1, reorder(x = State_Name, X=Mean))
ggplotly(
ggplot(g1, aes(x = State_Name, y = Mean))+
geom_boxplot()+
coord_flip()+
scale_y_continuous(name = "Mean Income") +
scale_x_discrete(name = "State")+
ggtitle("Boxplot of Mean Income of States"), height = 900)
The map we plotted to show top 10 states with highest mean income.
g2 <- group_by(df_clean, State_ab)
agg1 <- aggregate(g2[,c("Lat", "Lon", "Mean")], list(g2$State_Name), mean) %>%
arrange(desc(Mean))
ggplot(data = states) +
geom_polygon(aes(x = long, y = lat, group = group), fill = "grey90", color = "white") +
geom_point( data = agg1[1:10,], aes(x = Lon, y = Lat, color = Mean), size = 3, alpha = 0.5) +
scale_size( name = "Mean Income") +
scale_colour_gradientn(colours = rainbow(4), name = "Mean Income") +
geom_text_repel(data = agg1[1:10,],aes(x = Lon, y = Lat, label = Group.1)) +
coord_fixed(ratio = 1.5, xlim = NULL, ylim = NULL, expand = TRUE) +
guides(fill = FALSE) +
labs(x = "Longitute",y = "Latitude")+
ggtitle("Top 10 State Mean Income") +
theme_minimal() +
theme(panel.grid.minor = element_blank(),
text = element_text(family = "Georgia"),
plot.title = element_text(size = 25, margin = margin(b = 10)))
From the above maps, we can see that the top 10 states that have the hightes mean income are District of columbia, Connecticut, New jersey, Maryland, Massachusetts, Virginia, California, Hawaii, Alaska, New york. It seems that the top 10 mean income states are almost all at coastal sites.
Next we plotted mean income among counties within nationwide.
g3 <- group_by(df_clean_County, County)
agg2_lonlat <- aggregate(g3[,c("Lat", "Lon", "Mean")], list(g3$County), mean) %>%
arrange(desc(Mean))
ggplot(data = counties) +
geom_polygon(aes(x = long, y = lat, group = group), fill = "grey90", color = "white") +
geom_point( data=agg2_lonlat[1:10,], aes(x = Lon, y = Lat, size = Mean, color = Mean))+
scale_size( name="Mean Income") +
scale_colour_gradientn(colours=rainbow(7), name = "Mean Income") +
geom_text_repel(data = agg2_lonlat[1:10,],aes(x = Lon, y = Lat, label = Group.1)) +
coord_fixed(1.5) +
guides(fill=FALSE) +
labs(x="Longitute", y="Latitude") +
ggtitle("Top 10 County Mean Income")+
theme_minimal() +
theme(panel.grid.minor = element_blank(),
text = element_text(family = "Georgia"),
plot.title = element_text(size = 25, margin = margin(b = 20)))
The top 10 counties that have the highest mean income are:Southeast Fairbanks Census Area, Loudoun County, Westchester County, Boulder County, Providence County, Anne Arundel County, Indian River County, Fort Bend County, Fairfax County, Marin County. It seems that most of them are at east part of United States.
First, we constructed a linear model with the response variable of Mean, predictor variables of ALand, AWater and Type.
#selected variables of Type, ALand, AWater and Mean
df1 <- df[c(8,12,13,16)]
#Standardization of the numerical variables of ALand, AWater and Mean
df1$sd.Mean <- (df1$Mean-mean(df1$Mean))/sd(df1$Mean);
df1$sd.ALand <- (df1$ALand-mean(df1$ALand))/sd(df1$ALand);
df1$sd.AWater <- (df1$AWater-mean(df1$AWater))/sd(df1$AWater);
#Split of the dataset 80% Training Data and 20% Testing Data
set.seed(2018)
subset <- sample(nrow(df1), nrow(df1) * 0.8)
df1_train <- df1[subset, ]
df1_test <- df1[-subset, ]
####Multivariate Linear Model####
lm_df1_train <- lm(sd.Mean ~ sd.AWater+sd.ALand+Type, df1_train)
#R-square of the model from training dataset: very small r-square
summary(lm_df1_train)$r.squared
## [1] 0.01008337
#Compare the MSEs from training and testing dataset
test_pred <- predict(lm_df1_train, df1_test)
mse_train <- summary(lm_df1_train)$sigma^2
mse_test <- mean((test_pred-df1_test$sd.Mean)^2)
# MESs are similar, suggesting no overfitting
mse_train
## [1] 0.9837431
mse_test
## [1] 1.019807
Besides, we also tried Random Forest, but Random Forest is too expensive and our laptops can not afford it. Moreover, based on the performance of linear model, as the predictors of ALand, AWater and Type can not explain the variation of the Mean income, it is therefore meaningless to continue trying other machine learning methods.
####Random Forest Model####
# rf_df1_train <- train(sd.Mean ~ sd.AWater + sd.ALand + Type, data = df1_train,
# method = "ranger")
From the analysis we did, we can see that top states with highest mean income are at east or east costal locations, however, not all top 10 counties are at east part, some of them are in the middle, like the Boulder County and Fort Bend County, but most of them are also located at east coastal part.
We also did modelings in hope to find any relations to mean income, and unfortunately, our models tell us that ALand(area of land), AWater(area of water) and Type are not significant to be predictors of mean income. If we want to build better models, we need to include more explaining data, like population number and/or ages, job numbers.