Final Porject

1.Introduction

Problem Statement:

For this final project, we want to analyze the U.S household income distributions among states and cities. When we seek jobs, wages are one of the most concerned questions we want to know. So, our focus is which states and cities are within top or bottom incomes locations.


Implementation:

The data was downloaded from Kaggle and read in R. we are planning to group the data by states and plot average incomes. We will also plot mean and median incomes by cities within states and nationwide. Finally, we will try fit the date with linear regression or other statistics modelings.

We hope analysis of this dataset can help people have a sense of the income distributions among different states and cities, so they can choose their ideal locations for income accordingly. However, this analysis only takes locations in consideration, we do not analyze other factors that can affect incomes, like types of jobs, population density.

2.Packages Required

Packages required for reproduce this analysis are listed below

library(readr) # load .csv file
library(dplyr) # manipulate data
library(ggplot2) #visualize the data
library(DT) #output data in table

3.Data Preparation

We obtained this dataset from Kaggle.

The database contains 32,000 records on US Household Income Statistics & Geo Locations. The data was provided by the U.S. Census Reports for years from 2011-2015. This dataset has 19 variables and 32526 observations. There is only 1 missising value in Area_Code variable and it as recored as “M”.

This dataset is tidy and clean engough, so we just need select the only interested variables for further analysis: State_Name, State_ab, City, Type, ALand, AWater, Lat, Lon, Mean, Median. The State_Name, State_ab, City, Type help to filter intersested locations and when we compare incomes in cities, we only want observations with Type of city. Lat and Lon help to build up a map. ALand and AWater are used for modelings. Mean and Median are the values we want to compare.

We will read in the file and clean it up with the following codes:

df<-read_csv("kaggle_income.csv") 
  
df_clean<-select(df,State_Name, State_ab, City, Type, ALand, AWater, Lat, Lon, Mean, Median)

df_clean_city<- filter(df_clean,Type=="City")

We show the data in a table:

datatable(df_clean, caption = "Table 1: Tidy Dataset")

Here are variables metadata

Variable Information
State_Name Type: Character Description: The state code reported by the U.S. Census Bureau for the specified geographic location.
State_ab Type: Character Description: The abbreviated state name reported by the U.S. Census Bureau for the specified geographic location.
City Type: Character Description: The city name reported by the U.S. Census Bureau for the specified geographic location.
Type Type: Character Description: The place Type reported by the U.S. Census Bureau for the specified geographic location.
ALand Type: Double Description: The Square area of land at the geographic or track location.
AWater Type: Double Description: The Square area of water at the geographic or track location.
Lat Type: Double Description: The latitude of the specified geographic location.
Lon Type: Double Description: The longitude deviation of the household income for the specified geographic location.
Mean Type: Double Description: The mean household income of the specified geographic location.
Median Type: Double Description: The median household income of the specified geographic location.

4.Proposed Exploratory Data Analysis

This dataset is quite tidy and clean, so the only thing we plan to manipulate on the dataset is to filter type variabe with city when we analyze information in cites.

We want to plot mean values of incomes by states in a bar chart with ranking information and a geographic map. We also want to plot mean and median informations by cities in a bar chart and geographic map.

However, we do not know how to plot information on a geographic map with latitude and longitude values. We plan to fit the values in linear regression with square area of land and water or cluster analysis models to figure out what relationships the incomes have with other variables.