Demographic data refers to data that is statistically socio-economic in nature such as population, race, income, education and employment, which represent specific geographic locations and are often associated with time. For example, when referring to population demographic data, we have characteristics such as area population, population growth or birthrate, ethnicity, density and distribution. With regard to employment, we have employment and unemployment rates, which can be related further to gender and ethnicity.
Demographic data is often gathered by census organizations, both government and private, which may use the data for research, marketing, and environmental and human development. Data such as population and employment and all their related data fields such as density, ethnicity and gender can be used by the government to plan for infrastructure development such as roads, hospitals and law enforcement.
Demographic data is also used by marketers and business entities for targeted advertising and product distribution. For example, in areas with a large population density of Latinos, fast food restaurants often offer Mexican-themed foods as opposed to Mediterranean ones. Branches of famous American fast food companies in foreign countries often tailor their menu according to the local taste. All of these are due to demographic research using demographic data.
This study deals with the analysis of the 2015 US Demographic Data. The data contains demographic data for each census tract within a county, which is within a state. Census tracts are defined by the census bureau and will have a much more consistent size. A typical census tract has around 5000 or so residents. The demographic data include details about the total population, men, women, origin of the residents, median income per household, percentage employed in different sectors of the industry, etc. Our regression model reveals that the median income per household depend mainly on the population distribution based on origin. Some of our comparison tests indicate that there is a significant difference between the population of men and women in a tract.
This study deals with the analysis of the 2015 US Demographic Data to find out the factors which affect the median household income per tract. The median household income was compared between each tract and then tested with our regression model. The regression model is made, assuming that income depends mainly on the population distribution based on origin, percentage of people working in different sectors of the industry, etc.
Hypothesis H1: The income depends mainly on the population distribution based on origin, percentage of people working in different sectors of the industry.
The data was collected from the website https://www.kaggle.com/muonneutrino/us-census-demographic-data.
The data contains demographic data for each census tract within a county, which is within a state. Census tracts are defined by the census bureau and will have a much more consistent size. A typical census tract has around 5000 or so residents. The demographic data include details about the total population, men, women, origin of the residents, median income per household, percentage employed in different sectors of the industry, etc. The information about each column is given below. The column name is given in bold, with their meanings un-highlighted.
CensusTract Census tract ID State State, DC, or Puerto Rico County County or county equivalent TotalPop Total population Men Number of men Women Number of women Hispanic percent of population that is Hispanic/Latino White percent of population that is white Black percent of population that is black Native percent of population that is Native American or Native Alaskan Asian percent of population that is Asian Pacific percent of population that is Native Hawaiian or Pacific Islander Citizen Number of citizens Income Median household income IncomeErr Median household income error IncomePerCap Income per capita IncomePerCapErr Income per capita error Poverty percent under poverty level ChildPoverty percent of children under poverty level Professional percent employed in management, business, science, and arts Service percent employed in service jobs Office percent employed in sales and office jobs Construction percent employed in natural resources, construction, and maintenance Production percent employed in production, transportation, and material movement Drive percent commuting alone in a car, van, or truck Carpool percent carpooling in a car, van, or truck Transit percent commuting on public transportation Walk percent walking to work OtherTransp percent commuting via other means WorkAtHome percent working at home MeanCommute Mean commute time (minutes) Employed percent employed (16+) PrivateWork percent employed in private industry PublicWork percent employed in public jobs SelfEmployed percent self-employed FamilyWork percent in unpaid family work Unemployment Unemployment rate (percent)
In order to test Hypothesis H1, I proposed the following model:
Income = a0 + a1.Hispanic + a2.White + a3.Black + a4.Native + a5.Asian + a6.Pacific + a7.Professional + a8.Service + a9.Office + a10.Construction + a11.Production + error factor
setwd("D:/R Internship")
US_data<-read.csv(paste("US Demographic Data 2015_Census Tract.csv",sep = ""))
View(US_data)
str(US_data)
## 'data.frame': 74001 obs. of 37 variables:
## $ CensusTract : num 1e+09 1e+09 1e+09 1e+09 1e+09 ...
## $ State : Factor w/ 52 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ County : Factor w/ 1928 levels "Añasco","Abbeville",..: 90 90 90 90 90 90 90 90 90 90 ...
## $ TotalPop : int 1948 2156 2968 4423 10763 3851 2761 3187 10915 5668 ...
## $ Men : int 940 1059 1364 2172 4922 1787 1210 1502 5486 2897 ...
## $ Women : int 1008 1097 1604 2251 5841 2064 1551 1685 5429 2771 ...
## $ Hispanic : num 0.9 0.8 0 10.5 0.7 13.1 3.8 1.3 1.4 0.4 ...
## $ White : num 87.4 40.4 74.5 82.8 68.5 72.9 74.5 84 89.5 85.5 ...
## $ Black : num 7.7 53.3 18.6 3.7 24.8 11.9 19.7 10.7 8.4 12.1 ...
## $ Native : num 0.3 0 0.5 1.6 0 0 0 3.1 0 0 ...
## $ Asian : num 0.6 2.3 1.4 0 3.8 0 0 0 0 0.3 ...
## $ Pacific : num 0 0 0.3 0 0 0 0 0 0 0 ...
## $ Citizen : int 1503 1662 2335 3306 7666 2642 2060 2391 7778 4217 ...
## $ Income : int 61838 32303 44922 54329 51965 63092 34821 73728 60063 41287 ...
## $ IncomeErr : int 11900 13538 5629 7003 6935 9585 7867 2447 8602 7857 ...
## $ IncomePerCap : int 25713 18021 20689 24125 27526 30480 20442 32813 24028 24710 ...
## $ IncomePerCapErr: int 4548 2474 2817 2870 2813 7550 3245 4669 2233 4149 ...
## $ Poverty : num 8.1 25.5 12.7 2.1 11.4 14.4 28.9 13 13.9 6.8 ...
## $ ChildPoverty : num 8.4 40.3 19.7 1.6 17.5 21.9 41.9 25.9 18.3 10 ...
## $ Professional : num 34.7 22.3 31.4 27 49.6 24.2 19.5 42.8 31.5 29.3 ...
## $ Service : num 17 24.7 24.9 20.8 14.2 17.5 29.6 10.7 17.5 13.7 ...
## $ Office : num 21.3 21.5 22.1 27 18.2 35.4 25.3 34.2 26.1 17.7 ...
## $ Construction : num 11.9 9.4 9.2 8.7 2.1 7.9 10.1 5.5 7.8 11 ...
## $ Production : num 15.2 22 12.4 16.4 15.8 14.9 15.5 6.8 17.1 28.3 ...
## $ Drive : num 90.2 86.3 94.8 86.6 88 82.7 92.4 84.3 90.1 88.7 ...
## $ Carpool : num 4.8 13.1 2.8 9.1 10.5 6.9 7.6 8.1 8.6 7.9 ...
## $ Transit : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Walk : num 0.5 0 0 0 0 0 0 0 0 0 ...
## $ OtherTransp : num 2.3 0.7 0 2.6 0.6 6 0 1.7 0 1.2 ...
## $ WorkAtHome : num 2.1 0 2.5 1.6 0.9 4.5 0 5.9 1.3 2.1 ...
## $ MeanCommute : num 25 23.4 19.6 25.3 24.8 19.8 20 24.3 29.4 32.9 ...
## $ Employed : int 943 753 1373 1782 5037 1560 1166 1502 4348 2485 ...
## $ PrivateWork : num 77.1 77 64.1 75.7 67.1 79.4 82 78.1 73.3 77.9 ...
## $ PublicWork : num 18.3 16.9 23.6 21.2 27.6 14.7 14.6 14.8 22.1 15.2 ...
## $ SelfEmployed : num 4.6 6.1 12.3 3.1 5.3 5.8 3.4 7.1 4.6 6.9 ...
## $ FamilyWork : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Unemployment : num 5.4 13.3 6.2 10.8 4.2 10.9 11.4 8.2 8.7 7.2 ...
dim(US_data)
## [1] 74001 37
fit1<-lm(Income~Hispanic+White+Black+Native+Asian+Pacific+Professional+Service
+Office+Construction+Production,data = US_data)
summary(fit1)
##
## Call:
## lm(formula = Income ~ Hispanic + White + Black + Native + Asian +
## Pacific + Professional + Service + Office + Construction +
## Production, data = US_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -123062 -10576 -842 9236 151257
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 25112.06 107112.79 0.234 0.815
## Hispanic -19.76 29.22 -0.676 0.499
## White -5.65 29.48 -0.192 0.848
## Black -128.99 29.67 -4.347 1.38e-05 ***
## Native -269.09 34.56 -7.787 6.95e-15 ***
## Asian 331.76 32.25 10.288 < 2e-16 ***
## Pacific 395.81 83.93 4.716 2.41e-06 ***
## Professional 1116.58 1070.88 1.043 0.297
## Service -528.32 1070.91 -0.493 0.622
## Office 178.09 1070.93 0.166 0.868
## Construction 191.76 1070.97 0.179 0.858
## Production -132.75 1070.93 -0.124 0.901
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18590 on 72887 degrees of freedom
## (1102 observations deleted due to missingness)
## Multiple R-squared: 0.5792, Adjusted R-squared: 0.5791
## F-statistic: 9120 on 11 and 72887 DF, p-value: < 2.2e-16
We have created the model with Income per household as the dependant parameter, with the population distributions across different origins, percentage of people employed in each sector of the industry as the independant parameters. The model gives the ‘a’ coefficients for each of the independant parameters.
We found that our hypothesis wasn’t completely true as all the factors didn’t affect the income per household. The model indicates that the population distribution based on origins are the main determinants of the income parameter with p-values < 0.05 and interestingly, the sectors of the industry haven’t had a major role in this.
This analysis was conducted in order to understand the patterns of the US Demographic Data. We investigated the median income per household in every tract and how different factors affect it. We found that the population distribution based on Black, Asian, Native, Pacific origins are the main determinants of the income parameter.
library(psych)
summary(US_data)
## CensusTract State County
## Min. :1.001e+09 California : 8057 Los Angeles: 2346
## 1st Qu.:1.304e+10 Texas : 5265 Cook : 1326
## Median :2.805e+10 New York : 4918 Orange : 939
## Mean :2.839e+10 Florida : 4245 Jefferson : 927
## 3rd Qu.:4.200e+10 Pennsylvania: 3218 Maricopa : 916
## Max. :7.215e+10 Illinois : 3123 Montgomery : 833
## (Other) :45175 (Other) :66714
## TotalPop Men Women Hispanic
## Min. : 0 Min. : 0 Min. : 0 Min. : 0.00
## 1st Qu.: 2891 1st Qu.: 1409 1st Qu.: 1461 1st Qu.: 2.40
## Median : 4063 Median : 1986 Median : 2066 Median : 7.00
## Mean : 4326 Mean : 2128 Mean : 2198 Mean : 16.86
## 3rd Qu.: 5442 3rd Qu.: 2674 3rd Qu.: 2774 3rd Qu.: 20.40
## Max. :53812 Max. :27962 Max. :27250 Max. :100.00
## NA's :690
## White Black Native Asian
## Min. : 0.00 Min. : 0.00 Min. : 0.0000 Min. : 0.000
## 1st Qu.: 39.40 1st Qu.: 0.70 1st Qu.: 0.0000 1st Qu.: 0.200
## Median : 71.40 Median : 3.70 Median : 0.0000 Median : 1.400
## Mean : 62.03 Mean : 13.27 Mean : 0.7277 Mean : 4.588
## 3rd Qu.: 88.30 3rd Qu.: 14.40 3rd Qu.: 0.4000 3rd Qu.: 4.800
## Max. :100.00 Max. :100.00 Max. :100.0000 Max. :91.300
## NA's :690 NA's :690 NA's :690 NA's :690
## Pacific Citizen Income IncomeErr
## Min. : 0.000 Min. : 0 Min. : 2611 Min. : 390
## 1st Qu.: 0.000 1st Qu.: 2037 1st Qu.: 37683 1st Qu.: 5317
## Median : 0.000 Median : 2863 Median : 51094 Median : 7732
## Mean : 0.145 Mean : 3043 Mean : 57226 Mean : 9134
## 3rd Qu.: 0.000 3rd Qu.: 3838 3rd Qu.: 70117 3rd Qu.: 11258
## Max. :84.700 Max. :37416 Max. :248750 Max. :123116
## NA's :690 NA's :1100 NA's :1100
## IncomePerCap IncomePerCapErr Poverty ChildPoverty
## Min. : 128 Min. : 85 Min. : 0.00 Min. : 0.00
## 1st Qu.: 19123 1st Qu.: 2312 1st Qu.: 7.20 1st Qu.: 7.00
## Median : 25344 Median : 3127 Median : 13.40 Median : 17.80
## Mean : 28491 Mean : 3943 Mean : 16.96 Mean : 22.49
## 3rd Qu.: 33894 3rd Qu.: 4537 3rd Qu.: 23.10 3rd Qu.: 33.80
## Max. :254204 Max. :134380 Max. :100.00 Max. :100.00
## NA's :740 NA's :740 NA's :835 NA's :1118
## Professional Service Office Construction
## Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.000
## 1st Qu.: 24.10 1st Qu.: 13.4 1st Qu.: 20.10 1st Qu.: 5.000
## Median : 32.60 Median : 17.9 Median : 23.80 Median : 8.400
## Mean : 34.80 Mean : 19.1 Mean : 23.95 Mean : 9.292
## 3rd Qu.: 43.88 3rd Qu.: 23.6 3rd Qu.: 27.50 3rd Qu.: 12.500
## Max. :100.00 Max. :100.0 Max. :100.00 Max. :100.000
## NA's :807 NA's :807 NA's :807 NA's :807
## Production Drive Carpool Transit
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.000
## 1st Qu.: 7.10 1st Qu.: 72.00 1st Qu.: 6.000 1st Qu.: 0.000
## Median : 11.80 Median : 79.70 Median : 8.800 Median : 1.100
## Mean : 12.86 Mean : 75.53 Mean : 9.627 Mean : 5.456
## 3rd Qu.: 17.40 3rd Qu.: 84.90 3rd Qu.: 12.300 3rd Qu.: 4.700
## Max. :100.00 Max. :100.00 Max. :100.000 Max. :100.000
## NA's :807 NA's :797 NA's :797 NA's :797
## Walk OtherTransp WorkAtHome MeanCommute
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 1.20
## 1st Qu.: 0.400 1st Qu.: 0.400 1st Qu.: 1.800 1st Qu.:20.80
## Median : 1.400 Median : 1.100 Median : 3.500 Median :25.00
## Mean : 3.123 Mean : 1.892 Mean : 4.368 Mean :25.67
## 3rd Qu.: 3.500 3rd Qu.: 2.500 3rd Qu.: 5.900 3rd Qu.:29.80
## Max. :100.000 Max. :100.000 Max. :100.000 Max. :80.00
## NA's :797 NA's :797 NA's :797 NA's :949
## Employed PrivateWork PublicWork SelfEmployed
## Min. : 0 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 1249 1st Qu.: 74.60 1st Qu.: 9.60 1st Qu.: 3.500
## Median : 1846 Median : 80.10 Median : 13.40 Median : 5.500
## Mean : 1984 Mean : 78.98 Mean : 14.62 Mean : 6.234
## 3rd Qu.: 2553 3rd Qu.: 84.60 3rd Qu.: 18.20 3rd Qu.: 8.100
## Max. :24075 Max. :100.00 Max. :100.00 Max. :100.000
## NA's :807 NA's :807 NA's :807
## FamilyWork Unemployment
## Min. : 0.0000 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 5.100
## Median : 0.0000 Median : 7.700
## Mean : 0.1698 Mean : 9.029
## 3rd Qu.: 0.0000 3rd Qu.: 11.400
## Max. :26.5000 Max. :100.000
## NA's :807 NA's :802
describe(US_data)
## vars n mean sd median
## CensusTract 1 74001 2.839113e+10 1.647593e+10 2.8047e+10
## State* 2 74001 2.534000e+01 1.510000e+01 2.5000e+01
## County* 3 74001 9.879300e+02 5.229400e+02 1.0340e+03
## TotalPop 4 74001 4.325590e+03 2.129310e+03 4.0630e+03
## Men 5 74001 2.127650e+03 1.072330e+03 1.9860e+03
## Women 6 74001 2.197940e+03 1.095730e+03 2.0660e+03
## Hispanic 7 73311 1.686000e+01 2.294000e+01 7.0000e+00
## White 8 73311 6.203000e+01 3.068000e+01 7.1400e+01
## Black 9 73311 1.327000e+01 2.176000e+01 3.7000e+00
## Native 10 73311 7.300000e-01 4.490000e+00 0.0000e+00
## Asian 11 73311 4.590000e+00 8.790000e+00 1.4000e+00
## Pacific 12 73311 1.500000e-01 1.040000e+00 0.0000e+00
## Citizen 13 74001 3.043080e+03 1.475490e+03 2.8630e+03
## Income 14 72901 5.722556e+04 2.866333e+04 5.1094e+04
## IncomeErr 15 72901 9.134490e+03 5.920340e+03 7.7320e+03
## IncomePerCap 16 73261 2.849123e+04 1.504707e+04 2.5344e+04
## IncomePerCapErr 17 73261 3.942910e+03 3.023030e+03 3.1270e+03
## Poverty 18 73166 1.696000e+01 1.320000e+01 1.3400e+01
## ChildPoverty 19 72883 2.249000e+01 1.919000e+01 1.7800e+01
## Professional 20 73194 3.480000e+01 1.501000e+01 3.2600e+01
## Service 21 73194 1.910000e+01 8.280000e+00 1.7900e+01
## Office 22 73194 2.395000e+01 5.960000e+00 2.3800e+01
## Construction 23 73194 9.290000e+00 6.020000e+00 8.4000e+00
## Production 24 73194 1.286000e+01 7.670000e+00 1.1800e+01
## Drive 25 73204 7.553000e+01 1.537000e+01 7.9700e+01
## Carpool 26 73204 9.630000e+00 5.370000e+00 8.8000e+00
## Transit 27 73204 5.460000e+00 1.172000e+01 1.1000e+00
## Walk 28 73204 3.120000e+00 5.880000e+00 1.4000e+00
## OtherTransp 29 73204 1.890000e+00 2.600000e+00 1.1000e+00
## WorkAtHome 30 73204 4.370000e+00 3.900000e+00 3.5000e+00
## MeanCommute 31 73052 2.567000e+01 6.960000e+00 2.5000e+01
## Employed 32 74001 1.983910e+03 1.073430e+03 1.8460e+03
## PrivateWork 33 73194 7.898000e+01 8.350000e+00 8.0100e+01
## PublicWork 34 73194 1.462000e+01 7.540000e+00 1.3400e+01
## SelfEmployed 35 73194 6.230000e+00 4.040000e+00 5.5000e+00
## FamilyWork 36 73194 1.700000e-01 4.600000e-01 0.0000e+00
## Unemployment 37 73199 9.030000e+00 5.960000e+00 7.7000e+00
## trimmed mad min max
## CensusTract 2.803846e+10 2.080416e+10 1.00102e+09 7.215375e+10
## State* 2.516000e+01 2.076000e+01 1.00000e+00 5.200000e+01
## County* 9.921900e+02 6.553100e+02 1.00000e+00 1.928000e+03
## TotalPop 4.165830e+03 1.866590e+03 0.00000e+00 5.381200e+04
## Men 2.041010e+03 9.236600e+02 0.00000e+00 2.796200e+04
## Women 2.117140e+03 9.622100e+02 0.00000e+00 2.725000e+04
## Hispanic 1.161000e+01 8.450000e+00 0.00000e+00 1.000000e+02
## White 6.496000e+01 3.039000e+01 0.00000e+00 1.000000e+02
## Black 7.770000e+00 5.340000e+00 0.00000e+00 1.000000e+02
## Native 1.700000e-01 0.000000e+00 0.00000e+00 1.000000e+02
## Asian 2.520000e+00 2.080000e+00 0.00000e+00 9.130000e+01
## Pacific 0.000000e+00 0.000000e+00 0.00000e+00 8.470000e+01
## Citizen 2.935270e+03 1.319510e+03 0.00000e+00 3.741600e+04
## Income 5.378624e+04 2.265116e+04 2.61100e+03 2.487500e+05
## IncomeErr 8.270770e+03 4.123110e+03 3.90000e+02 1.231160e+05
## IncomePerCap 2.649608e+04 1.054425e+04 1.28000e+02 2.542040e+05
## IncomePerCapErr 3.422590e+03 1.460360e+03 8.50000e+01 1.343800e+05
## Poverty 1.509000e+01 1.082000e+01 0.00000e+00 1.000000e+02
## ChildPoverty 2.013000e+01 1.838000e+01 0.00000e+00 1.000000e+02
## Professional 3.383000e+01 1.423000e+01 0.00000e+00 1.000000e+02
## Service 1.847000e+01 7.410000e+00 0.00000e+00 1.000000e+02
## Office 2.383000e+01 5.490000e+00 0.00000e+00 1.000000e+02
## Construction 8.740000e+00 5.490000e+00 0.00000e+00 1.000000e+02
## Production 1.224000e+01 7.560000e+00 0.00000e+00 1.000000e+02
## Drive 7.831000e+01 8.900000e+00 0.00000e+00 1.000000e+02
## Carpool 9.140000e+00 4.600000e+00 0.00000e+00 1.000000e+02
## Transit 2.470000e+00 1.630000e+00 0.00000e+00 1.000000e+02
## Walk 1.920000e+00 2.080000e+00 0.00000e+00 1.000000e+02
## OtherTransp 1.410000e+00 1.480000e+00 0.00000e+00 1.000000e+02
## WorkAtHome 3.860000e+00 2.820000e+00 0.00000e+00 1.000000e+02
## MeanCommute 2.528000e+01 6.520000e+00 1.20000e+00 8.000000e+01
## Employed 1.899360e+03 9.562800e+02 0.00000e+00 2.407500e+04
## PrivateWork 7.963000e+01 7.410000e+00 0.00000e+00 1.000000e+02
## PublicWork 1.387000e+01 6.230000e+00 0.00000e+00 1.000000e+02
## SelfEmployed 5.800000e+00 3.260000e+00 0.00000e+00 1.000000e+02
## FamilyWork 6.000000e-02 0.000000e+00 0.00000e+00 2.650000e+01
## Unemployment 8.210000e+00 4.450000e+00 0.00000e+00 1.000000e+02
## range skew kurtosis se
## CensusTract 7.115273e+10 0.13 -0.92 60566306.31
## State* 5.100000e+01 0.02 -1.33 0.06
## County* 1.927000e+03 -0.08 -1.04 1.92
## TotalPop 5.381200e+04 1.83 14.56 7.83
## Men 2.796200e+04 1.96 17.07 3.94
## Women 2.725000e+04 1.79 13.80 4.03
## Hispanic 1.000000e+02 2.00 3.39 0.08
## White 1.000000e+02 -0.67 -0.86 0.11
## Black 1.000000e+02 2.31 4.77 0.08
## Native 1.000000e+02 15.89 289.71 0.02
## Asian 9.130000e+01 3.95 19.99 0.03
## Pacific 8.470000e+01 26.92 1295.51 0.00
## Citizen 3.741600e+04 1.61 12.68 5.42
## Income 2.461390e+05 1.48 3.46 106.16
## IncomeErr 1.227260e+05 3.00 21.40 21.93
## IncomePerCap 2.540760e+05 2.33 10.80 55.59
## IncomePerCapErr 1.342950e+05 5.81 96.59 11.17
## Poverty 1.000000e+02 1.46 2.69 0.05
## ChildPoverty 1.000000e+02 1.02 0.63 0.07
## Professional 1.000000e+02 0.60 0.09 0.06
## Service 1.000000e+02 0.97 2.44 0.03
## Office 1.000000e+02 0.73 7.23 0.02
## Construction 1.000000e+02 1.49 6.34 0.02
## Production 1.000000e+02 0.97 2.55 0.03
## Drive 1.000000e+02 -2.19 5.67 0.06
## Carpool 1.000000e+02 1.77 12.50 0.02
## Transit 1.000000e+02 3.63 14.68 0.04
## Walk 1.000000e+02 5.61 46.31 0.02
## OtherTransp 1.000000e+02 5.12 70.50 0.01
## WorkAtHome 1.000000e+02 3.95 50.07 0.01
## MeanCommute 7.880000e+01 0.63 0.85 0.03
## Employed 2.407500e+04 1.63 10.65 3.95
## PrivateWork 1.000000e+02 -1.32 5.17 0.03
## PublicWork 1.000000e+02 1.75 7.57 0.03
## SelfEmployed 1.000000e+02 2.92 38.47 0.01
## FamilyWork 2.650000e+01 7.29 188.67 0.00
## Unemployment 1.000000e+02 2.16 10.67 0.02
US_data1<-US_data[c(1:50,1182:1231),]
US_data2<-US_data[(US_data$State=="Alabama") | (US_data$State=="Alaska"),]
US_data1$State_1<-droplevels(US_data1$State)
US_data1$County_1<-droplevels(US_data1$County)
US_data2$State_1<-droplevels(US_data2$State)
US_data2$County_1<-droplevels(US_data2$County)
dim(US_data1)
## [1] 100 39
View(US_data1)
dim(US_data2)
## [1] 1348 39
View(US_data2)
str(US_data2)
## 'data.frame': 1348 obs. of 39 variables:
## $ CensusTract : num 1e+09 1e+09 1e+09 1e+09 1e+09 ...
## $ State : Factor w/ 52 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ County : Factor w/ 1928 levels "Añasco","Abbeville",..: 90 90 90 90 90 90 90 90 90 90 ...
## $ TotalPop : int 1948 2156 2968 4423 10763 3851 2761 3187 10915 5668 ...
## $ Men : int 940 1059 1364 2172 4922 1787 1210 1502 5486 2897 ...
## $ Women : int 1008 1097 1604 2251 5841 2064 1551 1685 5429 2771 ...
## $ Hispanic : num 0.9 0.8 0 10.5 0.7 13.1 3.8 1.3 1.4 0.4 ...
## $ White : num 87.4 40.4 74.5 82.8 68.5 72.9 74.5 84 89.5 85.5 ...
## $ Black : num 7.7 53.3 18.6 3.7 24.8 11.9 19.7 10.7 8.4 12.1 ...
## $ Native : num 0.3 0 0.5 1.6 0 0 0 3.1 0 0 ...
## $ Asian : num 0.6 2.3 1.4 0 3.8 0 0 0 0 0.3 ...
## $ Pacific : num 0 0 0.3 0 0 0 0 0 0 0 ...
## $ Citizen : int 1503 1662 2335 3306 7666 2642 2060 2391 7778 4217 ...
## $ Income : int 61838 32303 44922 54329 51965 63092 34821 73728 60063 41287 ...
## $ IncomeErr : int 11900 13538 5629 7003 6935 9585 7867 2447 8602 7857 ...
## $ IncomePerCap : int 25713 18021 20689 24125 27526 30480 20442 32813 24028 24710 ...
## $ IncomePerCapErr: int 4548 2474 2817 2870 2813 7550 3245 4669 2233 4149 ...
## $ Poverty : num 8.1 25.5 12.7 2.1 11.4 14.4 28.9 13 13.9 6.8 ...
## $ ChildPoverty : num 8.4 40.3 19.7 1.6 17.5 21.9 41.9 25.9 18.3 10 ...
## $ Professional : num 34.7 22.3 31.4 27 49.6 24.2 19.5 42.8 31.5 29.3 ...
## $ Service : num 17 24.7 24.9 20.8 14.2 17.5 29.6 10.7 17.5 13.7 ...
## $ Office : num 21.3 21.5 22.1 27 18.2 35.4 25.3 34.2 26.1 17.7 ...
## $ Construction : num 11.9 9.4 9.2 8.7 2.1 7.9 10.1 5.5 7.8 11 ...
## $ Production : num 15.2 22 12.4 16.4 15.8 14.9 15.5 6.8 17.1 28.3 ...
## $ Drive : num 90.2 86.3 94.8 86.6 88 82.7 92.4 84.3 90.1 88.7 ...
## $ Carpool : num 4.8 13.1 2.8 9.1 10.5 6.9 7.6 8.1 8.6 7.9 ...
## $ Transit : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Walk : num 0.5 0 0 0 0 0 0 0 0 0 ...
## $ OtherTransp : num 2.3 0.7 0 2.6 0.6 6 0 1.7 0 1.2 ...
## $ WorkAtHome : num 2.1 0 2.5 1.6 0.9 4.5 0 5.9 1.3 2.1 ...
## $ MeanCommute : num 25 23.4 19.6 25.3 24.8 19.8 20 24.3 29.4 32.9 ...
## $ Employed : int 943 753 1373 1782 5037 1560 1166 1502 4348 2485 ...
## $ PrivateWork : num 77.1 77 64.1 75.7 67.1 79.4 82 78.1 73.3 77.9 ...
## $ PublicWork : num 18.3 16.9 23.6 21.2 27.6 14.7 14.6 14.8 22.1 15.2 ...
## $ SelfEmployed : num 4.6 6.1 12.3 3.1 5.3 5.8 3.4 7.1 4.6 6.9 ...
## $ FamilyWork : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Unemployment : num 5.4 13.3 6.2 10.8 4.2 10.9 11.4 8.2 8.7 7.2 ...
## $ State_1 : Factor w/ 2 levels "Alabama","Alaska": 1 1 1 1 1 1 1 1 1 1 ...
## $ County_1 : Factor w/ 96 levels "Aleutians East Borough",..: 4 4 4 4 4 4 4 4 4 4 ...
table(droplevels(US_data2$State))
##
## Alabama Alaska
## 1181 167
table(droplevels(US_data1$County))
##
## Aleutians East Borough Aleutians West Census Area
## 1 2
## Anchorage Municipality Autauga
## 47 12
## Baldwin Barbour
## 32 6
xtabs(~County_1+State_1,data = US_data1)
## State_1
## County_1 Alabama Alaska
## Aleutians East Borough 0 1
## Aleutians West Census Area 0 2
## Anchorage Municipality 0 47
## Autauga 12 0
## Baldwin 32 0
## Barbour 6 0
boxplot(Income~State_1,data = US_data2,
main="Boxplot of Income in 2 specific states in US",
xlab="States",ylab="Income")
boxplot(TotalPop~County_1,data = US_data1,
main="Boxplot of Population in 6 counties",
xlab="Counties",ylab="Population")
axis(side=1,at=2,labels = "Aleutians West")
hist(US_data$Employed,main="Histogram of No.of Employed people above 16 years in US",
xlab="No.of employed people per tract",ylab = "Count",xlim = c(0,10000),
col = "light blue")
hist(US_data$Drive,main="Histogram of percent of people who commute alone in US",
xlab="percent of people who commute alone per tract",ylab = "Count",col = "grey")
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:psych':
##
## logit
scatterplot(Professional~White,data = US_data2,spread=FALSE,
smoother.args=list(lty=2),pch=1,
main="Plot between percent of White people and percent employed in Management,Business,Science and Arts")
scatterplot(Construction~Men,data = US_data2,spread=FALSE,
smoother.args=list(lty=2),pch=1,
main="Plot between no.of men and percent employed in Construction sector")
corr.test(US_data[,4:20])
## Call:corr.test(x = US_data[, 4:20])
## Correlation matrix
## TotalPop Men Women Hispanic White Black Native Asian
## TotalPop 1.00 0.98 0.98 0.11 -0.03 -0.11 -0.04 0.10
## Men 0.98 1.00 0.93 0.12 -0.02 -0.13 -0.03 0.10
## Women 0.98 0.93 1.00 0.10 -0.03 -0.09 -0.04 0.10
## Hispanic 0.11 0.12 0.10 1.00 -0.66 -0.12 -0.04 0.03
## White -0.03 -0.02 -0.03 -0.66 1.00 -0.58 -0.07 -0.25
## Black -0.11 -0.13 -0.09 -0.12 -0.58 1.00 -0.05 -0.11
## Native -0.04 -0.03 -0.04 -0.04 -0.07 -0.05 1.00 -0.04
## Asian 0.10 0.10 0.10 0.03 -0.25 -0.11 -0.04 1.00
## Pacific 0.02 0.03 0.02 0.02 -0.09 -0.04 0.01 0.16
## Citizen 0.94 0.92 0.92 -0.11 0.17 -0.13 -0.04 0.03
## Income 0.17 0.18 0.17 -0.23 0.31 -0.31 -0.07 0.28
## IncomeErr -0.01 0.00 -0.01 -0.10 0.10 -0.13 -0.06 0.22
## IncomePerCap 0.03 0.02 0.04 -0.31 0.38 -0.28 -0.07 0.20
## IncomePerCapErr -0.10 -0.10 -0.09 -0.16 0.17 -0.12 -0.05 0.13
## Poverty -0.15 -0.15 -0.15 0.34 -0.52 0.40 0.09 -0.12
## ChildPoverty -0.15 -0.15 -0.14 0.32 -0.50 0.41 0.07 -0.16
## Professional 0.08 0.07 0.08 -0.33 0.35 -0.25 -0.04 0.26
## Pacific Citizen Income IncomeErr IncomePerCap
## TotalPop 0.02 0.94 0.17 -0.01 0.03
## Men 0.03 0.92 0.18 0.00 0.02
## Women 0.02 0.92 0.17 -0.01 0.04
## Hispanic 0.02 -0.11 -0.23 -0.10 -0.31
## White -0.09 0.17 0.31 0.10 0.38
## Black -0.04 -0.13 -0.31 -0.13 -0.28
## Native 0.01 -0.04 -0.07 -0.06 -0.07
## Asian 0.16 0.03 0.28 0.22 0.20
## Pacific 1.00 0.00 0.01 0.01 -0.03
## Citizen 0.00 1.00 0.20 0.01 0.11
## Income 0.01 0.20 1.00 0.61 0.83
## IncomeErr 0.01 0.01 0.61 1.00 0.60
## IncomePerCap -0.03 0.11 0.83 0.60 1.00
## IncomePerCapErr -0.01 -0.05 0.50 0.52 0.77
## Poverty 0.01 -0.24 -0.70 -0.35 -0.61
## ChildPoverty 0.00 -0.24 -0.66 -0.35 -0.59
## Professional -0.03 0.18 0.73 0.49 0.80
## IncomePerCapErr Poverty ChildPoverty Professional
## TotalPop -0.10 -0.15 -0.15 0.08
## Men -0.10 -0.15 -0.15 0.07
## Women -0.09 -0.15 -0.14 0.08
## Hispanic -0.16 0.34 0.32 -0.33
## White 0.17 -0.52 -0.50 0.35
## Black -0.12 0.40 0.41 -0.25
## Native -0.05 0.09 0.07 -0.04
## Asian 0.13 -0.12 -0.16 0.26
## Pacific -0.01 0.01 0.00 -0.03
## Citizen -0.05 -0.24 -0.24 0.18
## Income 0.50 -0.70 -0.66 0.73
## IncomeErr 0.52 -0.35 -0.35 0.49
## IncomePerCap 0.77 -0.61 -0.59 0.80
## IncomePerCapErr 1.00 -0.29 -0.29 0.53
## Poverty -0.29 1.00 0.90 -0.54
## ChildPoverty -0.29 0.90 1.00 -0.57
## Professional 0.53 -0.54 -0.57 1.00
## Sample Size
## TotalPop Men Women Hispanic White Black Native Asian
## TotalPop 74001 74001 74001 73311 73311 73311 73311 73311
## Men 74001 74001 74001 73311 73311 73311 73311 73311
## Women 74001 74001 74001 73311 73311 73311 73311 73311
## Hispanic 73311 73311 73311 73311 73311 73311 73311 73311
## White 73311 73311 73311 73311 73311 73311 73311 73311
## Black 73311 73311 73311 73311 73311 73311 73311 73311
## Native 73311 73311 73311 73311 73311 73311 73311 73311
## Asian 73311 73311 73311 73311 73311 73311 73311 73311
## Pacific 73311 73311 73311 73311 73311 73311 73311 73311
## Citizen 74001 74001 74001 73311 73311 73311 73311 73311
## Income 72901 72901 72901 72901 72901 72901 72901 72901
## IncomeErr 72901 72901 72901 72901 72901 72901 72901 72901
## IncomePerCap 73261 73261 73261 73261 73261 73261 73261 73261
## IncomePerCapErr 73261 73261 73261 73261 73261 73261 73261 73261
## Poverty 73166 73166 73166 73166 73166 73166 73166 73166
## ChildPoverty 72883 72883 72883 72883 72883 72883 72883 72883
## Professional 73194 73194 73194 73194 73194 73194 73194 73194
## Pacific Citizen Income IncomeErr IncomePerCap
## TotalPop 73311 74001 72901 72901 73261
## Men 73311 74001 72901 72901 73261
## Women 73311 74001 72901 72901 73261
## Hispanic 73311 73311 72901 72901 73261
## White 73311 73311 72901 72901 73261
## Black 73311 73311 72901 72901 73261
## Native 73311 73311 72901 72901 73261
## Asian 73311 73311 72901 72901 73261
## Pacific 73311 73311 72901 72901 73261
## Citizen 73311 74001 72901 72901 73261
## Income 72901 72901 72901 72901 72901
## IncomeErr 72901 72901 72901 72901 72901
## IncomePerCap 73261 73261 72901 72901 73261
## IncomePerCapErr 73261 73261 72901 72901 73261
## Poverty 73166 73166 72901 72901 73120
## ChildPoverty 72883 72883 72748 72748 72874
## Professional 73194 73194 72899 72899 73156
## IncomePerCapErr Poverty ChildPoverty Professional
## TotalPop 73261 73166 72883 73194
## Men 73261 73166 72883 73194
## Women 73261 73166 72883 73194
## Hispanic 73261 73166 72883 73194
## White 73261 73166 72883 73194
## Black 73261 73166 72883 73194
## Native 73261 73166 72883 73194
## Asian 73261 73166 72883 73194
## Pacific 73261 73166 72883 73194
## Citizen 73261 73166 72883 73194
## Income 72901 72901 72748 72899
## IncomeErr 72901 72901 72748 72899
## IncomePerCap 73261 73120 72874 73156
## IncomePerCapErr 73261 73120 72874 73156
## Poverty 73120 73166 72883 73144
## ChildPoverty 72874 72883 72883 72880
## Professional 73156 73144 72880 73194
## Probability values (Entries above the diagonal are adjusted for multiple tests.)
## TotalPop Men Women Hispanic White Black Native Asian
## TotalPop 0.00 0.00 0.00 0 0 0 0.00 0
## Men 0.00 0.00 0.00 0 0 0 0.00 0
## Women 0.00 0.00 0.00 0 0 0 0.00 0
## Hispanic 0.00 0.00 0.00 0 0 0 0.00 0
## White 0.00 0.00 0.00 0 0 0 0.00 0
## Black 0.00 0.00 0.00 0 0 0 0.00 0
## Native 0.00 0.00 0.00 0 0 0 0.00 0
## Asian 0.00 0.00 0.00 0 0 0 0.00 0
## Pacific 0.00 0.00 0.00 0 0 0 0.02 0
## Citizen 0.00 0.00 0.00 0 0 0 0.00 0
## Income 0.00 0.00 0.00 0 0 0 0.00 0
## IncomeErr 0.06 0.23 0.01 0 0 0 0.00 0
## IncomePerCap 0.00 0.00 0.00 0 0 0 0.00 0
## IncomePerCapErr 0.00 0.00 0.00 0 0 0 0.00 0
## Poverty 0.00 0.00 0.00 0 0 0 0.00 0
## ChildPoverty 0.00 0.00 0.00 0 0 0 0.00 0
## Professional 0.00 0.00 0.00 0 0 0 0.00 0
## Pacific Citizen Income IncomeErr IncomePerCap
## TotalPop 0.00 0.00 0.00 0.35 0
## Men 0.00 0.00 0.00 0.69 0
## Women 0.00 0.00 0.00 0.10 0
## Hispanic 0.00 0.00 0.00 0.00 0
## White 0.00 0.00 0.00 0.00 0
## Black 0.00 0.00 0.00 0.00 0
## Native 0.15 0.00 0.00 0.00 0
## Asian 0.00 0.00 0.00 0.00 0
## Pacific 0.00 0.93 0.23 0.01 0
## Citizen 0.46 0.00 0.00 0.40 0
## Income 0.03 0.00 0.00 0.00 0
## IncomeErr 0.00 0.10 0.00 0.00 0
## IncomePerCap 0.00 0.00 0.00 0.00 0
## IncomePerCapErr 0.01 0.00 0.00 0.00 0
## Poverty 0.06 0.00 0.00 0.00 0
## ChildPoverty 0.54 0.00 0.00 0.00 0
## Professional 0.00 0.00 0.00 0.00 0
## IncomePerCapErr Poverty ChildPoverty Professional
## TotalPop 0.00 0.00 0.00 0
## Men 0.00 0.00 0.00 0
## Women 0.00 0.00 0.00 0
## Hispanic 0.00 0.00 0.00 0
## White 0.00 0.00 0.00 0
## Black 0.00 0.00 0.00 0
## Native 0.00 0.00 0.00 0
## Asian 0.00 0.00 0.00 0
## Pacific 0.05 0.35 0.93 0
## Citizen 0.00 0.00 0.00 0
## Income 0.00 0.00 0.00 0
## IncomeErr 0.00 0.00 0.00 0
## IncomePerCap 0.00 0.00 0.00 0
## IncomePerCapErr 0.00 0.00 0.00 0
## Poverty 0.00 0.00 0.00 0
## ChildPoverty 0.00 0.00 0.00 0
## Professional 0.00 0.00 0.00 0
##
## To see confidence intervals of the correlations, print with the short=FALSE option
library(corrgram)
## Warning: package 'corrgram' was built under R version 3.3.3
corrgram(US_data2[,4:20],order = TRUE,lower.panel = panel.shade,
upper.panel = panel.pie,text.panel = panel.txt,
main="Corrgram of numeric variables in the US Demographic Dataset")
scatterplotMatrix(formula=~Professional+Service+Office+Construction+Production,
cex=0.6,data = US_data2)
t.test(US_data2$Native,US_data2$Asian)
##
## Welch Two Sample t-test
##
## data: US_data2$Native and US_data2$Asian
## t = 2.9768, df = 1735, p-value = 0.002953
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.2857514 1.3896023
## sample estimates:
## mean of x mean of y
## 2.435071 1.597394
p-value suggests that there isn’t any significant difference between the percent of native and asian people.
t.test(US_data2$Men,US_data2$Women)
##
## Welch Two Sample t-test
##
## data: US_data2$Men and US_data2$Women
## t = -2.0056, df = 2694, p-value = 0.045
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -166.502622 -1.878683
## sample estimates:
## mean of x mean of y
## 2021.701 2105.892
p-value suggests that there is a significant difference between the no.of men and women.