US Census
US Census Bureau, which collects census data and finds interesting insights from it. The data is spread across multiple csv files.
I have used this data into R and taken it into reasonable shape so that some analysis can be performed.
# load libraries
library(dplyr)
library(readr)
library(tidyr)
Concatenated all of the data frames in df_list into one data frame called us_census.
# loading CSVs
files <- list.files(pattern="states_.*csv")
head(files)
## [1] "states_0.csv" "states_1.csv" "states_2.csv" "states_3.csv" "states_4.csv"
## [6] "states_5.csv"
df_list <- lapply(files,read_csv)
us_census <- bind_rows(df_list)
head(us_census)
## # A tibble: 6 x 11
## X1 State TotalPop Hispanic White Black Native Asian Pacific Income
## <dbl> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 0 Alab… 4830620 3.75161… 61.8… 31.2… 0.453… 1.05… 0.0343… $4329…
## 2 1 Alas… 733375 5.90958… 60.9… 2.84… 16.39… 5.45… 1.0586… $7035…
## 3 2 Ariz… 6641928 29.5659… 57.1… 3.85… 4.355… 2.87… 0.1676… $5420…
## 4 3 Arka… 2958208 6.21547… 71.1… 18.9… 0.522… 1.14… 0.1468… $4193…
## 5 4 Cali… 38421464 37.2918… 40.2… 5.67… 0.405… 13.0… 0.3514… $6726…
## 6 5 Colo… 5278906 20.7843… 69.8… 3.54… 0.573… 2.66… 0.1196… $6465…
## # … with 1 more variable: GenderPop <chr>
# inspecting data
colnames(us_census)
## [1] "X1" "State" "TotalPop" "Hispanic" "White" "Black"
## [7] "Native" "Asian" "Pacific" "Income" "GenderPop"
str(us_census)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 61 obs. of 11 variables:
## $ X1 : num 0 1 2 3 4 5 0 1 2 3 ...
## $ State : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ TotalPop : num 4830620 733375 6641928 2958208 38421464 ...
## $ Hispanic : chr "3.7516156462584975%" "5.909580838323351%" "29.565921052631502%" "6.215474452554738%" ...
## $ White : chr "61.878656462585%" "60.910179640718574%" "57.120000000000026%" "71.13781021897813%" ...
## $ Black : chr "31.25297619047618%" "2.8485029940119775%" "3.8509868421052658%" "18.968759124087573%" ...
## $ Native : chr "0.4532312925170065%" "16.39101796407186%" "4.35506578947368%" "0.5229197080291965%" ...
## $ Asian : chr "1.0502551020408146%" "5.450299401197604%" "2.876578947368419%" "1.1423357664233578%" ...
## $ Pacific : chr "0.03435374149659865%" "1.0586826347305378%" "0.16763157894736833%" "0.14686131386861315%" ...
## $ Income : chr "$43296.35860306644" "$70354.74390243902" "$54207.82095490716" "$41935.63396778917" ...
## $ GenderPop: chr "2341093M_2489527F" "384160M_349215F" "3299088M_3342840F" "1451913M_1506295F" ...
Column X1 stores meaningless information. Dropped the X1 column from us_census.
# drop X1 column
us_census <- us_census %>%
select(-X1)
head(us_census)
## # A tibble: 6 x 10
## State TotalPop Hispanic White Black Native Asian Pacific Income GenderPop
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Alaba… 4830620 3.751615… 61.87… 31.25… 0.4532… 1.05… 0.0343… $4329… 2341093M…
## 2 Alaska 733375 5.909580… 60.91… 2.848… 16.391… 5.45… 1.0586… $7035… 384160M_…
## 3 Arizo… 6641928 29.56592… 57.12… 3.850… 4.3550… 2.87… 0.1676… $5420… 3299088M…
## 4 Arkan… 2958208 6.215474… 71.13… 18.96… 0.5229… 1.14… 0.1468… $4193… 1451913M…
## 5 Calif… 38421464 37.29187… 40.21… 5.677… 0.4052… 13.0… 0.3514… $6726… 19087135…
## 6 Color… 5278906 20.78438… 69.89… 3.546… 0.5738… 2.66… 0.1196… $6465… 2648667M…
there are 6 columns representing the population percentage for different races. The columns include the percent symbol %. The Income column also incudes a $ symbol along with the number representing median income for a state.
# removing % from race columns
us_census <- us_census %>%
mutate (Hispanic = gsub("\\%","",Hispanic)) %>%
mutate (White = gsub("\\%","",White)) %>%
mutate (Black = gsub("\\%","",Black)) %>%
mutate (Native = gsub("\\%","",Native)) %>%
mutate (Asian = gsub("\\%","",Asian)) %>%
mutate (Pacific = gsub("\\%","",Pacific))
head(us_census)
## # A tibble: 6 x 10
## State TotalPop Hispanic White Black Native Asian Pacific Income GenderPop
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Alaba… 4830620 3.751615… 61.87… 31.25… 0.4532… 1.05… 0.0343… $4329… 2341093M…
## 2 Alaska 733375 5.909580… 60.91… 2.848… 16.391… 5.45… 1.0586… $7035… 384160M_…
## 3 Arizo… 6641928 29.56592… 57.12… 3.850… 4.3550… 2.87… 0.1676… $5420… 3299088M…
## 4 Arkan… 2958208 6.215474… 71.13… 18.96… 0.5229… 1.14… 0.1468… $4193… 1451913M…
## 5 Calif… 38421464 37.29187… 40.21… 5.677… 0.4052… 13.0… 0.3514… $6726… 19087135…
## 6 Color… 5278906 20.78438… 69.89… 3.546… 0.5738… 2.66… 0.1196… $6465… 2648667M…
# removing $ from Income column
us_census <- us_census %>%
mutate ( Income = gsub("\\$","",Income))
head(us_census)
## # A tibble: 6 x 10
## State TotalPop Hispanic White Black Native Asian Pacific Income GenderPop
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Alaba… 4830620 3.751615… 61.87… 31.25… 0.4532… 1.05… 0.0343… 43296… 2341093M…
## 2 Alaska 733375 5.909580… 60.91… 2.848… 16.391… 5.45… 1.0586… 70354… 384160M_…
## 3 Arizo… 6641928 29.56592… 57.12… 3.850… 4.3550… 2.87… 0.1676… 54207… 3299088M…
## 4 Arkan… 2958208 6.215474… 71.13… 18.96… 0.5229… 1.14… 0.1468… 41935… 1451913M…
## 5 Calif… 38421464 37.29187… 40.21… 5.677… 0.4052… 13.0… 0.3514… 67264… 19087135…
## 6 Color… 5278906 20.78438… 69.89… 3.546… 0.5738… 2.66… 0.1196… 64657… 2648667M…
The GenderPop column appears to hold the male and female population counts. Separated this column at the _ character to create two new columns: male_pop and female_pop & Removed these extra characters from the columns.
# separating GenderPop column
us_census <- us_census %>%
separate (GenderPop, c("male_pop","female_pop"),"_") %>% mutate(male_pop = gsub("\\M","",male_pop)) %>% mutate(female_pop = gsub("\\F","",female_pop))
head(us_census)
## # A tibble: 6 x 11
## State TotalPop Hispanic White Black Native Asian Pacific Income male_pop
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Alab… 4830620 3.75161… 61.8… 31.2… 0.453… 1.05… 0.0343… 43296… 2341093
## 2 Alas… 733375 5.90958… 60.9… 2.84… 16.39… 5.45… 1.0586… 70354… 384160
## 3 Ariz… 6641928 29.5659… 57.1… 3.85… 4.355… 2.87… 0.1676… 54207… 3299088
## 4 Arka… 2958208 6.21547… 71.1… 18.9… 0.522… 1.14… 0.1468… 41935… 1451913
## 5 Cali… 38421464 37.2918… 40.2… 5.67… 0.405… 13.0… 0.3514… 67264… 19087135
## 6 Colo… 5278906 20.7843… 69.8… 3.54… 0.573… 2.66… 0.1196… 64657… 2648667
## # … with 1 more variable: female_pop <chr>
Data type for these columns is still chr, or character. Then I Converted all of these columns (Hispanic,White,Black,Native,Asian,Pacific,Income,male_pop,female_pop) to have a data type of numeric.
# updating column data types
us_census <- us_census %>%
mutate(Hispanic = as.numeric (Hispanic),White = as.numeric(White),Black = as.numeric(Black),Native = as.numeric(Native),Asian = as.numeric(Asian),Pacific = as.numeric (Pacific),Income = as.numeric(Income),male_pop = as.numeric(male_pop),female_pop = as.numeric(female_pop))
head(us_census)
## # A tibble: 6 x 11
## State TotalPop Hispanic White Black Native Asian Pacific Income male_pop
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alab… 4830620 3.75 61.9 31.3 0.453 1.05 0.0344 43296. 2341093
## 2 Alas… 733375 5.91 60.9 2.85 16.4 5.45 1.06 70355. 384160
## 3 Ariz… 6641928 29.6 57.1 3.85 4.36 2.88 0.168 54208. 3299088
## 4 Arka… 2958208 6.22 71.1 19.0 0.523 1.14 0.147 41936. 1451913
## 5 Cali… 38421464 37.3 40.2 5.68 0.405 13.1 0.351 67265. 19087135
## 6 Colo… 5278906 20.8 69.9 3.55 0.574 2.66 0.120 64658. 2648667
## # … with 1 more variable: female_pop <dbl>
# updating values of race columns
us_census <- us_census %>%
mutate(Hispanic = Hispanic/100, White = White /100, Black = Black/100, Native = Native/100, Asian = Asian/100, Pacific = Pacific/100)
head(us_census)
## # A tibble: 6 x 11
## State TotalPop Hispanic White Black Native Asian Pacific Income male_pop
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alab… 4830620 0.0375 0.619 0.313 0.00453 0.0105 3.44e-4 43296. 2341093
## 2 Alas… 733375 0.0591 0.609 0.0285 0.164 0.0545 1.06e-2 70355. 384160
## 3 Ariz… 6641928 0.296 0.571 0.0385 0.0436 0.0288 1.68e-3 54208. 3299088
## 4 Arka… 2958208 0.0622 0.711 0.190 0.00523 0.0114 1.47e-3 41936. 1451913
## 5 Cali… 38421464 0.373 0.402 0.0568 0.00405 0.131 3.51e-3 67265. 19087135
## 6 Colo… 5278906 0.208 0.699 0.0355 0.00574 0.0266 1.20e-3 64658. 2648667
## # … with 1 more variable: female_pop <dbl>
# checking for duplicate rows
duplicates <- us_census %>%
duplicated() %>% table()
print(duplicates)
## .
## FALSE TRUE
## 52 9
Since there are duplicates, updated the value of us_census to be the us_census data frame with only unique/distinct rows.
# removing duplicate rows
us_census <- us_census %>%
distinct ()
head(us_census)
## # A tibble: 6 x 11
## State TotalPop Hispanic White Black Native Asian Pacific Income male_pop
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alab… 4830620 0.0375 0.619 0.313 0.00453 0.0105 3.44e-4 43296. 2341093
## 2 Alas… 733375 0.0591 0.609 0.0285 0.164 0.0545 1.06e-2 70355. 384160
## 3 Ariz… 6641928 0.296 0.571 0.0385 0.0436 0.0288 1.68e-3 54208. 3299088
## 4 Arka… 2958208 0.0622 0.711 0.190 0.00523 0.0114 1.47e-3 41936. 1451913
## 5 Cali… 38421464 0.373 0.402 0.0568 0.00405 0.131 3.51e-3 67265. 19087135
## 6 Colo… 5278906 0.208 0.699 0.0355 0.00574 0.0266 1.20e-3 64658. 2648667
## # … with 1 more variable: female_pop <dbl>
# checking for duplicate rows
updated_duplicate <- us_census %>%
duplicated() %>% table()
print(updated_duplicate)
## .
## FALSE
## 52
str(us_census)
## Classes 'tbl_df', 'tbl' and 'data.frame': 52 obs. of 11 variables:
## $ State : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ TotalPop : num 4830620 733375 6641928 2958208 38421464 ...
## $ Hispanic : num 0.0375 0.0591 0.2957 0.0622 0.3729 ...
## $ White : num 0.619 0.609 0.571 0.711 0.402 ...
## $ Black : num 0.3125 0.0285 0.0385 0.1897 0.0568 ...
## $ Native : num 0.00453 0.16391 0.04355 0.00523 0.00405 ...
## $ Asian : num 0.0105 0.0545 0.0288 0.0114 0.1305 ...
## $ Pacific : num 0.000344 0.010587 0.001676 0.001469 0.003514 ...
## $ Income : num 43296 70355 54208 41936 67265 ...
## $ male_pop : num 2341093 384160 3299088 1451913 19087135 ...
## $ female_pop: num 2489527 349215 3342840 1506295 19334329 ...
summary(us_census)
## State TotalPop Hispanic White
## Length:52 Min. : 579679 Min. :0.01291 Min. :0.007736
## Class :character 1st Qu.: 1792702 1st Qu.:0.04699 1st Qu.:0.569622
## Mode :character Median : 4168293 Median :0.08469 Median :0.715881
## Mean : 6155733 Mean :0.12591 Mean :0.675874
## 3rd Qu.: 6775556 3rd Qu.:0.13385 3rd Qu.:0.801829
## Max. :38421464 Max. :0.98894 Max. :0.939831
## Black Native Asian Pacific
## Min. :0.0009256 Min. :2.818e-05 Min. :0.000752 Min. :0.0000124
## 1st Qu.:0.0264959 1st Qu.:1.989e-03 1st Qu.:0.012377 1st Qu.:0.0002570
## Median :0.0796740 Median :3.766e-03 Median :0.023003 Median :0.0004296
## Mean :0.1173536 Mean :1.556e-02 Mean :0.035917 Mean :0.0028293
## 3rd Qu.:0.1723958 3rd Qu.:1.072e-02 3rd Qu.:0.037578 3rd Qu.:0.0007958
## Max. :0.5177654 Max. :1.639e-01 Max. :0.365921 Max. :0.0875886
## Income male_pop female_pop
## Min. :20721 Min. : 295561 Min. : 284118
## 1st Qu.:48487 1st Qu.: 887839 1st Qu.: 904862
## Median :54240 Median : 2056330 Median : 2111962
## Mean :55977 Mean : 3027849 Mean : 3127884
## 3rd Qu.:63689 3rd Qu.: 3346247 3rd Qu.: 3466387
## Max. :78765 Max. :19087135 Max. :19334329
us_census
## # A tibble: 52 x 11
## State TotalPop Hispanic White Black Native Asian Pacific Income male_pop
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Alab… 4830620 0.0375 0.619 0.313 0.00453 0.0105 3.44e-4 43296. 2341093
## 2 Alas… 733375 0.0591 0.609 0.0285 0.164 0.0545 1.06e-2 70355. 384160
## 3 Ariz… 6641928 0.296 0.571 0.0385 0.0436 0.0288 1.68e-3 54208. 3299088
## 4 Arka… 2958208 0.0622 0.711 0.190 0.00523 0.0114 1.47e-3 41936. 1451913
## 5 Cali… 38421464 0.373 0.402 0.0568 0.00405 0.131 3.51e-3 67265. 19087135
## 6 Colo… 5278906 0.208 0.699 0.0355 0.00574 0.0266 1.20e-3 64658. 2648667
## 7 Conn… 3593222 0.156 0.677 0.103 0.00126 0.0402 1.86e-4 76147. 1751607
## 8 Dela… 926454 0.0882 0.646 0.207 0.00260 0.0327 1.92e-4 61828. 448413
## 9 Dist… 647484 0.0917 0.331 0.518 0.00201 0.0338 2.96e-4 75466. 306674
## 10 Flor… 19645772 0.213 0.591 0.152 0.00210 0.0228 5.15e-4 50690. 9600009
## # … with 42 more rows, and 1 more variable: female_pop <dbl>
Author github: Emon-ProCoder7 2. Author Linked-in:Md Tabassum Hossain Emon↩