Cleaning US. Census Data

US Census

US Census Bureau, which collects census data and finds interesting insights from it. The data is spread across multiple csv files.

I have used this data into R and taken it into reasonable shape so that some analysis can be performed.

# load libraries
library(dplyr)
library(readr)
library(tidyr)

Merging csv’s in One Dataframe

Concatenated all of the data frames in df_list into one data frame called us_census.

# loading CSVs
files <- list.files(pattern="states_.*csv")
head(files)

## [1] "states_0.csv" "states_1.csv" "states_2.csv" "states_3.csv" "states_4.csv"
## [6] "states_5.csv"

df_list <- lapply(files,read_csv)
us_census <- bind_rows(df_list)
head(us_census)

## # A tibble: 6 x 11
##      X1 State TotalPop Hispanic White Black Native Asian Pacific Income
##   <dbl> <chr>    <dbl> <chr>    <chr> <chr> <chr>  <chr> <chr>   <chr> 
## 1     0 Alab…  4830620 3.75161… 61.8… 31.2… 0.453… 1.05… 0.0343… $4329…
## 2     1 Alas…   733375 5.90958… 60.9… 2.84… 16.39… 5.45… 1.0586… $7035…
## 3     2 Ariz…  6641928 29.5659… 57.1… 3.85… 4.355… 2.87… 0.1676… $5420…
## 4     3 Arka…  2958208 6.21547… 71.1… 18.9… 0.522… 1.14… 0.1468… $4193…
## 5     4 Cali… 38421464 37.2918… 40.2… 5.67… 0.405… 13.0… 0.3514… $6726…
## 6     5 Colo…  5278906 20.7843… 69.8… 3.54… 0.573… 2.66… 0.1196… $6465…
## # … with 1 more variable: GenderPop <chr>

# inspecting data
colnames(us_census)

##  [1] "X1"        "State"     "TotalPop"  "Hispanic"  "White"     "Black"    
##  [7] "Native"    "Asian"     "Pacific"   "Income"    "GenderPop"

str(us_census)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 61 obs. of  11 variables:
##  $ X1       : num  0 1 2 3 4 5 0 1 2 3 ...
##  $ State    : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ TotalPop : num  4830620 733375 6641928 2958208 38421464 ...
##  $ Hispanic : chr  "3.7516156462584975%" "5.909580838323351%" "29.565921052631502%" "6.215474452554738%" ...
##  $ White    : chr  "61.878656462585%" "60.910179640718574%" "57.120000000000026%" "71.13781021897813%" ...
##  $ Black    : chr  "31.25297619047618%" "2.8485029940119775%" "3.8509868421052658%" "18.968759124087573%" ...
##  $ Native   : chr  "0.4532312925170065%" "16.39101796407186%" "4.35506578947368%" "0.5229197080291965%" ...
##  $ Asian    : chr  "1.0502551020408146%" "5.450299401197604%" "2.876578947368419%" "1.1423357664233578%" ...
##  $ Pacific  : chr  "0.03435374149659865%" "1.0586826347305378%" "0.16763157894736833%" "0.14686131386861315%" ...
##  $ Income   : chr  "$43296.35860306644" "$70354.74390243902" "$54207.82095490716" "$41935.63396778917" ...
##  $ GenderPop: chr  "2341093M_2489527F" "384160M_349215F" "3299088M_3342840F" "1451913M_1506295F" ...

Removing and Reformating the Columns

Column X1 stores meaningless information. Dropped the X1 column from us_census.

# drop X1 column
us_census <- us_census %>%
 select(-X1)
head(us_census)

## # A tibble: 6 x 10
##   State  TotalPop Hispanic  White  Black  Native  Asian Pacific Income GenderPop
##   <chr>     <dbl> <chr>     <chr>  <chr>  <chr>   <chr> <chr>   <chr>  <chr>    
## 1 Alaba…  4830620 3.751615… 61.87… 31.25… 0.4532… 1.05… 0.0343… $4329… 2341093M…
## 2 Alaska   733375 5.909580… 60.91… 2.848… 16.391… 5.45… 1.0586… $7035… 384160M_…
## 3 Arizo…  6641928 29.56592… 57.12… 3.850… 4.3550… 2.87… 0.1676… $5420… 3299088M…
## 4 Arkan…  2958208 6.215474… 71.13… 18.96… 0.5229… 1.14… 0.1468… $4193… 1451913M…
## 5 Calif… 38421464 37.29187… 40.21… 5.677… 0.4052… 13.0… 0.3514… $6726… 19087135…
## 6 Color…  5278906 20.78438… 69.89… 3.546… 0.5738… 2.66… 0.1196… $6465… 2648667M…

Preparaing Data Points for Calculation

there are 6 columns representing the population percentage for different races. The columns include the percent symbol %. The Income column also incudes a $ symbol along with the number representing median income for a state.

# removing % from race columns
us_census <- us_census %>% 
 mutate (Hispanic = gsub("\\%","",Hispanic)) %>% 
 mutate (White = gsub("\\%","",White)) %>%
 mutate (Black = gsub("\\%","",Black)) %>%
 mutate (Native = gsub("\\%","",Native)) %>%
 mutate (Asian = gsub("\\%","",Asian)) %>%
 mutate (Pacific = gsub("\\%","",Pacific))

head(us_census)

## # A tibble: 6 x 10
##   State  TotalPop Hispanic  White  Black  Native  Asian Pacific Income GenderPop
##   <chr>     <dbl> <chr>     <chr>  <chr>  <chr>   <chr> <chr>   <chr>  <chr>    
## 1 Alaba…  4830620 3.751615… 61.87… 31.25… 0.4532… 1.05… 0.0343… $4329… 2341093M…
## 2 Alaska   733375 5.909580… 60.91… 2.848… 16.391… 5.45… 1.0586… $7035… 384160M_…
## 3 Arizo…  6641928 29.56592… 57.12… 3.850… 4.3550… 2.87… 0.1676… $5420… 3299088M…
## 4 Arkan…  2958208 6.215474… 71.13… 18.96… 0.5229… 1.14… 0.1468… $4193… 1451913M…
## 5 Calif… 38421464 37.29187… 40.21… 5.677… 0.4052… 13.0… 0.3514… $6726… 19087135…
## 6 Color…  5278906 20.78438… 69.89… 3.546… 0.5738… 2.66… 0.1196… $6465… 2648667M…

# removing $ from Income column
us_census <- us_census %>%
 mutate ( Income = gsub("\\$","",Income))
head(us_census)

## # A tibble: 6 x 10
##   State  TotalPop Hispanic  White  Black  Native  Asian Pacific Income GenderPop
##   <chr>     <dbl> <chr>     <chr>  <chr>  <chr>   <chr> <chr>   <chr>  <chr>    
## 1 Alaba…  4830620 3.751615… 61.87… 31.25… 0.4532… 1.05… 0.0343… 43296… 2341093M…
## 2 Alaska   733375 5.909580… 60.91… 2.848… 16.391… 5.45… 1.0586… 70354… 384160M_…
## 3 Arizo…  6641928 29.56592… 57.12… 3.850… 4.3550… 2.87… 0.1676… 54207… 3299088M…
## 4 Arkan…  2958208 6.215474… 71.13… 18.96… 0.5229… 1.14… 0.1468… 41935… 1451913M…
## 5 Calif… 38421464 37.29187… 40.21… 5.677… 0.4052… 13.0… 0.3514… 67264… 19087135…
## 6 Color…  5278906 20.78438… 69.89… 3.546… 0.5738… 2.66… 0.1196… 64657… 2648667M…

The GenderPop column appears to hold the male and female population counts. Separated this column at the _ character to create two new columns: male_pop and female_pop & Removed these extra characters from the columns.

# separating GenderPop column
us_census <- us_census %>%
 separate (GenderPop, c("male_pop","female_pop"),"_") %>% mutate(male_pop = gsub("\\M","",male_pop)) %>% mutate(female_pop = gsub("\\F","",female_pop))
head(us_census)

## # A tibble: 6 x 11
##   State TotalPop Hispanic White Black Native Asian Pacific Income male_pop
##   <chr>    <dbl> <chr>    <chr> <chr> <chr>  <chr> <chr>   <chr>  <chr>   
## 1 Alab…  4830620 3.75161… 61.8… 31.2… 0.453… 1.05… 0.0343… 43296… 2341093 
## 2 Alas…   733375 5.90958… 60.9… 2.84… 16.39… 5.45… 1.0586… 70354… 384160  
## 3 Ariz…  6641928 29.5659… 57.1… 3.85… 4.355… 2.87… 0.1676… 54207… 3299088 
## 4 Arka…  2958208 6.21547… 71.1… 18.9… 0.522… 1.14… 0.1468… 41935… 1451913 
## 5 Cali… 38421464 37.2918… 40.2… 5.67… 0.405… 13.0… 0.3514… 67264… 19087135
## 6 Colo…  5278906 20.7843… 69.8… 3.54… 0.573… 2.66… 0.1196… 64657… 2648667 
## # … with 1 more variable: female_pop <chr>

Conversion

Data type for these columns is still chr, or character. Then I Converted all of these columns (Hispanic,White,Black,Native,Asian,Pacific,Income,male_pop,female_pop) to have a data type of numeric.

# updating column data types
us_census <- us_census %>%
mutate(Hispanic = as.numeric (Hispanic),White = as.numeric(White),Black = as.numeric(Black),Native = as.numeric(Native),Asian = as.numeric(Asian),Pacific = as.numeric (Pacific),Income = as.numeric(Income),male_pop = as.numeric(male_pop),female_pop = as.numeric(female_pop))

head(us_census)

## # A tibble: 6 x 11
##   State TotalPop Hispanic White Black Native Asian Pacific Income male_pop
##   <chr>    <dbl>    <dbl> <dbl> <dbl>  <dbl> <dbl>   <dbl>  <dbl>    <dbl>
## 1 Alab…  4830620     3.75  61.9 31.3   0.453  1.05  0.0344 43296.  2341093
## 2 Alas…   733375     5.91  60.9  2.85 16.4    5.45  1.06   70355.   384160
## 3 Ariz…  6641928    29.6   57.1  3.85  4.36   2.88  0.168  54208.  3299088
## 4 Arka…  2958208     6.22  71.1 19.0   0.523  1.14  0.147  41936.  1451913
## 5 Cali… 38421464    37.3   40.2  5.68  0.405 13.1   0.351  67265. 19087135
## 6 Colo…  5278906    20.8   69.9  3.55  0.574  2.66  0.120  64658.  2648667
## # … with 1 more variable: female_pop <dbl>

# updating values of race columns
us_census <- us_census %>%
 mutate(Hispanic = Hispanic/100, White = White /100, Black = Black/100, Native = Native/100, Asian = Asian/100, Pacific = Pacific/100)
head(us_census)

## # A tibble: 6 x 11
##   State TotalPop Hispanic White  Black  Native  Asian Pacific Income male_pop
##   <chr>    <dbl>    <dbl> <dbl>  <dbl>   <dbl>  <dbl>   <dbl>  <dbl>    <dbl>
## 1 Alab…  4830620   0.0375 0.619 0.313  0.00453 0.0105 3.44e-4 43296.  2341093
## 2 Alas…   733375   0.0591 0.609 0.0285 0.164   0.0545 1.06e-2 70355.   384160
## 3 Ariz…  6641928   0.296  0.571 0.0385 0.0436  0.0288 1.68e-3 54208.  3299088
## 4 Arka…  2958208   0.0622 0.711 0.190  0.00523 0.0114 1.47e-3 41936.  1451913
## 5 Cali… 38421464   0.373  0.402 0.0568 0.00405 0.131  3.51e-3 67265. 19087135
## 6 Colo…  5278906   0.208  0.699 0.0355 0.00574 0.0266 1.20e-3 64658.  2648667
## # … with 1 more variable: female_pop <dbl>

Removing Duplicate Rows

# checking for duplicate rows
duplicates <- us_census %>%
 duplicated() %>% table()
print(duplicates)

## .
## FALSE  TRUE 
##    52     9

Since there are duplicates, updated the value of us_census to be the us_census data frame with only unique/distinct rows.

# removing duplicate rows
us_census <- us_census %>%
 distinct ()
head(us_census)

## # A tibble: 6 x 11
##   State TotalPop Hispanic White  Black  Native  Asian Pacific Income male_pop
##   <chr>    <dbl>    <dbl> <dbl>  <dbl>   <dbl>  <dbl>   <dbl>  <dbl>    <dbl>
## 1 Alab…  4830620   0.0375 0.619 0.313  0.00453 0.0105 3.44e-4 43296.  2341093
## 2 Alas…   733375   0.0591 0.609 0.0285 0.164   0.0545 1.06e-2 70355.   384160
## 3 Ariz…  6641928   0.296  0.571 0.0385 0.0436  0.0288 1.68e-3 54208.  3299088
## 4 Arka…  2958208   0.0622 0.711 0.190  0.00523 0.0114 1.47e-3 41936.  1451913
## 5 Cali… 38421464   0.373  0.402 0.0568 0.00405 0.131  3.51e-3 67265. 19087135
## 6 Colo…  5278906   0.208  0.699 0.0355 0.00574 0.0266 1.20e-3 64658.  2648667
## # … with 1 more variable: female_pop <dbl>

# checking for duplicate rows
updated_duplicate <- us_census %>%
 duplicated() %>% table()
print(updated_duplicate)

## .
## FALSE 
##    52

Tidy DataFrame

str(us_census)

## Classes 'tbl_df', 'tbl' and 'data.frame':    52 obs. of  11 variables:
##  $ State     : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ TotalPop  : num  4830620 733375 6641928 2958208 38421464 ...
##  $ Hispanic  : num  0.0375 0.0591 0.2957 0.0622 0.3729 ...
##  $ White     : num  0.619 0.609 0.571 0.711 0.402 ...
##  $ Black     : num  0.3125 0.0285 0.0385 0.1897 0.0568 ...
##  $ Native    : num  0.00453 0.16391 0.04355 0.00523 0.00405 ...
##  $ Asian     : num  0.0105 0.0545 0.0288 0.0114 0.1305 ...
##  $ Pacific   : num  0.000344 0.010587 0.001676 0.001469 0.003514 ...
##  $ Income    : num  43296 70355 54208 41936 67265 ...
##  $ male_pop  : num  2341093 384160 3299088 1451913 19087135 ...
##  $ female_pop: num  2489527 349215 3342840 1506295 19334329 ...

summary(us_census)

##     State              TotalPop           Hispanic           White         
##  Length:52          Min.   :  579679   Min.   :0.01291   Min.   :0.007736  
##  Class :character   1st Qu.: 1792702   1st Qu.:0.04699   1st Qu.:0.569622  
##  Mode  :character   Median : 4168293   Median :0.08469   Median :0.715881  
##                     Mean   : 6155733   Mean   :0.12591   Mean   :0.675874  
##                     3rd Qu.: 6775556   3rd Qu.:0.13385   3rd Qu.:0.801829  
##                     Max.   :38421464   Max.   :0.98894   Max.   :0.939831  
##      Black               Native              Asian             Pacific         
##  Min.   :0.0009256   Min.   :2.818e-05   Min.   :0.000752   Min.   :0.0000124  
##  1st Qu.:0.0264959   1st Qu.:1.989e-03   1st Qu.:0.012377   1st Qu.:0.0002570  
##  Median :0.0796740   Median :3.766e-03   Median :0.023003   Median :0.0004296  
##  Mean   :0.1173536   Mean   :1.556e-02   Mean   :0.035917   Mean   :0.0028293  
##  3rd Qu.:0.1723958   3rd Qu.:1.072e-02   3rd Qu.:0.037578   3rd Qu.:0.0007958  
##  Max.   :0.5177654   Max.   :1.639e-01   Max.   :0.365921   Max.   :0.0875886  
##      Income         male_pop          female_pop      
##  Min.   :20721   Min.   :  295561   Min.   :  284118  
##  1st Qu.:48487   1st Qu.:  887839   1st Qu.:  904862  
##  Median :54240   Median : 2056330   Median : 2111962  
##  Mean   :55977   Mean   : 3027849   Mean   : 3127884  
##  3rd Qu.:63689   3rd Qu.: 3346247   3rd Qu.: 3466387  
##  Max.   :78765   Max.   :19087135   Max.   :19334329

us_census

## # A tibble: 52 x 11
##    State TotalPop Hispanic White  Black  Native  Asian Pacific Income male_pop
##    <chr>    <dbl>    <dbl> <dbl>  <dbl>   <dbl>  <dbl>   <dbl>  <dbl>    <dbl>
##  1 Alab…  4830620   0.0375 0.619 0.313  0.00453 0.0105 3.44e-4 43296.  2341093
##  2 Alas…   733375   0.0591 0.609 0.0285 0.164   0.0545 1.06e-2 70355.   384160
##  3 Ariz…  6641928   0.296  0.571 0.0385 0.0436  0.0288 1.68e-3 54208.  3299088
##  4 Arka…  2958208   0.0622 0.711 0.190  0.00523 0.0114 1.47e-3 41936.  1451913
##  5 Cali… 38421464   0.373  0.402 0.0568 0.00405 0.131  3.51e-3 67265. 19087135
##  6 Colo…  5278906   0.208  0.699 0.0355 0.00574 0.0266 1.20e-3 64658.  2648667
##  7 Conn…  3593222   0.156  0.677 0.103  0.00126 0.0402 1.86e-4 76147.  1751607
##  8 Dela…   926454   0.0882 0.646 0.207  0.00260 0.0327 1.92e-4 61828.   448413
##  9 Dist…   647484   0.0917 0.331 0.518  0.00201 0.0338 2.96e-4 75466.   306674
## 10 Flor… 19645772   0.213  0.591 0.152  0.00210 0.0228 5.15e-4 50690.  9600009
## # … with 42 more rows, and 1 more variable: female_pop <dbl>

Author github: Emon-ProCoder7 2. Author Linked-in:Md Tabassum Hossain Emon ↩

Cleaning US. Census Data

Md. Tabassum Hossain Emon ¹

5/23/2020

Merging csv’s in One Dataframe

Removing and Reformating the Columns

Preparaing Data Points for Calculation

Conversion

Removing Duplicate Rows

Tidy DataFrame

Cleaning US. Census Data

Md. Tabassum Hossain Emon 1

5/23/2020

Merging csv’s in One Dataframe

Removing and Reformating the Columns

Preparaing Data Points for Calculation

Conversion

Removing Duplicate Rows

Tidy DataFrame

Md. Tabassum Hossain Emon ¹