You just got hired as a Data Analyst at the Census Bureau, which
collects census data and finds interesting insights from it.
The person who previously had your job left you all the data they had
for the most recent census. The data is spread across multiple csv
files. They didn’t use R, and they would manually look through these csv
files whenever they wanted to find something. Sometimes they would copy
and paste certain numbers into Excel for analysis.
The thought of it makes you shiver. This is not scalable or
repeatable.
Your boss wants you to dig into the data and find some insights by
the end of the day. Can you get this data into R and into reasonable
shape so that you can perform your analysis?
If you get stuck or want to see an experienced programmer tackle this
project, click “get help” to see a walkthrough video.
Load and Inspect the Data!
1.Open some of the census csv files in the navigator. How are they
named? What kind of information do they hold? What tools will you need
to import and clean this data?
2.In the first code block of notebook.Rmd, import the readr, dplyr,
and tidyr libraries to aid you in your data cleaning.
# load libraries
library(dplyr)
library(tidyr)
library(stringr)
3.It will be easier to inspect the data stored in these files once
you have it in a data frame. You can’t even call head() on these csvs!
How are you supposed to read them?
Begin by creating a variable called files and set it equal to the
list.files() of all of the csv files you want to import.
# load CSVs
files <- list.files(pattern = "states_.*csv")
files
[1] "states_0.csv" "states_1.csv" "states_2.csv" "states_3.csv" "states_4.csv" "states_5.csv" "states_6.csv"
[8] "states_7.csv" "states_8.csv" "states_9.csv"
4.Read each file in files into a data frame using lapply() and save
the result to df_list.
# inspect data
df_list <- lapply(files, read.csv)
df_list
[[1]]
[[2]]
[[3]]
[[4]]
[[5]]
[[6]]
[[7]]
[[8]]
[[9]]
[[10]]
NA
5.Concatenate all of the data frames in df_list into one data frame
called us_census.
# inspect data
us_census <- bind_rows(df_list)
us_census
6.Inspect the us_census data frame by printing the column names,
looking at the data types with str(), and viewing the head().
What columns have symbols that will prevent calculations? What are
the data types of the columns? Do any columns contain multiple kinds of
information?
# inspect data
col <- colnames(us_census)
col
[1] "X" "State" "TotalPop" "Hispanic" "White" "Black" "Native" "Asian" "Pacific"
[10] "Income" "GenderPop"
str(us_census)
'data.frame': 61 obs. of 11 variables:
$ X : int 0 1 2 3 4 5 0 1 2 3 ...
$ State : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
$ TotalPop : int 4830620 733375 6641928 2958208 38421464 5278906 5278906 3593222 926454 647484 ...
$ Hispanic : chr "3.7516156462584975%" "5.909580838323351%" "29.565921052631502%" "6.215474452554738%" ...
$ White : chr "61.878656462585%" "60.910179640718574%" "57.120000000000026%" "71.13781021897813%" ...
$ Black : chr "31.25297619047618%" "2.8485029940119775%" "3.8509868421052658%" "18.968759124087573%" ...
$ Native : chr "0.4532312925170065%" "16.39101796407186%" "4.35506578947368%" "0.5229197080291965%" ...
$ Asian : chr "1.0502551020408146%" "5.450299401197604%" "2.876578947368419%" "1.1423357664233578%" ...
$ Pacific : chr "0.03435374149659865%" "1.0586826347305378%" "0.16763157894736833%" "0.14686131386861315%" ...
$ Income : chr "$43296.35860306644" "$70354.74390243902" "$54207.82095490716" "$41935.63396778917" ...
$ GenderPop: chr "2341093M_2489527F" "384160M_349215F" "3299088M_3342840F" "1451913M_1506295F" ...
head(us_census)
Remove and Reformat the Columns
7.When inspecting us_census you notice a column X1 that stores
meaningless information. Drop the X1 column from us_census, and save the
resulting data frame to us_census. View the head of us_census.
# drop X1 column
us_census <- us_census %>%
select(-X)
us_census
8.You notice that there are 6 columns representing the population
percentage for different races. The columns include the percent symbol
%. Remove the percent symbol % from each of the race columns
(Hispanic,White,Black,Native,Asian,Pacific). Save the resulting data
frame to us_census, and view the head.
# remove % from race columns
us_census <- us_census %>%
mutate(Hispanic = gsub('%', '',Hispanic), White = gsub('%', '',Hispanic), Black = gsub('%', '',Hispanic), Native = gsub('%', '',Hispanic), Asian = gsub('%', '',Hispanic), Pacific = gsub('%', '',Hispanic))
head(us_census)
9.The Income column also incudes a $ symbol along with the number
representing median income for a state. Remove the $ from the Income
column. Save the resulting data frame to us_census. View the head of
us_census.
# remove $ from Income column
#gsub('$', '', Income) 這其實是「行尾」,不是「$ 符號」
#gsub('$', '', Income) 中的 '$' 是正則表達式中的特殊字元,它代表「行尾」,不是字面上的 $ 字符。
us_census <- us_census %>%
mutate(Income = gsub('\\$', '', Income))
head(us_census)
10.The GenderPop column appears to hold the male and female
population counts. Separate this column at the _ character to create two
new columns: male_pop and female_pop. Save the resulting data frame to
us_census, and view the head.
# separate GenderPop column
us_census <- us_census %>%
separate('GenderPop', c('male_pop', 'female_pop'), "_")
head(us_census)
11.You notice the new male_pop and female_pop columns contain extra
characters M and F, respectively. Remove these extra characters from the
columns. Save the resulting data frame to us_census, and view the
head.
# clean male and female population columns
us_census <- us_census %>%
mutate(male_pop = gsub('M', '', male_pop), female_pop = gsub('F', '', female_pop))
head(us_census)
Update the Data Types
12.Now that you have removed extra symbols from many of the columns
that contain numerical data, you notice that the data type for these
columns is still chr, or character. Convert all of these columns
(Hispanic,White,Black,Native,Asian,Pacific,Income,male_pop,female_pop)
to have a data type of numeric. Save the resulting data frame to
us_census, and view the head.
# update column data types
us_census <- us_census %>%
mutate(Hispanic = as.numeric(Hispanic), White = as.numeric(White), Black = as.numeric(Black), Native = as.numeric(Native), Asian = as.numeric(Asian), Pacific = as.numeric(Pacific), Income = as.numeric(Income), male_pop = as.numeric(male_pop), female_pop = as.numeric(female_pop))
head(us_census)
13.Take a second to look back at the Hispanic, White, Black, Native,
Asian, and Pacific columns. The columns represent the population
percentage for each race. Earlier you removed the % symbol, and then you
just converted the column to numeric type. To make calculations easier,
the column should now represent percentages in decimal form, where 50%
is equivalent to 0.50. Update the values of these columns to be in
decimal form, and save the resulting data frame to us_census. View the
head of us_census.
# update values of race columns
us_census <- us_census %>%
mutate(Hispanic = Hispanic / 100, White = White / 100, Black = Black / 100, Native = Native / 100, Asian = Asian / 100, Pacific = Pacific / 100)
head(us_census)
Remove Duplicate Rows
14.It’s always a good idea to check if there are duplicate rows of
data in a data set. Pipe us_census into the duplicated() function to see
which rows are duplicated. Then pipe the result into table() to get a
count of the duplicated rows.
# check for duplicate rows
duplicate <- us_census %>%
duplicated() %>%
table()
duplicate
.
FALSE TRUE
52 9
15.Since there are duplicates, update the value of us_census to be
the us_census data frame with only unique/distinct rows.
# remove duplicate rows
us_census <- us_census %>%
distinct()
us_census
16.Confirm that there are no more duplicated rows in us_census. Pipe
us_census into the duplicated() function to see which rows are
duplicated. Then pipe the result into table() to get a count of the
duplicated rows.
You should expect to see no TRUEs!
# check for duplicate rows
duplicate <- us_census %>%
duplicated() %>%
table()
duplicate
.
FALSE
52
17.View the head() of us_census. The data frame is all clean and
ready for analysis! What do you want to find out?
# clean data frame
head(us_census)
---
title: "Cleaning US Census Data"
author: "Annabel Kuo"
date: "`r format(Sys.time(), '%Y-%m-%d %H:%M')`"
output: html_notebook
---
You just got hired as a Data Analyst at the Census Bureau, which collects census data and finds interesting insights from it.

The person who previously had your job left you all the data they had for the most recent census. The data is spread across multiple csv files. They didn’t use R, and they would manually look through these csv files whenever they wanted to find something. Sometimes they would copy and paste certain numbers into Excel for analysis.

The thought of it makes you shiver. This is not scalable or repeatable.

Your boss wants you to dig into the data and find some insights by the end of the day. Can you get this data into R and into reasonable shape so that you can perform your analysis?

If you get stuck or want to see an experienced programmer tackle this project, click “get help” to see a walkthrough video.

# Load and Inspect the Data!

1.Open some of the census csv files in the navigator. How are they named? What kind of information do they hold? What tools will you need to import and clean this data?

2.In the first code block of notebook.Rmd, import the readr, dplyr, and tidyr libraries to aid you in your data cleaning.

```{r message=FALSE, warning=FALSE, error=TRUE}
# load libraries
library(dplyr)
library(tidyr)
library(stringr)
```

3.It will be easier to inspect the data stored in these files once you have it in a data frame. You can’t even call head() on these csvs! How are you supposed to read them?

Begin by creating a variable called files and set it equal to the list.files() of all of the csv files you want to import.

```{r message=FALSE, warning=FALSE, error=TRUE}
# load CSVs
files <- list.files(pattern = "states_.*csv")
files
```

4.Read each file in files into a data frame using lapply() and save the result to df_list.

```{r error=TRUE}
# inspect data
df_list <- lapply(files, read.csv)
df_list
```

5.Concatenate all of the data frames in df_list into one data frame called us_census.

```{r error=TRUE}
# inspect data
us_census <- bind_rows(df_list)
us_census
```

6.Inspect the us_census data frame by printing the column names, looking at the data types with str(), and viewing the head().

What columns have symbols that will prevent calculations? What are the data types of the columns? Do any columns contain multiple kinds of information?

```{r error=TRUE}
# inspect data
col <- colnames(us_census)
col
str(us_census)
head(us_census)
```

# Remove and Reformat the Columns

7.When inspecting us_census you notice a column X1 that stores meaningless information. Drop the X1 column from us_census, and save the resulting data frame to us_census. View the head of us_census.

```{r error=TRUE}
# drop X1 column
us_census <- us_census %>%
  select(-X)
us_census
```

8.You notice that there are 6 columns representing the population percentage for different races. The columns include the percent symbol %. Remove the percent symbol % from each of the race columns (Hispanic,White,Black,Native,Asian,Pacific). Save the resulting data frame to us_census, and view the head.

```{r error=TRUE}
# remove % from race columns
us_census <- us_census %>%
  mutate(Hispanic = gsub('%', '',Hispanic), White = gsub('%', '',Hispanic), Black = gsub('%', '',Hispanic), Native = gsub('%', '',Hispanic), Asian = gsub('%', '',Hispanic), Pacific = gsub('%', '',Hispanic))

head(us_census)
```

9.The Income column also incudes a $ symbol along with the number representing median income for a state. Remove the $ from the Income column. Save the resulting data frame to us_census. View the head of us_census.


```{r error=TRUE}
# remove $ from Income column
#gsub('$', '', Income)   這其實是「行尾」，不是「$ 符號」
#gsub('$', '', Income) 中的 '$' 是正則表達式中的特殊字元，它代表「行尾」，不是字面上的 $ 字符。
us_census <- us_census %>%
  mutate(Income = gsub('\\$', '', Income))

head(us_census)
```

10.The GenderPop column appears to hold the male and female population counts. Separate this column at the _ character to create two new columns: male_pop and female_pop. Save the resulting data frame to us_census, and view the head.

```{r error=TRUE}
# separate GenderPop column
us_census <- us_census %>%
  separate('GenderPop', c('male_pop', 'female_pop'), "_")

head(us_census)
```

11.You notice the new male_pop and female_pop columns contain extra characters M and F, respectively. Remove these extra characters from the columns. Save the resulting data frame to us_census, and view the head.

```{r error=TRUE}
# clean male and female population columns
us_census <- us_census %>%
  mutate(male_pop = gsub('M', '', male_pop), female_pop = gsub('F', '', female_pop))

head(us_census)
```

# Update the Data Types

12.Now that you have removed extra symbols from many of the columns that contain numerical data, you notice that the data type for these columns is still chr, or character. Convert all of these columns (Hispanic,White,Black,Native,Asian,Pacific,Income,male_pop,female_pop) to have a data type of numeric. Save the resulting data frame to us_census, and view the head.

```{r error=TRUE}
# update column data types
us_census <- us_census %>%
  mutate(Hispanic = as.numeric(Hispanic), White = as.numeric(White), Black = as.numeric(Black), Native = as.numeric(Native), Asian = as.numeric(Asian), Pacific = as.numeric(Pacific), Income = as.numeric(Income), male_pop = as.numeric(male_pop), female_pop = as.numeric(female_pop))
head(us_census)
```


13.Take a second to look back at the Hispanic, White, Black, Native, Asian, and Pacific columns. The columns represent the population percentage for each race. Earlier you removed the % symbol, and then you just converted the column to numeric type. To make calculations easier, the column should now represent percentages in decimal form, where 50% is equivalent to 0.50. Update the values of these columns to be in decimal form, and save the resulting data frame to us_census. View the head of us_census.

```{r error=TRUE}
# update values of race columns
us_census <- us_census %>%
  mutate(Hispanic = Hispanic / 100, White = White / 100, Black = Black / 100, Native = Native / 100, Asian = Asian / 100, Pacific = Pacific / 100)

head(us_census)
```

# Remove Duplicate Rows

14.It’s always a good idea to check if there are duplicate rows of data in a data set. Pipe us_census into the duplicated() function to see which rows are duplicated. Then pipe the result into table() to get a count of the duplicated rows.

```{r error=TRUE}
# check for duplicate rows
duplicate <- us_census %>%
  duplicated() %>%
  table()
  
duplicate
```

15.Since there are duplicates, update the value of us_census to be the us_census data frame with only unique/distinct rows.

```{r error=TRUE}
# remove duplicate rows
us_census <- us_census %>%
  distinct()
us_census
```

16.Confirm that there are no more duplicated rows in us_census. Pipe us_census into the duplicated() function to see which rows are duplicated. Then pipe the result into table() to get a count of the duplicated rows.

You should expect to see no TRUEs!


```{r error=TRUE}
# check for duplicate rows
duplicate <- us_census %>%
  duplicated() %>%
  table()

duplicate


```

17.View the head() of us_census. The data frame is all clean and ready for analysis! What do you want to find out?

```{r error=TRUE}
# clean data frame
head(us_census)
```

