library(rvest)
library(readxl)
library(dplyr)
library(tidyr)
library(forecast)
This data preprocessing exercise will join three datasets that measure median income, unemployment and interstate migration by state of the USA. The choice of datasets is based on the premise that there may be a link between these measures.
The three datasets will be imported into R, from two Microsoft Excel files and one URL, before being joined using state as key. An inspection of the dataset once merged is provided, and data type conversions are performed as appropriate. The next step is to consider the dataset with respect to tidy principles. To do this, we check if the data is tidy, before creating one new variable from two existing variables.
In addition, we scan the dataset for any missing values or inconsistencies within any variables. We then scan each numeric variable for outliers using boxplots. Last, we perform natural logarithmic transformations on all numeric variables, improving normality of the variable by reducing right-tail skewness and removing outliers.
In this section, we import median income data, then interstate migration data. We will then merge these two datasets together. Last, we import unemployment data and merge this with the previous dataset.
Median income data was obtained from United States Census Bureau (2018a). We import the data using the read_excel()
function from the readxl
package. The original file includes median income by state from 1984 to 2017. Each observation is provided in current US dollars and 2017 CPI-U-RS adjusted dollars. Each observation also has an associated standard error. The only observations we have imported were 2017 figures based on current US dollars (in 2017) which align with the 2017 CPI-U-RS adjusted observations for this year).
income <- read_excel("INCOME.xls", range = "h08!A8:B58", col_names = c("State", "MedIncome"))
head(income)
## # A tibble: 6 x 2
## State MedIncome
## <chr> <dbl>
## 1 Alabama 51113
## 2 Alaska 72231
## 3 Arizona 61125
## 4 Arkansas 48829
## 5 California 69759
## 6 Colorado 74172
dim(income)
## [1] 51 2
The imported dataset, income
, consists of 51 observations of 2 columns; variables are State (character variable identifying the state of USA) and MedIncome (numeric variable measuring median income).
Interstate migration data was obtained from United States Census Bureau (2018b). The data was imported using the read_excel()
function. As this is the most complex dataset used, columns for different variables were selected and imported separately (migration1
and migration2
). These columns will be combined to create the final dataset for interstate migration (migration
).
migration1 <- read_excel("State_to_State_Migrations_Table_2017.xls",
range = "Table!A12:H78",
col_names = c("State", "Pop1yo", "MOE1",
"Sameres1y", "MOE2", "Samest1y",
"MOE3", "Diffst1y"))
dim(migration1)
## [1] 67 8
colSums(is.na(migration1))
## State Pop1yo MOE1 Sameres1y MOE2 Samest1y MOE3
## 14 13 14 13 14 13 14
## Diffst1y
## 12
migration2 <- read_excel("State_to_State_Migrations_Table_2017.xls",
range = "Table!DR12:DS78",
col_names = c("State", "Abroad1y"))
dim(migration2)
## [1] 67 2
colSums(is.na(migration2))
## State Abroad1y
## 14 12
migration1[33:37,] # displays rows with NA values in migration1
## # A tibble: 5 x 8
## State Pop1yo MOE1 Sameres1y MOE2 Samest1y MOE3 Diffst1y
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## 2 Current~ Populati~ <NA> Same house~ <NA> Same state~ <NA> Different s~
## 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> Total
## 4 <NA> Estimate MOE Estimate MOE Estimate MOE Estimate
## 5 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
migration2[33:37,] # displays rows with NA values in migration2
## # A tibble: 5 x 2
## State Abroad1y
## <chr> <chr>
## 1 <NA> <NA>
## 2 Current residence in -- Abroad 1 year ago--
## 3 <NA> Total
## 4 <NA> Estimate
## 5 <NA> <NA>
Datasets migration1
and migration2
each consist of 67 rows, from which around 14 are filled with NA values. These NA values result from the Excel spreadsheet design, with spaces and subheadings arranged throughout the data. An example of these NAs and subheadings have been provided above.
Variables contained in the migration1
dataset include:
Variables contained in the migration2
dataset include:
Next, sets migration1
and migration2
are joined using the left_join()
function. Within the join function, we use na.omit()
to remove the rows containing NA values. We assume there to be no inconsistencies between the State
columns of either dataset as they are obtained from the same section of the spreadsheet.
migration <- left_join(na.omit(migration1), na.omit(migration2), by = "State")
migration <- select(migration, State, Pop1yo, Sameres1y, Samest1y,
Diffst1y, Abroad1y)
head(migration)
## # A tibble: 6 x 6
## State Pop1yo Sameres1y Samest1y Diffst1y Abroad1y
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Alabama 4819343 4158224 535649 111803 13667
## 2 Alaska 730874 595804 98168 30199 6703
## 3 Arizona 6935358 5717189 911120 261727 45322
## 4 Arkansas 2968395 2509717 370309 79411 8958
## 5 California 39085984 34009331 4234770 523131 318752
## 6 Colorado 5542282 4546729 734182 226933 34438
dim(migration)
## [1] 52 6
The left-join()
function prioritizes migration1
dataset, matching observations from migration2
by the key State
. Last, we drop all columns storing a margin of error by using select()
to select all other variables. Resulting dataset migration
has 52 observations of 6 columns.
The next step is to merge income
and migration
. We find inconsistencies in the state names between income
and migration
using setdiff()
; the results are displayed below. We find that Washington, DC is named “D.C.” in income
and “District of Columbia” in migration
. We change the name of this state in income
to match that of migration
. Furthermore, migration
includes an extra row for Puerto Rico, which does not appear as a state in income
. We will drop this row when merging the two dataset.
# Check for inconsistencies
setdiff(income$State, migration$State)
## [1] "D.C."
setdiff(migration$State, income$State)
## [1] "District of Columbia" "Puerto Rico"
# Rename Washington, DC
income[income$State == "D.C.", "State"] <- c("District of Columbia")
We use the left_join()
function prioritising income
to merge income
and migration
into new dataset set1
.
#Join datasets
set1 <- left_join(income, migration, by = "State")
#Inspect
head(set1)
## # A tibble: 6 x 7
## State MedIncome Pop1yo Sameres1y Samest1y Diffst1y Abroad1y
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 Alabama 51113 4819343 4158224 535649 111803 13667
## 2 Alaska 72231 730874 595804 98168 30199 6703
## 3 Arizona 61125 6935358 5717189 911120 261727 45322
## 4 Arkansas 48829 2968395 2509717 370309 79411 8958
## 5 California 69759 39085984 34009331 4234770 523131 318752
## 6 Colorado 74172 5542282 4546729 734182 226933 34438
dim(set1)
## [1] 51 7
The resulting dataset set1
is a 51 by 7 dataframe.
The unemployment dataset has been scraped from Bureau of Labor Statistics (2019). This is done using function read_html
.
#Save URL
unemploy <- read_html("https://www.bls.gov/lau/lastrk17.htm?fbclid=IwAR3SEQ5hc6eJFB9yNBQ-0ZTHoU-lEpiHfRaewdabRX1sa3T2DnjzPLPXpSc")
#Identify and import table
length(html_nodes(unemploy, "table"))
## [1] 2
html_nodes(unemploy, "table")
## {xml_nodeset (2)}
## [1] <table id="main-content-table"><tr>\n<td id="secondary-nav-td">\r\n\ ...
## [2] <table border="1" cellpadding="0" cellspacing="0" class="regular" id ...
unemp <- html_table(html_nodes(unemploy, "table")[[2]])
colnames(unemp) <- c("State", "Unemp", "Rank")
#Inspect imported table
head(unemp)
## State Unemp Rank
## 1 United States 4.4
## 2
## 3 Hawaii 2.4 1
## 4 Colorado 2.7 2
## 5 New Hampshire 2.7 2
## 6 North Dakota 2.7 2
dim(unemp)
## [1] 54 3
The imported dataset - unemp
- is 54 rows by 3 columns; variables have been renamed as the following:
We next join unemp
to set1
. Beforehand, we check for missing values in unemp
and also for any inconsistencies between unemp$State
and set1$State
.
# Check NAs
colSums(is.na(unemp))
## State Unemp Rank
## 0 0 0
# Check inconsistencies
setdiff(unemp$State, set1$State)
## [1] "United States"
## [2] ""
## [3] "Note: Rates shown are a percentage of the labor force. Data refer to place of residence."
setdiff(set1$State, unemp$State)
## character(0)
No missing values are found in unemp
but three character string inconsistencies were found in unemp
that are not in set1
. We will remove these in the merge, specifying a left_join()
function that prioritises set1
and adds values of unemp
using State
as the key.
# Join
finalset <- left_join(set1, unemp, by = "State")
# Inspect final dataset
head(finalset)
## # A tibble: 6 x 9
## State MedIncome Pop1yo Sameres1y Samest1y Diffst1y Abroad1y Unemp Rank
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 Alabama 51113 48193~ 4158224 535649 111803 13667 4.4 29
## 2 Alaska 72231 730874 595804 98168 30199 6703 7.0 51
## 3 Arizona 61125 69353~ 5717189 911120 261727 45322 4.9 40
## 4 Arkans~ 48829 29683~ 2509717 370309 79411 8958 3.7 15
## 5 Califo~ 69759 39085~ 34009331 4234770 523131 318752 4.8 39
## 6 Colora~ 74172 55422~ 4546729 734182 226933 34438 2.7 2
dim(finalset)
## [1] 51 9
The resulting dataset finalset
consists of 51 observations of 9 variables.
To gain an insight into the dataset finalset
, we inspect its structure below.
str(finalset)
## Classes 'tbl_df', 'tbl' and 'data.frame': 51 obs. of 9 variables:
## $ State : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ MedIncome: num 51113 72231 61125 48829 69759 ...
## $ Pop1yo : chr "4819343" "730874" "6935358" "2968395" ...
## $ Sameres1y: chr "4158224" "595804" "5717189" "2509717" ...
## $ Samest1y : chr "535649" "98168" "911120" "370309" ...
## $ Diffst1y : chr "111803" "30199" "261727" "79411" ...
## $ Abroad1y : chr "13667" "6703" "45322" "8958" ...
## $ Unemp : chr "4.4" "7.0" "4.9" "3.7" ...
## $ Rank : chr "29" "51" "40" "15" ...
We make the following observations:
Rank
must be encoded as an ordinal variablePop1y
, Sameres1y
, Samest1y
, Diffst1y
, Abrod1y
, and unemp
)First, we convert Rank
to a factor varible using function mutate()
from the dplyr
package; we specify ordering.
finalset <- mutate(finalset, Rank = as.factor(Rank))
finalset$Rank <- factor(finalset$Rank,
levels = as.character(seq(1,51,1)),
order = TRUE)
levels(finalset$Rank)
## [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
## [29] "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42"
## [43] "43" "44" "45" "46" "47" "48" "49" "50" "51"
We specify the levels of Rank
as integers 1 to 51, in character string format. Ascending order is specified as the order for Rank
. We note that not all of these levels are taken in the dataset as some states tie in ranking. We still specify all integers between 1 and 51 however, as these are the values that can be taken.
Next, we make the conversion from character to numeric format for variables Pop1y
, Sameres1y
, Samest1y
, Diffst1y
, Abrod1y
, and unemp
.
finalset$Pop1yo <- as.numeric(finalset$Pop1yo)
finalset$Sameres1y <- as.numeric(finalset$Sameres1y)
finalset$Samest1y <- as.numeric(finalset$Samest1y)
finalset$Diffst1y <- as.numeric(finalset$Diffst1y)
finalset$Abroad1y <- as.numeric(finalset$Abroad1y)
finalset$Unemp <- as.numeric(finalset$Unemp)
We inspect the structure of finalset
after making the changes in this section.
str(finalset)
## Classes 'tbl_df', 'tbl' and 'data.frame': 51 obs. of 9 variables:
## $ State : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ MedIncome: num 51113 72231 61125 48829 69759 ...
## $ Pop1yo : num 4819343 730874 6935358 2968395 39085984 ...
## $ Sameres1y: num 4158224 595804 5717189 2509717 34009331 ...
## $ Samest1y : num 535649 98168 911120 370309 4234770 ...
## $ Diffst1y : num 111803 30199 261727 79411 523131 ...
## $ Abroad1y : num 13667 6703 45322 8958 318752 ...
## $ Unemp : num 4.4 7 4.9 3.7 4.8 2.7 4.7 4.5 6.1 4.2 ...
## $ Rank : Ord.factor w/ 51 levels "1"<"2"<"3"<"4"<..: 29 51 40 15 39 2 35 31 50 23 ...
We can see several data types, including numeric (MedIncome
, Pop1yo
, Sameres1y
, Samest1y
, Diffst1y
, Abroad1y
, Unemp
), character (State
) and ordinal variable Rank
.
To determine whether finalset
is tidy or not we examine the first several observations below.
head(finalset)
## # A tibble: 6 x 9
## State MedIncome Pop1yo Sameres1y Samest1y Diffst1y Abroad1y Unemp Rank
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <ord>
## 1 Alabama 51113 4.82e6 4158224 535649 111803 13667 4.4 29
## 2 Alaska 72231 7.31e5 595804 98168 30199 6703 7 51
## 3 Arizona 61125 6.94e6 5717189 911120 261727 45322 4.9 40
## 4 Arkans~ 48829 2.97e6 2509717 370309 79411 8958 3.7 15
## 5 Califo~ 69759 3.91e7 34009331 4234770 523131 318752 4.8 39
## 6 Colora~ 74172 5.54e6 4546729 734182 226933 34438 2.7 2
We evaluate this dataset in terms of tidy principles and make some observations:
Pop1yo
and Sameres1y
), we intepret these to be different variables that are merely measured using the same unit (number of people).Hence, we conclude that finalset
is tidy and no adjustments are made in this section.
We create new variable RateAbroad1y
, that measures the percentage of the population that were living abroad 1 year ago. RateAbroad1y
is calculated by dividing Abroad1y
by Pop1yo
, multiplying by 100, and rounding the result to one decimal place.
finalset <- mutate(finalset, RateAbroad1y = round((Abroad1y / Pop1yo)*100, 1))
head(finalset$RateAbroad1y)
## [1] 0.3 0.9 0.7 0.3 0.8 0.6
By using mutate() function, Abroad1y
and Pop1yo
remain in finalset
after RateAbroad1y
has been created. The first 6 observations of resulting finalset
has been provided below.
head(finalset)
## # A tibble: 6 x 10
## State MedIncome Pop1yo Sameres1y Samest1y Diffst1y Abroad1y Unemp Rank
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <ord>
## 1 Alab~ 51113 4.82e6 4158224 535649 111803 13667 4.4 29
## 2 Alas~ 72231 7.31e5 595804 98168 30199 6703 7 51
## 3 Ariz~ 61125 6.94e6 5717189 911120 261727 45322 4.9 40
## 4 Arka~ 48829 2.97e6 2509717 370309 79411 8958 3.7 15
## 5 Cali~ 69759 3.91e7 34009331 4234770 523131 318752 4.8 39
## 6 Colo~ 74172 5.54e6 4546729 734182 226933 34438 2.7 2
## # ... with 1 more variable: RateAbroad1y <dbl>
We scan finaset
for any missing values or errors. First, we check for NA values. Then, we scan for infinite or NAN values. Next, we check to see that variables MedIncome
to Unemp
in the data frame are note negative (a negative value in one of these columns would be an obvious error). We last check the unique state names in finalset
to make sure there are no obvious errors here.
# Check number of missing values per column
colSums(is.na(finalset))
## State MedIncome Pop1yo Sameres1y Samest1y
## 0 0 0 0 0
## Diffst1y Abroad1y Unemp Rank RateAbroad1y
## 0 0 0 0 0
# Check special values (infinite or NAN values)
is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))}
sapply(finalset, function(x) sum( is.na(x) ))
## State MedIncome Pop1yo Sameres1y Samest1y
## 0 0 0 0 0
## Diffst1y Abroad1y Unemp Rank RateAbroad1y
## 0 0 0 0 0
# Check for negative values
finalset_negative <- (finalset[,2:8] < 0)
sum(finalset_negative)
## [1] 0
# Check state names are valid
unique(finalset$State)
## [1] "Alabama" "Alaska" "Arizona"
## [4] "Arkansas" "California" "Colorado"
## [7] "Connecticut" "Delaware" "District of Columbia"
## [10] "Florida" "Georgia" "Hawaii"
## [13] "Idaho" "Illinois" "Indiana"
## [16] "Iowa" "Kansas" "Kentucky"
## [19] "Louisiana" "Maine" "Maryland"
## [22] "Massachusetts" "Michigan" "Minnesota"
## [25] "Mississippi" "Missouri" "Montana"
## [28] "Nebraska" "Nevada" "New Hampshire"
## [31] "New Jersey" "New Mexico" "New York"
## [34] "North Carolina" "North Dakota" "Ohio"
## [37] "Oklahoma" "Oregon" "Pennsylvania"
## [40] "Rhode Island" "South Carolina" "South Dakota"
## [43] "Tennessee" "Texas" "Utah"
## [46] "Vermont" "Virginia" "Washington"
## [49] "West Virginia" "Wisconsin" "Wyoming"
We have the following result:
MedIncome
to Rank
.Rank
were already checked previously hence we do not do so here.Hence, we do not find any missing values, inconsistencies or obvious errors.
All numeric variables are scanned for univariate outliers boxplots.
par(mfrow=c(4,2))
boxplot(finalset$Unemp, horizontal = TRUE, xlab = "Unemp")
boxplot(finalset$MedIncome, horizontal = TRUE, xlab = "MedIncome")
boxplot(finalset$Pop1yo, horizontal = TRUE, xlab = "Pop1yo")
boxplot(finalset$Sameres1y, horizontal = TRUE, xlab = "Sameres1y")
boxplot(finalset$Samest1y, horizontal = TRUE, xlab = "Samest1y")
boxplot(finalset$Diffst1y, horizontal = TRUE, xlab = "Diffst1y")
boxplot(finalset$Abroad1y, horizontal = TRUE, xlab = "Abroad1y")
boxplot(finalset$RateAbroad1y, horizontal = TRUE, xlab = "RateAbroad1y")
Outliers are identified for all variables except MedIncome
. These outliers are identified below using the filter()
and select()
functions.
finalset %>% filter(Unemp > 6.5) %>% select(State, Unemp)
## # A tibble: 1 x 2
## State Unemp
## <chr> <dbl>
## 1 Alaska 7
finalset %>% filter(Pop1yo > 15000000) %>% select(State, Pop1yo)
## # A tibble: 4 x 2
## State Pop1yo
## <chr> <dbl>
## 1 California 39085984
## 2 Florida 20772371
## 3 New York 19638923
## 4 Texas 27924632
finalset %>% filter(Sameres1y > 15000000) %>% select(State, Sameres1y)
## # A tibble: 4 x 2
## State Sameres1y
## <chr> <dbl>
## 1 California 34009331
## 2 Florida 17554348
## 3 New York 17553033
## 4 Texas 23534387
finalset %>% filter(Samest1y > 2000000) %>% select(State, Samest1y)
## # A tibble: 3 x 2
## State Samest1y
## <chr> <dbl>
## 1 California 4234770
## 2 Florida 2402381
## 3 Texas 3643333
finalset %>% filter(Diffst1y > 400000) %>% select(State, Diffst1y)
## # A tibble: 3 x 2
## State Diffst1y
## <chr> <dbl>
## 1 California 523131
## 2 Florida 566476
## 3 Texas 524511
finalset %>% filter(Abroad1y > 100000) %>% select(State, Abroad1y)
## # A tibble: 4 x 2
## State Abroad1y
## <chr> <dbl>
## 1 California 318752
## 2 Florida 249166
## 3 New York 145486
## 4 Texas 222401
finalset %>% filter(RateAbroad1y > 1.4) %>% select(State, RateAbroad1y)
## # A tibble: 1 x 2
## State RateAbroad1y
## <chr> <dbl>
## 1 District of Columbia 1.5
The states that tend to appear as outliers are states such as California, Florida, Texas and New York. These are states with the largest populations (measured by Pop1yo
). As these states have more residents, it makes sense that they would have more residents counted in other measures that count people such as Sameres1y
. Other outliers include Alaska with its high unemployment rate and District of Columbia having the highest values of RateAbroad1y
. In each of these cases, excluding these states may exclude important systematic information related to the variables in finalset
. Furthermore, finalset
includes a full set of states of the USA that may be valuable for analysis. Hence, we do not want to remove any states from finalset
; instead, we will conduct transformations in the next section that will reduce the impact of these outliers.
To want to transform at least one variable. We need to choose which transformation to use, and also which variable(s) to transform. We consider the positive skew typically seen in the boxplots in the previous section. We also consider the fact that outliers are seen in all numeric variables except for MedIncome
. We decide to take the natural logarithm of each numeric variable using the log()
function, to try and improve normality and treat outliers.
lnUnemp <- log(finalset$Unemp)
lnMedIncome <- log(finalset$MedIncome)
lnPop1yo <- log(finalset$Pop1yo)
lnSameres1y <- log(finalset$Sameres1y)
lnSamest1y <- log(finalset$Samest1y)
lnDiffst1y <- log(finalset$Diffst1y)
lnAbroad1y <- log(finalset$Abroad1y)
lnRateAbroad1y <- log(finalset$RateAbroad1y)
Natural log transformation of Unemp
par(mfrow=c(1,2))
boxplot(lnUnemp)
hist(lnUnemp)
Natural log transformation of MedIncome
par(mfrow=c(1,2))
boxplot(lnMedIncome)
hist(lnMedIncome)
Natural log transformation of Pop1yo
par(mfrow=c(1,2))
boxplot(lnPop1yo)
hist(lnPop1yo)
Natural log transformation of Sameres1y
par(mfrow=c(1,2))
boxplot(lnSameres1y)
hist(lnSameres1y)
Natural log transformation of Samest1y
par(mfrow=c(1,2))
boxplot(lnSamest1y)
hist(lnSamest1y)
Natural log transformation of Diffst1y
par(mfrow=c(1,2))
boxplot(lnDiffst1y)
hist(lnDiffst1y)
Natural log transformation of Abroad1y
par(mfrow=c(1,2))
boxplot(lnAbroad1y)
hist(lnAbroad1y)
Natural log transformation of RateAbroad1y
par(mfrow=c(1,2))
boxplot(lnRateAbroad1y)
hist(lnRateAbroad1y)
In all cases except for that of MedIncome
, the natural log transformation has removed all outliers. In the case of MedIncome
, taking the natural log results in a more symmetric variable.