Required packages

library(rvest)
library(readxl)
library(dplyr) 
library(tidyr)
library(forecast)

Executive Summary

This data preprocessing exercise will join three datasets that measure median income, unemployment and interstate migration by state of the USA. The choice of datasets is based on the premise that there may be a link between these measures.

The three datasets will be imported into R, from two Microsoft Excel files and one URL, before being joined using state as key. An inspection of the dataset once merged is provided, and data type conversions are performed as appropriate. The next step is to consider the dataset with respect to tidy principles. To do this, we check if the data is tidy, before creating one new variable from two existing variables.

In addition, we scan the dataset for any missing values or inconsistencies within any variables. We then scan each numeric variable for outliers using boxplots. Last, we perform natural logarithmic transformations on all numeric variables, improving normality of the variable by reducing right-tail skewness and removing outliers.

Data

In this section, we import median income data, then interstate migration data. We will then merge these two datasets together. Last, we import unemployment data and merge this with the previous dataset.

Median income data

Median income data was obtained from United States Census Bureau (2018a). We import the data using the read_excel() function from the readxl package. The original file includes median income by state from 1984 to 2017. Each observation is provided in current US dollars and 2017 CPI-U-RS adjusted dollars. Each observation also has an associated standard error. The only observations we have imported were 2017 figures based on current US dollars (in 2017) which align with the 2017 CPI-U-RS adjusted observations for this year).

income <- read_excel("INCOME.xls", range = "h08!A8:B58", col_names = c("State", "MedIncome"))

head(income)

## # A tibble: 6 x 2
##   State      MedIncome
##   <chr>          <dbl>
## 1 Alabama        51113
## 2 Alaska         72231
## 3 Arizona        61125
## 4 Arkansas       48829
## 5 California     69759
## 6 Colorado       74172

dim(income)

## [1] 51  2

The imported dataset, income, consists of 51 observations of 2 columns; variables are State (character variable identifying the state of USA) and MedIncome (numeric variable measuring median income).

Interstate migration data

Interstate migration data was obtained from United States Census Bureau (2018b). The data was imported using the read_excel() function. As this is the most complex dataset used, columns for different variables were selected and imported separately (migration1 and migration2). These columns will be combined to create the final dataset for interstate migration (migration).

migration1 <- read_excel("State_to_State_Migrations_Table_2017.xls", 
                        range = "Table!A12:H78", 
                        col_names = c("State", "Pop1yo", "MOE1", 
                                      "Sameres1y", "MOE2", "Samest1y", 
                                      "MOE3", "Diffst1y"))
dim(migration1)

## [1] 67  8

colSums(is.na(migration1))

##     State    Pop1yo      MOE1 Sameres1y      MOE2  Samest1y      MOE3 
##        14        13        14        13        14        13        14 
##  Diffst1y 
##        12

migration2 <- read_excel("State_to_State_Migrations_Table_2017.xls", 
                        range = "Table!DR12:DS78", 
                        col_names = c("State", "Abroad1y"))
dim(migration2)

## [1] 67  2

colSums(is.na(migration2))

##    State Abroad1y 
##       14       12

migration1[33:37,] # displays rows with NA values in migration1

## # A tibble: 5 x 8
##   State    Pop1yo    MOE1  Sameres1y   MOE2  Samest1y    MOE3  Diffst1y    
##   <chr>    <chr>     <chr> <chr>       <chr> <chr>       <chr> <chr>       
## 1 <NA>     <NA>      <NA>  <NA>        <NA>  <NA>        <NA>  <NA>        
## 2 Current~ Populati~ <NA>  Same house~ <NA>  Same state~ <NA>  Different s~
## 3 <NA>     <NA>      <NA>  <NA>        <NA>  <NA>        <NA>  Total       
## 4 <NA>     Estimate  MOE   Estimate    MOE   Estimate    MOE   Estimate    
## 5 <NA>     <NA>      <NA>  <NA>        <NA>  <NA>        <NA>  <NA>

migration2[33:37,] # displays rows with NA values in migration2

## # A tibble: 5 x 2
##   State                   Abroad1y           
##   <chr>                   <chr>              
## 1 <NA>                    <NA>               
## 2 Current residence in -- Abroad 1 year ago--
## 3 <NA>                    Total              
## 4 <NA>                    Estimate           
## 5 <NA>                    <NA>

Datasets migration1 and migration2 each consist of 67 rows, from which around 14 are filled with NA values. These NA values result from the Excel spreadsheet design, with spaces and subheadings arranged throughout the data. An example of these NAs and subheadings have been provided above.

Variables contained in the migration1 dataset include:

State: state of USA
Pop1yo: Estimated population 1 year old and over.
MOE1: Margin of error based on 90% confidence interval, for Pop1yo.
Sameres1y: Estimated population aged 1 year and older, living in the same house as 1 year ago.
MOE2: Margin of error based on 90% confidence interval, for Sameres1y.
Samest1y: Estimated population aged 1 year and older, not living in the same house but living the same state as 1 year ago.
MOE3: Margin of error based on 90% confidence interal, for Samest1y.
Diffst1y: Estimated population aged 1 year and older, living in a different state 1 year ago.

Variables contained in the migration2 dataset include:

State: state of USA
Abroad1y: Estimated population aged one year and older, living abroad.

Next, sets migration1 and migration2 are joined using the left_join() function. Within the join function, we use na.omit() to remove the rows containing NA values. We assume there to be no inconsistencies between the State columns of either dataset as they are obtained from the same section of the spreadsheet.

migration <- left_join(na.omit(migration1), na.omit(migration2), by = "State")

migration <- select(migration, State, Pop1yo, Sameres1y, Samest1y, 
                    Diffst1y, Abroad1y)

head(migration)

## # A tibble: 6 x 6
##   State      Pop1yo   Sameres1y Samest1y Diffst1y Abroad1y
##   <chr>      <chr>    <chr>     <chr>    <chr>    <chr>   
## 1 Alabama    4819343  4158224   535649   111803   13667   
## 2 Alaska     730874   595804    98168    30199    6703    
## 3 Arizona    6935358  5717189   911120   261727   45322   
## 4 Arkansas   2968395  2509717   370309   79411    8958    
## 5 California 39085984 34009331  4234770  523131   318752  
## 6 Colorado   5542282  4546729   734182   226933   34438

dim(migration)

## [1] 52  6

The left-join() function prioritizes migration1 dataset, matching observations from migration2 by the key State. Last, we drop all columns storing a margin of error by using select() to select all other variables. Resulting dataset migration has 52 observations of 6 columns.

Merging median income and interstate migration datasets

The next step is to merge income and migration. We find inconsistencies in the state names between income and migration using setdiff(); the results are displayed below. We find that Washington, DC is named “D.C.” in income and “District of Columbia” in migration. We change the name of this state in income to match that of migration. Furthermore, migration includes an extra row for Puerto Rico, which does not appear as a state in income. We will drop this row when merging the two dataset.

# Check for inconsistencies
setdiff(income$State, migration$State)

## [1] "D.C."

setdiff(migration$State, income$State)

## [1] "District of Columbia" "Puerto Rico"

# Rename Washington, DC 
income[income$State == "D.C.", "State"] <- c("District of Columbia")

We use the left_join() function prioritising income to merge income and migration into new dataset set1.

#Join datasets
set1 <- left_join(income, migration, by = "State")

#Inspect
head(set1)

## # A tibble: 6 x 7
##   State      MedIncome Pop1yo   Sameres1y Samest1y Diffst1y Abroad1y
##   <chr>          <dbl> <chr>    <chr>     <chr>    <chr>    <chr>   
## 1 Alabama        51113 4819343  4158224   535649   111803   13667   
## 2 Alaska         72231 730874   595804    98168    30199    6703    
## 3 Arizona        61125 6935358  5717189   911120   261727   45322   
## 4 Arkansas       48829 2968395  2509717   370309   79411    8958    
## 5 California     69759 39085984 34009331  4234770  523131   318752  
## 6 Colorado       74172 5542282  4546729   734182   226933   34438

dim(set1)

## [1] 51  7

The resulting dataset set1 is a 51 by 7 dataframe.

Unemployment dataset

The unemployment dataset has been scraped from Bureau of Labor Statistics (2019). This is done using function read_html.

#Save URL
unemploy <- read_html("https://www.bls.gov/lau/lastrk17.htm?fbclid=IwAR3SEQ5hc6eJFB9yNBQ-0ZTHoU-lEpiHfRaewdabRX1sa3T2DnjzPLPXpSc")

#Identify and import table
length(html_nodes(unemploy, "table"))

## [1] 2

html_nodes(unemploy, "table")

## {xml_nodeset (2)}
## [1] <table id="main-content-table"><tr>\n<td id="secondary-nav-td">\r\n\ ...
## [2] <table border="1" cellpadding="0" cellspacing="0" class="regular" id ...

unemp <- html_table(html_nodes(unemploy, "table")[[2]])
colnames(unemp) <- c("State", "Unemp", "Rank")

#Inspect imported table 
head(unemp)

##           State Unemp Rank
## 1 United States   4.4     
## 2                         
## 3        Hawaii   2.4    1
## 4      Colorado   2.7    2
## 5 New Hampshire   2.7    2
## 6  North Dakota   2.7    2

dim(unemp)

## [1] 54  3

The imported dataset - unemp - is 54 rows by 3 columns; variables have been renamed as the following:

State: state of USA
Unemp: 2017 unemployement rates by State
Rank: state’s rank in terms of unemployment, from lowest to highest.

Merging unemployment data with other data

We next join unemp to set1. Beforehand, we check for missing values in unemp and also for any inconsistencies between unemp$State and set1$State.

# Check NAs 
colSums(is.na(unemp))

## State Unemp  Rank 
##     0     0     0

# Check inconsistencies  
setdiff(unemp$State, set1$State)

## [1] "United States"                                                                           
## [2] ""                                                                                        
## [3] "Note: Rates shown are a percentage of the labor force. Data refer to place of residence."

setdiff(set1$State, unemp$State)

## character(0)

No missing values are found in unemp but three character string inconsistencies were found in unemp that are not in set1. We will remove these in the merge, specifying a left_join() function that prioritises set1 and adds values of unemp using State as the key.

# Join 
finalset <- left_join(set1, unemp, by = "State") 

# Inspect final dataset 
head(finalset)

## # A tibble: 6 x 9
##   State   MedIncome Pop1yo Sameres1y Samest1y Diffst1y Abroad1y Unemp Rank 
##   <chr>       <dbl> <chr>  <chr>     <chr>    <chr>    <chr>    <chr> <chr>
## 1 Alabama     51113 48193~ 4158224   535649   111803   13667    4.4   29   
## 2 Alaska      72231 730874 595804    98168    30199    6703     7.0   51   
## 3 Arizona     61125 69353~ 5717189   911120   261727   45322    4.9   40   
## 4 Arkans~     48829 29683~ 2509717   370309   79411    8958     3.7   15   
## 5 Califo~     69759 39085~ 34009331  4234770  523131   318752   4.8   39   
## 6 Colora~     74172 55422~ 4546729   734182   226933   34438    2.7   2

dim(finalset)

## [1] 51  9

The resulting dataset finalset consists of 51 observations of 9 variables.

Understand

To gain an insight into the dataset finalset, we inspect its structure below.

str(finalset)

## Classes 'tbl_df', 'tbl' and 'data.frame':    51 obs. of  9 variables:
##  $ State    : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ MedIncome: num  51113 72231 61125 48829 69759 ...
##  $ Pop1yo   : chr  "4819343" "730874" "6935358" "2968395" ...
##  $ Sameres1y: chr  "4158224" "595804" "5717189" "2509717" ...
##  $ Samest1y : chr  "535649" "98168" "911120" "370309" ...
##  $ Diffst1y : chr  "111803" "30199" "261727" "79411" ...
##  $ Abroad1y : chr  "13667" "6703" "45322" "8958" ...
##  $ Unemp    : chr  "4.4" "7.0" "4.9" "3.7" ...
##  $ Rank     : chr  "29" "51" "40" "15" ...

We make the following observations:

Rank must be encoded as an ordinal variable
A number of variables that have been imported as character string variables must be converted to numeric (Pop1y, Sameres1y, Samest1y, Diffst1y, Abrod1y, and unemp)

First, we convert Rank to a factor varible using function mutate() from the dplyr package; we specify ordering.

finalset <- mutate(finalset, Rank = as.factor(Rank))

finalset$Rank <- factor(finalset$Rank, 
                        levels = as.character(seq(1,51,1)), 
                        order = TRUE)
levels(finalset$Rank)

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
## [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
## [29] "29" "30" "31" "32" "33" "34" "35" "36" "37" "38" "39" "40" "41" "42"
## [43] "43" "44" "45" "46" "47" "48" "49" "50" "51"

We specify the levels of Rank as integers 1 to 51, in character string format. Ascending order is specified as the order for Rank. We note that not all of these levels are taken in the dataset as some states tie in ranking. We still specify all integers between 1 and 51 however, as these are the values that can be taken.

Next, we make the conversion from character to numeric format for variables Pop1y, Sameres1y, Samest1y, Diffst1y, Abrod1y, and unemp.

finalset$Pop1yo <- as.numeric(finalset$Pop1yo)
finalset$Sameres1y <- as.numeric(finalset$Sameres1y)
finalset$Samest1y <- as.numeric(finalset$Samest1y)
finalset$Diffst1y <- as.numeric(finalset$Diffst1y)
finalset$Abroad1y <- as.numeric(finalset$Abroad1y)
finalset$Unemp <- as.numeric(finalset$Unemp)

We inspect the structure of finalset after making the changes in this section.

str(finalset)

## Classes 'tbl_df', 'tbl' and 'data.frame':    51 obs. of  9 variables:
##  $ State    : chr  "Alabama" "Alaska" "Arizona" "Arkansas" ...
##  $ MedIncome: num  51113 72231 61125 48829 69759 ...
##  $ Pop1yo   : num  4819343 730874 6935358 2968395 39085984 ...
##  $ Sameres1y: num  4158224 595804 5717189 2509717 34009331 ...
##  $ Samest1y : num  535649 98168 911120 370309 4234770 ...
##  $ Diffst1y : num  111803 30199 261727 79411 523131 ...
##  $ Abroad1y : num  13667 6703 45322 8958 318752 ...
##  $ Unemp    : num  4.4 7 4.9 3.7 4.8 2.7 4.7 4.5 6.1 4.2 ...
##  $ Rank     : Ord.factor w/ 51 levels "1"<"2"<"3"<"4"<..: 29 51 40 15 39 2 35 31 50 23 ...

We can see several data types, including numeric (MedIncome, Pop1yo, Sameres1y, Samest1y, Diffst1y, Abroad1y, Unemp), character (State) and ordinal variable Rank.

Tidy & Manipulate Data I

To determine whether finalset is tidy or not we examine the first several observations below.

head(finalset)

## # A tibble: 6 x 9
##   State   MedIncome Pop1yo Sameres1y Samest1y Diffst1y Abroad1y Unemp Rank 
##   <chr>       <dbl>  <dbl>     <dbl>    <dbl>    <dbl>    <dbl> <dbl> <ord>
## 1 Alabama     51113 4.82e6   4158224   535649   111803    13667   4.4 29   
## 2 Alaska      72231 7.31e5    595804    98168    30199     6703   7   51   
## 3 Arizona     61125 6.94e6   5717189   911120   261727    45322   4.9 40   
## 4 Arkans~     48829 2.97e6   2509717   370309    79411     8958   3.7 15   
## 5 Califo~     69759 3.91e7  34009331  4234770   523131   318752   4.8 39   
## 6 Colora~     74172 5.54e6   4546729   734182   226933    34438   2.7 2

We evaluate this dataset in terms of tidy principles and make some observations:

Firstly, each variable has its own column. While multiple columns may seem to measure similar things (such as Pop1yo and Sameres1y), we intepret these to be different variables that are merely measured using the same unit (number of people).
Secondly, each state has its own row.
Thirdly, each observation has its own cell.

Hence, we conclude that finalset is tidy and no adjustments are made in this section.

Tidy & Manipulate Data II

We create new variable RateAbroad1y, that measures the percentage of the population that were living abroad 1 year ago. RateAbroad1y is calculated by dividing Abroad1y by Pop1yo, multiplying by 100, and rounding the result to one decimal place.

finalset <- mutate(finalset, RateAbroad1y = round((Abroad1y / Pop1yo)*100, 1))

head(finalset$RateAbroad1y)

## [1] 0.3 0.9 0.7 0.3 0.8 0.6

By using mutate() function, Abroad1y and Pop1yo remain in finalset after RateAbroad1y has been created. The first 6 observations of resulting finalset has been provided below.

head(finalset)

## # A tibble: 6 x 10
##   State MedIncome Pop1yo Sameres1y Samest1y Diffst1y Abroad1y Unemp Rank 
##   <chr>     <dbl>  <dbl>     <dbl>    <dbl>    <dbl>    <dbl> <dbl> <ord>
## 1 Alab~     51113 4.82e6   4158224   535649   111803    13667   4.4 29   
## 2 Alas~     72231 7.31e5    595804    98168    30199     6703   7   51   
## 3 Ariz~     61125 6.94e6   5717189   911120   261727    45322   4.9 40   
## 4 Arka~     48829 2.97e6   2509717   370309    79411     8958   3.7 15   
## 5 Cali~     69759 3.91e7  34009331  4234770   523131   318752   4.8 39   
## 6 Colo~     74172 5.54e6   4546729   734182   226933    34438   2.7 2    
## # ... with 1 more variable: RateAbroad1y <dbl>

Scan I

We scan finaset for any missing values or errors. First, we check for NA values. Then, we scan for infinite or NAN values. Next, we check to see that variables MedIncome to Unemp in the data frame are note negative (a negative value in one of these columns would be an obvious error). We last check the unique state names in finalset to make sure there are no obvious errors here.

# Check number of missing values per column 
colSums(is.na(finalset))

##        State    MedIncome       Pop1yo    Sameres1y     Samest1y 
##            0            0            0            0            0 
##     Diffst1y     Abroad1y        Unemp         Rank RateAbroad1y 
##            0            0            0            0            0

# Check special values (infinite or NAN values) 
is.special <- function(x){
if (is.numeric(x)) (is.infinite(x) | is.nan(x))}
sapply(finalset, function(x) sum( is.na(x) ))

##        State    MedIncome       Pop1yo    Sameres1y     Samest1y 
##            0            0            0            0            0 
##     Diffst1y     Abroad1y        Unemp         Rank RateAbroad1y 
##            0            0            0            0            0

# Check for negative values 
finalset_negative <- (finalset[,2:8] < 0)
sum(finalset_negative)

## [1] 0

# Check state names are valid 
unique(finalset$State)

##  [1] "Alabama"              "Alaska"               "Arizona"             
##  [4] "Arkansas"             "California"           "Colorado"            
##  [7] "Connecticut"          "Delaware"             "District of Columbia"
## [10] "Florida"              "Georgia"              "Hawaii"              
## [13] "Idaho"                "Illinois"             "Indiana"             
## [16] "Iowa"                 "Kansas"               "Kentucky"            
## [19] "Louisiana"            "Maine"                "Maryland"            
## [22] "Massachusetts"        "Michigan"             "Minnesota"           
## [25] "Mississippi"          "Missouri"             "Montana"             
## [28] "Nebraska"             "Nevada"               "New Hampshire"       
## [31] "New Jersey"           "New Mexico"           "New York"            
## [34] "North Carolina"       "North Dakota"         "Ohio"                
## [37] "Oklahoma"             "Oregon"               "Pennsylvania"        
## [40] "Rhode Island"         "South Carolina"       "South Dakota"        
## [43] "Tennessee"            "Texas"                "Utah"                
## [46] "Vermont"              "Virginia"             "Washington"          
## [49] "West Virginia"        "Wisconsin"            "Wyoming"

We have the following result:

No NA values are found.
No infinite or NAN values are found.
No values are negative in columns MedIncome to Rank.
All state names are unique and make sense.
Levels of variable Rank were already checked previously hence we do not do so here.

Hence, we do not find any missing values, inconsistencies or obvious errors.

Scan II

All numeric variables are scanned for univariate outliers boxplots.

par(mfrow=c(4,2))
boxplot(finalset$Unemp, horizontal = TRUE, xlab = "Unemp") 
boxplot(finalset$MedIncome, horizontal = TRUE, xlab = "MedIncome")
boxplot(finalset$Pop1yo, horizontal = TRUE, xlab = "Pop1yo") 
boxplot(finalset$Sameres1y, horizontal = TRUE, xlab = "Sameres1y")
boxplot(finalset$Samest1y, horizontal = TRUE, xlab = "Samest1y")
boxplot(finalset$Diffst1y, horizontal = TRUE, xlab = "Diffst1y")
boxplot(finalset$Abroad1y, horizontal = TRUE, xlab = "Abroad1y")
boxplot(finalset$RateAbroad1y, horizontal = TRUE, xlab = "RateAbroad1y")

Outliers are identified for all variables except MedIncome. These outliers are identified below using the filter() and select() functions.

finalset %>%  filter(Unemp > 6.5) %>% select(State, Unemp)

## # A tibble: 1 x 2
##   State  Unemp
##   <chr>  <dbl>
## 1 Alaska     7

finalset %>%  filter(Pop1yo > 15000000) %>% select(State, Pop1yo)

## # A tibble: 4 x 2
##   State        Pop1yo
##   <chr>         <dbl>
## 1 California 39085984
## 2 Florida    20772371
## 3 New York   19638923
## 4 Texas      27924632

finalset %>%  filter(Sameres1y > 15000000) %>% select(State, Sameres1y)

## # A tibble: 4 x 2
##   State      Sameres1y
##   <chr>          <dbl>
## 1 California  34009331
## 2 Florida     17554348
## 3 New York    17553033
## 4 Texas       23534387

finalset %>%  filter(Samest1y > 2000000) %>% select(State, Samest1y)

## # A tibble: 3 x 2
##   State      Samest1y
##   <chr>         <dbl>
## 1 California  4234770
## 2 Florida     2402381
## 3 Texas       3643333

finalset %>%  filter(Diffst1y > 400000) %>% select(State, Diffst1y)

## # A tibble: 3 x 2
##   State      Diffst1y
##   <chr>         <dbl>
## 1 California   523131
## 2 Florida      566476
## 3 Texas        524511

finalset %>%  filter(Abroad1y > 100000) %>% select(State, Abroad1y)

## # A tibble: 4 x 2
##   State      Abroad1y
##   <chr>         <dbl>
## 1 California   318752
## 2 Florida      249166
## 3 New York     145486
## 4 Texas        222401

finalset %>%  filter(RateAbroad1y > 1.4) %>% select(State, RateAbroad1y)

## # A tibble: 1 x 2
##   State                RateAbroad1y
##   <chr>                       <dbl>
## 1 District of Columbia          1.5

The states that tend to appear as outliers are states such as California, Florida, Texas and New York. These are states with the largest populations (measured by Pop1yo). As these states have more residents, it makes sense that they would have more residents counted in other measures that count people such as Sameres1y. Other outliers include Alaska with its high unemployment rate and District of Columbia having the highest values of RateAbroad1y. In each of these cases, excluding these states may exclude important systematic information related to the variables in finalset. Furthermore, finalset includes a full set of states of the USA that may be valuable for analysis. Hence, we do not want to remove any states from finalset; instead, we will conduct transformations in the next section that will reduce the impact of these outliers.

Transform

To want to transform at least one variable. We need to choose which transformation to use, and also which variable(s) to transform. We consider the positive skew typically seen in the boxplots in the previous section. We also consider the fact that outliers are seen in all numeric variables except for MedIncome. We decide to take the natural logarithm of each numeric variable using the log() function, to try and improve normality and treat outliers.

lnUnemp <- log(finalset$Unemp)
lnMedIncome <- log(finalset$MedIncome)
lnPop1yo <- log(finalset$Pop1yo)
lnSameres1y <- log(finalset$Sameres1y)
lnSamest1y <- log(finalset$Samest1y)
lnDiffst1y <- log(finalset$Diffst1y)
lnAbroad1y <- log(finalset$Abroad1y)
lnRateAbroad1y <- log(finalset$RateAbroad1y)

Natural log transformation of Unemp

par(mfrow=c(1,2))
boxplot(lnUnemp) 
hist(lnUnemp)

Natural log transformation of MedIncome

par(mfrow=c(1,2))
boxplot(lnMedIncome) 
hist(lnMedIncome)

Natural log transformation of Pop1yo

par(mfrow=c(1,2))
boxplot(lnPop1yo) 
hist(lnPop1yo)

Natural log transformation of Sameres1y

par(mfrow=c(1,2))
boxplot(lnSameres1y) 
hist(lnSameres1y)

Natural log transformation of Samest1y

par(mfrow=c(1,2))
boxplot(lnSamest1y) 
hist(lnSamest1y)

Natural log transformation of Diffst1y

par(mfrow=c(1,2))
boxplot(lnDiffst1y) 
hist(lnDiffst1y)

Natural log transformation of Abroad1y

par(mfrow=c(1,2))
boxplot(lnAbroad1y) 
hist(lnAbroad1y)

Natural log transformation of RateAbroad1y

par(mfrow=c(1,2))
boxplot(lnRateAbroad1y) 
hist(lnRateAbroad1y)

In all cases except for that of MedIncome, the natural log transformation has removed all outliers. In the case of MedIncome, taking the natural log results in a more symmetric variable.

References

Bureau of Labor Statistics (2019). Unemployment Rates for States, Annual Averages. Table: ‘Unemployment Rates for States, 2017 Annual Averages’. URL: https://www.bls.gov/lau/lastrk17.htm?fbclid=IwAR3SEQ5hc6eJFB9yNBQ-0ZTHoU-lEpiHfRaewdabRX1sa3T2DnjzPLPXpSc (accessed 23 May 2019)
United States Census Bureau (2018a) Historical Income Tables: Households. Table: ‘Table H-8. Median Household Income by State’. URL: https://www.census.gov/data/tables/time-series/demo/income-poverty/historical-income-households.html (accessed 23 May 2019)
United States Census Bureau (2018b) State-to-State Migration Flows. Table: ‘State-to-State Migration Flows: 2017’. URL: https://www.census.gov/data/tables/time-series/demo/geographic-mobility/state-to-state-migration.html?fbclid=IwAR1H-m2ys67FqOFDy91SjTWRgtwhe9vT02ObQ1-WGFkL0RLnKZRiTkdyWjo (accessed 23 May 2019).

MATH2349 Data Preprocessing: Assignment 3

Ariana Tammetta (s3774802), Karen Gonzalez (s3697003)

09/06/2019