Course: Data-Driven Management and Policy

Prof. José Manuel Magallanes, PhD


Session 3 LAB: Data Structures

Lab Instructions

This lab is designed to give you practice manipulating basic R data structures. In addition, it should give you more practice using R and Markdown.

Part 1: Working with lists

You have this information from Governor Inslee:

The information was got from wikipedia.

Make a list, in a way you can answer these questions:

  • How old is he? (compute it from his birthday) He is 68 years old.
  • How long has he been married? He has been married for 47 years.
  • How many Universities has he attended? He has attended 3 Universities.
  • In what state was he born? He was born in Washington state.

Write the code to answer those questions.

Jay =list('FullName' = 'Jay Robert Inslee',
          'DateOfBirth' = '9 February 1951',
          'PlaceOfBirth' = list('Seattle', 'Washington', 'U.S.'),
          'PoliticalParty' = 'Democratic',
          'Spouse' = 'Trudi Inslee',
          'Children' = '3',
          'WeddingYear' = '1972',
          'Residence' = 'Governors Mansion',
          'Education' = list('StanfordUniversity', 'UniversityOfWashington', 'WilliametteUniversity')
          
)
#Jay's Age
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
Jay$DateOfBirth = as.Date(Jay$DateOfBirth, format = "%d %B %Y");str(Jay$DateOfBirth)
##  Date[1:1], format: "1951-02-09"
Sys.Date() - Jay$DateOfBirth
## Time difference of 24909 days
time_length(interval(Jay$DateOfBirth,Sys.Date()),"years")
## [1] 68.19726
#Years of Marriage
Jay$WeddingYear = as.numeric(Jay$WeddingYear)
2019 - Jay$WeddingYear
## [1] 47
#Number of schools attended
length(Jay$Education)
## [1] 3
#State of birth
Jay$PlaceOfBirth[2]
## [[1]]
## [1] "Washington"

Part 2: Solving Data issues

Last session you were asked to get some data from SEATTLE OPEN DATA portal. I will show you an example on how to use R to look for and solve some issues.

EXAMPLE:

  1. Open the data:
linkToData='ExampleLab3.xlsx'

library(rio)

myData=import("ExampleLab3.xlsx")
  1. Verify the data type:
str(myData)
## 'data.frame':    30 obs. of  5 variables:
##  $ personID     : chr  "A145" "A185" "A108" "A172" ...
##  $ age          : chr  "21" "34" "35" NA ...
##  $ County       : chr  "King County" "King County" "King County" "King County" ...
##  $ Degree       : chr  "HighSchoolDiploma" "Bachelor" "Bachelor" "Bachelor" ...
##  $ HouseholdSize: chr  "2" "1" "1" "1" ...
  1. Identify problems

From the result above, age and household size should be numbers, but they have been read as text; that is, there should be a number written as text.

  1. Count missing values in numeric data:
myData[is.na(myData$age),]
##   personID  age      County   Degree HouseholdSize
## 4     A172 <NA> King County Bachelor             1
myData[is.na(myData$HouseholdSize),]
##    personID age         County   Degree HouseholdSize
## 28     C148  35 Spokane County Bachelor          <NA>

Each column has one missing value.

  1. Count missing values in numeric data when turned into numeric:
myData[is.na(as.numeric(myData$age)),]
##   personID  age      County   Degree HouseholdSize
## 4     A172 <NA> King County Bachelor             1

In the previous case, when turning the column into numeric no extra missing value was found.

myData[is.na(as.numeric(myData$HouseholdSize)),]
## Warning in `[.data.frame`(myData,
## is.na(as.numeric(myData$HouseholdSize)), : NAs introduced by coercion
##    personID age         County   Degree HouseholdSize
## 18     A195  43  Kitsap County Bachelor             l
## 28     C148  35 Spokane County Bachelor          <NA>

In the previous case, when turning the column into numeric ONE extra missing value was found. In this case, R also told you that it introduced NAs. When you apply the function as.numeric() to a text, R will return a NA.

  1. Make replacements

In the first column, there were no issues, then I simply alter the data type:

myData$age=as.numeric(myData$age)

In the second column, you first need to modify the value:

myData$HouseholdSize[18]
## [1] "l"
# then:
myData$HouseholdSize[18]=1

Now you can turn the column into numbers:

myData$HouseholdSize=as.numeric(myData$HouseholdSize)
  1. Make a statistical summary of the numeric variables
summary(myData[,c('age','HouseholdSize')])
##       age        HouseholdSize   
##  Min.   :21.00   Min.   : 1.000  
##  1st Qu.:29.00   1st Qu.: 2.000  
##  Median :38.00   Median : 2.000  
##  Mean   :36.72   Mean   : 3.828  
##  3rd Qu.:42.00   3rd Qu.: 3.000  
##  Max.   :49.00   Max.   :44.000  
##  NA's   :1       NA's   :1
  1. Identify weird values

In the previous summary, you see that age values seem well, but it looks as though there might be something weird in household size: there is a person in this data set whose household size is 44. Is this an outlier?

boxplot(myData$HouseholdSize,horizontal = T)

It looks too far from the other answers. As it is an integer value (do not do this if it has decimal values), we could try a frequency table:

table(myData$HouseholdSize)
## 
##  1  2  3  4 44 
##  7  9  6  6  1

My best guess is that this value is atually a 4, so there is some kind of mistyping; however, you might want to spend some time looking for the original record that produced this row.

YOUR TURN:

  • Use the data you downloaded last week.
  • Using str(), verify if there is a numeric value that has been read as text. If so, please modify it. If there were several numeric columns, do not use more than three.
  • Identify if there are weird values in the numeric columns (now all re formatted). If there were several numeric columns, do not use more than three.

The area that has the higest number of AAWDT(annual average weekly daily traffic) in seattle area is 66361 and is located “YALE AVE, NW/O HOWELL ST”. From examining the location through the map, I believe that this is not the outlier despite the aberrant high in number comparing to the other areas. This is because it is located closely to i-5 bridge both the entrance and exit. It is also located between downtown and Capitol Hills which are probably two of the most busiest areas in Seattle area.

linkToData = 'Traffic_Flow_Map_Volumes.csv'

library(rio)

myData2 = import("Traffic_Flow_Map_Volumes.csv")


str(myData2)
## 'data.frame':    437 obs. of  7 variables:
##  $ OBJECTID      : int  17793 27025 20123 35281 44585 39118 898 33200 31642 11579 ...
##  $ STNAME        : chr  "E BOSTON ST" "E PINE ST" "M L KING JR WAY E" "EASTLAKE AVE E" ...
##  $ COUNT_LOCATION: chr  "E BOSTON ST, W/O 11TH AVE E" "E PINE ST, W/O 12TH AVE" "M L KING JR WAY E, N/O E JOHN ST" "EASTLAKE AVE E, N/O THOMAS ST" ...
##  $ YEAR          : int  2017 2018 2017 2018 2018 2018 2017 2017 2018 2018 ...
##  $ SEGKEY        : int  13859 14637 11496 10164 22292 18070 14080 4379 2350 18068 ...
##  $ AAWDT         : int  6235 8763 8706 8988 6295 3866 12342 11974 4283 4187 ...
##  $ INPUT_STUDY_ID: int  324780 325408 325994 327101 325455 325428 325810 324002 325448 326181 ...
colMax <- function(myData2) sapply(myData2,max, na.rm = TRUE)

colMax(myData2)
##                   OBJECTID                     STNAME 
##                    "46655"                 "YALE AVE" 
##             COUNT_LOCATION                       YEAR 
## "YALE AVE, NW/O HOWELL ST"                     "2018" 
##                     SEGKEY                      AAWDT 
##                   "652860"                    "66361" 
##             INPUT_STUDY_ID 
##                   "327691"
max(myData2$AAWDT)
## [1] 66361
summary(myData2)
##     OBJECTID        STNAME          COUNT_LOCATION          YEAR     
##  Min.   :  352   Length:437         Length:437         Min.   :2017  
##  1st Qu.:12228   Class :character   Class :character   1st Qu.:2017  
##  Median :24033   Mode  :character   Mode  :character   Median :2017  
##  Mean   :24570                                         Mean   :2017  
##  3rd Qu.:35702                                         3rd Qu.:2018  
##  Max.   :46655                                         Max.   :2018  
##      SEGKEY           AAWDT       INPUT_STUDY_ID  
##  Min.   :  1000   Min.   :  468   Min.   :323817  
##  1st Qu.:  7911   1st Qu.: 4504   1st Qu.:325393  
##  Median : 12359   Median : 7840   Median :325779  
##  Mean   : 20899   Mean   : 9923   Mean   :325675  
##  3rd Qu.: 17899   3rd Qu.:13133   3rd Qu.:326138  
##  Max.   :652860   Max.   :66361   Max.   :327691
#boxplot(myData2$AAWDT)

Part 3: Final Project

  • Specify what action or intervention you will begin in two weeks as your experiment.

I plan to start download to applications needed to track my distance walked, hours of sleep, calories eating to determine how it affects my weight) * Which of your measures do you expect to see a difference in based on that experiment (you can see it in multiple)?

I think sleep, exercises, and diet are important to weight loss or maintance. * What is your hypothesis, do you think that measure will increase or decrease? It is recommended that you begin your data collection by tonight, although in the next few days will also work.

I think my weight is expected to loss since I will be paying attention more to those three aspects.