This lab is designed to give you practice manipulating basic R data structures. In addition, it should give you more practice using R and Markdown.
You have this information from Governor Inslee:
The information was got from wikipedia.
Make a list, in a way you can answer these questions:
Write the code to answer those questions.
Jay =list('FullName' = 'Jay Robert Inslee',
'DateOfBirth' = '9 February 1951',
'PlaceOfBirth' = list('Seattle', 'Washington', 'U.S.'),
'PoliticalParty' = 'Democratic',
'Spouse' = 'Trudi Inslee',
'Children' = '3',
'WeddingYear' = '1972',
'Residence' = 'Governors Mansion',
'Education' = list('StanfordUniversity', 'UniversityOfWashington', 'WilliametteUniversity')
)
#Jay's Age
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
Jay$DateOfBirth = as.Date(Jay$DateOfBirth, format = "%d %B %Y");str(Jay$DateOfBirth)
## Date[1:1], format: "1951-02-09"
Sys.Date() - Jay$DateOfBirth
## Time difference of 24909 days
time_length(interval(Jay$DateOfBirth,Sys.Date()),"years")
## [1] 68.19726
#Years of Marriage
Jay$WeddingYear = as.numeric(Jay$WeddingYear)
2019 - Jay$WeddingYear
## [1] 47
#Number of schools attended
length(Jay$Education)
## [1] 3
#State of birth
Jay$PlaceOfBirth[2]
## [[1]]
## [1] "Washington"
Last session you were asked to get some data from SEATTLE OPEN DATA portal. I will show you an example on how to use R to look for and solve some issues.
linkToData='ExampleLab3.xlsx'
library(rio)
myData=import("ExampleLab3.xlsx")
str(myData)
## 'data.frame': 30 obs. of 5 variables:
## $ personID : chr "A145" "A185" "A108" "A172" ...
## $ age : chr "21" "34" "35" NA ...
## $ County : chr "King County" "King County" "King County" "King County" ...
## $ Degree : chr "HighSchoolDiploma" "Bachelor" "Bachelor" "Bachelor" ...
## $ HouseholdSize: chr "2" "1" "1" "1" ...
From the result above, age and household size should be numbers, but they have been read as text; that is, there should be a number written as text.
myData[is.na(myData$age),]
## personID age County Degree HouseholdSize
## 4 A172 <NA> King County Bachelor 1
myData[is.na(myData$HouseholdSize),]
## personID age County Degree HouseholdSize
## 28 C148 35 Spokane County Bachelor <NA>
Each column has one missing value.
myData[is.na(as.numeric(myData$age)),]
## personID age County Degree HouseholdSize
## 4 A172 <NA> King County Bachelor 1
In the previous case, when turning the column into numeric no extra missing value was found.
myData[is.na(as.numeric(myData$HouseholdSize)),]
## Warning in `[.data.frame`(myData,
## is.na(as.numeric(myData$HouseholdSize)), : NAs introduced by coercion
## personID age County Degree HouseholdSize
## 18 A195 43 Kitsap County Bachelor l
## 28 C148 35 Spokane County Bachelor <NA>
In the previous case, when turning the column into numeric ONE extra missing value was found. In this case, R also told you that it introduced NAs. When you apply the function as.numeric() to a text, R will return a NA.
In the first column, there were no issues, then I simply alter the data type:
myData$age=as.numeric(myData$age)
In the second column, you first need to modify the value:
myData$HouseholdSize[18]
## [1] "l"
# then:
myData$HouseholdSize[18]=1
Now you can turn the column into numbers:
myData$HouseholdSize=as.numeric(myData$HouseholdSize)
summary(myData[,c('age','HouseholdSize')])
## age HouseholdSize
## Min. :21.00 Min. : 1.000
## 1st Qu.:29.00 1st Qu.: 2.000
## Median :38.00 Median : 2.000
## Mean :36.72 Mean : 3.828
## 3rd Qu.:42.00 3rd Qu.: 3.000
## Max. :49.00 Max. :44.000
## NA's :1 NA's :1
In the previous summary, you see that age values seem well, but it looks as though there might be something weird in household size: there is a person in this data set whose household size is 44. Is this an outlier?
boxplot(myData$HouseholdSize,horizontal = T)
It looks too far from the other answers. As it is an integer value (do not do this if it has decimal values), we could try a frequency table:
table(myData$HouseholdSize)
##
## 1 2 3 4 44
## 7 9 6 6 1
My best guess is that this value is atually a 4, so there is some kind of mistyping; however, you might want to spend some time looking for the original record that produced this row.
The area that has the higest number of AAWDT(annual average weekly daily traffic) in seattle area is 66361 and is located “YALE AVE, NW/O HOWELL ST”. From examining the location through the map, I believe that this is not the outlier despite the aberrant high in number comparing to the other areas. This is because it is located closely to i-5 bridge both the entrance and exit. It is also located between downtown and Capitol Hills which are probably two of the most busiest areas in Seattle area.
linkToData = 'Traffic_Flow_Map_Volumes.csv'
library(rio)
myData2 = import("Traffic_Flow_Map_Volumes.csv")
str(myData2)
## 'data.frame': 437 obs. of 7 variables:
## $ OBJECTID : int 17793 27025 20123 35281 44585 39118 898 33200 31642 11579 ...
## $ STNAME : chr "E BOSTON ST" "E PINE ST" "M L KING JR WAY E" "EASTLAKE AVE E" ...
## $ COUNT_LOCATION: chr "E BOSTON ST, W/O 11TH AVE E" "E PINE ST, W/O 12TH AVE" "M L KING JR WAY E, N/O E JOHN ST" "EASTLAKE AVE E, N/O THOMAS ST" ...
## $ YEAR : int 2017 2018 2017 2018 2018 2018 2017 2017 2018 2018 ...
## $ SEGKEY : int 13859 14637 11496 10164 22292 18070 14080 4379 2350 18068 ...
## $ AAWDT : int 6235 8763 8706 8988 6295 3866 12342 11974 4283 4187 ...
## $ INPUT_STUDY_ID: int 324780 325408 325994 327101 325455 325428 325810 324002 325448 326181 ...
colMax <- function(myData2) sapply(myData2,max, na.rm = TRUE)
colMax(myData2)
## OBJECTID STNAME
## "46655" "YALE AVE"
## COUNT_LOCATION YEAR
## "YALE AVE, NW/O HOWELL ST" "2018"
## SEGKEY AAWDT
## "652860" "66361"
## INPUT_STUDY_ID
## "327691"
max(myData2$AAWDT)
## [1] 66361
summary(myData2)
## OBJECTID STNAME COUNT_LOCATION YEAR
## Min. : 352 Length:437 Length:437 Min. :2017
## 1st Qu.:12228 Class :character Class :character 1st Qu.:2017
## Median :24033 Mode :character Mode :character Median :2017
## Mean :24570 Mean :2017
## 3rd Qu.:35702 3rd Qu.:2018
## Max. :46655 Max. :2018
## SEGKEY AAWDT INPUT_STUDY_ID
## Min. : 1000 Min. : 468 Min. :323817
## 1st Qu.: 7911 1st Qu.: 4504 1st Qu.:325393
## Median : 12359 Median : 7840 Median :325779
## Mean : 20899 Mean : 9923 Mean :325675
## 3rd Qu.: 17899 3rd Qu.:13133 3rd Qu.:326138
## Max. :652860 Max. :66361 Max. :327691
#boxplot(myData2$AAWDT)
I plan to start download to applications needed to track my distance walked, hours of sleep, calories eating to determine how it affects my weight) * Which of your measures do you expect to see a difference in based on that experiment (you can see it in multiple)?
I think sleep, exercises, and diet are important to weight loss or maintance. * What is your hypothesis, do you think that measure will increase or decrease? It is recommended that you begin your data collection by tonight, although in the next few days will also work.
I think my weight is expected to loss since I will be paying attention more to those three aspects.