This is Quiz 1 attempt on the coursera course called “Getting and Cleaning Data”. This course is conducted by John Hopkins university and this is part of their data science course series.
The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv
and load the data into R. The code book, describing the variable names is here:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf
How many housing units in this survey were worth more than $1,000,000?
Trying to work out the answer
setwd("/Users/njvijay/big_data/datascience_johnhopkins/getting_cleaning_data/Week1/Quiz1")
fileUrl <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv"
download.file(url=fileUrl,destfile="idaho_housing.csv",mode="w",method="curl")
list.files(".")
## [1] "fsspid.csv" "gov_NGAP.xlsx" "idaho_housing.csv"
## [4] "Week1-Quiz1.html" "Week1-Quiz1.Rmd"
dateDownloaded <- date()
dateDownloaded
## [1] "Sun Apr 13 15:17:35 2014"
idaho_housing <- read.csv("idaho_housing.csv")
head(idaho_housing)
## RT SERIALNO DIVISION PUMA REGION ST ADJUST WGTP NP TYPE ACR AGS BDS BLD
## 1 H 186 8 700 4 16 1015675 89 4 1 1 NA 4 2
## 2 H 306 8 700 4 16 1015675 310 1 1 NA NA 1 7
## 3 H 395 8 100 4 16 1015675 106 2 1 1 NA 3 2
## 4 H 506 8 700 4 16 1015675 240 4 1 1 NA 4 2
## 5 H 835 8 800 4 16 1015675 118 4 1 2 1 5 2
## 6 H 989 8 700 4 16 1015675 115 4 1 1 NA 3 2
## BUS CONP ELEP FS FULP GASP HFL INSP KIT MHP MRGI MRGP MRGT MRGX PLM RMS
## 1 2 NA 180 0 2 3 3 600 1 NA 1 1300 1 1 1 9
## 2 NA NA 60 0 2 3 3 NA 1 NA NA NA NA NA 1 2
## 3 2 NA 70 0 2 30 1 200 1 NA NA NA NA 3 1 7
## 4 2 NA 40 0 2 80 1 200 1 NA 1 860 1 1 1 6
## 5 2 NA 250 0 2 3 3 700 1 NA 1 1900 1 1 1 7
## 6 2 NA 130 0 2 3 3 250 1 NA 1 700 1 1 1 6
## RNTM RNTP SMP TEL TEN VACS VAL VEH WATP YBL FES FINCP FPARC GRNTP GRPIP
## 1 NA NA NA 1 1 NA 17 3 840 5 2 105600 2 NA NA
## 2 2 600 NA 1 3 NA NA 1 1 3 NA NA NA 660 23
## 3 NA NA NA 1 2 NA 18 2 50 5 7 9400 2 NA NA
## 4 NA NA 400 1 1 NA 19 3 500 2 1 66000 1 NA NA
## 5 NA NA 650 1 1 NA 20 5 2 3 1 93000 2 NA NA
## 6 NA NA 400 1 1 NA 15 2 1200 5 2 61000 1 NA NA
## HHL HHT HINCP HUGCL HUPAC HUPAOC HUPARC LNGI MV NOC NPF NPP NR NRC
## 1 1 1 105600 0 2 2 2 1 4 2 4 0 0 2
## 2 1 4 34000 0 4 4 4 1 3 0 NA 0 0 0
## 3 1 3 9400 0 2 2 2 1 2 1 2 0 0 1
## 4 1 1 66000 0 1 1 1 1 3 2 4 0 0 2
## 5 1 1 93000 0 2 2 2 1 1 1 4 0 0 1
## 6 1 1 61000 0 1 1 1 1 4 2 4 0 0 2
## OCPIP PARTNER PSF R18 R60 R65 RESMODE SMOCP SMX SRNT SVAL TAXP WIF
## 1 18 0 0 1 0 0 1 1550 3 0 1 24 3
## 2 NA 0 0 0 0 0 2 NA NA 1 0 NA NA
## 3 23 0 0 1 0 0 1 179 NA 0 1 16 1
## 4 26 0 0 1 0 0 2 1422 1 0 1 31 2
## 5 36 0 0 1 0 0 1 2800 1 0 1 25 3
## 6 26 0 0 1 0 0 2 1330 2 0 1 7 1
## WKEXREL WORKSTAT FACRP FAGSP FBDSP FBLDP FBUSP FCONP FELEP FFSP FFULP
## 1 2 3 0 0 0 0 0 0 0 0 0
## 2 NA NA 0 0 0 0 0 0 0 0 0
## 3 13 13 0 0 0 0 0 0 0 0 0
## 4 2 1 0 0 0 0 0 0 0 0 0
## 5 1 1 0 0 0 0 0 0 0 0 0
## 6 7 3 0 0 0 0 0 0 0 0 0
## FGASP FHFLP FINSP FKITP FMHP FMRGIP FMRGP FMRGTP FMRGXP FMVYP FPLMP
## 1 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 1 0 0 0 0 0 0 0 0
## FRMSP FRNTMP FRNTP FSMP FSMXHP FSMXSP FTAXP FTELP FTENP FVACSP FVALP
## 1 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 1 0 0 0 0
## FVEHP FWATP FYBLP wgtp1 wgtp2 wgtp3 wgtp4 wgtp5 wgtp6 wgtp7 wgtp8 wgtp9
## 1 0 0 0 87 28 156 95 26 25 95 93 93
## 2 0 0 1 539 363 293 422 566 289 87 242 453
## 3 0 0 0 187 35 184 178 83 95 31 32 177
## 4 0 0 0 232 406 234 270 249 242 406 249 287
## 5 0 0 0 107 194 129 41 156 174 47 113 101
## 6 0 1 0 191 197 127 115 115 107 119 34 32
## wgtp10 wgtp11 wgtp12 wgtp13 wgtp14 wgtp15 wgtp16 wgtp17 wgtp18 wgtp19
## 1 91 87 166 90 25 153 89 148 82 25
## 2 453 334 358 414 102 281 99 108 278 131
## 3 118 110 114 184 107 95 115 33 118 120
## 4 67 72 413 399 77 245 424 67 63 226
## 5 33 115 52 113 95 135 206 100 185 135
## 6 30 123 199 117 33 109 117 31 115 201
## wgtp20 wgtp21 wgtp22 wgtp23 wgtp24 wgtp25 wgtp26 wgtp27 wgtp28 wgtp29
## 1 180 90 24 140 92 25 27 86 84 87
## 2 407 447 264 352 238 390 336 122 374 482
## 3 37 184 35 176 176 110 103 29 30 197
## 4 254 238 69 238 255 239 248 69 234 247
## 5 279 116 33 105 244 38 30 230 123 123
## 6 190 184 198 113 109 117 111 110 33 37
## wgtp30 wgtp31 wgtp32 wgtp33 wgtp34 wgtp35 wgtp36 wgtp37 wgtp38 wgtp39
## 1 93 90 149 91 28 143 81 144 95 27
## 2 468 335 251 613 104 284 116 91 326 102
## 3 127 92 118 177 99 99 109 34 100 105
## 4 437 423 74 61 401 267 72 388 335 229
## 5 243 120 238 98 90 107 44 122 32 127
## 6 36 110 183 114 35 134 119 32 121 188
## wgtp40 wgtp41 wgtp42 wgtp43 wgtp44 wgtp45 wgtp46 wgtp47 wgtp48 wgtp49
## 1 22 90 171 27 83 153 148 92 91 91
## 2 361 107 253 321 289 96 343 564 274 118
## 3 33 173 36 168 175 99 103 30 35 155
## 4 236 239 65 259 247 230 225 82 220 233
## 5 195 116 36 135 237 33 33 249 102 84
## 6 33 34 32 109 115 115 112 119 192 186
## wgtp50 wgtp51 wgtp52 wgtp53 wgtp54 wgtp55 wgtp56 wgtp57 wgtp58 wgtp59
## 1 93 90 26 94 142 24 91 29 84 148
## 2 118 321 261 130 463 294 479 391 307 476
## 3 102 95 107 185 120 114 113 36 115 103
## 4 419 390 69 74 391 276 70 422 409 223
## 5 224 119 250 119 125 126 32 112 33 131
## 6 213 106 34 124 179 106 107 190 112 34
## wgtp60 wgtp61 wgtp62 wgtp63 wgtp64 wgtp65 wgtp66 wgtp67 wgtp68 wgtp69
## 1 30 93 143 24 88 147 145 91 83 83
## 2 283 116 353 323 374 106 236 380 313 90
## 3 29 183 35 179 169 95 110 28 34 233
## 4 245 269 488 221 250 247 240 415 234 219
## 5 45 101 165 125 41 191 195 49 119 92
## 6 35 32 34 119 123 122 121 123 196 196
## wgtp70 wgtp71 wgtp72 wgtp73 wgtp74 wgtp75 wgtp76 wgtp77 wgtp78 wgtp79
## 1 86 81 27 93 151 28 79 25 101 157
## 2 94 292 401 81 494 346 496 615 286 454
## 3 97 123 119 168 107 95 101 30 124 106
## 4 66 68 359 385 71 234 421 76 77 242
## 5 44 127 36 119 121 116 209 97 176 144
## 6 207 120 34 109 199 116 110 211 120 31
## wgtp80
## 1 129
## 2 260
## 3 31
## 4 231
## 5 38
## 6 189
#Code book says VAL with 24 represents any house more than $1000000.
#Following piece of code is trying to get count on house price >$1000000 whose value is not NA
length(idaho_housing$VAL[!is.na(idaho_housing$VAL) & idaho_housing$VAL==24])
## [1] 53
Thus answer for the above questions is 53
Skipping Question 2
Download the Excel spreadsheet on Natural Gas Aquisition Program here:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx
Read rows 18-23 and columns 7-15 into R and assign the result to a variable called:
dat
What is the value of:
sum(dat\(Zip*dat\)Ext,na.rm=T)
(original data source: http://catalog.data.gov/dataset/natural-gas-acquisition-program)
#Download data which is in excel from the website
setwd("/Users/njvijay/big_data/datascience_johnhopkins/getting_cleaning_data/Week1/Quiz1")
fileUrl1 <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx"
download.file(url=fileUrl1, destfile="gov_NGAP.xlsx", mode="w", method="curl")
dateDownloaded <- date()
print(dateDownloaded)
## [1] "Sun Apr 13 15:17:37 2014"
library(xlsx)
## Loading required package: xlsxjars
## Loading required package: rJava
rowIndex <- 18:23
colIndx <- 7:15
dat <- read.xlsx(file="gov_NGAP.xlsx",sheetIndex=1,colIndex=colIndx,startRow=18, endRow=23, header=TRUE)
head(dat)
## Zip CuCurrent PaCurrent PoCurrent Contact Ext Fax email
## 1 74136 0 1 0 918-491-6998 0 918-491-6659 NA
## 2 30329 1 0 0 404-321-5711 NA <NA> NA
## 3 74136 1 0 0 918-523-2516 0 918-523-2522 NA
## 4 80203 0 1 0 303-864-1919 0 <NA> NA
## 5 80120 1 0 0 345-098-8890 456 <NA> NA
## Status
## 1 1
## 2 1
## 3 1
## 4 1
## 5 1
sum(dat$Zip*dat$Ext,na.rm=T)
## [1] 36534720
I hope answer is 36534720
Read the XML data on Baltimore restaurants from here:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml
How many restaurants have zipcode 21231?
library(XML)
fileUrl2 <- "http://d396qusza40orc.cloudfront.net/getdata/data/restaurants.xml"
doc <- xmlTreeParse(file=fileUrl2,useInternal=TRUE)
rootNode <- xmlRoot(doc)
xmlName(rootNode)
## [1] "response"
names(rootNode)
## row
## "row"
rootNode[[1]][[1]]
## <row _id="1" _uuid="93CACF6F-C8C2-4B87-95A8-8177806D5A6F" _position="1" _address="http://data.baltimorecity.gov/resource/k5ry-ef3g/1">
## <name>410</name>
## <zipcode>21206</zipcode>
## <neighborhood>Frankford</neighborhood>
## <councildistrict>2</councildistrict>
## <policedistrict>NORTHEASTERN</policedistrict>
## <location_1 human_address="{"address":"4509 BELAIR ROAD","city":"Baltimore","state":"MD","zip":""}" needs_recoding="true"/>
## </row>
zipcode <- xpathSApply(rootNode,"//zipcode",xmlValue)
length(zipcode[zipcode==21231])
## [1] 127
Answer is 127
The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:
https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv
using the fread() command load the data into an R object DT Which of the following is the fastest way to calculate the average value of the variable pwgtp15 broken down by sex using the data.table package?
#Download data which is in excel from the website
setwd("/Users/njvijay/big_data/datascience_johnhopkins/getting_cleaning_data/Week1/Quiz1")
fileUrl1 <- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv"
download.file(url=fileUrl1, destfile="fsspid.csv", mode="w", method="curl")
dateDownloaded <- date()
print(dateDownloaded)
## [1] "Sun Apr 13 15:17:45 2014"
library(data.table)
DT <- fread(input="fsspid.csv", sep=",")
system.time(DT[,mean(pwgtp15),by=SEX])
## user system elapsed
## 0.006 0.000 0.007
system.time(tapply(DT$pwgtp15,DT$SEX,mean))
## user system elapsed
## 0.009 0.001 0.010
system.time(mean(DT$pwgtp15,by=DT$SEX))
## user system elapsed
## 0 0 0
#system.time(rowMeans(DT)[DT$SEX==1])
#system.time(rowMeans(DT)[DT$SEX==2])
system.time(sapply(split(DT$pwgtp15,DT$SEX),mean))
## user system elapsed
## 0.004 0.000 0.004
system.time(mean(DT[DT$SEX==1,]$pwgtp15))
## user system elapsed
## 0.034 0.004 0.039
system.time(mean(DT[DT$SEX==2,]$pwgtp15))
## user system elapsed
## 0.034 0.004 0.038