The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here: https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv and load the data into R. The code book, describing the variable names is here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FPUMSDataDict06.pdf

How many properties are worth $1,000,000 or more?

we know by following the code book is that VAL is the variable that specifies property value, and that range 24 means value of 1m or more.

getwd()
## [1] "D:/data/hopkins-clean"
housing<-read.csv('getdata-data-ss06hid.csv')
str(housing)
## 'data.frame':    6496 obs. of  188 variables:
##  $ RT      : Factor w/ 1 level "H": 1 1 1 1 1 1 1 1 1 1 ...
##  $ SERIALNO: int  186 306 395 506 835 989 1861 2120 2278 2428 ...
##  $ DIVISION: int  8 8 8 8 8 8 8 8 8 8 ...
##  $ PUMA    : int  700 700 100 700 800 700 700 200 400 500 ...
##  $ REGION  : int  4 4 4 4 4 4 4 4 4 4 ...
##  $ ST      : int  16 16 16 16 16 16 16 16 16 16 ...
##  $ ADJUST  : int  1015675 1015675 1015675 1015675 1015675 1015675 1015675 1015675 1015675 1015675 ...
##  $ WGTP    : int  89 310 106 240 118 115 0 35 47 51 ...
##  $ NP      : int  4 1 2 4 4 4 1 1 2 2 ...
##  $ TYPE    : int  1 1 1 1 1 1 2 1 1 1 ...
##  $ ACR     : int  1 NA 1 1 2 1 NA 1 1 1 ...
##  $ AGS     : int  NA NA NA NA 1 NA NA NA NA NA ...
##  $ BDS     : int  4 1 3 4 5 3 NA 2 3 2 ...
##  $ BLD     : int  2 7 2 2 2 2 NA 1 2 1 ...
##  $ BUS     : int  2 NA 2 2 2 2 NA 2 2 2 ...
##  $ CONP    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ ELEP    : int  180 60 70 40 250 130 NA 40 2 20 ...
##  $ FS      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FULP    : int  2 2 2 2 2 2 NA 480 2 2 ...
##  $ GASP    : int  3 3 30 80 3 3 NA 3 3 140 ...
##  $ HFL     : int  3 3 1 1 3 3 NA 4 3 1 ...
##  $ INSP    : int  600 NA 200 200 700 250 NA NA 770 120 ...
##  $ KIT     : int  1 1 1 1 1 1 NA 1 1 1 ...
##  $ MHP     : int  NA NA NA NA NA NA NA NA NA 220 ...
##  $ MRGI    : int  1 NA NA 1 1 1 NA NA 1 NA ...
##  $ MRGP    : int  1300 NA NA 860 1900 700 NA NA 750 NA ...
##  $ MRGT    : int  1 NA NA 1 1 1 NA NA 1 NA ...
##  $ MRGX    : int  1 NA 3 1 1 1 NA NA 1 3 ...
##  $ PLM     : int  1 1 1 1 1 1 NA 1 1 1 ...
##  $ RMS     : int  9 2 7 6 7 6 NA 4 6 5 ...
##  $ RNTM    : int  NA 2 NA NA NA NA NA NA NA NA ...
##  $ RNTP    : int  NA 600 NA NA NA NA NA NA NA NA ...
##  $ SMP     : int  NA NA NA 400 650 400 NA NA NA NA ...
##  $ TEL     : int  1 1 1 1 1 1 NA 1 1 1 ...
##  $ TEN     : int  1 3 2 1 1 1 NA 4 1 2 ...
##  $ VACS    : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ VAL     : int  17 NA 18 19 20 15 NA NA 13 1 ...
##  $ VEH     : int  3 1 2 3 5 2 NA 1 2 2 ...
##  $ WATP    : int  840 1 50 500 2 1200 NA 650 660 2 ...
##  $ YBL     : int  5 3 5 2 3 5 NA 5 3 5 ...
##  $ FES     : int  2 NA 7 1 1 2 NA NA 2 NA ...
##  $ FINCP   : int  105600 NA 9400 66000 93000 61000 NA NA 209000 NA ...
##  $ FPARC   : int  2 NA 2 1 2 1 NA NA 4 NA ...
##  $ GRNTP   : int  NA 660 NA NA NA NA NA NA NA NA ...
##  $ GRPIP   : int  NA 23 NA NA NA NA NA NA NA NA ...
##  $ HHL     : int  1 1 1 1 1 1 NA 1 1 2 ...
##  $ HHT     : int  1 4 3 1 1 1 NA 6 1 5 ...
##  $ HINCP   : int  105600 34000 9400 66000 93000 61000 NA 10400 209000 35400 ...
##  $ HUGCL   : int  0 0 0 0 0 0 NA 0 0 0 ...
##  $ HUPAC   : int  2 4 2 1 2 1 NA 4 4 4 ...
##  $ HUPAOC  : int  2 4 2 1 2 1 NA 4 4 4 ...
##  $ HUPARC  : int  2 4 2 1 2 1 NA 4 4 4 ...
##  $ LNGI    : int  1 1 1 1 1 1 NA 1 1 2 ...
##  $ MV      : int  4 3 2 3 1 4 5 5 1 1 ...
##  $ NOC     : int  2 0 1 2 1 2 NA 0 0 0 ...
##  $ NPF     : int  4 NA 2 4 4 4 NA NA 2 NA ...
##  $ NPP     : int  0 0 0 0 0 0 NA 0 0 0 ...
##  $ NR      : int  0 0 0 0 0 0 NA 0 0 1 ...
##  $ NRC     : int  2 0 1 2 1 2 NA 0 0 0 ...
##  $ OCPIP   : int  18 NA 23 26 36 26 NA NA 5 7 ...
##  $ PARTNER : int  0 0 0 0 0 0 NA 0 0 0 ...
##  $ PSF     : int  0 0 0 0 0 0 NA 0 0 0 ...
##  $ R18     : int  1 0 1 1 1 1 NA 0 0 0 ...
##  $ R60     : int  0 0 0 0 0 0 NA 1 1 0 ...
##  $ R65     : int  0 0 0 0 0 0 NA 1 1 0 ...
##  $ RESMODE : int  1 2 1 2 1 2 NA 2 1 1 ...
##  $ SMOCP   : int  1550 NA 179 1422 2800 1330 NA NA 805 196 ...
##  $ SMX     : int  3 NA NA 1 1 2 NA NA 3 NA ...
##  $ SRNT    : int  0 1 0 0 0 0 NA 1 0 0 ...
##  $ SVAL    : int  1 0 1 1 1 1 NA 0 1 0 ...
##  $ TAXP    : int  24 NA 16 31 25 7 NA NA 22 4 ...
##  $ WIF     : int  3 NA 1 2 3 1 NA NA 1 NA ...
##  $ WKEXREL : int  2 NA 13 2 1 7 NA NA 6 NA ...
##  $ WORKSTAT: int  3 NA 13 1 1 3 NA NA 3 NA ...
##  $ FACRP   : int  0 0 0 0 0 0 NA 0 0 1 ...
##  $ FAGSP   : int  0 0 0 0 0 0 NA 0 0 0 ...
##  $ FBDSP   : int  0 0 0 0 0 0 NA 0 0 0 ...
##  $ FBLDP   : int  0 0 0 0 0 0 NA 0 0 0 ...
##  $ FBUSP   : int  0 0 0 0 0 0 NA 0 0 0 ...
##  $ FCONP   : int  0 0 0 0 0 0 NA 0 0 0 ...
##  $ FELEP   : int  0 0 0 0 0 0 NA 0 0 0 ...
##  $ FFSP    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ FFULP   : int  0 0 0 0 0 0 NA 0 0 0 ...
##  $ FGASP   : int  0 0 0 0 0 0 NA 0 0 0 ...
##  $ FHFLP   : int  0 0 0 0 0 0 NA 0 0 0 ...
##  $ FINSP   : int  0 0 0 0 0 1 NA 0 0 0 ...
##  $ FKITP   : int  0 0 0 0 0 0 NA 0 0 0 ...
##  $ FMHP    : int  0 0 0 0 0 0 NA 0 0 0 ...
##  $ FMRGIP  : int  0 0 0 0 0 0 NA 0 0 0 ...
##  $ FMRGP   : int  0 0 0 0 0 0 NA 0 0 0 ...
##  $ FMRGTP  : int  0 0 0 0 0 0 NA 0 0 0 ...
##  $ FMRGXP  : int  0 0 0 0 0 0 NA 0 0 0 ...
##  $ FMVYP   : int  0 0 0 0 0 0 NA 0 0 0 ...
##  $ FPLMP   : int  0 0 0 0 0 0 NA 0 0 0 ...
##  $ FRMSP   : int  0 0 0 0 0 0 NA 0 0 1 ...
##  $ FRNTMP  : int  0 0 0 0 0 0 NA 0 0 0 ...
##  $ FRNTP   : int  0 0 0 0 0 0 NA 0 0 0 ...
##  $ FSMP    : int  0 0 0 0 0 0 NA 0 0 0 ...
##  $ FSMXHP  : int  0 0 0 0 0 0 NA 0 0 0 ...
##   [list output truncated]
nrow(housing[housing$VAL==24 & !is.na(housing$VAL),])
## [1] 53

Excel

Download the Excel spreadsheet on Natural Gas Aquisition Program here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx

Read rows 18-23 and columns 7-15 into R and assign the result to a variable called: dat

library(xlsx)
## Warning: package 'xlsx' was built under R version 3.1.2
## Loading required package: rJava
## Warning: package 'rJava' was built under R version 3.1.2
## Loading required package: xlsxjars
## Warning: package 'xlsxjars' was built under R version 3.1.2
dat <- read.xlsx('getdata-data-DATA.gov_NGAP.xlsx', sheetIndex=1,rowIndex=18:23, colIndex=7:15)
sum(dat$Zip*dat$Ext,na.rm=T) 
## [1] 36534720

XML

Read the XML data on Baltimore restaurants from here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml

How many restaurants have zipcode 21231?

library(XML)
## Warning: package 'XML' was built under R version 3.1.2
library(RCurl)
## Loading required package: bitops
## 
## Attaching package: 'RCurl'
## 
## The following object is masked from 'package:rJava':
## 
##     clone
urlXML<- "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml"
#doc<- xmlTreeParse(urlXML, useInternal=TRUE)
xData<-getURL(urlXML, ssl.verifypeer=FALSE)
doc<-xmlParse(xData)
roots<-xmlRoot(doc)
xmlName(roots)
## [1] "response"
names(roots)
##   row 
## "row"
zips<-xpathSApply(roots, "//zipcode",xmlValue)
length(zips[zips=="21231"])
## [1] 127

data.table

The American Community Survey distributes downloadable data about United States communities. Download the 2006 microdata survey about housing for the state of Idaho using download.file() from here:

https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv

using the fread() command load the data into an R object

library(data.table)
DT<-fread("getdata-data-ss06pid.csv")
system.time(tapply(DT$pwgtp15,DT$SEX,mean))
##    user  system elapsed 
##    0.02    0.00    0.01
system.time(sapply(split(DT$pwgtp15,DT$SEX),mean))
##    user  system elapsed 
##       0       0       0
#system.time({rowMeans(DT)[DT$SEX==1]; rowMeans(DT)[DT$SEX==2]})
system.time({mean(DT[DT$SEX==1,]$pwgtp15); mean(DT[DT$SEX==2,]$pwgtp15)})
##    user  system elapsed 
##    0.11    0.00    0.11
system.time(mean(DT$pwgtp15,by=DT$SEX))
##    user  system elapsed 
##       0       0       0
system.time(DT[,mean(pwgtp15),by=SEX])
##    user  system elapsed 
##       0       0       0