Synposis

This is my homework report for week 2, produced with R Markdown. In this homework I perform the five data importing exercises listed under Week 2’s [Assignment section] (http://uc-r.github.io/data_wrangling/week-2), which includes importing the following three data sets:

  1. Artificial Reddit user data
  2. HUD data regarding 2017 fair market rent
  3. Average daily temperatures for Cincinnati dating back to 1995
library(knitr)
read_chunk("week-2.R")

Packages Required

To reproduce the code and results throughout this homework assignment I used the following packages:

library(XML)
library(xlsx)
library(gdata)
library(stringr)
library(DT)

Homework Problems

For each problem I imported the data and save as a data frame. I then used head() to display the first few rows of the data frame and str() to display the structure of each data frame.

  1. Download & import the csv file located at: (https://bradleyboehmke.github.io/public/data/reddit.csv)
reddit_data_url <- "https://bradleyboehmke.github.io/public/data/reddit.csv"
download.file(reddit_data_url, destfile="reddit_data.csv")
reddit_data <- read.csv("reddit_data.csv")
datatable(head(reddit_data))
datatable(str(reddit_data))
## 'data.frame':    32754 obs. of  14 variables:
##  $ id               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender           : int  0 0 1 0 1 0 0 0 0 0 ...
##  $ age.range        : Factor w/ 7 levels "18-24","25-34",..: 2 2 1 2 2 2 2 1 3 2 ...
##  $ marital.status   : Factor w/ 6 levels "Engaged","Forever Alone",..: NA NA NA NA NA 4 3 4 4 3 ...
##  $ employment.status: Factor w/ 6 levels "Employed full time",..: 1 1 2 2 1 1 1 4 1 2 ...
##  $ military.service : Factor w/ 2 levels "No","Yes": NA NA NA NA NA 1 1 1 1 1 ...
##  $ children         : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ education        : Factor w/ 7 levels "Associate degree",..: 2 2 5 2 2 2 5 2 2 5 ...
##  $ country          : Factor w/ 439 levels " Canada"," Canada eh",..: 394 394 394 394 394 394 125 394 394 125 ...
##  $ state            : Factor w/ 53 levels "","Alabama","Alaska",..: 33 33 48 33 6 33 1 6 33 1 ...
##  $ income.range     : Factor w/ 8 levels "$100,000 - $149,999",..: 2 2 8 2 7 2 NA 7 2 7 ...
##  $ fav.reddit       : Factor w/ 1834 levels "","'home' page (or front page if you prefer)",..: 720 691 1511 1528 188 691 1318 571 1629 1 ...
##  $ dog.cat          : Factor w/ 3 levels "I like cats.",..: NA NA NA NA NA 2 2 2 1 1 ...
##  $ cheese           : Factor w/ 11 levels "American","Brie",..: NA NA NA NA NA 3 3 1 10 7 ...

2. Now import the above csv file directly from the url provided (without downloading to your local hard drive)

reddit_data <- read.csv(reddit_data_url)
datatable(head(reddit_data))
datatable(str(reddit_data))
## 'data.frame':    32754 obs. of  14 variables:
##  $ id               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ gender           : int  0 0 1 0 1 0 0 0 0 0 ...
##  $ age.range        : Factor w/ 7 levels "18-24","25-34",..: 2 2 1 2 2 2 2 1 3 2 ...
##  $ marital.status   : Factor w/ 6 levels "Engaged","Forever Alone",..: NA NA NA NA NA 4 3 4 4 3 ...
##  $ employment.status: Factor w/ 6 levels "Employed full time",..: 1 1 2 2 1 1 1 4 1 2 ...
##  $ military.service : Factor w/ 2 levels "No","Yes": NA NA NA NA NA 1 1 1 1 1 ...
##  $ children         : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ education        : Factor w/ 7 levels "Associate degree",..: 2 2 5 2 2 2 5 2 2 5 ...
##  $ country          : Factor w/ 439 levels " Canada"," Canada eh",..: 394 394 394 394 394 394 125 394 394 125 ...
##  $ state            : Factor w/ 53 levels "","Alabama","Alaska",..: 33 33 48 33 6 33 1 6 33 1 ...
##  $ income.range     : Factor w/ 8 levels "$100,000 - $149,999",..: 2 2 8 2 7 2 NA 7 2 7 ...
##  $ fav.reddit       : Factor w/ 1834 levels "","'home' page (or front page if you prefer)",..: 720 691 1511 1528 188 691 1318 571 1629 1 ...
##  $ dog.cat          : Factor w/ 3 levels "I like cats.",..: NA NA NA NA NA 2 2 2 1 1 ...
##  $ cheese           : Factor w/ 11 levels "American","Brie",..: NA NA NA NA NA 3 3 1 10 7 ...

3. Import the .xlsx file located at: (http://www.huduser.gov/portal/datasets/fmr/fmr2017/FY2017_4050_FMR.xlsx)

fmr_data_url <- "http://www.huduser.gov/portal/datasets/fmr/fmr2017/FY2017_4050_FMR.xlsx"
download.file(fmr_data_url, destfile="fmr_data.xlsx", mode='wb')
fmr_data <- read.xlsx2("fmr_data.xlsx", sheetIndex = 1, header = TRUE, as.data.frame = TRUE)
datatable(head(fmr_data))
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## http://rstudio.github.io/DT/server.html
datatable(str(fmr_data))
## 'data.frame':    4769 obs. of  21 variables:
##  $ fips2010         : Factor w/ 4769 levels "0100199999","0100399999",..: 1431 4686 4688 1 2 3 4 5 6 7 ...
##  $ fips2000         : Factor w/ 4764 levels "","0100199999",..: 1 1 1 2 3 4 5 6 7 8 ...
##  $ fmr2             : Factor w/ 508 levels "1000","1002",..: 38 227 216 369 493 221 407 407 176 176 ...
##  $ fmr0             : Factor w/ 450 levels "1004","1007",..: 359 140 54 225 386 139 297 297 129 102 ...
##  $ fmr1             : Factor w/ 479 levels "1002","1003",..: 428 125 117 297 425 124 358 358 113 88 ...
##  $ fmr3             : Factor w/ 679 levels "1000","1001",..: 308 668 642 52 295 520 149 149 534 530 ...
##  $ fmr4             : Factor w/ 763 levels "1000","1001",..: 446 39 158 363 470 723 279 279 638 95 ...
##  $ State            : Factor w/ 56 levels "1","10","11",..: 15 50 52 1 1 1 1 1 1 1 ...
##  $ Metro_code       : Factor w/ 2598 levels "METRO10180M10180",..: 451 2592 2594 384 160 625 55 55 626 627 ...
##  $ areaname         : Factor w/ 2598 levels " Santa Ana-Anaheim-Irvine, CA HUD Metro FMR Area",..: 1903 52 1723 1633 571 122 186 186 263 271 ...
##  $ county           : Factor w/ 330 levels "","1","10","100",..: 1 330 330 2 136 249 293 323 10 24 ...
##  $ CouSub           : Factor w/ 1532 levels "00100","00170",..: 234 1532 1532 1532 1532 1532 1532 1532 1532 1532 ...
##  $ countyname       : Factor w/ 1961 levels "Abbeville County",..: 462 41 1265 92 99 110 163 178 239 249 ...
##  $ county_town_name : Factor w/ 3175 levels "Abbeville County",..: 533 60 2024 136 149 165 254 277 386 401 ...
##  $ pop2010          : Factor w/ 4468 levels "","0","1","10",..: 2446 3379 3331 3354 1221 2034 1665 3425 170 1487 ...
##  $ acs_2016_2       : Factor w/ 492 levels "1000","1001",..: 37 198 187 332 409 181 381 381 143 143 ...
##  $ state_alpha      : Factor w/ 56 levels "AK","AL","AR",..: 24 4 28 2 2 2 2 2 2 2 ...
##  $ fmr_type         : Factor w/ 2 levels "40","50": 1 1 1 1 1 1 1 1 1 1 ...
##  $ metro            : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 2 2 1 1 ...
##  $ FMR_PCT_Change   : Factor w/ 1482 levels "0.775774647887324",..: 213 813 820 879 1322 993 754 754 1228 1228 ...
##  $ FMR_Dollar_Change: Factor w/ 273 levels "-1","-10","-103",..: 43 178 178 195 110 197 182 182 219 219 ...

4. Now import the above .xlsx file directly from the url provided (without downloading to your local hard drive)

fmr_data <- read.xls(fmr_data_url)
datatable(head(fmr_data))
datatable(str(fmr_data))
## 'data.frame':    4769 obs. of  21 variables:
##  $ fips2010         : num  2.3e+09 6.1e+09 7.0e+09 1.0e+08 1.0e+08 ...
##  $ fips2000         : num  NA NA NA 1e+08 1e+08 ...
##  $ fmr2             : int  1078 677 666 822 977 671 866 866 621 621 ...
##  $ fmr0             : int  755 502 411 587 807 501 665 665 491 464 ...
##  $ fmr1             : int  851 506 498 682 847 505 751 751 494 467 ...
##  $ fmr3             : int  1454 987 961 1054 1422 839 1163 1163 853 849 ...
##  $ fmr4             : int  1579 1038 1158 1425 1634 958 1298 1298 856 1094 ...
##  $ State            : int  23 60 69 1 1 1 1 1 1 1 ...
##  $ Metro_code       : Factor w/ 2598 levels "METRO10180M10180",..: 451 2592 2594 384 160 625 55 55 626 627 ...
##  $ areaname         : Factor w/ 2598 levels " Santa Ana-Anaheim-Irvine, CA HUD Metro FMR Area",..: 1903 52 1723 1633 571 122 186 186 263 271 ...
##  $ county           : int  NA 999 999 1 3 5 7 9 11 13 ...
##  $ CouSub           : int  12300 99999 99999 99999 99999 99999 99999 99999 99999 99999 ...
##  $ countyname       : Factor w/ 1961 levels "Abbeville County",..: 462 41 1265 92 99 110 163 178 239 249 ...
##  $ county_town_name : Factor w/ 3175 levels "Abbeville County",..: 533 60 2024 136 149 165 254 277 386 401 ...
##  $ pop2010          : int  341 55519 53883 54571 182265 27457 22915 57322 10914 20947 ...
##  $ acs_2016_2       : int  1109 653 642 788 873 636 840 840 569 569 ...
##  $ state_alpha      : Factor w/ 56 levels "AK","AL","AR",..: 24 4 28 2 2 2 2 2 2 2 ...
##  $ fmr_type         : int  40 40 40 40 40 40 40 40 40 40 ...
##  $ metro            : int  1 0 0 1 1 0 1 1 0 0 ...
##  $ FMR_PCT_Change   : num  0.972 1.037 1.037 1.043 1.119 ...
##  $ FMR_Dollar_Change: int  -31 24 24 34 104 35 26 26 52 52 ...

5. Go to this University of Dayton webpage (http://academic.udayton.edu/kissock/http/Weather/citylistUS.htm), scroll down to Ohio and import the Cincinnati (OHCINCIN.txt) file

ohio_page_url <- "http://academic.udayton.edu/kissock/http/Weather/citylistUS.htm"
links <- getHTMLLinks(ohio_page_url)
links_cincinnati <- links[str_detect(links, "CINCIN")]
last_position <- str_locate_all(ohio_page_url,"/")
cincinnati_data_url <- paste0(str_sub(ohio_page_url, start=0, end=tail(unlist(last_position), n=1)),links_cincinnati)
cincinnati_data <- read.table(cincinnati_data_url)
datatable(head(cincinnati_data))
datatable(str(cincinnati_data))
## 'data.frame':    7963 obs. of  4 variables:
##  $ V1: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ V2: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ V3: int  1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
##  $ V4: num  41.1 22.2 22.8 14.9 9.5 23.8 31.1 26.9 31.3 31.5 ...