This is my homework report for week 2, produced with R Markdown. In this homework I perform the five data importing exercises listed under Week 2’s Assignment section, which includes importing the following three data sets: 1.Artificial Reddit user data 2.HUD data regarding 2017 fair market rent 3.Average daily temperatures for Cincinnati dating back to 1995
To reproduce the code and results throughout this homework assignment I used the following packages:
library(gdata) ###used to pull data from url excel file
library(xlsx) ## have fuctions which are used to read data from excel
library(stringr) ## used detect specific patern in string
library(XML) ##getHTMLLinks() function to identify file links
library(DT) ##make formatted data tables
For each problem I imported the data and save as a data frame. I then used head() to display the first few rows of the data frame and str() to display the structure of each data frame.
setwd("C:/tauseef/data_wrangling/Data Wrangling with R (BANA 8090)")
url <- "https://bradleyboehmke.github.io/public/data/reddit.csv"
download.file(url,dest="reddit.csv",mode="wb")
list.files("./")
## [1] "Data Wrangling with R (BANA 8090).Rproj"
## [2] "FY2017_4050_FMR.xlsx"
## [3] "hw1.R"
## [4] "markdown_cheetsheet.txt"
## [5] "reddit.csv"
## [6] "week-2.html"
## [7] "week-2.R"
## [8] "week-2.Rmd"
## [9] "week1.Rmd"
## [10] "week1_data_wrangaling.html"
## [11] "week1_data_wrangaling.Rmd"
reddit<-read.csv("reddit.csv")
datatable(head(reddit))
datatable(str(reddit))
## 'data.frame': 32754 obs. of 14 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : int 0 0 1 0 1 0 0 0 0 0 ...
## $ age.range : Factor w/ 7 levels "18-24","25-34",..: 2 2 1 2 2 2 2 1 3 2 ...
## $ marital.status : Factor w/ 6 levels "Engaged","Forever Alone",..: NA NA NA NA NA 4 3 4 4 3 ...
## $ employment.status: Factor w/ 6 levels "Employed full time",..: 1 1 2 2 1 1 1 4 1 2 ...
## $ military.service : Factor w/ 2 levels "No","Yes": NA NA NA NA NA 1 1 1 1 1 ...
## $ children : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ education : Factor w/ 7 levels "Associate degree",..: 2 2 5 2 2 2 5 2 2 5 ...
## $ country : Factor w/ 439 levels " Canada"," Canada eh",..: 394 394 394 394 394 394 125 394 394 125 ...
## $ state : Factor w/ 53 levels "","Alabama","Alaska",..: 33 33 48 33 6 33 1 6 33 1 ...
## $ income.range : Factor w/ 8 levels "$100,000 - $149,999",..: 2 2 8 2 7 2 NA 7 2 7 ...
## $ fav.reddit : Factor w/ 1834 levels "","'home' page (or front page if you prefer)",..: 720 691 1511 1528 188 691 1318 571 1629 1 ...
## $ dog.cat : Factor w/ 3 levels "I like cats.",..: NA NA NA NA NA 2 2 2 1 1 ...
## $ cheese : Factor w/ 11 levels "American","Brie",..: NA NA NA NA NA 3 3 1 10 7 ...
2. Now import the above csv file directly from the url provided (without downloading to your local hard drive)
url <- "https://bradleyboehmke.github.io/public/data/reddit.csv"
reddit_url<-read.csv(url,stringsAsFactors = F)
datatable(head(reddit_url))
datatable(str(reddit_url))
## 'data.frame': 32754 obs. of 14 variables:
## $ id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : int 0 0 1 0 1 0 0 0 0 0 ...
## $ age.range : chr "25-34" "25-34" "18-24" "25-34" ...
## $ marital.status : chr NA NA NA NA ...
## $ employment.status: chr "Employed full time" "Employed full time" "Freelance" "Freelance" ...
## $ military.service : chr NA NA NA NA ...
## $ children : chr "No" "No" "No" "No" ...
## $ education : chr "Bachelor's degree" "Bachelor's degree" "Some college" "Bachelor's degree" ...
## $ country : chr "United States" "United States" "United States" "United States" ...
## $ state : chr "New York" "New York" "Virginia" "New York" ...
## $ income.range : chr "$150,000 or more" "$150,000 or more" "Under $20,000" "$150,000 or more" ...
## $ fav.reddit : chr "getmotivated" "gaming" "snackexchange" "spacedicks" ...
## $ dog.cat : chr NA NA NA NA ...
## $ cheese : chr NA NA NA NA ...
url <- "http://www.huduser.gov/portal/datasets/fmr/fmr2017/FY2017_4050_FMR.xlsx"
download.file(url,dest="FY2017_4050_FMR.xlsx",mode="wb")
list.files("./")
## [1] "Data Wrangling with R (BANA 8090).Rproj"
## [2] "FY2017_4050_FMR.xlsx"
## [3] "hw1.R"
## [4] "markdown_cheetsheet.txt"
## [5] "reddit.csv"
## [6] "week-2.html"
## [7] "week-2.R"
## [8] "week-2.Rmd"
## [9] "week1.Rmd"
## [10] "week1_data_wrangaling.html"
## [11] "week1_data_wrangaling.Rmd"
FMR<-read.xls("FY2017_4050_FMR.xlsx")
datatable(head(FMR))
datatable(str(FMR))
## 'data.frame': 4769 obs. of 21 variables:
## $ fips2010 : num 2.3e+09 6.1e+09 7.0e+09 1.0e+08 1.0e+08 ...
## $ fips2000 : num NA NA NA 1e+08 1e+08 ...
## $ fmr2 : int 1078 677 666 822 977 671 866 866 621 621 ...
## $ fmr0 : int 755 502 411 587 807 501 665 665 491 464 ...
## $ fmr1 : int 851 506 498 682 847 505 751 751 494 467 ...
## $ fmr3 : int 1454 987 961 1054 1422 839 1163 1163 853 849 ...
## $ fmr4 : int 1579 1038 1158 1425 1634 958 1298 1298 856 1094 ...
## $ State : int 23 60 69 1 1 1 1 1 1 1 ...
## $ Metro_code : Factor w/ 2598 levels "METRO10180M10180",..: 451 2592 2594 384 160 625 55 55 626 627 ...
## $ areaname : Factor w/ 2598 levels " Santa Ana-Anaheim-Irvine, CA HUD Metro FMR Area",..: 1903 52 1723 1633 571 122 186 186 263 271 ...
## $ county : int NA 999 999 1 3 5 7 9 11 13 ...
## $ CouSub : int 12300 99999 99999 99999 99999 99999 99999 99999 99999 99999 ...
## $ countyname : Factor w/ 1961 levels "Abbeville County",..: 462 41 1265 92 99 110 163 178 239 249 ...
## $ county_town_name : Factor w/ 3175 levels "Abbeville County",..: 533 60 2024 136 149 165 254 277 386 401 ...
## $ pop2010 : int 341 55519 53883 54571 182265 27457 22915 57322 10914 20947 ...
## $ acs_2016_2 : int 1109 653 642 788 873 636 840 840 569 569 ...
## $ state_alpha : Factor w/ 56 levels "AK","AL","AR",..: 24 4 28 2 2 2 2 2 2 2 ...
## $ fmr_type : int 40 40 40 40 40 40 40 40 40 40 ...
## $ metro : int 1 0 0 1 1 0 1 1 0 0 ...
## $ FMR_PCT_Change : num 0.972 1.037 1.037 1.043 1.119 ...
## $ FMR_Dollar_Change: int -31 24 24 34 104 35 26 26 52 52 ...
FMR_url<- read.xls(url) ######import excel file
datatable(head(FMR_url))
datatable(str(FMR_url))
## 'data.frame': 4769 obs. of 21 variables:
## $ fips2010 : num 2.3e+09 6.1e+09 7.0e+09 1.0e+08 1.0e+08 ...
## $ fips2000 : num NA NA NA 1e+08 1e+08 ...
## $ fmr2 : int 1078 677 666 822 977 671 866 866 621 621 ...
## $ fmr0 : int 755 502 411 587 807 501 665 665 491 464 ...
## $ fmr1 : int 851 506 498 682 847 505 751 751 494 467 ...
## $ fmr3 : int 1454 987 961 1054 1422 839 1163 1163 853 849 ...
## $ fmr4 : int 1579 1038 1158 1425 1634 958 1298 1298 856 1094 ...
## $ State : int 23 60 69 1 1 1 1 1 1 1 ...
## $ Metro_code : Factor w/ 2598 levels "METRO10180M10180",..: 451 2592 2594 384 160 625 55 55 626 627 ...
## $ areaname : Factor w/ 2598 levels " Santa Ana-Anaheim-Irvine, CA HUD Metro FMR Area",..: 1903 52 1723 1633 571 122 186 186 263 271 ...
## $ county : int NA 999 999 1 3 5 7 9 11 13 ...
## $ CouSub : int 12300 99999 99999 99999 99999 99999 99999 99999 99999 99999 ...
## $ countyname : Factor w/ 1961 levels "Abbeville County",..: 462 41 1265 92 99 110 163 178 239 249 ...
## $ county_town_name : Factor w/ 3175 levels "Abbeville County",..: 533 60 2024 136 149 165 254 277 386 401 ...
## $ pop2010 : int 341 55519 53883 54571 182265 27457 22915 57322 10914 20947 ...
## $ acs_2016_2 : int 1109 653 642 788 873 636 840 840 569 569 ...
## $ state_alpha : Factor w/ 56 levels "AK","AL","AR",..: 24 4 28 2 2 2 2 2 2 2 ...
## $ fmr_type : int 40 40 40 40 40 40 40 40 40 40 ...
## $ metro : int 1 0 0 1 1 0 1 1 0 0 ...
## $ FMR_PCT_Change : num 0.972 1.037 1.037 1.043 1.119 ...
## $ FMR_Dollar_Change: int -31 24 24 34 104 35 26 26 52 52 ...
url<-"http://academic.udayton.edu/kissock/http/Weather/"
links<-getHTMLLinks(paste0(url,"citylistUS.htm"))
links
## [1] "gsod95-current/ALBIRMIN.txt"
## [2] "gsod95-current/ALHUNTSV.txt"
## [3] "gsod95-current/ALMOBILE.txt"
## [4] "gsod95-current/ALMONTGO.txt"
## [5] "gsod95-current/AKANCHOR.txt"
## [6] "gsod95-current/AKFAIRBA.txt"
## [7] "gsod95-current/AKJUNEAU.txt"
## [8] "gsod95-current/AZFLAGST.txt"
## [9] "gsod95-current/AZPHOENI.txt"
## [10] "gsod95-current/AZTUCSON.txt"
## [11] "gsod95-current/AZYUMA.txt"
## [12] "gsod95-current/ARFTSMIT.txt"
## [13] "gsod95-current/ARLIROCK.txt"
## [14] "gsod95-current/CAFRESNO.txt"
## [15] "gsod95-current/CALOSANG.txt"
## [16] "gsod95-current/CASACRAM.txt"
## [17] "gsod95-current/CASANDIE.txt"
## [18] "gsod95-current/CASANFRA.txt"
## [19] "gsod95-current/COCOSPGS.txt"
## [20] "gsod95-current/CODENVER.txt"
## [21] "gsod95-current/COGRNDJU.txt"
## [22] "gsod95-current/COPUEBLO.txt"
## [23] "gsod95-current/CTBRIDGE.txt"
## [24] "gsod95-current/CTHARTFO.txt"
## [25] "gsod95-current/DEWILMIN.txt"
## [26] "gsod95-current/MDWASHDC.txt"
## [27] "gsod95-current/FLDAYTNA.txt"
## [28] "gsod95-current/FLJACKSV.txt"
## [29] "gsod95-current/FLMIAMIB.txt"
## [30] "gsod95-current/FLORLAND.txt"
## [31] "http://www.engr.udayton.edu/faculty/jkissock/gsod/FLTALLAH.txt"
## [32] "gsod95-current/FLTAMPA.txt"
## [33] "gsod95-current/FLWPALMB.txt"
## [34] "gsod95-current/GAATLANT.txt"
## [35] "gsod95-current/GACOLMBS.txt"
## [36] "gsod95-current/GAMACON.txt"
## [37] "gsod95-current/GASAVANN.txt"
## [38] "gsod95-current/HIHONOLU.txt"
## [39] "gsod95-current/IDBOISE.txt"
## [40] "gsod95-current/IDPOCATE.txt"
## [41] "gsod95-current/ILCHICAG.txt"
## [42] "gsod95-current/ILPEORIA.txt"
## [43] "gsod95-current/ILROCKFO.txt"
## [44] "gsod95-current/ILSPRING.txt"
## [45] "gsod95-current/INEVANSV.txt"
## [46] "gsod95-current/INFTWAYN.txt"
## [47] "gsod95-current/ININDIAN.txt"
## [48] "gsod95-current/INSOBEND.txt"
## [49] "gsod95-current/IADESMOI.txt"
## [50] "gsod95-current/IASIOCTY.txt"
## [51] "gsod95-current/KSGOODLA.txt"
## [52] "gsod95-current/KSTOPEKA.txt"
## [53] "gsod95-current/KSWICHIT.txt"
## [54] "gsod95-current/KYLEXING.txt"
## [55] "gsod95-current/KYLOUISV.txt"
## [56] "gsod95-current/KYPADUCA.txt"
## [57] "gsod95-current/LABATONR.txt"
## [58] "gsod95-current/LALAKECH.txt"
## [59] "gsod95-current/LANEWORL.txt"
## [60] "gsod95-current/LASHREVE.txt"
## [61] "gsod95-current/MECARIBO.txt"
## [62] "gsod95-current/MEPORTLA.txt"
## [63] "gsod95-current/MDBALTIM.txt"
## [64] "gsod95-current/MDWASHDC.txt"
## [65] "gsod95-current/MABOSTON.txt"
## [66] "gsod95-current/MIDETROI.txt"
## [67] "gsod95-current/MIFLINT.txt"
## [68] "gsod95-current/MIGRNDRA.txt"
## [69] "gsod95-current/MILANSIN.txt"
## [70] "gsod95-current/MISTEMAR.txt"
## [71] "gsod95-current/MNDULUTH.txt"
## [72] "gsod95-current/MNMINPLS.txt"
## [73] "gsod95-current/MSJACKSO.txt"
## [74] "gsod95-current/MSTUPELO.txt"
## [75] "gsod95-current/MOKANCTY.txt"
## [76] "gsod95-current/MOSPRING.txt"
## [77] "gsod95-current/MOSTLOUI.txt"
## [78] "gsod95-current/MTBILLIN.txt"
## [79] "gsod95-current/MTGRFALL.txt"
## [80] "gsod95-current/MTHELENA.txt"
## [81] "gsod95-current/NELINCOL.txt"
## [82] "gsod95-current/NENPLATT.txt"
## [83] "gsod95-current/NEOMAHA.txt"
## [84] "gsod95-current/NVRENO.txt"
## [85] "gsod95-current/NVLASVEG.txt"
## [86] "gsod95-current/NHCONCOR.txt"
## [87] "gsod95-current/NJATLCTY.txt"
## [88] "gsod95-current/NJNEWARK.txt"
## [89] "gsod95-current/NMALBUQU.txt"
## [90] "gsod95-current/NYALBANY.txt"
## [91] "gsod95-current/NYBUFFAL.txt"
## [92] "gsod95-current/NYNEWYOR.txt"
## [93] "gsod95-current/NYROCHES.txt"
## [94] "gsod95-current/NYSYRACU.txt"
## [95] "gsod95-current/NCASHEVI.txt"
## [96] "gsod95-current/NCCHARLO.txt"
## [97] "gsod95-current/NCGRNSBO.txt"
## [98] "gsod95-current/NCRALEIG.txt"
## [99] "gsod95-current/NDBISMAR.txt"
## [100] "gsod95-current/NDFARGO.txt"
## [101] "gsod95-current/OHAKRON.txt"
## [102] "gsod95-current/OHCINCIN.txt"
## [103] "gsod95-current/OHCLEVEL.txt"
## [104] "gsod95-current/OHCOLMBS.txt"
## [105] "gsod95-current/OHDAYTON.txt"
## [106] "gsod95-current/OHTOLEDO.txt"
## [107] "gsod95-current/OHYOUNGS.txt"
## [108] "gsod95-current/OKOKLCTY.txt"
## [109] "gsod95-current/OKTULSA.txt"
## [110] "gsod95-current/OREUGENE.txt"
## [111] "gsod95-current/ORMEDFOR.txt"
## [112] "gsod95-current/ORPORTLA.txt"
## [113] "gsod95-current/ORSALEM.txt"
## [114] "gsod95-current/PAALLENT.txt"
## [115] "gsod95-current/PAERIE.txt"
## [116] "gsod95-current/PAHARRIS.txt"
## [117] "gsod95-current/PAPHILAD.txt"
## [118] "gsod95-current/PAPITTSB.txt"
## [119] "gsod95-current/PAWILKES.txt"
## [120] "gsod95-current/RIPROVID.txt"
## [121] "gsod95-current/SCCHARLE.txt"
## [122] "gsod95-current/SCCOLMBA.txt"
## [123] "gsod95-current/SDRAPCTY.txt"
## [124] "gsod95-current/SDRAPCTY.txt"
## [125] "gsod95-current/TNCHATTA.txt"
## [126] "gsod95-current/TNKNOXVI.txt"
## [127] "gsod95-current/TNMEMPHI.txt"
## [128] "gsod95-current/TNNASHVI.txt"
## [129] "gsod95-current/TXABILEN.txt"
## [130] "gsod95-current/TXAMARIL.txt"
## [131] "gsod95-current/TXAUSTIN.txt"
## [132] "gsod95-current/TXBROWNS.txt"
## [133] "gsod95-current/TXCORPUS.txt"
## [134] "gsod95-current/TXDALLAS.txt"
## [135] "gsod95-current/TXELPASO.txt"
## [136] "gsod95-current/TXHOUSTO.txt"
## [137] "gsod95-current/TXLUBBOC.txt"
## [138] "gsod95-current/TXMIDLAN.txt"
## [139] "gsod95-current/TXSANANG.txt"
## [140] "gsod95-current/TXSANANT.txt"
## [141] "gsod95-current/TXWACO.txt"
## [142] "gsod95-current/TXWICHFA.txt"
## [143] "gsod95-current/UTSALTLK.txt"
## [144] "gsod95-current/VTBURLIN.txt"
## [145] "gsod95-current/VANORFOL.txt"
## [146] "gsod95-current/VARICHMO.txt"
## [147] "gsod95-current/VAROANOK.txt"
## [148] "gsod95-current/WASEATTL.txt"
## [149] "gsod95-current/WASPOKAN.txt"
## [150] "gsod95-current/WAYAKIMA.txt"
## [151] "gsod95-current/WVCHARLE.txt"
## [152] "gsod95-current/WVELKINS.txt"
## [153] "gsod95-current/WIGREBAY.txt"
## [154] "gsod95-current/WIMADISO.txt"
## [155] "gsod95-current/WIMILWAU.txt"
## [156] "gsod95-current/WYCASPER.txt"
## [157] "gsod95-current/WYCHEYEN.txt"
## [158] "gsod95-current/PRSANJUA.txt"
## [159] "citylistUS.htm"
## [160] "default.htm"
# extract names for desired links and paste to url
link_data<-links[str_detect(links,"OHCINCIN.txt")]
# paste url to data links to have full url for data sets
filenames <- paste0(url,link_data)
filenames
## [1] "http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt"
#### from a .txt file hosted on net how do you determine what is the delimitor in it
ohcincin<-read.table(filenames,header=F,sep="")
datatable(head(ohcincin))
datatable(str(ohcincin))
## 'data.frame': 7963 obs. of 4 variables:
## $ V1: int 1 1 1 1 1 1 1 1 1 1 ...
## $ V2: int 1 2 3 4 5 6 7 8 9 10 ...
## $ V3: int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
## $ V4: num 41.1 22.2 22.8 14.9 9.5 23.8 31.1 26.9 31.3 31.5 ...