Training workshop: Introduction to Data Sciences with R language.
Rafael Resendiz Ramirez
August 20, 2015
Extraction, scanning, cleaning and handling of data.
First, we must learn to collect our data, this is done considering that usually going to get through various media and in various formats. For this reason, we will make some scripts, both to connect with the site where the information we need, as the file in question is located. This time instead of loading a file in a predetermined direction, we will use a function that allows us to choose the file you want to work.
Script_2 Reading Excel files
Example file with “Natural Gas Aquisition Program”.
# Download the file to load it into our work environment
fileUrl <- “https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2FDATA.gov_NGAP.xlsx?accessType=DOWNLOAD”
download.file(fileUrl, destfile = “/directory_files/filename.xlsx”, method = “curl”)
list.files(“../directory”)
dateDownloaded <- date()
dateDownloaded
Script_3
# Download a file from the web
# 2006 microdata survey about housing for the state of Idaho
# Direct Example
# FileURL <- “https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06hid.csv?accessType=DOWNLOAD”
# Copy the following line and change the word “link” with the address of the file (must include the extension)
# FileURL <- “link accessType = DOWNLOAD”
fileURL <- “https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv?accessType=DOWNLOAD”
Script_4a
Procedures for working with fortran files from R
After downloading the file. I think it is necessary to divide the variables differently.
In this example, I write the code to show how you can work the file.
Procedures for working with fortran files
After downloading the file. I think it is necessary to divide the variables differently.
Script_4b
In this example, I write the code to show how you can work.
First procedure for working with fortran
Read and fix fortran file
yourworktempfile <- read.fwf("namefortranfile.for", widths=c(V1,V2,...Vn))
Extracting the column to which want to work
yourworktempdata <- yourworktempfile[,Vn]
Turn extracted data to a matrix
dataextratedmatrix <- matrix(yourworktempdata, nrow= Numberows, ncol = Numbercols, byrow =FALSE, dimnames= NULL)
Script_4c
# Remove head (only in this case)
datamatrixclean <- dataextratedmatrix[Numberow_Ini:Numberow_End,]
# To convert characters as numeric data
numericData <- as.numeric(datamatrixclean)
# Execute the 'x' function (mean, sum, etc)
xFunction(numericData)
Script_5
library(XML)
fileUrl <- “http://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml”
# fileUrl <- “https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Frestaurants.xml”
doc <- xmlTreeParse(fileUrl,useInternal=TRUE)
rootNode <- xmlRoot(doc)
xmlName(rootNode)
names(rootNode)
rootNode[[1]]
rootNode[[1]][[1]]
xmlSApply(rootNode,xmlValue)
Script_5
# Find specific variables and values.
xpathSApply(rootNode, "//name", xmlValue)
findzp <- xpathSApply(rootNode, "//zipcode", xmlValue)
findzp