Reading and Writing Data in R

Writing Data to files from R

R has several systems for writing output to files. In this course we will only touch on a few.

# writing text ----
write("This is my story", "my_story.txt")
write("...and this is more of my story", "my_story.txt", append = TRUE)

# writing dataframe or tables to .csv or .xlsx files ----
# strongly suggest you read through the ?write.table help documentation. Its more flexible that what we will be doing.

# writing tables to text in .csv files
data(iris)
head(iris)

write.csv(x = iris, file = "my_own_iris.csv", row.names = FALSE) # can open in text file or import to Excel

# Write to excel (write.xlsx)
# try ?write.xlsx # nothing
# try ??write.xlsx # need a different package: first exposure to other packages:

# Introduce R packages

# install.packages("openxlsx") # note the parentheses
# can also go to tab "Packages" and install that way;  or Tools >> Install Packages
# check out the Packages tab in RStudio.

library(openxlsx) # note no parentheses
# now try ?write.xlsx

# write to excel sheet
write.xlsx(iris, file = "my_own_iris.xlsx")

# check it out...by opening your excel file

# sink() # useful for sending console output to a text file

# for example, I want to send the results from a statistical test to a text file.

# When I write sink() everything afterward will be send to that file and not to the console?

sink(file = "myTestSink.txt", append = TRUE)

summary(lm( mpg ~ ., data = mtcars))

summary(lm( mpg ~ wt, data = mtcars))

sink(file = NULL) # to stop sinking

# sending images to file:
# can use the interactivity of RStudio: Export or Plots >> Save as Image/Save as PDF/Save to Clipboard

# or...(this part in .R NOT .Rmd !!)
# look at a simple image (more about this soon)
hist(iris$Sepal.Length, breaks =20, col = "seagreen3")

# save image as .pdf
pdf(file = "my_iris_pic1.pdf")
hist(iris$Sepal.Length, breaks =20, col = "seagreen3")
dev.off()

# save image as .jpeg
jpeg(file = "my_iris_pic2.jpeg")
hist(iris$Sepal.Length, breaks =20, col = "violetred")
dev.off()

Reading Data from files into R

R can also read data from many different files. From text files, pdf documents, json files, Word documents, Excel files and many more.

For this course we will introduce you to the following useful functions:

readLines(); read.csv(); read.xlsx(); and source()

A useful function to study what is possible is the help documentation on read.table()

read.csv(file, header = TRUE, sep = “,”, quote = “"”, dec = “.”, fill = TRUE, comment.char = ““, …)

# reading data from a text file

read_back_lines <- readLines("/Users/hans-peterbakker/Dropbox/Statistics/consulting/R-Course/R_Course_2020_Repo/Sessions/myTestSink.txt")

read_back_lines

# reading a .csv file into R (worth reading read.table() in full...)
my_iris_read_back_csv <- read.csv("/Users/hans-peterbakker/Dropbox/Statistics/consulting/R-Course/R_Course_2020_Repo/Sessions/my_own_iris.csv")

str(my_iris_read_back_csv) # characters automatically turned to factors (more about factors soon...)

# reading an .xlsx file into R
# library(openxlsx)
my_iris_read_back_xlsx <- read.xlsx("/Users/hans-peterbakker/Dropbox/Statistics/consulting/R-Course/R_Course_2020_Repo/Sessions/my_own_iris.xlsx") # characters return as characters...

str(my_iris_read_back_xlsx)

# useful function to input previous script files. Will see this again with functions 
source("/Users/hans-peterbakker/Dropbox/Statistics/consulting/R-Course/R_Course_2020_Repo/t_plotter.R")

More Operators:

We have already seen several operators (+, -, *, /, ^)

Logical and Relational Operators

smaller than ( < ); larger than ( > ); smaller or equal to ( <= ); larger or equal t0 ( >= ); equal to ( == ); not equal to ( != ); intersection/and ( & ); union/or ( | ); negation ( ! ); value matching ( %in% )

# equal to, less than, greater than, less or equal, greater or equal 

# The output from these operators is logical:

4 < 5
5 <= 4
4 == 3
4 == 4

# not equal to
4 != 4
4 != 3

# intersection and union ('and' and 'or')

4 > 2 & 5 > 3 # TRUE & TRUE == TRUE

4 > 2 & 5 < 3 # TRUE & FALSE == FALSE

4 < 2 & 5 < 3 # FALSE & FALSE == FALSE

4 > 2 | 5 > 3 # TRUE | TRUE == TRUE

4 > 2 | 5 < 3 # TRUE | FALSE == TRUE

4 < 2 | 5 < 3 # FALSE | FALSE == FALSE

# us of negation ! whatever the result would be reversed.

4 > 2 & 5 > 3 # TRUE & TRUE == TRUE

# but

!(4 > 2 & 5 > 3)  # !(TRUE & TRUE) == FALSE

# applied to vectors
"a" == c("a", "b", "c") # checks each element and returns a vector

23 >= c(2, 56, 12, 45)

# a useful operator for matching in a vector: %in%
"a" %in% c("a", "b", "c") # single output TRUE

21 %in% c(2, 56, 12, 45)

Value Generating

In programming and particularly in statistical analysis, being able to create series and samples of values is important.

Here we will consider several functions that allow us to do that.

# we have already seen c()

# seq() ----
# generates a sequence of numbers:
# seq(from = 1, to = 1, by = ((to - from)/(length.out - 1)),
#    length.out = NULL, along.with = NULL, ...)

seq(from = 0, to = 100, by = 10)

twenty_numbers <- seq(0, 100, length.out = 20) # will generate twenty evenly spaced values ...

seq(0,1, along.with = twenty_numbers) # generates evenly spaced values between 0 and 1 with length.out == to the number of elements in the twenty_numbers vector

# rep() ----
# repeats
rep(x = 1, times = 10)
rep(c(1, 2, 3), times = 10)
rep(c(1, 2, 3), each = 10)
rep(c("a", "b"), each = 5) # also works for character vectors

# sample() ----
# Draws a random sample based on your arguments
# ?sample: sample(x, size, replace = FALSE, prob = NULL)

# Arguments:
# x = the population to draw from
# size = how many to draw
# replace = should the values I drew be replace to be drawn again
# prop = a vector of probability values attached to the sample. Only really makes sense if replace = TRUE

sample(x = 1:10, size = 10, replace = TRUE, prob = NULL) # equal probalities
sample(x = 1:10, size = 10, replace = FALSE, prob = NULL) # will just give you all the values again
sample(x = 1:10, size = 10, replace = TRUE, prob = c(0.9, rep(0.1,9))) # much higher chance on first (1)
sample(x = 1:10, size = 10, replace = TRUE, prob = c(rep(0.1,9), 0.9)) # much high chance on last (10)

# note if I run them again, I will get different values... its a random draw!!!
# but I can make sure its repeatible

### set.seed()
sample(x = 1:10, size = 10, replace = TRUE, prob = NULL)
sample(x = 1:10, size = 10, replace = TRUE, prob = NULL) # different values
# vs
set.seed(9) # can be any value in the argument
sample(x = 1:10, size = 10, replace = TRUE, prob = NULL)
set.seed(9) # can be any value in the argument
sample(x = 1:10, size = 10, replace = TRUE, prob = NULL)

Probability Distribution Sampling (slides)

In the slides, will give a presentation on what a probability distribution is and why its important in statistics. Also introducing the histogram as a step toward density functions.

What is a probability distribution.

We will get back to more details, but for now we want to generate random samples from different probability distributions.

# R "stats" package, which is loaded as the base set of packages has many distributions to choose from
?distributions

n_sample <- rnorm(n = 1000, mean = 78, sd = 10) # random sample from normal 

u_sample <- runif(n = 1000, min = 1, max = 4) # random sample from unif

e_sample <- rexp(n = 1000, rate = 2)  # random sample from exponential


# to look at the distribution of values a histogram is useful
# in .R !!
hist(n_sample, prob = TRUE)
lines(density(n_sample, bw = 1), col = "blue")
lines(seq(40,120), dnorm(seq(40,120), mean = 78, sd = 10), col = "red")
rug(n_sample)


hist(e_sample)
hist(u_sample)

n_sample # vector of values generated by my function above

# limiting the number of digits.
70/3
round(70/3)
round(70/3, digits = 2)

round(n_sample, digits = 2)

# head and tail
head(n_sample)
head(n_sample, 10)
tail(n_sample, 3)

When to deal with missing values…????

Missing values should be identified with NA… Not with for example zeros and not with words like “missing” or explanations for why they are missing “jason messed up this one”.

For example I want to construct a data frame with the following information: x = 2,3,“missing” ,4 y = 1,2,3,5

x = c(2,3,"missing", 4)
y = c(1,2,3,5)

# x is no longer a numeric variable. They have all been turned to character.
x[3] <- NA

x

as.numeric(x) # now I can combine with y in a data frame
df_NA <- cbind(x,y)

# function is.na() 
# a logical function that identify which values in a vector are missing:

is.na(x) # 
is.na(df_NA)

# NB the logical expression
# (x == NA) # is not the same as is.na() since NA is not really a number. Its a marker

is.na(x)

# another kind of "missing" value is NaN... Not a Number. This happens when your calculations don't make sense.

# example
0/0
Inf - Inf

# is.na() is TRUE for both Nan and NA, but is.nan() is TRUE only for NaN.

# to extract only cases in a dataframe that have no NA or Nan values

complete.cases(df_OK) # not sure I want to discuss this yet...

# so, can use as follows:

df_OK[complete.cases(df_OK),]