1 Organization

2 An introductory example

R as a language: an example with heat map developed by Esarey and Pierce Plots


output
output


FIN <- read.dta[file="find.dta"]
varfin <- data.frame(FIN[ which(FIN$rekry == "1"), ]) # only those who answer the self-adminsitered survey

new variable

varfin ["support"] <- (varfin$instrs+varfin$satis)/2
varsp ["support"] <- (varsp$instrs+ varsp$stdem)/2

# Analisys
library (mfx)

# Model Finland 
m12 <- glm(formula= ambiva ~ efin + knowle + support + factor(steco)intr + sctr + resp + ideo+ factor(edu)+ age, data = varfin, family= "binomial" na.action= na.omit))

## Heat Map for Finland 

install.packages(heatmapFit)
library("heatmapFit")

Varfin$a <- predict(m12, varfin, level = 0.90, type = c("response"))

par(mar = c(4 ,6 , 2, 2))

heatmap.fit(varfin$ambiva, varfin$a, calc.boot = T, reps = 1000, span.l = "aicc",
  color = FALSE, compress.obs = TRUE, init.grid = 2000, ret.obs = F,legend = TRUE)

par(xpd=NA)


3 Content of the workshop


  1. Firsts step with R
    • History, familiarization with the browser
    • A crucial distinction: Objects, working directory and workspace
    • Getting packages and data. Rmarkdown and chicken weights
  2. Data Structure
    • Basic calculations
    • How to write good code/workflow!
    • Vectors, List and Data frames
  3. Data Management
    • Select and recoding variables, and rename columns and raws
    • Subsetting data
    • Aggregation = Pipeline! and application
    • functions(?)
  4. Descriptive analyses
    • Basic descriptive analyses and plot them
    • Regression analysis and plot them
    • Customized table, and extracting relevant information


4 Firsts steps with R

4.1 R: The programm

  • What is R? Open source version of the statistics programming language S
    • “Open source”: Source code (+ software) are freely available
    • Runs on all operating systems
    • You can download it here!


  • Who develops R? Core group + Thousands of volunteers


4.2 Why use R and not any other software (e.g. stata, spss)?


4.3 R in, STATA out: Data concept and object-orientiation

  • SPSS/STATA
    • One data set at a time
    • Data sets follow a “row-by-column” structure
      • Rows = observations/units
      • Columns = variables
    • Variables have certain attributes attached to them (e.g. labels)


  • R (similar to Python, C, Basic)
    • “Everything” is an object
    • There are different classes of objects
    • Data, functions, estimation results etc. are all objects
    • Objects are stored in the “workspace”
  • Let’s open R now!


4.4 working directory, workspace, and objects

  • Working directory
    • Storage within R
    • getwd(): Display working directory
    • setwd(): Set working directory
    • dir(): Display files in working directory
    • R does not understand \
      • Use \\ instead or / in directory paths
    • #: Start comment that is not evaluated


  • Workspace
    • stores all objects generated during the session
    • is loaded automatically when you start R
    • ls(): Display workspace content
    • R queries whether you would like to save the workspace when quitting it


  • Objects
    • everything (functions, data, words…)

    • __advice__Meaningful names; No blanks!; one name per object

    • a <- 10 or a = 10: Assign something to an object

    • a: Type name to display content of object a

    • rm(): Delete object from workspace

    • save(a, b, file = "myobjects.RData"): Save single objects a and b

    • load("myobjects.RData"): Load saved objects


  • Help function
    • General help with help.start() or in the menu
    • help.search(Suchwort): Search help file
    • help(function) or ?function
      • Get help for certain objects/functions
      • e.g. call plot.html with ?plot (Explain help file structure!) or ?ls

4.4.1 Example

# Comments in the script start with #
# Everything after # in the line is ignored by R

# Send code with CTRL + R
5+5


getwd() # Display working directory


# Generate a new folder on your hardrive called "2016_03_R_course_EUI"
# Use this folder as your working directory, i.e. to store the downloaded files in there
# If you need the path to your folder, copy it from the path field in the explorer (in windows)
# The course files are available here:  ??????


# Set the your working directory to your new folder
setwd("C:/Users/usuario/Desktop/Workshop in R 2016")
dir() # Display content of working directory

a <- 50 # Generate object that contains the number 50
a       # Show content of object
a = 70
a
# Note that a new object with the same name "X" will replace an previous object called "X" 

ls() # Display objects in workspace

# Create a more complicated object
thesenseoflife <- c("worshipping god","having fun","no idea")
interesting <- thesenseoflife # define another object
interesting


# Objectnames do not have blanks
# Names should make sense and be systematic 
  # e.g. for dummy variables: d.male.voter
  # e.g. variance of variables: var.income
a <- 1
b <- 2

# Save objects a and b in the file "objects-a-and-b.RData"
# in the working directory
save(a,b,file="objects-a-and-b.RData")

# Delete objects a and b
rm(a,b)
rm(list=ls())
ls() # display all objects
load("objects-a-and-b.RData") # load objects from wd

# Check if they were loaded
a
b


4.4.2 Exercise: Working directory, objects and workspace

  1. Open R and create a new script that you save under the name “yourlastname_rworkshop_2016.r”. In this script you save all the code that you use to solve the exercises from now on.
  2. Display the working directory.
  3. Change the working directory to your personal folder.
  4. Create two objects: The object “hundred” in which you save the number 100 and the object “xyz” in which you save the number 250.
  5. Check whether the created objects have really been stored in the workspace.
  6. Save the two objects in your working directory. Delete them from the workspace and load them again from your working directory.
  7. Check whether they are stored again in the working directory.
  8. Download a dataset called ‘ChickWeight’ and save it in your working directory with the label ‘lunch’. Hint: write ChickWeight in the console and see the surprise

4.4.3 Solution

getwd()
hundert <- 100
xyz <- 250
hundert
ls()
save(hundred,xyz,file="workspace.RData")
rm(hundred,xyz)
load("workspace.RData")
ls()
save(ChickWeight, file = "lunch.RData")


4.5 Calculations and logical comparisons

  • R can calculate and compare things
  • Arithmetic operators
    • + - * / ^
  • Logical operators
    • & | == != > < >= <=
  • Different functions (check with ?function)
    • exp(x)=e^x log(x) log10(x) sin(x) cos(x) tan(x)
    • abs(x) sqrt(x) ceiling(x) floor(x) trunc(x) round(x, digits=n)


4.5.1 Example

# Calculations
Ergebnis <- (23+24)*11/(18+15)*5
Ergebnis

# Functions
log(2) # exp(1) = 2.718282^0.6931472

##########################
# Comparisons.. examples #
##########################

x <- -3:3
x

# Equals
x == 0

# is smaller than
x < 0

# is bigger than or equal to
x >= 0

# unequal
x != 0

# unequal(equal to 0)
!(x == 0)

# bigger than -1 and smaller than 1
x > -1 & x < 1

# bigger than 1 and smaller than -1
x > 1 & x < -1

# bigger than 1 or smaller than -1
x > 1 | x < -1


4.5.2 Exercise: Calculations and logical comparisons

  1. Use your r-script and save the code you are using.
  2. Calculate the following terms using R.
    1. \(((3 + 4 - 5) - 9)^{2}\)
    2. \(\frac{-99}{33} + 42\)
    3. \(log(1)\)
    4. \((\sqrt{2})^{2}\)
  3. Check the following terms:
    1. \(5 = 7\)
    2. \(5 \times 5 \geq 6 \times 4\)
    3. \(\sqrt{3} \neq cos(17)\)
  4. In the mean() function it is possible to use and additional argument called trim. Find out what this argument does by checking the help file of the mean() function and write it into your script.


4.5.3 Solution

# 2.
((3+4-5)-9)^2
-99/33+42
log(1)
(sqrt(2))^2

# 3.
5==7
5*5>=6*4
sqrt(3)!=cos(17)

# 4.
# help(mean)
# ?mean
# the fraction (0 to 0.5) of observations to be trimmed from each end of x 
# before the mean is computed.


4.6 How to write good code/workflow!

Q: What is your experience with looking at data analysis code you have written 2 years earlier?


  1. Comment your code
    • Data sources, data manipulations, describe steps of analysis
      • …commenting increases readability
      • …commenting increases reproducability
    • but, you can’t comment too much
      • Comments in R: #
    • Use meaningful names for objects !
  2. Structure you code properly
  3. More rules
    • Code must be usable on machines with different paths to the project
      • Path is set once and afterwards, relative paths are used
      • There is a folder structure separating data and figures (and code)

For larger projects you might consider tools that keep track of versions, branches etc. like Git (http://www.github.com among others offers free git repositories). github can easily be used from within RStudio.


4.6.1 Rmarkdown package

  • What is Markdown? a notebook interface to weave together codes and text

  • Useful? reproduction, automated reports in any format, control (analysis<-> report)

  • Free? please…click here

  • My experience? collection of data on elite defection. Q: What about you?


# Get Rmarkdown installed
install.packages("rmarkdown")

# Generate a new file, check names and format of the file
    # White = the space of text --> Information of the steps of your analysis
    # Grey = R code as a separate paragraph--> analysis and specific comments about a command

#How does it work?
    # treat it as a normal script, at the end of your work, you can click the knit button and get the file you want. 


5 Data Structure


5.1 Objects: Classes and their structure

  • Objects class that can store homogenous content (see Wickham)
    • Vectors: Integer; Numerical; Logical; Character
    • Matrix: Two dimensions
    • Array: Several matrices
  • …heterogeneous, varying content…
    • List: List of other objects
    • Data frame: “List” of vectors of equal length
  • Attributes, i.e. metadata
    • Objects can have additional attributes, e.g. names
    • Factor
      • Vector with predefined values
      • Stores categorical data
  • Convert objects, e.g. as.numeric()
  • Functions (no data structure!)
    • class(): Query object type (e.g. the attributes)
    • str(): to check the internal structure of an object


  • Overview of structure of object classes
# FOCUS ON THE STRUCTURE NOT THE CODE!

# HOMOGENEOUS
# Integer vector => integer numerical values
x <- c(1:9)
class(x)
x


# Numerical vector => numerical values
x <- c(1.3, 2.4, 3.5)
class(x)
x


# Logical vector => take on the values "TRUE/FALSE", are often the results of comparisons
x <- -3:3
y <- x >= 0
class(y)
y

# Character vector => are sequences of letters and/or words
x <- c("a", "b", "Adrian is so cool", "he said #$!?=96464#$%")
class(x)
x


# Matrix
matrix.2mal3 <- matrix(c(1,2,3,11,12,13), nrow = 2, ncol=3, byrow=TRUE)
class(matrix.2mal3)
matrix.2mal3


# Array (e.g. 3 matrices)
arraybsp <- array(1:50, c(5,5,2)) # numbers from 1 to 50 in array with 2 5*5 matrix 
class(arraybsp)
arraybsp


# HETEROGENEOUS

# List
liste <- list(a= c(4:8), b = c(1:3), c = c("Ger", "FR", "It", "SE"))
class(liste)
liste


# Data frame: Examine by calling our ``lunch'' or fix()
class(lunch)
lunch[1:10, 1:3]


# Factor
x <- factor(rep(1:5,5), exclude = NULL)
class(x)
x # Levels are attributes

o <- ordered(rep(1:3,10),levels=c(1,2,3), labels=c("A", "B", "C"))
class(o)
o

# Functions: e.g. cos(); mean()
quadraticfunction <- function(){x^2}
class(quadraticfunction)
quadraticfunction

class(mean)
mean


5.2 Some functions/operations for vectors

  • The basic ones!
    • c(): “concatenate”, e.g. c(1.2,"Hans",5.0,6.7)
    • length(): Get vector length
    • :: indicates from/to for numbers
    • rep("Peter",2): Repeat 5 two times
    • seq(5,8): Sequence from 5 to 8
    • vector[positions]
      • Access elements by inserting a numeric or logical vector for positions
    • Display
      • e.g. [1] 1.20 3.50 5.00 6.70 8.00 10.00 13.55
      • [1] = Position of the first element displayed in that line in the vector (show it!)
      • vector[raw, column]
    • Special values
      • Inf and -Inf: Positive and negative infinity
      • NaN: “Not a number”, e.g. calculate 0/0
      • NA: Missing value
    • Combine vectors
      • rbind(): Combine vectors line-by-line
      • cbind(): Combine vectors column-by-column

5.2.1 Example

# Generate two vectors
x <- 10:19
x
y <- c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE)
y
# seq(0,18,2)
z <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")
z



# Access vector elements
x[1]              # 1st element of x
x[1:5]           # First 5 elements of x
x[-(2:6)]      # All elements but not positions 2 to 6
z[4] # ?
x[y]
x[x< 16 | x >=12]  # values on X lower and higher than 


# Add names to elements of a vector
names(x)
names(x) <- z

# or like this
another.vector <- c(a=1,b=2)
x
names(x)

# Combine vectors rowwise and columnwise
rbind(x,y)
cbind(y,z)

# Access with names
names(x)
x[c("a","c")]


5.2.2 Exercise: Vectors

  1. Use your usual r-script to save the code for the following exercises.
  2. Create a vector of length 50 that contains the numbers from 1 to 5 repeated for 10 times.
  3. Create a vector x of length 404 in which the numbers 199 to 400 are repeated two times. Generate a new vector y, that contains the elements of x on the positions 53, 78 and 99.
  4. Create a character vector Freunde that contains the names of your three best friends.


5.2.3 Solution

# 2.
x <- rep(1:5,10, length.out= 50)
x

# 3.
x <- rep(199:400,2)
x[x>=240]

length(x)
y <- x[c(53, 78, 99)]
y

# 4. 
freunde <- c("Platon", "Aristoteles", "Corbyn")


install.packages("bookdown")
library


5.3 Factors and lists


  • Factors
    • Class: "factor"
    • factor(): Create an unordered factor
    • ordered(): Create an ordered factor
    • A way to store data for categorical (nominal/ordinal) variables
    • Vector with attributes
    • levels(): Display categories
    • as.numeric(): Convert factor to numeric vector


  • Lists
    • Class: "list"
    • list(): Create a list
    • Collections of arbitrary objects
    • Have a certain length and list elements can carry names
    • list$switzerland: Access element switzerland of the list list
    • list[2]: Access second element of the list list
    • Sometimes results of estimations are stored as lists
  • We’ll see more object classes later on


5.3.1 Example

# Q: How do i get help for the function factor()?


# Create factor (nominal variable)
f <- factor(rep(1:2,10),levels=c(1,2), labels=c("SPD", "CDU"))

# Q: How do I go about to understand what happens in the function?

f
levels(f)
as.numeric(f)


# Create factor (ordinal variable)
o <- ordered(rep(1:3,10),levels=c(1,2,3), labels=c("low", "medium", "high"))
o
as.numeric(o)
class(o)
as.character(a)

a<- 1:10
c

# Create a list
participants <- list(Teacher= "Rudi", 
                     Women = c("Daniela","Johanna"),
                     Men = c("Simon", "Peter", "usw."))

participants$Women
length(participants)

# Access elements or subsets of that list 
participants$Teacher
participants[1]
participants[[1]]
participants[["Teacher"]]

# Q: How can I access the list element "Women"?
participants$Women[2]
# Q: How can i access the Johanna who is in the element "Women"?


5.3.2 Exercise: Factors and lists

  1. Create a list mylist with two elements. The first element first contains the numbers 5 to 105. The second element second contains the numbers -1 to -50. Create another vector x that contains the 70th value of the first element of mylist and the 30th element of the second element of mylist.
  2. Create a list anotherlist with four elements: A, B, C, D. A contains the number 2. B contains a vector with the number 1 to 10. C contains a character vector with the names “Bernd” “Hans” “Peter”. D contains a vector with the numbers 1 to 100.
  3. Extract the third Element C from the list you just generated and save it in a new object names.
  4. Extract the vector elements 25 to 35 out of the fourth element D of the list and save them in an object names xyz.
  5. Create vector using the following code: test <- rep(1:10,10). Convert test to an ordered factor. Check which categories the factor has and whether it’s ordered.


5.3.3 Solution

# 1.
mylist <- list(first=c(5:105),second =c(-1:-50))
mylist
x <- c(mylist$first[70], mylist$second[30])
x

# 2. 
A <- 2
anotherlist <- list(A=2,
                    B=1:10,
                    C=c("Bernd", "Hans", "Peter"),
                    D=1:100)

# 3
names <- anotherlist[[3]]
names <- anotherlist$C

# 4
xyz <- anotherlist[[4]][25:35]
xyz <- anotherlist$D[25:35]
anotherlist$D[anotherlist$D>=25 & anotherlist$D<=35] 
# Was ist der Unterschied zwischen den beiden?!!!?
# Wie w?rde man auf die Elemente 25 und 35 zugreifen?

# 5.
test <- rep(1:10,10)
ordered(test)


5.4 Packages

  • Many objects (functions, data sets etc) are stored in packages (see here)
  • Packages are written by various authors and can be loaded for an R session to use their content
  • Sometimes there are conflicts between packages (“function is masked”)


  • Central functions
    • install.packages("packagename"): Install package
    • library(packagename): Load an installed package
    • detach("package:packagename"): Unload a package
    • remove.packages("packagename"): Unintall package
  • Other functions
    • library(): Display all installed packages
    • library(help=packagename): Describe a package
    • search(): Display all loaded packages
    • ls("package:packagename"): Display all objects within a package
    • packagename::bar: Load just use one object in a package (e.g. a function)


5.4.1 Exercise: Packages

  1. Use your usual script to save your code. Install the package dplyr.
  2. Load the package dplyr.
  3. Install the package foreign using the window

5.4.2 Solution


5.5 Data frames

  • The basics +Data frames are the usual format for data sets
    • …are "lists" of vectors of the same length under the hood
    • …look like matrices, but columns can contain objects of different class


  • Functions
    • data.frame(): Create a data frame
    • as.data.frame(): Convert into a data frame
    • summary() und str(): Get oversight of a data frame’s content
    • head() and tail(): Inspect the data
    • ``View(): Show data frame (formerlyfix()) + **Beware**.. you can't continue to work when the data window offix()``` is open!
    • names(): Display variablenames (and use to rename)
    • na.omit(): Delete missings “listwise”, i.e. rows that contain at least one missing
    • is.na(): Generate logical vector indicating the missings


  • More
    • object$var1: Access variable var1 in data frame object
    • as.numeric(object$var1): Convert class of variable into numeric
    • NA: What was that?!?!


5.5.1 Example

getwd() # Get working directory

# Create a data frame
data <- data.frame(id=1:3, # !
                    weight=c(20,27,24),
                    size=c("small", "large", "medium"))

# create a list 
list<- list(id=c(1:3), weight=c(20,27,24), size=c("small", "large", "medium"))

# create a matrix 
matrix <- cbind(id=c(1:3), weight=c(20,27,24), size=c("small", "large", "medium"))

# Can you observe the main differences?


6 Data Management


6.1 Import and export data

  • Fundamental rules!
    • Always save the data in multiple places (and on your harddrive)
    • Always save the address or source of your data
    • Research must be reproducible, ideally in a 1000 years!!
    • Ideally both code of the analysis and data are saved in one place (see e.g. Harvard Dataverse Network) what abou the DC?
    • Some journals require that now anyways (good!)
    • Prepare replication immediatly on submission!
    • Lots of different packages for scraping/downloading data (check the automated data collection workshop in the DC!)


  • Import

    • read.table(): Function to import most common files (.txt, .csv)
    • read.table("mydata.csv", header = TRUE, sep = ",", dec = ".")
      • header = TRUE: First row equals variable names
      • sep = ",": Columns are separted by a comma
      • dec = ".": Specfiy character for decimal points (. is default)
      • See other possible function arguments with ?read.table
    • Mostly you save files as .csv OR .txt and then import them (e.g. EXCEL files)
    • Other formats such as STATA, SPSS, SAS
      • foreign: Classic package with functions for data import
      • Hmisc: Special package for SPSS format and when foreign does not work to import SAS


  • Export

    • Ideally, as simple comma-separated .txt file
    • write.table(swiss, "swiss.txt", sep=","): Save as :.txt file
    • Check ?write.table for various arguments you can add to the function
    • See also export of other Quick R


6.1.1 Example

# REGARDING .CSV format

#set the working directory
setwd("C:/Users/usuario/Desktop/Workshop in R 2016")

#Package
library(foreign)

# Import the dataset
dataa <- read.csv(file="swiss.csv")

# export the data set in .txt
write.table(dataa, file= "datos.txt")


# REGARDING .SAV format

#set the working directory
setwd ("C:/Users/usuario/Desktop")

#Package
library("Hmisc")

#Import the dataset
mydata <- spss.get(file = "datos.sav%202", use.value.labels=TRUE)
 head(mydata)

# Export the dataset in .dta (stata format)
write.dta(mydata, "C:/Users/usuario/Desktop/datos.dta") # Q: why I have written here the whole path? what does it mean?
 dataa <- read.dta(file="datos.dta")

# Check whether everything is right yes
 summary(mydata$rutinaw10)
 summary(dataa$rutinaw10)


6.2 Procedure in data managment: when working with data

* ...Figure out the logical structure of what you want to do
* ...Describe those tasks in the form of a computer package (dplyr in our case)
* ...Execute codes


6.2.1 Example

setwd("C:/Users/usuario/Desktop/Workshop in R 2016")

library(foreign)

dataa <- read.csv(file="swiss.csv")
a <- dataa


# Case 1

a$inf.mort.cla[a$Infant.Mortality <= 18] <- 0
a$inf.mort.cla[a$Infant.Mortality > 18 & a$Infant.Mortality <= 20] <- 1
a$inf.mort.cla[a$Infant.Mortality > 20 & a$Infant.Mortality <= 21] <- 2
a$inf.mort.cla[a$Infant.Mortality > 21 & a$Infant.Mortality <= 27] <- 3
sub1<- filter(a,inf.mort.cla ==1) 
sub2 <- filter (a, inf.mort.cla == 3)
meana <- mean(sub1$Education)
meanb <- mean(sub2$Education)

# Case 2 

b<-a
library(car) ## used plyr 

b["inf.mort.cla"]<- recode(b$Infant.Mortality, "0:18= '1'; 19:20='2'; 21:27 ='3'; else = '4'") 


library(dplyr)
summary<- b %>% group_by(inf.mort.cla) %>% summarise_each(funs(mean), Education)

# Better right?


6.3 Logic of accessing subsets of data frames

  • Same logic as for vectors but two dimensions
    • dataframe[rows,columns]
    • Replace rows/columns by vector indicating position (numerical, logical, character)
  • Logic similar for other object classes such as lists (remember vectors/lists)


6.4 DPLYR: Grammar of data management (Hadley Wickham)


  • Functions
    • filter(): Select a subset of the rows of a data frame (see also slice())
    • arrange(): Reorders the rows of a data frame
    • select(): Selects columns (can be used for renaming, see rename())
    • distinct(): Returns unique values in a table
    • mutate(): Add new columns to existing columns (see also transmute())
    • summarise(): Collapses a data frame to a single row (e.g. aggregation)
    • group_by(): Break data set into groups (of rows), to apply above functions to each group


6.5 Filter and select rows by position

  • Selecting rows = selecting observations (e.g. countries, individuals etc.)
  • The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame


  • Example
# install.packages("dplyr")
library(dplyr)

# ?swiss # Check out the data set
# fix(swiss)

swiss <- dataa
# Filter
a<- filter(swiss, Agriculture >= 60 & Fertility >= 70)
swiss[swiss$Agriculture >= 60 & swiss$Fertility >= 70, ] # Classic approach

# Q: What if I want all observations/rows with Catholic <= 50 ?
# Q: What if I want all observations/rows with Catholic <= 50 OR Catholic  Catholic >= 80?
# Q: What if I want all observations/rows with Catholic <= 50 AND Catholic  Catholic >= 80?


# Normally, names of observations (e.g. countrys) are not saved as row.names
# but simply in a variable
row.names(swiss)


6.6 Arrange

  • Re-order columns
#Arrage
a<- arrange(swiss, Examination,Education,  Agriculture)
# Q: What if I want to sort according to examination first?

arrange(swiss, desc(Education), Examination)
desc(swiss$Education)
# Q: What if I want to sort Examination in descending order instead of Education?

# arrange() works similarly to filter(), right? but instead of filtering or selecting rows, it reorders them!


6.7 Select and rename columns

  • Classic way
    • names(dataframe) <- charactervector
    • or rename() function in different packages (e.g. package reshape)
    • better with dplyr
library(dplyr)
select(swiss, Fertility, Agriculture, Education) # Select columns by name
select(swiss, Agriculture:Education) # Variables from:to
a<-select(swiss, -(Agriculture:Education)) # Variables without (from:to)
select(swiss, Geburtsrate = Fertility) # Select and rename
a <-rename(swiss, Geburtsrate = Fertility, pacopico =Education) # Alternative because select drops non-mentioned variables!



select(swiss, Agriculture:Education)

# Q: What do I have to type to extract the variables Catholic and Infant.Mortality
# from the swiss dataset and save them in a new object?

# Other related commands: starts_with(), ends_with(), matches(), contains(), everything(). These let you quickly match larger blocks of variables that meet some criterion. See ?select for more details.

select(swiss, contains("Or"))


6.8 Add new (transformed) columns

  • Transformations are executed line-by-line
  • dplyr contains the functions mutate(), transform(), transmute()
swiss<-mutate(swiss, 
       Examination10 = Examination*10, 
       FertAgri = Fertility - Agriculture)

# Q: What does the above function do?

# Don't forget to assign the result to a new object if you want to save it!

# mutate() allows you to refer to variable you just created, transform() doesnt

mutate(swiss, Examination10 = Examination*10, NewExi = Examination10 - 10)
transform(swiss, Examination10 = Examination*10, NewExi = Examination10 - 10)

# Use transmute if you only wan't to keep new variables
transmute(swiss, Examination10 = Examination*100, NewExi = Examination10 - 10)

# Q: What if I want to get a new data set only with the variable Catholic but dividid by 10? What do I have to write?


6.8.1 Exercise: data.frame

  1. Extract all provinces from the data set swiss (?swiss) that have values on the variable Agriculture that are smaller or equal than 20 and bigger or equal than 80 (Agriculture <= 20 or >= 80). Save the results in a new data frame, that comprises the first 3 variables/columns of the old data set.
  2. Reorder the observations/rows in the data set swiss according to the variable Infant.Mortality. Which province has the highest level of Infant.Mortality?
  3. Add a new variable/column called Infant.Mortality.squared that contains the squared values of the variable Infant.Mortality.

###Solution

# 1.
filter(swiss, Agriculture <= 20 | Agriculture >= 80)
new <- select(filter(swiss, Agriculture <= 20 | Agriculture >= 80), 1, 2, 3)
# 2.

arrange(swiss, Infant.Mortality) # Porrentruy

# 3.
mutate(swiss, Infant.Mortality.squared = Infant.Mortality^2)


6.9 Group comparison and aggregation

  • We can apply above functions to subgroups within the dataset
    • Observations that have certain values on a variable
    • Individuals with different levels of education
    • Individuals belonging to different groups
  • dplyr lets you use the group_by() function to describe how to break a dataset down into groups of rows
  • dplyr functions recognize when data frame is grouped by using group_by()
group <- group_by(a, Examination)  # convert the data frame into a 
class(group)                                   # grouped data frame and save in object
                                   # Character variable to aggregate

# Character variable to aggregate
group # we can see the group variable and the dimensions
class(group) # we can see the new class, 
datar <- summarise(group, # summarise collapses data frame
  n = n(), # Add variable with the number of observations in group
  F.m = mean(Fertility, na.rm = TRUE)) # Variable containing mean


Another great function to do group comparisons and structure a dataframe is the one that follows the ‘pipeline’ logic.

  • %>% this is a pipe operator which means “nested in”, so you can encadenate functions

  • quite useful for summary statistics and group comparisons


dataa <- read.csv(file="trolleys.csv")

#the mean
weird.fish <- dataa %>% group_by(River) %>% summarise(mean(Number.of.Trolleys, na.rm = TRUE))
# Which is the winner?

#How to save it in a object?
a<- dataa %>%  group_by(River) %>% mutate(mean.trolley = mean(Number.of.Trolleys, na.rm = TRUE))

# What about if I want to do more descriptive statisitcs? summarise_each!
a<- dataa %>%  group_by(River) %>% filter(Year = 2005) %>%
              mutate(mean = mean(Number.of.Trolleys, na.rm = TRUE),
                        sd = sd(Number.of.Trolleys, na.rm = TRUE))

# Q: We have NA cases in the sample. How can we input a "0" value instead? 
# Q: What if we want to change the layout of the data? e.g. gather columns into rows or separate one column into severals? 


6.10 Changing the layout of your data

  • tidyr package is the most intutive package to reshape your data
    • basic logic
      • identify clearly which are: your observations (rows) and variables (columns)
    • Functions
      • gather(): gather columns into rows
      • separate(): separate one column into several
      • spread(): spread rows into columns


6.10.1 Example

# DATA
elites <- data.frame(
  name = c("Guinnes.plus", "Guinnes.classic", "Guinnes.flowerpower"),
  a = c(1, 2, 3),
  b = c(10, 2, 8)
)


# Gather
G<- elites %>%  gather(pepole, numb.beer, a:b) # gather(newvari, freq, wherefreq.comes) 

# separate
S <- G %>% separate(name, into = c("name", "type"), sep = "\\.") #look at the name variable 

# Spread
Spre <- S %>% spread(people, numb.beer)


6.10.2 Tuesday HOMEWORK: Put in practice what you have learnt today!

  1. Create a vector x of length 101, that starts with the number 33 and ends with the number 133.
  2. Extract the elements on the positions 26 to 50 and save them in a new object y. Extract the elements on position 1 to 25 and save them in a new vector z. Join the two vectors both column by column and subsequently line by line (rows!) and save the results in two new objects colyz and rowyz. What class do the last two objects that you created posses?
  3. Extract the elements (from x) that are smaller than 57 or greater/equal than 83 and save them in a new object subgroup.
  4. Load the dataset “Chickweight” and do the following operation with the vectors. +Extract the first 10 observations of the first two columns and save it in a new object. This object should be a dataframe called “lunch”
  • Generate a new colum in this dataset called “new”. This new column should be a factor vector in which the variable “time” is recorded in the following way. time from 0 to 2 = 1. from 3 to 6 = 2. from 7 to 12 = 3; from 13 to 20 = NA. Add a creative label to each value.
  • Extract the first 10 observations of the Chickweight last two columns (Chick and diet) and save it in a new object called “nonsensical”. Later, merge the dataset “nonsensical” and “lunch” in the way that the columns ‘Chick’ and ‘Diet’ are the first two columns and ‘weight’ and ‘time’ the last two colums.This new dataset should be save it as “tachaan” in your working directory as a csv file.
  • Use the ChickWeight dataset to asnwer the following question. How old are on average those chicken that have been following the Diet number 4 and 1? By looking at the dispersion of age between groups of chickens with the diet 1 and 4, would you say that their differences in age are significant? (hint: use pipelines!!)
  1. Load the “trolley” dataset and change the format in the following way. Every raw must be one river and columns must be the number of lost trolleys per year. Requierement: use the tidyr package. Hint look at the data wrangling cheatsheet!


6.10.3 Solution

# Vectors
# 1
x <- 33:133


# 2.
y <- x[26:50]
z <- x[1:25]
yz <- cbind(y,z)
class(yz)

zy <- rbind(y,z)
zy
class(zy)

# 3.
subgroup <- x[x<57 | x>=83]
subgroup

# 4.
library(dplyr)
objct<- ChickWeight[1:10,1:2]
pa<- ChickWeight[1:10,3:4]
data.frame <- data.frame(objct, pa)

objct["new"]<- as.factor(recode(objct$Time, "0:2= 1; 4:8=2; 10:14 =3 ; else =NA")) # car package
objct["new"]<- recode_factor(objct$Time, `0:2`= "1", `4:8`="2", `10:14` ="3", .default = NA_character_) #dplyr

pa<- ChickWeight[1:10,3:4]
tachan<- data.frame(pa , objct)
write.csv(tachan, "tachan.csv")

diet4<- ChickWeight %>% group_by(Diet) %>% summarise_each(funs(mean , sd), Time)

#5. 
library(tidyr)
clean <- dataa %>% group_by(River) %>% spread(Year, Number.of.Trolleys)


7 Analysis and visualization


7.1 Simple statistics and contingency tables

  • Univariate statistics
    • mean(x): Mean
    • sd(x): Standarddeviation
    • var(x): Variance
    • median(x): Median
    • min(x): Minimum
    • max(x): Maximum


  • Bivariate statistics
    • cov(): Covariance
    • cor(): Correlation
    • cor(x,y,method="kendall") # Kendall’s 0
    • ?cor: Check helpfile


  • Contingency tables
    • table(x): Onedimensional contingency table
    • table(x,y): Twodimensional contingency table
    • prop.table(table(x)): Onedimensional contingency table with proportions
    • 100*prop.table(table(x)): Onedimensional contingency table with proportions in percent


7.1.1 Example

library(foreign)
View(swiss)


# Contingency table for objects
table(Catholic)

a<- table(swiss$Catholic, swiss$Education)

# Table with percentages
100*prop.table(table(swiss$Catholic))
100*prop.table(table(swiss$Catholic, swiss$Education))
round(100*prop.table(table(swiss$Catholic, swiss$Education)), 2) # rounded values

# Mean
mean(swiss$Catholic)

# Median
median(swiss$Catholic) # "middle value"
sort(swiss$Catholic)

# Save several statistics in one vector
x <- swiss$Catholic

c(mean=mean(x), median=median(x), stddev=sd(x), min=min(x), max=max(x))

lista <- list("Mean"= mean(x), "Max" = max(x), "Min"= min(x), "Var"= var(x))
# or
a<- summary(swiss)

# Correlations between vectors
cor(swiss$Catholic,swiss$Education, method="spearman")
cov(swiss$Catholic,swiss$Education)


7.1.2 Exercise: Simple statistics and contingency tables

  1. Store the values for the two variables Assault and UrbanPop (dataframe USArrests) in two vectors x and y. Calculate for both vectors x and y the mean, the maximum, the minimum and the variance and save all the results in two objects statistics.x and statistics.y. Name the elements of these two objects “Mean”, “Max” etc. Save statistics.x and statistics.y as list elements in one list. Finally, check wether the two vectors x and y strongly correlate with each other.


7.1.3 Solution


7.2 Functions

  • …take an object as input and return values or objects
  • …are themselves objects
  • …are hidden behind all calculations and operators (e.g. + or -)
  • When you type the name of an object, the print()-function is called
  • function(object){Action of the function}: Define a function
  • When you enter the name of a function it’s content is displayed
  • curve(expr, from = NULL, to = NULL): Plot a mathematical function
  • Generic functions
    • They understand the input (e.g. print(), plot()) and react differently for different classes of objects
    • methods(functionname): List all available methods for a S3 and S4 generic function, or all methods for an S3 or S4 class.
      • e.g. methods(summary) and then summary.lm


7.2.1 Example

# Examples
cos(pi)
cos
print("Hallo")
print
mean(c(1,2,3))


# Example for defining a function
# e.g. squareroot
sqrt2 <- function(x) {
  x^0.5
  }

sqrt2(4)

# Enter name of function to get the content
sqrt2


# Another example
funktion.bsp <- function(x) {
  (x+10)^2
  }

funktion.bsp(2)

# Plot the function with curve()
curve(funktion.bsp, 1, 200) # Enters number 1 to 200 into the function and plots
                            # the result

# Delete a function
rm(funktion.bsp) # like any other object

# Another exampe: calculate the mean of a vector
mittelwert <- function(x) {
  sum(x)/length(x)
  }

x <- rep(1:70,5)
mittelwert(x)


7.2.2 Exercise: Functions

  1. Write a function that takes a vector x and adds 33 to each element in x.
  2. Generate a function that contains the following formula: \((x + 2)^{2}\) and apply this function to a vector that contains the integers from 1 to 10. Plot this function.


7.2.3 Solution


7.2.4 Another reason to convice you why functions are useful

df <- data.frame(replicate(6, sample(c(1:10, -99), 6, rep= T)))
names(df) <- letters[1:6]
df

#copy and paste procedure
df$e[df$e== -99]<-NA
# but better with a function
fix_missing <- function(x){x[x == -99]<- NA
x}

df[]<-lapply(df, fix_missing)

# more than one value? copy and paste?

missing_fixer <- function(na_value) {
  function(x) {
    x[x == na_value] <- NA
    x
  }
}
fix_missing <- missing_fixer(-99)
fix_missing1 <- missing_fixer(1)


df[]<-lapply(df, fix_missing)
df[]<-lapply(df, fix_missing1)


## Check the file "interaction plot" for a more sophisticated use of functions


7.3 Graphics


7.3.1 The basics

+ Understanding the basic workings of graphs is essential for creating more complex visualizations
+ Most simply: ```plot()```
    + A generic function (Q: What does that mean again?)
    + Opens a window, produces some plot, draws axes and a box around it
  • Steps
    • Open window: windows(8,8) and quartz(8,8)
    • Create graph: e.g. plot(x)
    • Save the plot
      • savePlot(filename="./figures/test.pdf", type="pdf")
        • Choose format with type = c("wmf", "emf", "png", "jpg", "jpeg", "bmp", "tif", "tiff", "ps", "eps", "pdf"
    • Close window: dev.off()
  • Or directly pdf() (and dev.off() after)


  • Example
getwd()

# Open window
window(8,8) # DINA4 page has a width of 8,27 inches
# MAC?

# Create histogram
plot(rnorm(n=101,sd=.8))
hist(rnorm(n=101,sd=.8))

# Save
savePlot(filename="test.jpg", type="jpg")

# Close the graphics window
dev.off()

# If you work on a server where it's not possible to open windows

# open device
pdf("test2.pdf")

# create a histogram
test<- hist(rnorm(n=101,sd=.8))

# close device
dev.off()
savePlot(filename="test.pdf", type="pdf")


7.3.2 Adding basic components

* ```points(xcoords,ycoords)```: Adds points 
*  ```lines(xcoords,ycoords)```: Adds lines
*  ```abline()```: Adds a straight line (e.g. regression line, ```?abline```)
    + e.g. ```abline(h=10)``` or ```abline(v=5)```
    + ```h``` = horizontal, ```v``` = vertical
* ```text()```: Adds text
    + e.g. ```text(xcoords,ycoords, labels="Enter Text here")```


7.3.3 Parameters: Local and global

  • Local parameters are set in the plot command, global parameters are set for several plots
    • Local
      • Main=""; xlab=""; ylab="": Add titles
      • xlim=c(min,max); ylim=c(min,max): Adapt axes scales
      • lwd: Linewidth
      • lty: Linetype (solid, dashed etc. )
      • pch: Plot Symbols (circles, crosses, points, etc.)
      • col: Color for points, lines etc.
      • bg: Backgroundcolor for plot and certain symbols
    • Global (for device/window):
      • par(): Set global parameters (?par), e.g. par(mfrow=c(2,2))
      • mfrow: Several plots in one window
      • oma: Set outer margins for plot
      • mar: Set inner margins for plot
    • Check out graph explaining margins


  • Example
getwd()

# Step for step: Example 1
windows(8,8)
x <- seq(from=0,to=2*pi,length=101)
y <- sin(x) + rnorm(n=101,sd=.5)
plot(x,y, xlim=c(-1,11), ylim=c(-3,3)) # 

lines(x,sin(x),col="red")

lines(c(1,5),c(-1.5,0.5)) # xcoords in first argument, y coords in second

abline(lm(y~x),lty=3)
abline(h=0)
abline(a=mean(y),b=0.1,lty=2) # a = intercept, b = slope


# labs <- paste(c("x"), 1:101, sep="")
labs <- paste(1:101) # create a character vector with labes
text(x,y, labels=labs, pos=1, cex=0.7)


# Global parameters
windows()
par(mfrow=c(2,2), oma=c(1,1,1,1), mar=c(4,4,0,0)) # try different numbers
plot(x,y)
plot(x,y)
plot(x,y)
plot(x,y)
# Try what happens if you change the global parameters
dev.off()


# SKIP



# Step for step: Example 2
x <- seq(from=0,to=2*pi,length=101)
y <- sin(x) + rnorm(n=101,sd=.5)
#
windows()
plot.default(x,y,type="n", xlab="Indep. var: x", ylab="Dep. var.: y")
# plot.default() is the default scatterplot function
lines(x,sin(x),lwd=2)
segments(x0=x,x1=x,y0=sin(x),y1=y)
points(x,y,pch=21,bg="green")
text(4,1, labels="LALALA")
dev.off()


7.3.4 Graphs for categorical variables

  • barplot(table(data$variable)): Barplot for absolute frequencies
  • barplot(prop.table(table(data$variable))): Barplot for relative frequencies
  • barplot(table(data$variable), horiz=TRUE, las=1): Barplot
  • barplot(table(data$variable1, data$variable2)): Stacked barplot
  • barplot(table(data$variable1, data$variable2), beside=TRUE): Grouped barplot


7.3.5 Graphs for metric variables

  • hist(data$variable): Histogramm
  • Boxplot
    • boxplot(x): One variable
    • boxplot(data$variable1 , data$variable2, data$variable3): Several variables
    • boxplot(data$variable1 data$variable2): Grouped
  • Scatterplot with regression line
    • plot(independentVariable, outcomeVariable)
    • abline(h=mean(outcomeVariable, na.rm=TRUE): Add straight line for the mean
    • abline(lm(outcomeVariable ~ independentVariable)): Add regression line


  • Example
setwd("C:/Users/usuario/Desktop/Workshop in R 2016")

library(foreign)
library(car)
swiss <- read.csv(file="swiss.csv")

# Graphs for categorical variables
# Generate a categorical variable in the dataset swisslibrary(car)

swiss["dummy.catholic"]<- recode(swiss$Catholic, "1:70 = '1'; 71:100 ='0'") # with car

swiss$dummy.catholic[swiss$Catholic>=70] <- 1 # classic way
swiss$dummy.catholic[swiss$Catholic<=70] <- 0

swiss["cat.fertility"]<- recode(swiss$Fertility, "1:64 = '1'; 65:70 ='2'; 70:78 ='3';else=4 ") # with car

swiss$cat.fertility[swiss$Fertility<=64] <- 1 # classic way
swiss$cat.fertility[swiss$Fertility>64 & swiss$Fertility<=70] <- 2
swiss$cat.fertility[swiss$Fertility>70 & swiss$Fertility<=78] <- 3
swiss$cat.fertility[swiss$Fertility>78] <- 4
swiss <- swiss[order(swiss$Fertility),]
swiss

# Barplot with absolute frequencies
barplot(table(swiss$dummy.catholic))

# Barplot with relative frequencies
barplot(prop.table(table(swiss$dummy.catholic)))

# barplot
barplot(table(swiss$dummy.catholic), horiz=TRUE, las=1)

# Stacked barplot
table(swiss$dummy.catholic, swiss$cat.fertility)
windows()
barplot(table(swiss$dummy.catholic, swiss$cat.fertility), 
        names.arg = c("x <= 64","64 < x <= 70","70 < x <= 78","78 < x"))

# Grouped barplot
barplot(table(swiss$dummy.catholic, swiss$cat.fertility), beside=TRUE)

### SKIP THIS 
# Four histograms
windows()
par(mfrow=c(2,2))
for (i in c(.8, .2, 2, 1)){
hist(rnorm(n=101,sd=i))
}


# Plot a function and a legend
windows()
curve(x^2,col="blue", xlim=c(0,10), ylim=c(0,10))
curve(x^3,add=TRUE,col="red")
curve(x^5,add=TRUE,col="green")
legends <- c("xhoch2", "xhoch3", "xhoch5")
legend(8,10, c("xhoch2", "xhoch3", "xhoch5"), pch = 20, 
       col=c("blue","red","green"))
abline(h=5)


7.3.6 Exercise: Graphs

  1. The data set swiss contains data for 47 French speaking provinces in Switzerland in the year 1888 (see ?swiss). Inspect the data set with the usual functions (How many variables are there? What class do these variables have? etc.).
  2. Create a scatterplot for the variables Education (x-Achse) and Fertility (y-Achse)
    • The title of the graph is “The relationship between education and fertility”.
    • The label for the x-Axis is “Education”.
    • The label for the y-Axis is “Fertility”.
    • Both the x-Axis as well as the y-Axis have a value range from 10 to 90.
  3. Create the following plot (see the histogram below) for the variable InfantMortality.
    • Tip: Check out Quick R for the normal curve. alt text


7.3.7 Solution


CLICK TO SHOW OR HIDE THE PART ON GGPLOT2.


CLICK TO SHOW OR HIDE THE PART ON GGPLOT2.


7.4 Statistical models

  • Goal:
    • learn the basic logic to set up a regression model
    • how to extract information from our model
    • how to present our results in a sexy way, visually speaking


  • General logic
    • Any statistical model is simply a mathematical equation (or several)
    • The equation contains parameters (e.g. \(\beta\)), which we don’t know but we can estimate them + i.e the values predicted by our equation (\(\hat{y}_{i}\)) are as close as possible to the observed values (\(y_{i}\)) of our outcome variable
    • There are various approaches for estimating parameters + OLS + Maximum Likelihood + Bayesian estimation (see Rstan)


  • Q: how is this set of information translated into R language?


7.4.1 Linear model (Regression)


  • lm(): R function to estimate a linear model (lm stands for “linear model”“)
  • lm(y ~ x): Use outcome variable y and explanatory variable x as input
  • lm(y ~ x,data = MyData): If both variables are located in the data frame “MyData”
  • lm(y~x1 + x2 + x3,data = MyData): Several independent variables
  • lm(y ~ x1 + x2 + x1:x2,data = MyData): Interaction effects
  • lm(y ~ x1 * x2,data = MyData): Same as above
  • result <- lm(y ~ x): Save the results
  • Access results by referring to the object name, e.g. result or summary(result)
  • An object of class "lm" contains different components
    • coef(result): The coefficients
    • residuals(result): The residuals
  • IMPORTANT: Option na.action==na.omit omits missings listwise


  • Example
swiss <- read.csv(file= "swiss.csv")  

?swiss
  summary(swiss$Catholic)
  summary(swiss$Fertility)
  summary(swiss$Education)
  
  fit <- lm(Fertility ~ Education + Catholic, data=swiss) # na.action==na.omit

# summary() is a generic function. If you input an lm() object it guesses you want a summary of regression statistics
  summary(fit)
fit$residuals

# Components of the lm object
  ls(fit)
  names(fit)
  class(fit)

# Access coefficents and other parts
  fit$fitted.values # ^y = y hat = predicted values
  fit$residuals     # residals = Predicted - Observed
  coef(fit)         # Coefficients beta1, beta2, beta3
  fit[1]
  fit[1]$coefficients[1] # Access subelements of object

# Scatterplot
  fit <- lm(Fertility ~ Education,data=swiss)
  plot(Fertility ~ Education,data=swiss)
  abline(fit, col="red")

  # Add labels
  install.packages("sp")
  install.packages("maptools")
  library(maptools)
  pointLabel(swiss$Education , swiss$Fertility, rownames(swiss), 
             cex = 0.7, method = "SANN", allowSmallOverlap = FALSE, 
             trace = FALSE, doPlot = TRUE)
  

# 3D SCATTERPLOT
# Load/install package "car"
  install.packages("car")
  library(car)

# Another Spinning 3d Scatterplot
  scatter3d(swiss$Catholic,swiss$Fertility,swiss$Education, 
            fit="linear") 
# adding a quadratic fit for the fun
  scatter3d(swiss$Catholic,swiss$Fertility,swiss$Education, 
            fit=c("linear","quadratic")) 


7.4.2 Checking assumptions linear regression model (OLS)

## TESTING ASSUMPTIONS OF THE LINEAR REGRESSION MODEL (e.g. Gelman & Hill 2007, 45-47)

# Checking robustness and assumptions
  # Good overview: http://www.stat.uni-muenchen.de/~helmut/limo09/limo_09_kap6.pdf)
  fit <- lm(Fertility ~ Education + Catholic, data=swiss)

# Linearity
  # Cause: Non-linear relationship between X and Y
  # Problem: Linear model does not produce good fit
  # Diagnose: e.g. partial residual plot for every variable
  crPlots(fit) # partial residual plot
  # Formula: http://en.wikipedia.org/wiki/Partial_residual_plot
  # Residuals + beta_i*X_i (y-axis) vs. X_i (x-axis)
  # beta_i is estimated and plotted while controlling for all other variables
  
  ceresPlots(fit) 
  # Generalization of component+residual (partial residual) 



# Independence or errors
  # Causes: autocorrelated time series or clustered data
  # Problem: wrong t-values
  # Diagnose: among others: Durban Watson Test http://de.wikipedia.org/wiki/Durbin-Watson-Test
  durbinWatsonTest(fit)



# Constant variance of errors (Homoskedasticity)
  # e_i dependent of i
  # Causes: errors often increase with x
  # Problem: wrong t-values -> problem for hypothesis testing
  # Diagnose: plot e_i vs fitted values
  spreadLevelPlot(fit)
  summary(fit$fitted.values) 
  # "Solution": calculate robust  standard errors (library(sandwich))
 
# Normality of errors
  # Cause: wrong distribution assumption / wrong model
  # Problem: wrong t-values, bad predictions
  # Diagnose: Plot residuals (are they normally distributed)
    # qq plot for studentized resid vs. theoretic quantiles
    qqPlot(fit, main="QQ Plot")

    
# Outliers based on deviant cases
  # Cause: special cases, data errors
  # Problem: Outliers may biase coefficients
  # Diagnose: e.g. Cook's D plot - http://en.wikipedia.org/wiki/Cook's_distance
  # measures effect of deleting observation
  # identify D values > 4/(n-k-1)
  cutoff <- 4/((nrow(swiss)-length(fit$coefficients)-2))
  plot(fit, which=4, cook.levels=cutoff)
  
  # Added variable plots
  library(car)
  avPlots(fit)
# Outliers based on the residuals 

 swiss["row.names"]<- as.numeric(row.names(swiss)) # Generate id, in
  r1<- cbind(residuals(fit))
  a<- abs(r1)< 2 # what sort of vector is this
  out <- data.frame(subset(swiss, ! swiss$row.names %in% c(5,15)))


7.4.3 Exercise: Linear regression in R

  1. Anscombe’s Quartet
    • Load the package car and inspect the data set Quartet
    • Use the data set Quartet to estimate several regressions and compare the coefficients of the following regression models: y1 on x (Model 1), y2 on x (Model 2), y3 on x (Model 3) and y4 on x4 (Model 4)!
  2. Generate 4 scatter plots + regression line for the regression models above. What would you conclude comparing the results from the regression with the graphs?
# Tip: par(mfrow=c(2,2)) can be used to plot multiple plots in one window! 
library(car)
par(mfrow=c(2,2))
plot(Quartet$x, Quartet$y1) # Scatterplot for the two variables
abline(lm(Quartet$y1 ~ Quartet$x))  

7.4.4 Solution


7.4.5 Summary statistics: the stargazer package

  • Stargazer is probably the best way to go. here the cheat sheet, but inspect the element anyway!

  • Important functions to consider: + type = the format of the table e.g. txt, word, html… +out= allocation and name of the document.’format’ +summary.stat = type and order of the statistics you want + flip = to flip the table axes + style = pre-formatted styles that imitate the most popular journals! + for labelling the table: title ,column.labels,covariate.labels, dep.var.caption, dep.var.labels, column.labels check the package for further information! + add a custom row to report statistics add.lines= list(c("statistic", value, values), c(the same))the rest of statistics remain! + ommit something? omit.stat + add a note? notes

library(stargazer)

a1<- lm(Education ~ Fertility, data= swiss)
a2 <-lm(Education ~ Fertility, data= swiss)
a3 <- lm(Education ~ Fertility, data= swiss)

# basic way of summarizing values 
stargazer(swiss, type = "html", out = "./summary.html")

# more complex!
stargazer(swiss, type = "html", out = "./summary.html", summary.stat = c("n", "mean", "sd"), style= "ajps") # !


stargazer(a1,a2, a3, type = "html", out = "./regression.html")
# what do you want me to change here?

8 For the brave researcher…


8.1 Generalized linear models

  • GLMs.. see Wikipedia and examples here
  • Very easy to fit in R
  • e.g. Logistic regression
    • Predicting a binary outcome (Voter participation: Yes vs. No)
  • e.g. Poisson regression
    • Predicting outcome variable that represents counts (Number of wars started in a given year)


  • Functions
    • glm(): R function to estimate a generalized linear model
    • glm(formula, family = familytype(link = linkfunction), data = )
      • family =: Specify a choice of variance and link functions
        • Every family comes with a default link function (see here) but you can change that
    • glm(outcome ~ x1 + x2 + x3, data = somedata, family = binomial()): Logistic model
    • glm(outcome ~ x1 + x2 + x3, data = somedata, family=poisson()): Poisson model
swiss2 <- swiss

library(plyr)
swiss2$d.Catholic <- cut(swiss2$Catholic,
                     breaks=c(-Inf, 50, Inf),
                     labels=c("low","high"))
class(swiss2$d.Catholic) # Factor!
swiss2$d.Catholic


# Logistic Regression
# where F is a binary factor and 
# x1-x3 are continuous predictors 
fit <- glm(d.Catholic ~ Agriculture + Education,data = swiss2, family = binomial())

# Display results
summary(fit)

# Coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable
# e.g. For every one unit change in Agriculture, the log odds of Catholic (vs. non-Catholic) increases by 0.6173.

confint(fit) # CIs using profiled log-likelihood
 # confidence intervals are based on the profiled log-likelihood function

confint.default(fit) # CIs using standard error 
exp(coef(fit)) # exponentiated coefficients
exp(confint(fit)) # 95% CI for exponentiated coefficients
predict(fit, type="response") # predicted values
residuals(fit, type="deviance") # residuals

# Poisson Regression
# where count is a count and 
# x1-x3 are continuous predictors 
fit <- glm(count ~ x1+x2+x3, data=mydata, family=poisson())
summary(fit) #display results


8.2 Multilevel models


  • Package lme4 and arm
  • The lmer function
    • fit <- lmer(y ~ 1 + (1 | county)): Varying-intercept model with no predictors
      • county: Grouping variable`
    • fit <- lmer(y ~ ilevel.var + (1 | county)): Varying-intercept model with an individual-level predictor
    • display(fit): Summarize estimation results
    • coef(fit): Display coefficients
    • fixef(fit): Average
    • ranef(fit): Group-level errors
    • se.fixef(fit) and se.ranef(fit): Pull out standard errors
    • lmer(y ~ ilevel.var + glevel.var + (1 | county)): Adding a group level predictor
    • lmer(y ~ ilevel.var + (1 + ilevel.var | county)): Varying intercept, varying slope
    • lmer(formula = y ~ ilevel.var + glevel.var + ilevel.var:glevel.var + (1 + ilevel.var | county)): Varying intercept, varying slope + group level predictor included



9 The end?

feelings…
feelings…