R as a language: an example with heat map developed by Esarey and Pierce Plots
FIN <- read.dta[file="find.dta"]
varfin <- data.frame(FIN[ which(FIN$rekry == "1"), ]) # only those who answer the self-adminsitered survey
new variable
varfin ["support"] <- (varfin$instrs+varfin$satis)/2
varsp ["support"] <- (varsp$instrs+ varsp$stdem)/2
# Analisys
library (mfx)
# Model Finland
m12 <- glm(formula= ambiva ~ efin + knowle + support + factor(steco)intr + sctr + resp + ideo+ factor(edu)+ age, data = varfin, family= "binomial" na.action= na.omit))
## Heat Map for Finland
install.packages(heatmapFit)
library("heatmapFit")
Varfin$a <- predict(m12, varfin, level = 0.90, type = c("response"))
par(mar = c(4 ,6 , 2, 2))
heatmap.fit(varfin$ambiva, varfin$a, calc.boot = T, reps = 1000, span.l = "aicc",
color = FALSE, compress.obs = TRUE, init.grid = 2000, ret.obs = F,legend = TRUE)
par(xpd=NA)
Technical superiority because of…
You are not alone. Good online-documentation (packages and tutorials) and lively community (forums)
getwd(): Display working directorysetwd(): Set working directorydir(): Display files in working directory\
\\ instead or / in
directory paths#: Start comment that is not evaluatedls(): Display workspace contenteverything (functions, data, words…)
__advice__Meaningful names; No blanks!; one name per object
a <- 10 or a = 10: Assign something
to an object
a: Type name to display content of object
a
rm(): Delete object from workspace
save(a, b, file = "myobjects.RData"): Save single
objects a and b
load("myobjects.RData"): Load saved objects
help.start() or in the menuhelp.search(Suchwort): Search help filehelp(function) or ?function
?plot (Explain help file
structure!) or ?ls# Comments in the script start with #
# Everything after # in the line is ignored by R
# Send code with CTRL + R
5+5
getwd() # Display working directory
# Generate a new folder on your hardrive called "2016_03_R_course_EUI"
# Use this folder as your working directory, i.e. to store the downloaded files in there
# If you need the path to your folder, copy it from the path field in the explorer (in windows)
# The course files are available here: ??????
# Set the your working directory to your new folder
setwd("C:/Users/usuario/Desktop/Workshop in R 2016")
dir() # Display content of working directory
a <- 50 # Generate object that contains the number 50
a # Show content of object
a = 70
a
# Note that a new object with the same name "X" will replace an previous object called "X"
ls() # Display objects in workspace
# Create a more complicated object
thesenseoflife <- c("worshipping god","having fun","no idea")
interesting <- thesenseoflife # define another object
interesting
# Objectnames do not have blanks
# Names should make sense and be systematic
# e.g. for dummy variables: d.male.voter
# e.g. variance of variables: var.income
a <- 1
b <- 2
# Save objects a and b in the file "objects-a-and-b.RData"
# in the working directory
save(a,b,file="objects-a-and-b.RData")
# Delete objects a and b
rm(a,b)
rm(list=ls())
ls() # display all objects
load("objects-a-and-b.RData") # load objects from wd
# Check if they were loaded
a
b
getwd()
hundert <- 100
xyz <- 250
hundert
ls()
save(hundred,xyz,file="workspace.RData")
rm(hundred,xyz)
load("workspace.RData")
ls()
save(ChickWeight, file = "lunch.RData")
+ - * / ^& | == != > < >= <=?function)
exp(x)=e^x log(x) log10(x) sin(x) cos(x) tan(x)abs(x) sqrt(x) ceiling(x) floor(x) trunc(x) round(x, digits=n)# Calculations
Ergebnis <- (23+24)*11/(18+15)*5
Ergebnis
# Functions
log(2) # exp(1) = 2.718282^0.6931472
##########################
# Comparisons.. examples #
##########################
x <- -3:3
x
# Equals
x == 0
# is smaller than
x < 0
# is bigger than or equal to
x >= 0
# unequal
x != 0
# unequal(equal to 0)
!(x == 0)
# bigger than -1 and smaller than 1
x > -1 & x < 1
# bigger than 1 and smaller than -1
x > 1 & x < -1
# bigger than 1 or smaller than -1
x > 1 | x < -1
mean() function it is possible to use and
additional argument called trim. Find out what this
argument does by checking the help file of the mean()
function and write it into your script.# 2.
((3+4-5)-9)^2
-99/33+42
log(1)
(sqrt(2))^2
# 3.
5==7
5*5>=6*4
sqrt(3)!=cos(17)
# 4.
# help(mean)
# ?mean
# the fraction (0 to 0.5) of observations to be trimmed from each end of x
# before the mean is computed.
Q: What is your experience with looking at data analysis code you have written 2 years earlier?
For larger projects you might consider tools that keep track of versions, branches etc. like Git (http://www.github.com among others offers free git repositories). github can easily be used from within RStudio.
What is Markdown? a notebook interface to weave together codes and text
Useful? reproduction, automated reports in any format, control (analysis<-> report)
Free? please…click here
My experience? collection of data on elite defection. Q: What about you?
# Get Rmarkdown installed
install.packages("rmarkdown")
# Generate a new file, check names and format of the file
# White = the space of text --> Information of the steps of your analysis
# Grey = R code as a separate paragraph--> analysis and specific comments about a command
#How does it work?
# treat it as a normal script, at the end of your work, you can click the knit button and get the file you want.
as.numeric()class(): Query object type (e.g. the attributes)str(): to check the internal structure of an
object# FOCUS ON THE STRUCTURE NOT THE CODE!
# HOMOGENEOUS
# Integer vector => integer numerical values
x <- c(1:9)
class(x)
x
# Numerical vector => numerical values
x <- c(1.3, 2.4, 3.5)
class(x)
x
# Logical vector => take on the values "TRUE/FALSE", are often the results of comparisons
x <- -3:3
y <- x >= 0
class(y)
y
# Character vector => are sequences of letters and/or words
x <- c("a", "b", "Adrian is so cool", "he said #$!?=96464#$%")
class(x)
x
# Matrix
matrix.2mal3 <- matrix(c(1,2,3,11,12,13), nrow = 2, ncol=3, byrow=TRUE)
class(matrix.2mal3)
matrix.2mal3
# Array (e.g. 3 matrices)
arraybsp <- array(1:50, c(5,5,2)) # numbers from 1 to 50 in array with 2 5*5 matrix
class(arraybsp)
arraybsp
# HETEROGENEOUS
# List
liste <- list(a= c(4:8), b = c(1:3), c = c("Ger", "FR", "It", "SE"))
class(liste)
liste
# Data frame: Examine by calling our ``lunch'' or fix()
class(lunch)
lunch[1:10, 1:3]
# Factor
x <- factor(rep(1:5,5), exclude = NULL)
class(x)
x # Levels are attributes
o <- ordered(rep(1:3,10),levels=c(1,2,3), labels=c("A", "B", "C"))
class(o)
o
# Functions: e.g. cos(); mean()
quadraticfunction <- function(){x^2}
class(quadraticfunction)
quadraticfunction
class(mean)
mean
c(): “concatenate”,
e.g. c(1.2,"Hans",5.0,6.7)length(): Get vector length:: indicates from/to for numbersrep("Peter",2): Repeat 5 two timesseq(5,8): Sequence from 5 to 8vector[positions]
[1] 1.20 3.50 5.00 6.70 8.00 10.00 13.55[1] = Position of the first element displayed in that
line in the vector (show it!)Inf and -Inf: Positive and negative
infinityNaN: “Not a number”, e.g. calculate
0/0NA: Missing valuerbind(): Combine vectors line-by-linecbind(): Combine vectors column-by-column # Generate two vectors
x <- 10:19
x
y <- c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE)
y
# seq(0,18,2)
z <- c("a", "b", "c", "d", "e", "f", "g", "h", "i", "j")
z
# Access vector elements
x[1] # 1st element of x
x[1:5] # First 5 elements of x
x[-(2:6)] # All elements but not positions 2 to 6
z[4] # ?
x[y]
x[x< 16 | x >=12] # values on X lower and higher than
# Add names to elements of a vector
names(x)
names(x) <- z
# or like this
another.vector <- c(a=1,b=2)
x
names(x)
# Combine vectors rowwise and columnwise
rbind(x,y)
cbind(y,z)
# Access with names
names(x)
x[c("a","c")]
x of length 404 in which the numbers
199 to 400 are repeated two times. Generate a new vector y,
that contains the elements of x on the positions 53, 78 and
99.Freunde that contains the
names of your three best friends.# 2.
x <- rep(1:5,10, length.out= 50)
x
# 3.
x <- rep(199:400,2)
x[x>=240]
length(x)
y <- x[c(53, 78, 99)]
y
# 4.
freunde <- c("Platon", "Aristoteles", "Corbyn")
install.packages("bookdown")
library
"factor"factor(): Create an unordered factorordered(): Create an ordered factorlevels(): Display categoriesas.numeric(): Convert factor to numeric vector"list"list(): Create a listlist$switzerland: Access element
switzerland of the list listlist[2]: Access second element of the list
list# Q: How do i get help for the function factor()?
# Create factor (nominal variable)
f <- factor(rep(1:2,10),levels=c(1,2), labels=c("SPD", "CDU"))
# Q: How do I go about to understand what happens in the function?
f
levels(f)
as.numeric(f)
# Create factor (ordinal variable)
o <- ordered(rep(1:3,10),levels=c(1,2,3), labels=c("low", "medium", "high"))
o
as.numeric(o)
class(o)
as.character(a)
a<- 1:10
c
# Create a list
participants <- list(Teacher= "Rudi",
Women = c("Daniela","Johanna"),
Men = c("Simon", "Peter", "usw."))
participants$Women
length(participants)
# Access elements or subsets of that list
participants$Teacher
participants[1]
participants[[1]]
participants[["Teacher"]]
# Q: How can I access the list element "Women"?
participants$Women[2]
# Q: How can i access the Johanna who is in the element "Women"?
mylist with two elements. The first
element first contains the numbers 5 to 105. The second
element second contains the numbers -1 to -50. Create
another vector x that contains the 70th value of the first
element of mylist and the 30th element of the second
element of mylist.anotherlist with four elements: A, B, C,
D. A contains the number 2. B contains a vector with the number 1 to 10.
C contains a character vector with the names “Bernd” “Hans” “Peter”. D
contains a vector with the numbers 1 to 100.names.D of the list and save them in an object names
xyz.test <- rep(1:10,10). Convert test to an
ordered factor. Check which categories the factor has and whether it’s
ordered.# 1.
mylist <- list(first=c(5:105),second =c(-1:-50))
mylist
x <- c(mylist$first[70], mylist$second[30])
x
# 2.
A <- 2
anotherlist <- list(A=2,
B=1:10,
C=c("Bernd", "Hans", "Peter"),
D=1:100)
# 3
names <- anotherlist[[3]]
names <- anotherlist$C
# 4
xyz <- anotherlist[[4]][25:35]
xyz <- anotherlist$D[25:35]
anotherlist$D[anotherlist$D>=25 & anotherlist$D<=35]
# Was ist der Unterschied zwischen den beiden?!!!?
# Wie w?rde man auf die Elemente 25 und 35 zugreifen?
# 5.
test <- rep(1:10,10)
ordered(test)
install.packages("packagename"): Install packagelibrary(packagename): Load an installed packagedetach("package:packagename"): Unload a packageremove.packages("packagename"): Unintall packagelibrary(): Display all installed packageslibrary(help=packagename): Describe a packagesearch(): Display all loaded packagesls("package:packagename"): Display all objects within a
packagepackagename::bar: Load just use one object in a package
(e.g. a function)dplyr.dplyr.foreign using the window"lists" of vectors of the same length under the hooddata.frame(): Create a data frameas.data.frame(): Convert into a data framesummary() und str(): Get oversight of a
data frame’s contenthead() and tail(): Inspect the data: Show data frame (formerlyfix()) + **Beware**.. you can't continue to work when the data window offix()```
is open!names(): Display variablenames (and use to rename)na.omit(): Delete missings “listwise”, i.e. rows that
contain at least one missingis.na(): Generate logical vector indicating the
missingsobject$var1: Access variable var1 in data
frame objectas.numeric(object$var1): Convert class of variable into
numericNA: What was that?!?!getwd() # Get working directory
# Create a data frame
data <- data.frame(id=1:3, # !
weight=c(20,27,24),
size=c("small", "large", "medium"))
# create a list
list<- list(id=c(1:3), weight=c(20,27,24), size=c("small", "large", "medium"))
# create a matrix
matrix <- cbind(id=c(1:3), weight=c(20,27,24), size=c("small", "large", "medium"))
# Can you observe the main differences?
Import ↑
read.table(): Function to import most common files
(.txt, .csv)read.table("mydata.csv", header = TRUE, sep = ",", dec = ".")
header = TRUE: First row equals variable namessep = ",": Columns are separted by a commadec = ".": Specfiy character for decimal points
(. is default)?read.table.csv OR .txt and
then import them (e.g. EXCEL files)foreign: Classic package with functions for data
importHmisc: Special package for SPSS format and when foreign
does not work to import SASExport ↑
.txt filewrite.table(swiss, "swiss.txt", sep=","): Save as
:.txt file?write.table for various arguments you can add to
the function# REGARDING .CSV format
#set the working directory
setwd("C:/Users/usuario/Desktop/Workshop in R 2016")
#Package
library(foreign)
# Import the dataset
dataa <- read.csv(file="swiss.csv")
# export the data set in .txt
write.table(dataa, file= "datos.txt")
# REGARDING .SAV format
#set the working directory
setwd ("C:/Users/usuario/Desktop")
#Package
library("Hmisc")
#Import the dataset
mydata <- spss.get(file = "datos.sav%202", use.value.labels=TRUE)
head(mydata)
# Export the dataset in .dta (stata format)
write.dta(mydata, "C:/Users/usuario/Desktop/datos.dta") # Q: why I have written here the whole path? what does it mean?
dataa <- read.dta(file="datos.dta")
# Check whether everything is right yes
summary(mydata$rutinaw10)
summary(dataa$rutinaw10)
* ...Figure out the logical structure of what you want to do
* ...Describe those tasks in the form of a computer package (dplyr in our case)
* ...Execute codes
setwd("C:/Users/usuario/Desktop/Workshop in R 2016")
library(foreign)
dataa <- read.csv(file="swiss.csv")
a <- dataa
# Case 1
a$inf.mort.cla[a$Infant.Mortality <= 18] <- 0
a$inf.mort.cla[a$Infant.Mortality > 18 & a$Infant.Mortality <= 20] <- 1
a$inf.mort.cla[a$Infant.Mortality > 20 & a$Infant.Mortality <= 21] <- 2
a$inf.mort.cla[a$Infant.Mortality > 21 & a$Infant.Mortality <= 27] <- 3
sub1<- filter(a,inf.mort.cla ==1)
sub2 <- filter (a, inf.mort.cla == 3)
meana <- mean(sub1$Education)
meanb <- mean(sub2$Education)
# Case 2
b<-a
library(car) ## used plyr
b["inf.mort.cla"]<- recode(b$Infant.Mortality, "0:18= '1'; 19:20='2'; 21:27 ='3'; else = '4'")
library(dplyr)
summary<- b %>% group_by(inf.mort.cla) %>% summarise_each(funs(mean), Education)
# Better right?
dataframe[rows,columns]dplyr are highly performant (big data!)
and consistentfilter(): Select a subset of the rows of a data frame
(see also slice())arrange(): Reorders the rows of a data frameselect(): Selects columns (can be used for renaming,
see rename())distinct(): Returns unique values in a tablemutate(): Add new columns to existing columns (see also
transmute())summarise(): Collapses a data frame to a single row
(e.g. aggregation)group_by(): Break data set into groups (of rows), to
apply above functions to each group# install.packages("dplyr")
library(dplyr)
# ?swiss # Check out the data set
# fix(swiss)
swiss <- dataa
# Filter
a<- filter(swiss, Agriculture >= 60 & Fertility >= 70)
swiss[swiss$Agriculture >= 60 & swiss$Fertility >= 70, ] # Classic approach
# Q: What if I want all observations/rows with Catholic <= 50 ?
# Q: What if I want all observations/rows with Catholic <= 50 OR Catholic Catholic >= 80?
# Q: What if I want all observations/rows with Catholic <= 50 AND Catholic Catholic >= 80?
# Normally, names of observations (e.g. countrys) are not saved as row.names
# but simply in a variable
row.names(swiss)
#Arrage
a<- arrange(swiss, Examination,Education, Agriculture)
# Q: What if I want to sort according to examination first?
arrange(swiss, desc(Education), Examination)
desc(swiss$Education)
# Q: What if I want to sort Examination in descending order instead of Education?
# arrange() works similarly to filter(), right? but instead of filtering or selecting rows, it reorders them!
names(dataframe) <- charactervectorrename() function in different packages
(e.g. package reshape)dplyrlibrary(dplyr)
select(swiss, Fertility, Agriculture, Education) # Select columns by name
select(swiss, Agriculture:Education) # Variables from:to
a<-select(swiss, -(Agriculture:Education)) # Variables without (from:to)
select(swiss, Geburtsrate = Fertility) # Select and rename
a <-rename(swiss, Geburtsrate = Fertility, pacopico =Education) # Alternative because select drops non-mentioned variables!
select(swiss, Agriculture:Education)
# Q: What do I have to type to extract the variables Catholic and Infant.Mortality
# from the swiss dataset and save them in a new object?
# Other related commands: starts_with(), ends_with(), matches(), contains(), everything(). These let you quickly match larger blocks of variables that meet some criterion. See ?select for more details.
select(swiss, contains("Or"))
dplyr contains the functions
mutate(), transform(), transmute()swiss<-mutate(swiss,
Examination10 = Examination*10,
FertAgri = Fertility - Agriculture)
# Q: What does the above function do?
# Don't forget to assign the result to a new object if you want to save it!
# mutate() allows you to refer to variable you just created, transform() doesnt
mutate(swiss, Examination10 = Examination*10, NewExi = Examination10 - 10)
transform(swiss, Examination10 = Examination*10, NewExi = Examination10 - 10)
# Use transmute if you only wan't to keep new variables
transmute(swiss, Examination10 = Examination*100, NewExi = Examination10 - 10)
# Q: What if I want to get a new data set only with the variable Catholic but dividid by 10? What do I have to write?
swiss
(?swiss) that have values on the variable
Agriculture that are smaller or equal than 20 and bigger or
equal than 80 (Agriculture <= 20 or
>= 80). Save the results in a new data frame, that
comprises the first 3 variables/columns of the old data set.Infant.Mortality. Which province has the highest
level of Infant.Mortality?Infant.Mortality.squared that contains the squared values
of the variable Infant.Mortality.###Solution
# 1.
filter(swiss, Agriculture <= 20 | Agriculture >= 80)
new <- select(filter(swiss, Agriculture <= 20 | Agriculture >= 80), 1, 2, 3)
# 2.
arrange(swiss, Infant.Mortality) # Porrentruy
# 3.
mutate(swiss, Infant.Mortality.squared = Infant.Mortality^2)
dplyr lets you use the group_by() function
to describe how to break a dataset down into groups of rowsdplyr functions recognize when data frame is grouped by
using group_by()group <- group_by(a, Examination) # convert the data frame into a
class(group) # grouped data frame and save in object
# Character variable to aggregate
# Character variable to aggregate
group # we can see the group variable and the dimensions
class(group) # we can see the new class,
datar <- summarise(group, # summarise collapses data frame
n = n(), # Add variable with the number of observations in group
F.m = mean(Fertility, na.rm = TRUE)) # Variable containing mean
Another great function to do group comparisons and structure a dataframe is the one that follows the ‘pipeline’ logic.
%>% this is a pipe operator which means “nested in”, so you can encadenate functions
quite useful for summary statistics and group comparisons
dataa <- read.csv(file="trolleys.csv")
#the mean
weird.fish <- dataa %>% group_by(River) %>% summarise(mean(Number.of.Trolleys, na.rm = TRUE))
# Which is the winner?
#How to save it in a object?
a<- dataa %>% group_by(River) %>% mutate(mean.trolley = mean(Number.of.Trolleys, na.rm = TRUE))
# What about if I want to do more descriptive statisitcs? summarise_each!
a<- dataa %>% group_by(River) %>% filter(Year = 2005) %>%
mutate(mean = mean(Number.of.Trolleys, na.rm = TRUE),
sd = sd(Number.of.Trolleys, na.rm = TRUE))
# Q: We have NA cases in the sample. How can we input a "0" value instead?
# Q: What if we want to change the layout of the data? e.g. gather columns into rows or separate one column into severals?
gather(): gather columns into rowsseparate(): separate one column into severalspread(): spread rows into columns# DATA
elites <- data.frame(
name = c("Guinnes.plus", "Guinnes.classic", "Guinnes.flowerpower"),
a = c(1, 2, 3),
b = c(10, 2, 8)
)
# Gather
G<- elites %>% gather(pepole, numb.beer, a:b) # gather(newvari, freq, wherefreq.comes)
# separate
S <- G %>% separate(name, into = c("name", "type"), sep = "\\.") #look at the name variable
# Spread
Spre <- S %>% spread(people, numb.beer)
x of length 101, that starts with the
number 33 and ends with the number 133.y. Extract the elements on position 1 to 25 and
save them in a new vector z. Join the two vectors both
column by column and subsequently line by line (rows!) and save the
results in two new objects colyz and rowyz.
What class do the last two objects that you created posses?x) that are smaller than 57
or greater/equal than 83 and save them in a new object
subgroup.# Vectors
# 1
x <- 33:133
# 2.
y <- x[26:50]
z <- x[1:25]
yz <- cbind(y,z)
class(yz)
zy <- rbind(y,z)
zy
class(zy)
# 3.
subgroup <- x[x<57 | x>=83]
subgroup
# 4.
library(dplyr)
objct<- ChickWeight[1:10,1:2]
pa<- ChickWeight[1:10,3:4]
data.frame <- data.frame(objct, pa)
objct["new"]<- as.factor(recode(objct$Time, "0:2= 1; 4:8=2; 10:14 =3 ; else =NA")) # car package
objct["new"]<- recode_factor(objct$Time, `0:2`= "1", `4:8`="2", `10:14` ="3", .default = NA_character_) #dplyr
pa<- ChickWeight[1:10,3:4]
tachan<- data.frame(pa , objct)
write.csv(tachan, "tachan.csv")
diet4<- ChickWeight %>% group_by(Diet) %>% summarise_each(funs(mean , sd), Time)
#5.
library(tidyr)
clean <- dataa %>% group_by(River) %>% spread(Year, Number.of.Trolleys)
mean(x): Meansd(x): Standarddeviationvar(x): Variancemedian(x): Medianmin(x): Minimummax(x): Maximumcov(): Covariancecor(): Correlationcor(x,y,method="kendall") # Kendall’s 0?cor: Check helpfiletable(x): Onedimensional contingency tabletable(x,y): Twodimensional contingency tableprop.table(table(x)): Onedimensional contingency table
with proportions100*prop.table(table(x)): Onedimensional contingency
table with proportions in percentlibrary(foreign)
View(swiss)
# Contingency table for objects
table(Catholic)
a<- table(swiss$Catholic, swiss$Education)
# Table with percentages
100*prop.table(table(swiss$Catholic))
100*prop.table(table(swiss$Catholic, swiss$Education))
round(100*prop.table(table(swiss$Catholic, swiss$Education)), 2) # rounded values
# Mean
mean(swiss$Catholic)
# Median
median(swiss$Catholic) # "middle value"
sort(swiss$Catholic)
# Save several statistics in one vector
x <- swiss$Catholic
c(mean=mean(x), median=median(x), stddev=sd(x), min=min(x), max=max(x))
lista <- list("Mean"= mean(x), "Max" = max(x), "Min"= min(x), "Var"= var(x))
# or
a<- summary(swiss)
# Correlations between vectors
cor(swiss$Catholic,swiss$Education, method="spearman")
cov(swiss$Catholic,swiss$Education)
Assault and
UrbanPop (dataframe USArrests) in two vectors
x and y. Calculate for both vectors
x and y the mean, the maximum, the minimum and
the variance and save all the results in two objects
statistics.x and statistics.y. Name the
elements of these two objects “Mean”, “Max” etc. Save
statistics.x and statistics.y as list elements
in one list. Finally, check wether the two vectors x and
y strongly correlate with each other.+ or -)print()-function is calledfunction(object){Action of the function}: Define a
functioncurve(expr, from = NULL, to = NULL): Plot a
mathematical functionprint(),
plot()) and react differently for different classes of
objectsmethods(functionname): List all available methods for a
S3 and S4 generic function, or all methods for an S3 or S4 class.
methods(summary) and then
summary.lm# Examples
cos(pi)
cos
print("Hallo")
print
mean(c(1,2,3))
# Example for defining a function
# e.g. squareroot
sqrt2 <- function(x) {
x^0.5
}
sqrt2(4)
# Enter name of function to get the content
sqrt2
# Another example
funktion.bsp <- function(x) {
(x+10)^2
}
funktion.bsp(2)
# Plot the function with curve()
curve(funktion.bsp, 1, 200) # Enters number 1 to 200 into the function and plots
# the result
# Delete a function
rm(funktion.bsp) # like any other object
# Another exampe: calculate the mean of a vector
mittelwert <- function(x) {
sum(x)/length(x)
}
x <- rep(1:70,5)
mittelwert(x)
x and adds 33 to
each element in x.df <- data.frame(replicate(6, sample(c(1:10, -99), 6, rep= T)))
names(df) <- letters[1:6]
df
#copy and paste procedure
df$e[df$e== -99]<-NA
# but better with a function
fix_missing <- function(x){x[x == -99]<- NA
x}
df[]<-lapply(df, fix_missing)
# more than one value? copy and paste?
missing_fixer <- function(na_value) {
function(x) {
x[x == na_value] <- NA
x
}
}
fix_missing <- missing_fixer(-99)
fix_missing1 <- missing_fixer(1)
df[]<-lapply(df, fix_missing)
df[]<-lapply(df, fix_missing1)
## Check the file "interaction plot" for a more sophisticated use of functions
+ Understanding the basic workings of graphs is essential for creating more complex visualizations
+ Most simply: ```plot()```
+ A generic function (Q: What does that mean again?)
+ Opens a window, produces some plot, draws axes and a box around it
windows(8,8) and
quartz(8,8)plot(x)savePlot(filename="./figures/test.pdf", type="pdf")
type = c("wmf", "emf", "png", "jpg", "jpeg", "bmp", "tif", "tiff", "ps", "eps", "pdf"dev.off()pdf() (and dev.off()
after)getwd()
# Open window
window(8,8) # DINA4 page has a width of 8,27 inches
# MAC?
# Create histogram
plot(rnorm(n=101,sd=.8))
hist(rnorm(n=101,sd=.8))
# Save
savePlot(filename="test.jpg", type="jpg")
# Close the graphics window
dev.off()
# If you work on a server where it's not possible to open windows
# open device
pdf("test2.pdf")
# create a histogram
test<- hist(rnorm(n=101,sd=.8))
# close device
dev.off()
savePlot(filename="test.pdf", type="pdf")
* ```points(xcoords,ycoords)```: Adds points
* ```lines(xcoords,ycoords)```: Adds lines
* ```abline()```: Adds a straight line (e.g. regression line, ```?abline```)
+ e.g. ```abline(h=10)``` or ```abline(v=5)```
+ ```h``` = horizontal, ```v``` = vertical
* ```text()```: Adds text
+ e.g. ```text(xcoords,ycoords, labels="Enter Text here")```
Main=""; xlab=""; ylab="":
Add titlesxlim=c(min,max); ylim=c(min,max): Adapt
axes scaleslwd: Linewidthlty: Linetype (solid, dashed etc. )pch: Plot Symbols (circles, crosses, points, etc.)col: Color for points, lines etc.bg: Backgroundcolor for plot and certain symbolspar(): Set global parameters (?par),
e.g. par(mfrow=c(2,2))mfrow: Several plots in one windowoma: Set outer margins for plotmar: Set inner margins for plotgetwd()
# Step for step: Example 1
windows(8,8)
x <- seq(from=0,to=2*pi,length=101)
y <- sin(x) + rnorm(n=101,sd=.5)
plot(x,y, xlim=c(-1,11), ylim=c(-3,3)) #
lines(x,sin(x),col="red")
lines(c(1,5),c(-1.5,0.5)) # xcoords in first argument, y coords in second
abline(lm(y~x),lty=3)
abline(h=0)
abline(a=mean(y),b=0.1,lty=2) # a = intercept, b = slope
# labs <- paste(c("x"), 1:101, sep="")
labs <- paste(1:101) # create a character vector with labes
text(x,y, labels=labs, pos=1, cex=0.7)
# Global parameters
windows()
par(mfrow=c(2,2), oma=c(1,1,1,1), mar=c(4,4,0,0)) # try different numbers
plot(x,y)
plot(x,y)
plot(x,y)
plot(x,y)
# Try what happens if you change the global parameters
dev.off()
# SKIP
# Step for step: Example 2
x <- seq(from=0,to=2*pi,length=101)
y <- sin(x) + rnorm(n=101,sd=.5)
#
windows()
plot.default(x,y,type="n", xlab="Indep. var: x", ylab="Dep. var.: y")
# plot.default() is the default scatterplot function
lines(x,sin(x),lwd=2)
segments(x0=x,x1=x,y0=sin(x),y1=y)
points(x,y,pch=21,bg="green")
text(4,1, labels="LALALA")
dev.off()
barplot(table(data$variable)): Barplot for absolute
frequenciesbarplot(prop.table(table(data$variable))): Barplot for
relative frequenciesbarplot(table(data$variable), horiz=TRUE, las=1):
Barplotbarplot(table(data$variable1, data$variable2)): Stacked
barplotbarplot(table(data$variable1, data$variable2), beside=TRUE):
Grouped barplothist(data$variable): Histogrammboxplot(x): One variableboxplot(data$variable1 , data$variable2, data$variable3):
Several variablesboxplot(data$variable1 data$variable2): Groupedplot(independentVariable, outcomeVariable)abline(h=mean(outcomeVariable, na.rm=TRUE): Add
straight line for the meanabline(lm(outcomeVariable ~ independentVariable)): Add
regression linesetwd("C:/Users/usuario/Desktop/Workshop in R 2016")
library(foreign)
library(car)
swiss <- read.csv(file="swiss.csv")
# Graphs for categorical variables
# Generate a categorical variable in the dataset swisslibrary(car)
swiss["dummy.catholic"]<- recode(swiss$Catholic, "1:70 = '1'; 71:100 ='0'") # with car
swiss$dummy.catholic[swiss$Catholic>=70] <- 1 # classic way
swiss$dummy.catholic[swiss$Catholic<=70] <- 0
swiss["cat.fertility"]<- recode(swiss$Fertility, "1:64 = '1'; 65:70 ='2'; 70:78 ='3';else=4 ") # with car
swiss$cat.fertility[swiss$Fertility<=64] <- 1 # classic way
swiss$cat.fertility[swiss$Fertility>64 & swiss$Fertility<=70] <- 2
swiss$cat.fertility[swiss$Fertility>70 & swiss$Fertility<=78] <- 3
swiss$cat.fertility[swiss$Fertility>78] <- 4
swiss <- swiss[order(swiss$Fertility),]
swiss
# Barplot with absolute frequencies
barplot(table(swiss$dummy.catholic))
# Barplot with relative frequencies
barplot(prop.table(table(swiss$dummy.catholic)))
# barplot
barplot(table(swiss$dummy.catholic), horiz=TRUE, las=1)
# Stacked barplot
table(swiss$dummy.catholic, swiss$cat.fertility)
windows()
barplot(table(swiss$dummy.catholic, swiss$cat.fertility),
names.arg = c("x <= 64","64 < x <= 70","70 < x <= 78","78 < x"))
# Grouped barplot
barplot(table(swiss$dummy.catholic, swiss$cat.fertility), beside=TRUE)
### SKIP THIS
# Four histograms
windows()
par(mfrow=c(2,2))
for (i in c(.8, .2, 2, 1)){
hist(rnorm(n=101,sd=i))
}
# Plot a function and a legend
windows()
curve(x^2,col="blue", xlim=c(0,10), ylim=c(0,10))
curve(x^3,add=TRUE,col="red")
curve(x^5,add=TRUE,col="green")
legends <- c("xhoch2", "xhoch3", "xhoch5")
legend(8,10, c("xhoch2", "xhoch3", "xhoch5"), pch = 20,
col=c("blue","red","green"))
abline(h=5)
swiss contains data for 47 French speaking
provinces in Switzerland in the year 1888 (see ?swiss).
Inspect the data set with the usual functions (How many variables are
there? What class do these variables have? etc.).Education
(x-Achse) and Fertility (y-Achse)
InfantMortality.

lm(): R function to estimate a linear model
(lm stands for “linear model”“)lm(y ~ x): Use outcome variable y and
explanatory variable x as inputlm(y ~ x,data = MyData): If both variables are located
in the data frame “MyData”lm(y~x1 + x2 + x3,data = MyData): Several independent
variableslm(y ~ x1 + x2 + x1:x2,data = MyData): Interaction
effectslm(y ~ x1 * x2,data = MyData): Same as aboveresult <- lm(y ~ x): Save the resultsresult or summary(result)"lm" contains different components
coef(result): The coefficientsresiduals(result): The residualsna.action==na.omit omits missings
listwiseswiss <- read.csv(file= "swiss.csv")
?swiss
summary(swiss$Catholic)
summary(swiss$Fertility)
summary(swiss$Education)
fit <- lm(Fertility ~ Education + Catholic, data=swiss) # na.action==na.omit
# summary() is a generic function. If you input an lm() object it guesses you want a summary of regression statistics
summary(fit)
fit$residuals
# Components of the lm object
ls(fit)
names(fit)
class(fit)
# Access coefficents and other parts
fit$fitted.values # ^y = y hat = predicted values
fit$residuals # residals = Predicted - Observed
coef(fit) # Coefficients beta1, beta2, beta3
fit[1]
fit[1]$coefficients[1] # Access subelements of object
# Scatterplot
fit <- lm(Fertility ~ Education,data=swiss)
plot(Fertility ~ Education,data=swiss)
abline(fit, col="red")
# Add labels
install.packages("sp")
install.packages("maptools")
library(maptools)
pointLabel(swiss$Education , swiss$Fertility, rownames(swiss),
cex = 0.7, method = "SANN", allowSmallOverlap = FALSE,
trace = FALSE, doPlot = TRUE)
# 3D SCATTERPLOT
# Load/install package "car"
install.packages("car")
library(car)
# Another Spinning 3d Scatterplot
scatter3d(swiss$Catholic,swiss$Fertility,swiss$Education,
fit="linear")
# adding a quadratic fit for the fun
scatter3d(swiss$Catholic,swiss$Fertility,swiss$Education,
fit=c("linear","quadratic"))
see Gelman/Hill (2007: 45/46); Quick-R: http://www.statmethods.net/stats/ and rdiagnostics.html
Example
## TESTING ASSUMPTIONS OF THE LINEAR REGRESSION MODEL (e.g. Gelman & Hill 2007, 45-47)
# Checking robustness and assumptions
# Good overview: http://www.stat.uni-muenchen.de/~helmut/limo09/limo_09_kap6.pdf)
fit <- lm(Fertility ~ Education + Catholic, data=swiss)
# Linearity
# Cause: Non-linear relationship between X and Y
# Problem: Linear model does not produce good fit
# Diagnose: e.g. partial residual plot for every variable
crPlots(fit) # partial residual plot
# Formula: http://en.wikipedia.org/wiki/Partial_residual_plot
# Residuals + beta_i*X_i (y-axis) vs. X_i (x-axis)
# beta_i is estimated and plotted while controlling for all other variables
ceresPlots(fit)
# Generalization of component+residual (partial residual)
# Independence or errors
# Causes: autocorrelated time series or clustered data
# Problem: wrong t-values
# Diagnose: among others: Durban Watson Test http://de.wikipedia.org/wiki/Durbin-Watson-Test
durbinWatsonTest(fit)
# Constant variance of errors (Homoskedasticity)
# e_i dependent of i
# Causes: errors often increase with x
# Problem: wrong t-values -> problem for hypothesis testing
# Diagnose: plot e_i vs fitted values
spreadLevelPlot(fit)
summary(fit$fitted.values)
# "Solution": calculate robust standard errors (library(sandwich))
# Normality of errors
# Cause: wrong distribution assumption / wrong model
# Problem: wrong t-values, bad predictions
# Diagnose: Plot residuals (are they normally distributed)
# qq plot for studentized resid vs. theoretic quantiles
qqPlot(fit, main="QQ Plot")
# Outliers based on deviant cases
# Cause: special cases, data errors
# Problem: Outliers may biase coefficients
# Diagnose: e.g. Cook's D plot - http://en.wikipedia.org/wiki/Cook's_distance
# measures effect of deleting observation
# identify D values > 4/(n-k-1)
cutoff <- 4/((nrow(swiss)-length(fit$coefficients)-2))
plot(fit, which=4, cook.levels=cutoff)
# Added variable plots
library(car)
avPlots(fit)
# Outliers based on the residuals
swiss["row.names"]<- as.numeric(row.names(swiss)) # Generate id, in
r1<- cbind(residuals(fit))
a<- abs(r1)< 2 # what sort of vector is this
out <- data.frame(subset(swiss, ! swiss$row.names %in% c(5,15)))
car and inspect the data set
QuartetQuartet to estimate several
regressions and compare the coefficients of the following regression
models: y1 on x (Model 1), y2 on x (Model 2), y3 on x (Model 3) and y4
on x4 (Model 4)!# Tip: par(mfrow=c(2,2)) can be used to plot multiple plots in one window!
library(car)
par(mfrow=c(2,2))
plot(Quartet$x, Quartet$y1) # Scatterplot for the two variables
abline(lm(Quartet$y1 ~ Quartet$x))
Stargazer is probably the best way to go. here the cheat sheet, but inspect the element anyway!
Important functions to consider: + type = the format
of the table e.g. txt, word, html… +out= allocation and
name of the document.’format’ +summary.stat = type and
order of the statistics you want + flip = to flip the table
axes + style = pre-formatted styles that imitate the most
popular journals! + for labelling the table: title
,column.labels,covariate.labels,
dep.var.caption, dep.var.labels,
column.labels check the package for further information! +
add a custom row to report statistics
add.lines= list(c("statistic", value, values), c(the same))the
rest of statistics remain! + ommit something? omit.stat +
add a note? notes
library(stargazer)
a1<- lm(Education ~ Fertility, data= swiss)
a2 <-lm(Education ~ Fertility, data= swiss)
a3 <- lm(Education ~ Fertility, data= swiss)
# basic way of summarizing values
stargazer(swiss, type = "html", out = "./summary.html")
# more complex!
stargazer(swiss, type = "html", out = "./summary.html", summary.stat = c("n", "mean", "sd"), style= "ajps") # !
stargazer(a1,a2, a3, type = "html", out = "./regression.html")
# what do you want me to change here?
glm(): R function to estimate a generalized linear
modelglm(formula, family = familytype(link = linkfunction), data = )
family =: Specify a choice of variance and link
functions
glm(outcome ~ x1 + x2 + x3, data = somedata, family = binomial()):
Logistic modelglm(outcome ~ x1 + x2 + x3, data = somedata, family=poisson()):
Poisson modelswiss2 <- swiss
library(plyr)
swiss2$d.Catholic <- cut(swiss2$Catholic,
breaks=c(-Inf, 50, Inf),
labels=c("low","high"))
class(swiss2$d.Catholic) # Factor!
swiss2$d.Catholic
# Logistic Regression
# where F is a binary factor and
# x1-x3 are continuous predictors
fit <- glm(d.Catholic ~ Agriculture + Education,data = swiss2, family = binomial())
# Display results
summary(fit)
# Coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable
# e.g. For every one unit change in Agriculture, the log odds of Catholic (vs. non-Catholic) increases by 0.6173.
confint(fit) # CIs using profiled log-likelihood
# confidence intervals are based on the profiled log-likelihood function
confint.default(fit) # CIs using standard error
exp(coef(fit)) # exponentiated coefficients
exp(confint(fit)) # 95% CI for exponentiated coefficients
predict(fit, type="response") # predicted values
residuals(fit, type="deviance") # residuals
# Poisson Regression
# where count is a count and
# x1-x3 are continuous predictors
fit <- glm(count ~ x1+x2+x3, data=mydata, family=poisson())
summary(fit) #display results
fit <- lmer(y ~ 1 + (1 | county)): Varying-intercept
model with no predictors
county: Grouping variable`fit <- lmer(y ~ ilevel.var + (1 | county)):
Varying-intercept model with an individual-level predictordisplay(fit): Summarize estimation resultscoef(fit): Display coefficientsfixef(fit): Averageranef(fit): Group-level errorsse.fixef(fit) and se.ranef(fit): Pull out
standard errorslmer(y ~ ilevel.var + glevel.var + (1 | county)):
Adding a group level predictorlmer(y ~ ilevel.var + (1 + ilevel.var | county)):
Varying intercept, varying slopelmer(formula = y ~ ilevel.var + glevel.var + ilevel.var:glevel.var + (1 + ilevel.var | county)):
Varying intercept, varying slope + group level predictor included