"R is an integrated suite of software facilities for data manipulation, calculation and graphical display. Among other thing it has
It is referred to as the R environment to reflect that it is a fully planned and coherent system, rather than a part of a larger system or piece to be used in conjunction with other data analysis software to reach the eventual goal.
Not traditionally meant for statistics, it is now used by many for statistics and while it does not include all of the packages required to perform varying types of statistical analylsis many can be found at https://CRAN.R-project.org and elsehwere.
“The most convenient way to use R is at a graphics workstation running a windowing system.”
‘>’ Indicates the R program is ready for input commands.
Appendix A A sample Session
help.start()
## Start the HTML interface to on-line help (using a web browser available at your machine. You should breifly explore the features of this facility with the mouse. Iconify the help window and move on to the next part.
## Generate two pseudo-random normal vectors of x- and y-coordinates.
x = rnorm(50)
y = rnorm (x)
#Plot the points in the plane. A graphics window will appear automatically.
plot(x,y)
## See which objects are now in the R workspace.
ls()
[1] "x" "y"
## Remove objects no longer needed (clean up).
rm(x,y)
## Make x = (1,2,...,20).
x = 1:20
## Make w a 'weight' vector of standard deviations.
w = 1 + sqrt(x)/2
## Make a data frame of two columns, x and y, and look at it.
dummy = data.frame(x=x, y=x+rnorm(x)*w)
dummy
## Fit a simple linear regression and look at the analysis. With y to the left of the tilde, we are modelling y dependent on x.
fm = lm(y~x,data=dummy)
summary(fm)
Call:
lm(formula = y ~ x, data = dummy)
Residuals:
Min 1Q Median 3Q Max
-3.6925 -1.9183 0.1055 1.4652 4.2479
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.40447 1.12166 -1.252 0.227
x 1.05081 0.09363 11.222 1.47e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.415 on 18 degrees of freedom
Multiple R-squared: 0.875, Adjusted R-squared: 0.868
F-statistic: 125.9 on 1 and 18 DF, p-value: 1.472e-09
## Since we know the standard deviations, we can do a weighted regression.
fm1 = lm(y~x, data=dummy, weight=1/w^2)
summary(fm1)
Call:
lm(formula = y ~ x, data = dummy, weights = 1/w^2)
Weighted Residuals:
Min 1Q Median 3Q Max
-1.24313 -0.70306 0.00271 0.57434 1.67073
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.97279 0.80134 -1.214 0.24
x 1.01075 0.08171 12.370 3.09e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8855 on 18 degrees of freedom
Multiple R-squared: 0.8947, Adjusted R-squared: 0.8889
F-statistic: 153 on 1 and 18 DF, p-value: 3.09e-10
## Make the columns in the data frame visible as variables.
attach(dummy)
The following object is masked _by_ .GlobalEnv:
x
## Make a nonparametric local regression function.
lrf = lowess(x,y)
plot(x,y) ## Standard point plot.
lines(x,lrf$y) ## Add in the local regression.
abline(0,1,lty=3) ## The true regression line: (intercept 0, slope 1).
abline(coef(fm)) ## Unweighted regression line.
abline(coef(fm1),col="red") ## Weighted regression line.
detach() ## Remove data from the search path.
## A standard regression diagnostic plot to check for heteroscedasticity. Can you see it?
plot(fitted(fm),resid(fm),
xlab="Fitted values",
ylab="Residuals",
main="Residuals vs Fitted")
## A normal scores plot to check for skewness, kurtosis and outliers. (Not very useful here.)
qqnorm(resid(fm),main="Residuals Rankit Plot")
rm(fm,fm1,lrf,x,dummy) ## Clean up again.
The next section will look at data from the classical experiment of Michelson to measure the speed of light. This dataset is available in the morley object, but we will read it to illustrate the read.table function.
## Get the path to the data file.
filepath <- system.file("data","morley.tab",package="datasets")
filepath
[1] "/Library/Frameworks/R.framework/Resources/library/datasets/data/morley.tab"
## Optional. Look at the file.
file.show(filepath)
## Read in the Michelson data as a data frame, and look at it. There are five experminents (column Expt) and each has 20 runs (column Run) and s1 is the recorded speed of light, suitably coded.
mm <- read.table(filepath)
mm
## Change Expt and Run into factors.
mm$Expt=factor(mm$Expt)
mm$Run=factor(mm$Run)
## Make the data frame visible at position 2 (the default).
attach(mm)
## Compare the five experiments with simple boxplots.
plot(Expt,Speed,main="Speed of Light Data",xlab="Experiment No.")
## Analyze as a randomized block, with 'runs' and 'experiments' as factors.
fm=aov(Speed~Run+Expt,data=mm)
summary(fm)
Df Sum Sq Mean Sq F value Pr(>F)
Run 19 113344 5965 1.105 0.36321
Expt 4 94514 23629 4.378 0.00307 **
Residuals 76 410166 5397
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
## Fit the sub-model omitting 'runs', and compare using a formal analysis of variance.
fm0=update(fm,.~.-Run)
anova(fm0,fm)
Analysis of Variance Table
Model 1: Speed ~ Expt
Model 2: Speed ~ Run + Expt
Res.Df RSS Df Sum of Sq F Pr(>F)
1 95 523510
2 76 410166 19 113344 1.1053 0.3632
## Clean up before moving on.
detach()
rm(fm,fm0)
We can now look at some more graphical features: contour and image plots.
x <- seq(-pi,pi,len=50) ## x is a vector of 50 equally spaced values between -pi and pi
y <- x ## y is the same.
f=outer(x,y,function(x,y)cos(y)/(1+x^2)) ## f is a square matrix, with rows and columns indexed by x and y respectively, of values of the function cos(y)/(1+x^2).
oldpar=par(no.readonly=TRUE)
par(pty="s") ## Save the plotting paramaters and set the plotting region to "square".
contour(x,y,f)
contour(x,y,f,nlevels=15,add=TRUE) ## Make a contour map of f; add in more lines for more detail.
fa=(f-t(f))/2 ## fa is the "asymmetric part" of f. (t() is transpose).
contour(x,y,fa,nlevels=15) ## Make a contour plot, ...
par(oldpar) ## ... and restore the old graphics parameters.
image(x,y,f)
image(x,y,fa) ## Make some high density image plots, (of which you can get hardcopies if you wish), ...
objects(); rm(x,y,f,fa) ## ... and clean up before moving on.
[1] "f" "fa" "filepath" "mm" "oldpar" "w" "x"
[8] "y"
R can do complex arithmetic, also.
th <- seq(-pi,pi,len=100)
z <- exp(1i*th) ## 1i is used for the complex number i.
par(pty="s")
plot(z,type="l") ## Plotting complex arguments means plot imaginary versus real parts. This should be a circle.
w=rnorm(100)+rnorm(100)*1i ## Suppose we want to sample points within the unit circle. One method would be to take complex numbers with standard normal real and imaginary parts...
w=ifelse(Mod(w)>1,1/w,w) ## and to map any outside the circle onto their reciprocal.
plot(w,xlim=c(-1,1),ylim=c(-1,1),pch="+",xlab="x",ylab="y")
lines(z) ## All points are inside the unit circle, but the distribution is not uniform.
w=sqrt(runif(100))*exp(2*pi*runif(100)*1i)
plot(w,xlim=c(-1,1),ylim=c(-1,1),pch="+",xlab="x",ylab="y")
lines(z) ## The second method uses the uniform distribution. The points should now look more evenly spaced over the disc.
rm(th,w,z) ## Clean up again.
q() ## Quit the R program. You will be asked if you want to save the R workspace.
End of Appendix A Sample Session
R has an inbuilt help facility similar to the man facility of UNIX. To get more information on any specific named function, for example “solve”, the command is
help(solve) or ?solve
R is case sensitive.
Elementary commands consist of either expressions or assignmnts.
Commands are separated either by a semi-colon (‘;’), or by a newline. Elementary commands can be grouped together into one compound expression by braces (‘{’} and ‘}’). Comments can be put almost anywhere, starting with a hasmhark (‘#’), everything to the end of the line is a comment.
If a command is not complete at the end of a line, R will give a different prompt, by default ‘+’ on second and subsequent lines and continue to read input until the command is syntactically complete.
The vertical arrow keys on the keyboard can be used to scroll forward and backward through a command history.
The recall and editing capabilities under UNIX are highly customizable. You can find out how to do this by reading the manual entry for the readline library.
If commands are stored in an external file, say commands.R in the working directory work, they may be executed at any time in an R session with the command source(“commands.R”)
For Windows Source is also available on the File menu. The function sink, sink(“record.lis”) will divert all subsequent output from the console to an external file, record.lis. The command sink() restores it to the console once again.
The entities that R creates and maniputes are known as objects. These may be variables, arrays of numbers, character strings functions, or more general structures built from such components.
During an R session, objects are created and stored by name. The R commands objects() (alternatively, ls()) can be used to display the names of (most of) the objects which are currently stored within R. The collections of objects currently stored is called the workspace.
To remove objects the function rm is available.
For example: rm(x,y,z,ink,junk,temp,foo,bar)
All objects created during an R session can be stored permanently in a file for use in future R sessions At the end of each R session you are given the opportunity to save all the currently available objects. If you indicate that you want to do this, the objects are written to a file called .RData in the current directory, and the command lines used in the session are saved to a file called .Rhistory.
When R is started at later time from the same directory it reloads the workspace from this file. At the same time the associated commands history is reloaded.
It is recommended that you should use separate working directories for analyses conducted with R. It is quite common for objects with names x and y to be created during an analysis. Names like this are often meaningful in the context of a single analysis, but it can be quite hard to decide what they might be when the several analyses have been conducted in the same directory.