Basic calculations
You can use R for basic computations you would perform in a
calculator
# Addition
2-3
[1] -1
# Division
2/3
[1] 0.6666667
# Exponentiation
2^3
[1] 8
# Square root
sqrt(2)
[1] 1.414214
# Logarithms
log(2)
[1] 0.6931472
#We can see that subtraction, division, exponentiation, square root,
and Logarithms work.
Before we move on to see how comparison operators work, lets test the
exponentiation, square root and log 10 math operators
Often you will want to test whether something is less than, greater
than or equal to something.
sqrt(9)
[1] 3
9^2
[1] 81
log10(10)
[1] 1
3 == 8
[1] FALSE
3 != 8
[1] TRUE
3 <= 8
[1] TRUE
We see that the mathematical operations we tested worked. Also the
comparison operators functioned as expected. Lets test some more
comparisons.
The logical operators are & for logical
AND, | for logical OR,
and ! for NOT. These are some
examples:
4==4
[1] TRUE
4!=4
[1] FALSE
4<=3
[1] FALSE
# Logical Disjunction (or)
FALSE | FALSE
[1] FALSE
# Logical Conjunction (and)
TRUE & FALSE
[1] FALSE
# Negation
! FALSE
[1] TRUE
# Combination of statements
2 < 3 | 1 == 5
[1] TRUE
Lets test some operators like the & (and) and !(negation)
Assigning Values to Variables
In R, you create a variable and assign it a value using
<- as follows
3<4&2!=4
[1] TRUE
! TRUE
[1] FALSE
foo <- 2 + 2
foo*3
[1] 12
#Here we can see that the logical operators & works, as well the
negation #We also see that the value stored in the variable “foo” (4)
equalled 12 when multiplied(*) by 3
lets create a new variable called “ton” store the value 5 and
multiply it by 5 so that we get the result of 25
To see the variables that are currently defined, use ls
(as in “list”)
ton <- 7-2
ton*5
ls()
we can see that the ls() function listed the two stored variables
“foo” and “ton”
now lets test the remove and list functions. We’ll remove the
variable ‘foo’ so only ‘ton’ should be left
To delete a variable, use rm (as in “remove”)
rm(foo)
ls()
we see the remove function worked and the only variable left was the
‘ton’ variable we created
Either <- or = can be used to assign a
value to a variable, but I prefer <- because is less
likely to be confused with the logical operator ==
Vectors
The basic type of object in R is a vector, which is an
ordered list of values of the same type. You can create a vector using
the c() function (as in “concatenate”).
bar <- c(2, 5, 10, 2, 1)
bar
we see above how the variable bar stored the five values in a vector
by using the concatenate(c) function
we will also see the same be achieve below for the baz variable, i
used the = assignment operator instead of the <-
baz = c(2, 2, 3, 3, 3)
baz
There are also some functions that will create vectors with regular
patterns, like repeated elements.
# replicate function
rep(2, 5)
# consecutive numbers
1:5
# sequence from 1 to 10 with a step of 2
seq(1, 10, by=2)
we can see that the replicate (rep) function created a vector with
the number 2 repeated five times.
the consecutive numbers function created a vector ranging of
consecutive numbers form 1 through 5 notice it incluede the last number
5, unlike pyhton we didnt have to list it to 6 to achive this
We see we can create a varaible that will list thru a specified
range by the steps we designate
lets test the rep and seq function by repeating the number 6 in a
vector four times, and a sequence from 1 thru 20 with steps of 3
Many functions and operators like + or -
will work on all elements of the vector.
rep(6, 4)
seq(1,20, by=3)
# add vectors
bar + baz
# compare vectors
bar == baz
# find length of vector
length(bar)
# find minimum value in vector
min(bar)
# find average value in vector
mean(bar)
we can see both the vector generation functions we tested worked, 6
is listed four times and 1 thru 20 is listed with at intervals of
three
we also see that the addition worked for each indexed value in the
bar and baz variables, 2+2 = 4, 5+2=7, 10+3=13 and so forth
the comparison operator also showed that only the first comparison
was true and all other values didn’t match between bar and baz
the length function did count the leng of the ‘bar’ vectors which is
five. The minimum value of bar was one, and the mean of of the ’bar
vectors values was 4.
lets test and see what the mean of the baz vector is ‘2.6’ and also
if its length is five
You can access parts of a vector by using [. Recall what
the value is of the vector bar.
mean(baz)
length(baz)
bar
# If you want to get the first element:
bar[1]
we can see that the baz vectors values came out as we predicted
#above we can see that the ‘bar’ variable was called and it out put
the vector for it as well as when we called for the first element(2)in
the vector was output by using the select element [] on the bar
variable
Now lets list the baz vector and select the last element which
should be 3
If you want to get the last element of bar without
explicitly typing the number of elements of bar, make use
of the length function, which calculates the length of a
vector:
baz
baz[5]
bar[length(bar)]
we were able to list the baz vector and its last element was as we
predicted above
more useful for listing the last element of the function is the
numerical value of thelength of the vector being used as the element
location for the last value. Since there are five values then the output
of the length would be 5 which is also the last element insaid
vector.
You can also extract multiple values from a vector. For instance to
get the 2nd through 4th values use
bar[c(2, 3, 4)]
bar[c(1, 3, 5)]
above we can see that the 2nd, 3rd, and 4th elements from the bar
variable are listed correctley as 5, 10 , and 2
we also see that this works with elements in ancy sequence with the
other line we called from the same variable since the 1st, 3rd, and 5th
elements (2,10, and 1) were listed correctley.
Vectors can also be strings or logical values
quxx <- c("a", "b", "cde", "fg")
quxx
we can see above that the we are able to store string characters in
a vector as well.
Data Frames
In statistical applications, data is often stored as a data frame,
which is like a spreadsheet, with rows as observations and
columns as variables.
To manually create a data frame, use the data.frame()
function.
data.frame(foo = c(1, 2, 3),
bar = c("a", "b", "c"),
baz = c(1.5, 2.5, 3))
NA
we can see above that we were able to store the records for each
vector in the dataframe as columns noticed its balanced. I was curios to
see if an unbalanced dataframe would cause an error and after testing I
proved it does. All columns must have a matching number of
records/rows
Most often you will be using data frames loaded from a file. For
example, load the results of a class survey. The function
load or read.table can be used for this.
How to Make a Random Sample
To randomly select a sample use the function sample().
The following code selects 5 numbers between 1 and 10 at random (without
duplication)
sample(1:10, size=5)
sample(1:20, size=8)
We can see that we are able to create a vector with randomized
values by using the sample function, all we have to do is specify the
range we want the numbers to be selected from and the (size) amount of
numbers we want produced.
- The first argument gives the vector of data to select elements
from.
- The second argument (
size=) gives the size of the
sample to select.
Taking a simple random sample from a data frame is only slightly more
complicated, having two steps:
- Use
sample() to select a sample of size n
from a vector of the row numbers of the data frame.
- Use the index operator
[ to select those rows from the
data frame.
Consider the following example with fake data. First, make
up a data frame with two columns. (LETTERS is a character
vector of length 26 with capital letters âAâ to âZâ;
LETTERS is automatically defined and pre-loaded in
R)
bar <- data.frame(var1 = LETTERS[1:10], var2 = 1:10)
# Check data frame
bar
above we can see that we created a dataframe(‘bar’ replacing the
previous variale of the same name) with two columns by using the
charactes from the ‘letters’ vector for the first varaible ans the
numbers 1 thru 10 on the second variable by using the(:) vector creation
function.
Suppose you want to select a random sample of size 5. First, define a
variable n with the size of the sample, i.e. 5
n <- 5
#above we created the variable ‘n’ to store the value 5 so we can
call it again later as the size(amount of numbers) we want selected for
our random number sample
Now, select a sample of size 5 from the vector with 1 to 10 (the
number of rows in bar). Use the function
nrow() to find the number of rows in bar
instead of manually entering that number.
Use : to create a vector with all the integers between 1
and the number of rows in bar.
samplerows <- sample(1:nrow(bar), size=n)
# print sample rows
samplerows
above we can see we create the variable “samplerows” by using the
sample function, we set the range from 1 to the number of rows(nrow) in
the ‘bar’ dataframe (10), then we select the size (number of elements we
want output) by using the variable we created before ‘n’. So we can see
that only five random numers ranging from one thru 10 are listed when we
call the varable ‘samplerows’
The variable samplerows contains the rows of
bar which make a random sample from all the rows in
bar. Extract those rows from bar with
# extract rows
barsample <- bar[samplerows, ]
# print sample
print(barsample)
above we can see ho we can select specific rows from a dataframe and
store them in a new variable(barsample), by selecting the rows from our
previous dataframe(bar) off of the values from our random
sample(samplerows). My random sample produced 3 4 7 9 1 and those are
corresponding rows where inserted into the barsample dataframe.
The code above creates a new data frame called
barsample with a random sample of rows from
bar.
In a single line of code:
bar[sample(1:nrow(bar), n), ]
above we can see how we replicate the previous work we did to create
the ‘barsample’ dataframe in one line of code by usinge the element
select function [] combined with the sample function and nrow function.
Notice this doesn’t store the random sample dataframe like we did
before, instead it only displays the selection.
Using Tables
The table() command allows us to look at tables. Its
simplest usage looks like table(x) where x is
a categorical variable.
For example, a survey asks people if they smoke or not. The data
is
Yes, No, No, Yes, Yes
We can enter this into R with the c() command, and
summarize with the table() command as follows
x <- c("Yes","No","No","Yes","Yes")
table(x)
above we can see that we create a new variable with string
characters stored in it. then we use the table() function to display the
count of each value in this table, our output is 2 No’s and 3 Yes’s
which is correct.
Numeric measures of center and spread
Suppose, CEO yearly compensations are sampled and the following are
found (in millions)
12 .4 5 2 50 8 3 1 4 0.25
sals <- c(12, .4, 5, 2, 50, 8, 3, 1, 4, 0.25)
# the average
mean(sals)
# the variance
var(sals)
# the standard deviation
sd(sals)
# the median
median(sals)
# Tukey's five number summary, usefull for boxplots
# five numbers: min, lower hinge, median, upper hinge, max
fivenum(sals)
# summary statistics
summary(sals)
Above we use the the function “function” to create our own function.
In this case the function we create is called ‘getMode’ which will
return the value that is repeated the most in a vector.
As an example, we can use the function defined above to find the most
frequent value in te vector baz
# Most frequent value in baz
getMode(baz)
Here we use the function that we created ‘getMode’ on the variable
‘baz’ which has a vector made of 2,2,3,3,3. It ouputs 3 which we can see
ist the mode of the ‘baz’ vector.
