Basic calculations

You can use R for basic computations you would perform in a calculator

# Addition
2-3
[1] -1
# Division
2/3
[1] 0.6666667
# Exponentiation
2^3 
[1] 8
# Square root
sqrt(2)
[1] 1.414214
# Logarithms
log(2)
[1] 0.6931472

#We can see that subtraction, division, exponentiation, square root, and Logarithms work.

Before we move on to see how comparison operators work, lets test the exponentiation, square root and log 10 math operators

Often you will want to test whether something is less than, greater than or equal to something.

sqrt(9)
[1] 3
9^2
[1] 81
log10(10)
[1] 1
3 == 8
[1] FALSE
3 != 8
[1] TRUE
3 <= 8
[1] TRUE

We see that the mathematical operations we tested worked. Also the comparison operators functioned as expected. Lets test some more comparisons.

The logical operators are & for logical AND, | for logical OR, and ! for NOT. These are some examples:

4==4
[1] TRUE
4!=4
[1] FALSE
4<=3
[1] FALSE
# Logical Disjunction (or)
FALSE | FALSE
[1] FALSE
# Logical Conjunction (and)
TRUE & FALSE
[1] FALSE
# Negation
! FALSE
[1] TRUE
# Combination of statements
2 < 3 | 1 == 5
[1] TRUE

Lets test some operators like the & (and) and !(negation)

Assigning Values to Variables

In R, you create a variable and assign it a value using <- as follows

3<4&2!=4
[1] TRUE
! TRUE
[1] FALSE
foo <- 2 + 2
foo*3
[1] 12

#Here we can see that the logical operators & works, as well the negation #We also see that the value stored in the variable “foo” (4) equalled 12 when multiplied(*) by 3

lets create a new variable called “ton” store the value 5 and multiply it by 5 so that we get the result of 25

To see the variables that are currently defined, use ls (as in “list”)

ton <- 7-2
ton*5
ls()

we can see that the ls() function listed the two stored variables “foo” and “ton”

now lets test the remove and list functions. We’ll remove the variable ‘foo’ so only ‘ton’ should be left

To delete a variable, use rm (as in “remove”)

rm(foo)
ls()

we see the remove function worked and the only variable left was the ‘ton’ variable we created

Either <- or = can be used to assign a value to a variable, but I prefer <- because is less likely to be confused with the logical operator ==

Vectors

The basic type of object in R is a vector, which is an ordered list of values of the same type. You can create a vector using the c() function (as in “concatenate”).

bar <- c(2, 5, 10, 2, 1) 
bar

we see above how the variable bar stored the five values in a vector by using the concatenate(c) function

we will also see the same be achieve below for the baz variable, i used the = assignment operator instead of the <-

baz = c(2, 2, 3, 3, 3)
baz

There are also some functions that will create vectors with regular patterns, like repeated elements.

# replicate function
rep(2, 5)
# consecutive numbers
1:5
# sequence from 1 to 10 with a step of 2
seq(1, 10, by=2)

we can see that the replicate (rep) function created a vector with the number 2 repeated five times.

the consecutive numbers function created a vector ranging of consecutive numbers form 1 through 5 notice it incluede the last number 5, unlike pyhton we didnt have to list it to 6 to achive this

We see we can create a varaible that will list thru a specified range by the steps we designate

lets test the rep and seq function by repeating the number 6 in a vector four times, and a sequence from 1 thru 20 with steps of 3

Many functions and operators like + or - will work on all elements of the vector.

rep(6, 4)
seq(1,20, by=3)
# add vectors
bar + baz
# compare vectors
bar == baz
# find length of vector
length(bar)
# find minimum value in vector
min(bar)
# find average value in vector
mean(bar)

we can see both the vector generation functions we tested worked, 6 is listed four times and 1 thru 20 is listed with at intervals of three

we also see that the addition worked for each indexed value in the bar and baz variables, 2+2 = 4, 5+2=7, 10+3=13 and so forth

the comparison operator also showed that only the first comparison was true and all other values didn’t match between bar and baz

the length function did count the leng of the ‘bar’ vectors which is five. The minimum value of bar was one, and the mean of of the ’bar vectors values was 4.

lets test and see what the mean of the baz vector is ‘2.6’ and also if its length is five

You can access parts of a vector by using [. Recall what the value is of the vector bar.

mean(baz)
length(baz)
bar
# If you want to get the first element:
bar[1]

we can see that the baz vectors values came out as we predicted

#above we can see that the ‘bar’ variable was called and it out put the vector for it as well as when we called for the first element(2)in the vector was output by using the select element [] on the bar variable

Now lets list the baz vector and select the last element which should be 3

If you want to get the last element of bar without explicitly typing the number of elements of bar, make use of the length function, which calculates the length of a vector:

baz
baz[5]
bar[length(bar)]

we were able to list the baz vector and its last element was as we predicted above

more useful for listing the last element of the function is the numerical value of thelength of the vector being used as the element location for the last value. Since there are five values then the output of the length would be 5 which is also the last element insaid vector.

You can also extract multiple values from a vector. For instance to get the 2nd through 4th values use

bar[c(2, 3, 4)]
bar[c(1, 3, 5)]

above we can see that the 2nd, 3rd, and 4th elements from the bar variable are listed correctley as 5, 10 , and 2

we also see that this works with elements in ancy sequence with the other line we called from the same variable since the 1st, 3rd, and 5th elements (2,10, and 1) were listed correctley.

Vectors can also be strings or logical values

quxx <- c("a", "b", "cde", "fg")
quxx

we can see above that the we are able to store string characters in a vector as well.

Data Frames

In statistical applications, data is often stored as a data frame, which is like a spreadsheet, with rows as observations and columns as variables.

To manually create a data frame, use the data.frame() function.

data.frame(foo = c(1, 2, 3), 
           bar = c("a", "b", "c"), 
           baz = c(1.5, 2.5, 3)) 
NA

we can see above that we were able to store the records for each vector in the dataframe as columns noticed its balanced. I was curios to see if an unbalanced dataframe would cause an error and after testing I proved it does. All columns must have a matching number of records/rows

Most often you will be using data frames loaded from a file. For example, load the results of a class survey. The function load or read.table can be used for this.

How to Make a Random Sample

To randomly select a sample use the function sample(). The following code selects 5 numbers between 1 and 10 at random (without duplication)

sample(1:10, size=5)
sample(1:20, size=8)

We can see that we are able to create a vector with randomized values by using the sample function, all we have to do is specify the range we want the numbers to be selected from and the (size) amount of numbers we want produced.

Taking a simple random sample from a data frame is only slightly more complicated, having two steps:

  1. Use sample() to select a sample of size n from a vector of the row numbers of the data frame.
  2. Use the index operator [ to select those rows from the data frame.

Consider the following example with fake data. First, make up a data frame with two columns. (LETTERS is a character vector of length 26 with capital letters “A” to “Z”; LETTERS is automatically defined and pre-loaded in R)

bar <- data.frame(var1 = LETTERS[1:10], var2 = 1:10)
# Check data frame
bar

above we can see that we created a dataframe(‘bar’ replacing the previous variale of the same name) with two columns by using the charactes from the ‘letters’ vector for the first varaible ans the numbers 1 thru 10 on the second variable by using the(:) vector creation function.

Suppose you want to select a random sample of size 5. First, define a variable n with the size of the sample, i.e. 5

n <- 5

#above we created the variable ‘n’ to store the value 5 so we can call it again later as the size(amount of numbers) we want selected for our random number sample

Now, select a sample of size 5 from the vector with 1 to 10 (the number of rows in bar). Use the function nrow() to find the number of rows in bar instead of manually entering that number.

Use : to create a vector with all the integers between 1 and the number of rows in bar.

samplerows <- sample(1:nrow(bar), size=n) 
# print sample rows
samplerows

above we can see we create the variable “samplerows” by using the sample function, we set the range from 1 to the number of rows(nrow) in the ‘bar’ dataframe (10), then we select the size (number of elements we want output) by using the variable we created before ‘n’. So we can see that only five random numers ranging from one thru 10 are listed when we call the varable ‘samplerows’

The variable samplerows contains the rows of bar which make a random sample from all the rows in bar. Extract those rows from bar with

# extract rows
barsample <- bar[samplerows, ]
# print sample
print(barsample)

above we can see ho we can select specific rows from a dataframe and store them in a new variable(barsample), by selecting the rows from our previous dataframe(bar) off of the values from our random sample(samplerows). My random sample produced 3 4 7 9 1 and those are corresponding rows where inserted into the barsample dataframe.

The code above creates a new data frame called barsample with a random sample of rows from bar.

In a single line of code:

bar[sample(1:nrow(bar), n), ]

above we can see how we replicate the previous work we did to create the ‘barsample’ dataframe in one line of code by usinge the element select function [] combined with the sample function and nrow function. Notice this doesn’t store the random sample dataframe like we did before, instead it only displays the selection.

Using Tables

The table() command allows us to look at tables. Its simplest usage looks like table(x) where x is a categorical variable.

For example, a survey asks people if they smoke or not. The data is

Yes, No, No, Yes, Yes

We can enter this into R with the c() command, and summarize with the table() command as follows

x <- c("Yes","No","No","Yes","Yes") 
table(x)

above we can see that we create a new variable with string characters stored in it. then we use the table() function to display the count of each value in this table, our output is 2 No’s and 3 Yes’s which is correct.

Numeric measures of center and spread

Suppose, CEO yearly compensations are sampled and the following are found (in millions)

12 .4 5 2 50 8 3 1 4 0.25

sals <- c(12, .4, 5, 2, 50, 8, 3, 1, 4, 0.25)
# the average
mean(sals) 
# the variance
var(sals)
# the standard deviation
sd(sals)
# the median
median(sals)
# Tukey's five number summary, usefull for boxplots
# five numbers: min, lower hinge, median, upper hinge, max
fivenum(sals)
# summary statistics
summary(sals)

Above we can see some common statistical analysis using built in functions to return the average, variance, standard deviation, median, five number summary, and summary statistics, all performed on the sals variables values we concated into a vector. Notice the statistical summary and five number summary are different.

How about the mode?

In R we can write our own functions, and a first example of a function is shown below in order to compute the mode of a vector of observations x

# Function to find the mode, i.e. most frequent value
getMode <- function(x) {
     ux <- unique(x)
     ux[which.max(tabulate(match(x, ux)))]
 }

Above we use the the function “function” to create our own function. In this case the function we create is called ‘getMode’ which will return the value that is repeated the most in a vector.

As an example, we can use the function defined above to find the most frequent value in te vector baz

# Most frequent value in baz
getMode(baz)

Here we use the function that we created ‘getMode’ on the variable ‘baz’ which has a vector made of 2,2,3,3,3. It ouputs 3 which we can see ist the mode of the ‘baz’ vector.

---
title: "First Steps in `R`"
output: html_notebook
---

# Basic calculations

You can use R for basic computations you would perform in a calculator

```{r}
# Addition
2-3
# Division
2/3
# Exponentiation
2^3 
# Square root
sqrt(2)
# Logarithms
log(2)
```

#We can see that subtraction, division, exponentiation, square root, and Logarithms work.

Before we move on to see how comparison operators work, lets test the exponentiation, square root and log 10 math operators

Often you will want to test whether something is less than, greater than or equal to something.

```{r}
sqrt(9)
9^2
log10(10)
3 == 8
3 != 8
3 <= 8
```

We see that the mathematical operations we tested worked. Also the comparison operators functioned as expected. Lets test some more comparisons.

The *logical operators* are `&` for logical **AND**, `|` for logical **OR**, and `!` for **NOT**. These are some examples:

```{r}
4==4
4!=4
4<=3

# Logical Disjunction (or)
FALSE | FALSE
# Logical Conjunction (and)
TRUE & FALSE
# Negation
! FALSE
# Combination of statements
2 < 3 | 1 == 5
```

Lets test some operators like the & (and) and !(negation)

# Assigning Values to Variables

In R, you create a variable and assign it a value using `<-` as follows

```{r}
3<4&2!=4
! TRUE
foo <- 2 + 2
foo*3
```

#Here we can see that the logical operators & works, as well the negation #We also see that the value stored in the variable "foo" (4) equalled 12 when multiplied(\*) by 3

# lets create a new variable called "ton" store the value 5 and multiply it by 5 so that we get the result of 25

To see the variables that are currently defined, use `ls` (as in "list")

```{r}
ton <- 7-2
ton*5
ls()
```

# we can see that the ls() function listed the two stored variables "foo" and "ton"

# now lets test the remove and list functions. We'll remove the variable 'foo' so only 'ton' should be left

To delete a variable, use `rm` (as in "remove")

```{r}
rm(foo)
ls()
```

# we see the remove function worked and the only variable left was the 'ton' variable we created

Either `<-` or `=` can be used to assign a value to a variable, but I prefer `<-` because is less likely to be confused with the logical operator `==`

# Vectors

The basic type of object in R is a *vector*, which is an ordered list of values of the same type. You can create a vector using the `c()` function (as in "concatenate").

```{r}
bar <- c(2, 5, 10, 2, 1) 
bar
```

# we see above how the variable bar stored the five values in a vector by using the concatenate(c) function

# we will also see the same be achieve below for the baz variable, i used the = assignment operator instead of the \<-

```{r}
baz = c(2, 2, 3, 3, 3)
baz
```

There are also some functions that will create vectors with regular patterns, like repeated elements.

```{r}
# replicate function
rep(2, 5)
# consecutive numbers
1:5
# sequence from 1 to 10 with a step of 2
seq(1, 10, by=2)
```

# we can see that the replicate (rep) function created a vector with the number 2 repeated five times.

# the consecutive numbers function created a vector ranging of consecutive numbers form 1 through 5 notice it incluede the last number 5, unlike pyhton we didnt have to list it to 6 to achive this

# We see we can create a varaible that will list thru a specified range by the steps we designate

# lets test the rep and seq function by repeating the number 6 in a vector four times, and a sequence from 1 thru 20 with steps of 3

Many functions and operators like `+` or `-` will work on all elements of the vector.

```{r}
rep(6, 4)
seq(1,20, by=3)
# add vectors
bar + baz
# compare vectors
bar == baz
# find length of vector
length(bar)
# find minimum value in vector
min(bar)
# find average value in vector
mean(bar)
```

# we can see both the vector generation functions we tested worked, 6 is listed four times and 1 thru 20 is listed with at intervals of three

# we also see that the addition worked for each indexed value in the bar and baz variables, 2+2 = 4, 5+2=7, 10+3=13 and so forth

# the comparison operator also showed that only the first comparison was true and all other values didn't match between bar and baz

# the length function did count the leng of the 'bar' vectors which is five. The minimum value of bar was one, and the mean of of the 'bar vectors values was 4.

# lets test and see what the mean of the baz vector is '2.6' and also if its length is five

You can access parts of a vector by using `[`. Recall what the value is of the vector `bar`.

```{r}
mean(baz)
length(baz)
bar
# If you want to get the first element:
bar[1]
```

# we can see that the baz vectors values came out as we predicted

#above we can see that the 'bar' variable was called and it out put the vector for it as well as when we called for the first element(2)in the vector was output by using the select element [] on the bar variable

# Now lets list the baz vector and select the last element which should be 3

If you want to get the last element of `bar` without explicitly typing the number of elements of `bar`, make use of the `length` function, which calculates the length of a vector:

```{r}
baz
baz[5]
bar[length(bar)]
```

# we were able to list the baz vector and its last element was as we predicted above

# more useful for listing the last element of the function is the numerical value of thelength of the vector being used as the element location for the last value. Since there are five values then the output of the length would be 5 which is also the last element insaid vector.

You can also extract multiple values from a vector. For instance to get the 2nd through 4th values use

```{r}
bar[c(2, 3, 4)]
bar[c(1, 3, 5)]

```

# above we can see that the 2nd, 3rd, and 4th elements from the bar variable are listed correctley as 5, 10 , and 2

# we also see that this works with elements in ancy sequence with the other line we called from the same variable since the 1st, 3rd, and 5th elements (2,10, and 1) were listed correctley.

Vectors can also be strings or logical values

```{r}
quxx <- c("a", "b", "cde", "fg")
quxx
```

# we can see above that the we are able to store string characters in a vector as well.

# Data Frames

In statistical applications, data is often stored as a data frame, which is like a spreadsheet, with *rows as observations* and *columns as variables*.

To manually create a data frame, use the `data.frame()` function.

```{r}
data.frame(foo = c(1, 2, 3), 
           bar = c("a", "b", "c"), 
           baz = c(1.5, 2.5, 3)) 

```

# we can see above that we were able to store the records for each vector in the dataframe as columns noticed its balanced. I was curios to see if an unbalanced dataframe would cause an error and after testing I proved it does. All columns must have a matching number of records/rows

Most often you will be using data frames loaded from a file. For example, load the results of a class survey. The function `load` or `read.table` can be used for this.

# How to Make a Random Sample

To randomly select a sample use the function `sample()`. The following code selects 5 numbers between 1 and 10 at random (without duplication)

```{r}
sample(1:10, size=5)
sample(1:20, size=8)
```

# We can see that we are able to create a vector with randomized values by using the sample function, all we have to do is specify the range we want the numbers to be selected from and the (size) amount of numbers we want produced.

-   The first argument gives the vector of data to select elements from.
-   The second argument (`size=`) gives the size of the sample to select.

Taking a simple random sample from a data frame is only slightly more complicated, having two steps:

1.  Use `sample()` to select a sample of size `n` from a vector of the row numbers of the data frame.
2.  Use the index operator `[` to select those rows from the data frame.

Consider the following example with *fake data*. First, make up a data frame with two columns. (`LETTERS` is a character vector of length 26 with capital letters âAâ to âZâ; `LETTERS` is automatically defined and pre-loaded in `R`)

```{r}
bar <- data.frame(var1 = LETTERS[1:10], var2 = 1:10)
# Check data frame
bar
```

# above we can see that we created a dataframe('bar' replacing the previous variale of the same name) with two columns by using the charactes from the 'letters' vector for the first varaible ans the numbers 1 thru 10 on the second variable by using the(:) vector creation function.

Suppose you want to select a random sample of size 5. First, define a variable `n` with the size of the sample, i.e. 5

```{r}
n <- 5
```

#above we created the variable 'n' to store the value 5 so we can call it again later as the size(amount of numbers) we want selected for our random number sample

Now, select a sample of size 5 from the vector with 1 to 10 (the number of rows in `bar`). Use the function `nrow()` to find the number of rows in `bar` instead of manually entering that number.

Use `:` to create a vector with all the integers between 1 and the number of rows in `bar`.

```{r}
samplerows <- sample(1:nrow(bar), size=n) 
# print sample rows
samplerows
```

# above we can see we create the variable "samplerows" by using the sample function, we set the range from 1 to the number of rows(nrow) in the 'bar' dataframe (10), then we select the size (number of elements we want output) by using the variable we created before 'n'. So we can see that only five random numers ranging from one thru 10 are listed when we call the varable 'samplerows'

The variable `samplerows` contains the rows of `bar` which make a random sample from all the rows in `bar`. Extract those rows from `bar` with

```{r}
# extract rows
barsample <- bar[samplerows, ]
# print sample
print(barsample)
```

# above we can see ho we can select specific rows from a dataframe and store them in a new variable(barsample), by selecting the rows from our previous dataframe(bar) off of the values from our random sample(samplerows). My random sample produced 3 4 7 9 1 and those are corresponding rows where inserted into the barsample dataframe.

The code above creates a new *data frame* called `barsample` with a random sample of rows from `bar`.

In a single line of code:

```{r}
bar[sample(1:nrow(bar), n), ]
```

# above we can see how we replicate the previous work we did to create the 'barsample' dataframe in one line of code by usinge the element select function [] combined with the sample function and nrow function. Notice this doesn't store the random sample dataframe like we did before, instead it only displays the selection.

# Using Tables

The `table()` command allows us to look at tables. Its simplest usage looks like `table(x)` where `x` is a *categorical variable*.

For example, a survey asks people if they smoke or not. The data is

*Yes, No, No, Yes, Yes*

We can enter this into R with the `c()` command, and summarize with the `table()` command as follows

```{r}
x <- c("Yes","No","No","Yes","Yes") 
table(x)

```

# above we can see that we create a new variable with string characters stored in it. then we use the table() function to display the count of each value in this table, our output is 2 No's and 3 Yes's which is correct.

# Numeric measures of center and spread

Suppose, CEO yearly compensations are sampled and the following are found (in millions)

12 .4 5 2 50 8 3 1 4 0.25

```{r}
sals <- c(12, .4, 5, 2, 50, 8, 3, 1, 4, 0.25)
# the average
mean(sals) 
# the variance
var(sals)
# the standard deviation
sd(sals)
# the median
median(sals)
# Tukey's five number summary, usefull for boxplots
# five numbers: min, lower hinge, median, upper hinge, max
fivenum(sals)
# summary statistics
summary(sals)

```

# Above we can see some common statistical analysis using built in functions to return the average, variance, standard deviation, median, five number summary, and summary statistics, all performed on the sals variables values we concated into a vector. Notice the statistical summary and five number summary are different.

### How about the *mode*?

In R we can write our own *functions*, and a first example of a function is shown below in order to compute *the mode* of a vector of observations `x`

```{r}
# Function to find the mode, i.e. most frequent value
getMode <- function(x) {
     ux <- unique(x)
     ux[which.max(tabulate(match(x, ux)))]
 }
```

# Above we use the the function "function" to create our own function. In this case the function we create is called 'getMode' which will return the value that is repeated the most in a vector.

As an example, we can use the function defined above to find the most frequent value in te vector `baz`

```{r}
# Most frequent value in baz
getMode(baz)
```

# Here we use the function that we created 'getMode' on the variable 'baz' which has a vector made of 2,2,3,3,3. It ouputs 3 which we can see ist the mode of the 'baz' vector.
