Lab 2

Functions

  • Functions are reusable and referenceable blocks of code that accomplish a specific task

  • Functions take in a parameter, process it, and return a result

    • Parameters are usually a data structure (i.e. a value, vector, dataframe, etc.)

Creating your own functions

Here is an example of a basic function that you could write in R:

my_function <- function(x, y) {
  result <- x + y 
  return(result)
  } 

my_function(2, 3)
[1] 5

As you can see in the last line, you can call a function by referencing its name and entering your parameters in parenthesis.

Parameters

  • The parameters are determined by the order in which they appear inside the parantheses. In this example, x is equal to 2 and y is equal to 3

    • You can also specify which parameters are which by using parameter = ...
my_function <- function(x, y) {
  result <- x / y 
  return(result)
  } 

my_function(2, 3)
[1] 0.6666667
my_function(y = 2, x = 3)
[1] 1.5

Functions in R

Here are some examples of functions we have used in R:

data("ChickWeight") # A predefined dataset included in R
describe(ChickWeight)
       vars   n   mean    sd median trimmed   mad min max range  skew kurtosis
weight    1 578 121.82 71.07    103  113.18 69.68  35 373   338  0.96     0.34
Time      2 578  10.72  6.76     10   10.77  8.90   0  21    21 -0.02    -1.26
Chick*    3 578  26.26 14.00     26   26.27 17.79   1  50    49  0.00    -1.19
Diet*     4 578   2.24  1.16      2    2.17  1.48   1   4     3  0.31    -1.39
         se
weight 2.96
Time   0.28
Chick* 0.58
Diet*  0.05

These functions have already been defined in Base R or in the “psych” library you previously installed. Your use of functions in this class will mostly be limited to pre-defined functions.

  • If you want to learn more about a function, use ?funcname to do so.
# ?describe

Vectors and Dataframes

As a review:

  • Vectors are one-dimensional arrays of data

    • All elements in a vector must be the same type:
numeric_vector <- c(1, 2, 3)
char_vector <- c("a", "b", "c")
  • Dataframes are two-dimensional arrays of data, with each column written as a vector

  • You can specify the name of each column by using = as follows.

df <- data.frame(
  name = c("Alice", "Bob"),
  age = c(25, 30)
)

Dataframe indexing

  • There are two ways to access elements inside a dataframe, using the column name or index

    • $ indicates that a specific column is being accessed.
    • [row, col] indicates that both the row and column is specified, whether by number or by name
      • If you use name, you must use quotes around it ('' or "")
      • If you only want one of those two, use a comma but leave one side blank.
# The following lines of code are equivalent
df$name
[1] "Alice" "Bob"  
df[, 1]
[1] "Alice" "Bob"  
df[, "name"]
[1] "Alice" "Bob"  
# If you wanted to obtain a specific value
df[1, 1]
[1] "Alice"

Manipulations of vectors/dataframes

You can call functions to manipulate data in vectors:

sum(numeric_vector)
[1] 6

And dataframes:

describe(df$age)
   vars n mean   sd median trimmed  mad min max range skew kurtosis  se
X1    1 2 27.5 3.54   27.5    27.5 3.71  25  30     5    0    -2.75 2.5

colnames() and head

Here are two functions which you’ll be asked to use on your assignment:

  • colnames(): Can be used to rename the columns in your data frame.

  • head(): Outputs only the first few lines in your data frame.

colnames(df) <- c("name_new", "age_new")
head(df)
  name_new age_new
1    Alice      25
2      Bob      30

Statistical Analyses in R: cor() and lm()

Set-up

  • Download the lab2.csv file from canvas.

  • In a new R file, follow the same steps to read the .csv file as last week, including:

    • Save the .R file to the same folder as your .csv.

    • Set current working directory to source file

    • call read.csv on the file name

Correlation in R

Correlation is defined by the relationship between two variables. Let’s say that we’re interested in the correlation between x and y

x <- df$x
y <- df$y
cor(x, y)
[1] 0.9044494
  • cor() is a function that takes in x and y as its parameters

We’re often interested in \(r^2\) in this class (why?) which we can obtain simply by squaring the above value

r <- cor(x, y)
r ** 2
[1] 0.8180287

Correlation Tables

You can also do correlations on the entire data frame to produce a correlation matrix.

cor(df)
           x           y           z
x 1.00000000  0.90444939  0.05378997
y 0.90444939  1.00000000 -0.05182359
z 0.05378997 -0.05182359  1.00000000

This produces a new dataframe that has correlation values for every pair of variables.

  • What do you notice about this correlation matrix (how is it similar to our previous value?)

  • You can also square this table as well

cor_table <- cor(df)
cor_table ** 2
            x           y           z
x 1.000000000 0.818028703 0.002893361
y 0.818028703 1.000000000 0.002685685
z 0.002893361 0.002685685 1.000000000

Linear Models in R

This class is called Applied Linear Regression, so you’ll be working a lot with the lm function in R.

lm() has two required parameters:

  • formula: Always in the form of y ~ x, with y as the dependent and x as the independent variable.

  • data: The data frame from which the above variables come from

# lm(formula = target ~ predictor, data = dataframe)
lm(y ~ x, df)

Call:
lm(formula = y ~ x, data = df)

Coefficients:
(Intercept)            x  
      1.042        0.918  
  • You’ll notice that two numbers appear in your output. What do each of them mean?

summary()

Until now you have been using describe() to obtain descriptive statistics from data frames. summary is a base R function that does a similar purpose

summary(df)
       x            y                z          
 Min.   : 1   Min.   :-3.167   Min.   :-32.981  
 1st Qu.: 7   1st Qu.: 8.266   1st Qu.: -5.198  
 Median :13   Median :14.669   Median : 18.671  
 Mean   :13   Mean   :12.976   Mean   : 19.922  
 3rd Qu.:19   3rd Qu.:17.418   3rd Qu.: 38.469  
 Max.   :25   Max.   :27.096   Max.   : 67.687  

However, unlike describe summary can also be used on the output of lm as well, giving you much more information.

model <- lm(y ~ x, df)
summary(model)

Call:
lm(formula = y ~ x, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-5.148 -2.863  1.618  2.611  4.941 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.04237    1.34205   0.777    0.445    
x            0.91795    0.09028  10.168 5.57e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.255 on 23 degrees of freedom
Multiple R-squared:  0.818, Adjusted R-squared:  0.8101 
F-statistic: 103.4 on 1 and 23 DF,  p-value: 5.575e-10

Interpretation of Linear Model in R

summary(model)

Call:
lm(formula = y ~ x, data = df)

Residuals:
   Min     1Q Median     3Q    Max 
-5.148 -2.863  1.618  2.611  4.941 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.04237    1.34205   0.777    0.445    
x            0.91795    0.09028  10.168 5.57e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.255 on 23 degrees of freedom
Multiple R-squared:  0.818, Adjusted R-squared:  0.8101 
F-statistic: 103.4 on 1 and 23 DF,  p-value: 5.575e-10
  • Call: repeats the line of code used to generate the linear model.

  • Residuals: descriptive statistics about the error from the fitted line.

  • Coefficients: the \(\beta\) ’s obtained from your linear model. In this case \(\beta_0\) (y-intercept) and \(\beta_1\) (slope)

    • Estimate: estimated value of the coefficient

    • Std. Error: standard error of the coefficient, measure of variability of estimate

    • t value: number obtained from computing a t test for the coefficient

    • Pr(>|t|): the p-value obtained from the t-test, the likelihood that the null hypothesis for the estimate is true.

  • Some other important information for simple linear regression.

    • degrees of freedom

    • R-squared

Application Questions

  • If instead y is the independent variable and x is the dependent variable, should the linear model change?

    • Try it out for yourself using another call to lm
  • Find the correlation for x and z, then obtain a summary of the linear model for z on x

    • Interpret these results: How does the relationship with x and y differ from x and z?
  • Create a new column in df that is the z column with 2 added to every value. Then repeat the above steps for x and this new column.

    • Did anything change? If so, what did?