Data Science Stream
Topic 6B: Writing R Functions
Welcome to the sixth computer lab for the Data Science stream of STM1001.
By now, you will have gained some experience using a variety of functions in R, ranging from inbuilt ‘base R’ functions, to package-specific functions, such as those contained in the plotly
package.
In this computer lab we will extend our R coding skills and focus on writing our own functions in R.
By the end of this lab, you should understand the R code syntax required to create functions, and feel comfortable writing simple and intermediate functions. Let’s get started!
R Functions Overview
Usually, if you need to carry out a specific statistical process in R, there will already be a relevant function for you to use!
However, when you are conducting your own coding or research, sometimes it can be extremely beneficial to write your own custom functions. These can range from simple functions of a few lines of code, to extremely detailed, multi-layered functions that make use of other functions and packages. The creative possibilities are endless.
R Function Composition
Let’s take a look at the basic composition of an R function, using the example function below:
my_simple_function <- function(argument1,
argument2){
output <- argument1 + argument2
}
The code above consists of four main parts:
- The function has a name,
my_simple_function
.
- We use the
function
function to (unsurprisingly) create a function.
- Inside the open brackets
()
following function
, we specify argument1
and argument2
to be the arguments, i.e. inputs, of our function.
- To begin the ‘main body’ of our function, we use a left curly brace,
{
, directly after the right open bracket )
.
- In the ‘main body’ of the function, we specify the calculations the function should perform - for
my_simple_function
, we are simply adding the argument1
and argument2
values provided. This sum is stored in the object output
, and when the function is computed, the output
value is provided as output.
- To wrap up our function, we conclude with a right curly brace,
}
.
You might be wondering what argument1
and argument2
are, and from where they came. In fact, these are arbitrary names, and don’t correspond to any specific numbers. We can provide whatever numbers or string of numbers we would like to our function, and within the my_simple_function
environment these values will be used in place of argument1
and argument2
. For example, suppose we would like to add 39 and 3. We can do this using our function as follows:
res <- my_simple_function(39, 3)
res
## [1] 42
Copy the code in the two code chunks above, paste it into an R script, and run it. You should obtain the result 42
shown above.
Of course, we didn’t really need to use a function for this - we could have simply computed 39 + 3
in R. In general, functions will be a little more complicated than my_simple_function
.
Important Notes
Strictly speaking, we did not need to include the output <-
code within my_simple_function
.
R will output the last evaluated expression in the function, so we could simply have used argument1 + argument2
in the main body of the function. If however we wanted to output more than one result, then specifying our desired output can be informative and help avoid errors.
This additional output <-
code was primarily included here to demonstrate a key feature of R functions - any objects created within a function are part of that function’s local environment
, meaning that they exist only inside that function, and can’t be called from the global environment
outside the function.
To clarify the distinction, take a look at the code below:
test <- 39 + 3
test
## [1] 42
Here, we assign the sum 39 + 3
to the object test
. This is not done within a function, so the object test
is stored in the global environment
. This means we can then call the test
object at a later date, and see our result of 42
.
Similarly, we could compute this result using my_simple_function
, as shown above in 1.1.
However, try to now call the object output
(which is defined within this function). You will receive an error - something along the lines of Error in eval(expr, envir, enclos) : object 'output' not found
.
This is something worth keeping in mind. If you intend to use an object outside of a function, it might be worth defining it separately. It is also a good idea to use different names for all your objects (inside and outside functions) to avoid confusion.
Writing Simple Functions in R
Now that we have covered the basics of R function composition, let’s try to write some simple functions ourselves. These functions actually already exist in base R, but this will be good practice for subsequent work.
Mean Function
Recall that the mean \(\bar{x}\) of a set of \(n\) values \(x_1, x_2, \cdots, x_n\) is computed using the formula \[\bar{x} = \displaystyle \frac{\sum_{i=1}^n x_i}{n} .\]
Suppose that you would like to write an R function that computes the mean of a set of provided values.
There are several ways that we could write this function. We have made a head-start with the partially completed code below. See if you can complete it by filling in the missing ...
parts.
mean_func <- function(values){
# Argument:
# values: This is our list of values
n <- length(values)
sum(...) / ...
}
Note that, based on this format, our function can compute the mean of a string of numbers of any non-zero length.
Verify that your mean_func
function works by computing the mean of the set of values 2, 3, 4, 5, 6, 7, 8
. If your function is correct, you should receive an answer of 5
.
Sample standard deviation Function
Now that you have had some practice writing R functions, let’s tackle a harder problem.
Recall the equation for the sample standard deviation \(s\) (introduced in Topic 2):
\[s = \displaystyle \sqrt{ \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2} .\]
Using your mean_func
function, and your code from 2.1 as a guide, write a function called sd_func
that computes the sample standard deviation of a set of \(n\) values \(x_1, x_2, \cdots, x_n\).
Hints:
- You can use the same input as for your
mean_func
function.
- When you are multiplying parts of a function together, you need to use the \(\ast\) symbol.
- You can use the built-in R function
sqrt()
to compute the square root.
- You can use your
mean_func
function within your new function, when computing the \(\bar{x}\) value.
- Be careful to include brackets to separate different parts of your calculations.
- Don’t forget the square term in the equation.
Verify that your sample standard deviation function works by computing the sample standard deviation of the set of values 2, 3, 4, 5, 6, 7, 8
. If your function is correct, you should receive an answer of 2.160247
.
Note: Your answer may be slightly different due to rounding.
Writing a \(t\)-test Function in R
Now that we have had some practice writing R functions, let’s move on to a slightly harder scenario, and try to write an R function for computing a \(t\)-test.
Recall that in Topic 5 we introduced the \(t\)-test as a common type of hypothesis test.
The test statistic \(t\) used for the one-sample \(t\)-test has the following formula:
\[t = \displaystyle \frac{\bar{x} - \mu_0}{s / \sqrt{n}},\]
where \(\bar{x}\), \(s\) and \(n\) are the sample mean, sample standard deviation, and sample size respectively, as previously defined, and \(\mu_0\) denotes the mean under the null hypothesis.
Note: \(s / \sqrt{n}\) is the standard error of \(t\), and is often denoted by \(SE\).
\(t\)-test test statistic
Using your functions mean_func
and sd_func
, write a new R function called t_test_func
that computes the test statistic \(t\) for a one-sample \(t\)-test with an arbitrary null hypothesis mean \(\mu_0\).
Hints:
- Most of the hard work is already done, since you can use your two functions
mean_func
and sd_func
.
- You will need to provide one new input argument, the \(\mu_0\) value, for your
t_test_func
function.
To verify your test statistic function’s accuracy, evaluate it for the set of values 2, 3, 4, 5, 6, 7, 8
, with the null hypothesis mean \(\mu_0 = 4\).
Compare your result to the result of the inbuilt R function t.test
, by running the code t.test(c(2:8), mu = 4)
. If you have written your function correctly, the t.test
\(t=\) output should be identical to the value of \(t\) computed by your t_test_func
function.
Hint: Don’t worry if your t_test_func
function gives a slightly different value - as long as it is correct to 3 or 4 decimal places that is fine. If it is very different, just go back over your function, and check for errors such as insufficient or misplaced brackets. It’s very easy to make mistakes like this.
Adding Details to our \(t\)-test Function
To conclude this computer lab, let’s try adding some additional details and features to our \(t\)-test function, to make it more informative.
Degrees of freedom
Recall that the degrees of freedom for a one-sample \(t\)-test is equal to \(n-1\). It would be good to include this information in your t_test_func
output.
As noted above in 1.1.1, if you want multiple results to be provided as the output of a function, you will need to combine these results into one output object.
There are several ways in which we can do this.
The simplest option is to use the combine function, c()
. For example, if your test statistic \(t\) result is stored in the object t.val
, and your degrees of freedom result is stored in the object df
, you could use c(t.val, df)
to output your results.
Try this now for your t_test_func
function, and then test your updated function, using the inputs discussed in 3.2.
Note that the above approach won’t provide any extra information about your output - that might be fine if you are the only person using your function, but if you intend your function to be used by others, then adding more detail to the output might be desirable.
As a simple extra step, continuing the example above you could use c("test statistic value" = t.val, "degrees of freedom" = df)
to include some helpful additional details about your function output.
Using this information, update your t_test_func
function so that it outputs the test statistic \(t\) and the degrees of freedom, with clear labels.
Test your updated function, using the inputs discussed in 3.2.
Informative Output
As a further improvement, rather than using e.g. c("test statistic value" = t.val, "degrees of freedom" = df)
, we could instead use the concatenate function cat
to provide more detailed output that is easier to read and interpret.
An example application of the cat
function is provided below:
cat("The test statistic is", round(t.val, 4))
This code will output a single character vector, e.g. The test statistic is 1.2247, which is nicer to read than the c
function method we used in 4.1.
Here, anything that is not an R code result is surrounded by quotation marks "..."
. A comma is used to separate sections of R code results and text (or text related code).
If we would like to include several statements in our output, we can use "\n"
to specify a new line. For example, the layout for two lines could be:
cat("Some text here", some R code here, "\n",
"some more text", some more R code)
Using the information above in 4.2, incorporate the cat
function into your t_test_func
so that the output is in a clear, easy to read format.
Test your updated function, using the inputs discussed in 3.2.
\(p\)-value Function
Suppose that we would also like to compute the \(p\)-value that corresponds to the \(t\)-test test statistic. Recall from Topic 5 that for a two-sided hypothesis test, the \(p\)-value is equal to the probability that we observe a test statistic value as extreme, or more extreme, than the observed test statistic we have computed using our sample.
In other words, we have that \[\text{$p$-value} = 1 - P(-t \leq T \leq t) = P(T \leq -t) + P(T \geq t) .\]
Recall that in Computer Lab 4 we discussed the use of functions like dnorm
, pnorm
and qnorm
for computing the density, probability distribution function and quantile value respectively of normally distributed random variables.
Similarly, we can use the inbuilt R function pt
to compute the probability that the test statistic \(T\) (which follows a Student’s \(t\) distribution) takes a value less than or equal to an arbitrary value. We will however also need to include the degrees of freedom as an argument.
For example, if we would like to compute \(P(T \leq 5)\), and our degrees of freedom is 4, we can use the code pt(5, 4)
.
If we would like to compute \(P(T \geq 5)\), we can be clever and use the code 1 - pt(5, 4)
.
Based on this information, add extra code to your t_test_func
function so that it also provides (as a third output) the corresponding \(p\)-value for the test statistic.
Hint: Try to use the symmetry property of the Student’s \(t\) distribution when including code for the \(p\)-value calculations in your function.
Run both your updated t_test_func
function and the inbuilt t.test
R function, using the inputs discussed in 3.2. Verify that your t_test_func
function \(p\)-value is correct, by comparing it to the output of the t.test
R function.
We now have written functions that, combined, can be used to conduct a one-sample \(t\)-test for a two-sided hypothesis. Well done! That concludes our computer lab on R function writing.
Hopefully this lab has demystified R function creation, and shown you that writing functions in R is perhaps easier than you might have thought.
These notes have been prepared by Rupert Kuveke and Amanda Shaker. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License
BY-NC-ND.
---
title: "STM1001: Computer Lab 6B"
output:
  bookdown::html_document2: 
    toc: true
    toc_float: true
    code_download: true
    theme: readable
    code_folding: show
bibliography: STM1001_DS_CL_references.bib 
link-citations: yes
---

<style>
#TOC {
  background: url("https://www.latrobe.edu.au/_media/la-trobe-api/v5/img/logo.svg");
  background-size: contain;
  padding-top: 80px !important;
  background-repeat: no-repeat;
}
</style>

### Data Science Stream {-}

### Topic 6B: Writing R Functions {-}

<br>

Welcome to the sixth computer lab for the Data Science stream of STM1001.

By now, you will have gained some experience using a variety of functions in R, ranging from inbuilt 'base R' functions, to package-specific functions, such as those contained in the `plotly` package.

In this computer lab we will extend our R coding skills and focus on writing our **own** functions in R.

By the end of this lab, you should understand the R code syntax required to create functions, and feel comfortable writing simple and intermediate functions. Let's get started!

# R Functions Overview

Usually, if you need to carry out a specific statistical process in R, there will already be a relevant function for you to use!
However, when you are conducting your own coding or research, sometimes it can be extremely beneficial to write your own custom functions. These can range from simple functions of a few lines of code, to extremely detailed, multi-layered functions that make use of other functions and packages. The creative possibilities are endless.

## R Function Composition {#comp}

Let's take a look at the basic composition of an R function, using the example function below:

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F}
my_simple_function <- function(argument1, 
                               argument2){
    
  output <- argument1 + argument2
  
}
```

The code above consists of four main parts:

* The function has a name, `my_simple_function`.
* We use the `function` function to (unsurprisingly) create a function.
* Inside the open brackets `()` following `function`, we specify `argument1` and `argument2` to be the arguments, i.e. inputs, of our function.
* To begin the 'main body' of our function, we use a left curly brace, `{`, directly after the right open bracket `)`. 
* In the 'main body' of the function, we specify the calculations the function should perform - for `my_simple_function`, we are simply adding the `argument1` and `argument2` values provided. This sum is stored in the object `output`, and when the function is computed, the `output` value is provided as output. 
* To wrap up our function, we conclude with a right curly brace, `}`.

You might be wondering what `argument1` and `argument2` are, and from where they came. In fact, these are arbitrary names, and don't correspond to any specific numbers. We can provide whatever numbers or string of numbers we would like to our function, and within the `my_simple_function` environment these values will be used in place of `argument1` and `argument2`. For example, suppose we would like to add 39 and 3. We can do this using our function as follows:

```{r eval = T, echo = T}
res <- my_simple_function(39, 3)
res
```

Copy the code in the two code chunks above, paste it into an R script, and run it. You should obtain the result `42` shown above.

*Of course, we didn't really need to use a function for this - we could have simply computed `39 + 3` in R. In general, functions will be a little more complicated than `my_simple_function`.*

### Important Notes {#imp}

Strictly speaking, we did not need to include the `output <-` code within `my_simple_function`. 

R will output the last evaluated expression in the function, so we could simply have used `argument1 + argument2` in the main body of the function. **If however we wanted to output more than one result, then specifying our desired output can be informative and help avoid errors**.

This additional `output <-` code was primarily included here to demonstrate a key feature of R functions - any objects created within a function are part of that function's `local environment`, meaning that **they exist only inside that function**, and can't be called from the `global environment` outside the function.

To clarify the distinction, take a look at the code below:

```{r class.source = "fold-show", eval = T, echo = T, warning = F, message = F}
test <- 39 + 3
test
```

Here, we assign the sum `39 + 3` to the object `test`. This is not done within a function, so the object `test` is stored in the `global environment`. This means we can then call the `test` object at a later date, and see our result of `42`.

Similarly, we could compute this result using `my_simple_function`, as shown above in \@ref(comp). 

However, try to now call the object `output` (which is defined within this function). You will receive an error - something along the lines of `Error in eval(expr, envir, enclos) : object 'output' not found`.

This is something worth keeping in mind. If you intend to use an object outside of a function, it might be worth defining it separately. It is also a good idea to use different names for all your objects (inside and outside functions) to avoid confusion.

# Writing Simple Functions in R {#values}

Now that we have covered the basics of R function composition, let's try to write some simple functions ourselves. These functions actually already exist in base R, but this will be good practice for subsequent work.

## Mean Function {#meanfunc}

Recall that the mean $\bar{x}$ of a set of $n$ values $x_1, x_2, \cdots, x_n$ is computed using the formula $$\bar{x} = \displaystyle \frac{\sum_{i=1}^n x_i}{n} .$$

Suppose that you would like to write an R function that computes the **mean** of a set of provided values. 
There are several ways that we could write this function. We have made a head-start with the partially completed code below. See if you can complete it by filling in the missing `...` parts.

```{r class.source = "fold-show", eval = F, echo = T, warning = F, message = F}
mean_func <- function(values){
  # Argument:
  # values: This is our list of values
  
  n <- length(values)
  sum(...) / ...

}                      
```

*Note that, based on this format, our function can compute the mean of a string of numbers of any non-zero length.*

###

Verify that your `mean_func` function works by computing the mean of the set of values `2, 3, 4, 5, 6, 7, 8`. If your function is correct, you should receive an answer of `5`.

## Sample standard deviation Function

Now that you have had some practice writing R functions, let's tackle a harder problem.

Recall the equation for the sample standard deviation $s$ (introduced in [Topic 2](https://bookdown.org/a_shaker/STM1001_Topic_2/2-1-variance-and-standard-deviation.html)):

$$s = \displaystyle \sqrt{ \frac{1}{n-1} \sum_{i=1}^n (x_i - \bar{x})^2} .$$
Using your `mean_func` function, and your code from \@ref(meanfunc) as a guide, write a function called `sd_func` that computes the **sample standard deviation** of a set of $n$ values $x_1, x_2, \cdots, x_n$.

*Hints:* 

* *You can use the same input as for your `mean_func` function.*
* *When you are multiplying parts of a function together, you need to use the $\ast$ symbol.*
* *You can use the built-in R function `sqrt()` to compute the square root.*
* *You can use your `mean_func` function within your new function, when computing the $\bar{x}$ value.*
* *Be careful to include brackets to separate different parts of your calculations.*
* *Don't forget the square term in the equation.*

###

Verify that your sample standard deviation function works by computing the sample standard deviation of the set of values `2, 3, 4, 5, 6, 7, 8`. If your function is correct, you should receive an answer of `2.160247`.  

*Note: Your answer may be slightly different due to rounding.*

# Writing a $t$-test Function in R {#ttestfunc}

Now that we have had some practice writing R functions, let's move on to a slightly harder scenario, and try to write an R function for computing a $t$-test.

Recall that in [Topic 5](https://bookdown.org/content/50a3178d-5432-44a3-bb5c-1718ca3e1fe2/1-tdist.html) we introduced the $t$-test as a common type of hypothesis test.
The test statistic $t$ used for the one-sample $t$-test has the following formula:

$$t = \displaystyle \frac{\bar{x} - \mu_0}{s / \sqrt{n}},$$
where $\bar{x}$, $s$ and $n$ are the sample mean, sample standard deviation, and sample size respectively, as previously defined, and $\mu_0$ denotes the mean under the null hypothesis.

*Note: $s / \sqrt{n}$ is the standard error of $t$, and is often denoted by $SE$.*

## $t$-test test statistic

Using your functions `mean_func` and `sd_func`, write a new R function called `t_test_func` that computes the test statistic $t$ for a one-sample $t$-test with an arbitrary null hypothesis mean $\mu_0$.

*Hints:*

* *Most of the hard work is already done, since you can use your two functions `mean_func` and `sd_func`.*
* *You will need to provide one new input argument, the $\mu_0$ value, for your `t_test_func` function.*

## {#ttest}

To verify your test statistic function's accuracy, evaluate it for the set of values `2, 3, 4, 5, 6, 7, 8`, with the null hypothesis mean $\mu_0 = 4$.

Compare your result to the result of the inbuilt R function `t.test`, by running the code `t.test(c(2:8), mu = 4)`.  If you have written your function correctly, the `t.test` $t=$ output should be identical to the value of $t$ computed by your `t_test_func` function.

*Hint: Don't worry if your `t_test_func` function gives a slightly different value - as long as it is correct to 3 or 4 decimal places that is fine. If it is very different, just go back over your function, and check for errors such as insufficient or misplaced brackets. It's very easy to make mistakes like this.*

# Adding Details to our $t$-test Function

To conclude this computer lab, let's try adding some additional details and features to our $t$-test function, to make it more informative.

## Degrees of freedom {#df}

Recall that the degrees of freedom for a one-sample $t$-test is equal to $n-1$. It would be good to include this information in your `t_test_func` output.

As noted above in \@ref(imp), if you want multiple results to be provided as the output of a function, you will need to combine these results into one output object. 

There are several ways in which we can do this.

The simplest option is to use the combine function, `c()`. For example, if your test statistic $t$ result is stored in the object `t.val`, and your degrees of freedom result is stored in the object `df`, you could use `c(t.val, df)` to output your results. 

Try this now for your `t_test_func` function, and then test your updated function, using the inputs discussed in \@ref(ttest).

###

Note that the above approach won't provide any extra information about your output - that might be fine if you are the only person using your function, but if you intend your function to be used by others, then adding more detail to the output might be desirable.

As a simple extra step, continuing the example above you could use `c("test statistic value" = t.val, "degrees of freedom" = df)` to include some helpful additional details about your function output.

Using this information, update your `t_test_func` function so that it outputs the test statistic $t$ and the degrees of freedom, with clear labels.

Test your updated function, using the inputs discussed in \@ref(ttest).

## Informative Output {#cat}

As a further improvement, rather than using e.g. `c("test statistic value" = t.val, "degrees of freedom" = df)`, we could instead use the concatenate function `cat` to provide more detailed output that is easier to read and interpret.

An example application of the `cat` function is provided below:^[*Note: This brief introduction does not cover all aspects of the concatenate function.*]

```{r class.source = "fold-show", eval = F, echo = T, warning = F, message = F}
cat("The test statistic is", round(t.val, 4))
```

This code will output a single character vector, e.g. *The test statistic is 1.2247*, which is nicer to read than the `c` function method we used in \@ref(df).

Here, anything that is not an R code result is surrounded by quotation marks `"..."`. A comma is used to separate sections of R code results and text (or text related code).

If we would like to include several statements in our output, we can use `"\n"` to specify a new line. For example, the layout for two lines could be:

```{r class.source = "fold-show", eval = F, echo = T, warning = F, message = F}
cat("Some text here", some R code here, "\n", 
    "some more text", some more R code)
```

### 

Using the information above in \@ref(cat), incorporate the `cat` function into your `t_test_func` so that the output is in a clear, easy to read format.

Test your updated function, using the inputs discussed in \@ref(ttest).

## $p$-value Function

Suppose that we would also like to compute the $p$-value that corresponds to the $t$-test test statistic. Recall from [Topic 5](https://bookdown.org/content/50a3178d-5432-44a3-bb5c-1718ca3e1fe2/3-1-explaining-the-one-sample-t-test-results.html) that for a two-sided hypothesis test, the $p$-value is equal to the probability that we observe a test statistic value as extreme, or more extreme, than the observed test statistic we have computed using our sample. 

In other words, we have that $$\text{$p$-value} = 1 - P(-t \leq T \leq t) = P(T \leq -t) + P(T \geq t) .$$

Recall that in [Computer Lab 4](https://rpubs.com/LTU_STM1001/CL4) we discussed the use of functions like `dnorm`, `pnorm` and `qnorm` for computing the density, probability distribution function and quantile value respectively of normally distributed random variables.

Similarly, we can use the inbuilt R function `pt` to compute the probability that the test statistic $T$ (which follows a Student's $t$ distribution) takes a value less than or equal to an arbitrary value. We will however also need to include the degrees of freedom as an argument. 

For example, if we would like to compute $P(T \leq 5)$, and our degrees of freedom is 4, we can use the code `pt(5, 4)`.
If we would like to compute $P(T \geq 5)$, we can be clever and use the code `1 - pt(5, 4)`^[since we are dealing with a continuous random variable].

Based on this information, add extra code to your  `t_test_func` function so that it also provides (as a third output) the corresponding $p$-value for the test statistic.

*Hint: Try to use the symmetry property of the Student's $t$ distribution when including code for the $p$-value calculations in your function.*

### 

Run both your updated `t_test_func` function and the inbuilt `t.test` R function, using the inputs discussed in \@ref(ttest). Verify that your `t_test_func` function $p$-value is correct, by  comparing it to the output of the `t.test` R function.

<br>

#### We now have written functions that, combined, can be used to conduct a one-sample $t$-test for a two-sided hypothesis. Well done! That concludes our computer lab on R function writing. #### {-}

Hopefully this lab has demystified R function creation, and shown you that writing functions in R is perhaps easier than you might have thought.

<br>

<font color = "grey">
These notes have been prepared by Rupert Kuveke and Amanda Shaker. The copyright for the material in these notes resides with the authors named above, with the Department of Mathematical and Physical Sciences and with La Trobe University. Copyright in this work is vested in La Trobe University including all La Trobe University branding and naming. Unless otherwise stated, material within this work is licensed under a Creative Commons Attribution-Non Commercial-Non Derivatives License 
<a href = "https://creativecommons.org/licenses/by-nc-nd/4.0/CC" target="_blank"> BY-NC-ND. </a>
</font>