Exercises on Functions

Harold Nelson

1/26/2022

Exercise 1

Create a function my_range() that returns the value of the range of a numeric vector.

Solution

my_range = function(x){
  
  return(max(x) - min(x))
}

rn = rnorm(1000)
my_range(rn)
## [1] 6.688954
range(rn)
## [1] -3.012472  3.676481

Note that the built-in range() function does not do the arithmetic.

Exercise 2

Create a function range_95() that returns the difference between the 95th percentile and the 5th percentile of a numeric vector.

Solution

range_95 = function(x){
  
  return(quantile(x,.95) - quantile(x,.05))
}

range_95(rn)
##      95% 
## 3.278338

Exercise 3

Create a function range_85() that returns the difference between the 85th percentile and the 15th percentile of a numeric vector.

Solution

range_85 = function(x){
  
  return(quantile(x,.85) - quantile(x,.15))
}

range_85(rn)
##      85% 
## 2.082558

Exercise 4

We’ve created separate functions range_85() and range_95(). In addition we have the built-in function IQR(), which is essentially range_75(). Create a function gen_range(x,pct), where the parameter pct takes the place of the 75, 85, and 95 in our examples.

Solution

gen_range = function(x,pct){
  
  top = quantile(x,pct/100)
  bottom = quantile(x,1 - pct/100)
  return(top - bottom)
}

rn = rnorm(1000)
gen_range(rn,85)
##      85% 
## 1.981619
range_85(rn)
##      85% 
## 1.981619

Exercise 5

Create a function rmsd(x,y) which returns the square root of the mean of the squares of the differences between x and y.

Solution

rmsd = function(x,y){
  
  diffs = x - y
  diffs_sq = diffs^2
  mdiffs_sq = mean(diffs_sq)
  return(sqrt(mdiffs_sq))
}

x = rnorm(1000)
y = rnorm(1000)
rmsd(x,y)
## [1] 1.451501

Exercise 6

Create a function mad(x,y) which returns the mean of the absolute values of the differences between x and y.

Solution

mad = function(x,y){
  
  diffs = x - y
  abs_diffs = abs(diffs)
  return(mean(abs_diffs))
  
}

x = c(1,2,3,4)
y = c(2,1,4,3)

mad(x,y)
## [1] 1

Quantile

You probably noticed that the quantile() function produces a named vector as a result. You may want to know why. The answer is that its second argument may be a vector of percentiles. In that case, the labels would be important.

Example

rn = rnorm(1000)
values = quantile(rn,c(.1,.25,.5,.75,.9))
values
##         10%         25%         50%         75%         90% 
## -1.26904475 -0.66170522  0.05865024  0.78821800  1.40129157

Exercise 7

Create an inverse of the quantile() function, qinv(x,val). The parameter x is a numeric vector.The parameter val is a single number. The function returns the fraction of the values of x that are less than val.

Solution

qinv = function(x,val){
  
  return( mean(x < val) )
}

# Example 

rn = rnorm(1000)
qinv(rn,2)
## [1] 0.982

The summary function

When we apply the summary function to a numeric vector like county$pop2017, we get some useful results.

Example

load("county.rda")
res = summary(county$pop2017)
res
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's 
##       88    10976    25857   103763    67756 10163507        3
str(res)
##  'summaryDefault' Named num [1:7] 88 10976 25857 103763 67756 ...
##  - attr(*, "names")= chr [1:7] "Min." "1st Qu." "Median" "Mean" ...

res is a named vector. We can wrap this function inside another function and add to the output vector before producing a final result.

tb_summary = function(x){
  res = summary(x)
  out = c(res,sd(x,na.rm = T))
  names(out) = c(names(res),"SD")
  return(out)
}

tb_summary(county$pop2017)
##       Min.    1st Qu.     Median       Mean    3rd Qu.       Max.       NA's 
##       88.0    10975.5    25857.0   103763.4    67756.0 10163507.0        3.0 
##         SD 
##   333194.5

Histogram

The base R function is easy to use. Get a histogram of the weight variable in cdc2. You need to load the data first.

Solution

load("cdc2.Rdata")
hist(cdc2$weight)

The ggplot2 Version

Do this using ggplot2.

Solution

library(ggplot2)
ggplot(data = cdc2,aes(x = weight)) +
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Complexity Issue

By comparison with the base R hist(), the ggplot2 version is more complexity.

Let’s write an R function that uses ggplot2 to do the histogram but is no more complex in usage that the base R function.

Here’s our first try.

gg_hist = function(df,var) {
  
  ggplot(data = df,aes(x = var)) +
    geom_histogram()
}

gg_hist(cdc2,weight)
## Error in FUN(X[[i]], ...): object 'weight' not found
gg_hist = function(df,var) {
  
  ggplot(data = df,aes(x = {{var}})) +
    geom_histogram()
}

gg_hist(cdc2,weight)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

That failed. We need to “embrace” the “var”.

Density

Use this method to create the function gg_density()

Solution

gg_density = function(df,var) {
  
  ggplot(data = df,aes(x = {{var}})) +
    geom_density()
}

gg_density(cdc2,weight)

Extension to Facetting

Create a function gg_hist_wrap() that produces a histogram of var and facets it by a categorical variable cat.

Solution

gg_hist_wrap = function(df,var,cat){
  ggplot(data = df,aes(x = {{var}})) +
    geom_histogram() +
    facet_wrap(~{{ageCat}})
}

gg_hist_wrap(cdc2,weight,ageCat)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This actually failed the first time I tried it. I googled for a solution and found this.

https://community.rstudio.com/t/problem-with-facet-wrap-and-curly-curly/36975.

Then for some reason, the error was resolved.

Statistical Functions

First, let’s get immersed in a 19th century method of computing probabilities using a normal curve table.

Watch https://www.youtube.com/watch?v=xI9ZHGOSaCg

It’s only 11 minutes.