Functions

Harold Nelson

2/6/2019

library(tidyverse)
## ── Attaching packages ──── tidyverse 1.2.1 ──
## ✔ ggplot2 3.0.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.6
## ✔ tidyr   0.8.1     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ─────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Range

The range function in R does not do exactly what we want. The range of a numeric variable in statistics is the difference between the maximum and the minimum value of the variable. But the range function in R returns the minimu and maximum values, but doesn’t do the subtraction. I’ll create a function myRange, which will do the subtraction.

myRange = function(x){
  # x needs to be a numeric vector.
  # Remove any NA values.
  x = x[!is.na(x)]
  return(max(x) - min(x))
}

Now compare the results of this function and the base R function range().

range(mtcars$mpg)
## [1] 10.4 33.9
myRange(mtcars$mpg)
## [1] 23.5

Fixing summary()

The standard function summary leaves out some standard descriptivs statistics of a numeric variable. We can create our own function mySummary to do what we want.

mySummary = function(x){
  # x needs to be a numeric vector.
  # Remove any NA values.
  x = x[!is.na(x)]
  
  result = summary(x)
result = c(result,sd(x))
result = c(result,myRange(x))
result = c(result,IQR(x))
names(result)[7] = "sd"
names(result)[8] = "range"
names(result)[9] = "IQR"

  return(result)  
}
mySummary(mtcars$mpg)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max.        sd 
## 10.400000 15.425000 19.200000 20.090625 22.800000 33.900000  6.026948 
##     range       IQR 
## 23.500000  7.375000

myNormProb

This is a function I wrote to display the results of normal probability calculations. It’s a good example of the usage of arguments.

MyNormProb <- function(
  lb = NA,                       # Lower bound
  ub = NA,                        # Upper bound
  mean = 0,                      # Mean 
  sd = 1,                        # Standard deviation
  MyLabel = "Standard Normal RV" # Description of the variable
  
  # This function produces a plot and the probability that
  # a normal random variable with the given mean and
  # standrard deviation falls between the given lower and 
  # upper bounds. 
  
  ) 
{

if (is.na(lb)) {lb <- mean - 4 * sd}
if (is.na(ub)) {ub <- mean + 4 * sd}

x <- seq(-4,4,length=1000)*sd + mean
hx <- dnorm(x,mean,sd)

plot(x, hx, type="n", xlab=MyLabel, ylab="Density",
     main="Normal Distribution", axes=FALSE)

i <- x >= lb & x <= ub
lines(x, hx)
polygon(c(lb,x[i],ub), c(0,hx[i],0), col="red")

area <- pnorm(ub, mean, sd) - pnorm(lb, mean, sd)
result <- paste("P(",lb,"< ",MyLabel," <" ,ub,") =",
                signif(area, digits=3))
mtext(result,3)
abline(0,0)
segments(x0 = mean, y0 = 0, x1 = mean, y1 = dnorm(mean,mean,sd))
return(area)
}

v <- MyNormProb(ub=1.5,MyLabel="Standard Normal Distribution")

Exercise

The functions IQR and myRange are similar. What do they have in common? Can we parameterize the common feature and use it to build a generalized function?

midXX = function(x,XX){
  # x is a numeric vector
  # XX is the middle percentage
  # the function returns the length
  # of the interval covered by the middle   # XX% of the data in x.
  
  dec = XX/100
  lb = .5*(1-dec)
  ub = dec + lb
  midrange = (quantile(x,ub) - quantile(x,lb))
  names(midrange) = ""
  return(midrange)
}

nrv = rnorm(1000)
midXX(nrv,50)
##          
## 1.392344
IQR(nrv)
## [1] 1.392344

Exercise

Let’s load the olywthr data and look at the first 15 days of February.

Solution

load("~/Downloads/olywthr.rdata")
earlyFeb = filter(olywthr,mo == 2 & dy >= 1 & dy <= 15)

Exercise

What is the probability that snow occurs on a date in this range?

Solution

mean(earlyFeb$SNOW>0)
## [1] 0.04529617

Exercise

What is the probability that the minimum temperature on a date in this range is below 32 degrees?

Solution

mean(earlyFeb$TMIN < 32)
## [1] 0.4156297