1 R as an expression language


Exercise 1: Write an expression to compute the number of seconds in a 365-day year, and execute the expression.

365*24*60*60
## [1] 31536000


2 Assignment and workspace objects


Exercise 2: Define a workspace object which contains the number of seconds in 365-day year, and display the results.

(Second_per_year <- 365*24*60*60)
## [1] 31536000


3 Functions and methods


Exercise 3: Find the function name for base-10 logarithms, and compute the base-10 logarithm of 10, 100, and 1000 (use the ?? function at the console to search).

log10(10); log10(100); log10(1000)
## [1] 1
## [1] 2
## [1] 3


Exercise 4: What are the arguments of the rbinom (random numbers following the binomial distribution) function? Are any default or must all be specified? What is the value returned?

  • rbinom has 3 arguments and all of them must be specified.
## n    number of observations. If length(n) > 1, the length is taken to be the number required.

## size number of trials (zero or more).

## prob probability of success on each trial.
  • rbinom will return random deviates.


Exercise 5: Display the vector of the number of successes in 24 trials with probability of success 0.2 (20%), this simulation carried out 128 times.

(vector_e5 <- rbinom(128, 24, 0.2))
##   [1]  7  5  2  8  5  7  6  7  8  7  4  5  4  5  4  4  5  5  7  6  3  4  2  6  3
##  [26]  3  1  5  3  5  1  4  5  3  2  4  3  5  4  4  5  7  4  1  3  7  8  5  4  6
##  [51]  6  8  6  2  4  5  7  4  3  5  4  3  5  5  6  6  4  5  3  2  4  7  5  6  8
##  [76]  3  1  5  3  5  4  6  2  5  5  1  5  4  5  6  5  4  3 10  3  4  6  6  3  4
## [101]  7  4  2  6  3  4  6  5  2 11  4  4  3  2  2  5  6  8  6  2  1  5  3  5  3
## [126]  5  5  0


4 Including computations in the text


Exercise 6: Summarize the result of rbinom (previous exercise) with the table function. What is the range of results, i.e., the minimum and maximum values? Which is the most likely result? For these, write text which includes the computed results. This is necessary because the results change with each random sampling.

(table(vector_e5))
## vector_e5
##  0  1  2  3  4  5  6  7  8 10 11 
##  1  6 11 19 25 31 17 10  6  1  1
  • The range of vector_e5 is from 0 to 11.
  • The most likely result is 5.


5 Vectorized operations


Exercise 7: Create and display a vector representing latitudes in degrees from \(0^\circ\) (equator) to \(+90^\circ\) (north pole), in intervals of \(5^\circ\). Compute and display their cosines – recall, the trig functions in R expect arguments in radians. Find and display the maximum cosine.

(latitudes <- seq(0, 90, by=5))
##  [1]  0  5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90
cos(latitudes)
##  [1]  1.00000000  0.28366219 -0.83907153 -0.75968791  0.40808206  0.99120281
##  [7]  0.15425145 -0.90369221 -0.66693806  0.52532199  0.96496603  0.02212676
## [13] -0.95241298 -0.56245385  0.63331920  0.92175127 -0.11038724 -0.98437664
## [19] -0.44807362
max(cos(latitudes))
## [1] 1


6 Packages


Exercise 8: Check if the gstat package is installed on your system. If not, install it. Load it into the workspace. Display its help and find the variogram function. What is its description?

install.packages('gstat', dependencies=TRUE)
library('gstat')
search()
##  [1] ".GlobalEnv"        "package:gstat"     "package:stats"    
##  [4] "package:graphics"  "package:grDevices" "package:utils"    
##  [7] "package:datasets"  "package:methods"   "Autoloads"        
## [10] "package:base"
help('variogram')
  • The description of variogram.
## Calculates the sample variogram from data, or in case of a linear model is given, for the residuals, with options for directional, robust, and pooled variogram, and for irregular distance intervals.

## In case spatio-temporal data is provided, the function variogramST is called with a different set of parameters.


7 Classes


7.1 Fundamental classes


Exercise 9: Display the classes of the built-in constant pi and of the built-in constant letters.

class(pi)
## [1] "numeric"
class(letters)
## [1] "character"


7.2 Derived classes



7.3 Classes defined by functions


Exercise 10: What is the class of the object returned by the variogram function? (Hint: see the heading “Value” in the help text.)

  • “gstatVariogram” “data.frame”


8 Example datasets


Exercise 11: List the datasets in the gstat package.

data(package="gstat")


Exercise 12: Load, summarize, and show the structure of the oxford dataset.

data(oxford, package="gstat")
summary(oxford)
##     PROFILE           XCOORD        YCOORD          ELEV       PROFCLASS
##  Min.   :  1.00   Min.   :100   Min.   : 100   Min.   :540.0   Cr:19    
##  1st Qu.: 32.25   1st Qu.:200   1st Qu.: 600   1st Qu.:558.0   Ct:36    
##  Median : 63.50   Median :350   Median :1100   Median :573.0   Ia:71    
##  Mean   : 63.50   Mean   :350   Mean   :1100   Mean   :573.6            
##  3rd Qu.: 94.75   3rd Qu.:500   3rd Qu.:1600   3rd Qu.:584.5            
##  Max.   :126.00   Max.   :600   Max.   :2100   Max.   :632.0            
##  MAPCLASS      VAL1            CHR1           LIME1            VAL2     
##  Cr:31    Min.   :2.000   Min.   :1.000   Min.   :0.000   Min.   :4.00  
##  Ct:36    1st Qu.:3.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:4.00  
##  Ia:59    Median :4.000   Median :2.000   Median :4.000   Median :8.00  
##           Mean   :3.508   Mean   :2.468   Mean   :2.643   Mean   :6.23  
##           3rd Qu.:4.000   3rd Qu.:3.000   3rd Qu.:4.000   3rd Qu.:8.00  
##           Max.   :4.000   Max.   :4.000   Max.   :4.000   Max.   :8.00  
##       CHR2       LIME2          DEPTHCM         DEP2LIME         PCLAY1     
##  Min.   :2   Min.   :0.000   Min.   :10.00   Min.   :20.00   Min.   :10.00  
##  1st Qu.:2   1st Qu.:4.000   1st Qu.:25.00   1st Qu.:20.00   1st Qu.:20.00  
##  Median :2   Median :5.000   Median :36.00   Median :20.00   Median :24.50  
##  Mean   :3   Mean   :3.889   Mean   :46.25   Mean   :30.32   Mean   :24.44  
##  3rd Qu.:4   3rd Qu.:5.000   3rd Qu.:64.75   3rd Qu.:40.00   3rd Qu.:28.00  
##  Max.   :6   Max.   :5.000   Max.   :91.00   Max.   :90.00   Max.   :37.00  
##      PCLAY2           MG1              OM1              CEC1      
##  Min.   :10.00   Min.   : 19.00   Min.   : 2.600   Min.   : 7.00  
##  1st Qu.:10.00   1st Qu.: 44.00   1st Qu.: 4.100   1st Qu.:12.00  
##  Median :10.00   Median : 72.00   Median : 5.350   Median :15.00  
##  Mean   :14.76   Mean   : 93.53   Mean   : 5.995   Mean   :18.88  
##  3rd Qu.:20.00   3rd Qu.:123.25   3rd Qu.: 7.175   3rd Qu.:25.25  
##  Max.   :40.00   Max.   :308.00   Max.   :13.100   Max.   :43.00  
##       PH1            PHOS1             POT1      
##  Min.   :4.200   Min.   : 1.700   Min.   : 83.0  
##  1st Qu.:7.200   1st Qu.: 6.200   1st Qu.:127.0  
##  Median :7.500   Median : 8.500   Median :164.0  
##  Mean   :7.152   Mean   : 8.752   Mean   :181.7  
##  3rd Qu.:7.600   3rd Qu.:10.500   3rd Qu.:194.8  
##  Max.   :7.700   Max.   :25.000   Max.   :847.0
str(oxford)
## 'data.frame':    126 obs. of  22 variables:
##  $ PROFILE  : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ XCOORD   : num  100 100 100 100 100 100 100 100 100 100 ...
##  $ YCOORD   : num  2100 2000 1900 1800 1700 1600 1500 1400 1300 1200 ...
##  $ ELEV     : num  598 597 610 615 610 595 580 590 598 588 ...
##  $ PROFCLASS: Factor w/ 3 levels "Cr","Ct","Ia": 2 2 2 3 3 2 3 2 3 3 ...
##  $ MAPCLASS : Factor w/ 3 levels "Cr","Ct","Ia": 2 3 3 3 3 2 2 3 3 3 ...
##  $ VAL1     : num  3 3 4 4 3 3 4 4 4 3 ...
##  $ CHR1     : num  3 3 3 3 3 2 2 3 3 3 ...
##  $ LIME1    : num  4 4 4 4 4 0 2 1 0 4 ...
##  $ VAL2     : num  4 4 5 8 8 4 8 4 8 8 ...
##  $ CHR2     : num  4 4 4 2 2 4 2 4 2 2 ...
##  $ LIME2    : num  4 4 4 5 5 4 5 4 5 5 ...
##  $ DEPTHCM  : num  61 91 46 20 20 91 30 61 38 25 ...
##  $ DEP2LIME : num  20 20 20 20 20 20 20 20 40 20 ...
##  $ PCLAY1   : num  15 25 20 20 18 25 25 35 35 12 ...
##  $ PCLAY2   : num  10 10 20 10 10 20 10 20 10 10 ...
##  $ MG1      : num  63 58 55 60 88 168 99 59 233 87 ...
##  $ OM1      : num  5.7 5.6 5.8 6.2 8.4 6.4 7.1 3.8 5 9.2 ...
##  $ CEC1     : num  20 22 17 23 27 27 21 14 27 20 ...
##  $ PH1      : num  7.7 7.7 7.5 7.6 7.6 7 7.5 7.6 6.6 7.5 ...
##  $ PHOS1    : num  13 9.2 10.5 8.8 13 9.3 10 9 15 12.6 ...
##  $ POT1     : num  196 157 115 172 238 164 312 184 123 282 ...


9 Data frames


Exercise 13: load the women sample dataset. How many observations (cases) and how many attributes (fields) for each case? What are the column (field) and row names? What is the height of the first-listed woman?

data("women")
str(women)
## 'data.frame':    15 obs. of  2 variables:
##  $ height: num  58 59 60 61 62 63 64 65 66 67 ...
##  $ weight: num  115 117 120 123 126 129 132 135 139 142 ...
colnames(women)
## [1] "height" "weight"
rownames(women)
##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
women[1, "height"]
## [1] 58


9.1 Factors


Exercise 14: List the factors in the oxford dataset.

str(oxford)
## 'data.frame':    126 obs. of  22 variables:
##  $ PROFILE  : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ XCOORD   : num  100 100 100 100 100 100 100 100 100 100 ...
##  $ YCOORD   : num  2100 2000 1900 1800 1700 1600 1500 1400 1300 1200 ...
##  $ ELEV     : num  598 597 610 615 610 595 580 590 598 588 ...
##  $ PROFCLASS: Factor w/ 3 levels "Cr","Ct","Ia": 2 2 2 3 3 2 3 2 3 3 ...
##  $ MAPCLASS : Factor w/ 3 levels "Cr","Ct","Ia": 2 3 3 3 3 2 2 3 3 3 ...
##  $ VAL1     : num  3 3 4 4 3 3 4 4 4 3 ...
##  $ CHR1     : num  3 3 3 3 3 2 2 3 3 3 ...
##  $ LIME1    : num  4 4 4 4 4 0 2 1 0 4 ...
##  $ VAL2     : num  4 4 5 8 8 4 8 4 8 8 ...
##  $ CHR2     : num  4 4 4 2 2 4 2 4 2 2 ...
##  $ LIME2    : num  4 4 4 5 5 4 5 4 5 5 ...
##  $ DEPTHCM  : num  61 91 46 20 20 91 30 61 38 25 ...
##  $ DEP2LIME : num  20 20 20 20 20 20 20 20 40 20 ...
##  $ PCLAY1   : num  15 25 20 20 18 25 25 35 35 12 ...
##  $ PCLAY2   : num  10 10 20 10 10 20 10 20 10 10 ...
##  $ MG1      : num  63 58 55 60 88 168 99 59 233 87 ...
##  $ OM1      : num  5.7 5.6 5.8 6.2 8.4 6.4 7.1 3.8 5 9.2 ...
##  $ CEC1     : num  20 22 17 23 27 27 21 14 27 20 ...
##  $ PH1      : num  7.7 7.7 7.5 7.6 7.6 7 7.5 7.6 6.6 7.5 ...
##  $ PHOS1    : num  13 9.2 10.5 8.8 13 9.3 10 9 15 12.6 ...
##  $ POT1     : num  196 157 115 172 238 164 312 184 123 282 ...
  • PROFCLASS and MAPCLASS are oxford’s factors.


10 Missing values


11 Logical expressions


Exercise 15: Identify the thin trees, defined as those with height/girth ratio more than 1 s.d. above the mean. You will have to define a new field in the dataframe with this ratio, and then use the mean and sd summary functions, along with a logical expression.

(trees$"Height/Girth" <- trees$Height/trees$Girth)
##  [1] 8.433735 7.558140 7.159091 6.857143 7.570093 7.685185 6.000000 6.818182
##  [9] 7.207207 6.696429 6.991150 6.666667 6.666667 5.897436 6.250000 5.736434
## [17] 6.589147 6.466165 5.182482 4.637681 5.571429 5.633803 5.103448 4.500000
## [25] 4.723926 4.682081 4.685714 4.469274 4.444444 4.444444 4.223301
(sd_e15 <- sd(trees$"Height/Girth"))
## [1] 1.186666
(mean_e15 <- mean(trees$"Height/Girth"))
## [1] 5.985513
(trees$Thin <- trees$"Height/Girth" > mean_e15+sd_e15)
##  [1]  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
(trees)
##    Girth Height Volume Height/Girth  Thin
## 1    8.3     70   10.3     8.433735  TRUE
## 2    8.6     65   10.3     7.558140  TRUE
## 3    8.8     63   10.2     7.159091 FALSE
## 4   10.5     72   16.4     6.857143 FALSE
## 5   10.7     81   18.8     7.570093  TRUE
## 6   10.8     83   19.7     7.685185  TRUE
## 7   11.0     66   15.6     6.000000 FALSE
## 8   11.0     75   18.2     6.818182 FALSE
## 9   11.1     80   22.6     7.207207  TRUE
## 10  11.2     75   19.9     6.696429 FALSE
## 11  11.3     79   24.2     6.991150 FALSE
## 12  11.4     76   21.0     6.666667 FALSE
## 13  11.4     76   21.4     6.666667 FALSE
## 14  11.7     69   21.3     5.897436 FALSE
## 15  12.0     75   19.1     6.250000 FALSE
## 16  12.9     74   22.2     5.736434 FALSE
## 17  12.9     85   33.8     6.589147 FALSE
## 18  13.3     86   27.4     6.466165 FALSE
## 19  13.7     71   25.7     5.182482 FALSE
## 20  13.8     64   24.9     4.637681 FALSE
## 21  14.0     78   34.5     5.571429 FALSE
## 22  14.2     80   31.7     5.633803 FALSE
## 23  14.5     74   36.3     5.103448 FALSE
## 24  16.0     72   38.3     4.500000 FALSE
## 25  16.3     77   42.6     4.723926 FALSE
## 26  17.3     81   55.4     4.682081 FALSE
## 27  17.5     82   55.7     4.685714 FALSE
## 28  17.9     80   58.3     4.469274 FALSE
## 29  18.0     80   51.5     4.444444 FALSE
## 30  18.0     80   51.0     4.444444 FALSE
## 31  20.6     87   77.0     4.223301 FALSE


12 Combination: functions, logical expressions, simulation


13 Graphics


Exercise 16: Display a histogram of the diamond prices in the diamonds dataset.

library(ggplot2)
histogram_e16 <- ggplot(data=diamonds) +
        geom_histogram(mapping = aes(x=price), binwidth = 500,
                        colour="pink") +
        geom_rug(mapping = aes(x=price))
print(histogram_e16)


14 Statistical models


Exercise 17: Write a model to predict tree height from tree girth. How much of the height can be predicted from the girth?

model_e17 <- lm(Height ~ Girth, data=trees)
summary(model_e17)
## 
## Call:
## lm(formula = Height ~ Girth, data = trees)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.5816  -2.7686   0.3163   2.4728   9.9456 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  62.0313     4.3833  14.152 1.49e-14 ***
## Girth         1.0544     0.3222   3.272  0.00276 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.538 on 29 degrees of freedom
## Multiple R-squared:  0.2697, Adjusted R-squared:  0.2445 
## F-statistic: 10.71 on 1 and 29 DF,  p-value: 0.002758


Exercise 18: Write a model to predict tree volume as a linear function of tree height and tree girth, with no interaction.

model_e18 <- lm(Volume ~ Height + Girth, data=trees)
summary(model_e18)
## 
## Call:
## lm(formula = Volume ~ Height + Girth, data = trees)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.4065 -2.6493 -0.2876  2.2003  8.4847 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -57.9877     8.6382  -6.713 2.75e-07 ***
## Height        0.3393     0.1302   2.607   0.0145 *  
## Girth         4.7082     0.2643  17.816  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.882 on 28 degrees of freedom
## Multiple R-squared:  0.948,  Adjusted R-squared:  0.9442 
## F-statistic:   255 on 2 and 28 DF,  p-value: < 2.2e-16


15 Control structures


15.1 The apply family of functions


16 User-defined functions


Exercise 19: Write a function to restrict the values of a vector to the range \(0 \ldots 1\). Any values \(< 0\) should be replaced with \(0\), and any values \(>1\) should be replaced with \(1\). Test the function on a vector with elements from \(-1.2\) to \(+1.2\) in increments of \(0.1\) – see the seq “sequence” function.

(vector_e19 <- seq(-1.2, 1.2, by = 0.1))
##  [1] -1.200000e+00 -1.100000e+00 -1.000000e+00 -9.000000e-01 -8.000000e-01
##  [6] -7.000000e-01 -6.000000e-01 -5.000000e-01 -4.000000e-01 -3.000000e-01
## [11] -2.000000e-01 -1.000000e-01  2.220446e-16  1.000000e-01  2.000000e-01
## [16]  3.000000e-01  4.000000e-01  5.000000e-01  6.000000e-01  7.000000e-01
## [21]  8.000000e-01  9.000000e-01  1.000000e+00  1.100000e+00  1.200000e+00
restrict_vector_e19 <- function(v){
  for (i in 1:length(v)){
    if (v[i] < 0){
      v[i] <- 0
    }
    else if (v[i] > 1){
      v[i] <- 1
    }
  }
  return(v)
}
restrict_vector_e19(vector_e19)
##  [1] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
##  [6] 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00
## [11] 0.000000e+00 0.000000e+00 2.220446e-16 1.000000e-01 2.000000e-01
## [16] 3.000000e-01 4.000000e-01 5.000000e-01 6.000000e-01 7.000000e-01
## [21] 8.000000e-01 9.000000e-01 1.000000e+00 1.000000e+00 1.000000e+00


17 Import/Export


18 The Tidyverse


Bonus Exercise : Use tidyverse functions and pipes on the trees dataset, to select the trees (use the filter function) with a volume greater than the median volume (use the median function), compute the ratio of girth to height as a new variable (use the mutate function), and sort by this (use the arrange function) from thin to thick trees.

trees %>% 
  filter(Volume>median(Volume)) %>%
  mutate(ratio_be = round(Girth/Height, 3)) %>%
  arrange(ratio_be)