This R Markdown document contains exercises to accompany the course “Data analysis and visualization using R”.
This is part 1 of the exercises.
This document contains the exercises themselves plus (in most cases) a R code chunk to complete, correct or create.
Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.
Before Knitting this document, check if you have the devtools package installed, by typing library(devtools) in the console. If this fails, you need to install it by typing install.packages("devtools").
In this section you will need to use R as a calculator to get the requested output. The first question is already solved in part so you can see what is expected of you.
Calculate the following
(a) calculate 42 times 3
42 * 3
## [1] 126
(b) 20 divided by 7
## your code here
(c) the remainder of 20 divided by 7
## your code here
(d) 13 divided by 3, rounded down
## your code here
(e) 13 divided by 3, rounded up
## your code here
(f) 5 divided by 2, rounded
## your code here
(g) 7 divided by 2, rounded
## your code here
(h) can you explain the two results of questions f and g?
<YOUR ANSWER HERE>
Calculate/carry out the following
(a) create a vector with the numbers 2, 4, 7, 1, 5 and assign it to a variable called myNumbers
##your code here
(b) create a new vector, squared, that contains the numbers from my_numbers, but squared
##your code here
(c) create a new logical vector, isEven, that holds a Boolean value for each number in my_numbers, indicating whether it is an even number or not
##your code here
(d) create a vector with 9 empty strings and (bonus) put the string “hello, world”" in the 5th position
##your code here
(e) create a factor with the values “blue”, “red”, “blue”, “green”, “blue”, “red”. Plot the frequencies using plot()
##your code here
(f) create a vector with the numbers 6 to 10 in it
##your code here
(g) create a vector with the values 1 2 3 1 2 3 1 2 3 without exactly typing this sequence
##your code here
(h) create a vector with the values 1 1 1 2 2 2 4 4 5 5 6 6 4 4 5 5 6 6, but without exactly typing this sequence
##your code here
(i) create a vector with the sequence of numbers 3.5, 3.7, 3.9, … 4.7 in it, but without exactly typing this sequence
##your code here
Calculate/carry out the following. With all plots, take care to adhere to the rules regarding titles and other decorations. Tip: the site Quick-R has nice detailed information with examples on the different plot types and their configuration. Especially the section on plotting is helpful for these assignments.
The vectors below hold data for a staircase walking experiment. A subject of normal weight and height was asked to ascend a (long) stairs wearing a heart-rate monitor. The subjects’ heart was registered for different step heights. Create a line (!) plot showing the relationship between heart rate and stair height.
#number of steps on the stairs
stair_height <- c(0, 5, 10, 15, 20, 25, 30, 35)
#heart rate after ascending the stairs
heart_rate <- c(66, 65, 67, 69, 73, 79, 86, 97)
##your code here creating the plot
The experiment from the previous question was extended with three more subjects. One of these subjects was also of normal weight, while two of the subjects were obese. The data are given below. Create a single scatter plot with connector lines between the points showing the data for all four subjects. Give the normal-weighted subjects a green line/marker and the obese subjects a red line/marker. You can add new data series to a plot by using the points(x, y) function. Use the ylim() function to adjust the Y-axis range.
#number of steps on the stairs
stair_height <- c(0, 5, 10, 15, 20, 25, 30, 35)
#heart rates for subjects with normal weight
heart_rate_1 <- c(66, 65, 67, 69, 73, 79, 86, 97)
heart_rate_2 <- c(61, 61, 63, 68, 74, 81, 89, 104)
#heart rates for obese subjects
heart_rate_3 <- c(58, 60, 67, 71, 78, 89, 104, 121)
heart_rate_4 <- c(69, 73, 77, 83, 88, 96, 102, 127)
##your code here creating the plot
The body weights of chicks were measured at birth and every second day thereafter until day 20. They were also measured on day 21. There were four groups on chicks on different protein diets. Here are the data for the first four chicks. Chick one and two were on diet 1 and chick three and four on diet 2. Create a single line plot showing the data for all four chicks. Give each chick its own color
# chick weight data
time <- c(0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 21)
chick_1 <- c(42, 51, 59, 64, 76, 93, 106, 125, 149, 171, 199, 205)
chick_2 <- c(40, 49, 58, 72, 84, 103, 122, 138, 162, 187, 209, 215)
chick_3 <- c(42, 53, 62, 73, 85, 102, 123, 138, 170, 204, 235, 256)
chick_4 <- c(41, 49, 61, 74, 98, 109, 128, 154, 192, 232, 280, 290)
##your code here creating the plot
With the data from the previous question, create a barplot of the maximum weights of the chicks.
##your code here
The R language comes with a wealth of datasets for you to use as practice materials. We will see many of these. One of these datasets is The Time-Series dataset called discoveries holding the numbers of “great” inventions and scientific discoveries in each year from 1860 to 1959. Create plot(s) answering these two questions:
(a) What is the frequency distribution of numbers of discoveries per year?
(b) What is the 5-number summary of discoveries per year?
(c) What is the trend over time for the numbers of discoveries per year?
PS actually this is not a simple vector, but a vector with some time=-related attributes called a Time-Series (a ts class), but this does not really matter for this assignment.
#load datasets, if not already loaded
library(datasets)
#look ate the discoveries dataset
discoveries
## Time Series:
## Start = 1860
## End = 1959
## Frequency = 1
## [1] 5 3 0 2 0 3 2 3 6 1 2 1 2 1 3 3 3 5 2 4 4 0 2
## [24] 3 7 12 3 10 9 2 3 7 7 2 3 3 6 2 4 3 5 2 2 4 0 4
## [47] 2 5 2 3 3 6 5 8 3 6 6 0 5 2 2 2 6 3 4 4 2 2 4
## [70] 7 5 3 3 0 2 2 2 1 3 4 2 2 1 1 1 2 1 4 4 3 2 1
## [93] 4 1 1 1 0 0 2 0
##your code here
The R datasets package has three related timeseries datasets relating to lung cancer deaths. These are ldeaths, mdeaths and fdeaths for total, male and female deatchs, respectively. Create a line plot showing the montly mortality holding all three of these datasets. Use the legend() function to add a legend to the plot, as shown in this example:
t <- 1:5
y1 <- c(2, 3, 5, 4, 6)
y2 <- c(1, 3, 4, 5, 7)
plot(t, y1, type = "b", ylab = "response", ylim = c(0, 8))
points(t, y2, col = "blue", type = "b")
legend("topleft", legend = c("series 1", "series 2"), col = c("black", "blue"), pch = 1, lty = 1)
(a) Create the mentioned line plot. Do you see trends and/or patterns and if so, can you explain these?
(b) Create a combined boxplot of the three time-series. Are there outliers? If so, can you figure out when this occurred?
#load datasets, if not already loaded
library(datasets)
#look ate the fdeaths dataset
fdeaths
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1974 901 689 827 677 522 406 441 393 387 582 578 666
## 1975 830 752 785 664 467 438 421 412 343 440 531 771
## 1976 767 1141 896 532 447 420 376 330 357 445 546 764
## 1977 862 660 663 643 502 392 411 348 387 385 411 638
## 1978 796 853 737 546 530 446 431 362 387 430 425 679
## 1979 821 785 727 612 478 429 405 379 393 411 487 574
##your code here
Calculate/carry out the following
create a vector with the numbers 2, 4, 7, 1, 5 and assign it to a variable called myNumbers
##your code here
given the vectors below, generate a logical vector that has TRUE values for each position where a is greater than b
a <- c(6, 1, 4, 1, 5, 1, 2)
b <- c(1, 3, 4, 2, 7, 1, 5)
given the vectors below, generate a logical vector that has TRUE values for each position where a is even or smaller than b
##your code here
give the actual values of both a and b from the previous question
##your code here