Exercises: Data structures

Exercise 1: Sums and products
Exercise 2: Yoghurt cake
Problem 3: Mean and variance of a discrete variable
- Part 1
- Part 2
Exercise 4: mortality and life expectancy
- Warm up: short lived fish
- The interesting problem: Human demography
Exercise 5: Precip
Exercise 6: Two lifts and one Markov chain
Exercise 7: Language school
Exercise 8: Adding binomials and real state
Exercise 9: 3 dice vs 2 dice
Exercise 10: Forbes list
Exercise 11: Rainy days
- Data import and very simple exploratory data analysis
- Markov chain
Exercise 12: Marks

Exercise 1: Sums and products

Compute:

\(\sum_{i=1}^{10^6} i^2\)

\(\prod_{i=1}^{10} \frac{e^i}{i^3}\)

Suggestions: sum, prod, exp

Exercise 2: Yoghurt cake

Yoghurt cake (‘’coca de iogurt’’) is an easy cake recipe. A useful feature of the traditional recipe is that neither a scale nor any other measuring device is needed because most ingredients are measured using the yoghurt cup. In this exercise we will use vectors in R to transform the ingredients list to commercial units and compute its cost.

Ingredients’list:

1 yoghurt
3 yoghurt cups of sugar
2 yoghurt cups of flour
3 eggs
1/2 yoghurt cup of oil
50 g butter

Ingredients’price (in commercial units):

1€ a 4 cups yoghurt pack
1.2€ a 1 kg sugar package
0.90€ a 1 kg floor package
1.5€ a dozen eggs
4€ a 1 litter oil bottle
1.4€ a 250g bar butter

Yogurt cups capacity is 125 cm³.

Density of suggar is 0.88 g/cm³ and density of flour is 0.59 g/cm³.

How much of each ingredient are we going to need to make 4 yogurt cakes?

Using the units in the recipe.
Using commercial units

Make a shopping list to buy the ingredients, taking in account that supermarkets only sell whole packages, packs, bottles or bars. Suggestion: help(ceiling).
How much is going to cost each ingredient?
How much is going to cost all ingredients?
How much will be left of each ingredient after preparing the cakes?

Problem 3: Mean and variance of a discrete variable

The two parts of this problem are slightly different versions of the same problem.

Use R vectors to compute the solution to both parts.

Reminder:

Given a discrete random variable \(X\):

Its probability function is \(f(x_i)=P(X=x_i)\)
Its distribution function is \(F(x_i)=P(X \le x_i)\)
Its expected value is \(E(X) = \sum_i P(x=x_i) \cdot x_i\)
Its variance is \(Var(X) = E(X^2)-E(X)\)
Its standard deviation \(sd(X) = \sqrt{Var(X)}\)

Part 1

In each round of a game, any player rolls a die. If the die shows a six, the player rolls the die again and the points of the two rolls are added. Otherwise, the player gets just the points of his single roll.

Consider the amount of points a player gets in one round as a random variable.

Find its probability function and its distribution function.
Find its expected value.
Find its variance.

Part 2

A usual rule in several board games is that when a player gets a 6 when rolling a die, he can roll it again and add the points. The same rule applies to that second roll, but if the player gets three 6 in a row, he loses all points and his turn.

Compute expected value and variance of the total amount of points a player gets in a turn.

Exercise 4: mortality and life expectancy

In this exercise about life expectancy, a version using made-up mortality data of a short-lived species of fish that could even be computed by hand is included as a warm up to ease understanding of the more interesting one using demographic data of a human population.

Warm up: short lived fish

In a fish species in its natural environment, each individual has a 90% probability of dying during its first year. Fishes which survive to one year old have a 50% chance of not reaching two years old and those arrive to two years old have a 20% probability of not surviving to three years old. All Individuals which reach the age of three die before reaching the age of four.

a) Consider lifespan of a newborn individual as a random variable. The expectancy of this variable is known as life expectancy at birth. Compute it.

Suggestion: For this problem you can assume that all individuals dying with age between 0 and 1 years old die at 0.5 years old, that all those dying between 1 an 2 years die at 1.5 years old, and so.

b) Find the function of distribution of that random variable.

c) Use this distribution function to find the probability that a fish live for between 1 and 3 years.

d) If a fish has reached the age of one, what is its probability or reaching the age of three?

The interesting problem: Human demography

e) Compute life expectancy as in the previous exercise using data of mortality for the most recent year available in https://www.idescat.cat/pub/?id=aec&n=292.

Suggestion: You might prefer to use mortality data from another source of your interest.

Suggestion: If you are doing this exercise before learning about importing data from Excel and managing data frames, you might prefer to use the following data from year 2017.

For men:

c(`De 0 anys` = 2.36, `D'1 a 4 anys` = 0.14, `De 5 a 9 anys` = 0.08, 
`De 10 a 14 anys` = 0.07, `De 15 a 19 anys` = 0.25, `De 20 a 24 anys` = 0.32, 
`De 25 a 29 anys` = 0.37, `De 30 a 34 anys` = 0.5, `De 35 a 39 anys` = 0.67, 
`De 40 a 44 anys` = 0.98, `De 45 a 49 anys` = 1.73, `De 50 a 54 anys` = 3.26, 
`De 55 a 59 anys` = 5.88, `De 60 a 64 anys` = 9.32, `De 65 a 69 anys` = 14.15, 
`De 70 a 74 anys` = 20.9, `De 75 a 79 anys` = 35.65, `De 80 a 84 anys` = 65.36, 
`De 85 a 89 anys` = 118.53, `De 90 a 94 anys` = 207.54, `De 95 anys i més` = 350.95
)

For women:

c(`De 0 anys` = 2.23, `D'1 a 4 anys` = 0.08, `De 5 a 9 anys` = 0.08, 
`De 10 a 14 anys` = 0.06, `De 15 a 19 anys` = 0.12, `De 20 a 24 anys` = 0.16, 
`De 25 a 29 anys` = 0.15, `De 30 a 34 anys` = 0.23, `De 35 a 39 anys` = 0.37, 
`De 40 a 44 anys` = 0.54, `De 45 a 49 anys` = 1.07, `De 50 a 54 anys` = 1.88, 
`De 55 a 59 anys` = 2.76, `De 60 a 64 anys` = 3.77, `De 65 a 69 anys` = 5.78, 
`De 70 a 74 anys` = 9.04, `De 75 a 79 anys` = 17.59, `De 80 a 84 anys` = 38.17, 
`De 85 a 89 anys` = 81.49, `De 90 a 94 anys` = 166.3, `De 95 anys i més` = 294.16
)

For this problem we can assume that:

Within each age interval (in the data) mortality is the same every year.
People die at the equidistant moment between two birthdays (i.e., at 0.5 years, at 1.5 years, and so).
In order to avoid summing infinite series, you can assume that at the age of 110 the mortality rate is 100%.

Note that IDESCAT mortality data are expressed in deaths per 1000 inhabitants, as usual in demography. This is 1000 times the probability of dying at that age.

f) A male professor wrote the first version of this problem at 48 years old. What’s the probability of him arriving to retirement age (67 years old) alive?

Exercise 5: Precip

The package datasets, which is loaded by default in most R installations, includes several vectors and data frames to be used in examples. The named vector precip is part of The average amount of precipitation (rainfall) in inches for each of 70 United States (and Puerto Rico) cities. You can find more information about it and its source by typing help(precip).

Get the names and precipitation of the cities with precipitation below 800 mm. Sort the results by precipitation in decreasing order. Indication: 1 in = 25.4 cm
Get the names and precipitation of the cities with precipitation over 1000 mm. Sort the results by alphabetical order.
Get the names of the cities with yearly precipitation greater than 800 mm and a name longer than 10 characters. Indication: help(nchar)

Exercise 6: Two lifts and one Markov chain

In a train station, there are two lifts linking the platform (down) with the hall (up). When a passenger arrives at the platform stop or the hall stop, he can find an available lift, or none available if both of them are in the other stop. We are interested in the probability of finding at least one lift available when arriving at a stop.

We can use the following simplifying assumptions:

Passengers arrive one by one (or equivalently in little groups), after the previous passenger has left the lift.
Passengers arrive at both stops at random, with probability 50% each.
Which stop each passenger arrives to, is independent from which stops previous passengers have arrived to.
If there is at least one lift in one stop and a passenger arrives to that stop, he takes one lift and leaves it at the other stop.
If there isn’t a lift in one stop and a passenger arrives to that stop, he calls one lift from the other stop, takes it and leaves it again at the other stop.
Lifts don’t move except when passengers use them.

a) Find the transition matrix for the states:

Both lifts at the hall stop
One lift at each stop
Both lifts at the platform stop

b) Now one lift is at each stop. Find the probability of each state after 3 passengers have used the lifts.

c) Find the probability of each state in the long term.

d) What was the probability of finding at least one lift available when arriving at a stop.

e) In this problem we used three states because we didn’t distinguish between the two lifts. Solve it again distinguishing between lifts - and therefore using four states - and check that the results are equivalent.

Exercise 7: Language school

In a language school, each week each student is in one of the following states:

Attending class
Not attending (but still enrolled)
Dropped (not enrolled)

15% of students attending one week don’t attend next week, and 1% drop from the course and the remaining ones keep attending.

30% of students not attending one week attend next week, 20% drop from the course and the remaining ones keep not attending.

Students that drop the course don’t enrol again.

a) A course is 13 week long. The first week 100% of students attend the class. Find the probability of students attending, not attending and having dropped in the last week.

b) The course started with 23 students. Find the expected value of the number of students attending, not attending and having dropped in the last week.

c) Find the long term probability of each state assuming that the curse is infinitely long.

Exercise 8: Adding binomials and real state

A real state agent deals with properties for rent and with properties for sale. An appointments with customers interested in a lease may result in a deal with probability 12%, and an appointment with customers interesting in buying results in a deal with probability 2%. Each closed leasing reports the agent a commission of 600€, and each closed sale reports him a 6000€ commission.

Last month that agent got 120 appointments with customers interested in renting and 20 appointments with customers interested in buying a property. The outcome of the different appointments are independent.

What’s the probability that the agent closed more than 16 deals?
What’s the probability that the agent earned more than 12000€ in commissions?

Indication: Although the problem could be solved quite accurately using a simulation and some parts of it could be roughly approximated by a normal distribution (but it’s quite debatable), compute the exact value - although you may use other methods to check the results.

Exercise 9: 3 dice vs 2 dice

In a dice game, the first player rolls a die thrice but the second player rolls the die twice. Each player gets the sum of points of his rolls. Compute the probability of each player winning and the probability of a tie.

Indication: Although the problem could be solved quite accurately using a simulation, compute the exact value.

Exercise 10: Forbes list

In this exercise, we will use dataset Forbes2000 form package HSAUR. You will need to load the library to access the data frame, and the first time you may also need to install it before loading.

Look the help for Forbes2000 to see what the data is about and where it comes from.
How many UK banking companies are in the list that earned (positive) profits.
Which company among the 10 US companies with most assets has the lowest market value?

Exercise 11: Rainy days

The file precipitació i temperatura.xlsx contains daily data from several meteorological stations in or around the Llobregat River basin for the years 2007 to 2017. We are interested in frequency of rainy days (precipitation > 0).

Data import and very simple exploratory data analysis

Import the dataset to R.
Select precipitation data from Sant Cugat del Vallès.
Do you see any problem with the data? Suggestion: summary or quantile
Check which other stations have the same problem. Suggestion: summary, quantile, table, is.na…

Markov chain

Select precipitation data from another station without that problem.

Estimate the probability of raining in a chosen at random day?
Estimate the conditional probability of raining in a day according with whether it rained the previous day.
Consider rain/not rain two states in a discrete time Markov chain, with one day steps. Use results from previous step to build a transfer matrix.
Look out of the window to check whether it is raining today and assume that the weather today is the same at the selected station. Compute the probability that it rains next Sunday.
Compute the probability that it rains at least one day next weekend.
Compute the probability that it rains both days next weekend.

Exercise 12: Marks

The following commands, that you can copy and paste into R, contain marks for a Moodle quiz for a whole class and an exam marks for a seminar group.

quiz <- data.frame(name = c("Maria", "Marc", "Anna", "Octavia", "Adrià", 
                            "Marc", "Valentina", "Carme", "Gal·la", "Margarida", "Àlex", 
                            "Aureli", "Gerard", "Terenci", "Marc", "Pompeu", "Virgili", "Adrià", 
                            "Mariana", "Pompeu", "Eva", "Anna", "Antònia", "Carme", "Eulàlia", 
                            "Àlex", "Mireia", "Dolors", "Mercè", "Marc", "Aureli", "Montserrat", 
                            "Valentina", "Margarida", "Núria", "Antoni", "Maximiana"), 
                   surname1 = c("Aladern", 
                                "Alfals", "Àliga", "Arboç", "Arboç", "Avet", "Avet", "Corb", 
                                "Esparver", "Esparver", "Falcó", "Gafarró", "Gralla", "Graula", 
                                "Graula", "Graula", "Heura", "Lledoner", "Lledoner", "Llistó", 
                                "Llorer", "Marfull", "Mostela", "Mussol", "Mussol", "Mussol", 
                                "Mussol", "Mussol", "Pardal", "Pardal", "Pensament", "Pollancre", 
                                "Pollancre", "Ripoll", "Talpó", "Talpó", "Vidalba"), 
                   surname2 = c("Pollancre", 
                                "Mussol", "Gafarró", "Gínjol", "Talpó", "Graula", "Ripoll", "Caragol", 
                                "Aladern", "Llistó", "Heura", "Heura", "Roure", "Aranyó", "Caragol", 
                                "Garsa", "Talpó", "Caragol", "Pollancre", "Talpó", "Pardal", 
                                "Vidiella", "Espàrrec", "Albellatge", "Alfals", "Avet", "Avet", 
                                "Vidiella", "Aladern", "Rossinyol", "Mostela", "Fenàs", "Roure", 
                                "Perera", "Avet", "Marfull", "Palmera"), 
                   mark = c(8.7, 6.5, 5.6, 
                            6.8, 8.6, 9.4, 9.7, 5.4, 8.4, 9.4, 8.2, 7.8, 8.8, 5, 7, 4.5, 
                            4.7, 4.6, 7.8, 7.7, 7.8, 5.7, 9.1, 7.8, 8.7, 5.7, 4, 7.5, 8.4, 
                            7.6, 6.6, 9.1, 3.8, 8.3, 4.7, 5.3, 1.2))


exam <- data.frame(name = c("Maria", "Valentina", "Mariana", "Margarida", 
                            "Àlex", "Aureli", "Gerard", "Terenci", "Pompeu", "Anna", "Antònia", 
                            "Eulàlia", "Mireia", "Mercè", "Núria"), 
                   surname1 = c("Aladern", 
                                "Avet", "Esmerla", "Esparver", "Falcó", "Gafarró", "Gralla", 
                                "Graula", "Llistó", "Marfull", "Mostela", "Mussol", "Mussol", 
                                "Pardal", "Talpó"), 
                   surname2 = c("Pollancre", "Ripoll", "Esmerla", 
                                "Llistó", "Heura", "Heura", "Roure", "Aranyó", "Talpó", "Vidiella", 
                                "Espàrrec", "Alfals", "Avet", "Aladern", "Avet"), 
                   mark = c(9.2, 
                            7.4, 7.6, 0.2, 6.5, 7.5, 7.1, 4.6, 7.9, 5.4, 0.8, 8.7, 4.5, 7.9, 
                            4.3))

a) The lecturer in charge of the seminar needs to do a weighted average of both marks (40% quiz, 60% exam) and check whether all students in the exams list have also quiz marks. Suggestion: merge the data frames.

b) The lecturer wants to study marks for each gender. Since his data on students doen’t include a gender variable, he plans to use given names to tell appart males from females. Using the following name lists, add a gender variable to the marks dataset.

male <- c("Jordi", "Josep", "Màrius,", "Marc", "Pau", "Pere", "Àlex", 
"Gerard", "Ernest", "Daniel", "Sergi", "Eliseu", "Pompeu", "August", 
"Terenci", "Virgili", "Aureli", "Maximià", "Vespasià", "Antoni", 
"Adrià")
female <- c("Maria", "Mariana", "Mireia", "Carme", "Lívia", "Octavia", 
"Eulàlia", "Eva", "Ariadna", "Hipàtia", "Safo", "Mercè", "Gisela", 
"Marta", "Griselda", "Aurèli", "Maximiana", "Constància", "Gal·la", 
"Antònia", "Dolors", "Margarida", "Núria", "Montserrat", "Valentina", 
"Anna")