library(tidyverse)MATH1062/MATH1005 - Statistics Assignment 1
Instructions
There are two sections to this assignment, each with multiple parts, covering the material discussed in lectures, exercises and tutorials up to the end of Week 3.
You should only use R functions which have been used in lectures, exercises or tutorials. This includes base R code or the Tidyverse library (ggplot, dplyr). This should be sufficient to complete any task in this assignment. If you use functions which have not been used in lectures or tutorials, you need to provide written justification of why this was necessary, otherwise marks will be deducted.
Do NOT modify the header of this file. Do NOT delete or alter any text description from this file. Do NOT add code blocks. Only work in the space provided.
All answers are either typed into code blocks under the comment
# Your code here: ...or typed after the prompt: “Your written answer:“ Start your written answer in the next line after the prompt, like this:Your written answer:
Start your answer here…..
Submission: Upon completion, you must render this worksheet (using
Renderin R Studio) into an html file and submit the html file. Your html file MUST contain all the R code you have written in the worksheet.First load the tidyverse library:
1. Numerical and Graphical summaries: Gapminder data 🌏
In this part of the assignment, you will use graphical and numerical summaries to perform initial data analysis on the Gapminder data set. You can see more details about this data from here and here. First install the gapminder library by clicking on the Install button in the Packages tab of the right-bottom window. Then type in gapminder in the pop-up and then click install. After installation, you can run the following code cell:
# Load and view the head of the gapminder data:
library(gapminder)
head(gapminder)# A tibble: 6 × 6
country continent year lifeExp pop gdpPercap
<fct> <fct> <int> <dbl> <int> <dbl>
1 Afghanistan Asia 1952 28.8 8425333 779.
2 Afghanistan Asia 1957 30.3 9240934 821.
3 Afghanistan Asia 1962 32.0 10267083 853.
4 Afghanistan Asia 1967 34.0 11537966 836.
5 Afghanistan Asia 1972 36.1 13079460 740.
6 Afghanistan Asia 1977 38.4 14880372 786.
1.1 Exploring Life Expectancy
Your specific tasks in this section are the following:
Filter the data set to only include data from the
year2007 and save this into the variable namedgapminder_2007(you can use either base R or tidyverse functions to do this).Plot a histogram of the
lifeExpvariable for this filtered data set using 15 bins. Ensure your histogram is visually pleasing and includes labels for the x and y axes as well as a title.Describe the shape of the distribution. If you were to report a measure of centre for global life expectancy, would you report the mean or median? Justify your answer.
# Your code here: filter the data and save into the variable name gapminder_2007
gapminder_2007 <- gapminder[gapminder$year == 2007, ]
# Your code here: plot a histogram of lifeExp for gapminder_2007
hist(gapminder_2007$lifeExp,
breaks = 15,
xlim = range(gapminder_2007$lifeExp),
main = 'Global Life Expectancy (2007)',
xlab = 'Life Expectancy (years)',
ylab = "Number of Countries",
col = "lightblue",
border = "white")Your written answer:
The data is positively skewed with the highest frequency of life expectancy occurring at 70-75 years. To report a central measure for global life expectancy, I would report the median because it is robust to skew whereas, the median is more likely to fluctuate significantly due to outliers.
1.2 Comparing continents
Your specific task in this section are the following:
- Create a comparative boxplot of life expectancy by
continentin 2007. Ensure your comparative boxplot is visually pleasing with labels for the x and y axes, including a title. Use your plot to answer the following questions. In each case provide a one sentence justification for your answer with regard to the features of the boxplots:- Which continent had the highest median life expectancy? Which had the lowest?
- Which continent has the highest variability in life expectancy? Which had the lowest?
- Which continent appears to have the most right-skewed life expectancy, which continent has the most left-skewed life expectancy?
# Your code here: A comparative boxplot for lifeExp for gapminder_2007
boxplot(gapminder_2007$lifeExp ~ gapminder_2007$continent,
main = 'Life Expectancy by Continent (2007)',
xlab = 'Continent',
ylab = 'Life Expectancy (years)',
col = "lightblue")Your written answer:
- Which continent had the highest median life expectancy? Which had the lowest?
The highest median life expectancy was Oceania because its median line is above all the other continents at around 80 and the lowest was Africa because its median line was below all of the other continents, sitting at around 52.
- Which continent has the highest variability in life expectancy? Which had the lowest?
Africa had the highest variability and Oceania had the lowest which can be observed by the distance between the maximum and minimum or ‘whiskers’ of the box plot being the greatest and the smallest distance apart for these two continents respectively.
- Which continent appears to have the most right-skewed life expectancy, which continent has the most left-skewed life expectancy?
Europe is the most right skew and the Africa is the most left skew which can be observed by the relative position of the median line to both the quartiles and the minimum and maximum markers.
1.3 How does life expectancy compare across countries in Americas?
We would like to compare life expectancy between Mexico and Haiti to determine which country’s life expectancy is further from the mean for the Americas.
Your specific tasks in this section are the following:
Using the 2007 data again,
- Filter the
gapminder_2007data to the continent of “Americas”, save this asamericas_2007. - Calculate the standard units for
lifeExpfor the countries of Mexico and Haiti, with respect to countries in the “Americas” continent in 2007. Which country’s life expectancy is further from the mean? In which direction?
# Your code here: filter data to 2007 and the Americas
americas_2007 <- gapminder[gapminder$year == 2007 & gapminder$continent == 'Americas', ]
# Your code here: Compute mean and standard deviation of lifeExp americas_2007
mean_americas <- mean(americas_2007$lifeExp)
sd_americas <- sd(americas_2007$lifeExp)
# Your code here: Filter the americas_2007 data for Mexico and Haiti
mexico_data <- americas_2007$lifeExp[americas_2007$country == 'Mexico']
haiti_data <- americas_2007$lifeExp[americas_2007$country == 'Haiti']
# Your code here: Compute standard units (z-scores) for Mexico and Haiti in 2007
z_mexico <- (mexico_data - mean_americas) / sd_americas
z_haiti <- (haiti_data - mean_americas) / sd_americas
# Your code here: Output results
z_mexico[1] 0.5825063
z_haiti[1] -2.857976
Your written answer:
Haiti is further, the absolute value of Haiti’s Z score is larger than Mexico (2.857976 > 0.5825063) and therefore Haiti’s life expectancy is further from the mean, in the negative direction.
2. Chance simulation: Coaching strategy 🏀
In this part of the assignment, you will use the concepts we covered in probability and chance simulation to provide data-driven advice to a basketball coach. Here is the scenario:
You work as a data scientist for a professional basketball team, and the head coach has approached you for advice about strategy for an upcoming important game. Here is some important information to consider:
For three point shots (threes), your team can make these at a 32% rate.
For two point shots, your team can make these at a 49% rate.
Each shot is independent of the others.
Given the players in the team this season, your team can shoot 90 shots in total for the game.
Based on past performance, the opposing team is expected to score 98 points in the upcoming game.
The head coach asks you to evaluate two strategies:
Strategy 1: Aim to shoot 55 threes and 35 two point shots, or;
Strategy 2: Aim to shoot 80 two point shots and 10 threes.
Your specific tasks are the following:
Using R code similar to what’s been done in lectures and exercises, simulate 1000 games using strategy 1.
In addition to your code, write a short description of how your simulation works. Include details of:
the contents of the box/boxes, and;
whether the draws are with or without replacement (Hint: You may need to work out how to combine samples from two boxes in your simulation).
Then plot a histogram of the scores from the simulation. Ensure your histogram is visually pleasing, has clearly labelled x and y axes and a title.
Do the same for strategy 2, as above for strategy 1.
Based on the simulations, report the mean and (sample) standard deviation of scores for both strategies.
Give a recommendation for which strategy is better, and justify this. Explain why one strategy might be better than the other based on the centre and spread of the distribution of the simulations.
2.1 Simulation for Strategy 1.
# Set seed for reproducibility
set.seed(123)
# Your code here: Simulation of scores for strategy 1
scores_strategy1 <- replicate(1000, {
threes <- sample(c(3,0), size=55, replace=TRUE, prob=c(0.32,0.68))
twos <- sample(c(2,0), size=35, replace=TRUE, prob=c(0.49,0.51))
sum(threes) + sum(twos)
})Your written answer here:
- the contents of the box/boxes, and;
The box represents each type of shot and its point value: {3,0} for threes with a 32% success rate (55 attempts), and {2,0} for twos with a 49% success rate (35 attempts). The draws from each box are summed to give the total score for one game.
- whether the draws are with or without replacement (Hint: You may need to work out how to combine samples from two boxes in your simulation).
The draws are with replacement because the results of each basketball shot are independent of each other and therefore replacement must occur to maintain this model. I then summed the total from the draws for both shot types to combine samples.
# Your code here: Histogram plot of simulations for strategy 1
hist(scores_strategy1,
breaks = 20,
main = 'Strategy 1',
xlab = 'Points Scored',
ylab = 'Frequency',
col= 'lightblue',
border = "white")Print the mean and standard deviation for the scores under strategy 1.
# Your code here: print the mean and standard deviation of simulated scores for strategy 1
mean(scores_strategy1)[1] 86.603
sd(scores_strategy1)[1] 11.61578
2.2 Simulation for Strategy 2.
# Set seed for reproducibility
set.seed(45)
# Your code here: Simulation of scores for strategy 2
scores_strategy2 <- replicate(1000, {
threes <- sample(c(3,0), size=10, replace=TRUE, prob=c(0.32,0.68))
twos <- sample(c(2,0), size=80, replace=TRUE, prob=c(0.49,0.51))
sum(threes) + sum(twos)
})Your written answer here:
- the contents of the box/boxes, and;
The box represents each type of shot and its point value: {3,0} for threes with a 32% success rate (10 attempts), and {2,0} for twos with a 49% success rate (80 attempts). The draws from each box are summed to give the total score for one game.
- whether the draws are with or without replacement (Hint: You may need to work out how to combine samples from two boxes in your simulation).
The draws are with replacement because the results of each basketball shot are independent of each other and therefore replacement must occur to maintain this model. I then summed the total from the draws for both shot types to combine samples.
# Your code here: Histogram plot of simulations for strategy 2
hist(scores_strategy2,
breaks = 20,
main = 'Strategy 2',
xlab = 'Points Scored',
ylab = 'Frequency',
col = 'lightgreen',
border = "white")Print the mean and standard deviation for the scores under strategy 2.
# Your code here: print the mean and standard deviation of simulated scores for strategy 2
mean(scores_strategy2)[1] 88.07
sd(scores_strategy2)[1] 10.20477
2.3 Which strategy would you recommend to the coach?
# Your code here
mean1 <- mean(scores_strategy1)
sd1 <- sd(scores_strategy1)
p_win1 <- mean(scores_strategy1 > 98)
mean2 <- mean(scores_strategy2)
sd2 <- sd(scores_strategy2)
p_win2 <- mean(scores_strategy2 > 98)
mean1; sd1; p_win1[1] 86.603
[1] 11.61578
[1] 0.154
mean2; sd2; p_win2[1] 88.07
[1] 10.20477
[1] 0.147
Your written answer:
Strategy 1 is recommended, because it results in a higher likelihood of surpassing 98 points, and therefore winning. Its wider spread (bigger standard deviation) makes it more likely to produce higher scores that cross the 98-point threshold . Strategy 2 is more consistent by its smaller spread, (smaller standard deviation), but its scores cluster below 98, so its wins are rarer.