Load the bike data from the file bike.csv. This file contains information on the number of bikes rented from a ride share service each day, along with some predictors to describe temporal and weather effects on demand. Additional information is given about this data set at this link.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(readr)
Task 1: Plot bike rentals against temperature. Is there evidence of a non-linear effect here?
bike <- read_csv("./bike.csv")
bike$weathersit <- as.factor(bike$weathersit)
bike %>%
rename(weather = weathersit) %>%
ggplot(mapping = aes(x = temp, y = cnt)) +
geom_point()+
geom_smooth() +
xlab("Temperature") +
ylab("Bike rental count") +
theme_minimal()
Using a moving average, it seems that a linear trend may be acceptable up to up to around 20 degrees, beyond which any additional heat begins to hamper bike rentals.
Task 2: Test this formally by fitting a generalised additive model and performing an appropriate statistical test.
We wish to compare a linear model for count on temperature against the same model but with an additional smooth term in temperature.
Since these models are nested, this is equivalent to testing whether the smooth term is significantly different from zero. Fitting the larger model we see that this is indeed the case.
library("mgcv")
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
##
## collapse
## This is mgcv 1.8-39. For overview type 'help("mgcv-package")'.
bike_gam <- gam(formula = cnt ~ temp + s(temp), data = bike)
summary(bike_gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## cnt ~ temp + s(temp)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1622.20 422.47 3.840 0.000134 ***
## temp 188.58 27.75 6.795 2.26e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(temp) 7.01 8.046 16.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Rank: 10/11
## R-sq.(adj) = 0.466 Deviance explained = 47.1%
## GCV = 2.0264e+06 Scale est. = 2.0036e+06 n = 731
Task 3: Construct a second graphic showing the effect of temperature on bike rentals for each level of the weathersit factor.
bike %>%
rename(weather = weathersit) %>%
ggplot(mapping = aes(x = temp, y = cnt, group = weather)) +
geom_point(mapping = aes(col = weather))+
geom_smooth(mapping = aes(col = weather), se = TRUE) +
xlab("Temperature") +
ylab("Bike rental count") +
theme_minimal()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Task 4: Consider the temperature range 10-18 degrees.
bike %>%
rename(weather = weathersit) %>%
ggplot(mapping = aes(x = temp, y = cnt, group = weather)) +
geom_point(mapping = aes(col = weather))+
geom_smooth(mapping = aes(col = weather), se = TRUE) +
xlab("Temperature") +
ylab("Bike rental count") +
xlim(c(10,18))+
theme_minimal()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning: Removed 543 rows containing non-finite values (stat_smooth).
## Warning: Removed 543 rows containing missing values (geom_point).
When accounting for uncertainty in the rolling mean estimate, the data do not contradict the conditional mean rental counts for each weather type being linear functions of temperature. (We can draw parallel lines through the pointwise confidence regions for each type of weather)
The linearity suggests that temperature may have a linear effect on bike rentals, conditional on the weather type.
The data are consistent with these linear relationships having a common slope but distinct intercepts. That suggests that weather has an additive effect because it changes the predicted number of bikes rented at a given temperature, but the effect of temperature is not altered by the weather type.
Finally, it does not make sense for a factor to have a linear effect since the levels of the factor are implicitly ordered and the “distance” between levels of the factor are not quantifiable. The difference between what qualifies as good and misty weather cannot be comapare to the difference between what qualifies as misty and “bad” weather.
Task 5: Using the GAM fitted to the entire temperature range, assess whether there is an interaction between the weather type and the smooth effect of temperature on bike rentals.
Fit the relevant GAM:
library(mgcViz)
## Loading required package: qgam
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
## Registered S3 method overwritten by 'mgcViz':
## method from
## +.gg GGally
##
## Attaching package: 'mgcViz'
## The following objects are masked from 'package:stats':
##
## qqline, qqnorm, qqplot
nlm1 <- mgcv::gam(formula = cnt ~ temp*weathersit + s(temp, by = weathersit), data = bike)
summary(nlm1)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## cnt ~ temp * weathersit + s(temp, by = weathersit)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1157.51 276.88 4.181 3.27e-05 ***
## temp 241.41 18.38 13.134 < 2e-16 ***
## weathersitMISTY 249.42 170.47 1.463 0.144
## weathersitRAIN/SNOW/STORM -173.80 306.07 -0.568 0.570
## temp:weathersitMISTY -61.00 13.01 -4.689 3.29e-06 ***
## temp:weathersitRAIN/SNOW/STORM -159.49 33.74 -4.727 2.74e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(temp):weathersitGOOD 4.1421 5.186 36.110 <2e-16 ***
## s(temp):weathersitMISTY 2.1018 2.744 4.958 0.0136 *
## s(temp):weathersitRAIN/SNOW/STORM 0.9781 1.216 1.244 0.3866
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Rank: 30/33
## R-sq.(adj) = 0.548 Deviance explained = 55.4%
## GCV = 1.7216e+06 Scale est. = 1.6955e+06 n = 731
nlm1_viz <- mgcViz::getViz(nlm1)
Plot the smooth terms:
mgcViz::plotDiff(s1 = mgcViz::sm(nlm1_viz, 1),
s2 = mgcViz::sm(nlm1_viz, 2)) +
l_ciPoly() +
l_fitLine() +
ylab("Effect Difference") +
xlab("Temperature") +
ggtitle("Difference in smooth effects: GOOD - MISTY") +
geom_hline(yintercept = 0, linetype = 2)
mgcViz::plotDiff(s1 = mgcViz::sm(nlm1_viz, 1),
s2 = mgcViz::sm(nlm1_viz, 3)) +
l_ciPoly() +
l_fitLine() +
ylab("Effect Difference") +
xlab("Temperature") +
ggtitle("Difference in smooth effects: GOOD - BAD") +
geom_hline(yintercept = 0, linetype = 2)
mgcViz::plotDiff(s1 = mgcViz::sm(nlm1_viz, 2),
s2 = mgcViz::sm(nlm1_viz, 3)) +
l_ciPoly() +
l_fitLine() +
ylab("Effect Difference") +
xlab("Temperature") +
ggtitle("Difference in smooth effects: MISTY - BAD") +
geom_hline(yintercept = 0, linetype = 2)
The pointwise confidence intervals for the difference in smooth effects do not envelop the line \(y=0\), and so weather does have an impact on the predicted number of bike rentals.
The intervals also do not envelop any horizontal line, and so the data are not consistent with a purely additive effect of weather. There is sufficient evidence to suggest an interaction between temperature and weather type.
In all weathers, moderate temperatures are associated with higher levels of cycling but this effect is more dramatic during fair weather.
In the run up to a mayoral election, a polling company is attempting to measure the effectiveness of an advertisement campaign for the candidate of a progressive political party.
Suppose that the population of the city may be partitioned into three groups of people: those with progressive views, those with neutral views, and those with conservative views. These groups occur with true proportions \(\pi_1 = 0.3\), \(\pi_2 = 0.5\), and \(\pi_3 = 0.2\), that are unknown to the survey company.
The company sends out 10,000 surveys to measure the approval of the progressive party candidate. The surveys ask whether the respondent recalls seeing the campaign and whether they self-identify as having progressive, neutral or conservative values.
The survey also asks a series of further questions. The responses to these questions are processed into a real-valued approval rating for the candidate, where positive values indicate support for the candidate and negative values indicate opposition.
The true probabilities of seeing the campaign and responding to the survey vary between subgroups; so do the distributions of approval scores before and after seeing the campaign. These probabilities and distributions are summarised in the table below.
| Group | Pr(Respond) | Pr(See campaign) | Pre-exposure response distribution | Post-exposure response distribution |
|---|---|---|---|---|
| Progressive | 0.3 | 0.8 | N(\(\mu\) = 50, \(\sigma\) = 15) | N(55, 10) |
| Neutral | 0.1 | 0.2 | N(0, 25) | N(10, 30) |
| Conservative | 0.5 | 0.2 | N(-50, 20) | N(-45, 20) |
Task 6: Identify analytically the expected approval rating at the population level.
We can consider the population to be formed of 6 distinct subgroups, depending on whether the person belongs to the progressive, neural or conservative ideology and whether or not they have seen the campaign. Within each subgroup the response is normally distributed.
The population distribution of approval weightings will be a weighted sum of the distributions for each sub group, where the weights represent the prevalence of each subgroup within the population. This means that the population response distribution will also be a normal distribution and the mean of that distribution will be the weighted sum of the subgroup means.
Let Y be the approval score, R be an indicator of responding to the survey, G be a group indicator and T be an indicator of exposure to the campaign (treatment). We can express the above argument mathematically as:
\[\begin{align*} \mathbb{E}[Y] &= \sum_{g\in\{1,2,3\}} \sum_{t\in\{0,1\}} \mathbb{E}[Y| G=g, T= t] \Pr( G=g, T=t) \\ &= \sum_{g\in\{1,2,3\}} \sum_{t\in\{0,1\}} \mathbb{E}[Y| G=g, T= t] \Pr(T= t | G= g) \Pr(G = g) \\ &= \pi_1 (0.8 \times 55 + 0.2 \times 50) + \pi_2 (0.2 \times 10 + 0.8 \times 0) + \pi_3 (0.2 \times -45 + 0.8 \times -50) \\ & = 7.4. \end{align*}\]
Task 7: By simulation, confirm your previous calculation and estimate the variance of the population approval rating.
The code below simulates the approval scores for a population of a given size. By simulating a sufficiently large number of simulated ratings, we can estimate the population mean and variance of the approval rating.
# Function to simulate a given number of approval scores from population
simulate_population <- function(pop_size = 1000000){
groups <- c("P","N","C")
# Sample which group each population member belongs to
pop_data <- data.frame(
group = sample(
x = groups,
size = pop_size,
replace = TRUE,
prob = c(0.3,0.5,0.2))
)
pop_data$group <- as.factor(pop_data$group)
# Sample whether they have seen the campaign
pop_data$prob_see_campaign <- 0.2 + 0.6 * (pop_data$group == "P")
pop_data$has_seen <- rbinom(
n = pop_size,
size = 1,
prob = pop_data$prob_see_campaign)
# Allocate each population member the correct approal score distribution
means <- c(-50, -45, 0, 10, 50, 55)
sds <- c(20,20,25,30,15,10)
pop_data$parameter_index <- 2 * as.numeric(pop_data$group) - 1 + pop_data$has_seen
# Sample approval rating
pop_data$approval_rating <- rnorm(
n = pop_size,
mean = means[pop_data$parameter_index],
sd = sds[pop_data$parameter_index]
)
return(pop_data)
}
pop_data <- simulate_population()
head(pop_data)
We can estimate the population mean by the sample mean:
mean(pop_data$approval_rating)
## [1] 7.351976
which we see is close to our analytically derived value. We can then estimate the population variance by the sample variance:
sd(pop_data$approval_rating)
## [1] 42.05385
var(pop_data$approval_rating)
## [1] 1768.526
Task 8: Write a function that simulates one set of responses to this survey.
simulate_survey <- function(survey_size = 10000){
# Simulate approval ratings for everyone who receives the survey
survey_data <- simulate_population(pop_size = survey_size)
# Calculate the probability that each person responds
survey_data$prob_responds <-
0.3 * (survey_data$group == "P") +
0.1 * (survey_data$group == "N") +
0.5 * (survey_data$group == "C")
# Sample which people respond
survey_data$is_responder <- rbinom(
n = survey_size,
size = 1,
prob = survey_data$prob_responds
)
# Subsest responses accordingly
survey_responses <- survey_data[survey_data$is_responder == 1,]
return(survey_responses)
}
survey_responses <- simulate_survey()
head(survey_responses)
Task 9: Write two further functions. The first should calculate return a direct estimate of the survey effect based on the responses. The second should use inverse probability weighting to address the fact that the sample is not representative of the population.
The naive,direct estimation method is simply the difference in mean approval rating between those who have and have not seen the campaign.
estimate_direct <- function(survey){
# Expected approval rating for those who have not seen the campaign
EAR_seen <- mean(survey$approval_rating[survey$has_seen == 1])
# Expected approval rating for those who have seen the campaign
EAR_unseen <- mean(survey$approval_rating[survey$has_seen == 0])
# Estimate treatment effect as difference between these groups
EAR_seen - EAR_unseen
}
estimate_direct(survey_responses)
## [1] 54.49099
To implement inverse probability weighting, we need to find the probability of belonging to each group given that a response to the survey has been received and find the appropriate weighting to adjust this to be the population proportion for that group.
To find the probability of belonging to each group we apply Bayes’ theorem and the law of total probability:
\[\Pr(G =g | R = 1) = \frac{\Pr(R = 1 | G = g) \Pr(G = g)}{\sum_{g \in \{1,2,3\}} \Pr(R = 1 | G = g) \Pr(G = g)}. \]
Substituting appropriate values we find that
\[\begin{equation} \Pr(G = g | R = 1) = \left \{ \begin{array}{cc} 9/24 & \text{ for } g = 1; \\ 5/24 & \text{ for } g = 2; \\ 10/24 & \text{ for } g = 3. \end{array} \right. \end{equation}\]
Now we must find weights \(w_g\) such that \(w_g \Pr(G = g | R = 1) = \pi_g\):
\[\begin{align*} w_1 = \frac{3}{10} \frac{24}{9} = \frac{4}{5} \\ w_2 = \frac{5}{10} \frac{24}{5} = \frac{12}{5} \\ w_3 = \frac{2}{10} \frac{24}{10} = \frac{12}{25}. \end{align*}\]
We can sense check these weights by verifying that the groups that are over-represented in the survey responses have weights less than one and groups that are under represented in the survey responses have weights greater than one.
estimate_weighted <- function(survey){
weights <- c(0.8, 2.4, 0.48)
EARs <- survey %>% group_by(group, has_seen) %>% dplyr::summarise(EAR = mean(approval_rating))
EARs <- EARs[[3]]
estimated_campaign_effects <- c(
EARs[2] - EARs[1],
EARs[4] - EARs[3],
EARs[6] - EARs[5]
)
weighted_estimate <- sum(estimated_campaign_effects * weights)
return(weighted_estimate)
}
estimate_weighted(survey_responses)
## `summarise()` has grouped output by 'group'. You can override using the
## `.groups` argument.
## [1] 17.22562
Task 10: By simulating many sets of responses to the survey, investigate and compare the bias of the two estimators.
The population effect of the campaign can be calculated in a similar manner to the population approval rating, by calculations analogous to Task 6 we can show that
\[ \mathbb{E}[Y|T = 0] = -5 \text{ and } \mathbb{E[Y|T = 1] = 12.4.\] Combining these gives the expected effect of the campaign in the population as being +17.4 points.
We can then simulate the results of 1000 surveys and the direct and weighted estimates of the population treatment effect for each run of the survey.
Plotting the distributions of these estimates we can see that the mean discrepancy between the true value and the estimate is smaller for the weighted estimator, demonstrating that the weighting has reduced the bias estimation.
dat <- data.frame(
estimate = c(direct_estimates, weighted_estimates),
estimate_type = rep(c("direct", "weighted"), each = n_surveys)
)
library(ggplot2)
dat %>%
ggplot(mapping = aes(x = estimate, group = estimate_type)) +
geom_histogram(mapping = aes(fill = estimate_type)) +
geom_vline(xintercept = 17.4)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Task 11: Since the survey company does not know the true proportions of each voter type, they suggest using the sample proportions as plug-in estimates. Comment on this choice.
This approach has (at least) two issues.
Firstly, and most importantly, the sample proportion in each political group is not representative of the population proportions - which is why we are reweighting in the first place. Therefore using the sample proportion to estimate the population proportion is not going to correct for this issue.
Secondly, this does not address the uncertainty about the population proportions in each political group.
Task 12: A second proposal is to estimate the population proportion of each voter type from the last election as plug-in estimates. Comment on this choice.
This is an improvement in that it uses an external data source. However, we should be aware that there are many ways that election turn-out might not be representative of the current population. For example one political group may be more likely to turn up on voting day or the composition of the population may have changed in the time since the last election.