GLM_HW_Prior
Introduction
Recently Marketing did a bang up job with their Promotion Program. Our office has been tasked with determining whether or not said program was in fact a success. We utilized the “Bike Share” dataset located in our super secret Data Vault.
Question 1 - Establishing our Model
The BikeShare Marketing department has tasked this office with creating a model to study to efficacy of its most recent campaign.
So we’ve assembled a model which takes a look at Total Ridership considering a number of factors.
What is the temperature and/or humidity? This is probably the biggest factor, these are also a curved regressive line, rather than linear. An increase in temperature or humidity of x does not necessarily always mean a commensurate increase in Ridership y. Rather, after a certain point, increases in temperature and/or humidity cause ridership to slow and eventually decline.
Put simply, while warmth is certainly a strong positive indicator of increased ridership, it is also possible to be too warm, leading to a decrease in ridership as more people stay home in their Air Conditioning.
Other factors were also taken into account - was it a workday? was it a holiday? was it windy, what month was it, , is there a promotion that day? what was the general weather situation etc etc.
The model we came up with “Bike LM 4” bears an Adjusted R Squared of 87% which means it’s a pretty good model, and one that passes a logic driven “sniff test” to see if it makes sense. To be sure, I could have included more data - what ‘season’ was it. I generally consider “Season” to be a social construct, as the calendar definiton of “Winter” and the weather situation on the ground will not always match. The other factors considered within the model take into account weather and time of year.
However, one situation not covered in my model is if it becomes unseasonably warm or cool (relative to a certain season). Logically it seems to suggest that unseasonable warmth in a cool period would drive an uptick, while unseasonable warmth in a warm period would drive a down tick or a slowing. However, my model includes Month, Weather Situation, and Temperature. Ergo, it already includes “warm” December dates and “Cool” July dates. Certainly, an argument can be made for including the season, but I remain of the opinion that Season is a reasonable removal.
##Q2 - Problems with Multi-Collinearity/Linearity
I did not have a lot of problems with the base assumptions surrounding this model, with the exception of the issue of “linearity”. Multicollinearity is a logical conclusion when it comes to this model, just because it’s warm outside does not mean people are going to ride bikes if it is also rainy and not a promotional day. Multiple portions of this model are interrelated and it does not surprise me.
Where I did run into issues, initially, was with the assumption of Linearity. Namely, this model is not truly linear. Delta X != Delta y at all times. Rather the model is polynomial, which is to suggest that it eventually begins to curve. There are diminishing returns, on that heat such that eventually each additional unit of heat does not produce an equal additional unit of ridership, but it slows, eventually stops and indeed eventually devolves into decreased ridership.
So too, does this occur with Humidity and Windspeed, though the relationship there is a little more stark.
##Q3 - What happens if things get wonky for a certain month?
The model does take month into account. I was not going to include it - if it’s good Bike Ridin’ weather, it doesn’t really bother me that it’s September, I just want to hit the pedals. But…I read all of the questions before beginning the homework, so…I put Month into my model.
Best Practice is to know all of the minimum viable product requirements before beginning work on your product, right?
So, looking at the Months to see which one is the busiest, we see that Month 8 (August) has the highest coefficient (barely) of 8.92. It is roughly comparable to similarly located (in a Gregorian sense) months like 4,5 and six (April, May, and June) which all have co-efficients in the 6.3-8.4 range.
The odd little duck is July with a teensy cooefficient of 1.14. That doesn’t make sense on a purely month-to-month analysis unless something else comes into the equation. Was it particularly rainy? Was it just unbearably hot or humid?
To determine that, let’s take a look at two things, the “Standard Error” and the Pr(>|t|) Value (which I don’t actually know what that translates to…but it’s a measure of how good a fit said line is to the model…so…that.)
| Month | Estimate | error | Pr> |
|---|---|---|---|
| Apr | 6.39 | 1.63 | .000121 |
| May | 7.92 | 1.86 | 2.36*10^-05 |
| Jun | 8.49 | 2.1 | 3.11*10^-05 |
| Jul | 1.14 | 2.33 | 1.18*10^-06 |
| Aug | 8.92 | 2.15 | 3.92*10^-05 |
So, there are a couple of things that I take away from this table. The first is that July is different from the others, looking at its Coefficient alone is enought o suggest that. But so too does it’s Error - it’s wider than the other values. Not by a huge margin mind you, but it’s wider.
Additionally, the Pr>T (is this a T-Test?) is markedly smaller than it’s peers.
According to this exhaustively researched climatological data Toronto in July is both hot as heck and humid as all get out. These two factors combine to suggest that it’s warm, even for the season, and that ridership dips.
Toronto Weather
Consequently, I feel like, even in August, if it were quite warm and humid like this you would see the Estimate dip as well, though note that both July and August have relatively high (for this table) standard errors, suggesting that there are good biking days too, it’s not like February where you just don’t see anyone on a bike if they can help it.
Q4 - Did Promotion Work?
Broadly speaking, yes, I think The Promotion works, it’s coefficient is (1.9510^3) which is positive number and quite large In Regressive coefficients. It’s Pr(>|t|) value is also quite small (210^-16) which is broadly understood to mean “the line is quite a good fit”. However, we’ll dig a little deeper into later questions as I think there’s some more to this question than is apparent at first blush. After all, while we’ve been doing a lot of math about “total riders” there are really two classes of customers…
Q5 Diggin’ a little deeper.
So, let’s look at the differentiated effects on the Registered vs Casual Riders
We run the Regression and come up with the same model, but swapping in Casual or Registered Riders
lm(casual ~ poly(temp, 3, raw = T) + as.factor(Promotion) + as.factor(weathersit) + as.factor(mnth) + as.factor(weekday) + as.factor(workingday) + humidity, data = Bike_Master)
The relevant figures are…
| Casual/Member | Intercept | Error | Promotion Effect |
|---|---|---|---|
| Casual | 1200 | 220.89 | 285 (+-26) |
| Register | 1.14 | 3.96 | 1.7 (+-4.7) |
There are markedly more casual riders, and the Promotion Effect was numerically more powerful on them, but was proportionally more powerful on the Registered Riders, where it may have almost doubled Ridership among Registered Riders on Promo Days.
I conclude that the Promo is more effective for our Registered Riders, because it had a stronger per-rider effect within that category. That isn’t to suggest, however, that it wasn’t effective among Casual Riders.
Q6 - What More Do We Need?
So, we know that Promo’s are effective, but what we don’t know is, are they Financial Profit Drivers? I’d like to look at some financial considerations. Predominantly, what are our costs associated with maintaining our system. For the purposes of this determination/analysis I don’t need a full P&L sheet or even our Balance Sheet. Rather I just need to know if we make money on a Promo Day when a Registered User takes a bike out.
We’ve lowered the Transaction Costs to Reg-Users to the point that they’re paying like…20% of what the advertised rate is…and this is a population that already likes to ride bikes to such a degree they’ve signed up for some newfangled subscription service to do so…so…are they really the audience we need to be targeting for incentives?
If we’re making money even at 20% of Sticker? They hell yes, good on us! But if we’re in fact trying to offer a loss leader to people who are already our customers…we probably need to look at another form of promotional offer.