Probability and Statistics Homework 2

Problem 1 - Capital Metro UT Ridership

Part 1

Make a faceted line graph that plots average boardings by hour of the day, day of week, and month. You should facet by day of week. Each facet should include three lines of average boardings (y) by hour of the day (x), one line for each month and distinguished by color.

Does the hour of peak boardings change from day to day, or is it broadly similar across days? Why do you think average boardings on Mondays in September look lower, compared to other days and months? Similarly, why do you think average boardings on Weds/Thurs/Fri in November look lower?

Peak boardings is similarly high- around 150/hour- during school/workdays and similarly low- around 25/hour- on weekends. I think that average boardings on was lower on Mondays in September due to the US Federal holiday, Labor Day, which accounts for 1/4 of the Mondays in September and therefore would lower the average significantly. I think average boardings on Weds/Thurs/Fri are lower in November for a similar reason: due to the Federal holiday, Thanksgiving.

Part 2

Make a faceted scatter plot showing boardings (y) vs. temperature (x), faceted by hour of the day, and with points colored in according to whether it is a weekday or weekend.

When we hold hour of day and weekend status constant, does temperature seem to have a noticeable effect on the number of UT students riding the bus?

Hour of day and weekend/weekday held constant, there seems to be neither a positive nor negative relationship between temperature and boardings. Temperature does not seem to have an effect on the number of UT students riding the bus.

Problem 2: Wrangling the Billboard Top 100

Part A

Make a table of the top 10 most popular songs since 1958, as measured by the total number of weeks that a song spent on the Billboard Top 100.

performer	song	count
Imagine Dragons	Radioactive	87
AWOLNATION	Sail	79
Jason Mraz	I’m Yours	76
The Weeknd	Blinding Lights	76
LeAnn Rimes	How Do I Live	69
LMFAO Featuring Lauren Bennett & GoonRock	Party Rock Anthem	68
OneRepublic	Counting Stars	68
Adele	Rolling In The Deep	65
Jewel	Foolish Games/You Were Meant For Me	65
Carrie Underwood	Before He Cheats	64

In the table we see songs and their respective performers, ordered by the number of weeks spent on the Billboard Top 100. An interesting trend is that most or all of these top songs are contemporary (from my lifetime).

Part B

Make a line graph that plots this measure of musical diversity over the years. The x axis should show the year, while the y axis should show the number of unique songs appearing at any position on the Billboard Top 100 chart in any week that year.

The above line graph shows the number of unique songs that appeared on at least one week’s Billboard 100 chart in every year between 1959 and 2020. The rise of song_diversity before 1970 could be attributed to the cultural explosion of media in the 60s. We then see a negative trend until around 2000. This could be explained by dominance of pop and rock music in mainstream culture. Finally we see increases in song_diversity beginning around 2000. This could be due to changing tastes such as mainstream appetite for rap, but also to new technologies such as Apple’s Ipod. Finally, around 2013, we see a massive drop and a subsequent spike. I remember this period, and it did feel like only a couple artists were dominating the game. The massive spike could be due to the onset of streaming services, which give listeners more freedom than ever to listen to a huge diversity of artists at virtually no added cost.

Part C

There are 19 artists in U.S. musical history since 1958 who have had at least 30 songs that were “ten-week hits.” Make a bar plot for these 19 artists, showing how many ten-week hits each one had in their musical career.

The above bar plot shows the 19 artists who have had at least 30 songs appear for at least ten weeks on the Billboard 100 weekly list- AKA musical GOATs.

Problem 3: Regression Practice

Part A

What creatinine clearance rate should we expect for a 55-year-old?

Plotting the relationship between age and creatclear we can see the below negative linear trend.

Since we suspect a standard linear relationship, we can use the linear model function in R to obtain coefficients.

## (Intercept)         age 
## 147.8129158  -0.6198159

The intercept and age coefficients provide the following estimation where \(Rate\) is creatine clearance rate and \(Age\) is age in years:

\(Rate = 147.813 - [0.620]*Age.\)

Plugging in, we get:

\(Rate = 147.813 - [0.620]*55 \\ Rate = 113.713.\)

Therefore we should suspect a creatinine clearance rate of ~113mL/min for a 55 year-old.

Part B

How does creatinine clearance rate change with age? (This should be a number with units ml/minute per year.)

The estimated change of creatinine clearance rate with age is represented by the age coefficeint in the linear model. Per above equation, creatinine clearance rate should decrease by 0.620mL/min per year.

Part C

Whose creatinine clearance rate is healthier (higher) for their age: a 40-year-old with a rate of 135, or a 60-year-old with a rate of 112? Briefly explain your reasoning.

Based off the linear model, the expected creatinine clearance rates (Rates) for a 40-year-old and a 60-year-old are the following:

\(Rate = 147.813 - [0.620]*40 \\ Rate = 123.013.\)

and,

\(Rate = 147.813 - [0.620]*60 \\ Rate = 110.613.\)

Based off of our model, the 40-year-old has a Rate around \(135-123 = 12\)mL higher than the projected Rate for their age, whereas the 60-year-old’s rate is only \(112-110.6 = 1.4\)mL/min higher than projected. Therefore the 40-year is healthier for their age.

Problem 4: Probability Practice

Part A

A shady used car dealer has 30 cars, and 10 of them are “lemons” (that is, mechanically faulty used cars), but you don’t know which cars they are. If you buy 3 cars, what is the probability that you will get at least one lemon?

Probability that you will get at least one lemon from a sample of three cars, or \(Pr(Lemon)\) can be given by

\(Pr(Lemon) = 1 - Pr(NotLemon)\).

\(Pr(NotLemon)\) can be calculated by taking the total number of combinations of non lemons given by \(\binom{20}{3}\) over the total number of combinations of all cars \(\binom{30}{3}\).

Thus,

\(Pr(NotLemon) = \frac{\binom{20}{3}}{\binom{30}{3}} = \frac{1140}{4060} = 0.2807882\).

The probability that one of the cars will be a lemon is \(Pr(Lemon) = 1 - \frac{1140}{4060} = \frac{2920}{4060}\), or 0.719.

Part B

We throw two dice (each with the usual 6 sides, numbered 1-6). What is the probability that the sum of the two numbers is odd? What is the probability that the sum of the two numbers is less than 7? What is the probability that the sum of the two numbers is less than 7, given that it is odd? Are these two events independent?

\(Pr(Odd)\) can be obtained by counting the number of odd combinations over the total number of combinations. There are 5 odd sums (3,5,7,9,11) out of 11 total possibilities (2:12), thus \(Pr(Odd) = \frac{5}{11}\), or 0.455.

\(Pr(<7)\) can be obtained similarly. There are only 5 sums that are less than 7 (2,3,4,5,6). It follows that the probability is the same, 0.455.

\(Pr(<7 | Odd)\) is a conditional probability which can be written as

\(Pr(<7 | Odd) = \frac{Pr(<7 \cap Odd)}{Pr(Odd)}\).

There are only 2 cases in which the sum of the die is both odd and less than 7 (3,5). So \(Pr(<7 \cap Odd) = 2/11\). Therefore, \(Pr(<7 | Odd) = \frac{2}{11}/\frac{5}{11} = \frac{2}{5}\). Since \(\frac{2}{5} \neq \frac{5}{11}\) it follows the two events are not independent.

Part C

Visitors to your website are asked to answer a single survey question before they get access to the content on the page. Among all of the users, there are two categories: Random Clicker (RC), and Truthful Clicker (TC). There are two possible answers to the survey: yes and no. Random clickers would click either one with equal probability. You are also giving the information that the expected fraction of random clickers is 0.3. After a trial period, you get the following survey results: 65% said Yes and 35% said No. What fraction of people who are truthful clickers answered yes?

The rule of total probability holds that the probability of an event is the sum of the probabilities for all the different ways in which that event can happen. In this model, it is clear that every ‘yes’ clicker is either a RC or a TC.

\(Pr(Y) = Pr(RC, Y) + Pr(TC, Y)\).

Plugging in for our givens, we have,

\(0.65 = (0.3)(0.5) + Pr(TC, Y)\).

So \(Pr(TC, Y) = 0.65 - 0.15 = 0.5\). Next we can construct the following:

\(Pr(TC, Y) = Pr(TC) * Pr(Y | TC)\).

Plugging in, we see that \(0.5 = 0.7 * Pr(Y | TC)\). Thus, \(Pr( Y | TC) = \frac{0.5}{0.7} = \frac{5}{7}\).

Part D

Imagine a medical test for a disease with the following attributes: 1) The sensitivity is about 0.993. That is, if someone has the disease, there is a probability of 0.993 that they will test positive. 2) The specificity is about 0.9999. This means that if someone doesn’t have the disease, there is probability of 0.9999 that they will test negative. 3) In the general population, incidence of the disease is reasonably rare: about 0.0025% of all people have it (or 0.000025 as a decimal probability).

Suppose someone tests positive. What is the probability that they have the disease?

We are interested in the observable event that someone tests positive. Let us break up that event into the following joint probabilities. Where \(Pr(D), Pr(ND)\) are the likelihoods of having the disease or not having the disease, and \(Pr(P), Pr(N)\) are the likelihoods are testing positive or testing negative.

\(Pr(P, ND) = Pr(ND) * (1 - Pr(N | ND)) \\ Pr(P, D) = Pr(D) * Pr(P, D).\)

The likelihoods on the right side of the above equations are all the information we have been given. We can simplify \(1 - Pr(N | ND) = Pr(P | ND)\). Plugging in we have

\(Pr(P, ND) = Pr(ND) * Pr(P | ND) = 0.999975 * (1-0.9999) = 0.000099997 \\ Pr(P, D) = Pr(D) * Pr(P, D) = .000025 * 0.993 = 0.000024825.\)

The above joint probabilities allow us to construct \(Pr(D | P)\) as following:

\(Pr(D | P) = \frac{Pr(P,D)}{Pr(P,D) + Pr(P,ND)}.\)

Finally, plug in to get,

\(Pr(D | P) = \frac{0.000024825}{0.000024825 + 0.000099997} = 0.1988832.\)

Therefore, if someone tests positive there is only a 19.88% chance they have the disease.

Part E

If an aircraft is present in a certain area, a radar correctly registers its presence with probability 0.99. If it is not present, the radar falsely registers an aircraft presence with probability 0.10. Suppose that on average across all days, an aircraft is present with probability 0.05. Let the events A and R be defined as follows: A = an aircraft is present, R = the radar registers an aircraft presence. What is P(A | R), the conditional probability that an aircraft is present, given that the radar registers an aircraft presence?

We can approach this problem similar to above. We are interested in the observable event that an aircraft is registered. Let us break up that event into the following joint probabilities where \(Pr(NR)\) and \(Pr(NA)\) are the likelihoods that an aircraft is not registered and that an aircraft is not present, respectively:

\(Pr(R, A) = Pr(A) * Pr(R | A) \\ Pr(R, NA) = Pr(NA) * Pr(R | NA).\)

Substituting our given information, we can calculate the above joint probabilities:

\(Pr(R, A) = Pr(A) * Pr(R | A) = 0.5 * 0.99 = 0.0495\\ Pr(R, NA) = Pr(NA) * Pr(R | NA) = 0.95 * 0.1 = 0.095.\)

The above joint probabilities allow us to construct \(Pr(A | R)\) as following:

\(Pr(A | R) = \frac{Pr(R,A)}{Pr(R,A) + Pr(R,NA)}.\)

Plugging in the likelihoods we obtain

\(Pr(A | R) = \frac{0.0495}{0.0495 + 0.095} = 0.3425606.\)

Therefore, if an aircraft is registered, there is only a 34.3% chance there is actually an aircraft present.

Problem 5: Modeling Soccer Games with the Poisson Distribution

1. Question: What questions are you trying to answer?

The question(s) I am trying to answer is what are the estimated probabilities of win/lose/draw for matches between (i) Liverpool (home) and Tottemham (away) and (ii) Manchester City (home) versus Arsenal (away)

2. Approach: What approach/statistical tool did you use to answer the questions?

I used R to calculate the defensive and offensive strength variables and to create the lambdas for each team. Lambda was calculated using the method outlined in the article:

\(LambdaHomeTeam = LeagueAverageGF * HomeAttackStrength * AwayDefenseWeakness\)

Lambda was then used to calculate the probability \(P(X = x)\) that the target team would score X goals. The dpois functoin in R utilizes the following probability mass function, or pmf:

\(P(X = x) = \frac{\lambda^{x}}{x!}*e^{-x}\)

Our method’s joint probability, therefore, that a home and an away team would both receive certain scores would be:

\(P(X = x, Y = y) = \frac{\lambda^{x}}{x!} e^{-x} * \frac{\lambda^{y}}{y!} e^{-y}\)

Plugging in the Lambdas calculated for Liverpool (\(\lambda = 2.234193\)) and Tottemham (\(\lambda = 0.7297656\)) we have the following probability that Liverpool will win at home 2-1 against Tottemham:

\(Pr(X = 2, Y = 1) = \frac{2.234193^{2}}{2!} e^{-2} * \frac{0.7297656^{1}}{1!} e^{-1} \\ Pr(X = 2, Y = 1) = 0.2672475 * 0.3517631 = 0.094007809\)

So, there is a roughly 9.4% chance that Liverpoool (home) will beat Tottemham with a score of exactly 2-1.

I calculated the probabilities that 0-5 Goals, respectively, would be scored for each team. Then I copied those probabilities into Excel and created two cross tables of joint probabilities [see Doc. A]. From there I could sum the joint probabilities that corresponded with win, draw or loss. These sums are the estimated probabilities of those events occurring.

3. Results: What evidence/results did your approach provide to answer the questions? (E.g. any numbers, tables, figures as appropriate.)

Action	Likelihood
PR(LIVERPOOL WIN)	0.6896897
PR(TIE)	0.1769584
PR(TOTTENHAM WIN)	0.1066698
CHECK	0.9733179

Action	Likelihood
PR(MAN CITY WIN)	0.6676364
PR(TIE)	0.1823213
PR(ARSENAL WIN)	0.1232617
CHECK	0.9732194

The above tables summarizes the likelihood of win/loss/tie for each game. The “Check’ observation is the sum of each of the three likelihoods. The difference between 1.00 and ‘Check’ (less than .03 in both cases) can be accounted for by the unaccounted likelihood that one team will score over 5 goals.

4. Conclusion: What are your conclusions about your questions? Provide a written interpretation of your results, understandable to stakeholders who might plausibly take an interest in this data set.

The above findings can be interpreted as following: there is around a 67% chance that Man City will take the home field win against Arsenal, etc. with all other results.

All in all, this method provides a useful, if idealistic, way to summarize a teams performance into one indicator variable. Soccer is very flow-based and psycological. Occasionally there is no clear reason why one team is better than another aside from ‘chemistry’ or ‘it-factor’. For this reason, using a team’s past performance is actually one of the best indicators possible for their future performance, rather than player biodata, etc.. Regarding the independence of variables- while the Lambda relies on the performance of another team, this method takes no account for the pressure a team might face if down one, and how that might influence them to attack harder, etc.

Probability and Statistics Homework 2

Joseph Williams

2023-07-24

Problem 1 - Capital Metro UT Ridership

Part 1

Part 2

Problem 2: Wrangling the Billboard Top 100

Part A

Part B

Part C

Problem 3: Regression Practice

Part A

Part B

Part C

Problem 4: Probability Practice

Part A

Part B

Part C

Part D

Part E

Problem 5: Modeling Soccer Games with the Poisson Distribution

1. Question: What questions are you trying to answer?

2. Approach: What approach/statistical tool did you use to answer the questions?

3. Results: What evidence/results did your approach provide to answer the questions? (E.g. any numbers, tables, figures as appropriate.)

4. Conclusion: What are your conclusions about your questions? Provide a written interpretation of your results, understandable to stakeholders who might plausibly take an interest in this data set.