Probability and Statistics Homework 1

Question 1

Part A

Make a 2x2 table of conditional probabilities, conditional on the levels of the david.bowie variable

##          david.bowie
## daft.punk     0     1
##         0 0.925 0.912
##         1 0.075 0.088

Part B

Are the events “plays Johnny Cash” and “plays Pink Floyd” independent? Why or why not?

For these two events to be independent, \(P(\)plays Johnny Cash\()\) \(\approx\) \(P(\)plays Johnny Cash | plays Pink Floyd\()\) \(\approx\) \(P(\)plays Johnny Cash | not play Pink Floyd\()\).

See below probabilities conditional on the pink floyd variable:

##            pink.floyd
## johnny.cash     0     1
##           0 0.945 0.895
##           1 0.055 0.105

Here, \(P(\)plays Johnny Cash | plays Pink Floyd\()\) = \(0.105\) does not closely equal \(P(\)plays Johnny Cash | not play Pink Floyd\()\) = \(0.055\). So Pink Floyd listeners are more likely to also listen to Johnny Cash than non-Pink Floyd listeners, and thus the events are not independent.

Question 2

Part A

What’s the relationship between danger and humor across all Super Bowl commercials? Estimate the following probabilities:

See below frequency table:

##        danger
## funny   FALSE TRUE
##   FALSE    67    9
##   TRUE    105   66

\(P(\) danger = TRUE\()\): \(\frac{9+66}{249}\) = \(0.30\)
\(P(\) danger = TRUE | funny = TRUE\()\): = \(\frac{66}{105+66}\) = \(0.39\)
\(P(\) danger = TRUE | funny = FALSE\()\): = \(\frac{9}{67+9}\) = \(0.12\)

Does it seem that ads using humor are more or less likely to feature danger than ads not using humor? Or, on the other hand, do humor and danger look nearly independent of each other?

Based on the large difference between \(P(\) Danger = TRUE | Funny = TRUE\()\) = \(0.39\) and \(P(\) Danger = TRUE | Funny = FALSE\()\): = \(0.12\), ads that use humor are MORE likely to feature danger. The difference between the two conditional probabilities also shows that the two events are NOT independent.

Part B

What’s the relationship between animals and sex across all Super Bowl commercials? Estimate the following probabilities:

See below frequency table:

##        use_sex
## animals FALSE TRUE
##   FALSE   114   41
##   TRUE     67   25

\(P(\) animals = TRUE\()\): \(\frac{67+25}{249}\) = \(0.37\)
\(P(\) animals = TRUE | use_sex = TRUE\()\): = \(\frac{25}{25+41}\) = \(0.38\)
\(P(\) animals = TRUE | use_sex = FALSE\()\): = \(\frac{67}{67+114}\) = \(0.37\)

Does it seem that ads using sex are more or less likely to feature animals than ads not using sex? Or, on the other hand, do use_sex and animals look nearly independent of each other?

There is no significant difference in the values of \(P(\) animals = TRUE\()\) = \(0.37\), \(P(\) animals = TRUE | use_sex = TRUE\()\): = \(0.38\) and \(P(\) animals = TRUE | use_sex = FALSE\()\) = \(0.37\). Therefore we can conclude that the events are independent, and ads that feature sex are NO MORE likley to also feature animals that ads that do not feature sex.

Question 3

Part A

Create a histogram to display the overall data distribution of course evaluation scores.

The above histogram is a visualization of teacher average evaluation scores vs frequency of each score. The frequency scores are averaged out to 1 decimal point and have a domain of [2.1,5]. It appears that the mean, median and mode of the plot are all at or near eval=4.

Part B

Use side-by-side boxplots to show the distribution of course evaluation scores by whether or not the professor is a native English speaker.

The above boxplots show the distributions of evaluations given for professors that are native English speakers vs those that are not. The top and bottom of each box represents the values of the first to third quartiles of the distribution. We are able to conclude that ratings for native English speakers are significantly higher in general and also account for most of the “bad’ evaluation scores. My key takeaway is that non-native professors are rated more harshly due to their accents. Conversely, there are less ‘bad’ non native English-speaking professors since they must be good enough to land the job DESPITE their accents.

Part C

Use a faceted histogram with two rows to compare the distribution of course evaluation scores for male and female instructors.

The above histograms show the frequency distribution of evaluation scores for male vs female professors. We are able to see that gender=male shows a higher concentration of ratings around 4-4.5, and a less steep drop in frequency near eval=5. The rightward skew of the male distribution vs the more normal female one could indicate that ‘bad’ professors are rated similarly irregardless of gender whereas ‘good’ professors are rated more favorably if they are male.

Part D

Create a scatterplot to visualize the extent to which there may be an association between the professor’s physical attractiveness (x) and their course evaluations (y).

The above scatter plot visualizes the distribution between a teacher’s subjective beauty and their course evaluations. We notice a subtle positive trend of evaluation scores with respect to attractiveness. The positive relationship would support an assertion that more attractive professors are more likely to receive favorable ratings.

Question 4

Make a single table of summary statistics of this data. The table should show the following summary statistics for SAT verbal scores, SAT math scores, and graduating GPA across the whole sample: mean, standard deviation, inter-quartile range (IQR), 5th percentile, 25th percentile, median (50th percentile), 75th percentile, and 95th percentile.

Metric	Mean	SD	IQR	q_05	q_25	q_50	q_75	q_95
SAT Verbal	595.049	83.768	110.000	460.000	540.000	590.000	650.000	730.000
SAT Verbal	619.979	83.082	120.000	480.000	560.000	620.000	680.000	760.000
GPA	3.212	0.480	0.723	2.361	2.872	3.252	3.595	3.921

The above table displays summary statistics of SAT scores (Verbal and Quantitative) and graduating GPA from graduated UT students from a particular year. Some takeaways from this visualization are that averages and quantiles for the SAT Quantitative are slightly higher than that of Verbal, perhaps indicating a strong technical background of admitted UT students, or that the SAT scores slightly higher on math than reading for the standard student. The GPA Metric shows a mean of 3.2 which is a B! So our summary statistics align with what we colloquially consider to be an average grade.

Question 5

Plot A

A line graph showing average hourly bike rentals (total) across all hours of the day (hr)

The above line graph shows the average total number of bike rentals in Washington DC for every hour of the day where zero = midnight. We see hourly averages above 200 for almost the entire period between 7am and 7pm, with spikes at 8am and 5pm. This trend would support common sense theories that rental bicycles are more popular during daytime, and that they are the MOST popular during commute times for the 9-5 workday.

Plot B

A faceted line graph showing average bike rentals by hour of the day, faceted according to whether it is a working day (workingday)

## `summarise()` has grouped output by 'hr'. You can override using the `.groups`
## argument.

The above faceted plots display the same as before but this time we are able to see the data differentiated by whether it is a workday or not. It is clear that non-workdays, or weekends, see less average bike rentals earlier in the morning before hr=10, presumably because 9-5 workers are sleeping in. We also see less steep hikes at the 8am and 5pm marks, presumably because 9-5 workers are not renting bikes to commute to and from work.

Plot C

A faceted bar plot showing average ridership (y) during the 9 AM hour by weather situation code (weathersit, x), faceted according to whether it is a working day or not

## `summarise()` has grouped output by 'weathersit'. You can override using the
## `.groups` argument.

Probability and Statistics Homework 1

Joseph Williams

2023-07-17

Question 1

Part A

Part B

Question 2

Part A

Part B

Question 3

Part A

Part B

Part C

Part D

Question 4

Question 5

Plot A

Plot B

Plot C