library(tidyverse)
library(openintro)glimpse(kobe_basket)## Rows: 133
## Columns: 6
## $ vs <fct> ORL, ORL, ORL, ORL, ORL, ORL, ORL, ORL, ORL, ORL, ORL, ORL~
## $ game <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1~
## $ quarter <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3~
## $ time <fct> 9:47, 9:07, 8:11, 7:41, 7:03, 6:01, 4:07, 0:52, 0:00, 6:35~
## $ description <fct> Kobe Bryant makes 4-foot two point shot, Kobe Bryant misse~
## $ shot <chr> "H", "M", "M", "H", "H", "M", "M", "M", "M", "H", "H", "H"~
In this lab report we define a streak to be the consecutive shooting of baskets, and therefore, if we want to determine if the hot hand effect truly exists, we try to measure the streak length that Kobe had, and to do so, we define a streak length which is the number of consecutive successful shots/baskets did Kobe make until a failure occurred.
What does a streak length of 1 mean, i.e. how many hits and misses are in a streak of 1? What about a streak length of 0?
Answer: A streak length of 1 means that Kobe was able to make one shot i.e., a hit before a miss. In a streak length of 0, there is a miss right after a miss that caused one to end the previous miss.
kobe_streak<-calc_streak(kobe_basket$shot)
ggplot(data=kobe_streak, aes(x=length))+geom_bar()Describe the distribution of Kobe’s streak lengths from the 2009 NBA finals. What was his typical streak length? How long was his longest streak of baskets? Make sure to include the accompanying plot in your answer.
Answer: If we take a look at the plot of Kobe’s streak length, we realize that the mean of the graph is greater than the mode i.e., the graph is skewed to the right of the mean. Kobe’s typical streak length was 0; and his longest streak of baskets was 4.
ggplot(data=kobe_streak, aes(x=length))+geom_bar()In your simulation of flipping the unfair coin 100 times, how many flips came up heads? Include the code for sampling the unfair coin in your response. Since the markdown file will run the code, and generate a new sample each time you Knit it, you should also “set a seed” before you sample. Read more about setting a seed below.
Answer: In my simulation of flipping the unfair coin a hundred times, “heads” came up 26 times.
set.seed(257)
coin_outcomes<-c("heads","tails")
sample(coin_outcomes, size=1, replace=TRUE)## [1] "tails"
unfair_coin<-sample(coin_outcomes, size=100, replace=TRUE, prob = c(0.2,0.8))
unfair_coin## [1] "heads" "heads" "heads" "tails" "tails" "tails" "tails" "heads" "tails"
## [10] "tails" "heads" "tails" "tails" "tails" "heads" "tails" "heads" "tails"
## [19] "heads" "tails" "heads" "tails" "tails" "tails" "heads" "tails" "tails"
## [28] "heads" "tails" "tails" "tails" "tails" "tails" "tails" "heads" "heads"
## [37] "heads" "tails" "heads" "tails" "heads" "tails" "tails" "tails" "tails"
## [46] "tails" "tails" "tails" "heads" "tails" "tails" "tails" "heads" "tails"
## [55] "heads" "tails" "tails" "tails" "tails" "tails" "tails" "tails" "tails"
## [64] "tails" "heads" "tails" "tails" "tails" "heads" "tails" "tails" "tails"
## [73] "tails" "tails" "tails" "tails" "tails" "tails" "tails" "heads" "heads"
## [82] "tails" "tails" "tails" "tails" "tails" "tails" "tails" "tails" "tails"
## [91] "heads" "tails" "tails" "heads" "tails" "tails" "tails" "tails" "tails"
## [100] "heads"
table(unfair_coin)## unfair_coin
## heads tails
## 26 74
What change needs to be made to the sample function so that it reflects a shooting percentage of 45%? Make this adjustment, then run a simulation to sample 133 shots. Assign the output of this simulation to a new object called sim_basket.
Answer:
shot_outcome<-c("H","M")
set.seed(257)
sim_basket<-sample(shot_outcome, size=133, replace=TRUE, prob=c(0.45,0.55))
table(sim_basket)## sim_basket
## H M
## 68 65
Using calc_streak, compute the streak lengths of sim_basket, and save the results in a data frame called sim_streak.
sim_streak<-calc_streak(sim_basket)
sim_streak## length
## 1 4
## 2 1
## 3 3
## 4 2
## 5 0
## 6 1
## 7 1
## 8 1
## 9 1
## 10 1
## 11 2
## 12 4
## 13 1
## 14 3
## 15 3
## 16 0
## 17 0
## 18 1
## 19 0
## 20 0
## 21 2
## 22 0
## 23 1
## 24 1
## 25 2
## 26 0
## 27 1
## 28 0
## 29 3
## 30 0
## 31 1
## 32 0
## 33 0
## 34 1
## 35 2
## 36 1
## 37 3
## 38 0
## 39 1
## 40 1
## 41 0
## 42 0
## 43 2
## 44 1
## 45 0
## 46 0
## 47 1
## 48 1
## 49 0
## 50 1
## 51 0
## 52 0
## 53 0
## 54 1
## 55 3
## 56 0
## 57 0
## 58 0
## 59 3
## 60 0
## 61 4
## 62 0
## 63 1
## 64 1
## 65 0
## 66 0
Describe the distribution of streak lengths. What is the typical streak length for this simulated independent shooter with a 45% shooting percentage? How long is the player’s longest streak of baskets in 133 shots? Make sure to include a plot in your answer.
Answer
ggplot(data=sim_streak, aes(x=length))+geom_bar()As can be seen in the chart displayed: the lenghts are right skewed as the median of the streak appears to be greater than the mode. The typical streak length for the shooter is 0, which is a given considering that the odds are skewed against his favor; however his longest streak is evident from the above graph, it ranks at 4.
Loading new data:
library(tidyverse)
library(openintro)
head(fastfood)## # A tibble: 6 x 17
## restaurant item calories cal_fat total_fat sat_fat trans_fat cholesterol
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Mcdonalds Artisan G~ 380 60 7 2 0 95
## 2 Mcdonalds Single Ba~ 840 410 45 17 1.5 130
## 3 Mcdonalds Double Ba~ 1130 600 67 27 3 220
## 4 Mcdonalds Grilled B~ 750 280 31 10 0.5 155
## 5 Mcdonalds Crispy Ba~ 920 410 45 12 0.5 120
## 6 Mcdonalds Big Mac 540 250 28 10 1 80
## # ... with 9 more variables: sodium <dbl>, total_carb <dbl>, fiber <dbl>,
## # sugar <dbl>, protein <dbl>, vit_a <dbl>, vit_c <dbl>, calcium <dbl>,
## # salad <chr>
Make a plot (or plots) to visualize the distributions of the amount of calories from fat of the options from these two restaurants (McDonalds and Dairy Queen). How do their centers, shapes, and spreads compare?
mcdonalds<-fastfood%>%
filter(restaurant=="Mcdonalds")
dairy_queen<-fastfood%>%
filter(restaurant=="Dairy Queen")
ggplot(data=mcdonalds, aes(x=cal_fat))+geom_bar()ggplot(data=dairy_queen, aes(x=cal_fat))+geom_bar()The spreads for both McDonalds and Dairy Queen are right skewed. For McDonalds, the shape of the bar graph denotes a very steep slope near the left extrema of the graph while the mode of the calorific fat in thier food ranges between 200 and 500; for Dairy Queen, despite being right skewed, the graph is still more evenly spread out. The center of the data for the twain lies firmly in the 200-500 range, but due to the unwieldy and extremely skewed mean for McDonalds, the center is located to the right of the center of Dairy Queen.
dqmean<-mean(dairy_queen$cal_fat)
dqsd<-sd(dairy_queen$cal_fat)Creating a histogram of the collected data:
ggplot(data = dairy_queen, aes(x = cal_fat)) +
geom_blank() +
geom_histogram(aes(y = ..density..)) +
stat_function(fun = dnorm, args = c(mean = dqmean, sd = dqsd), col = "tomato")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Exercise 8 Based on the this plot, does it appear that the data follow a nearly normal distribution?
Answer: Yes, based on this plot, the data follows a normal distribution as most bars are nearly nestled in the area under the tomato colored curve.
#Evaluating the normal distribution One can use the normal Q-Q plot for evaluate the normal distribution, rather than eyeballing the graph. To do so:
ggplot(data=dairy_queen, aes(sample=cal_fat))+
geom_line(stat="qq")sim_norm<-rnorm(n=nrow(dairy_queen), mean=dqmean, sd=dqsd)#Exercise 9 Make a normal probability plot of sim_norm. Do all of the points fall on the line? How does this plot compare to the probability plot for the real data? (Since sim_norm is not a dataframe, it can be put directly into the sample argument and the data argument can be dropped.)
ggplot(data=dairy_queen, aes(sample=sim_norm))+
geom_line(stat="qq")qqnormsim(sample = cal_fat, data = dairy_queen) #Exercise 10 Does the normal probability plot for the calories from fat look similar to the plots created for the simulated data? That is, do the plots provide evidence that the calories from fat are nearly normal?
Answer: Yes, the plots provide evidence that the calories from the fat are normally distributed.
#Exercise 11 Using the same technique, determine whether or not the calories from McDonald’s menu appear to come from a normal distribution.
Answer:
qqnormsim(sample=cal_fat, data=mcdonalds) As seen from the graphs above, McDonald’s menu does not follow a normal distribution with regards to calories from fat.
#Excercise 12 Write out two probability questions that you would like to answer about any of the restaurants in this dataset. Calculate those probabilities using both the theoretical normal distribution as well as the empirical distribution (four probabilities in all). Which one had a closer agreement between the two methods?
Question 1: The probability of McDonald’s menu having an item with less than 15 sugar.
Answer 1:
mcdSugarMean=mean(mcdonalds$sugar)
mcdSugarsd=sd(mcdonalds$sugar)
#Theoretical Probability
pnorm(q=15, mean=mcdSugarMean, sd=mcdSugarsd)## [1] 0.6158
#Actual Probability
mcdonalds%>%
filter(sugar<15)%>%
summarise(percent=n()/nrow(mcdonalds))## # A tibble: 1 x 1
## percent
## <dbl>
## 1 0.842
Question 2: The probability of Dairy Queen’s menu having an item with less than 15 sugar.
dqSugarMean=mean(dairy_queen$sugar)
dqSugarsd=sd(dairy_queen$sugar)
#Theoretical Probability
pnorm(q=15, mean=dqSugarMean, sd=dqSugarsd)## [1] 0.9572535
#Actual Probability
dairy_queen%>%
filter(sugar<15)%>%
summarise(percent=n()/nrow(dairy_queen))## # A tibble: 1 x 1
## percent
## <dbl>
## 1 0.976
In the case of Dairy Queen the theoretical and the actual probability are much closer, once again signifying the normality of the distribution in their menu (as compared to McDonalds).
#Exercise 13 Now let’s consider some of the other variables in the dataset. Out of all the different restaurants, which ones’ distribution is the closest to normal for sodium?
#from http://zevross.com/blog/2019/04/02/easy-multi-panel-plots-in-r-using-facet_wrap-and-facet_grid-from-ggplot2/
fastfood%>%
group_by(restaurant)%>%
ggplot(aes(sample=sodium))+geom_line(stat="qq")+facet_wrap(.~restaurant) Taco Bell’s menu appears to be the most normally distributed for sodium levels in their food.
#Exercise 14 Note that some of the normal probability plots for sodium distributions seem to have a stepwise pattern. why do you think this might be the case?
Answer: The graphs for sodium appears to have many stepwise patterns due to repetition of the sodium values. This can be confirmed by visually checking the sodium levels in the tables of the dataset “fastfood”