Quiz 1

Question 1: What is the data type of the variable p_5?

data.frame: 5 obs of 5 variables:

$p_1: int 48 2 28 8 16

$p_2: num 3 9 20 6 15

$p_3: chr “S” “X” “I” “G” ….

$p_4: Factor w/ 5 levels “April”, “January”, ..: 1 5 3 2 4

$p_5: logi TRUE FALSE FALSE TRUE FALSE Answer: logical

QUestion 2: Select the output produced by the script

Script

tues <- c(9,36,2,29,44,33)
tue[6]

Answer: 33

Question 3: Complete the script to produce the output shown Output [,1] [,2] [1,] 1 5 [2,] 2 7 [3,] 4 8

Script ——(c(1,2,4),c(5,7,8)) Answer: cbind

Question 4: Complete the script to produce the output shown Following is a preview of the dataframe df.

   x y   z
1 17 A Sep
2 37 B Jul
3 12 C Jun
4 48 D Feb
5 19 E Mar

Output

Script
df <- data.frame(
x = c(17,37,12,48,19),
y = c("A","B","C","D","E"),
z = c("Sep","Jul","Jun","Feb","Mar")
)
df[2:4,_____]

Answer: 1:2

Question 5: Select the correct option If z is a 4 by 4 matrix, what is the result of z[1,4]? Answer: Element at the first row and fourth column

Question 6: Complete the script to produce the output shown Output Oct Nov Sep 23 1 15

Script

x <- c(20,10,15,23,1,26)
y <- c("Jul", "Aug","Sep", "Oct", "Nov", "Dec")
names(x) <- y
x[______]

Answer: c(4,5,3)

Question 7: Select the correct option Which of the following is an ordinal categorical variable? Answer: Survey results (Dissatisfied, Neutral, Satisfied)

Question 8: Select the script that produces the output shown Following is a preview of the matrix x.

  P Q
A 1 2
B 3 4

Output

A B 
3 7

Answer: x <- matrix(1:4, nrow = 2, byrow = TRUE, dimnames = list(c("A","B"), c("P", "Q"))) rowsums(x)

Questions 9: Select the correct option Which of the following data structures can contain multiple data types? Answer: Data frames and lists

Question 10: Complete the script to produce the output shown

Output [1] 12 19 44 26 43

Script

df <- data.frame(
  id = c(1,2,3,4,5),
  prod = c("K", "V", "J", "R", "O"),
  units = c(12,19,44,26,43)
)
df[______]

Answer: “units”

Quiz 2

Question 1: Run the following code and characterize the distribution of X

set.seed(231)
X <-rnorm(1000,10,2) + runif(1000,7,13)
hist(X, col = "pink", breaks = "Scott")

Answer:

The distribution of X is symmetric.

Shape: symmetric

Center: 20.122

Spread: sd(X) = 2.621324

It is a histogram.

Question 2: Run the following code and characterize the distribution of Y.

set.seed(893)
Y <- rexp(1000,01)*(-1)+100

Shape: skewed left Center: median(y) = 93.25677 Spread: IQR(y) = 10.885 It is a histogram

Question 3: Use the data stored in X from question 1 to compute the standardized test statistic (tobs) for testing H0: u = 20 versuses H1: u != 20. What is the p-value for the test and what is your English conclusion?

Quiz 3

Question 1: Define the random variable(X) and specify its distribution. Answer: x = # of heads in flipped biased coins X~Bin(5, 0.4)

Question 2: Find the E(X) and V(X) exactly and verify the reasonableness of your answer with a simulation using 10,000 draws. E(X) = 5.4 = 2 V(X) = 5.4(1-.4) = 1.2

Answer:

EX <- 5*.4
EX

[1] 2

VX <- 5*.4*(1-.4)
VX

[1] 1.2

Question 3: What is the probability exactly two of the coins are heads? Provide an exact answer and use simulation with 10,000 trials to verify the reasonableness of your exact answer. P(X = 2) = 0.3456

dbinom(2,5,.4)

[1] 0.3456

r <- rbinom(10000,5,.4)
mean(r == 2)

[1] 0.346

Question 4: What is the probability exactly at least 3 are heads? Provide an exact answer and use simulation with 10,000 trials to verify the reasonableness of your exact anwer. P(X >= 3) = 0.31744

sum(dbinom(3:5, 5, 0.4))

[1] 0.31744

x <- rbinom(10000,5,0.4)
mean(x >= 3)

[1] 0.3219

Question 5: If X~ Bin(n = 20, p = 0.3), Y ~ Bin(n = 30, p = 0.2) and X and Y are independent, what is the E(3X+2Y) and what is the V(3X+2Y)? Provide an exact answer and use simulation with 10,000 trials to verify the reasonableness of your exact answer. E(3X+2Y) V(3X+2Y)

x <- rbinom(1000,20,.3)
y <- rbinom(1000,30,.2)
mean(3*x) + mean(2*y)

[1] 29.885

var(3*x) + var(2*y)

[1] 56.89873

Question 6: We see 13 out of 20 flips from a coin that is either fair (50% chance of heads) or biased (80% chance of heads). How likely is it that the coin is fair? Answer this by simulating 50,000 fair coins and 50,000 biased coins. What is the exact answer?

F <- rbinom(50000,20,.5)
B <- rbinom(50000,20,.8)
B_13 <- mean(B== 13)
F_13 <- mean(F == 13)
F_13/(B_13 + F_13)

[1] 0.5748963

#Another way to do this
B1 <- dbinom(13,20,.8)
F1 <- dbinom(13,20,.5)
F1/(B1+F1)

[1] 0.5754171

Question 7: Suppose we see 16 heads out of 20 flips, which would normally be strong evidence that the coin is biased. However, suppose we had set a prior probability of 99% chance that the coin is fair (50% chance of heads), and only a 1% chance that the coin is biased (80% chance of heads). Find the posterior probability that the coin is fair, given that there is a 99% prior probability that the coin is fair.

F16 <- dbinom(16,20,.5)
B16 <- dbinom(16,20,.8)
top <- F16*.99
bottom <- F16*.99+B16*.01
top/bottom

[1] 0.677045

Quiz 4/5

This was the week where we created the R-Markdown Practice document. I will insert it tomorrow after I ask him some questions about mine.

Directions: Recreate this document using R Markdown. Make sure that you use inline R to report your answers. Your document should look like this document when it is knitted including the directions but have your name in place of the current Your Name. Please print (before class) and turn in both the *.Rmd file and the knitted *.pdf file stapled to the back of your *.Rmd file at the start of class 9/14/17. Name your file firstname_lastname.Rmd (mine would be alan_arnholt.Rmd). Use global options to set the height and width of your figures to 1.5 and 2.5 inches, respectively.

Some Code

set.seed(31)
x <- rnorm(1000,100,10)
DF <- data.frame(x = x)
library(ggplot2)
ggplot(data = DF, aes(x = x)) +
  geom_histogram(binwidth = 2, fill = "pink", color = "black") +
  theme_bw()

The mean of the graph shown below is $\bar{x}= 100.31$. The standard deviation of the graph below is $s$ = 10.13. Make sure your answers update properly and are rounded to two decimal places when the value passed to set.seed() changes.

summary(DF$x)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  71.78   93.60  100.12  100.31  107.10  128.85

The third quartile, $Q_3$, is 107.1.

A Graph

Additonal Resources

Quiz 6

1 - [0.5pts] Select the correct option

Which function adds variables to a data frame?

filter()

summarise()

group_by()

mutate()

2 - [0.5pts] Which function returns a subset of variables?

group_by()

mutate()

summarise()

select()

3 - [0.5pts] Select the correct option

Which function reduces each group to a single row by calculating aggregate measures?

filter()

arrange()

mutate()

summarise()

4 - [0.5pts] Select the output produced by the script

Following is a preview of the data frame sales:

  id units profit revenue
1  1    10      5     150
2  2    18      4     350
3  2     5     12     250
4  1    20      2     300
5  2    25      7     400
6  1    15     10     500

CODE

arrange(sales, desc(units), desc(profit))

  id units profit revenue
1  2    25      7     400
2  1    20      2     300
3  2    18      4     350
4  1    15     10     500
5  1    10      5     150
6  2     5     12     250

  id units profit revenue
1  1    10      5     150
2  2     5     12     250
3  1    20      2     300
4  2    18      4     350
5  2    25      7     400
6  1    15     10     500

  id units profit revenue
1  2     5     12     250
2  1    15     10     500
3  2    25      7     400
4  1    10      5     150
5  2    18      4     350
6  1    20      2     300

5 - [0.5pts] Complete the script to produce the output shown

Following is a preview of the tibble trains:

# A tibble: 20 x 3
      id cancelled delayed
   <chr>     <dbl>   <lgl>
 1   F28         0   FALSE
 2   I38        NA   FALSE
 3   J39         0   FALSE
 4   T21         1    TRUE
 5   T30         1    TRUE
 6   N20        NA    TRUE
 # ... with 14 more rows

OUTPUT

 # A tibble: 1 x 1
  dperc
  <dbl>
1   0.4

SCRIPT

library(dplyr)

# Find the percentage of trains delayed

trains %>%

summarise(dperc = mean(delayed))

6 - [0.5pts] Complete the script to produce the output shown

Following is a preview of the data frame df:

OUTPUT

SCRIPT

library(dplyr)

mutate(df, z = x + y)

7 - [0.5pts] Complete the script to produce the output shown

OUTPUT

Observations: 3
Variables: 2
$ x1 <dbl> 5, 4, 1
$ x2 <dbl> 6, 8, 9

SCRIPT

library(dplyr) df <- data.frame( x1 = c(5, 4, 1), x2 = c(6, 8, 9) ) glimpse(df)

8 - [0.5pts] Following is a preview of the data frame max_temp:

  month temp
1   Jan   48
2   Jan   61
3   Jan   55
4   Feb   43
5   Feb   52
6   Feb   44

OUTPUT

  month temp
1   Feb   43
2   Feb   44

filter(max_temp, month == "Feb" | temp < 50)

filter(max_temp, month == "Jan" | temp > 50)

filter(max_temp, month == "Feb" & temp < 50)

9 - [0.5pts] Select the script that produces the output shown

Following is a preview of the data frame sales:

  store product revenue
1     X       A     200
2     X       B     120
3     Y       C     300
4     Y       A     100
5     Z       B     150
6     Z       C     250

OUTPUT

  store product revenue
1     X       A     200
2     X       B     120
3     Y       A     100
4     Z       B     150

filter(sales, store %in% c("Y", "Z"))

filter(sales, product %in% c("A", "B"))

filter(sales, product %in% c("B", "C"))

10 - [0.5pts] Load the dplyr and hflights packages. How many observations and how many variables are contained in the hflights data set? Write your code and answer below.

library(dplyr)
library(hflights)
glimpse(hflights)

Observations: 227,496
Variables: 21
$ Year              <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 20...
$ Month             <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ DayofMonth        <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
$ DayOfWeek         <int> 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6,...
$ DepTime           <int> 1400, 1401, 1352, 1403, 1405, 1359, 1359, 13...
$ ArrTime           <int> 1500, 1501, 1502, 1513, 1507, 1503, 1509, 14...
$ UniqueCarrier     <chr> "AA", "AA", "AA", "AA", "AA", "AA", "AA", "A...
$ FlightNum         <int> 428, 428, 428, 428, 428, 428, 428, 428, 428,...
$ TailNum           <chr> "N576AA", "N557AA", "N541AA", "N403AA", "N49...
$ ActualElapsedTime <int> 60, 60, 70, 70, 62, 64, 70, 59, 71, 70, 70, ...
$ AirTime           <int> 40, 45, 48, 39, 44, 45, 43, 40, 41, 45, 42, ...
$ ArrDelay          <int> -10, -9, -8, 3, -3, -7, -1, -16, 44, 43, 29,...
$ DepDelay          <int> 0, 1, -8, 3, 5, -1, -1, -5, 43, 43, 29, 19, ...
$ Origin            <chr> "IAH", "IAH", "IAH", "IAH", "IAH", "IAH", "I...
$ Dest              <chr> "DFW", "DFW", "DFW", "DFW", "DFW", "DFW", "D...
$ Distance          <int> 224, 224, 224, 224, 224, 224, 224, 224, 224,...
$ TaxiIn            <int> 7, 6, 5, 9, 9, 6, 12, 7, 8, 6, 8, 4, 6, 5, 6...
$ TaxiOut           <int> 13, 9, 17, 22, 9, 13, 15, 12, 22, 19, 20, 11...
$ Cancelled         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ CancellationCode  <chr> "", "", "", "", "", "", "", "", "", "", "", ...
$ Diverted          <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...

11 - [2 pts] Provide the code needed in the space below to add the variables ActualGroundTime (the difference between ActualElapsedTime and AirTime), GroundTime (the sum of TaxiIn and TaxiOut), and AverageSpeed (Distance/AirTime*60) to a copy of hflights and save the result in the object B3.

library(dplyr)
library(hflights)
hflights <- tbl_df(hflights)

B1 <- mutate(hflights, ActualGroundTime <- ActualElapsedTime - AirTime)
B2 <- mutate(B1, GroundTime = TaxiIn + TaxiOut)
B3 <- mutate(B2, AverageSpeed = Distance/AirTime * 60)

B3

12 - [3 pts] Verizon is the primary local telephone company (incumbent local exchange carrier ILEC) for a large area of the eastern United States. As such, it is responsible for providing repair service for the customers of other telephone companies known as competing local exchange carriers (CLECs) in this region. Verizon is subject to fines if the repair times (the time it takes to fix a problem) for CLEC customers are substantially worse than those for Verizon customers. The data set Verizon.csv contains a random sample of repair times for 1664 ILEC and 23 CLEC customers. The mean repair time for ILEC customers is 8.4 hours, while that for CLEC customers is 16.5 hours. Could a difference this large be easily explained by chance?

Specify the null and alternative hypotheses.

(I’m using m for mu and parenthsis for subscript)
H(o): m(elec) - m(ilec) = 0
H(A)L m(elec) - m(ilec) > 0

Add to the code below to find the p-value and report the p-value here .0176

State your conclusion.

Reject null hypothesis (less than .05), elec time is higher than ilec time

library(dplyr)
Ver <- read.csv("http://www1.appstate.edu/~arnholta/Data/Verizon.csv")
TBL <- Ver %>% 
  group_by(Group) %>% 
  summarize(Mean = mean(Time), n())
TBL
obs_diff <- 16.50913 - 8.411611
obs_diff
set.seed(41)
sims <- 10^4 - 1
ans <- numeric(sims)
for(i in 1:sims){
  index <- sample(1664 + 23, 23)
  ans[i] <- mean(Ver$Time[index]) - mean(Ver$Time[-index])
}

mean(ans)

pvalue <- (sum(ans >= 8.1) + 1)/(sims + 1) or pvalue <- (sum(ans >= obs_diff) + 1)/10^4

pvalue .0176

Quiz 7

Question 1: Load the ggplot2 package and provide code that duplicates Figure 1 using the data in mtcars.

library(ggplot2)
ggplot(data = mtcars, aes(x = wt, y = mpg, size = hp)) +
  geom_point() +
  theme_bw()

Question 2: Provide code that duplicates Figure 2 using the data in mtcars.

ggplot(data = mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  theme_bw()

Question 3: Provide code that duplicates Figure 3 using the data in ChickWeight.

ggplot(data = ChickWeight, aes(x = Diet, y = weight)) +
  geom_boxplot() +
  theme_bw()

Extra Credit: Provide code that duplicates Figure 4 using the data in mtcars and a seed value of 23.

set.seed(23)
ggplot(mtcars, aes(cyl, wt, shape = factor(am))) +
  geom_point(position = "jitter", size = 3) +
  theme_bw()

Quiz 8

Question 1: Load the ggplot2 package and provide code that duplicates Figure 1 using the data in mtcars.

library(ggplot2)
ggplot(mtcars, aes(x=factor(cyl), fill = factor(vs))) +
  geom_bar(position = "dodge") +
  theme_bw()

Question 2: Provide code that duplicates Figure 2 using the daata in mtcars.

ggplot(mtcars, aes(x = factor(cyl), fill = factor (am))) +
  geom_bar(postion = "fill") +
  theme_bw()

Question 3: Provide code that duplicates Figure 3 using the data in ChickWeight.

library(ggplot2)
ggplot(ChickWeight, aes(x = Diet, y = weight)) +
  geom_boxplot() +
  theme_bw()

Question 4: Compute and report appropriate measures of center and spread for the boxplots in Figure 3 using the data in ChickWeight.

library(dplyr)
ChickWeight %>% 
  group_by(Diet) %>% 
  summarize(median(weight), IQR(weight))

# A tibble: 4 x 3
    Diet `median(weight)` `IQR(weight)`
  <fctr>            <dbl>         <dbl>
1      1             88.0         78.75
2      2            104.5         97.50
3      3            125.5        131.25
4      4            129.5        113.50

Quiz 9

Questions 1: State the fundamental question of inference. Answer: How does what you actually observed compare to what would happen if the null hypothesis were true, and you repeat the process many times.

Question 2: Use the data frame r mtcars and provide the code needed to recreate Figure 1

library(tidyverse)
library(ggplot2)
ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) + geom_point() + geom_smooth(method = "lm", Se = FALSE, linetype = 2, aes(color = factor(cyl), group = 1))

Question 3: When are two events considered independent? Answer: Whenever P(A n B) = P(A)P(B)

Question 4: Based on Figure, do you think using Facebook and using Twitter and independent? Justify your answer. Answer:

Questions 5: Use the following code that creates r TDF and test whether using Facebook and using Twitter are independent with the r chisq.test() function. State your English conclusion.

on <- c(40, 20, 10, 20)
ONL <- matrix(data = on, nrow = 2, byrow = TRUE)
dimnames(ONL) <- list(Twitter = c("Yes", "No"), Facebook = c("Yes", "No"))
ONLT <- as.table(ONL)
ONLTDF <- as.data.frame(ONLT)
TDF <- vcdExtra::expand.dft(ONLT)
T1 <- xtabs(~Twitter + Facebook, data = TDF)
T1

       Facebook
Twitter No Yes
    No  20  10
    Yes 20  40

chisq.test(T1, correct = FALSE)


    Pearson's Chi-squared test

data:  T1
X-squared = 9, df = 1, p-value = 0.0027

Answer: X-Squared = 7.7006, df = 1, p-value = 0.00552. Based on the chi squared calculation, there is a relationship between bpeople who use both forms of social media, and people who do not use either form. Therefore, the events are not independent.

Quiz 11

Question 1: State the fundamental question of inference. Answer: How does what you observe compared to what would happen if the null hypothesis were actually true and you repeat the process many times.

Question 2: Given the code below, write appropriate null and alternative hypotheses for the code as well as your English conclusion.

library(tidyverse)
FD <- read.csv("http://www1.appstate.edu/~arnholta/Data/FlightDelays.csv")
names(FD)

 [1] "ID"           "Carrier"      "FlightNo"     "Destination" 
 [5] "DepartTime"   "Day"          "Month"        "FlightLength"
 [9] "Delay"        "Delayed30"

AN <- FD %>%
  group_by(Carrier) %>%
  summarize(Var = var(Delay), n = n())
AN

# A tibble: 2 x 3
  Carrier      Var     n
   <fctr>    <dbl> <int>
1      AA 1606.457  2906
2      UA 2037.525  1123

An <- AN %>%
  summarize(obs_stat = Var[2]/Var[1])
sims <- 10^4 - 1
rat <- numeric(sims)
for(i in 1:sims){
  index <- sample(4029, 1123, replace = FALSE)
  rat[i] <- var(FD$Delay[index]/var(FD$Delay[-index]))
}
pvalue <- (sum(rat >= AN$obs_stat) + 1)/(sims + 1)
pvalue

[1] 1e-04

Ho: sd(UA)^2/sd(AA)2 = 1 Ha: sd(UA)^2/sd(AA)2 > 1 English Conclusion: We fail to find evidence that the delay times of United Airlines and delay times of American Airlines are falling. We did not find sufficient evidence to suggest the variance (delays) United Airelines is greater than variance (Delays) American Airlines.

Quiz 12

Question 1: State the fundamental question of inference. Answer: How does what you observe compard to what would happen if the null hypothesis were actually true and you repeat the process many times.

Question 2: Let X1, X2, …, X25 ~ Exp(lambda = 1/5) and let xbar denote the sample mean. a. What are the theoretical values of E(X), E(Xbar), Var(X), Var(Xbar)? b. Simulate the sample distribution of xbar in R by taking 10^4 samples of the appropriate size. Computer hat(E(xbar)) and hat(Var(xbar)) c. From your simulation, find P(X >= 7).

Answer: a. E(X) = 5 E(xbar) = 5 Var(x) = 25 var(xbar) = 1

sims <- 10^4
xbar <- numeric(sims)
for (i in 1:sims) {
  xbar[i] <- mean(rexp(25,1/5))
}
mean(xbar)

[1] 5.002896

var(xbar)

[1] 0.9872694

mean(xbar >= 7)

[1] 0.0315

Quiz 13

Question 1: Given the r pdf f(x) = (16/6)x^3e^(-2*x), x >= 0, find E(X) and Var(X)

func <- function(x){
  (16/6)*x^3*exp(-2*x)
}
func_ex <- function(x){
  x*((16/6)*x^3*exp(-2*x))
}
EX <- integrate(func_ex, 0, Inf)$value
func_vx <- function(x){
  (x-EX)^2*((16/6)*x^3*exp(-2*x))
}
VX <- integrate(func_vx, 0, Inf)$value

Question 2: P(X >= 3)?

P <- integrate(func, 3, Inf)$value
P

[1] 0.1512039

Quiz 14

Question 1:

Quiz 15

Question 1: Z_0.04

qnorm(0.04)

[1] -1.750686

Question 2: t_(0.92,20)

qt(.92,20)

[1] 1.459341

Question 3: P(t_4 <= .10)

pt(.1, 4)

[1] 0.5374221

Question 4: A random sample of 16 student revealed a mean score of 7.8 with a standard deviation of 1.15. Assuming the distribution of quiz scores follows a normal distribution, construct a 93% confidence interval for the true average quiz score.

7.8+ qt(.965, 15)*(1.15/4)

[1] 8.360895

7.8 - qt(.965, 15)*(1.15/4)

[1] 7.239105

to determine confidence interval (percentage) bounds: ex: using 93% confidence interval

.93 = 1 - alpha alpha = 0.07 alpha / 2 = 0.035 = lower percentage number 1 - 0.035 = upper percentage number

Quiz Solutions

Samantha Widman

10/9/2017

Quiz 1

Quiz 2

Quiz 3

Quiz 4/5

Some Code

A Graph

Additonal Resources

Quiz 6

Quiz 7

Quiz 8

Quiz 9

Quiz 11

Quiz 12

Quiz 13

Quiz 14

Quiz 15