Please put your answers here, following the instructions in the assignment description. Do not change the arguments at the top of the code chunks, and put your answers and word count tallies in the locations indicated. Remember to knit as you go, and submit the knitted version of this on Canvas.

Q1

do_new <- do %>% rowwise() %>% mutate(morescary = if_else(scariness>loudness,"TRUE","FALSE")) %>% mutate(personality = mean(c(scariness,loudness)))  %>% select(-c(scariness,loudness))

tibble(do_new)
## # A tibble: 7 Ă— 5
##   name        species height morescary personality
##   <chr>       <chr>    <dbl> <chr>           <dbl>
## 1 super size  bear       164 TRUE              8.5
## 2 big wol     owl         32 TRUE              6  
## 3 rainbow     unicorn     18 FALSE             3.5
## 4 hugo        hippo       22 TRUE              5.5
## 5 sissily     snake        4 TRUE              5  
## 6 little blue penguin     18 FALSE             4  
## 7 kevin       guitar      25 FALSE             4

Q2

dh_new <- pivot_longer(dh,cols=c("lfb",  "rainbow" , "diff" ),names_to = 'what',values_to='score')
 

dh_new            
## # A tibble: 180 Ă— 3
##    question what    score
##    <chr>    <chr>   <dbl>
##  1 q1       lfb      49.7
##  2 q1       rainbow  40.9
##  3 q1       diff      8.8
##  4 q2       lfb      40.3
##  5 q2       rainbow  35.8
##  6 q2       diff      4.5
##  7 q3       lfb      29.7
##  8 q3       rainbow  26.8
##  9 q3       diff      2.9
## 10 q4       lfb      38  
## # … with 170 more rows

Q3

ggplot(dh_new) +
 aes(x = score, fill = what) +
 geom_histogram(bins = 30L) +
 scale_fill_hue(direction = 1) +
 labs(x = "Score of each question", y = "No. of observations", title = "1D distibution of 3 answers from all groups", 
 fill = "Answers") +
 ggthemes::theme_base() +
 theme(legend.position = "bottom") +
 facet_wrap(vars(what), 
 scales = "free_x")

Q4

a <- dh_new %>% filter(what=='lfb') 

lapply(a['score'], shapiro.test)
## $score
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.98257, p-value = 0.5462
b <- dh_new %>% filter(what=='rainbow') 

lapply(b['score'], shapiro.test)
## $score
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.98162, p-value = 0.5009
c <- dh_new %>% filter(what=='diff') 

lapply(c['score'], shapiro.test)
## $score
## 
##  Shapiro-Wilk normality test
## 
## data:  X[[i]]
## W = 0.98602, p-value = 0.7232

*ANSWER: The Shapiro wilk normality test is performa=ed to check normality of the test for three classes. We can deduce the normality based on 2 hypothesis as; Null hypothesis H0: Data is normal Alternate hypothesis HA: Data is not normal

When we use these hypothesis on the 3 score values for 3 classes we observe that none of the class has normalized score at 5% significance level since p-value > \(alpha\) so we reject our null hypothesis that data is normal. [Word Count: XX]*

Q5

t.test(a['score'],b['score'])
## 
##  Welch Two Sample t-test
## 
## data:  a["score"] and b["score"]
## t = 1.0137, df = 117.19, p-value = 0.3128
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.198303  3.711636
## sample estimates:
## mean of x mean of y 
##  34.79500  33.53833

ANSWER: We have used t-test to compare answers of rainbow and LFB with hypothesis that Null hypothesis H0: Mean scores for LFB is significantly different than rainbow Alternate Hypothesis: Mean scores for LFB is are significantly different than rainbow. We get p-value > 0.05 (5% significance level) so we deduce that score of LFB and rainbow are significantly different and we reject our null hypothesis. The size of sample is 120 which is not an appropriate size of sample out of population for t hypothesis test and it could be a reason for such a high p-value. [Word Count: XX]

Q6

Q7

a = table(dp$problem, dp$improved) 
chisq.test(a)
## 
##  Pearson's Chi-squared test
## 
## data:  a
## X-squared = 0.057744, df = 2, p-value = 0.9715

*ANSWER: We have applied the chi-square test to test two of following hypothesis Null hypothesis H0: The Distribution of health problem is not significantly different than govt standard Alternate hypothesis HA: The Distribution of health problems is significantly different than govt standard

From the result of our test we reject our null hypothesis since p-value > 5% significance level. Note that degrees of freedom here is 2 which is n-1 according to no. of health problems.

. [Word Count: XX]*

Q8

pt(0.8,133)
## [1] 0.7874313

*ANSWER: The probability of seeing 134 or more non-food visits is 78__%.*

ANSWER: The calculated probability of 78% shows that the area under a t-curve in t distribution curve is on the we have 78% area covered and only 10% area remains on the left tail. We can calculate the probability by qt function as 1-pt(0.8,133) well which will given the remaiining 10% probability of curve on the right side of the t-curve. Remember that the seconf argument is degrees of freedom which is equal to n-1 where n= sample size.

Q9

dd_sum <- dd %>% group_by(size,time) %>% summarise(mean(health),median(health),sd(health)) 
## `summarise()` has grouped output by 'size'. You can override using the
## `.groups` argument.
colnames(dd_sum) <- c("size"  ,     "time"  ,     "meanHealth", "mdnHealth" , "sdHealth" )
dd_sum
## # A tibble: 12 Ă— 5
## # Groups:   size [4]
##    size     time  meanHealth mdnHealth sdHealth
##    <chr>    <chr>      <dbl>     <dbl>    <dbl>
##  1 enormous t1          70.3      66      10.2 
##  2 enormous t2          61.7      59       8.33
##  3 enormous t3          51        46      10.4 
##  4 large    t1          81        77.5    12.8 
##  5 large    t2          73.9      73      13.0 
##  6 large    t3          65.5      68      13.6 
##  7 medium   t1          77.6      75      11.2 
##  8 medium   t2          NA        NA      NA   
##  9 medium   t3          67.6      64      13.4 
## 10 small    t1          72.2      76.5    12.8 
## 11 small    t2          69.5      73.5    13.1 
## 12 small    t3          65.1      68.5    11.6

Q10

ggplot(dd_sum2) +
 aes(x = time, fill = time, weight = meanHealth) +
 geom_bar(color="black") +
 scale_fill_brewer(palette = "Blues", 
 direction = 1) +
 labs(x = "Time", y = "Health") +
 ggthemes::theme_base() +
 facet_wrap(vars(size))+  geom_errorbar(aes(ymin=min(dd_sum2$sdHealth),ymax=max(dd_sum2$meanHealth)), width=.2,  position=position_dodge(.9))

ANSWER: The highest health count is for the large size people. and average is around 60 for all times. [Word Count: XX]

Q11

ggplot(dd) +
 aes(x = income, y = health, fill = size) +
 geom_point(shape = "circle filled", size = 3L, 
 colour = "#112446") +
 scale_fill_hue(direction = 1) +
 labs(x = "Income of participant", y = "Health probelm") +
 ggthemes::theme_base() +
 facet_wrap(vars(size))
## Warning: Removed 2 rows containing missing values (`geom_point()`).

*ANSWER: In the dd dataset we observe that there are full types of species from enormous to small based on their size. We have tried to get a trend out of the dataset using the facet plots with point graph. The graph above shows that there are quite a mix of sizes in all the sample dataset. The independent variables in our graph is income linked to depedent variables health rating. We can observe that for the persons with large size the health rating is higher as compared to enormous and small size persons. In all the cases a cluster formation is observed around a value where certian range of income. We can’t explicitly say that there is linear trend between the income and health rating for the participants. Overall highest rating is observed for the large persons with medium income of 80-120.

Q12

Q13

  1. t-test does not depend upon sample size. This is the reason t-statistics value is same in all cases. Although the sample size changes the mean value between 2 groups which is checked and corelated by t-test.

  2. A negative t-value shows a reversal in the directionality of the effect being studied yet it has no effect on p-value which is generally taken at 5% significance level. t value with negative or positive sign are same except that the curve we are taking value form the negative side of the curve. If we get t-value negative it means we can reject null hypothesis. It also indicates that in the formula of t-test Foxy have put lower mean before the large mean. [Word Count: XX]*

Q14

ANSWER: p an alpha values are used in hypothesis tests to accept or reject null hypothesis. The first statement with p=0.4 is correct that probability of getting null hypothesis equal to true is 40% and 60% for ejecting null hypothesis. Whenever p-value is greater than alpha (significance level) we reject the null hypothesis. Another parameter in hypothesis test is the confidence interval which is normally chosen as 95%. When we get such a large p-value even if we have set alpha = 0.05 we accept the null hypothesis. Regarding the second statement Type i error can be reduced by choosing less value for alpha. If we set alpha=0.01 it means we have 1% chance of rejecting null hypothesis and after hypothesis test we can accept the null hypothesis even if p comes out to be 0.02.

Q15

*ANSWER: Type I error occurs when result is false positive and Type II error occurs when result is false negative. alpha is probability of committing type I error while beta is probability of committing type II error. When we assume that alpha=0.05 it means we are expecting a probability of 5% error. Generally we do not know at the initial stage that we should reject null hypothesis which helps to use alpha and beta together. With large alpha we can afford to make type I error and vice versa for beta. The beta value for alpha 0.05 is 0.2. When sample size is large enough (more than 30) we can use z-test for hypothesis testing. With regards to effect size we do not have control over it. With the increase in effect size the sampling distribution moves from null.

Q16

ANSWER: I would like to