Please put your answers here, following the instructions in the assignment description. Do not change the arguments at the top of the code chunks, and put your answers and word count tallies in the locations indicated. Remember to knit as you go, and submit the knitted version of this on Canvas.
do_new <- do %>% rowwise() %>% mutate(morescary = if_else(scariness>loudness,"TRUE","FALSE")) %>% mutate(personality = mean(c(scariness,loudness))) %>% select(-c(scariness,loudness))
tibble(do_new)
## # A tibble: 7 Ă— 5
## name species height morescary personality
## <chr> <chr> <dbl> <chr> <dbl>
## 1 super size bear 164 TRUE 8.5
## 2 big wol owl 32 TRUE 6
## 3 rainbow unicorn 18 FALSE 3.5
## 4 hugo hippo 22 TRUE 5.5
## 5 sissily snake 4 TRUE 5
## 6 little blue penguin 18 FALSE 4
## 7 kevin guitar 25 FALSE 4
dh_new <- pivot_longer(dh,cols=c("lfb", "rainbow" , "diff" ),names_to = 'what',values_to='score')
dh_new
## # A tibble: 180 Ă— 3
## question what score
## <chr> <chr> <dbl>
## 1 q1 lfb 49.7
## 2 q1 rainbow 40.9
## 3 q1 diff 8.8
## 4 q2 lfb 40.3
## 5 q2 rainbow 35.8
## 6 q2 diff 4.5
## 7 q3 lfb 29.7
## 8 q3 rainbow 26.8
## 9 q3 diff 2.9
## 10 q4 lfb 38
## # … with 170 more rows
ggplot(dh_new) +
aes(x = score, fill = what) +
geom_histogram(bins = 30L) +
scale_fill_hue(direction = 1) +
labs(x = "Score of each question", y = "No. of observations", title = "1D distibution of 3 answers from all groups",
fill = "Answers") +
ggthemes::theme_base() +
theme(legend.position = "bottom") +
facet_wrap(vars(what),
scales = "free_x")
a <- dh_new %>% filter(what=='lfb')
lapply(a['score'], shapiro.test)
## $score
##
## Shapiro-Wilk normality test
##
## data: X[[i]]
## W = 0.98257, p-value = 0.5462
b <- dh_new %>% filter(what=='rainbow')
lapply(b['score'], shapiro.test)
## $score
##
## Shapiro-Wilk normality test
##
## data: X[[i]]
## W = 0.98162, p-value = 0.5009
c <- dh_new %>% filter(what=='diff')
lapply(c['score'], shapiro.test)
## $score
##
## Shapiro-Wilk normality test
##
## data: X[[i]]
## W = 0.98602, p-value = 0.7232
*ANSWER: The Shapiro wilk normality test is performa=ed to check normality of the test for three classes. We can deduce the normality based on 2 hypothesis as; Null hypothesis H0: Data is normal Alternate hypothesis HA: Data is not normal
When we use these hypothesis on the 3 score values for 3 classes we observe that none of the class has normalized score at 5% significance level since p-value > \(alpha\) so we reject our null hypothesis that data is normal. [Word Count: XX]*
t.test(a['score'],b['score'])
##
## Welch Two Sample t-test
##
## data: a["score"] and b["score"]
## t = 1.0137, df = 117.19, p-value = 0.3128
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.198303 3.711636
## sample estimates:
## mean of x mean of y
## 34.79500 33.53833
ANSWER: We have used t-test to compare answers of rainbow and LFB with hypothesis that Null hypothesis H0: Mean scores for LFB is significantly different than rainbow Alternate Hypothesis: Mean scores for LFB is are significantly different than rainbow. We get p-value > 0.05 (5% significance level) so we deduce that score of LFB and rainbow are significantly different and we reject our null hypothesis. The size of sample is 120 which is not an appropriate size of sample out of population for t hypothesis test and it could be a reason for such a high p-value. [Word Count: XX]
a = table(dp$problem, dp$improved)
chisq.test(a)
##
## Pearson's Chi-squared test
##
## data: a
## X-squared = 0.057744, df = 2, p-value = 0.9715
*ANSWER: We have applied the chi-square test to test two of following hypothesis Null hypothesis H0: The Distribution of health problem is not significantly different than govt standard Alternate hypothesis HA: The Distribution of health problems is significantly different than govt standard
From the result of our test we reject our null hypothesis since p-value > 5% significance level. Note that degrees of freedom here is 2 which is n-1 according to no. of health problems.
. [Word Count: XX]*
pt(0.8,133)
## [1] 0.7874313
*ANSWER: The probability of seeing 134 or more non-food visits is 78__%.*
ANSWER: The calculated probability of 78% shows that the area under a
t-curve in t distribution curve is on the we have 78% area covered and
only 10% area remains on the left tail. We can calculate the probability
by qt function as 1-pt(0.8,133) well which will given the
remaiining 10% probability of curve on the right side of the t-curve.
Remember that the seconf argument is degrees of freedom which is equal
to n-1 where n= sample size.
dd_sum <- dd %>% group_by(size,time) %>% summarise(mean(health),median(health),sd(health))
## `summarise()` has grouped output by 'size'. You can override using the
## `.groups` argument.
colnames(dd_sum) <- c("size" , "time" , "meanHealth", "mdnHealth" , "sdHealth" )
dd_sum
## # A tibble: 12 Ă— 5
## # Groups: size [4]
## size time meanHealth mdnHealth sdHealth
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 enormous t1 70.3 66 10.2
## 2 enormous t2 61.7 59 8.33
## 3 enormous t3 51 46 10.4
## 4 large t1 81 77.5 12.8
## 5 large t2 73.9 73 13.0
## 6 large t3 65.5 68 13.6
## 7 medium t1 77.6 75 11.2
## 8 medium t2 NA NA NA
## 9 medium t3 67.6 64 13.4
## 10 small t1 72.2 76.5 12.8
## 11 small t2 69.5 73.5 13.1
## 12 small t3 65.1 68.5 11.6
ggplot(dd_sum2) +
aes(x = time, fill = time, weight = meanHealth) +
geom_bar(color="black") +
scale_fill_brewer(palette = "Blues",
direction = 1) +
labs(x = "Time", y = "Health") +
ggthemes::theme_base() +
facet_wrap(vars(size))+ geom_errorbar(aes(ymin=min(dd_sum2$sdHealth),ymax=max(dd_sum2$meanHealth)), width=.2, position=position_dodge(.9))
ANSWER: The highest health count is for the large size people. and average is around 60 for all times. [Word Count: XX]
ggplot(dd) +
aes(x = income, y = health, fill = size) +
geom_point(shape = "circle filled", size = 3L,
colour = "#112446") +
scale_fill_hue(direction = 1) +
labs(x = "Income of participant", y = "Health probelm") +
ggthemes::theme_base() +
facet_wrap(vars(size))
## Warning: Removed 2 rows containing missing values (`geom_point()`).
*ANSWER: In the dd dataset we observe that there are full types of species from enormous to small based on their size. We have tried to get a trend out of the dataset using the facet plots with point graph. The graph above shows that there are quite a mix of sizes in all the sample dataset. The independent variables in our graph is income linked to depedent variables health rating. We can observe that for the persons with large size the health rating is higher as compared to enormous and small size persons. In all the cases a cluster formation is observed around a value where certian range of income. We can’t explicitly say that there is linear trend between the income and health rating for the participants. Overall highest rating is observed for the large persons with medium income of 80-120.
t-test does not depend upon sample size. This is the reason t-statistics value is same in all cases. Although the sample size changes the mean value between 2 groups which is checked and corelated by t-test.
A negative t-value shows a reversal in the directionality of the effect being studied yet it has no effect on p-value which is generally taken at 5% significance level. t value with negative or positive sign are same except that the curve we are taking value form the negative side of the curve. If we get t-value negative it means we can reject null hypothesis. It also indicates that in the formula of t-test Foxy have put lower mean before the large mean. [Word Count: XX]*
ANSWER: p an alpha values are used in hypothesis tests to accept or reject null hypothesis. The first statement with p=0.4 is correct that probability of getting null hypothesis equal to true is 40% and 60% for ejecting null hypothesis. Whenever p-value is greater than alpha (significance level) we reject the null hypothesis. Another parameter in hypothesis test is the confidence interval which is normally chosen as 95%. When we get such a large p-value even if we have set alpha = 0.05 we accept the null hypothesis. Regarding the second statement Type i error can be reduced by choosing less value for alpha. If we set alpha=0.01 it means we have 1% chance of rejecting null hypothesis and after hypothesis test we can accept the null hypothesis even if p comes out to be 0.02.
*ANSWER: Type I error occurs when result is false positive and Type II error occurs when result is false negative. alpha is probability of committing type I error while beta is probability of committing type II error. When we assume that alpha=0.05 it means we are expecting a probability of 5% error. Generally we do not know at the initial stage that we should reject null hypothesis which helps to use alpha and beta together. With large alpha we can afford to make type I error and vice versa for beta. The beta value for alpha 0.05 is 0.2. When sample size is large enough (more than 30) we can use z-test for hypothesis testing. With regards to effect size we do not have control over it. With the increase in effect size the sampling distribution moves from null.
ANSWER: I would like to
solve the health problems for the participants.
Improve their statistics knowledge.
Increase health rating of the participants
Get more data with more species.
Remove the scariness from the persons