Brozek looks to be approximately normal. Using the method of moments estimation, we can determine that the mu should be equal to the sample mean, and the sigma square should be equal to the sample second moment minus the sample mean square.
The sample mean for brozek is 17.2333 (and thus the mu). The second moment for brozek is 354.59. The sigma squared for brozek is 57.6022.
Thus, brozek is approximately normal with mu = 17.2333 and sigma squared = 57.6022.
The variable siri looks to be approximately normal. We know from previous lessons that the method of maximum likelihood for Normal gives mu as the sample mean and sigma squared as the sample variance.
The sample mean for siri is 17.2833. The sample variance is 80.9937.
Thus, siri is approximately normal with mu = 17.2833 and sigma squared = 80.9937.
I had an issue pulling the data set as originally intended, but I was able to find the csv file on Kaggle.
bodyfat <- read.csv("bodyfat.csv", header=T)
dim(bodyfat)
## [1] 252 15
summary(bodyfat$Chest)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 79.30 94.35 99.65 100.82 105.38 136.20
Since the sample is large, we can generate a 95% confidence interval using the following formula:
\[ \bar{X}\pm t\frac{s}{\sqrt{n}} \]
Which t can be substituted with 1.96 being n>30. Since the mean is only a little bit higher than the median, and both Q1 and Q3 are located nearly at the same distance from the median on both sides, it can be concluded that this distribution is near symmetric, hence a normal distribution can be used to model this variable.
s = 0.25 * (136.20 - 79.3)
mean = mean(bodyfat$Chest)
n = 252
z = 1.96
LB = mean - 1.96 * (s/sqrt(n))
UB = mean + 1.96 * (s/sqrt(n))
LB
## [1] 99.06787
UB
## [1] 102.5805
So a 95% confidence interval for the actual average chest circumference is (99.06787, 102.5805).
Load the dataset:
library("datasets")
UCBAdmissions
## , , Dept = A
##
## Gender
## Admit Male Female
## Admitted 512 89
## Rejected 313 19
##
## , , Dept = B
##
## Gender
## Admit Male Female
## Admitted 353 17
## Rejected 207 8
##
## , , Dept = C
##
## Gender
## Admit Male Female
## Admitted 120 202
## Rejected 205 391
##
## , , Dept = D
##
## Gender
## Admit Male Female
## Admitted 138 131
## Rejected 279 244
##
## , , Dept = E
##
## Gender
## Admit Male Female
## Admitted 53 94
## Rejected 138 299
##
## , , Dept = F
##
## Gender
## Admit Male Female
## Admitted 22 24
## Rejected 351 317
We want to find the confidence intervals and hypothesis tests for differences in proportions The CI formula we will be using is \[ (\hat{p}_{male}-\hat{p}_{female})\pm z *\sqrt{n} \] Where,
\[ S = \sqrt{(\hat{p}*(1-\hat{p})*(\frac{1}{n_1} + \frac{1}{n_2}))} \]
We will also be using the Bonferroni correction. This leads to the confidence intervals being (1 - 6 * 0.05) = 70% Confidence Intervals. Thus, Z = 1.04.
Dept A:
p_hat_male = 0.621, p_hat_female = 0.824, s= 0.049, N = 933 CI for A: LB = -0.205, UB = -0.201
Dept B:
p_hat_male = 0.630, p_hat_female = 0.68, N = 585, s = 0.0986 CI for B: LB = -0.0542, UB = -0.0458
Dept C:
p_hat_male = 0.369, p_hat_female = 0.341, N = 918, s = 0.0329 CI for C: LB = 0.0269, UB = 0.0291
Dept D:
p_hat_male = 0.331, p_hat_female = 0.349, N = 792, s = 0.0337 CI for D: LB = -0.0192, UB = -0.0168
Dept E:
p_hat_male = 0.277, p_hat_female = 0.239, N = 584, s = 0.0383 CI for E: LB = 0.0364, UB = 0.0396
Dept F:
p_hat_male = 0.059, p_hat_female = 0.070, N = 714, s = 0.0184 CI for F: LB = -0.0117, -0.0103
Departments A, B, D, and F definitely have no evidence of discrimination against female students because the acceptance for females is higher than males.
Since we are measuring for discrimination against female students (i.e. the difference being > 0), we only need to test Depts that have female proportion lower than male, which is Depts C and E.
The Bonferroni leads to the hypothesis tests needing 0.05/2 = 0.025 as the probability of Type I error.
\[ H_0 = \hat{p}_{male} - \hat{p}_{female} = 0 \] \[ H_a = \hat{p}_{male} - \hat{p}_{female} > 0 \] \[ Z_C = 0.8511, Z_E = 0.992 \] We need a Z of 1.96 to reach an alpha of 0.025. Thus, we will fail to reject the null in both cases.
Final Answer: There does not seem to be any evidence that female students are discriminated against in any of the departments.
pima <- read.csv("PI_diabetes.csv", header=T)
x_bar_d = mean(pima[pima$Outcome == 1, "Insulin"])
y_bar_nd = mean(pima[pima$Outcome == 0, "Insulin"])
s_d = sd(pima[pima$Outcome == 1, "Insulin"])
s_nd = sd(pima[pima$Outcome == 0, "Insulin"])
n_d = sum(pima[pima$Outcome == 1, "Outcome"])
n_nd = 500
x_bar_d = 100.336, y_bar_nd = 68.792, s_d = 138.689, s_nd = 98.865, n_d 268, n_nd = 500
We will be doing a difference in means CI and hypothesis test.
The formula for the CI is \[ (\bar{x}_d - \bar{y}_{nd}) \pm z*\frac{s}{\sqrt{n}} \] where Z = 1.96 (assuming 95% CI), N = 768, and s = 9.556
So, the LB = 30.868, UB = 32.220.
For the hypothesis test: \[ H_0 = \mu_d - \mu_{nd} = 0 \] \[ H_a = \mu_d - \mu_{nd}\neq0 \] The level of significance is 0.05/2 or 0.025.
The test statistic is 3.3.
The p-value is 0.00048.
Thus, we can reject the null and conclude there is evidence that the insulin levels between people with diabetes and people without diabetes are different. This also lines up with what the CI is telling us.
\[ P(Y_1 = 2, Y_2=1, Y_3=1)=\frac{4!}{2!1!1!}(0.73)^2(0.2)(0.07)\approx0.089 \]
n = 4
Y_1 = 2
p_1 = 0.73
Y_2 = 1
p_2 = 0.2
Y_3 = 1
p_3 = 0.07
Prob = (factorial(n)/(factorial(Y_1)*factorial(Y_2)*factorial(Y_3)))*((p_1)^2)*(p_2)*(p_3)
Prob
## [1] 0.0895272
R Code:
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.0.5
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
pima = read.csv("PI_diabetes.csv", header=TRUE, sep = ",")
diab = select(pima, Insulin, Outcome) %>% filter(Outcome == 1)
non_diab = select(pima, Insulin, Outcome) %>% filter(Outcome == 0)
# Compare Insulin levels between diabetic and non-diabetic patients
db = density(diab$Insulin)
ndb = density(non_diab$Insulin)
plot(db,
xlim = c(0,800),
ylim = c(0,0.01),
main = "Pima Indian Women Insulin Lvls")
lines(ndb, col = 2)
legend("topright",
c("Diabetic", "Non-Diabetic"),
col = 1:2,
lty = 1)
\[ X-NB(r, p) \] \[ X_1, X_2, . . . X_n,\:\bar{X} = E(X) = \frac{1}{n}\sum_{i=1}^{n}X_i \]
Since,
\[ E(X) = \frac{r(1-p)}{p} \] \[ V(X) = \frac{r(1-p)}{p^2} * \frac{1}{p} \Longrightarrow p = \frac{E(X)}{V(X)}= \frac{E(X)}{E(X^2)-(E(X))^2} \]
Thus, \[ \hat{p}_{MM}=\frac{\frac{1}{n}\sum_{i=n}^{n}X_i}{\frac{1}{n}\sum_{i=n}^{n}X_{i}^2-(\frac{1}{n}\sum_{i=n}^{n}X_i)^2} \]
\[ \hat{r}_{MM}=\frac{pE(X)}{(1-p)}= \frac{\frac{1}{n}\sum_{i=n}^{n}X_i\:\hat{p}_{MM}}{1-\hat{p}_{MM}} \]