DSCI 602: Project II

(10 pts) The variable brozek in the bodyfat dataset (see below), denotes the percent body fat using Brozek’s equation 457/Density−414.2. What is a good candidate probability distribution for modeling the variable brozek? Estimate the parameters of the candidate distribution using the method of moments estimation.

Brozek looks to be approximately normal. Using the method of moments estimation, we can determine that the mu should be equal to the sample mean, and the sigma square should be equal to the sample second moment minus the sample mean square.

The sample mean for brozek is 17.2333 (and thus the mu). The second moment for brozek is 354.59. The sigma squared for brozek is 57.6022.

Thus, brozek is approximately normal with mu = 17.2333 and sigma squared = 57.6022.

(10 pts) The variable siri in the bodyfat dataset (see below), denotes the percent body fat using Siri’s equation 495/Density − 450. What is a good candidate probability distribution for modeling the variable siri? Estimate the parameters of the candidate distribution using the method of maximum likelihood.

The variable siri looks to be approximately normal. We know from previous lessons that the method of maximum likelihood for Normal gives mu as the sample mean and sigma squared as the sample variance.

The sample mean for siri is 17.2833. The sample variance is 80.9937.

Thus, siri is approximately normal with mu = 17.2833 and sigma squared = 80.9937.

(15 pts) Consider the variable chest denoting Chest circumference (cm) in the bodyfat data. What is a good candidate probability distribution for modeling the variable chest? Assume µ and σ2 denote the population means and variance of chest circumference. Demonstrate the sampling distribution of the sample average of chest circumference. Construct a 95% confidence interval for the actual average chest circumference.

I had an issue pulling the data set as originally intended, but I was able to find the csv file on Kaggle.

bodyfat <- read.csv("bodyfat.csv", header=T)
dim(bodyfat)

## [1] 252  15

summary(bodyfat$Chest)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   79.30   94.35   99.65  100.82  105.38  136.20

Since the sample is large, we can generate a 95% confidence interval using the following formula:

\[ \bar{X}\pm t\frac{s}{\sqrt{n}} \]

Which t can be substituted with 1.96 being n>30. Since the mean is only a little bit higher than the median, and both Q1 and Q3 are located nearly at the same distance from the median on both sides, it can be concluded that this distribution is near symmetric, hence a normal distribution can be used to model this variable.

s = 0.25 * (136.20 - 79.3)

mean = mean(bodyfat$Chest)
n = 252 
z = 1.96

LB = mean - 1.96 * (s/sqrt(n))
UB = mean + 1.96 * (s/sqrt(n))

LB

## [1] 99.06787

UB

## [1] 102.5805

So a 95% confidence interval for the actual average chest circumference is (99.06787, 102.5805).

(20 pts) Consider the UCBAdmissions dataset from the R package datasets. Plaintiffs at UC Berkeley contended that the university had been discriminating against female applicants. Among the six departments which one has convincing evidence of discrimination against female applicants. Use both hypothesis testing and confidence intervals to support your decision. Apply Bonferroni correction (see below) to the confidence level and the level of significance.

Load the dataset:

library("datasets")
UCBAdmissions

## , , Dept = A
## 
##           Gender
## Admit      Male Female
##   Admitted  512     89
##   Rejected  313     19
## 
## , , Dept = B
## 
##           Gender
## Admit      Male Female
##   Admitted  353     17
##   Rejected  207      8
## 
## , , Dept = C
## 
##           Gender
## Admit      Male Female
##   Admitted  120    202
##   Rejected  205    391
## 
## , , Dept = D
## 
##           Gender
## Admit      Male Female
##   Admitted  138    131
##   Rejected  279    244
## 
## , , Dept = E
## 
##           Gender
## Admit      Male Female
##   Admitted   53     94
##   Rejected  138    299
## 
## , , Dept = F
## 
##           Gender
## Admit      Male Female
##   Admitted   22     24
##   Rejected  351    317

We want to find the confidence intervals and hypothesis tests for differences in proportions The CI formula we will be using is \[ (\hat{p}_{male}-\hat{p}_{female})\pm z *\sqrt{n} \] Where,

\[ S = \sqrt{(\hat{p}*(1-\hat{p})*(\frac{1}{n_1} + \frac{1}{n_2}))} \]

We will also be using the Bonferroni correction. This leads to the confidence intervals being (1 - 6 * 0.05) = 70% Confidence Intervals. Thus, Z = 1.04.

Dept A:

p_hat_male = 0.621, p_hat_female = 0.824, s= 0.049, N = 933 CI for A: LB = -0.205, UB = -0.201

Dept B:

p_hat_male = 0.630, p_hat_female = 0.68, N = 585, s = 0.0986 CI for B: LB = -0.0542, UB = -0.0458

Dept C:

p_hat_male = 0.369, p_hat_female = 0.341, N = 918, s = 0.0329 CI for C: LB = 0.0269, UB = 0.0291

Dept D:

p_hat_male = 0.331, p_hat_female = 0.349, N = 792, s = 0.0337 CI for D: LB = -0.0192, UB = -0.0168

Dept E:

p_hat_male = 0.277, p_hat_female = 0.239, N = 584, s = 0.0383 CI for E: LB = 0.0364, UB = 0.0396

Dept F:

p_hat_male = 0.059, p_hat_female = 0.070, N = 714, s = 0.0184 CI for F: LB = -0.0117, -0.0103

Departments A, B, D, and F definitely have no evidence of discrimination against female students because the acceptance for females is higher than males.

Since we are measuring for discrimination against female students (i.e. the difference being > 0), we only need to test Depts that have female proportion lower than male, which is Depts C and E.

The Bonferroni leads to the hypothesis tests needing 0.05/2 = 0.025 as the probability of Type I error.

\[ H_0 = \hat{p}_{male} - \hat{p}_{female} = 0 \] \[ H_a = \hat{p}_{male} - \hat{p}_{female} > 0 \] \[ Z_C = 0.8511, Z_E = 0.992 \] We need a Z of 1.96 to reach an alpha of 0.025. Thus, we will fail to reject the null in both cases.

Final Answer: There does not seem to be any evidence that female students are discriminated against in any of the departments.

(15 pts) Let µd and µnd denote average insulin levels for diabetic and non-diabetic women in the Pima Indian population. Suppose you are interested in the difference in the average insulin levels in the diabetic and non-diabetic women in the Pima Indian population. Using both confidence intervals (CIs) and hypothesis testing make a decision about your inquiry. State assumptions, if any, and show all the steps of your hypothesis test including the null and alternative hypotheses, level of significance, test statistic, and p-value. Discuss if the results from CIs and hypothesis testing are along the same line.

pima <- read.csv("PI_diabetes.csv", header=T)
x_bar_d = mean(pima[pima$Outcome == 1, "Insulin"])
y_bar_nd = mean(pima[pima$Outcome == 0, "Insulin"])
s_d = sd(pima[pima$Outcome == 1, "Insulin"])
s_nd = sd(pima[pima$Outcome == 0, "Insulin"])
n_d = sum(pima[pima$Outcome == 1, "Outcome"])
n_nd = 500

x_bar_d = 100.336, y_bar_nd = 68.792, s_d = 138.689, s_nd = 98.865, n_d 268, n_nd = 500

We will be doing a difference in means CI and hypothesis test.

The formula for the CI is \[ (\bar{x}_d - \bar{y}_{nd}) \pm z*\frac{s}{\sqrt{n}} \] where Z = 1.96 (assuming 95% CI), N = 768, and s = 9.556

So, the LB = 30.868, UB = 32.220.

For the hypothesis test: \[ H_0 = \mu_d - \mu_{nd} = 0 \] \[ H_a = \mu_d - \mu_{nd}\neq0 \] The level of significance is 0.05/2 or 0.025.

The test statistic is 3.3.

The p-value is 0.00048.

Thus, we can reject the null and conclude there is evidence that the insulin levels between people with diabetes and people without diabetes are different. This also lines up with what the CI is telling us.

(10 pts) The National Fire Incident Reporting Service stated that, among residential fires, 73% are in family homes, 20% are in apartments, and 7% are in other types of dwellings. If four residential fires are independently reported on a single day, what is the probability that two are in family homes, one is in an apartment, and one is in another type of dwelling? To answer this question, first identify the random variable(s), distribution, and the parameter(s) of the distribution.

\[ P(Y_1 = 2, Y_2=1, Y_3=1)=\frac{4!}{2!1!1!}(0.73)^2(0.2)(0.07)\approx0.089 \]

n = 4

Y_1 = 2
p_1 = 0.73

Y_2 = 1
p_2 = 0.2

Y_3 = 1
p_3 = 0.07 

Prob = (factorial(n)/(factorial(Y_1)*factorial(Y_2)*factorial(Y_3)))*((p_1)^2)*(p_2)*(p_3)
Prob

## [1] 0.0895272

(10 pts) Consider the variable Insulin in the Pima Indian Data set. Plot two density estimates for the distribution of Insulin on the same graph. Use plot() for the first and then lines() for the second.

R Code:

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.0.5

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

pima = read.csv("PI_diabetes.csv", header=TRUE, sep = ",")

diab = select(pima, Insulin, Outcome) %>% filter(Outcome == 1)

non_diab = select(pima, Insulin, Outcome) %>% filter(Outcome == 0)

# Compare Insulin levels between diabetic and non-diabetic patients
db = density(diab$Insulin)
ndb = density(non_diab$Insulin)
plot(db, 
     xlim = c(0,800),
     ylim = c(0,0.01),
     main = "Pima Indian Women Insulin Lvls")
lines(ndb, col = 2)
legend("topright",
       c("Diabetic", "Non-Diabetic"),
       col = 1:2,
       lty = 1)

(10 pts) Suppose the distribution of some random variable X is modeled as Negative binomial with parameters, p and r, where p is the probability of success and r is the total number of successes over the trials. Find the method of moment (MM) estimators for (p, r), in closed algebraic form expressions based on a random sample of size n, X1, X2, . . . , Xn.

\[ X-NB(r, p) \] \[ X_1, X_2, . . . X_n,\:\bar{X} = E(X) = \frac{1}{n}\sum_{i=1}^{n}X_i \]

Since,

\[ E(X) = \frac{r(1-p)}{p} \] \[ V(X) = \frac{r(1-p)}{p^2} * \frac{1}{p} \Longrightarrow p = \frac{E(X)}{V(X)}= \frac{E(X)}{E(X^2)-(E(X))^2} \]

Thus, \[ \hat{p}_{MM}=\frac{\frac{1}{n}\sum_{i=n}^{n}X_i}{\frac{1}{n}\sum_{i=n}^{n}X_{i}^2-(\frac{1}{n}\sum_{i=n}^{n}X_i)^2} \]

\[ \hat{r}_{MM}=\frac{pE(X)}{(1-p)}= \frac{\frac{1}{n}\sum_{i=n}^{n}X_i\:\hat{p}_{MM}}{1-\hat{p}_{MM}} \]

DSCI 602: Project II

Cordell Jones, Zak Gray, and Michael Noonan

7/23/2021