Question 1.

Mean = 100

Population SD = 16

mu <- 100
SD <- 16

1 - pnorm(q = 110, mean = 100, sd = 16)
## [1] 0.2659855

A. Probability Statement:

\(P(X \ge 110) | \mu = 110, \sigma = 16)\)

The Probability a randomly selected person for Jury Duty has an IQ of 110 or higher is about 26.6%.

\(P(\overline{x} \ge 110 | n = 12, \mu = 100, \sigma = 16)\)

#calculate the standard error for a known population sigma

sigma_x <- 16/sqrt(12)
#calculate the z-score
z1 <- (110 - 100) / sigma_x
1 - pnorm(q = z1, mean = 0, sd = 1)
## [1] 0.01519141

Let’s check out answer.

1 - pnorm(q = 110, mean = 100, sd = sigma_x)
## [1] 0.01519141

The probability a randomly selected 12 person Jury having a mean IQ of 110 or above with a population mean of 100 and population SD of 16 is..

1.52%

Question 2.

Given:

n = 331

p_hat = .48

Null and Alternative Hypothesis

Ho: \(\pi \ge .50\) - The assumed condition is that half or majority of people do not go to school because they cannot afford it. (status quo)

Ha: \(\pi < .50\) - This is what we are trying to prove: that a minority do not go to school because they cannot afford it.

Significance Level

Significance Level (alpha) = .05

\(\alpha = .05\)

Distribution

Z (proportion) - since we have a large enough sample size (331) we can assume standard normal distribution. Usually with n >= 30, the T distribution becomes largely similar to the Z-distribution.

Test.

We are going to perform a one tailed test. This is because we are testing to see if the true proportion is less than .5. We are testing one side of the distribution.

n2 <- 331
p_hat2 <- .48
alpha2 <- .05

First we will calculate the critical value:

#critical value
critical4 <- qnorm(p = .05, mean = 0, sd = 1)

The critical value is -1.6448 (this represents standard deviations away from the mean given the alpha level).

#test stat
standard_error4 <- sqrt((.48 * .52)/331)
z4 <- (.48 - .5) / (standard_error4)
z4
## [1] -0.7283191

z-score of -.728

#p-value
p_value4 <- pnorm(q = z4, mean = 0, sd = 1)
p_value4
## [1] 0.2332091

P-value of .2332

#make a decison
#method 1
abs(z4) > abs(critical4)
## [1] FALSE
#method 2
myp=function(p, alpha){
  if(p<alpha){print('REJECT Ho')}else{print('FAIL 2 REJECT')}
}
myp(p = p_value4, alpha = .05)
## [1] "FAIL 2 REJECT"

The absolute value of the test-stat is greater than the absolute value of the critical value. This is basis to FAIL TO REJECT THE NULL.

The p-value is greater than the significance value… also BASIS TO FAIL TO REJECT THE NULL

Decision: We FAIL TO REJECT THE NULL that the true proportion of people that do not go to college due to financial reasons is greater or equal to .50.

There is not enough evidence to claim a minority of the Americans that decide not to go to college do so because of financial reasons.

Confidence Interval.

#create a confidence interval.

#point estimate +- Z(alpha/2) * standard error

critical_CI <- qnorm(p = .025, mean = 0, sd = 1)
critical_CI
## [1] -1.959964
CI2_upper <- .48 + abs(critical_CI * standard_error4)
CI2_lower <- .48 - abs(critical_CI * standard_error4)

CI2_lower
## [1] 0.4261784
CI2_upper
## [1] 0.5338216

The 95% (double sided) confidence interval for the true proportion of Americans that did not go to college because they could not afford it is

(.4262, .5338).

B. I would expect a confidence interval (alpha = .05, double sided) for the proportion of American adults who decide not to go to college for financial reasons to include .5.

As seen above the 95% CI does include the proportion of .5. Also, the statistical tests performed above did not have enough statistical evidence to reject the null. Therefore we FAILED TO REJECT THE NULL that the true proportion is greater or equal to .5. For both of those reasons, .5 is expected to be in this 95% CI.

Question 3.

n1 = 22 , mu1 = 4.9, s1 = 1.8

n2 = 22, mu2 = 6.1, s2 = 1.8

mu3_1 <- 4.9
mu3_2 <- 6.1

alpha3  <-  .05

# dist = t, set up the problem
n3_1    <-   22
n3_2   <-   22

df3_1   <-   n3_1-1  #DOF
df3_2   <-   n3_2-1  #DOF

sd3_1  <- 1.8
sd3_2  <- 1.8

var3_1  <- sd3_1^2   #Variances
var3_2 <- sd3_2^2

  
num_point_estimate_diff_3 <- (mu3_1 - mu3_2 )  # point estimate difference 

Se_3 <- sqrt( var3_1/n3_1 + var3_2/n3_2 )  # Se formula - Standard Error using sample standard deviations rather than population standard deviations

t_3   <- num_point_estimate_diff_3 / Se_3

numdf_3 <- (var3_1/n3_1 + var3_2/n3_2)^2                       # Satterthwaite
dendf_3 <- (var3_1/n3_1)^2 / df3_1 + (var3_2/n3_2)^2 / df3_2   # Satterthwaite

df3 <- numdf_3 / dendf_3  

Hypothesis:

Ho: \(\mu_1 = \mu_2\)

Ha: \(\mu1 \ne \mu2\)

Ho: There is no difference in the mean food items recalled between groups.

Ha: There is a difference in the mean food items recalled.

Significance

alpha = .05

Test Stat.

Distribution: T - this is because n is less than 30, so we cannot assume standard normal.

#p-value
p_value3 <- 2 * (1 - pt(q = t_3, df = df3))
p_value3
## [1] 1.967472
#t-stat (calculated above)

t_3
## [1] -2.211083
#critical
critical3 <- qt(p = .025, df3)
critical3
## [1] -2.018082
abs(t_3) > abs(critical3)
## [1] TRUE

Since the absolute value of the T-stat is larger than the absolute value of the critical value, we can REJECT THE NULL. This means there is statistical evidence that there is a difference in the mean food items recalled between groups.

Question 4:

90% CI (65,77).

n = 25

Formula for CI = sample_mean + - Z * (s/sqrt(n))

#step 1 : find z score

z4 <- abs(qnorm(p = .05))
z4
## [1] 1.644854

The z-score here is 1.645 (we will use the absolute value).

#point estimate will be the midpoint of the CI
x_bar4 <- (65 + 77) / 2
x_bar4
## [1] 71

The point estimate from the sample is 71.

#margin of error is going to be the range / 2... or it is the upper limit - point estimate

ME <- (77-65) / 2
ME
## [1] 6
ME2 <- (77-x_bar4)
ME2
## [1] 6

Give us the same answer.

#standard deviation... we know all the other values now.
# s = (margin_error * sqrt(n) / Z)

s4 <- (ME * sqrt(25) / z4)
s4
## [1] 18.2387

sample standard deviation is 18.2387.

SUMMARY:

SAMPLE MEAN = 71

MARGIN OF ERROR = 6

SAMPLE STANDARD DEVIATION = 18.2387

#reverse engineer to check the work.

CI_upper4 <- (x_bar4) + (z4 * (s4/sqrt(25)))
CI_lower4 <- (x_bar4) - (z4 * (s4/sqrt(25)))
CI_lower4
## [1] 65
CI_upper4
## [1] 77

We get back the original CI.

Question 5.

mym <- matrix(c(4,30,24,45), nrow=2)
colnames(mym) <- c("control", "treatment")
rownames(mym) <- c("alive","dead")
mym
##       control treatment
## alive       4        24
## dead       30        45

The reason we cannot estimate the difference in survival rate using the a confidence interval is…

  1. The sample sizes are relatively small. The control group has n = 34 and the treatment has n = 69. However, both n’s are greater than 30 so we should satisfy this condition.
  2. The population is highly skewed for the the control group. Far more people died than survived. Usually a sample size of 50 or more is recommended for samples with highly skewed data, and here we only have 34.
  3. THE BIG ERROR: When performing hypothesis testing/CIs on proportion data, you want to satisfy the condition that np (number of successes) is 5 or greater, AND n(1-p) (number failures) is 5 or greater. These are not satisfied. In the control group n(1-p) is only equal to 4.
  4. By violating sample size/success/failure requirement you run the risk of obtaining an inaccurate confidence interval. More data should be collected.

Question 6:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
train <- read_csv("train.csv")
## Rows: 891 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Clean the Data, remove na values in age
train <- train[!is.na(train$Age), ]

#make sex a binary variable
train <- train %>%
  mutate(Sex_binary = ifelse(Sex == "male", 0, 1)
  )

head(train)
## # A tibble: 6 × 13
##   PassengerId Survived Pclass Name    Sex     Age SibSp Parch Ticket  Fare Cabin
##         <dbl>    <dbl>  <dbl> <chr>   <chr> <dbl> <dbl> <dbl> <chr>  <dbl> <chr>
## 1           1        0      3 Braund… male     22     1     0 A/5 2…  7.25 <NA> 
## 2           2        1      1 Cuming… fema…    38     1     0 PC 17… 71.3  C85  
## 3           3        1      3 Heikki… fema…    26     0     0 STON/…  7.92 <NA> 
## 4           4        1      1 Futrel… fema…    35     1     0 113803 53.1  C123 
## 5           5        0      3 Allen,… male     35     0     0 373450  8.05 <NA> 
## 6           7        0      1 McCart… male     54     0     0 17463  51.9  E46  
## # ℹ 2 more variables: Embarked <chr>, Sex_binary <dbl>
set.seed(100)
train <- sample_n(train, 500)
#Create the logit model

model <- glm(formula = Survived ~ Pclass + Sex_binary + Age, family = binomial(link = 'logit'), data = train)

summary(model)
## 
## Call:
## glm(formula = Survived ~ Pclass + Sex_binary + Age, family = binomial(link = "logit"), 
##     data = train)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  2.448847   0.540174   4.533 5.80e-06 ***
## Pclass      -1.335625   0.169602  -7.875 3.41e-15 ***
## Sex_binary   2.654303   0.256141  10.363  < 2e-16 ***
## Age         -0.034247   0.009152  -3.742 0.000183 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 667.85  on 499  degrees of freedom
## Residual deviance: 439.59  on 496  degrees of freedom
## AIC: 447.59
## 
## Number of Fisher Scoring iterations: 5

The logistic regression coefficients is the expencted change in log odds of having the outcome per unit change X

PClass: The PCass variable is defined as follows: 1 = 1st class, 2 = second class, 3 = 3rd class.

Coefficient: -1.335625

exp(-1.335625)
## [1] 0.2629938

Since the coefficient is negative, this is going to be an inverse relationship between PCLass and Survival. This means that as PClass increases (meaning that usually people are poorer) the odds of survival decrease (this is expected based on historical data).

One unit increase in PClass would result in a decrease of about 1.34 log-odds of Survival.

e^coefficient yields a value of .263 meaning for every 1 unit increase in PClass, the odds of survival will decrease by (1-.263 = .737) 73.7%.

SEX_BINARY: Male = 0, Female = 1.

Coefficient = 2.654303

The positive value means that there is a positive relationship in sex and survival (here it means that being a woman would have increased your survival chances).

exp(2.654303)
## [1] 14.21507

For an increase in 1 unit in Sex, the log odds of survival go up 2.654303.

For an increase in 1 unit in Sex, the odds of survival increase by (14.21507 - 1) 1322% !

AGE: The age of the passengers.

Coefficient: -.034247

. This is a negative relationship, so there will be an inverse relationship between age and survival odds (if age goes up, survival odds will decrease).

exp(-.034247)
## [1] 0.9663328

For an increase in 1 unit in age, the log odds of survival decrease by .034247

For an increase in 1 unit in age, the odds of survival decrease by (1 - .9663) = 3.37%

#generate predicted probabilities
predicted_probs <- predict(object = model, type = 'response')
#convert probabilites to binary outcomes
threshold <- .5

binary_predictions <- ifelse(test = predicted_probs >= threshold,
                             yes = 1,
                             no  = 0)

#Attach binary predictions to the original dataset
train$binary_predictions <- binary_predictions

summary(train$binary_predictions)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.364   1.000   1.000
table(train$binary_predictions)
## 
##   0   1 
## 318 182
conf_matrix <- table(train$Survived, train$binary_predictions,
                     dnn = c("Actual", "Predicted"))

conf_matrix
##       Predicted
## Actual   0   1
##      0 261  45
##      1  57 137
true_positives <- conf_matrix[2,2]
false_positives <- conf_matrix[1,2]
true_negatives <- conf_matrix[1,1]
false_negatives <- conf_matrix[2,1]


#calculate sensitivity and specificity
sensitivity <- true_positives/(true_positives + false_negatives)

specificity <- true_negatives/(true_negatives + false_positives)

#print

cat("Sensitivity (True Positive Rate):", sensitivity, "\n")
## Sensitivity (True Positive Rate): 0.7061856

This model (using PClass, Sex, and Age) as predictors for survival in the above logistic model gives a 70.6% accuracy in predicting survival. Meaning, the sensitivity rate of this model (true positives) is 70.6%.

Improved Model : 80% accuracy

#convert probabilites to binary outcomes
threshold2 <- .4

binary_predictions2 <- ifelse(test = predicted_probs >= threshold2,
                             yes = 1,
                             no  = 0)

#Attach binary predictions to the original dataset
train$binary_predictions2 <- binary_predictions2

summary(train$binary_predictions2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   0.000   0.432   1.000   1.000
table(train$binary_predictions2)
## 
##   0   1 
## 284 216
conf_matrix2 <- table(train$Survived, train$binary_predictions2,
                     dnn = c("Actual", "Predicted"))

conf_matrix2
##       Predicted
## Actual   0   1
##      0 246  60
##      1  38 156
true_positives2 <- conf_matrix2[2,2]
false_positives2 <- conf_matrix2[1,2]
true_negatives2 <- conf_matrix2[1,1]
false_negatives2 <- conf_matrix2[2,1]


#calculate sensitivity and specificity
sensitivity2 <- true_positives2/(true_positives2 + false_negatives2)

specificity2 <- true_negatives2/(true_negatives2 + false_positives2)

#print

cat("Sensitivity (True Positive Rate):", sensitivity2, "\n")
## Sensitivity (True Positive Rate): 0.8041237

Lowering the threshold in the binary predictions formula from .5 to .4 will give the model 80% accuracy (80% Sensitivity rate aka “True Positive Rate”).

Question 7:

telecom_customer_churn <- read_csv("~/Data Analysis- Sharma/telecom_customer_churn.csv")
## Rows: 7043 Columns: 38
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (23): Customer ID, Gender, Married, City, Offer, Phone Service, Multiple...
## dbl (15): Age, Number of Dependents, Zip code, Latitude, Longitude, Number o...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
telecom_zipcode_population <- read_csv("~/Data Analysis- Sharma/telecom_zipcode_population.csv")
## Rows: 1671 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): Zip Code, Population
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#first we need to rename the zipcode columns because one is capitilzed and one is not, and we need them to be the same.

telecom_customer_churn <- telecom_customer_churn %>%
  rename(ZIPCODE = 'Zip code')
average_age_by_zip <- telecom_customer_churn %>%
  group_by(ZIPCODE) %>%
  summarise(average_age = mean(Age, na.rm = TRUE))
telecom_zipcode_population <- telecom_zipcode_population %>%
  rename(ZIPCODE = 'Zip Code')
head(telecom_customer_churn)
## # A tibble: 6 × 38
##   `Customer ID` Gender   Age Married `Number of Dependents` City         ZIPCODE
##   <chr>         <chr>  <dbl> <chr>                    <dbl> <chr>          <dbl>
## 1 0002-ORFBO    Female    37 Yes                          0 Frazier Park   93225
## 2 0003-MKNFE    Male      46 No                           0 Glendale       91206
## 3 0004-TLHLJ    Male      50 No                           0 Costa Mesa     92627
## 4 0011-IGKFF    Male      78 Yes                          0 Martinez       94553
## 5 0013-EXCHZ    Female    75 Yes                          0 Camarillo      93010
## 6 0013-MHZWF    Female    23 No                           3 Midpines       95345
## # ℹ 31 more variables: Latitude <dbl>, Longitude <dbl>,
## #   `Number of Referrals` <dbl>, `Tenure in Months` <dbl>, Offer <chr>,
## #   `Phone Service` <chr>, `Avg Monthly Long Distance Charges` <dbl>,
## #   `Multiple Lines` <chr>, `Internet Service` <chr>, `Internet Type` <chr>,
## #   `Avg Monthly GB Download` <dbl>, `Online Security` <chr>,
## #   `Online Backup` <chr>, `Device Protection Plan` <chr>,
## #   `Premium Tech Support` <chr>, `Streaming TV` <chr>, …
head(telecom_zipcode_population)
## # A tibble: 6 × 2
##   ZIPCODE Population
##     <dbl>      <dbl>
## 1   90001      54492
## 2   90002      44586
## 3   90003      58198
## 4   90004      67852
## 5   90005      43019
## 6   90006      62784
merged_data <- merge(average_age_by_zip, telecom_zipcode_population, by = "ZIPCODE", all = TRUE)