Mean = 100
Population SD = 16
mu <- 100
SD <- 16
1 - pnorm(q = 110, mean = 100, sd = 16)
## [1] 0.2659855
A. Probability Statement:
\(P(X \ge 110) | \mu = 110, \sigma = 16)\)
The Probability a randomly selected person for Jury Duty has an IQ of 110 or higher is about 26.6%.
\(P(\overline{x} \ge 110 | n = 12, \mu = 100, \sigma = 16)\)
#calculate the standard error for a known population sigma
sigma_x <- 16/sqrt(12)
#calculate the z-score
z1 <- (110 - 100) / sigma_x
1 - pnorm(q = z1, mean = 0, sd = 1)
## [1] 0.01519141
Let’s check out answer.
1 - pnorm(q = 110, mean = 100, sd = sigma_x)
## [1] 0.01519141
1.52%
Given:
n = 331
p_hat = .48
Ho: \(\pi \ge .50\) - The assumed condition is that half or majority of people do not go to school because they cannot afford it. (status quo)
Ha: \(\pi < .50\) - This is what we are trying to prove: that a minority do not go to school because they cannot afford it.
Significance Level (alpha) = .05
\(\alpha = .05\)
Z (proportion) - since we have a large enough sample size (331) we can assume standard normal distribution. Usually with n >= 30, the T distribution becomes largely similar to the Z-distribution.
We are going to perform a one tailed test. This is because we are testing to see if the true proportion is less than .5. We are testing one side of the distribution.
n2 <- 331
p_hat2 <- .48
alpha2 <- .05
First we will calculate the critical value:
#critical value
critical4 <- qnorm(p = .05, mean = 0, sd = 1)
The critical value is -1.6448 (this represents standard deviations away from the mean given the alpha level).
#test stat
standard_error4 <- sqrt((.48 * .52)/331)
z4 <- (.48 - .5) / (standard_error4)
z4
## [1] -0.7283191
z-score of -.728
#p-value
p_value4 <- pnorm(q = z4, mean = 0, sd = 1)
p_value4
## [1] 0.2332091
P-value of .2332
#make a decison
#method 1
abs(z4) > abs(critical4)
## [1] FALSE
#method 2
myp=function(p, alpha){
if(p<alpha){print('REJECT Ho')}else{print('FAIL 2 REJECT')}
}
myp(p = p_value4, alpha = .05)
## [1] "FAIL 2 REJECT"
The absolute value of the test-stat is greater than the absolute value of the critical value. This is basis to FAIL TO REJECT THE NULL.
The p-value is greater than the significance value… also BASIS TO FAIL TO REJECT THE NULL
Decision: We FAIL TO REJECT THE NULL that the true proportion of people that do not go to college due to financial reasons is greater or equal to .50.
There is not enough evidence to claim a minority of the Americans that decide not to go to college do so because of financial reasons.
#create a confidence interval.
#point estimate +- Z(alpha/2) * standard error
critical_CI <- qnorm(p = .025, mean = 0, sd = 1)
critical_CI
## [1] -1.959964
CI2_upper <- .48 + abs(critical_CI * standard_error4)
CI2_lower <- .48 - abs(critical_CI * standard_error4)
CI2_lower
## [1] 0.4261784
CI2_upper
## [1] 0.5338216
The 95% (double sided) confidence interval for the true proportion of Americans that did not go to college because they could not afford it is
(.4262, .5338).
B. I would expect a confidence interval (alpha = .05, double sided) for the proportion of American adults who decide not to go to college for financial reasons to include .5.
As seen above the 95% CI does include the proportion of .5. Also, the statistical tests performed above did not have enough statistical evidence to reject the null. Therefore we FAILED TO REJECT THE NULL that the true proportion is greater or equal to .5. For both of those reasons, .5 is expected to be in this 95% CI.
n1 = 22 , mu1 = 4.9, s1 = 1.8
n2 = 22, mu2 = 6.1, s2 = 1.8
mu3_1 <- 4.9
mu3_2 <- 6.1
alpha3 <- .05
# dist = t, set up the problem
n3_1 <- 22
n3_2 <- 22
df3_1 <- n3_1-1 #DOF
df3_2 <- n3_2-1 #DOF
sd3_1 <- 1.8
sd3_2 <- 1.8
var3_1 <- sd3_1^2 #Variances
var3_2 <- sd3_2^2
num_point_estimate_diff_3 <- (mu3_1 - mu3_2 ) # point estimate difference
Se_3 <- sqrt( var3_1/n3_1 + var3_2/n3_2 ) # Se formula - Standard Error using sample standard deviations rather than population standard deviations
t_3 <- num_point_estimate_diff_3 / Se_3
numdf_3 <- (var3_1/n3_1 + var3_2/n3_2)^2 # Satterthwaite
dendf_3 <- (var3_1/n3_1)^2 / df3_1 + (var3_2/n3_2)^2 / df3_2 # Satterthwaite
df3 <- numdf_3 / dendf_3
Ho: \(\mu_1 = \mu_2\)
Ha: \(\mu1 \ne \mu2\)
Ho: There is no difference in the mean food items recalled between groups.
Ha: There is a difference in the mean food items recalled.
alpha = .05
Distribution: T - this is because n is less than 30, so we cannot assume standard normal.
#p-value
p_value3 <- 2 * (1 - pt(q = t_3, df = df3))
p_value3
## [1] 1.967472
#t-stat (calculated above)
t_3
## [1] -2.211083
#critical
critical3 <- qt(p = .025, df3)
critical3
## [1] -2.018082
abs(t_3) > abs(critical3)
## [1] TRUE
Since the absolute value of the T-stat is larger than the absolute value of the critical value, we can REJECT THE NULL. This means there is statistical evidence that there is a difference in the mean food items recalled between groups.
90% CI (65,77).
n = 25
Formula for CI = sample_mean + - Z * (s/sqrt(n))
#step 1 : find z score
z4 <- abs(qnorm(p = .05))
z4
## [1] 1.644854
The z-score here is 1.645 (we will use the absolute value).
#point estimate will be the midpoint of the CI
x_bar4 <- (65 + 77) / 2
x_bar4
## [1] 71
The point estimate from the sample is 71.
#margin of error is going to be the range / 2... or it is the upper limit - point estimate
ME <- (77-65) / 2
ME
## [1] 6
ME2 <- (77-x_bar4)
ME2
## [1] 6
Give us the same answer.
#standard deviation... we know all the other values now.
# s = (margin_error * sqrt(n) / Z)
s4 <- (ME * sqrt(25) / z4)
s4
## [1] 18.2387
sample standard deviation is 18.2387.
SUMMARY:
SAMPLE MEAN = 71
MARGIN OF ERROR = 6
SAMPLE STANDARD DEVIATION = 18.2387
#reverse engineer to check the work.
CI_upper4 <- (x_bar4) + (z4 * (s4/sqrt(25)))
CI_lower4 <- (x_bar4) - (z4 * (s4/sqrt(25)))
CI_lower4
## [1] 65
CI_upper4
## [1] 77
We get back the original CI.
mym <- matrix(c(4,30,24,45), nrow=2)
colnames(mym) <- c("control", "treatment")
rownames(mym) <- c("alive","dead")
mym
## control treatment
## alive 4 24
## dead 30 45
The reason we cannot estimate the difference in survival rate using the a confidence interval is…
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readr)
train <- read_csv("train.csv")
## Rows: 891 Columns: 12
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Name, Sex, Ticket, Cabin, Embarked
## dbl (7): PassengerId, Survived, Pclass, Age, SibSp, Parch, Fare
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Clean the Data, remove na values in age
train <- train[!is.na(train$Age), ]
#make sex a binary variable
train <- train %>%
mutate(Sex_binary = ifelse(Sex == "male", 0, 1)
)
head(train)
## # A tibble: 6 × 13
## PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin
## <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl> <chr>
## 1 1 0 3 Braund… male 22 1 0 A/5 2… 7.25 <NA>
## 2 2 1 1 Cuming… fema… 38 1 0 PC 17… 71.3 C85
## 3 3 1 3 Heikki… fema… 26 0 0 STON/… 7.92 <NA>
## 4 4 1 1 Futrel… fema… 35 1 0 113803 53.1 C123
## 5 5 0 3 Allen,… male 35 0 0 373450 8.05 <NA>
## 6 7 0 1 McCart… male 54 0 0 17463 51.9 E46
## # ℹ 2 more variables: Embarked <chr>, Sex_binary <dbl>
set.seed(100)
train <- sample_n(train, 500)
#Create the logit model
model <- glm(formula = Survived ~ Pclass + Sex_binary + Age, family = binomial(link = 'logit'), data = train)
summary(model)
##
## Call:
## glm(formula = Survived ~ Pclass + Sex_binary + Age, family = binomial(link = "logit"),
## data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.448847 0.540174 4.533 5.80e-06 ***
## Pclass -1.335625 0.169602 -7.875 3.41e-15 ***
## Sex_binary 2.654303 0.256141 10.363 < 2e-16 ***
## Age -0.034247 0.009152 -3.742 0.000183 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 667.85 on 499 degrees of freedom
## Residual deviance: 439.59 on 496 degrees of freedom
## AIC: 447.59
##
## Number of Fisher Scoring iterations: 5
The logistic regression coefficients is the expencted change in log odds of having the outcome per unit change X
PClass: The PCass variable is defined as follows: 1 = 1st class, 2 = second class, 3 = 3rd class.
Coefficient: -1.335625
exp(-1.335625)
## [1] 0.2629938
Since the coefficient is negative, this is going to be an inverse relationship between PCLass and Survival. This means that as PClass increases (meaning that usually people are poorer) the odds of survival decrease (this is expected based on historical data).
One unit increase in PClass would result in a decrease of about 1.34 log-odds of Survival.
e^coefficient yields a value of .263 meaning for every 1 unit increase in PClass, the odds of survival will decrease by (1-.263 = .737) 73.7%.
SEX_BINARY: Male = 0, Female = 1.
Coefficient = 2.654303
The positive value means that there is a positive relationship in sex and survival (here it means that being a woman would have increased your survival chances).
exp(2.654303)
## [1] 14.21507
For an increase in 1 unit in Sex, the log odds of survival go up 2.654303.
For an increase in 1 unit in Sex, the odds of survival increase by (14.21507 - 1) 1322% !
AGE: The age of the passengers.
Coefficient: -.034247
. This is a negative relationship, so there will be an inverse relationship between age and survival odds (if age goes up, survival odds will decrease).
exp(-.034247)
## [1] 0.9663328
For an increase in 1 unit in age, the log odds of survival decrease by .034247
For an increase in 1 unit in age, the odds of survival decrease by (1 - .9663) = 3.37%
#generate predicted probabilities
predicted_probs <- predict(object = model, type = 'response')
#convert probabilites to binary outcomes
threshold <- .5
binary_predictions <- ifelse(test = predicted_probs >= threshold,
yes = 1,
no = 0)
#Attach binary predictions to the original dataset
train$binary_predictions <- binary_predictions
summary(train$binary_predictions)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.364 1.000 1.000
table(train$binary_predictions)
##
## 0 1
## 318 182
conf_matrix <- table(train$Survived, train$binary_predictions,
dnn = c("Actual", "Predicted"))
conf_matrix
## Predicted
## Actual 0 1
## 0 261 45
## 1 57 137
true_positives <- conf_matrix[2,2]
false_positives <- conf_matrix[1,2]
true_negatives <- conf_matrix[1,1]
false_negatives <- conf_matrix[2,1]
#calculate sensitivity and specificity
sensitivity <- true_positives/(true_positives + false_negatives)
specificity <- true_negatives/(true_negatives + false_positives)
#print
cat("Sensitivity (True Positive Rate):", sensitivity, "\n")
## Sensitivity (True Positive Rate): 0.7061856
This model (using PClass, Sex, and Age) as predictors for survival in the above logistic model gives a 70.6% accuracy in predicting survival. Meaning, the sensitivity rate of this model (true positives) is 70.6%.
#convert probabilites to binary outcomes
threshold2 <- .4
binary_predictions2 <- ifelse(test = predicted_probs >= threshold2,
yes = 1,
no = 0)
#Attach binary predictions to the original dataset
train$binary_predictions2 <- binary_predictions2
summary(train$binary_predictions2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.432 1.000 1.000
table(train$binary_predictions2)
##
## 0 1
## 284 216
conf_matrix2 <- table(train$Survived, train$binary_predictions2,
dnn = c("Actual", "Predicted"))
conf_matrix2
## Predicted
## Actual 0 1
## 0 246 60
## 1 38 156
true_positives2 <- conf_matrix2[2,2]
false_positives2 <- conf_matrix2[1,2]
true_negatives2 <- conf_matrix2[1,1]
false_negatives2 <- conf_matrix2[2,1]
#calculate sensitivity and specificity
sensitivity2 <- true_positives2/(true_positives2 + false_negatives2)
specificity2 <- true_negatives2/(true_negatives2 + false_positives2)
#print
cat("Sensitivity (True Positive Rate):", sensitivity2, "\n")
## Sensitivity (True Positive Rate): 0.8041237
Lowering the threshold in the binary predictions formula from .5 to .4 will give the model 80% accuracy (80% Sensitivity rate aka “True Positive Rate”).
telecom_customer_churn <- read_csv("~/Data Analysis- Sharma/telecom_customer_churn.csv")
## Rows: 7043 Columns: 38
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (23): Customer ID, Gender, Married, City, Offer, Phone Service, Multiple...
## dbl (15): Age, Number of Dependents, Zip code, Latitude, Longitude, Number o...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
telecom_zipcode_population <- read_csv("~/Data Analysis- Sharma/telecom_zipcode_population.csv")
## Rows: 1671 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): Zip Code, Population
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#first we need to rename the zipcode columns because one is capitilzed and one is not, and we need them to be the same.
telecom_customer_churn <- telecom_customer_churn %>%
rename(ZIPCODE = 'Zip code')
average_age_by_zip <- telecom_customer_churn %>%
group_by(ZIPCODE) %>%
summarise(average_age = mean(Age, na.rm = TRUE))
telecom_zipcode_population <- telecom_zipcode_population %>%
rename(ZIPCODE = 'Zip Code')
head(telecom_customer_churn)
## # A tibble: 6 × 38
## `Customer ID` Gender Age Married `Number of Dependents` City ZIPCODE
## <chr> <chr> <dbl> <chr> <dbl> <chr> <dbl>
## 1 0002-ORFBO Female 37 Yes 0 Frazier Park 93225
## 2 0003-MKNFE Male 46 No 0 Glendale 91206
## 3 0004-TLHLJ Male 50 No 0 Costa Mesa 92627
## 4 0011-IGKFF Male 78 Yes 0 Martinez 94553
## 5 0013-EXCHZ Female 75 Yes 0 Camarillo 93010
## 6 0013-MHZWF Female 23 No 3 Midpines 95345
## # ℹ 31 more variables: Latitude <dbl>, Longitude <dbl>,
## # `Number of Referrals` <dbl>, `Tenure in Months` <dbl>, Offer <chr>,
## # `Phone Service` <chr>, `Avg Monthly Long Distance Charges` <dbl>,
## # `Multiple Lines` <chr>, `Internet Service` <chr>, `Internet Type` <chr>,
## # `Avg Monthly GB Download` <dbl>, `Online Security` <chr>,
## # `Online Backup` <chr>, `Device Protection Plan` <chr>,
## # `Premium Tech Support` <chr>, `Streaming TV` <chr>, …
head(telecom_zipcode_population)
## # A tibble: 6 × 2
## ZIPCODE Population
## <dbl> <dbl>
## 1 90001 54492
## 2 90002 44586
## 3 90003 58198
## 4 90004 67852
## 5 90005 43019
## 6 90006 62784
merged_data <- merge(average_age_by_zip, telecom_zipcode_population, by = "ZIPCODE", all = TRUE)