The distribution of the average IQ score is known to closely follow a Gaussian distribution with a mean centered directly at 100 and a population standard deviation of 16 (Stanford-Benet).
\(P(X \ge 110|\ \mu=100, \sigma=16)\)
x <- 110
mu <- 100
sigma <- 16
1 - pnorm(x, mu, sigma, lower.tail = TRUE)
## [1] 0.2659855
The probability of a single person being selected and having an IQ of 110 or higher is 26.60%.
\(P(X \ge 110|\ n=12, \mu=100, \sigma=16)\)
n <- 12
x <- 110
mu <- 100
sigma <- 16
stnd.error <- sigma/sqrt(n)
1 - pnorm(x, mu, stnd.error, lower.tail = TRUE)
## [1] 0.01519141
The probability of the mean IQ of a 12-person jury being 110 or higher is 1.52%.
Is college worth it?
Among a simple random sample of 331 American adults who do not have a four-year college degree and are not currently enrolled in school, 48% said they decided not to go to college because they could not afford school.
H0 = \(\pi = .50\) OR The mean proportion of Americans who did not attend college due to financial reasons is equal to 50%.
Ha = \(\pi \ge .50\) OR The mean proportion of Americans who did not attend college due to financial reasons is greater than 50%.
Tested at an alpha value of .05
Z-score
pi <- .50 #population parameter
n <- 331 #sample size
p.hat <- .48 #point estimate
alpha <- .05 #level of significance
df <- n-1 #df
#calculate z statistic
z_score <- (p.hat - pi) / sqrt(pi * (1 - pi) / n)
z_score
## [1] -0.7277362
#calculate critical value
critical_value <- qt(alpha, df)
critical_value
## [1] -1.649484
#calculate p value
p.value <- pnorm(z_score)
print(p.value)
## [1] 0.2333875
#drawing conclusion
myp=function(p, alpha){
if(p<alpha){print('REJECT Ho')}else{print('FAIL 2 REJECT')}
}
myp(p.value, alpha)
## [1] "FAIL 2 REJECT"
The z-score is -0.7278 and the critical value is -1.6495. At an alpha level of .05, we fail to reject the null hypothesis (p = .2334) — in other words, there is not sufficient data to claim that the mean proportion of Americans who did not attend college due to financial reasons is significantly greater than 50%.
pi <- .50 #population parameter
n <- 331 #sample size
p.hat <- .48 #point estimate
alpha <- .05 #level of significance
df <- n-1 #df
SE <- sqrt((p.hat * (1-p.hat))/n)
SE
## [1] 0.02746049
critical_value <- qnorm(1 - alpha/ 2)
CI_Bounds = c( p.hat - critical_value * SE , p.hat + critical_value * SE )
CI_Bounds
## [1] 0.4261784 0.5338216
The 95% CI for this sample population ranges from 42.62% to 53.38%. Since .5 is within the bounds provided here, we can confirm that a 95% CI for the proportion of American adults who decide not to go to college for financial reasons to include the .50 hypothesized value.
You are investigating the effects of being distracted by a game on how much people eat. The 22 patients in the treatment group who ate their lunch while playing solitaire were asked to do a serial-order recall of the food lunch items they ate. The average number of items recalled by the patients in this group was 4.9, with a standard deviation of 1.8. The average number of items recalled by the patients in the control group (no distraction, n=22 too) was 6.1, with a standard deviation of 1.8. Do these data provide strong evidence that the average number of food items recalled by the patients in the treatment and control groups are different? Assume α = 5%.
H0 = \(\mu_1 = \mu_2\) OR The average number of items recalled by the patients in the treatment group is not significantly different from the number of items recalled by the patients in the control group.
Ha = \(\mu_1 \neq \mu_2\) OR The average number of items recalled by the patients in the treatment group is significantly different from the number of items recalled by the patients in the control group.
In which \(\mu_1 = Treatment\ group\) and \(\mu_2 = Control\ group\) and tested at an α value of .05:
mu1 <- 4.9
sigma1 <- 1.8
mu2 <- 6.1
sigma2 <- 1.8
n <- 22
alpha <- .05
df <- n-1
#calculate variance
var1 = sigma1^2
var2 = sigma2^2
#standard error (Variance)
SE <- sqrt((var1/n)+(var2/n))
t_statistic <- (mu1-mu2)/SE
t_statistic
## [1] -2.211083
#calculate critical value
critical_value <- qnorm(alpha/2)
critical_value
## [1] -1.959964
#calculate p value
p.value <- 2*pt(t_statistic, df = df, lower.tail = TRUE)
print(p.value)
## [1] 0.03825675
#drawing conclusion
myp=function(p, alpha){
if(p<alpha){print('REJECT Ho')}else{print('FAIL 2 REJECT')}
}
myp(p.value, alpha)
## [1] "REJECT Ho"
The p-value of .0383 at an alpha level of 0.05 indicates that we reject our null hypothesis here; that is, there is sufficient data to support the alternative hypothesis, which states that there is a statistically significant difference between the treatment and control groups. In other words, the average number of items recalled by patients while distracted significantly differs from the mean number of items recalled by patients who were not distracted while eating.
A 90% confidence interval for a population mean is (65,77). The population distribution is approximately normal and the population standard deviation is unknown. This confidence interval is based on a simple random sample of 25 observations (double sided). Calculate the sample mean, the margin of error, and the sample standard deviation.
alpha <- .1
n <- 25
df <- n-1
CI_lower <- 65
CI_upper <- 77
#calculate sample mean
sample_mean <- (CI_lower + CI_upper) / 2
sample_mean
## [1] 71
#margin of error
sample_marginalerror <- abs((CI_lower - CI_upper) / 2)
sample_marginalerror
## [1] 6
#calculate sigma (sd)
sd <- (sample_marginalerror * sqrt(n))/qt(p=.9, df=df)
sd
## [1] 22.76459
The sample mean is 71, the margin of error is 6, and the sample standard deviation is 22.7646.
The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was officially designated a heart transplant candidate, meaning that he was gravely ill and might benefit from a new heart. Patients were randomly assigned into treatment and control groups. Patients in the treatment group received a transplant, and those in the control group did not. The table below displays how many patients survived and died in each group
mym <- matrix(c(4,30,24,45), nrow=2)
colnames(mym) <- c("control", "treatment")
rownames(mym) <- c("alive","dead")
mym
## control treatment
## alive 4 24
## dead 30 45
Suppose we are interested in estimating the difference in survival rate between the control and treatment groups using a confidence interval. Explain why we cannot construct such an interval using the normal approximation. What might go wrong if we constructed the confidence interval despite this problem?
We cannot construct a confidence interval using this data, because the data does not meet the assumptions of a normal approximation. Right off the bat, you can see that the random assignment lead to an unequal number of groups, with almost double the number of participants in the treatment group. Furthermore, and more importantly, there are not enough observations for the alive-control condition (only 4 successes in the control group) and is evidence that the proportion of successes (i.e., survival rates) is not normally distributed. Because the observations for treatment survival rates are not normally distributed, conducting analyses under assumptions of normality will result in very skewed findings and consequently lead to erroneous interpretations.
The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. In this question, you are to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio- economic class, etc).
setwd("/Users/jiwonban/ADEC7301/Week 7")
df <- read.csv("train.csv")
df$Age[is.na(df$Age)] <- median(df$Age, na.rm = TRUE) #treating for missing values (FROM HW1)
df$Sex <- ifelse(df$Sex == "male", 1, 0) #numeric vectors
summary(df)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
## Sex Age SibSp Parch
## Min. :0.0000 Min. : 0.42 Min. :0.000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:22.00 1st Qu.:0.000 1st Qu.:0.0000
## Median :1.0000 Median :28.00 Median :0.000 Median :0.0000
## Mean :0.6476 Mean :29.36 Mean :0.523 Mean :0.3816
## 3rd Qu.:1.0000 3rd Qu.:35.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :1.0000 Max. :80.00 Max. :8.000 Max. :6.0000
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
cor.test(df$Survived, df$Pclass)
##
## Pearson's product-moment correlation
##
## data: df$Survived and df$Pclass
## t = -10.725, df = 889, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3953692 -0.2790061
## sample estimates:
## cor
## -0.338481
cor.test(df$Survived, df$Sex)
##
## Pearson's product-moment correlation
##
## data: df$Survived and df$Sex
## t = -19.298, df = 889, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.5880439 -0.4953510
## sample estimates:
## cor
## -0.5433514
cor.test(df$Survived, df$Age)
##
## Pearson's product-moment correlation
##
## data: df$Survived and df$Age
## t = -1.9395, df = 889, p-value = 0.05276
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1300334740 0.0007702699
## sample estimates:
## cor
## -0.06491042
cor.test(df$Survived, df$SibSp)
##
## Pearson's product-moment correlation
##
## data: df$Survived and df$SibSp
## t = -1.0538, df = 889, p-value = 0.2922
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.10076614 0.03042549
## sample estimates:
## cor
## -0.0353225
cor.test(df$Survived, df$Parch)
##
## Pearson's product-moment correlation
##
## data: df$Survived and df$Parch
## t = 2.442, df = 889, p-value = 0.0148
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.01603798 0.14652128
## sample estimates:
## cor
## 0.08162941
cor.test(df$Survived, df$Fare)
##
## Pearson's product-moment correlation
##
## data: df$Survived and df$Fare
## t = 7.9392, df = 889, p-value = 6.12e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1949232 0.3176165
## sample estimates:
## cor
## 0.2573065
The most meaningful (and significant) correlations were between
Survived and Pclass (r = -.338), Sex
(r = -.543), and Fare (r = .257); these
three variables and the weak/moderate correlations suggest that they are
potential predictors for the sorts of people who were more likely to
survive the shipwreck. The correlations between Survived
and Age, SibSp, and Parch were
minimal and not significantly different from zero. Therefore, my linear
regression model will only include Pclass,
Sex, and Fare as the predictors for the
dependent variable Survived.
You can use sample_n command from tidyverse ecosystem dplyr package which you will have to install.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
set.seed(seed = 100)
subset.df <- sample_n(df, 500)
summary(lm(subset.df$Survived ~ subset.df$Pclass + subset.df$Sex + subset.df$Fare))
##
## Call:
## lm(formula = subset.df$Survived ~ subset.df$Pclass + subset.df$Sex +
## subset.df$Fare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.8990 -0.2614 -0.1264 0.2620 0.8752
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0017570 0.0768721 13.031 <2e-16 ***
## subset.df$Pclass -0.1339889 0.0260788 -5.138 4e-07 ***
## subset.df$Sex -0.4750254 0.0395460 -12.012 <2e-16 ***
## subset.df$Fare 0.0002059 0.0004688 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.407 on 496 degrees of freedom
## Multiple R-squared: 0.301, Adjusted R-squared: 0.2967
## F-statistic: 71.18 on 3 and 496 DF, p-value: < 2.2e-16
plot(lm(subset.df$Survived ~ subset.df$Pclass + subset.df$Sex + subset.df$Fare))
The results of my regression model show that passenger class
Pclass and their Sex are significant
predictors to whether one would have been likely to have
Survived, whereas cost of Fare was not a
significant predictor (p = .482). More specifically, with every
increase in Pclass (i.e., lower in class) while holding Sex
constant, there was a .15 decrease in chance of survival; with being
male, there was a significantly less chance of survival, at .51
decrease.
predict_survival <- predict(lm(subset.df$Survived ~ subset.df$Pclass + subset.df$Sex + subset.df$Fare), subset.df, type='response')
predict_survival[predict_survival<0.5]=0 #categorize observations as 0 or 1
predict_survival[predict_survival>0.5]=1
table(predict_survival)
## predict_survival
## 0 1
## 338 162
Based on my regression model of Pclass,
Sex, and Fare (n=500), there are 178 predicted
to survive.
#new regression model using whole dataset
summary(lm(df$Survived ~ df$Pclass + df$Sex + df$Fare))
##
## Call:
## lm(formula = df$Survived ~ df$Pclass + df$Sex + df$Fare)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.94084 -0.24487 -0.09287 0.23803 0.90886
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.0581462 0.0538301 19.657 < 2e-16 ***
## df$Pclass -0.1509739 0.0186075 -8.114 1.63e-15 ***
## df$Sex -0.5140878 0.0276565 -18.588 < 2e-16 ***
## df$Fare 0.0002221 0.0003156 0.704 0.482
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3875 on 887 degrees of freedom
## Multiple R-squared: 0.368, Adjusted R-squared: 0.3659
## F-statistic: 172.2 on 3 and 887 DF, p-value: < 2.2e-16
#predicting survival based on my regression model
predict_survival_entiredataset <- predict(lm(df$Survived ~ df$Pclass + df$Sex + df$Fare), df, type = "response")
predict_survival_entiredataset[predict_survival_entiredataset<0.5]=0
predict_survival_entiredataset[predict_survival_entiredataset>0.5]=1
table(predict_survival_entiredataset)
## predict_survival_entiredataset
## 0 1
## 575 316
#confusion matrix
correctly_classfied <- table(df$Survived, predict_survival_entiredataset)
correctly_classfied
## predict_survival_entiredataset
## 0 1
## 0 468 81
## 1 107 235
#calculate accuracy
correctly_classfied_accur <- sum(diag(correctly_classfied)) / sum(correctly_classfied)
correctly_classfied_accur
## [1] 0.7890011
(468 + 235) / 891
## [1] 0.7890011
This regression model correctly predicted 78.90% of the individuals who either survived or died from the shipwreck.
summary(lm(df$Survived ~ df$Pclass + df$Sex + df$Age + df$SibSp))
##
## Call:
## lm(formula = df$Survived ~ df$Pclass + df$Sex + df$Age + df$SibSp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.09314 -0.21685 -0.07921 0.22229 0.99074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.334594 0.059198 22.544 < 2e-16 ***
## df$Pclass -0.184446 0.016440 -11.219 < 2e-16 ***
## df$Sex -0.509682 0.027267 -18.692 < 2e-16 ***
## df$Age -0.005829 0.001073 -5.435 7.09e-08 ***
## df$SibSp -0.045349 0.011938 -3.799 0.000155 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3799 on 886 degrees of freedom
## Multiple R-squared: 0.3931, Adjusted R-squared: 0.3903
## F-statistic: 143.4 on 4 and 886 DF, p-value: < 2.2e-16
#predicting survival based on my regression model
predict_survival_higher_accur <- predict(lm(df$Survived ~ df$Pclass + df$Sex + df$Age + df$SibSp + df$Parch), df, type = "response")
predict_survival_higher_accur[predict_survival_higher_accur<0.5]=0
predict_survival_higher_accur[predict_survival_higher_accur>0.5]=1
table(predict_survival_higher_accur)
## predict_survival_higher_accur
## 0 1
## 588 303
#confusion matrix
correctly_classfied_1 <- table(df$Survived, predict_survival_higher_accur)
correctly_classfied_1
## predict_survival_higher_accur
## 0 1
## 0 479 70
## 1 109 233
#calculate accuracy
correctly_classfied_accur_1 <- sum(diag(correctly_classfied_1)) / sum(correctly_classfied_1)
correctly_classfied_accur_1
## [1] 0.7991021
I was able to improve the accuracy from 78.90% to 79.91% by removing
the variable Fare and adding the variables
SibSp and Age. I have a prediction that
because I am conducting a linear regression on a binary outcome, the
accuracy is not as high as we’d like to see it.
The above linear regression model was conducted to address the question of: “what sorts of people were more likely to survive on the Titanic shipwreck?”
My initial exploration of the data were done through correlation
coefficients. The most meaningful (and significant) correlations were
between Survived and Pclass (r = -.338),
Sex (r = -.543), and Fare (r
= .257); these three variables and the weak/moderate correlations
suggest that they are potential predictors for the sorts of people who
were more likely to survive the shipwreck. The correlations between
Survived and Age, SibSp, and
Parch were minimal and not significantly different from
zero. Therefore, my linear regression model will only include
Pclass, Sex, and Fare as the
predictors for the dependent variable Survived.
However, the results of my regression model, based on n=500, showed
that Fare was in fact not a significant predictor to the chances of
surviving. Meanwhile, passenger class Pclass and their
Sex are significant predictors to whether one would have
been likely to have Survived, whereas cost of
Fare was not a significant predictor (p = .482).
More specifically, with every increase in Pclass (i.e.,
lower in class) while holding Sex constant, there was a .15 decrease in
chance of survival; with being male, there was a significantly less
chance of survival, at .51 decrease. This regression model correctly
predicted 78.90% of the individuals who either survived or died from the
shipwreck.
To better increase the rate of accuracy in predicting individuals’
survival of the Titanic, I decided to remove the variable
Fare and include Age, SibSp, and
Parch. Upon doing so, I was able to improve the accuracy
from 78.90% to 79.91%. I was not able to bring it past 80%; I have a
thought that it’s because I am conducting a linear regression on a
binary outcome, so we are violating the assumptions (i.e.,
linearity).
Import the telecom_customer_churn and telecom_zipcode_population datasets. Create a table which has ZIPCODE in the rows and two columns - average population and average age. HINT: You may have to merge on the geographic variable.
library(dplyr)
setwd("/Users/jiwonban/ADEC7301/Week 7")
df.1 <- read.csv("telecom_customer_churn.csv")
df.2 <- read.csv("telecom_zipcode_population.csv")
head(df.1)
## Customer.ID Gender Age Married Number.of.Dependents City Zip.code
## 1 0002-ORFBO Female 37 Yes 0 Frazier Park 93225
## 2 0003-MKNFE Male 46 No 0 Glendale 91206
## 3 0004-TLHLJ Male 50 No 0 Costa Mesa 92627
## 4 0011-IGKFF Male 78 Yes 0 Martinez 94553
## 5 0013-EXCHZ Female 75 Yes 0 Camarillo 93010
## 6 0013-MHZWF Female 23 No 3 Midpines 95345
## Latitude Longitude Number.of.Referrals Tenure.in.Months Offer Phone.Service
## 1 34.82766 -118.9991 2 9 None Yes
## 2 34.16251 -118.2039 0 9 None Yes
## 3 33.64567 -117.9226 0 4 Offer E Yes
## 4 38.01446 -122.1154 1 13 Offer D Yes
## 5 34.22785 -119.0799 3 3 None Yes
## 6 37.58150 -119.9728 0 9 Offer E Yes
## Avg.Monthly.Long.Distance.Charges Multiple.Lines Internet.Service
## 1 42.39 No Yes
## 2 10.69 Yes Yes
## 3 33.65 No Yes
## 4 27.82 No Yes
## 5 7.38 No Yes
## 6 16.77 No Yes
## Internet.Type Avg.Monthly.GB.Download Online.Security Online.Backup
## 1 Cable 16 No Yes
## 2 Cable 10 No No
## 3 Fiber Optic 30 No No
## 4 Fiber Optic 4 No Yes
## 5 Fiber Optic 11 No No
## 6 Cable 73 No No
## Device.Protection.Plan Premium.Tech.Support Streaming.TV Streaming.Movies
## 1 No Yes Yes No
## 2 No No No Yes
## 3 Yes No No No
## 4 Yes No Yes Yes
## 5 No Yes Yes No
## 6 No Yes Yes Yes
## Streaming.Music Unlimited.Data Contract Paperless.Billing
## 1 No Yes One Year Yes
## 2 Yes No Month-to-Month No
## 3 No Yes Month-to-Month Yes
## 4 No Yes Month-to-Month Yes
## 5 No Yes Month-to-Month Yes
## 6 Yes Yes Month-to-Month Yes
## Payment.Method Monthly.Charge Total.Charges Total.Refunds
## 1 Credit Card 65.6 593.30 0.00
## 2 Credit Card -4.0 542.40 38.33
## 3 Bank Withdrawal 73.9 280.85 0.00
## 4 Bank Withdrawal 98.0 1237.85 0.00
## 5 Credit Card 83.9 267.40 0.00
## 6 Credit Card 69.4 571.45 0.00
## Total.Extra.Data.Charges Total.Long.Distance.Charges Total.Revenue
## 1 0 381.51 974.81
## 2 10 96.21 610.28
## 3 0 134.60 415.45
## 4 0 361.66 1599.51
## 5 0 22.14 289.54
## 6 0 150.93 722.38
## Customer.Status Churn.Category Churn.Reason
## 1 Stayed
## 2 Stayed
## 3 Churned Competitor Competitor had better devices
## 4 Churned Dissatisfaction Product dissatisfaction
## 5 Churned Dissatisfaction Network reliability
## 6 Stayed
head(df.2)
## Zip.Code Population
## 1 90001 54492
## 2 90002 44586
## 3 90003 58198
## 4 90004 67852
## 5 90005 43019
## 6 90006 62784
#noticed variable names were different
names(df.2)[names(df.2) == "Zip.Code"] <- "Zip.code"
#merge datasets
merge_by_zipcode <- df.2 %>% left_join(df.1, by = c("Zip.code"))
#create columns
df.3 <- merge_by_zipcode %>%
group_by(Zip.code) %>%
summarise(average_Population = mean(Population), average_Age = mean(Age))
summary(df.3)
## Zip.code average_Population average_Age
## Min. :90001 Min. : 11 Min. :24.33
## 1st Qu.:92269 1st Qu.: 1789 1st Qu.:40.50
## Median :93664 Median : 14239 Median :46.75
## Mean :93679 Mean : 20276 Mean :46.43
## 3rd Qu.:95408 3rd Qu.: 32942 3rd Qu.:51.60
## Max. :96161 Max. :105285 Max. :76.67
## NA's :45