Question 1: Distribution of an individual score vs Sample Mean

The distribution of the average IQ score is known to closely follow a Gaussian distribution with a mean centered directly at 100 and a population standard deviation of 16 (Stanford-Benet).

Q1a: A single person is randomly selected for jury duty. What is the probability that this person will have an IQ of 110 or higher? Be sure to write the probability statement and show your R code.

Q1a. Solution:

\(P(X \ge 110|\ \mu=100, \sigma=16)\)

x <- 110
mu <- 100
sigma <- 16

1 - pnorm(x, mu, sigma, lower.tail = TRUE)
## [1] 0.2659855

The probability of a single person being selected and having an IQ of 110 or higher is 26.60%.

Q1b: What is the probability that the mean IQ of the 12-person jury would be 110 or above if drawn from a normal population with μ = 100 and σ = 16? Be sure to write the probability statement and show your R code.

Q1b. Solution:

\(P(X \ge 110|\ n=12, \mu=100, \sigma=16)\)

n <- 12
x <- 110 
mu <- 100
sigma <- 16

stnd.error <- sigma/sqrt(n)

1 - pnorm(x, mu, stnd.error, lower.tail = TRUE)
## [1] 0.01519141

The probability of the mean IQ of a 12-person jury being 110 or higher is 1.52%.

Question 2: Hypothesis testing

Is college worth it?

Among a simple random sample of 331 American adults who do not have a four-year college degree and are not currently enrolled in school, 48% said they decided not to go to college because they could not afford school.

Q2a. A newspaper article states that only a minority of the Americans who decide not to go to college do so because they cannot afford it and uses the point estimate from this survey as evidence. Conduct a hypothesis test with α = .05 to determine if these data provide strong evidence that more than 50% adult Americans do not go to college for financial reasons. What is the critical value, test statistic, and p-value. Do you reject or fail to reject the null ?

Q2a. Solution:

Step 1: Null/Alternative hypothesis

H0 = \(\pi = .50\) OR The mean proportion of Americans who did not attend college due to financial reasons is equal to 50%.

Ha = \(\pi \ge .50\) OR The mean proportion of Americans who did not attend college due to financial reasons is greater than 50%.

Step 2: Significance Level

Tested at an alpha value of .05

Step 3: Distribution

Z-score

Step 4: Test Statistic from sample

pi <- .50 #population parameter
n <- 331 #sample size
p.hat <- .48 #point estimate
alpha <- .05 #level of significance
df <- n-1 #df

#calculate z statistic
z_score <- (p.hat - pi) / sqrt(pi * (1 - pi) / n)
z_score
## [1] -0.7277362
#calculate critical value
critical_value <- qt(alpha, df)
critical_value
## [1] -1.649484
#calculate p value
p.value <- pnorm(z_score)
print(p.value)
## [1] 0.2333875
#drawing conclusion
myp=function(p, alpha){
  if(p<alpha){print('REJECT Ho')}else{print('FAIL 2 REJECT')}
}

myp(p.value, alpha)
## [1] "FAIL 2 REJECT"

The z-score is -0.7278 and the critical value is -1.6495. At an alpha level of .05, we fail to reject the null hypothesis (p = .2334) — in other words, there is not sufficient data to claim that the mean proportion of Americans who did not attend college due to financial reasons is significantly greater than 50%.

Q2b. Would you expect a confidence interval (α = .05, double sided) for the proportion of American adults who decide not to go to college for financial reasons to include 0.5 (hypothesized value)? Show, or explain.

Q2b. Solution:

pi <- .50 #population parameter
n <- 331 #sample size
p.hat <- .48 #point estimate
alpha <- .05 #level of significance
df <- n-1 #df

SE <- sqrt((p.hat * (1-p.hat))/n)
SE
## [1] 0.02746049
critical_value <- qnorm(1 - alpha/ 2)

CI_Bounds = c( p.hat - critical_value * SE , p.hat + critical_value * SE )
CI_Bounds
## [1] 0.4261784 0.5338216

The 95% CI for this sample population ranges from 42.62% to 53.38%. Since .5 is within the bounds provided here, we can confirm that a 95% CI for the proportion of American adults who decide not to go to college for financial reasons to include the .50 hypothesized value.

Question 3: Hypothesis testing

You are investigating the effects of being distracted by a game on how much people eat. The 22 patients in the treatment group who ate their lunch while playing solitaire were asked to do a serial-order recall of the food lunch items they ate. The average number of items recalled by the patients in this group was 4.9, with a standard deviation of 1.8. The average number of items recalled by the patients in the control group (no distraction, n=22 too) was 6.1, with a standard deviation of 1.8. Do these data provide strong evidence that the average number of food items recalled by the patients in the treatment and control groups are different? Assume α = 5%.

Q3. Solution:

H0 = \(\mu_1 = \mu_2\) OR The average number of items recalled by the patients in the treatment group is not significantly different from the number of items recalled by the patients in the control group.

Ha = \(\mu_1 \neq \mu_2\) OR The average number of items recalled by the patients in the treatment group is significantly different from the number of items recalled by the patients in the control group.

In which \(\mu_1 = Treatment\ group\) and \(\mu_2 = Control\ group\) and tested at an α value of .05:

mu1 <- 4.9
sigma1 <- 1.8

mu2 <- 6.1
sigma2 <- 1.8

n <- 22
alpha <- .05
df <- n-1

#calculate variance
var1 = sigma1^2 
var2 = sigma2^2

#standard error (Variance)
SE <- sqrt((var1/n)+(var2/n))

t_statistic <- (mu1-mu2)/SE           
t_statistic
## [1] -2.211083
#calculate critical value
critical_value <- qnorm(alpha/2)
critical_value
## [1] -1.959964
#calculate p value
p.value <- 2*pt(t_statistic, df = df, lower.tail = TRUE)
print(p.value)
## [1] 0.03825675
#drawing conclusion
myp=function(p, alpha){
  if(p<alpha){print('REJECT Ho')}else{print('FAIL 2 REJECT')}
}

myp(p.value, alpha)
## [1] "REJECT Ho"

The p-value of .0383 at an alpha level of 0.05 indicates that we reject our null hypothesis here; that is, there is sufficient data to support the alternative hypothesis, which states that there is a statistically significant difference between the treatment and control groups. In other words, the average number of items recalled by patients while distracted significantly differs from the mean number of items recalled by patients who were not distracted while eating.

Question 4: Confidence Intervals

A 90% confidence interval for a population mean is (65,77). The population distribution is approximately normal and the population standard deviation is unknown. This confidence interval is based on a simple random sample of 25 observations (double sided). Calculate the sample mean, the margin of error, and the sample standard deviation.

Q4. Solution:

alpha <- .1
n <- 25
df <- n-1
CI_lower <- 65
CI_upper <- 77

#calculate sample mean
sample_mean <- (CI_lower + CI_upper) / 2
sample_mean
## [1] 71
#margin of error
sample_marginalerror <- abs((CI_lower - CI_upper) / 2)
sample_marginalerror
## [1] 6
#calculate sigma (sd)
sd <- (sample_marginalerror * sqrt(n))/qt(p=.9, df=df)
sd
## [1] 22.76459

The sample mean is 71, the margin of error is 6, and the sample standard deviation is 22.7646.

Question 5: Assumptions

The Stanford University Heart Transplant Study was conducted to determine whether an experimental heart transplant program increased lifespan. Each patient entering the program was officially designated a heart transplant candidate, meaning that he was gravely ill and might benefit from a new heart. Patients were randomly assigned into treatment and control groups. Patients in the treatment group received a transplant, and those in the control group did not. The table below displays how many patients survived and died in each group

mym <- matrix(c(4,30,24,45), nrow=2)
colnames(mym) <- c("control", "treatment")
rownames(mym) <- c("alive","dead")
mym
##       control treatment
## alive       4        24
## dead       30        45

Suppose we are interested in estimating the difference in survival rate between the control and treatment groups using a confidence interval. Explain why we cannot construct such an interval using the normal approximation. What might go wrong if we constructed the confidence interval despite this problem?

Q5. Solution:

We cannot construct a confidence interval using this data, because the data does not meet the assumptions of a normal approximation. Right off the bat, you can see that the random assignment lead to an unequal number of groups, with almost double the number of participants in the treatment group. Furthermore, and more importantly, there are not enough observations for the alive-control condition (only 4 successes in the control group) and is evidence that the proportion of successes (i.e., survival rates) is not normally distributed. Because the observations for treatment survival rates are not normally distributed, conducting analyses under assumptions of normality will result in very skewed findings and consequently lead to erroneous interpretations.

Question 6: Data Question

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. In this question, you are to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio- economic class, etc).

Import the titanic train.csv file into R. You will have to clean the data to run some of commands/set up the workflow.

setwd("/Users/jiwonban/ADEC7301/Week 7")
df <- read.csv("train.csv") 

df$Age[is.na(df$Age)] <- median(df$Age, na.rm = TRUE) #treating for missing values (FROM HW1)
df$Sex <- ifelse(df$Sex == "male", 1, 0) #numeric vectors

Exploring dataset

summary(df)
##   PassengerId       Survived          Pclass          Name          
##  Min.   :  1.0   Min.   :0.0000   Min.   :1.000   Length:891        
##  1st Qu.:223.5   1st Qu.:0.0000   1st Qu.:2.000   Class :character  
##  Median :446.0   Median :0.0000   Median :3.000   Mode  :character  
##  Mean   :446.0   Mean   :0.3838   Mean   :2.309                     
##  3rd Qu.:668.5   3rd Qu.:1.0000   3rd Qu.:3.000                     
##  Max.   :891.0   Max.   :1.0000   Max.   :3.000                     
##       Sex              Age            SibSp           Parch       
##  Min.   :0.0000   Min.   : 0.42   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:22.00   1st Qu.:0.000   1st Qu.:0.0000  
##  Median :1.0000   Median :28.00   Median :0.000   Median :0.0000  
##  Mean   :0.6476   Mean   :29.36   Mean   :0.523   Mean   :0.3816  
##  3rd Qu.:1.0000   3rd Qu.:35.00   3rd Qu.:1.000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :80.00   Max.   :8.000   Max.   :6.0000  
##     Ticket               Fare           Cabin             Embarked        
##  Length:891         Min.   :  0.00   Length:891         Length:891        
##  Class :character   1st Qu.:  7.91   Class :character   Class :character  
##  Mode  :character   Median : 14.45   Mode  :character   Mode  :character  
##                     Mean   : 32.20                                        
##                     3rd Qu.: 31.00                                        
##                     Max.   :512.33
cor.test(df$Survived, df$Pclass) 
## 
##  Pearson's product-moment correlation
## 
## data:  df$Survived and df$Pclass
## t = -10.725, df = 889, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.3953692 -0.2790061
## sample estimates:
##       cor 
## -0.338481
cor.test(df$Survived, df$Sex)
## 
##  Pearson's product-moment correlation
## 
## data:  df$Survived and df$Sex
## t = -19.298, df = 889, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5880439 -0.4953510
## sample estimates:
##        cor 
## -0.5433514
cor.test(df$Survived, df$Age) 
## 
##  Pearson's product-moment correlation
## 
## data:  df$Survived and df$Age
## t = -1.9395, df = 889, p-value = 0.05276
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1300334740  0.0007702699
## sample estimates:
##         cor 
## -0.06491042
cor.test(df$Survived, df$SibSp)
## 
##  Pearson's product-moment correlation
## 
## data:  df$Survived and df$SibSp
## t = -1.0538, df = 889, p-value = 0.2922
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.10076614  0.03042549
## sample estimates:
##        cor 
## -0.0353225
cor.test(df$Survived, df$Parch)
## 
##  Pearson's product-moment correlation
## 
## data:  df$Survived and df$Parch
## t = 2.442, df = 889, p-value = 0.0148
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.01603798 0.14652128
## sample estimates:
##        cor 
## 0.08162941
cor.test(df$Survived, df$Fare)
## 
##  Pearson's product-moment correlation
## 
## data:  df$Survived and df$Fare
## t = 7.9392, df = 889, p-value = 6.12e-15
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.1949232 0.3176165
## sample estimates:
##       cor 
## 0.2573065

The most meaningful (and significant) correlations were between Survived and Pclass (r = -.338), Sex (r = -.543), and Fare (r = .257); these three variables and the weak/moderate correlations suggest that they are potential predictors for the sorts of people who were more likely to survive the shipwreck. The correlations between Survived and Age, SibSp, and Parch were minimal and not significantly different from zero. Therefore, my linear regression model will only include Pclass, Sex, and Fare as the predictors for the dependent variable Survived.

1. Use set.seed(100) command, and create a subset of train dataset that has only 500 observations.

You can use sample_n command from tidyverse ecosystem dplyr package which you will have to install.

Q6. Solution Part 1:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
set.seed(seed = 100)
subset.df <- sample_n(df, 500)

2. Create an Ordinary Least Squares model / linear regression where Survived is the dependent variable on your n=500 sample. If you wish, you can run a logistic regression instead (would be more appropriate)

Q6. Solution Part 2:

summary(lm(subset.df$Survived ~ subset.df$Pclass + subset.df$Sex + subset.df$Fare))
## 
## Call:
## lm(formula = subset.df$Survived ~ subset.df$Pclass + subset.df$Sex + 
##     subset.df$Fare)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.8990 -0.2614 -0.1264  0.2620  0.8752 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       1.0017570  0.0768721  13.031   <2e-16 ***
## subset.df$Pclass -0.1339889  0.0260788  -5.138    4e-07 ***
## subset.df$Sex    -0.4750254  0.0395460 -12.012   <2e-16 ***
## subset.df$Fare    0.0002059  0.0004688   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.407 on 496 degrees of freedom
## Multiple R-squared:  0.301,  Adjusted R-squared:  0.2967 
## F-statistic: 71.18 on 3 and 496 DF,  p-value: < 2.2e-16
plot(lm(subset.df$Survived ~ subset.df$Pclass + subset.df$Sex + subset.df$Fare))

The results of my regression model show that passenger class Pclass and their Sex are significant predictors to whether one would have been likely to have Survived, whereas cost of Fare was not a significant predictor (p = .482). More specifically, with every increase in Pclass (i.e., lower in class) while holding Sex constant, there was a .15 decrease in chance of survival; with being male, there was a significantly less chance of survival, at .51 decrease.

3. Create an estimate of whether an individual survived or not (binary variable) using the predict command on your estimated model. Essentially, you are using the coefficient from your linear model to forecast/predict/estimate the survived variable given independent variable values/data (you may have to convert the predicted probabilities into binary outcome variable)

Q6. Solution Part 3:

predict_survival <- predict(lm(subset.df$Survived ~ subset.df$Pclass + subset.df$Sex + subset.df$Fare), subset.df, type='response')

predict_survival[predict_survival<0.5]=0 #categorize observations as 0 or 1
predict_survival[predict_survival>0.5]=1

table(predict_survival)
## predict_survival
##   0   1 
## 338 162

Based on my regression model of Pclass, Sex, and Fare (n=500), there are 178 predicted to survive.

4. Create a confusion matrix, and calculate the accuracy for the entire train dataset. In other words, how many individuals who survived or died the titanic crash were correctly classified by your regression?

Q6. Solution Part 4:

#new regression model using whole dataset
summary(lm(df$Survived ~ df$Pclass + df$Sex + df$Fare))
## 
## Call:
## lm(formula = df$Survived ~ df$Pclass + df$Sex + df$Fare)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.94084 -0.24487 -0.09287  0.23803  0.90886 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.0581462  0.0538301  19.657  < 2e-16 ***
## df$Pclass   -0.1509739  0.0186075  -8.114 1.63e-15 ***
## df$Sex      -0.5140878  0.0276565 -18.588  < 2e-16 ***
## df$Fare      0.0002221  0.0003156   0.704    0.482    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3875 on 887 degrees of freedom
## Multiple R-squared:  0.368,  Adjusted R-squared:  0.3659 
## F-statistic: 172.2 on 3 and 887 DF,  p-value: < 2.2e-16
#predicting survival based on my regression model
predict_survival_entiredataset <- predict(lm(df$Survived ~ df$Pclass + df$Sex + df$Fare), df, type = "response")

predict_survival_entiredataset[predict_survival_entiredataset<0.5]=0
predict_survival_entiredataset[predict_survival_entiredataset>0.5]=1

table(predict_survival_entiredataset)
## predict_survival_entiredataset
##   0   1 
## 575 316
#confusion matrix
correctly_classfied <- table(df$Survived, predict_survival_entiredataset)
correctly_classfied
##    predict_survival_entiredataset
##       0   1
##   0 468  81
##   1 107 235
#calculate accuracy
correctly_classfied_accur <- sum(diag(correctly_classfied)) / sum(correctly_classfied)
correctly_classfied_accur
## [1] 0.7890011
(468 + 235) / 891
## [1] 0.7890011

This regression model correctly predicted 78.90% of the individuals who either survived or died from the shipwreck.

HINT (modelling): Pclass, Sex, Age can give you an accuracy of about 80%. Can you get higher accuracy?

Exploring dataset

summary(lm(df$Survived ~ df$Pclass + df$Sex + df$Age + df$SibSp))
## 
## Call:
## lm(formula = df$Survived ~ df$Pclass + df$Sex + df$Age + df$SibSp)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.09314 -0.21685 -0.07921  0.22229  0.99074 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.334594   0.059198  22.544  < 2e-16 ***
## df$Pclass   -0.184446   0.016440 -11.219  < 2e-16 ***
## df$Sex      -0.509682   0.027267 -18.692  < 2e-16 ***
## df$Age      -0.005829   0.001073  -5.435 7.09e-08 ***
## df$SibSp    -0.045349   0.011938  -3.799 0.000155 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3799 on 886 degrees of freedom
## Multiple R-squared:  0.3931, Adjusted R-squared:  0.3903 
## F-statistic: 143.4 on 4 and 886 DF,  p-value: < 2.2e-16
#predicting survival based on my regression model
predict_survival_higher_accur <- predict(lm(df$Survived ~ df$Pclass + df$Sex + df$Age + df$SibSp + df$Parch), df, type = "response")

predict_survival_higher_accur[predict_survival_higher_accur<0.5]=0
predict_survival_higher_accur[predict_survival_higher_accur>0.5]=1

table(predict_survival_higher_accur)
## predict_survival_higher_accur
##   0   1 
## 588 303
#confusion matrix
correctly_classfied_1 <- table(df$Survived, predict_survival_higher_accur)
correctly_classfied_1
##    predict_survival_higher_accur
##       0   1
##   0 479  70
##   1 109 233
#calculate accuracy
correctly_classfied_accur_1 <- sum(diag(correctly_classfied_1)) / sum(correctly_classfied_1)
correctly_classfied_accur_1
## [1] 0.7991021

I was able to improve the accuracy from 78.90% to 79.91% by removing the variable Fare and adding the variables SibSp and Age. I have a prediction that because I am conducting a linear regression on a binary outcome, the accuracy is not as high as we’d like to see it.

5. Organise the 4 subparts above into a 1 page report

Q6. Solution Part 5:

The above linear regression model was conducted to address the question of: “what sorts of people were more likely to survive on the Titanic shipwreck?”

My initial exploration of the data were done through correlation coefficients. The most meaningful (and significant) correlations were between Survived and Pclass (r = -.338), Sex (r = -.543), and Fare (r = .257); these three variables and the weak/moderate correlations suggest that they are potential predictors for the sorts of people who were more likely to survive the shipwreck. The correlations between Survived and Age, SibSp, and Parch were minimal and not significantly different from zero. Therefore, my linear regression model will only include Pclass, Sex, and Fare as the predictors for the dependent variable Survived.

However, the results of my regression model, based on n=500, showed that Fare was in fact not a significant predictor to the chances of surviving. Meanwhile, passenger class Pclass and their Sex are significant predictors to whether one would have been likely to have Survived, whereas cost of Fare was not a significant predictor (p = .482). More specifically, with every increase in Pclass (i.e., lower in class) while holding Sex constant, there was a .15 decrease in chance of survival; with being male, there was a significantly less chance of survival, at .51 decrease. This regression model correctly predicted 78.90% of the individuals who either survived or died from the shipwreck.

To better increase the rate of accuracy in predicting individuals’ survival of the Titanic, I decided to remove the variable Fare and include Age, SibSp, and Parch. Upon doing so, I was able to improve the accuracy from 78.90% to 79.91%. I was not able to bring it past 80%; I have a thought that it’s because I am conducting a linear regression on a binary outcome, so we are violating the assumptions (i.e., linearity).

Question 7: Data Question

Import the telecom_customer_churn and telecom_zipcode_population datasets. Create a table which has ZIPCODE in the rows and two columns - average population and average age. HINT: You may have to merge on the geographic variable.

Q7. Solution:

library(dplyr)
setwd("/Users/jiwonban/ADEC7301/Week 7")
df.1 <- read.csv("telecom_customer_churn.csv") 
df.2 <- read.csv("telecom_zipcode_population.csv")

head(df.1)
##   Customer.ID Gender Age Married Number.of.Dependents         City Zip.code
## 1  0002-ORFBO Female  37     Yes                    0 Frazier Park    93225
## 2  0003-MKNFE   Male  46      No                    0     Glendale    91206
## 3  0004-TLHLJ   Male  50      No                    0   Costa Mesa    92627
## 4  0011-IGKFF   Male  78     Yes                    0     Martinez    94553
## 5  0013-EXCHZ Female  75     Yes                    0    Camarillo    93010
## 6  0013-MHZWF Female  23      No                    3     Midpines    95345
##   Latitude Longitude Number.of.Referrals Tenure.in.Months   Offer Phone.Service
## 1 34.82766 -118.9991                   2                9    None           Yes
## 2 34.16251 -118.2039                   0                9    None           Yes
## 3 33.64567 -117.9226                   0                4 Offer E           Yes
## 4 38.01446 -122.1154                   1               13 Offer D           Yes
## 5 34.22785 -119.0799                   3                3    None           Yes
## 6 37.58150 -119.9728                   0                9 Offer E           Yes
##   Avg.Monthly.Long.Distance.Charges Multiple.Lines Internet.Service
## 1                             42.39             No              Yes
## 2                             10.69            Yes              Yes
## 3                             33.65             No              Yes
## 4                             27.82             No              Yes
## 5                              7.38             No              Yes
## 6                             16.77             No              Yes
##   Internet.Type Avg.Monthly.GB.Download Online.Security Online.Backup
## 1         Cable                      16              No           Yes
## 2         Cable                      10              No            No
## 3   Fiber Optic                      30              No            No
## 4   Fiber Optic                       4              No           Yes
## 5   Fiber Optic                      11              No            No
## 6         Cable                      73              No            No
##   Device.Protection.Plan Premium.Tech.Support Streaming.TV Streaming.Movies
## 1                     No                  Yes          Yes               No
## 2                     No                   No           No              Yes
## 3                    Yes                   No           No               No
## 4                    Yes                   No          Yes              Yes
## 5                     No                  Yes          Yes               No
## 6                     No                  Yes          Yes              Yes
##   Streaming.Music Unlimited.Data       Contract Paperless.Billing
## 1              No            Yes       One Year               Yes
## 2             Yes             No Month-to-Month                No
## 3              No            Yes Month-to-Month               Yes
## 4              No            Yes Month-to-Month               Yes
## 5              No            Yes Month-to-Month               Yes
## 6             Yes            Yes Month-to-Month               Yes
##    Payment.Method Monthly.Charge Total.Charges Total.Refunds
## 1     Credit Card           65.6        593.30          0.00
## 2     Credit Card           -4.0        542.40         38.33
## 3 Bank Withdrawal           73.9        280.85          0.00
## 4 Bank Withdrawal           98.0       1237.85          0.00
## 5     Credit Card           83.9        267.40          0.00
## 6     Credit Card           69.4        571.45          0.00
##   Total.Extra.Data.Charges Total.Long.Distance.Charges Total.Revenue
## 1                        0                      381.51        974.81
## 2                       10                       96.21        610.28
## 3                        0                      134.60        415.45
## 4                        0                      361.66       1599.51
## 5                        0                       22.14        289.54
## 6                        0                      150.93        722.38
##   Customer.Status  Churn.Category                  Churn.Reason
## 1          Stayed                                              
## 2          Stayed                                              
## 3         Churned      Competitor Competitor had better devices
## 4         Churned Dissatisfaction       Product dissatisfaction
## 5         Churned Dissatisfaction           Network reliability
## 6          Stayed
head(df.2)
##   Zip.Code Population
## 1    90001      54492
## 2    90002      44586
## 3    90003      58198
## 4    90004      67852
## 5    90005      43019
## 6    90006      62784
#noticed variable names were different 
names(df.2)[names(df.2) == "Zip.Code"] <- "Zip.code"

#merge datasets
merge_by_zipcode <- df.2 %>% left_join(df.1, by = c("Zip.code"))

#create columns
df.3 <- merge_by_zipcode %>%
  group_by(Zip.code) %>%
  summarise(average_Population = mean(Population), average_Age = mean(Age))

summary(df.3)
##     Zip.code     average_Population  average_Age   
##  Min.   :90001   Min.   :    11     Min.   :24.33  
##  1st Qu.:92269   1st Qu.:  1789     1st Qu.:40.50  
##  Median :93664   Median : 14239     Median :46.75  
##  Mean   :93679   Mean   : 20276     Mean   :46.43  
##  3rd Qu.:95408   3rd Qu.: 32942     3rd Qu.:51.60  
##  Max.   :96161   Max.   :105285     Max.   :76.67  
##                                     NA's   :45