# Clearing workspace  
rm(list = ls()) # Clear environment 
gc() 
##          used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 522598 28.0    1162891 62.2   660491 35.3
## Vcells 948122  7.3    8388608 64.0  1769514 13.6
# Clear unused memory
cat("\f") 

A. Implement the logistic regression on any dataset of your choice, and interpret your coefficients. Tell us why you should not run a multivariate regression.

df <- Titanic
head(df)
## , , Age = Child, Survived = No
## 
##       Sex
## Class  Male Female
##   1st     0      0
##   2nd     0      0
##   3rd    35     17
##   Crew    0      0
## 
## , , Age = Adult, Survived = No
## 
##       Sex
## Class  Male Female
##   1st   118      4
##   2nd   154     13
##   3rd   387     89
##   Crew  670      3
## 
## , , Age = Child, Survived = Yes
## 
##       Sex
## Class  Male Female
##   1st     5      1
##   2nd    11     13
##   3rd    13     14
##   Crew    0      0
## 
## , , Age = Adult, Survived = Yes
## 
##       Sex
## Class  Male Female
##   1st    57    140
##   2nd    14     80
##   3rd    75     76
##   Crew  192     20

Going back to one of our earlier data sets, I wanted to see do a logistic regression for survivors based on gender from our Titantic set.

print(df)
## , , Age = Child, Survived = No
## 
##       Sex
## Class  Male Female
##   1st     0      0
##   2nd     0      0
##   3rd    35     17
##   Crew    0      0
## 
## , , Age = Adult, Survived = No
## 
##       Sex
## Class  Male Female
##   1st   118      4
##   2nd   154     13
##   3rd   387     89
##   Crew  670      3
## 
## , , Age = Child, Survived = Yes
## 
##       Sex
## Class  Male Female
##   1st     5      1
##   2nd    11     13
##   3rd    13     14
##   Crew    0      0
## 
## , , Age = Adult, Survived = Yes
## 
##       Sex
## Class  Male Female
##   1st    57    140
##   2nd    14     80
##   3rd    75     76
##   Crew  192     20
# Logit Model
logit.model <- glm(Survived == "Survived" ~ Sex + Survived,  
                   data = df,
                   weights=Freq, family="binomial"
                   )
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logit.model)
## 
## Call:
## glm(formula = Survived == "Survived" ~ Sex + Survived, family = "binomial", 
##     data = df, weights = Freq)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)   -30.2079 55803.4934  -0.001        1
## SexFemale       0.9444 81840.4543   0.000        1
## SurvivedYes     0.5747 82578.9767   0.000        1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 0.0000e+00  on 23  degrees of freedom
## Residual deviance: 3.8787e-10  on 21  degrees of freedom
## AIC: 6
## 
## Number of Fisher Scoring iterations: 25
exp(0.9444)
## [1] 2.57127

Multivariate Regression is used to predict the relationships of dependent variables based off independent variables. In this data, we are just trying to look at who survived and the gender so this would not be necessary as we are not trying to predict a certain outcome. We saw from our summary that more females survived.

Reflection: Please reflect over the last 14 weeks - maybe even skim over the material that we have seen, to consolidate the topics we have seen in class. What have your learned about data analysis - both theoretically and empirically?

During this class, we learned so much about how to not only calculate and read outcomes from big data sets, but how to interpret them. In our first week, we started off with an introduction to Statistics and how to use R. Personally, I had never used R or another form of statistical programming so I was excited to get used to this throughout this course. We then started focusing on probability and how to calculate using different methods. We focused on classical probability, Bayes Theorem, and joint probability. This was also our first introduction to plotting in R where we used decision trees to show Bayes Theorem with different probabilities. In our third and fourth week, we looked at distributions. The main three focused on were the normal distribution, binomial distribution, and the poisson distribution. We did plenty on applying the Empirical rule to Normal distributions along with plotting out our data in R. These distributions help us better understand our data sets along with the mean and standard deviations. This also helped lead us into sampling and the proper way to gather samples to apply it to the true population. Our fifth week was focused on inference and the central limit theorem. Inferential statistics were extremely helpful in showing us how to come to a conclusion with our evidence. This helped really simulate real world examples where we are looking to solve a problem. We could be given a certain task along with data and be asked to prove whether the data gave us sufficient evidence to support our claim. From there, we could look into the significance and compare our P value to our alpha. The results of these test would tell us if we could reject or fail to reject our null hypothesis. Our last two weeks focused on regression. Regression is an extremely important way to see the relationships between different variables. In these assignments and readings, we learned how to compare independent and dependent variables and how they might allow us to predict certain values for them. We also learned how to merge data sets in R which will be extremely important to any job since often times we will not just get a perfect set of data and we will want to be able to see different variables.