# Clearing workspace
rm(list = ls()) # Clear environment
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 522598 28.0 1162891 62.2 660491 35.3
## Vcells 948122 7.3 8388608 64.0 1769514 13.6
# Clear unused memory
cat("\f")
df <- Titanic
head(df)
## , , Age = Child, Survived = No
##
## Sex
## Class Male Female
## 1st 0 0
## 2nd 0 0
## 3rd 35 17
## Crew 0 0
##
## , , Age = Adult, Survived = No
##
## Sex
## Class Male Female
## 1st 118 4
## 2nd 154 13
## 3rd 387 89
## Crew 670 3
##
## , , Age = Child, Survived = Yes
##
## Sex
## Class Male Female
## 1st 5 1
## 2nd 11 13
## 3rd 13 14
## Crew 0 0
##
## , , Age = Adult, Survived = Yes
##
## Sex
## Class Male Female
## 1st 57 140
## 2nd 14 80
## 3rd 75 76
## Crew 192 20
Going back to one of our earlier data sets, I wanted to see do a logistic regression for survivors based on gender from our Titantic set.
print(df)
## , , Age = Child, Survived = No
##
## Sex
## Class Male Female
## 1st 0 0
## 2nd 0 0
## 3rd 35 17
## Crew 0 0
##
## , , Age = Adult, Survived = No
##
## Sex
## Class Male Female
## 1st 118 4
## 2nd 154 13
## 3rd 387 89
## Crew 670 3
##
## , , Age = Child, Survived = Yes
##
## Sex
## Class Male Female
## 1st 5 1
## 2nd 11 13
## 3rd 13 14
## Crew 0 0
##
## , , Age = Adult, Survived = Yes
##
## Sex
## Class Male Female
## 1st 57 140
## 2nd 14 80
## 3rd 75 76
## Crew 192 20
# Logit Model
logit.model <- glm(Survived == "Survived" ~ Sex + Survived,
data = df,
weights=Freq, family="binomial"
)
## Warning: glm.fit: algorithm did not converge
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logit.model)
##
## Call:
## glm(formula = Survived == "Survived" ~ Sex + Survived, family = "binomial",
## data = df, weights = Freq)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -30.2079 55803.4934 -0.001 1
## SexFemale 0.9444 81840.4543 0.000 1
## SurvivedYes 0.5747 82578.9767 0.000 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 0.0000e+00 on 23 degrees of freedom
## Residual deviance: 3.8787e-10 on 21 degrees of freedom
## AIC: 6
##
## Number of Fisher Scoring iterations: 25
exp(0.9444)
## [1] 2.57127
Multivariate Regression is used to predict the relationships of dependent variables based off independent variables. In this data, we are just trying to look at who survived and the gender so this would not be necessary as we are not trying to predict a certain outcome. We saw from our summary that more females survived.
During this class, we learned so much about how to not only calculate and read outcomes from big data sets, but how to interpret them. In our first week, we started off with an introduction to Statistics and how to use R. Personally, I had never used R or another form of statistical programming so I was excited to get used to this throughout this course. We then started focusing on probability and how to calculate using different methods. We focused on classical probability, Bayes Theorem, and joint probability. This was also our first introduction to plotting in R where we used decision trees to show Bayes Theorem with different probabilities. In our third and fourth week, we looked at distributions. The main three focused on were the normal distribution, binomial distribution, and the poisson distribution. We did plenty on applying the Empirical rule to Normal distributions along with plotting out our data in R. These distributions help us better understand our data sets along with the mean and standard deviations. This also helped lead us into sampling and the proper way to gather samples to apply it to the true population. Our fifth week was focused on inference and the central limit theorem. Inferential statistics were extremely helpful in showing us how to come to a conclusion with our evidence. This helped really simulate real world examples where we are looking to solve a problem. We could be given a certain task along with data and be asked to prove whether the data gave us sufficient evidence to support our claim. From there, we could look into the significance and compare our P value to our alpha. The results of these test would tell us if we could reject or fail to reject our null hypothesis. Our last two weeks focused on regression. Regression is an extremely important way to see the relationships between different variables. In these assignments and readings, we learned how to compare independent and dependent variables and how they might allow us to predict certain values for them. We also learned how to merge data sets in R which will be extremely important to any job since often times we will not just get a perfect set of data and we will want to be able to see different variables.