Non-graphical Univariate and Bivariate Statistical Analysis.

1. Prepare the Data (Load and Clean titantic.csv).

setwd("/Users/whinton/src/rstudio/tim8501")
titanic <- read.csv("titanic.csv", header = TRUE, sep= ",",stringsAsFactors = TRUE)
df <- titanic ## make copy of original dataset to data frame df

Show Initial Missing Values.

## PassengerId    Survived      Pclass        Name         Sex         Age 
##           0           0           0           0           0         177 
##       SibSp       Parch      Ticket        Fare       Cabin    Embarked 
##           0           0           0           0         687           2

## ####################################################################################

Perform Pre-processing, Imputation and Show Filtered/Cleaned Data.

## Survived   Pclass      Sex      Age    SibSp    Parch     Fare Embarked 
##        0        0        0        0        0        0        0        0

## ####################################################################################

2. Univariate Statistics (Non-graphical).

Show Univariate Descriptive Statistics

##     Var Obs Mean Median Variance St.Dev  Range   IQR Skewness Kurtosis Outliers
## 1   Age 891 29.7   29.7   169.05     13  79.58    13     0.43     3.95       66
## 2 SibSp 891 0.52      0     1.22    1.1      8     1     3.69    20.77       46
## 3 Parch 891 0.38      0     0.65   0.81      6     0     2.74    12.72      213
## 4  Fare 891 32.2  14.45  2469.44  49.69 512.33 23.09     4.78     36.2      116

## ####################################################################################

## Quantiles Data Frame of Quantitative Variables

##     qAge qFare qSibSp qParch
## 0%   0.4   0.0      0      0
## 5%   6.0   7.2      0      0
## 25% 22.0   7.9      0      0
## 50% 29.7  14.5      0      0
## 75% 35.0  31.0      1      0
## 95% 54.0 112.1      3      2
## 10% 16.0   7.6      0      0

## ####################################################################################

## Frequencies, Proportions and Modes of Qualitative Variables

## #####################

##   Survived Freq      Prop
## 1       NO  549 0.6161616
## 2      YES  342 0.3838384
## 3    Total  891 1.0000000

## Mode:

##    Survived        Freq        Prop 
##       "YES"       "549" "0.6161616"

## #####################

##   Gender Freq     Prop
## 1 female  314 0.352413
## 2   male  577 0.647587
## 3  Total  891 1.000000

## Mode:

##     Gender       Freq       Prop 
##     "male"      "577" "0.647587"

## #####################

##   Pclass Freq      Prop
## 1  First  216 0.2424242
## 2 Second  184 0.2065095
## 3  Third  491 0.5510662
## 4  Total  891 1.0000000

## Mode:

##      Pclass        Freq        Prop 
##     "Third"       "491" "0.5510662"

## #####################

##   Embarked Freq       Prop
## 1        C  168 0.18855219
## 2        Q   77 0.08641975
## 3        S  646 0.72502806
## 4    Total  891 1.00000000

## Mode:

##     Embarked         Freq         Prop 
##          "S"        "646" "0.72502806"

## ####################################################################################

3. Bivariate Statistics (Non-graphical).

## Correlation Table (For Quantitative Variables)
numeric_data <- df[sapply(df, is.numeric)]
cor(numeric_data)

##               Age      SibSp      Parch       Fare
## Age    1.00000000 -0.2326246 -0.1791909 0.09156609
## SibSp -0.23262459  1.0000000  0.4148377 0.15965104
## Parch -0.17919092  0.4148377  1.0000000 0.21622494
## Fare   0.09156609  0.1596510  0.2162249 1.00000000

message("")

##

## Crosstabulation / Contingency Table (For Categorical Variables). 
table(df$Sex, df$Embarked)

##         
##            C   Q   S
##   female  73  36 205
##   male    95  41 441

table(df$Survived, df$Pclass)

##      
##       First Second Third
##   NO     80     97   372
##   YES   136     87   119

table(df$Survived, df$Sex)

##      
##       female male
##   NO      81  468
##   YES    233  109

table(df$Survived, df$Embarked)

##      
##         C   Q   S
##   NO   75  47 427
##   YES  93  30 219

table(df$Pclass, df$Sex)

##         
##          female male
##   First      94  122
##   Second     76  108
##   Third     144  347

table(df$Pclass, df$Embarked)

##         
##            C   Q   S
##   First   85   2 129
##   Second  17   3 164
##   Third   66  72 353

table(df$Sex, df$Embarked)

##         
##            C   Q   S
##   female  73  36 205
##   male    95  41 441

message("")

##

## Chi-squared test (For Testing Association Between Categorical Variables)
chisq.test(table(df$Survived, df$Pclass))

## 
##  Pearson's Chi-squared test
## 
## data:  table(df$Survived, df$Pclass)
## X-squared = 102.89, df = 2, p-value < 2.2e-16

chisq.test(table(df$Survived, df$Sex))

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(df$Survived, df$Sex)
## X-squared = 260.72, df = 1, p-value < 2.2e-16

chisq.test(table(df$Survived, df$Embarked))

## 
##  Pearson's Chi-squared test
## 
## data:  table(df$Survived, df$Embarked)
## X-squared = 25.964, df = 2, p-value = 2.301e-06

chisq.test(table(df$Pclass, df$Sex))

## 
##  Pearson's Chi-squared test
## 
## data:  table(df$Pclass, df$Sex)
## X-squared = 16.971, df = 2, p-value = 0.0002064

chisq.test(table(df$Pclass, df$Embarked))

## 
##  Pearson's Chi-squared test
## 
## data:  table(df$Pclass, df$Embarked)
## X-squared = 122.64, df = 4, p-value < 2.2e-16

chisq.test(table(df$Sex, df$Embarked))

## 
##  Pearson's Chi-squared test
## 
## data:  table(df$Sex, df$Embarked)
## X-squared = 12.917, df = 2, p-value = 0.001567

message("")

##

## T-Test / ANOVA (For Numeric and Categorical Variables). 
t.test(df$Age ~ df$Survived, data = df) # T-test for Age by Survived

## 
##  Welch Two Sample t-test
## 
## data:  df$Age by df$Survived
## t = 2.0385, df = 669.03, p-value = 0.04189
## alternative hypothesis: true difference in means between group NO and group YES is not equal to 0
## 95 percent confidence interval:
##  0.06862884 3.66201421
## sample estimates:
##  mean in group NO mean in group YES 
##          30.41510          28.54978

aov(Fare ~ Pclass, data = df)           # ANOVA for Fare by Pclass

## Call:
##    aov(formula = Fare ~ Pclass, data = df)
## 
## Terms:
##                    Pclass Residuals
## Sum of Squares   776030.1 1421768.7
## Deg. of Freedom         2       888
## 
## Residual standard error: 40.01363
## Estimated effects may be unbalanced

aov(Fare ~ Embarked, data = df)         # ANOVA for Fare by Embarked

## Call:
##    aov(formula = Fare ~ Embarked, data = df)
## 
## Terms:
##                  Embarked Residuals
## Sum of Squares   172853.4 2024945.4
## Deg. of Freedom         2       888
## 
## Residual standard error: 47.75295
## Estimated effects may be unbalanced

message("")

##

## Point-Biserial Correlation (For One Binary Categorical and One Quantitative Variable)

 # Point-biserial correlation between Survived and Age
cor(as.integer(df$Survived), df$Age, use = "complete.obs")

## [1] -0.06980852

# Point-biserial correlation between Survived and Fare
cor(as.integer(df$Survived), df$Fare, use = "complete.obs")

## [1] 0.2573065

message("")

##

## Spearman's Rank Correlation (For Ordinal Vars or When Assumptions of Pearson Correlation Are Not Met). 
# Spearman correlation between Pclass and Age
cor(as.integer(df$Pclass), df$Age, method = "spearman")

## [1] -0.308875

# Spearman correlation between Pclass and Fare
cor(as.integer(df$Pclass), df$Fare, method = "spearman")

## [1] -0.6880317

4. Interpretation and Findings.

In this iteration of the EDA of the Titanic dataset, we moved from graphical analysis to non-graphical descriptive, univariate and bivariate statistical analysis. The two key measures of descriptive statistics are “central tendency” which describes the middle point of a data set such as mean and median, and “spread or variability” such as standard deviation, skewness and kurtosis which describes how dispersed the data is around that central point; essentially summarizing the typical value and the variation within the data Ultimately, we see the value of both non-graphical and graphical descriptive statistics and how the two different forms can complement each other.

Data Preparation

Data Preparation is fairly boilerplate at this point with reusable functions for handling missing data, performing imputation and filtering/cleaning. See prior studies and papers in the References section.

Non-graphical Univariate Statistics

For univariate statistics, different variables of different types and classifications require particular combinations of descriptive statistics in order to gain specific insights that otherwise may not be as specific and clear in a graphical plot. For example, with a continuous numeric variable like Age, the applicable statistics include: mean, medion, range, variance, standard deviation, quantiles and interquartile range (IQR). For a descrete numeric variable, like SibSp, frequency, mean, median, range, variance and standard deviation. With regard to qualitative categorical variables of classified as binary, nominal or ordinal, the key statistics were frequency, proportion and mode.

Non-graphical Bivariate Statistics

For bivariate analysis, I applied six different methods of analysis examined the relationships between pairs of variables non-graphically to understand how they correlate and interact. Not all methods apply to all pairs, but again variable type and classification help determine the method.

Correlation (For Quantitative Variables).
Crosstabulation / Contingency Table (For Categorical Variables).
Chi-squared test (For Testing Association Between Categorical Variables).
T-Test / ANOVA (For Numeric and Categorical Variables).
Point-Biserial Correlation (For One Binary Categorical and One Quantitative Variable).
Spearman’s Rank Correlation (For Ordinal Variables or When Assumptions of Pearson Correlation Are Not Met).

Key Benefits of Non-graphical Statistics

Precise Numerical Summaries: Descriptive statistics like the average age or the proportion of passengers in each Pclass provide foundational knowledge about the data. Correlations between variables like Age and Fare reveal the strength and direction of relationships, which visualizations alone might suggest but not quantify. Crosstabulations and chi-squared tests for categorical variables, such as Survived and Sex, provide counts and tests of association that give more depth than a simple bar chart might.
Statistics Quantify Insights and Relationships: Correlations between continuous variables (e.g., Age and Fare) provide a numerical measure of how closely changes in one variable relate to changes in another. For example, a scatter plot might suggest a weak or strong association, but calculating the correlation coefficient (Pearson or Spearman) quantifies it, providing a specific measure (e.g., 0.3 or 0.8) that can be interpreted rigorously. Chi-squared tests allow for statistical testing of associations between categorical variables, such as Survived and Pclass. While a stacked bar chart may indicate a difference in survival rates across classes, a chi-squared test will confirm if this difference is statistically significant, offering a more robust conclusion. Point-biserial correlations between a binary variable (like Survived) and a continuous variable (like Fare) allow for insight into how survival status may correlate with fare prices. This statistical measure can quantify how much the two are related, which complements what may be visible on a box plot.
Complementing Visualizations with Non-Graphical Statistics: Histograms and density plots show the distribution of variables like Age or Fare but don’t provide exact values for central tendency or spread. Descriptive statistics such as mean, median, variance, and IQR give precise values that help interpret these distributions more accurately. Box plots can indicate the presence of outliers, but calculating the IQR and identifying specific data points beyond Q1 - 1.5 × IQR and Q3 + 1.5 × IQR offers a more exact approach to defining and handling outliers. Scatter plots may show relationships, but they don’t quantify them. Correlation coefficients measure how strong or weak the relationship is, supporting the visual pattern with a quantitative measure.

Summary

Non-graphical statistics and visualizations together form a balanced approach to EDA. Visualizations offer intuitive insights and reveal patterns that guide further analysis, while non-graphical statistics provide precise, quantitative measurements that validate and deepen understanding. By using both, we can:
- Confirm observations made visually.
- Quantify relationships and patterns.
- Assess statistical significance.
- Ensure findings are both visually appealing and analytically rigorous.

This combination helps uncover detailed relationships, such as how passenger class impacts the survival rate or how fares vary by embarkation point, in the context of the Titanic dataset, allowing for a more robust and accurate EDA.

References

Hinton, W. (2024). A Univariate Visualization and Analysis of the Titanic Data Set. Available at Rpubs. https://www.rpubs.com/whinton/.

Hinton, W. (2024). From Univariate to Bivariate and Multivariate Analysis. Available at Rpubs. https://www.rpubs.com/whinton/.

Packt Publishing. (2018). R programming for statistics and data science (Media from Packt Publishing available freely through O’Reilly Media Inc.). https://learning.oreilly.com/course/r-programming-for/9781789950298/.

Datar, R., & Garg, H. (2019). Hands-on exploratory data analysis with R: Become an expert in exploratory data analysis using R packages. O’Reilly Media, Inc.

Prabhakaran, S. (2023). The complete ggplot2 tutorial. R-statistics.co. Available online _{link}(https://r-statistics.co/Complete-Ggplot2-Tutorial-Part1-With-R-Code.html)_

Smeaton, A. (2003). NIST/SEMATECH Engineering Statistics Handbook. _{link}(https://www.itl.nist.gov/div898/handbook/)_. R Programming for Statistics and Data Science (Media from Packt Publishing available freely through O’Reilly Media Inc.). (2018).

Kabacoff, Robert (2022). R in Action, Third Edition. O’Reilly Online Learning. _{Link}(https://learning.oreilly.com/library/view/r-in-action/9781617296055/)_

EPA. (2024). Exploratory Data Analysis. “United States Environmental Protection Agency”. Link

Wickham, H. (2016). gplot2: Elegant Graphics for Data Analysis. Retrieved from _[Link}(https://ggplot2.tidyverse.org)_.

.
This study conducted and performed by Will Hinton

Appendix: Graphical Analysis - Visualizations

Plot of a single categorical/factor variable Survived (YES=1,NO=0) .

Survived
- Overall, the bar plot is the most useful here, clearly visualizing survival counts.

Plots of a single categorical/factor variable Pclass (First=1,Second=2,Third=2).

Pclass
- Overall: Bar plot is the most effective way to visualize passenger class distribution.

Plot of a single quantitative/numeric variable Age(decimal) .

Age
- Overall: Histogram and Box plot are valuable for visualizing Age, while the Probability plot can check for normality.

Plot of a single quantitative/numeric variable Fare(decimal).

Fare
- Overall: Histogram, Box plot, and Probability plot are effective for Fare analysis.

Bivariate Analysis

Bar plot for Categorical vs Categorical. (vs Continuous).
Survival by Pclass , Survival by Embarked , Survival by Sex , Survival by Age

## ####################################################################################

Box Plot for Categorical vs Continuous. Scatter Plot for Continuous vs. Continuous.
Fare by Pclass , Fare by Embarked , Age by Survived. Plus, Fare by Age.

## ####################################################################################

Multivariate Analysis.

This multi-plot shows different graphical examples of multivariate analysis:
1. Facet Grid. The relationship between age, fare, and survival across gender and class categories.
2. Bubble Plot. Family size (SibSp) is represented by bubble size, to see if larger family sizes impacted survival in relation to age and fare.
3. Heatmap. Survival rates within each embarkation point and class, highlighting differences in survival probability.

## ####################################################################################

TIM-8501 Assignment 5

Will Hinton

2024-11-17