This report presents a comprehensive exploratory data analysis (EDA) of the Titanic dataset. The primary goals of this analysis include uncovering initial patterns, assessing data structure and quality, and preparing the dataset for further statistical modeling and predictive analysis. The Titanic dataset was selected due to its educational significance, well-structured variables, and suitability for classification-based learning objectives in both statistical and machine learning contexts. The analysis offers a high-level overview of passenger demographics and survival patterns that are relevant to future analytical applications.
The dataset provides a real-world classification challenge—predicting survival following the Titanic disaster—and is frequently used in academic and industry training settings. Preliminary findings revealed strong survival associations based on passenger class, gender, and age, underscoring the impact of socio-demographic variables on survival outcomes.
The dataset (train.csv) consists of 891 rows and 12 columns, each capturing individual passenger-level data. Upon initial examination, the dataset included 5 quantitative variables (Age, Fare, SibSp, Parch, and PassengerId) and 7 qualitative variables (Sex, Embarked, Cabin, Name, Ticket, Pclass, and Survived).
Key variables relevant for classification and modeling include:
These variables are essential for identifying patterns that can inform predictive models and survival analysis.
Each variable was examined and reclassified where necessary to accurately reflect its role in the dataset. Variables were grouped into quantitative and qualitative types. During preprocessing, 7 variables originally classified as numeric or character were recoded as categorical or ordinal to better represent their function: Survived, Sex, Embarked, Ticket, Cabin, Pclass, and PassengerId. This reclassification ensured the proper treatment of these variables during statistical analysis and improved analytical validity.
Measurement scales were also applied to the variables to guide appropriate statistical techniques:
This classification framework facilitates the selection of valid methods for modeling, visualization, and hypothesis testing.
Accurate classification of variable types is fundamental to the EDA process. The correct identification of nominal, ordinal, interval, and ratio variables ensures that analytical tools and models align with the nature of the data. Failure to recognize variable scale can lead to inappropriate statistical techniques, incorrect visualizations, and invalid inferences. This stage of the analysis provided a structured foundation for future steps such as feature selection, transformation, and model design.
The EDA process was documented and executed within an R Markdown environment. This approach ensured reproducibility and transparency by embedding R code, output, and commentary in a single source document. Using the knit function, the analysis was compiled into a shareable .pdf or .html format, preserving all steps and outputs for stakeholders and reviewers.The EDA process established the structural integrity and analytic readiness of the Titanic dataset. Key steps included identifying data types, correcting variable classifications, and applying appropriate measurement scales. These actions reinforced best practices in data preparation and set the stage for meaningful statistical modeling. Emphasis was placed on understanding variable behavior and interrelationships before advancing to predictive modeling or inferential statistics.
This preliminary analysis sets the foundation for the following advanced analytical activities:
The EDA performed on the Titanic dataset represents a critical first phase of the data analysis cycle. Through careful examination of structure, types, and classifications, the dataset was validated and prepared for future modeling. The results of this process improve the integrity of the dataset and provide a solid analytical framework for deriving insights related to survival patterns and socio-demographic predictors.
Exploratory Data Analysis (EDA) is a foundational approach in data analysis that emphasizes the use of summary statistics and visual techniques to explore patterns, identify anomalies, and understand the underlying structure of data before formal modeling. Introduced by John W. Tukey in his seminal work Exploratory Data Analysis (1977), EDA shifted the focus of statistical work from purely confirmatory hypothesis testing to open-ended exploration, allowing data to guide inquiry and insight generation. Tukey’s work laid the groundwork for a philosophy of analysis rooted in curiosity, graphical inspection, and robustness to assumptions, encouraging analysts to “let the data speak.”
The goals of EDA include summarizing key features of datasets, detecting data quality issues such as missing values or outliers, checking assumptions required for statistical modeling, identifying patterns and relationships among variables, and generating hypotheses (Jebb, Parrigon, & Woo, 2017). These goals are particularly critical in contemporary data science, where analysts often begin their workflow by performing EDA to inform subsequent modeling decisions and ensure valid interpretations (Behrens, 1997). EDA also serves as a quality control process, helping researchers avoid spurious findings that may arise from blindly applying statistical models to poorly understood data.
In R, EDA is supported through numerous packages and functions that streamline the process of data exploration. The tidyverse suite—including dplyr for data manipulation and ggplot2 for visualization—is designed around principles that align closely with Tukey’s original philosophy (Wickham & Grolemund, 2017b). These tools enable rapid transformation, summarization, and graphical exploration, making R a powerful environment for both educational and professional applications of EDA. Notably, the development of statistical programming languages like S (which later became R) was inspired in part by the interactive, visually driven nature of EDA.
Today, R is one of the predominant languages for performing EDA in both academic and industry settings. R provides a rich ecosystem of packages for data exploration. For example, the tidyverse collection (including packages like dplyr for data manipulation and ggplot2 for plotting) is widely used to quickly summarize data and create exploratory graphics. Data scientists often use R to compute summary statistics (means, medians, frequencies, etc.), create plots (histograms, boxplots, scatterplots), and perform preliminary analyses with just a few commands. The results allow them to iteratively investigate the data, refine questions, and prepare for more formal modeling. The emphasis on visualization in R aligns with EDA’s ethos – as Tukey noted, seeing the data is a powerful way to understand it (Tukey, 1977).
The Titanic dataset, widely used in teaching and competitions, provides a compelling case for applying EDA. It consists of records for passengers aboard the RMS Titanic, whereby during the ship’s maiden voyage across the Atlantic, the Titanic struck an iceberg and sank, resulting in the death of 1,502 of the 2,224 passengers and crew on April 15, 1912. This tragedy has been well documented and has produced detailed passenger records, which form the basis of the Titanic dataset commonly used in data analysis contexts. The dataset includes variables such as passenger class, gender, age, ticket fare, number of family members aboard, and survival status. One commonly analyzed version of the dataset contains 891 observations with 12 variables, made publicly available through platforms such as Kaggle and Vanderbilt University (Perrier, 2017).
Kaggle chose this dataset for their introductory competition because of its iconic status and the instructive patterns it contains. The competition and its many tutorials have solidified the Titanic dataset’s role as a benchmark for beginner machine learning techniques (Perrier, 2017). Participants use the training portion of the data (the 891 records described earlier) to train models like logistic regression, decision trees, or support vector machines to predict the survival status of the remaining passengers. In doing so, they learn how to perform EDA on the dataset (e.g., visualizing survival rates by passenger class or creating new features like “family size”), how to handle missing data (such as ages), and how to evaluate model performance. The Kaggle Titanic challenge has become almost a rite of passage for newcomers in data science. It demonstrates how EDA and feature engineering can significantly improve a model’s accuracy – for example, one discovers through EDA that women and children had higher survival rates, which suggests incorporating the Sex and Age variables (and perhaps creating a “women/children first” feature) in a predictive model.
Beyond Kaggle, the Titanic data is widely used in academic coursework for teaching statistics and machine learning. In statistics classes, instructors use the Titanic data for illustrating concepts like contingency tables, odds ratios, and logistic regression. For instance, one can perform a chi-square test to examine the association between passenger class and survival, or fit a logistic regression model to quantify how each variable (gender, class, age, etc.) affects the probability of survival. The data’s binary outcome (survived or not) makes it ideal for teaching classification methods. In fact, the Titanic dataset is commonly used for demonstrating classification algorithms and survival analysis techniques, as it naturally involves predicting survival outcomes (Wickham & Grolemund, 2017b).
Although formal survival analysis (in the statistical sense of time-to-event modeling) isn’t directly applied since the data doesn’t include times of death, the term “survival analysis” is often used informally in this context to mean analyzing who lived or died and why. Students also learn to interpret model outputs in context – for example, a logistic regression might show that the odds of survival for females were several times higher than for males, reflecting the “women and children first” evacuation policy in place during the disaster.
The Titanic dataset has also been used in data visualization exercises and EDA demonstrations. Because the story behind the data is well-known, exploring it can be engaging. For example, one might plot survival rates by age and see the higher survival of children, or create a bar chart of survival by passenger class to illustrate how first-class passengers had better outcomes than third-class passengers. Such analyses help solidify understanding of EDA: they require formulating questions (“Did socio-economic status matter?”), using the dataset to answer them (by grouping and plotting survival by class), and then interpreting the results in a real-world context (“First-class passengers had a survival rate around 62%, while third-class passengers had around 25% – likely due to better access to lifeboats”; Wickham & Grolemund, 2017).
Educational use of the Titanic dataset is prevalent due to its binary outcome (survived/died), manageable size, and historical relevance. It has been adopted in textbooks, university coursework, and machine learning tutorials to teach data wrangling, feature engineering, and classification modeling. For example, it serves as an introductory case in logistic regression, decision tree modeling, and EDA principles (Wickham & Grolemund, 2017b). Its widespread availability and the presence of clear, interpretable patterns make it ideal for demonstrating the importance of variable types, missing data treatment, and exploratory visualization.
By exploring patterns such as survival rate by gender or passenger class, students and researchers can formulate hypotheses about social factors influencing survival and evaluate them with statistical techniques. Thus, the Titanic dataset not only offers historical insight but also provides a pedagogical bridge between exploratory and confirmatory data analysis. This project applies EDA principles to the Titanic dataset to examine its structure, classify variable types, detect data issues, and uncover relationships that may guide future modeling efforts.
To begin, set the working directory and import the train.csv file.
setwd("C:/My Project Week 1")
train_data <- read.csv("train.csv", header = TRUE, stringsAsFactors = FALSE)
summary(train_data)
## PassengerId Survived Pclass Name
## Min. : 1.0 Min. :0.0000 Min. :1.000 Length:891
## 1st Qu.:223.5 1st Qu.:0.0000 1st Qu.:2.000 Class :character
## Median :446.0 Median :0.0000 Median :3.000 Mode :character
## Mean :446.0 Mean :0.3838 Mean :2.309
## 3rd Qu.:668.5 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :891.0 Max. :1.0000 Max. :3.000
##
## Sex Age SibSp Parch
## Length:891 Min. : 0.42 Min. :0.000 Min. :0.0000
## Class :character 1st Qu.:20.12 1st Qu.:0.000 1st Qu.:0.0000
## Mode :character Median :28.00 Median :0.000 Median :0.0000
## Mean :29.70 Mean :0.523 Mean :0.3816
## 3rd Qu.:38.00 3rd Qu.:1.000 3rd Qu.:0.0000
## Max. :80.00 Max. :8.000 Max. :6.0000
## NA's :177
## Ticket Fare Cabin Embarked
## Length:891 Min. : 0.00 Length:891 Length:891
## Class :character 1st Qu.: 7.91 Class :character Class :character
## Mode :character Median : 14.45 Mode :character Mode :character
## Mean : 32.20
## 3rd Qu.: 31.00
## Max. :512.33
##
str(train_data)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...
## $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : chr "male" "female" "female" "female" ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : chr "" "C85" "" "C123" ...
## $ Embarked : chr "S" "C" "S" "S" ...
num_rows <- nrow(train_data)
num_cols <- ncol(train_data)
cat("Number of rows:", num_rows, "\n")
## Number of rows: 891
cat("Number of columns:", num_cols, "\n")
## Number of columns: 12
The code above performs an initial structural analysis of the Titanic dataset using three key R functions:
summary(train_data) generates descriptive statistics for each variable, such as minimum, maximum, mean, and counts of missing values. This helps identify variable distributions and potential data quality issues (e.g., missing ages or skewed fares).
str(train_data) provides a compact display of the data types (e.g., integers, factors, characters) and a preview of values for each variable, allowing us to spot misclassified data early (e.g., Pclass as numeric when it may be ordinal).
nrow() and ncol() calculate the number of rows (observations) and columns (variables), giving us a quick sense of dataset size.
These functions are essential in Exploratory Data Analysis (EDA) because they offer a high-level overview of the dataset’s structure and variable types before deeper analysis. Understanding how many entries we’re working with, what types of data are present, and whether any variables need to be reclassified or cleaned is critical to ensuring valid analysis and avoiding mistakes in later steps.
The str(train_data) output revealed that the dataset contains 891 observations across 12 variables. This function also helped identify the data types R has automatically assigned to each column. During this step of EDA, we classified each variable as either quantitative (numeric/measurable) or qualitative (categorical/label-based), and flagged any variables that may have been misclassified.
Key findings include:
This classification process is crucial in EDA because it ensures that each variable is analyzed appropriately. Using the wrong data type can lead to invalid assumptions in statistical modeling, incorrect summary statistics, or improper visualization methods. By identifying misclassifications early, this process ensure the rest of our analysis aligns with the true nature of the data.
In this step, I recoded several variables that were initially classified as numeric but actually represent categorical or ordinal data. Specifically:
This recoding ensures that R treats each variable appropriately in future analysis. For example, when plotting or summarizing Survived, category counts rather than a numeric average is preferred. Similarly, statistical models like logistic regression will correctly handle these variables as categorical predictors, rather than incorrectly assuming linear effects from their numeric codes.
Proper variable encoding supports more accurate and interpretable EDA by preventing misleading statistical summaries or plots (Wickham & Grolemund, 2017a). It also ensures that modeling functions in R apply the correct mathematical treatment to each variable (McNamara & Horton, 2017), which is especially important when transitioning from exploration to model building.
train_data$Survived <- factor(train_data$Survived, levels = c(0,1), labels = c("Died", "Survived"))
train_data$Sex <- as.factor(train_data$Sex)
train_data$Embarked <- as.factor(train_data$Embarked)
train_data$Ticket <- as.factor(train_data$Ticket)
train_data$Cabin <- as.factor(train_data$Cabin)
train_data$Pclass <- ordered(train_data$Pclass, levels = c(1,2,3), labels = c("1st", "2nd", "3rd"))
train_data$PassengerId <- as.factor(train_data$PassengerId)
str(train_data)
## 'data.frame': 891 obs. of 12 variables:
## $ PassengerId: Factor w/ 891 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Survived : Factor w/ 2 levels "Died","Survived": 1 2 2 2 1 1 1 1 2 2 ...
## $ Pclass : Ord.factor w/ 3 levels "1st"<"2nd"<"3rd": 3 1 3 1 3 3 1 3 3 2 ...
## $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
## $ Sex : Factor w/ 2 levels "female","male": 2 1 1 1 2 2 2 2 1 1 ...
## $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...
## $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...
## $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...
## $ Ticket : Factor w/ 681 levels "110152","110413",..: 524 597 670 50 473 276 86 396 345 133 ...
## $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...
## $ Cabin : Factor w/ 148 levels "","A10","A14",..: 1 83 1 57 1 1 131 1 1 1 ...
## $ Embarked : Factor w/ 4 levels "","C","Q","S": 4 2 4 4 4 3 4 4 4 2 ...
To complete Step 5, each variable in the dataset was examined and classified according to its level of measurement—nominal, ordinal, interval, or ratio—as defined by Stevens’ typology. In addition, quantitative variables were further assessed to determine whether they are discrete (count-based) or continuous (measured across a range). This classification process ensures that subsequent analyses and modeling techniques align with the nature of the data. A structured data frame was created to summarize these classifications, providing a clear reference for understanding how each variable should be interpreted and handled during analysis.
variable_class <- data.frame(
Variable = c("PassengerId", "Survived", "Pclass", "Name", "Sex", "Age", "SibSp", "Parch", "Ticket", "Fare", "Cabin", "Embarked"),
Type = c("Nominal", "Nominal", "Ordinal", "Nominal","Nominal","Ratio", "Ratio", "Ratio", "Nominal","Ratio", "Nominal", "Nominal"),
DiscreteOrContinuous = c("N/A", "N/A", "N/A", "N/A", "N/A", "Continuous", "Discrete","Discrete","N/A", "Continuous","N/A", "N/A")
)
variable_class
## Variable Type DiscreteOrContinuous
## 1 PassengerId Nominal N/A
## 2 Survived Nominal N/A
## 3 Pclass Ordinal N/A
## 4 Name Nominal N/A
## 5 Sex Nominal N/A
## 6 Age Ratio Continuous
## 7 SibSp Ratio Discrete
## 8 Parch Ratio Discrete
## 9 Ticket Nominal N/A
## 10 Fare Ratio Continuous
## 11 Cabin Nominal N/A
## 12 Embarked Nominal N/A
Classifying variables as nominal, ordinal, interval, or ratio—and as discrete or continuous—is a foundational part of the EDA cycle. These classifications guide how data is summarized, visualized, and modeled. For example, knowing that Pclass is ordinal (rather than just numeric) informs that its levels have a meaningful order but not equal spacing, which affects how it’s graphed (e.g., with boxplots) or included in regression models.
Likewise, identifying SibSp and Parch as discrete ratio variables ensures they should be interpreted as counts, not continuous values, influencing summary statistics (e.g., frequencies vs. means) and plot types (e.g., bar charts vs. histograms). Mistaking nominal variables (like Sex or Embarked) as numeric can lead to misleading averages or invalid assumptions in models.
Ultimately, these classifications support valid decision-making throughout the EDA process. They ensure the right statistical tools are applied, relationships are properly interpreted, and lay the groundwork for meaningful modeling and communication of results (Bhattacherjee, 2012; Price et al., 2015). In summary,
Understanding variable types is a foundational principle of Exploratory Data Analysis (EDA) because it determines the kinds of summaries, visualizations, and statistical models that are valid and meaningful. Variables fall into different measurement levels—nominal, ordinal, interval, and ratio—based on their inherent properties. This classification, introduced by psychologist S.S. Stevens in 1946, has become a cornerstone of data analysis (Stevens, 1946).
Each type constrains the kinds of transformations and statistical techniques that can be appropriately used (Price et al., 2015; Bhattacherjee, 2012). For example, one can calculate a mean and standard deviation for interval and ratio variables, but not for nominal or ordinal ones. Misapplying a technique—such as using a Pearson correlation on ordinal data—can produce misleading or invalid results.
In practice, understanding variable types helps analysts:
However, this typology is not without criticism. In their influential article Nominal, Ordinal, Interval, and Ratio Typologies Are Misleading, Velleman and Wilkinson (1993) argue that Stevens’s categories oversimplify the nature of data and can mislead analysis when rigidly applied. They note that variable “type” is not an inherent feature of the data but often depends on the research question and context. For example, a variable like ticket number could be treated as nominal (just an ID), ordinal (arrival order), or even ratio (if sequential and counting attendees). Their critique encourages flexible thinking: rather than letting scale type limit analysis, analysts should use judgment and consider what’s meaningful in context.
Modern statistical thinking acknowledges Stevens’s framework as a useful starting point—but not a rigid rulebook. As Williams (2021) explains, meaningful data analysis depends not just on data types, but also on distributional assumptions, research goals, and empirical patterns discovered through EDA. Summarily, identifying variable types is critical for valid EDA, but it must be combined with domain knowledge, flexibility, and an openness to what the data reveal.
Behrens, J. T. (1997). Principles and procedures of exploratory data analysis. Psychological Methods, 2(2), 131–160. https://doi.org/10.1037/1082-989X.2.2.131
Bhattacherjee, A. (2012). Social science research: Principles, methods, and practices (2nd ed.). University of South Florida Scholar Commons. http://scholarcommons.usf.edu/oa_textbooks/3
Jebb, A. T., Parrigon, S., & Woo, S. E. (2017). Exploratory data analysis as a foundation of inductive research. Human Resource Management Review, 27(2), 265–276. https://doi.org/10.1016/j.hrmr.2016.08.003
McNamara, A., & Horton, N. J. (2017). Wrangling categorical data in R. PeerJ Preprints, 5, e3163v2. https://doi.org/10.7287/peerj.preprints.3163v2
Perrier, A. (2017). Introduction to Titanic datasets. In Effective Amazon Machine Learning (pp. 59–65). Packt Publishing.
Price, P. C., Jhangiani, R. S., & Chiang, I. A. (2015). Research methods in psychology (2nd Canadian ed.). BCcampus. https://opentextbc.ca/researchmethods/
Stevens, S. S. (1946). On the theory of scales of measurement. Science, 103(2684), 677–680. https://doi.org/10.1126/science.103.2684.677
Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.
Velleman, P. F., & Wilkinson, L. (1993). Nominal, ordinal, interval, and ratio typologies are misleading. The American Statistician, 47(1), 65–72. https://www.jstor.org/stable/2684788
Wickham, H., & Grolemund, G. (2017a). R for data science. O’Reilly Media.
Wickham, H., & Grolemund, G. (2017b). R for data science: Import, tidy, transform, visualize, and model data. O’Reilly Media. https://r4ds.had.co.nz/
Williams, M. N. (2021). Levels of measurement and statistical analyses: The influence of Stevens’s scales on statistical practice. Meta-Psychology, 5, Article e1916. https://doi.org/10.15626/MP.2019.1916