load("C:/Users/jessi/Desktop/ML7331_ICA5/titanic2.raw.rdata")
# Number of rows
nrow(titanic.raw)
[1] 2201
# Number of columns
ncol(titanic.raw)
[1] 4
# Or both at once (dimension of the data)
dim(titanic.raw)
[1] 2201 4
# Column names
colnames(titanic.raw)
[1] "Class" "Sex" "Age" "Survived"
# Display the first 6 rows
head(titanic.raw)
summary(titanic.raw)
Class Sex Age Survived
1st :325 Female: 470 Adult:2092 No :1490
2nd :285 Male :1731 Child: 109 Yes: 711
3rd :706
Crew:885
str(titanic.raw)
'data.frame': 2201 obs. of 4 variables:
$ Class : Factor w/ 4 levels "1st","2nd","3rd",..: 3 3 3 3 3 3 3 3 3 3 ...
$ Sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
$ Age : Factor w/ 2 levels "Adult","Child": 2 2 2 2 2 2 2 2 2 2 ...
$ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
# Total missing values per column
colSums(is.na(titanic.raw))
Class Sex Age Survived
0 0 0 0
# Alternatively, check if there are any missing values in the entire dataset
any(is.na(titanic.raw))
[1] FALSE
Q.3:
Based on your outputs, here’s a summary of each step’s results and
interpretation:
Step 1: Load the Dataset
The first few rows and structure of the dataset:
Class Sex Age Survived
0 3rd Male Child No
1 3rd Male Child No
2 3rd Male Child No
3 3rd Male Child No
4 3rd Male Child No
Data structure information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2201 entries, 0 to 2200
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Class 2201 non-null object
1 Sex 2201 non-null object
2 Age 2201 non-null object
3 Survived 2201 non-null object
dtypes: object(4)
memory usage: 68.9+ KB
Summary of each column:
Column Count Unique Top Freq
Class 2201 4 Crew 885
Sex 2201 2 Male 1731
Age 2201 2 Adult 2092
Survived 2201 2 No 1490
Step 2: Summary of Categorical Distributions
The distribution of values in each categorical column:
Class distribution:
Crew 885
3rd 706
1st 325
2nd 285
Sex distribution:
Male 1731
Female 470
Age distribution:
Adult 2092
Child 109
Survived distribution:
No 1490
Yes 711
Step 3: Support Count Calculations
Support count for \(\{1st, \text{Adult, Yes}\}\):
Support count for {1st, Adult, Yes}: 197
This means there are 197 instances in the dataset where the passenger
was in 1st class, was an adult, and survived.
Additional Steps: Calculating Support and Confidence
Based on this support count and assuming a total of 2201 records, we
can now calculate support and confidence.
Support for \(\{1st,
\text{Adult, Yes}\}\): \[
s = \frac{197}{2201} \approx 0.0895 \text{ or } 8.95\%
\]
Support count for \(\{1st, \text{Adult, Yes, Female}\}\): Run
the code for the count of
{1st, Adult, Yes, Female}
Confidence for \(\{1st, \text{Adult, Yes}\} \rightarrow
\{\text{Female}\}\): Calculate confidence using: \[
c = \frac{\sigma(\{1st, \text{Adult, Yes, Female}\})}{\sigma(\{1st,
\text{Adult, Yes}\})}
\]
Q.1 Answer:
Given the dataset size of 2201 records, I calculated
the minimum support count for a minsup threshold of 0.25:
# Total dataset size
total_records = 2201
# Minsup threshold and required support count
minsup_threshold = 0.25
minsup_count = int(minsup_threshold * total_records) # Result is 550
print("Required support count for minsup of 0.25:", minsup_count)
This means an itemset needs a support count of at least
550 to be frequent at this level. Checking this against
the support counts we’ve seen:
{1st, Adult, Yes}
has a support count of
197.
{1st, Adult, Yes, Female}
has a support count of
140.
Since both are below 550, neither itemset meets the minsup threshold
of 0.25, so neither is frequent.
Based on this, I conclude: - No 4-itemsets would be frequent
or maximal at this minsup level, since even the relevant
3-itemsets fall below the required support count.
Q.2 Answer:
- The support count for
{1st, Adult, Yes}
is 197, with a support of
0.0895 or about 8.95% of the
dataset.
- The support count for
{1st, Adult, Yes, Female}
is 140, with a support
of 0.0636 or about 6.36%.
Q.3 Answer:
To calculate the confidence for the rule
{1st, Adult, Yes} → {Female}
, I use the support counts from
previous answers.
Confidence Calculation: \[
\text{Confidence} = \frac{\text{support count of }\{1st, \text{Adult,
Yes, Female}\}}{\text{support count of }\{1st, \text{Adult, Yes}\}} =
\frac{140}{197} \approx 0.7107
\]
So, the confidence for
{1st, Adult, Yes} → {Female}
is about
71.07%. This means that if someone in the dataset is
from 1st class, an adult, and survived, there’s a 71.07% chance they are
female.
Q.4 Answer:
Support Count for minsup Threshold of 0.25:
- With a minsup threshold of 0.25, an itemset needs a support count of
at least 550 (25% of 2201 records) to be considered
frequent.
Frequent Itemsets with minsup = 0.25:
- To be frequent, an itemset must have a support count of 550 or more.
Looking at our support counts:
{1st, Adult, Yes}
has a support count of 197.
{1st, Adult, Yes, Female}
has a support count of
140.
- Since both are below 550, neither meets the minsup
threshold, so they’re not frequent.
Based on the dataset, only very broad categories (like all
No
for Survived
or all Male
for
Sex
) are likely to exceed a support count of 550.
Maximal 4-Itemsets:
- An itemset is maximal if it’s frequent and has no
frequent supersets. Since none of the 4-itemsets reach a support count
of 550, no 4-itemsets are frequent or maximal at this
minsup level.
Q.5 Answer:
- Step 1: Identify Frequent 3-Itemsets
- With a minimum support count of 200, the following 3-itemsets from
the bar graph are frequent:
{Adult, Female, No}
with a support of 1200
{Crew, Male, No}
with a support of 600
{3rd, Male, No}
with a support of 300
- Step 2: Apply the Apriori Principle
- To generate candidate 4-itemsets, I use the apriori principle, which
states that any subset of a frequent itemset must also be frequent. So,
I can only form a 4-itemset if two 3-itemsets share at least two common
attributes.
- Step 3: Attempt to Generate 4-Itemsets
{Adult, Female, No}
and {Crew, Male, No}
don’t share enough attributes to form a valid 4-itemset.
{Crew, Male, No}
and {3rd, Male, No}
differ in the Class
attribute, so they don’t combine into a
frequent 4-itemset.
{Adult, Female, No}
and {3rd, Male, No}
also lack shared attributes, preventing a 4-itemset.
Conclusion
Since none of these frequent 3-itemsets overlap enough to meet the
apriori principle, no candidate 4-itemsets can be
generated that would meet the minsup threshold.
Colab
RPubs
