load("C:/Users/jessi/Desktop/ML7331_ICA5/titanic2.raw.rdata")


# Number of rows
nrow(titanic.raw)
[1] 2201
# Number of columns
ncol(titanic.raw)
[1] 4
# Or both at once (dimension of the data)
dim(titanic.raw)
[1] 2201    4
# Column names
colnames(titanic.raw)
[1] "Class"    "Sex"      "Age"      "Survived"
# Display the first 6 rows
head(titanic.raw)

summary(titanic.raw)
  Class         Sex          Age       Survived  
 1st :325   Female: 470   Adult:2092   No :1490  
 2nd :285   Male  :1731   Child: 109   Yes: 711  
 3rd :706                                        
 Crew:885                                        
str(titanic.raw)
'data.frame':   2201 obs. of  4 variables:
 $ Class   : Factor w/ 4 levels "1st","2nd","3rd",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ Sex     : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
 $ Age     : Factor w/ 2 levels "Adult","Child": 2 2 2 2 2 2 2 2 2 2 ...
 $ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
# Total missing values per column
colSums(is.na(titanic.raw))
   Class      Sex      Age Survived 
       0        0        0        0 
# Alternatively, check if there are any missing values in the entire dataset
any(is.na(titanic.raw))
[1] FALSE

Q.3:

Based on your outputs, here’s a summary of each step’s results and interpretation:

Step 1: Load the Dataset

The first few rows and structure of the dataset:

   Class   Sex    Age Survived
0   3rd    Male   Child   No
1   3rd    Male   Child   No
2   3rd    Male   Child   No
3   3rd    Male   Child   No
4   3rd    Male   Child   No

Data structure information:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2201 entries, 0 to 2200
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Class     2201 non-null   object
 1   Sex       2201 non-null   object
 2   Age       2201 non-null   object
 3   Survived  2201 non-null   object
dtypes: object(4)
memory usage: 68.9+ KB

Summary of each column:

       Column   Count Unique    Top      Freq
       Class     2201    4     Crew       885
       Sex       2201    2     Male       1731
       Age       2201    2     Adult      2092
       Survived  2201    2     No         1490

Step 2: Summary of Categorical Distributions

The distribution of values in each categorical column:

Class distribution:
 Crew    885
 3rd     706
 1st     325
 2nd     285

Sex distribution:
 Male      1731
 Female     470

Age distribution:
 Adult    2092
 Child     109

Survived distribution:
 No     1490
 Yes     711

Step 3: Support Count Calculations

This means there are 197 instances in the dataset where the passenger was in 1st class, was an adult, and survived.

Additional Steps: Calculating Support and Confidence

Based on this support count and assuming a total of 2201 records, we can now calculate support and confidence.

  1. Support for \(\{1st, \text{Adult, Yes}\}\): \[ s = \frac{197}{2201} \approx 0.0895 \text{ or } 8.95\% \]

  2. Support count for \(\{1st, \text{Adult, Yes, Female}\}\): Run the code for the count of {1st, Adult, Yes, Female}

  3. Confidence for \(\{1st, \text{Adult, Yes}\} \rightarrow \{\text{Female}\}\): Calculate confidence using: \[ c = \frac{\sigma(\{1st, \text{Adult, Yes, Female}\})}{\sigma(\{1st, \text{Adult, Yes}\})} \]

Q.1 Answer:

Given the dataset size of 2201 records, I calculated the minimum support count for a minsup threshold of 0.25:

# Total dataset size
total_records = 2201

# Minsup threshold and required support count
minsup_threshold = 0.25
minsup_count = int(minsup_threshold * total_records)  # Result is 550

print("Required support count for minsup of 0.25:", minsup_count)

This means an itemset needs a support count of at least 550 to be frequent at this level. Checking this against the support counts we’ve seen:

Since both are below 550, neither itemset meets the minsup threshold of 0.25, so neither is frequent.

Based on this, I conclude: - No 4-itemsets would be frequent or maximal at this minsup level, since even the relevant 3-itemsets fall below the required support count.

Q.2 Answer:

Q.3 Answer:

To calculate the confidence for the rule {1st, Adult, Yes} → {Female}, I use the support counts from previous answers.

Confidence Calculation: \[ \text{Confidence} = \frac{\text{support count of }\{1st, \text{Adult, Yes, Female}\}}{\text{support count of }\{1st, \text{Adult, Yes}\}} = \frac{140}{197} \approx 0.7107 \]

So, the confidence for {1st, Adult, Yes} → {Female} is about 71.07%. This means that if someone in the dataset is from 1st class, an adult, and survived, there’s a 71.07% chance they are female.

Q.4 Answer:

  1. Support Count for minsup Threshold of 0.25:

    • With a minsup threshold of 0.25, an itemset needs a support count of at least 550 (25% of 2201 records) to be considered frequent.
  2. Frequent Itemsets with minsup = 0.25:

    • To be frequent, an itemset must have a support count of 550 or more. Looking at our support counts:
      • {1st, Adult, Yes} has a support count of 197.
      • {1st, Adult, Yes, Female} has a support count of 140.
    • Since both are below 550, neither meets the minsup threshold, so they’re not frequent.

    Based on the dataset, only very broad categories (like all No for Survived or all Male for Sex) are likely to exceed a support count of 550.

  3. Maximal 4-Itemsets:

    • An itemset is maximal if it’s frequent and has no frequent supersets. Since none of the 4-itemsets reach a support count of 550, no 4-itemsets are frequent or maximal at this minsup level.

Q.5 Answer:

  1. Step 1: Identify Frequent 3-Itemsets
    • With a minimum support count of 200, the following 3-itemsets from the bar graph are frequent:
      • {Adult, Female, No} with a support of 1200
      • {Crew, Male, No} with a support of 600
      • {3rd, Male, No} with a support of 300
  2. Step 2: Apply the Apriori Principle
    • To generate candidate 4-itemsets, I use the apriori principle, which states that any subset of a frequent itemset must also be frequent. So, I can only form a 4-itemset if two 3-itemsets share at least two common attributes.
  3. Step 3: Attempt to Generate 4-Itemsets
    • {Adult, Female, No} and {Crew, Male, No} don’t share enough attributes to form a valid 4-itemset.
    • {Crew, Male, No} and {3rd, Male, No} differ in the Class attribute, so they don’t combine into a frequent 4-itemset.
    • {Adult, Female, No} and {3rd, Male, No} also lack shared attributes, preventing a 4-itemset.

Conclusion

Since none of these frequent 3-itemsets overlap enough to meet the apriori principle, no candidate 4-itemsets can be generated that would meet the minsup threshold.

Colab

RPubs

