load("C:/Users/jessi/Desktop/ML7331_ICA5/titanic2.raw.rdata")
# Number of rows
nrow(titanic.raw)
[1] 2201
# Number of columns
ncol(titanic.raw)
[1] 4
# Or both at once (dimension of the data)
dim(titanic.raw)
[1] 2201 4
# Column names
colnames(titanic.raw)
[1] "Class" "Sex" "Age" "Survived"
# Display the first 6 rows
head(titanic.raw)
summary(titanic.raw)
Class Sex Age Survived
1st :325 Female: 470 Adult:2092 No :1490
2nd :285 Male :1731 Child: 109 Yes: 711
3rd :706
Crew:885
str(titanic.raw)
'data.frame': 2201 obs. of 4 variables:
$ Class : Factor w/ 4 levels "1st","2nd","3rd",..: 3 3 3 3 3 3 3 3 3 3 ...
$ Sex : Factor w/ 2 levels "Female","Male": 2 2 2 2 2 2 2 2 2 2 ...
$ Age : Factor w/ 2 levels "Adult","Child": 2 2 2 2 2 2 2 2 2 2 ...
$ Survived: Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
# Total missing values per column
colSums(is.na(titanic.raw))
Class Sex Age Survived
0 0 0 0
# Alternatively, check if there are any missing values in the entire dataset
any(is.na(titanic.raw))
[1] FALSE
Q.3:
Based on your outputs, here’s a summary of each step’s results and
interpretation:
Step 1: Load the Dataset
The first few rows and structure of the dataset:
Class Sex Age Survived
0 3rd Male Child No
1 3rd Male Child No
2 3rd Male Child No
3 3rd Male Child No
4 3rd Male Child No
Data structure information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2201 entries, 0 to 2200
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Class 2201 non-null object
1 Sex 2201 non-null object
2 Age 2201 non-null object
3 Survived 2201 non-null object
dtypes: object(4)
memory usage: 68.9+ KB
Summary of each column:
Column Count Unique Top Freq
Class 2201 4 Crew 885
Sex 2201 2 Male 1731
Age 2201 2 Adult 2092
Survived 2201 2 No 1490
Step 2: Summary of Categorical Distributions
The distribution of values in each categorical column:
Class distribution:
Crew 885
3rd 706
1st 325
2nd 285
Sex distribution:
Male 1731
Female 470
Age distribution:
Adult 2092
Child 109
Survived distribution:
No 1490
Yes 711
Step 3: Support Count Calculations
Support count for \(\{1st, \text{Adult, Yes}\}\):
Support count for {1st, Adult, Yes}: 197
This means there are 197 instances in the dataset where the passenger
was in 1st class, was an adult, and survived.
Additional Steps: Calculating Support and Confidence
Based on this support count and assuming a total of 2201 records, we
can now calculate support and confidence.
Support for \(\{1st,
\text{Adult, Yes}\}\): \[
s = \frac{197}{2201} \approx 0.0895 \text{ or } 8.95\%
\]
Support count for \(\{1st, \text{Adult, Yes, Female}\}\): Run
the code for the count of
{1st, Adult, Yes, Female}
Confidence for \(\{1st, \text{Adult, Yes}\} \rightarrow
\{\text{Female}\}\): Calculate confidence using: \[
c = \frac{\sigma(\{1st, \text{Adult, Yes, Female}\})}{\sigma(\{1st,
\text{Adult, Yes}\})}
\]
Q.1 Answer:
Given the dataset size of 2201 records, I calculated
the minimum support count for a minsup threshold of 0.25:
# Total dataset size
total_records = 2201
# Minsup threshold and required support count
minsup_threshold = 0.25
minsup_count = int(minsup_threshold * total_records) # Result is 550
print("Required support count for minsup of 0.25:", minsup_count)
This means an itemset needs a support count of at least
550 to be frequent at this level. Checking this against
the support counts we’ve seen:
{1st, Adult, Yes}
has a support count of
197.
{1st, Adult, Yes, Female}
has a support count of
140.
Since both are below 550, neither itemset meets the minsup threshold
of 0.25, so neither is frequent.
Based on this, I conclude: - No 4-itemsets would be frequent
or maximal at this minsup level, since even the relevant
3-itemsets fall below the required support count.
Q.2 Answer:
- The support count for
{1st, Adult, Yes}
is 197, with a support of
0.0895 or about 8.95% of the
dataset.
- The support count for
{1st, Adult, Yes, Female}
is 140, with a support
of 0.0636 or about 6.36%.
Q.3 Answer:
To calculate the confidence for the rule
{1st, Adult, Yes} → {Female}
, I use the support counts from
previous answers.
Confidence Calculation: \[
\text{Confidence} = \frac{\text{support count of }\{1st, \text{Adult,
Yes, Female}\}}{\text{support count of }\{1st, \text{Adult, Yes}\}} =
\frac{140}{197} \approx 0.7107
\]
So, the confidence for
{1st, Adult, Yes} → {Female}
is about
71.07%. This means that if someone in the dataset is
from 1st class, an adult, and survived, there’s a 71.07% chance they are
female.
Q.4 Answer:
Support Count for minsup Threshold of 0.25:
- With a minsup threshold of 0.25, an itemset needs a support count of
at least 550 (25% of 2201 records) to be considered
frequent.
Frequent Itemsets with minsup = 0.25:
- To be frequent, an itemset must have a support count of 550 or more.
Looking at our support counts:
{1st, Adult, Yes}
has a support count of 197.
{1st, Adult, Yes, Female}
has a support count of
140.
- Since both are below 550, neither meets the minsup
threshold, so they’re not frequent.
Based on the dataset, only very broad categories (like all
No
for Survived
or all Male
for
Sex
) are likely to exceed a support count of 550.
Maximal 4-Itemsets:
- An itemset is maximal if it’s frequent and has no
frequent supersets. Since none of the 4-itemsets reach a support count
of 550, no 4-itemsets are frequent or maximal at this
minsup level.
Q.5 Answer:
- Step 1: Identify Frequent 3-Itemsets
- With a minimum support count of 200, the following 3-itemsets from
the bar graph are frequent:
{Adult, Female, No}
with a support of 1200
{Crew, Male, No}
with a support of 600
{3rd, Male, No}
with a support of 300
- Step 2: Apply the Apriori Principle
- To generate candidate 4-itemsets, I use the apriori principle, which
states that any subset of a frequent itemset must also be frequent. So,
I can only form a 4-itemset if two 3-itemsets share at least two common
attributes.
- Step 3: Attempt to Generate 4-Itemsets
{Adult, Female, No}
and {Crew, Male, No}
don’t share enough attributes to form a valid 4-itemset.
{Crew, Male, No}
and {3rd, Male, No}
differ in the Class
attribute, so they don’t combine into a
frequent 4-itemset.
{Adult, Female, No}
and {3rd, Male, No}
also lack shared attributes, preventing a 4-itemset.
Conclusion
Since none of these frequent 3-itemsets overlap enough to meet the
apriori principle, no candidate 4-itemsets can be
generated that would meet the minsup threshold.
Colab
RPubs
---
title: "ML7331_ICA5_13_Nov_24: Titanic Raw"
author: "Jessica McPhaul"
output: html_notebook
editor_options: 
  markdown: 
    wrap: 72
---

```{r}
load("C:/Users/jessi/Desktop/ML7331_ICA5/titanic2.raw.rdata")


# Number of rows
nrow(titanic.raw)

# Number of columns
ncol(titanic.raw)

# Or both at once (dimension of the data)
dim(titanic.raw)

# Column names
colnames(titanic.raw)

# Display the first 6 rows
head(titanic.raw)

summary(titanic.raw)

str(titanic.raw)

# Total missing values per column
colSums(is.na(titanic.raw))

# Alternatively, check if there are any missing values in the entire dataset
any(is.na(titanic.raw))
```

### Q.3:

Based on your outputs, here’s a summary of each step's results and
interpretation:

### Step 1: Load the Dataset

The first few rows and structure of the dataset:

``` plaintext
   Class   Sex    Age Survived
0   3rd    Male   Child   No
1   3rd    Male   Child   No
2   3rd    Male   Child   No
3   3rd    Male   Child   No
4   3rd    Male   Child   No
```

Data structure information:

``` plaintext
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2201 entries, 0 to 2200
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Class     2201 non-null   object
 1   Sex       2201 non-null   object
 2   Age       2201 non-null   object
 3   Survived  2201 non-null   object
dtypes: object(4)
memory usage: 68.9+ KB
```

Summary of each column:

``` plaintext
       Column   Count Unique    Top      Freq
       Class     2201    4     Crew       885
       Sex       2201    2     Male       1731
       Age       2201    2     Adult      2092
       Survived  2201    2     No         1490
```

### Step 2: Summary of Categorical Distributions

The distribution of values in each categorical column:

``` plaintext
Class distribution:
 Crew    885
 3rd     706
 1st     325
 2nd     285

Sex distribution:
 Male      1731
 Female     470

Age distribution:
 Adult    2092
 Child     109

Survived distribution:
 No     1490
 Yes     711
```

### Step 3: Support Count Calculations

-   **Support count for** $\{1st, \text{Adult, Yes}\}$:

    ``` plaintext
    Support count for {1st, Adult, Yes}:  197
    ```

This means there are 197 instances in the dataset where the passenger
was in 1st class, was an adult, and survived.

### Additional Steps: Calculating Support and Confidence

Based on this support count and assuming a total of 2201 records, we can
now calculate support and confidence.

1.  **Support for** $\{1st, \text{Adult, Yes}\}$: $$
    s = \frac{197}{2201} \approx 0.0895 \text{ or } 8.95\%
    $$

2.  **Support count for** $\{1st, \text{Adult, Yes, Female}\}$: Run the
    code for the count of `{1st, Adult, Yes, Female}`

3.  **Confidence for**
    $\{1st, \text{Adult, Yes}\} \rightarrow \{\text{Female}\}$:
    Calculate confidence using: $$
    c = \frac{\sigma(\{1st, \text{Adult, Yes, Female}\})}{\sigma(\{1st, \text{Adult, Yes}\})}
    $$

### Q.1 Answer:

Given the dataset size of **2201 records**, I calculated the minimum
support count for a minsup threshold of 0.25:

``` python
# Total dataset size
total_records = 2201

# Minsup threshold and required support count
minsup_threshold = 0.25
minsup_count = int(minsup_threshold * total_records)  # Result is 550

print("Required support count for minsup of 0.25:", minsup_count)
```

This means an itemset needs a support count of at least **550** to be
frequent at this level. Checking this against the support counts we’ve
seen:

-   `{1st, Adult, Yes}` has a support count of **197**.
-   `{1st, Adult, Yes, Female}` has a support count of **140**.

Since both are below 550, neither itemset meets the minsup threshold of
0.25, so **neither is frequent**.

Based on this, I conclude: - **No 4-itemsets would be frequent or
maximal** at this minsup level, since even the relevant 3-itemsets fall
below the required support count.

### Q.2 Answer:

-   The **support count for `{1st, Adult, Yes}`** is 197, with a support
    of **0.0895** or about **8.95%** of the dataset.
-   The **support count for `{1st, Adult, Yes, Female}`** is 140, with a
    support of **0.0636** or about **6.36%**.

### Q.3 Answer:

To calculate the confidence for the rule `{1st, Adult, Yes} → {Female}`,
I use the support counts from previous answers.

**Confidence Calculation**: $$
\text{Confidence} = \frac{\text{support count of }\{1st, \text{Adult, Yes, Female}\}}{\text{support count of }\{1st, \text{Adult, Yes}\}} = \frac{140}{197} \approx 0.7107
$$

So, the **confidence for `{1st, Adult, Yes} → {Female}`** is about
**71.07%**. This means that if someone in the dataset is from 1st class,
an adult, and survived, there’s a 71.07% chance they are female.

### Q.4 Answer:

1.  **Support Count for minsup Threshold of 0.25**:

    -   With a minsup threshold of 0.25, an itemset needs a support
        count of at least **550** (25% of 2201 records) to be considered
        frequent.

2.  **Frequent Itemsets with minsup = 0.25**:

    -   To be frequent, an itemset must have a support count of 550 or
        more. Looking at our support counts:
        -   `{1st, Adult, Yes}` has a support count of 197.
        -   `{1st, Adult, Yes, Female}` has a support count of 140.
    -   Since both are below 550, **neither meets the minsup
        threshold**, so they’re not frequent.

    Based on the dataset, only very broad categories (like all `No` for
    `Survived` or all `Male` for `Sex`) are likely to exceed a support
    count of 550.

3.  **Maximal 4-Itemsets**:

    -   An itemset is **maximal** if it’s frequent and has no frequent
        supersets. Since none of the 4-itemsets reach a support count of
        550, **no 4-itemsets are frequent or maximal** at this minsup
        level.

### Q.5 Answer:

1.  **Step 1: Identify Frequent 3-Itemsets**
    -   With a minimum support count of 200, the following 3-itemsets
        from the bar graph are frequent:
        -   `{Adult, Female, No}` with a support of 1200
        -   `{Crew, Male, No}` with a support of 600
        -   `{3rd, Male, No}` with a support of 300
2.  **Step 2: Apply the Apriori Principle**
    -   To generate candidate 4-itemsets, I use the apriori principle,
        which states that any subset of a frequent itemset must also be
        frequent. So, I can only form a 4-itemset if two 3-itemsets
        share at least two common attributes.
3.  **Step 3: Attempt to Generate 4-Itemsets**
    -   `{Adult, Female, No}` and `{Crew, Male, No}` don’t share enough
        attributes to form a valid 4-itemset.
    -   `{Crew, Male, No}` and `{3rd, Male, No}` differ in the `Class`
        attribute, so they don’t combine into a frequent 4-itemset.
    -   `{Adult, Female, No}` and `{3rd, Male, No}` also lack shared
        attributes, preventing a 4-itemset.

**Conclusion**

Since none of these frequent 3-itemsets overlap enough to meet the
apriori principle, **no candidate 4-itemsets can be generated** that
would meet the minsup threshold.

[Colab](https://colab.research.google.com/drive/1PPfHOBs-wXhyFYRyYl3VhtuY4vFpHzyT#scrollTo=Ws2C2hlROZTa)

[RPubs](https://rpubs.com/Texaschikkita/ml7331_ICA5)
