set.seed(20)

1 Introduction

Naive Bayes is a popular and powerful classification algorithm used in machine learning and data mining tasks. It’s based on Bayes’ theorem, which describes the probability of an event occurring based on prior knowledge of conditions that might be related to the event.

In Naive Bayes classification, the algorithm assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. This assumption is known as “naive” because it simplifies the calculation of probabilities. Despite its simplicity, Naive Bayes often performs surprisingly well in practice, especially in text classification and spam filtering tasks.

One of the key advantages of Naive Bayes is its computational efficiency and ability to handle large datasets with high-dimensional features. It’s also relatively easy to implement and requires minimal training data compared to other classification algorithms.

Naive Bayes is a versatile and efficient algorithm that’s widely used in various real-world applications due to its simplicity, speed, and effectiveness, especially in scenarios where the independence assumption holds reasonably well.

In this manual, we’ll do a short demonstration of this algorithm, and also implement it in R.

To start with, consider the table below, which is the training data. From this data we’re to train a model that is capable of predicting Airline type based on some given features. After training this model, new traveller on the test dataset who want to travel, should be able to decide on which of the Airline (A, B, and C) should they book based on those features.In this problem, the target variable is Airline, while the features are Aircraft_Type, Demand and Weather_Conditions. Number_of_Stops

Aircraft_Type	Demand	Weather_Conditions	Number_of_Stops	Airline
Airbus A320	Low	Cloudy	0	Airline A
Airbus A380	Low	Rain	0	Airline A
Airbus A380	Low	Snow	1	Airline B
Boeing 777	Medium	Clear	1	Airline C
Boeing 777	Low	Rain	3	Airline B
Boeing 737	Low	Rain	0	Airline B
Airbus A320	Low	Rain	0	Airline A
Boeing 787	Low	Snow	0	Airline B
Airbus A380	Medium	Cloudy	0	Airline B

2 Computational Implementation

\[ P(C_k | x) = \frac{P(x | C_k) \cdot P(C_k)}{P(x)}\] In this formula:

$P(C_k|x)$ is the posterior probability of class $C_k$ given predictor x.
$P(x|C_k)$ is the likelihood, the probability of predictor x given class $C_k$.
$P(C_k)$ is the prior probability of class $C_k$.
$P(x)$ is the total probability of predictor x, also known as evidence. It’s usually computed as the sum of $P(x|C_k) \times P(C_k)$ over all classes.

2.1 Process of Training

We start training the model by calculating the condition probability of each instances. \[\begin{equation} \text{Probability of Airline A} = P(\text{Airline} = \text{Airline A}) = \frac{3}{9} = \frac{1}{3} \end{equation}\]

\[\begin{equation} \text{Probability of Airline B} = P(\text{Airline} = \text{Airline B}) = \frac{5}{9} \end{equation}\]

\[\begin{equation} \text{Probability of Airline C} = P(\text{Airline} = \text{Airline C}) = \frac{1}{9} \end{equation}\]

\[\begin{equation} n(\text{Airline} = \text{A}) = 3, \quad n(\text{Airline} = \text{B}) = 5, \quad n(\text{Airline} = \text{C}) = 1 \end{equation}\]

2.1.1 then calculate condition probability of each attribute

2.1.1.1 Conditional probability of Airbus type

Aircraft_Type	Airline.A	Airline.B	Airline.C
Airbus A320	0.6666667	0.0	0
Airbus A380	0.3333333	0.4	0
Airbus 777	0.0000000	0.2	1
Airbus 737	0.0000000	0.2	0
Airbus 787	0.0000000	0.2	0

2.1.1.2 Conditional probability of Airbus type

Demand	Airline.A	Airline.B	Airline.C
Low	1	0.8	0
Medium	0	0.2	1
High	0	0.0	0

2.1.1.3 Conditional probability of Weather Condition

Weather.Condition	Airline.A	Airline.B	Airline.C
Cloudy	0.3333333	0.2	0
Rain	0.6666667	0.4	0
Snow	0.0000000	0.4	0
Clear	0.0000000	0.0	1

2.1.1.4 Conditional probability of Number of stops

Number.of.stops	Airline.A	Airline.B	Airline.C
0	1	0.6	0
1	0	0.2	1
2	0	0.0	0
3	0	0.2	0

2.2 Process of Prediction

We’re to predict and classifier new instances in the table below. You can assume yourself as a traveler, planning on embarking on a journey, but so confused about the best airline suitable for such journey. Thank God, you have some information regarding the journey. And here we’re to help you decide. The machine we’ve trained will do this job.

Aircraft_Type	Demand	Weather_Conditions	Number_of_Stops	Airline
Boeing 777	Low	Rain	3	?
Airbus A320	Low	Snow	0	?
Airbus A380	High	Clear	0	?
Airbus A320	Low	Rain	0	?
Airbus A380	High	Clear	0	?
Boeing 737	Low	Snow	1	?
Boeing 777	Low	Snow	1	?
Boeing 777	Low	Cloudy	1	?
Airbus A320	Medium	Clear	1	?
Airbus A380	Low	Rain	1	?

2.2.1 Prediction formula

\[ U_{NB} = \text{argMax}_{V_j} \left[ \sum_{n}^{J} P(a_1 | U_j) \right] \]

\[ = argMaxP(U_j) \]

Where:

$U_{NB}$ is the predicted class using Naive Bayes.
$V_j$ is a class label.
$P(a_1 | U_j)$ is the conditional probability of the first attribute given class $U_j$ .
$U_j$ is a class label in the set of all possible classes J .

Following this, we make prediction as follows:

To predict the values for the “Airline” column based on the given features using a Naive Bayes classifier, we would follow these steps:

Calculate the posterior probability for each class label (airline) given the predictor variables.
Select the class label with the highest posterior probability as the predicted class for each instance.
The prediction formula for a Naive Bayes classifier can be represented as: \[\begin{equation} \text{Predicted Class} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} P(\text{Aircraft_Type} = a_1 | \text{Airline} = U_j) \right] \times P(\text{Airline} = U_j) \end{equation}\]

Where:

Predicted Class: Predicted Class is the predicted value for the “Airline” column.
$V_j$ represents each possible class label (airline).
$P(Aircraft Type} = a_1 |$ Airline = $U_j)$ is the conditional probability of the “Aircraft_Type” given each class label $U_j$
$P(Airline = U_j)$ is the prior probability of each class label
$J$ represents the set of all possible class labels.

2.2.2 Predition for the first row

To predict the airline for the first row using the given features (Aircraft_Type, Demand, Weather_Conditions, Number_of_Stops), we can use the Naive Bayes prediction formula: \[\begin{equation} \text{Predicted Class} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} P(\text{Aircraft_Type} = \text{Boeing 777} | \text{Airline} = U_j) \\ \times P(\text{Demand} = \text{Low} | \text{Airline} = U_j) \times P(\text{Weather_Conditions} = \text{Rain} | \text{Airline} = U_j) \\ \times P(\text{Number_of_Stops} = 3 | \text{Airline} = U_j) \right] \times P(\text{Airline} = U_j) \end{equation}\] Here, we substitute the specific values from the condition probability tables: \[\begin{equation} \text{Airline A} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 0 \times 1 \times 2/3 \times 0 \right] \times 1/3 = 0 \end{equation}\]

\[\begin{equation} \text{Airline B} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 1/5 \times 4/5 \times 2/5 \times 1/5 \right] \times 5/9 = 0.0071 \end{equation}\]

\[\begin{equation} \text{Airline C} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 1 \times 0 \times 0 \times 0 \right] \times 1/9 = 0 \end{equation}\]

Decision: Since $P(\text{Airline B}) > P(\text{Airline A})$ and $P(\text{Airline B}) > P(\text{Airline C})$, we classify the new instance as Airline B.

2.2.3 Predition for the second row

To predict the airline for the first row using the given features (Aircraft_Type = Airbus A320, Demand = Low, Weather_Conditions = Snow, Number_of_Stops = 0), we can use the Naive Bayes prediction formula: \[\begin{equation} \text{Predicted Class} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} P(\text{Aircraft_Type} = \text{Boeing 777} | \text{Airline} = U_j) \times \\ P(\text{Demand} = \text{Low} | \text{Airline} = U_j) \times P(\text{Weather_Conditions} = \text{Snow} | \text{Airline} = U_j) \\ \times P(\text{Number_of_Stops} = 0 | \text{Airline} = U_j) \right] \times P(\text{Airline} = U_j) \end{equation}\] Here, we substitute the specific values from the condition probability tables: \[\begin{equation} \text{Airline A} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 2/3 \times 1 \times 0 \times 1/3 \right] \times 1/3 = 0 \end{equation}\]

\[\begin{equation} \text{Airline B} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 0 \times 4/5 \times 2/5 \times 5/9 \right] \times 5/9 = 0 \end{equation}\]

\[\begin{equation} \text{Airline C} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 0 \times 0 \times 0 \times 0 \right] \times 1/9 = 0 \end{equation}\]

Decision: Since $P(\text{Airline B}) = P(\text{Airline A})$ = P()$. This become inconclusive. Hence, we use the Laplace smoothing formula

\[P_{lap_k}(x,y) = \frac{c(x,y)+K}{c(y) + K(X)}\] Where K is the smoothing parameter and it set to 2.

2.2.4 Laplace Smoothing Prediction

2.2.4.1 Probability Calculation:

Airline A: \[ \begin{align*} &= \left( \frac{2 + 2}{3+2(9)} \times \frac{3 + 2}{3+2(9)}\times \frac{0 + 2}{3+2(9)} \times \frac{0+2}{3+2(9)} \right) \times \frac{1}{3} \\ &= \left( \frac{4}{21} \times \frac{3}{21} \times \frac{2}{21} \times \frac{2}{21} \right) \times \frac{1}{3} \\ &= \left( \frac{48}{194,481} \right) \times \frac{1}{3} \\ &= \frac{48}{583,443} \\ &= 0.0000083 \end{align*} \]
Airline B: \[ \begin{align*} &= \left( \frac{0+2}{5+2(9)} \times \frac{4+2}{5+2(9)}\times \frac{2+2}{5+2(9)} \times \frac{3+2}{5+2(9)}\right) \times \frac{5}{9} \\ &= \left( \frac{2}{23} \times \frac{6}{23} \times \frac{4}{23} \times \frac{5}{23} \right) \times \frac{5}{9} \\ &= \left( \frac{240}{279,841} \right) \times \frac{5}{9} \\ &= \frac{1,200}{2,518,569} \\ &= 0.0005 \end{align*} \]
Airline C: \[ \begin{align*} &= \left( \frac{(0 + 2)}{1+2(9)} \times \frac{0 + 2}{1 + 2(9)} \times \frac{0+2}{1+2(9)} \times \frac{0+2}{1+2(9)} \right) \times \frac{1}{9} \\ &= \left( \frac{2}{19} \times \frac{2}{19} \times \frac{2}{19} \times \frac{2}{19} \right) \times \frac{1}{9} \\ &= \left( \frac{16}{130,321} \right) \times \frac{1}{9} \\ &= \frac{16}{1,172,889} \\ &= 0.0000014 \end{align*} \]

Now, we can compare these probabilities and make a decision. From the results above, Airline B > Airline A and Airline B > Airline C. Therefore, we predict this instance to be Airline B.

2.2.5 Predition for new instance in third row

To predict the airline for the first row using the given features (Aircraft_Type = Airbus A380, Demand = High, Weather_Conditions = Clear, Number_of_Stops = 0). \[\begin{equation} \text{Predicted Class} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} P(\text{Aircraft_Type} = \text{Airbus A380} | \text{Airline} = U_j) \times \\ P(\text{Demand} = \text{High} | \text{Airline} = U_j) \times P(\text{Weather_Conditions} = \text{Clear} | \text{Airline} = U_j) \times \\ P(\text{Number_of_Stops} = 0 | \text{Airline} = U_j) \right] \times P(\text{Airline} = U_j) \end{equation}\] Here, we substitute the specific values from the condition probability tables: \[\begin{equation} \text{Airline A} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 1/3 \times 0 \times 0 \times 0 \right] \times 1/3 = 0 \end{equation}\]

\[\begin{equation} \text{Airline B} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 2/5 \times 4/5 \times 0/5 \times 3/5 \right] \times 5/9 = 0 \end{equation}\]

\[\begin{equation} \text{Airline C} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 0 \times 0 \times 1 \times 0 \right] \times 1/9 = 0 \end{equation}\]

Decision: Since $P(\text{Airline B}) = P(\text{Airline A})$ = $P(\text{Airline C})$. This become inconclusive. Hence, we use the Laplace smoothing formula

2.2.6 Laplace Smoothing Prediction

Airline A: \[ \begin{align*} &= \left( \frac{1 + 2}{3+2(9)} \times \frac{0 + 2}{3+2(9)}\times \frac{0 + 2}{3+2(9)} \times \frac{0+2}{3+2(9)} \right) \times \frac{1}{3} \\ &= \left( \frac{3}{21} \times \frac{2}{21} \times \frac{2}{21} \times \frac{2}{21} \right) \times \frac{1}{3} \\ &= \left( \frac{24}{194,481} \right) \times \frac{1}{3} \\ &= \frac{24}{583,443} \\ &= 0.00004 \end{align*} \]
Airline B: \[ \begin{align*} &= \left( \frac{2+2}{5+2(9)} \times \frac{4+2}{5+2(9)}\times \frac{0+2}{5+2(9)} \times \frac{3+2}{5+2(9)}\right) \times \frac{5}{9} \\ &= \left( \frac{4}{23} \times \frac{6}{23} \times \frac{2}{23} \times \frac{5}{23} \right) \times \frac{5}{9} \\ &= \left( \frac{240}{279,841} \right) \times \frac{5}{9} \\ &= \frac{240}{2,518,569} \\ &= 0.000095 \end{align*} \]
Airline C: \[ \begin{align*} &= \left( \frac{(0 + 2)}{1+2(9)} \times \frac{0 + 2}{1 + 2(9)} \times \frac{1+2}{1+2(9)} \times \frac{0+2}{1+2(9)} \right) \times \frac{1}{9} \\ &= \left( \frac{2}{19} \times \frac{2}{19} \times \frac{3}{19} \times \frac{2}{19} \right) \times \frac{1}{9} \\ &= \left( \frac{24}{130,321} \right) \times \frac{1}{9} \\ &= \frac{24}{1,172,889} \\ &= 0.00002 \end{align*} \]

Now, we can compare these probabilities and make a decision. From the results above, Airline B > Airline A and Airline B > Airline C. Therefore, we predict this instance to be Airline B.

2.2.7 Predition for new instance in fourth row

To predict the airline for the fourth row using the given features (Aircraft_Type = Airbus A320, Demand = Low, Weather_Conditions = Rain, Number_of_Stops = 0). \[\begin{equation} \text{Predicted Class} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} P(\text{Aircraft_Type} = \text{Airbus A320} | \text{Airline} = U_j) \times \\ P(\text{Demand} = \text{Low} | \text{Airline} = U_j) \times P(\text{Weather_Conditions} = \text{Rain} | \text{Airline} = U_j) \times \\ P(\text{Number_of_Stops} = 0 | \text{Airline} = U_j) \right] \times P(\text{Airline} = U_j) \end{equation}\] Here, we substitute the specific values from the condition probability tables: \[\begin{equation} \text{Airline A} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 2/3 \times 3/3 \times 2/3 \times 3/3 \right] \times 1/3 \\ = 36/243 = 0.15 \end{equation}\]

\[\begin{equation} \text{Airline B} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 0 \times 4/5 \times 2/5 \times 3/5 \right] \times 5/9 = 0 \end{equation}\]

\[\begin{equation} \text{Airline C} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 0 \times 0 \times 1 \times 0 \right] \times 1/9 = 0 \end{equation}\]

Decision: Since $P(\text{Airline A}) > P(\text{Airline B})$ = $P(\text{Airline C})$. We conclude that the new instance is Airline A.

2.2.8 Predition for new instance in fift row : Note: This instance is the same to that of third row. Hence, we proceed to predicting for sixth instance.

2.2.9 Predition for new instance in sixth row

To predict the airline for the first row using the given features (Aircraft_Type = Boeing 737, Demand = Low, Weather_Conditions = Snow, Number_of_Stops = 1). \[\begin{equation} \text{Predicted Class} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} P(\text{Aircraft_Type} = \text{Boeing 737} | \text{Airline} = U_j) \times \\ P(\text{Demand} = \text{Low} | \text{Airline} = U_j) \times P(\text{Weather_Conditions} = \text{Snow} | \text{Airline} = U_j) \times \\ P(\text{Number_of_Stops} = 1 | \text{Airline} = U_j) \right] \times P(\text{Airline} = U_j) \end{equation}\] Here, we substitute the specific values from the condition probability tables: \[\begin{equation} \text{Airline A} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 0/3 \times 3/3 \times 0/3 \times 0/3 \right] \times 1/3 = 0 \end{equation}\]

\[\begin{equation} \text{Airline B} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 1/5 \times 4/5 \times 2/5 \times 1/5 \right] \times 5/9 = 8/625 = 0.01 \end{equation}\]

\[\begin{equation} \text{Airline C} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 0/1 \times 0/1 \times 0/1 \times 2/5 \right] \times 1/9 = 0 \end{equation}\]

Decision: Since $P(\text{Airline B}) > P(\text{Airline A})$ = $P(\text{Airline C})$. We conclude that the new instance is Airline B.

2.2.10 Predition for new instance in seventh row

To predict the airline for the first row using the given features (Aircraft_Type = Boeing 777, Demand = Low, Weather_Conditions = Snow, Number_of_Stops = 1). \[\begin{equation} \text{Predicted Class} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} P(\text{Aircraft_Type} = \text{Boeing 777} | \text{Airline} = U_j) \times \\ P(\text{Demand} = \text{Low} | \text{Airline} = U_j) \times P(\text{Weather_Conditions} = \text{Snow} | \text{Airline} = U_j) \times \\ P(\text{Number_of_Stops} = 1 | \text{Airline} = U_j) \right] \times P(\text{Airline} = U_j) \end{equation}\] Here, we substitute the specific values from the condition probability tables: \[\begin{equation} \text{Airline A} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 0/3 \times 3/3 \times 0/3 \times 0/3 \right] \times 1/3 = 0 \end{equation}\]

\[\begin{equation} \text{Airline B} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 1/5 \times 4/5 \times 2/5 \times 1/5 \right] \times 5/9 = 8/625 = 0.01 \end{equation}\]

\[\begin{equation} \text{Airline C} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 1/1 \times 0/1 \times 0/1 \times 1/1 \right] \times 1/9 = 0 \end{equation}\]

Decision: Since $P(\text{Airline B}) > P(\text{Airline A})$ = $P(\text{Airline C})$. We conclude that the new instance is Airline B.

2.2.11 Predition for new instance in eight row

To predict the airline for the first row using the given features (Aircraft_Type = Boeing 777, Demand = Low, Weather_Conditions = Cloudy, Number_of_Stops = 1). \[\begin{equation} \text{Predicted Class} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} P(\text{Aircraft_Type} = \text{Boeing 777} | \text{Airline} = U_j) \times \\ P(\text{Demand} = \text{Low} | \text{Airline} = U_j) \times P(\text{Weather_Conditions} = \text{Cloudy} | \text{Airline} = U_j) \times \\ P(\text{Number_of_Stops} = 1 | \text{Airline} = U_j) \right] \times P(\text{Airline} = U_j) \end{equation}\] Here, we substitute the specific values from the condition probability tables: \[\begin{equation} \text{Airline A} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 0/3 \times 3/3 \times 1/3 \times 0/3 \right] \times 1/3 = 4/243 = 0.016 \end{equation}\]

\[\begin{equation} \text{Airline B} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 1/5 \times 4/5 \times 1/5 \times 1/5 \right] \times 5/9 = 8/625 = 0.01 \end{equation}\]

\[\begin{equation} \text{Airline C} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 1/1 \times 0/1 \times 0/1 \times 1/1 \right] \times 1/9 = 0 \end{equation}\]

Decision: Since $P(\text{Airline B}) > P(\text{Airline A})$ = $P(\text{Airline C})$. We conclude that the new instance is Airline B.

2.2.12 Predition for new instance in ninth row

To predict the airline for the first row using the given features (Aircraft_Type = Airbus A380, Demand = Low, Weather_Conditions = Rain, Number_of_Stops = 1): \[\begin{equation} \text{Predicted Class} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} P(\text{Aircraft_Type} = \text{Airline A320} | \text{Airline} = U_j) \\ \times P(\text{Demand} = \text{Medium} | \text{Airline} = U_j) \\ \times P(\text{Weather_Conditions} = \text{Clear} | \text{Airline} = U_j) \times P(\text{Number_of_Stops} = 1 | \text{Airline} = U_j) \right] \\ \times P(\text{Airline} = U_j) \end{equation}\] Here, we substitute the specific values from the condition probability tables: \[\begin{equation} \text{Airline A} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 2/3 \times 0/3 \times 0/3 \times 3/3 \right] \times 1/3 = 0 \end{equation}\]

\[\begin{equation} \text{Airline B} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 0/5 \times 1/5 \times 0/5 \times 1/5 \right] \times 5/9 = 0 \end{equation}\]

\[\begin{equation} \text{Airline C} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 0/1 \times 1/1 \times 0/1 \times 1/1 \right] \times 1/9 = 0 \end{equation}\]

Decision: Since $P(\text{Airline B}) = P(\text{Airline A})$ = P()$. This become inconclusive. Hence, we use the Laplace smoothing formula

2.2.13 Laplace Smoothing Prediction

Airline A: \[ \begin{align*} &= \left( \frac{2 + 2}{3+2(9)} \times \frac{0 + 2}{3+2(9)}\times \frac{0 + 2}{3+2(9)} \times \frac{3+2}{3+2(9)} \right) \times \frac{1}{3} \\ &= \left( \frac{4}{21} \times \frac{2}{21} \times \frac{2}{21} \times \frac{5}{21} \right) \times \frac{1}{3} \\ &= \left( \frac{80}{194,481} \right) \times \frac{1}{3} \\ &= \frac{80}{583,443} \\ &= 0.00014 \end{align*} \]
Airline B: \[ \begin{align*} &= \left( \frac{0+2}{5+2(9)} \times \frac{1+2}{5+2(9)}\times \frac{0+2}{5+2(9)} \times \frac{1+2}{5+2(9)}\right) \times \frac{5}{9} \\ &= \left( \frac{2}{23} \times \frac{3}{23} \times \frac{2}{23} \times \frac{3}{23} \right) \times \frac{5}{9} \\ &= \left( \frac{36}{279,841} \right) \times \frac{5}{9} \\ &= \frac{180}{2,518,569} \\ &= 0.000072 \end{align*} \]
Airline C: \[ \begin{align*} &= \left( \frac{(0 + 2)}{1+2(9)} \times \frac{1 + 2}{1 + 2(9)} \times \frac{0+2}{1+2(9)} \times \frac{1+2}{1+2(9)} \right) \times \frac{1}{9} \\ &= \left( \frac{2}{19} \times \frac{3}{19} \times \frac{2}{19} \times \frac{3}{19} \right) \times \frac{1}{9} \\ &= \left( \frac{36}{130,321} \right) \times \frac{1}{9} \\ &= \frac{36}{1,172,889} \\ &= 0.0000031 \end{align*} \]

Now, we can compare these probabilities and make a decision. From the results above, Airline A > Airline B and Airline A > Airline C. Therefore, we predict this instance to be Airline A.

2.2.14 Predition for new instance in tenth row

To predict the airline for the first row using the given features (Aircraft_Type = Boeing 777, Demand = Low, Weather_Conditions = Snow, Number_of_Stops = 1). \[\begin{equation} \text{Predicted Class} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} P(\text{Aircraft_Type} = \text{Boeing 777} | \text{Airline} = U_j) \times \\ P(\text{Demand} = \text{Low} | \text{Airline} = U_j) \times P(\text{Weather_Conditions} = \text{Snow} | \text{Airline} = U_j) \times \\ P(\text{Number_of_Stops} = 1 | \text{Airline} = U_j) \right] \times P(\text{Airline} = U_j) \end{equation}\] Here, we substitute the specific values from the condition probability tables: \[\begin{equation} \text{Airline A} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 1/3 \times 3/3 \times 2/3 \times 3/3 \right] \times 1/3 = 2/27 = 0.075 \end{equation}\]

\[\begin{equation} \text{Airline B} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 2/5 \times 4/5 \times 2/5 \times 1/5 \right] \times 5/9 = 80/5625 = 0.014 \end{equation}\]

\[\begin{equation} \text{Airline C} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 0/1 \times 0/1 \times 0/1 \times 1/1 \right] \times 1/9 = 0 \end{equation}\]

Decision: Since $P(\text{Airline A}) > P(\text{Airline B})$, and $P(\text{Airline A}) > P(\text{Airline C})$. We conclude that the new instance is Airline A.

Aircraft_Type	Demand	Weather_Conditions	Number_of_Stops	Airline
Boeing 777	Low	Snow	1	Airline B
Boeing 737	Low	Rain	0	Airline B
Boeing 777	Low	Rain	1	Airline B
Boeing 787	Low	Clear	0	Airline A
Boeing 737	Low	Cloudy	0	Airline B
Airbus A320	Low	Cloudy	0	Airline B
Airbus A320	Low	Rain	0	Airline B
Boeing 787	Medium	Clear	0	Airline B
Airbus A320	Low	Clear	0	Airline A
Boeing 777	Low	Cloudy	0	Airline A

3 Implementation in R

3.1 Unpacking neccessary packages

library(dlookr)

## Warning: package 'dlookr' was built under R version 4.4.0

## Registered S3 methods overwritten by 'dlookr':
##   method          from  
##   plot.transform  scales
##   print.transform scales

## 
## Attaching package: 'dlookr'

## The following object is masked from 'package:tidyr':
## 
##     extract

## The following object is masked from 'package:base':
## 
##     transform

library(tidyverse)
library(caret)

## Warning: package 'caret' was built under R version 4.3.3

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(caTools)
library(e1071)

## 
## Attaching package: 'e1071'

## The following objects are masked from 'package:dlookr':
## 
##     kurtosis, skewness

3.2 Data cleaning and Data Wrangling

train <- read.csv("C:/Users/DELL/Desktop/2024_Projects/Healthcare prediction/Full linear regression/train.csv")
test <- read.csv("C:/Users/DELL/Desktop/2024_Projects/Healthcare prediction/Full linear regression/test.csv")
train_subset <- sample_n(train, 10) %>% select(Aircraft_Type, Demand, Weather_Conditions, Number_of_Stops, Airline )
head(test, 5)

##   Flight_ID   Airline  Departure_City  Arrival_City Distance Departure_Time
## 1    F45001 Airline B       Davidstad     Moorebury     3096          18:43
## 2    F45002 Airline A      Lake Tyler   Camachoberg     8760           1:16
## 3    F45003 Airline C       New Carol West Ryanfurt     6365          12:17
## 4    F45004 Airline A Richardsonshire   Jordanburgh     7836           0:11
## 5    F45005 Airline B     Tiffanytown    Morganstad     1129           3:22
##   Arrival_Time Duration Aircraft_Type Number_of_Stops Day_of_Week
## 1         0:14     5.52    Boeing 737               1    Saturday
## 2        13:04    11.80   Airbus A380               1    Thursday
## 3        21:52     9.59    Boeing 777               1      Sunday
## 4        10:23    10.21   Airbus A380               0    Thursday
## 5         5:13     1.86   Airbus A320               1    Saturday
##   Month_of_Travel Holiday_Season Demand Weather_Conditions Passenger_Count
## 1          August         Summer Medium              Clear             110
## 2           April           None   High              Clear             295
## 3         January           None    Low               Rain             223
## 4           March           None    Low               Rain             223
## 5          August         Summer   High             Cloudy             145
##   Promotion_Type Fuel_Price
## 1           None       0.95
## 2       Discount       1.05
## 3       Discount       0.63
## 4           None       0.88
## 5  Special Offer       1.11

head(train, 5)

##   Flight_ID   Airline   Departure_City    Arrival_City Distance Departure_Time
## 1        F1 Airline B                       Greenshire     8286           8:23
## 2        F2 Airline C      Leonardland     New Stephen     2942          20:28
## 3        F3 Airline B South Dylanville Port Ambermouth     2468          11:30
## 4        F4                  Blakefort      Crosbyberg     3145          20:24
## 5        F5 Airline B      Michaelport    Onealborough     5558          21:59
##   Arrival_Time Duration Aircraft_Type Number_of_Stops Day_of_Week
## 1        20:19    11.94    Boeing 787               0   Wednesday
## 2         1:45     5.29   Airbus A320               0   Wednesday
## 3        15:54     4.41    Boeing 787               1      Sunday
## 4         1:21     4.96    Boeing 787               0      Sunday
## 5         6:04     8.09    Boeing 737               1    Thursday
##   Month_of_Travel Holiday_Season Demand Weather_Conditions Passenger_Count
## 1        December         Summer    Low               Rain             240
## 2           March         Spring    Low               Rain             107
## 3       September         Summer   High             Cloudy             131
## 4        February           Fall    Low             Cloudy             170
## 5         January           None                     Clear             181
##   Promotion_Type Fuel_Price Flight_Price
## 1  Special Offer       0.91       643.93
## 2           None       1.08       423.13
## 3                      0.52       442.17
## 4       Discount       0.71       394.42
## 5           None       1.09       804.35

emp <-train%>%filter(Airline=="" ) %>% count() %>% summarise(Empty = (n/length(train$Airline))*100)
nempt <-train%>%filter(Airline!="" )%>% count() %>% summarise(Not_Empty = (n/length(train$Airline))*100)
cbind(emp, nempt) %>% pivot_longer(cols = everything())

## # A tibble: 2 × 2
##   name      value
##   <chr>     <dbl>
## 1 Empty      7.94
## 2 Not_Empty 92.1

3.3 Checking for Missing values

plot_na_pareto(test)

plot_na_pareto(train)

3.4 Checking the percentage of missing rows

emp_c <-test%>%filter(Month_of_Travel=="" ) %>% count() %>% summarise(Empty = (n/length(test$Month_of_Travel))*100)
nempt_c <-test%>%filter(Month_of_Travel!="" )%>% count() %>% summarise(Not_Empty = (n/length(test$Month_of_Travel))*100)
cbind(emp_c, nempt_c) %>% pivot_longer(cols = everything())

## # A tibble: 2 × 2
##   name      value
##   <chr>     <dbl>
## 1 Empty      0.68
## 2 Not_Empty 99.3

empp <-test%>%filter(Day_of_Week=="" ) %>% count() %>% summarise(Empty = (n/length(test$Day_of_Week))*100)
nemptp <-test%>%filter(Day_of_Week!="" )%>% count() %>% summarise(Not_Empty = (n/length(test$Day_of_Week))*100)
cbind(empp, nemptp) %>% pivot_longer(cols = everything())

## # A tibble: 2 × 2
##   name      value
##   <chr>     <dbl>
## 1 Empty       0.5
## 2 Not_Empty  99.5

emp_de <-test%>%filter(Demand=="" ) %>% count() %>% summarise(Empty = (n/length(test$Demand))*100)
nempt_de <-test%>%filter(Demand!="" )%>% count() %>% summarise(Not_Empty = (n/length(test$Demand))*100)
cbind(emp_de, nempt_de) %>% pivot_longer(cols = everything())

## # A tibble: 2 × 2
##   name      value
##   <chr>     <dbl>
## 1 Empty      0.68
## 2 Not_Empty 99.3

emp_hs <-test%>%filter(Holiday_Season=="" ) %>% count() %>% summarise(Empty = (n/length(test$Holiday_Season))*100)
nempt_hs <-test%>%filter(Holiday_Season!="" )%>% count() %>% summarise(Not_Empty = (n/length(test$Holiday_Season))*100)
cbind(emp_hs, nempt_hs) %>% pivot_longer(cols = everything())

## # A tibble: 2 × 2
##   name      value
##   <chr>     <dbl>
## 1 Empty         0
## 2 Not_Empty   100

emp_h <-test%>%filter(Weather_Conditions=="" ) %>% count() %>% summarise(Empty = (n/length(test$Weather_Conditions))*100)
nempt_h <-test%>%filter(Weather_Conditions!="" )%>% count() %>% summarise(Not_Empty = (n/length(test$Weather_Conditions))*100)
cbind(emp_h, nempt_h) %>% pivot_longer(cols = everything())

## # A tibble: 2 × 2
##   name      value
##   <chr>     <dbl>
## 1 Empty      0.98
## 2 Not_Empty 99.0

emp_hsf <-test%>%filter(Promotion_Type=="" ) %>% count() %>% summarise(Empty = (n/length(test$Promotion_Type))*100)
nempt_hsf <-test%>%filter(Promotion_Type!="" )%>% count() %>% summarise(Not_Empty = (n/length(test$Promotion_Type))*100)
cbind(emp_hsf, nempt_hsf) %>% pivot_longer(cols = everything())

## # A tibble: 2 × 2
##   name      value
##   <chr>     <dbl>
## 1 Empty      0.98
## 2 Not_Empty 99.0

emp_hsp <-test%>%filter(Number_of_Stops=="" ) %>% count() %>% summarise(Empty = (n/length(test$Number_of_Stops))*100)
nempt_hsp <-test%>%filter(Number_of_Stops!="" )%>% count() %>% summarise(Not_Empty = (n/length(test$Number_of_Stops))*100)
 cbind(emp_hsp, nempt_hsp) %>% pivot_longer(cols = everything())

## # A tibble: 2 × 2
##   name      value
##   <chr>     <dbl>
## 1 Empty         0
## 2 Not_Empty   100

emp_hspq <-test%>%filter(Aircraft_Type=="" ) %>% count() %>% summarise(Empty = (n/length(test$Aircraft_Type))*100)
nempt_hspq <-test%>%filter(Aircraft_Type!="" )%>% count() %>% summarise(Not_Empty = (n/length(test$Aircraft_Type))*100)
cbind(emp_hspq, nempt_hspq) %>% pivot_longer(cols = everything())

## # A tibble: 2 × 2
##   name      value
##   <chr>     <dbl>
## 1 Empty      0.16
## 2 Not_Empty 99.8

3.5 Cleaning the dataset : Removing all empty values

test_nempty <-test%>%filter(Airline!="" )
test_nempty <-test_nempty%>%filter(Day_of_Week!="" )
test_nempty <-test_nempty%>%filter(Departure_City!="" )
test_nempty <-test_nempty%>%filter(Arrival_City!="" )
test_nempty <-test_nempty%>%filter(Aircraft_Type!="" )
test_nempty <-test_nempty%>%filter(Promotion_Type!="" )
test_nempty <-test_nempty%>%filter(Number_of_Stops!="" )
test_nempty <-test_nempty%>%filter(Weather_Conditions!="" )
test_nempty <-test_nempty%>%filter(Demand!="" )
test_nempty <-test_nempty%>%filter(Holiday_Season!="" )
test_nempty <-test_nempty%>%filter(Month_of_Travel!="" )
head(test_nempty,5)

##   Flight_ID   Airline  Departure_City  Arrival_City Distance Departure_Time
## 1    F45001 Airline B       Davidstad     Moorebury     3096          18:43
## 2    F45002 Airline A      Lake Tyler   Camachoberg     8760           1:16
## 3    F45003 Airline C       New Carol West Ryanfurt     6365          12:17
## 4    F45004 Airline A Richardsonshire   Jordanburgh     7836           0:11
## 5    F45005 Airline B     Tiffanytown    Morganstad     1129           3:22
##   Arrival_Time Duration Aircraft_Type Number_of_Stops Day_of_Week
## 1         0:14     5.52    Boeing 737               1    Saturday
## 2        13:04    11.80   Airbus A380               1    Thursday
## 3        21:52     9.59    Boeing 777               1      Sunday
## 4        10:23    10.21   Airbus A380               0    Thursday
## 5         5:13     1.86   Airbus A320               1    Saturday
##   Month_of_Travel Holiday_Season Demand Weather_Conditions Passenger_Count
## 1          August         Summer Medium              Clear             110
## 2           April           None   High              Clear             295
## 3         January           None    Low               Rain             223
## 4           March           None    Low               Rain             223
## 5          August         Summer   High             Cloudy             145
##   Promotion_Type Fuel_Price
## 1           None       0.95
## 2       Discount       1.05
## 3       Discount       0.63
## 4           None       0.88
## 5  Special Offer       1.11

test_nempty <-test_nempty%>%drop_na()

4 Data Exploration: Descriptve Analysis and Feature

Airline <- test_nempty %>% select(Airline, Flight_ID)%>%group_by(Airline)%>%count() %>%summarise(percentage = (n / length(test_nempty$Flight_ID)) * 100)

# Plot the bar plot with data labels as percentages
ggplot(Airline, aes(x = Airline, y = percentage)) +
  geom_bar(stat = "identity", fill = "red") + geom_text(aes(label = paste0(round(percentage, 1), "%")), position = position_dodge(width = 0.9), vjust=-0.5, hjust=0.5, size=7.5) +  # Use geom_text_repel instead
  labs(y = " ",x =" ", title = "Distribution of Airlines:This result shows data balancing") +
  theme_minimal()+ 
  theme(plot.title = element_text(size=25), axis.text.y = element_blank(), axis.text.x = element_text(face="bold", color='black', size = 18, angle = 45, hjust=1))

# Calculate percentage of each category
Day_of_Week <- test_nempty %>%
  group_by(Day_of_Week, Airline) %>%
  count() %>%
  summarise(percentage = (n / nrow(test_nempty)) * 100)

## `summarise()` has grouped output by 'Day_of_Week'. You can override using the
## `.groups` argument.

# Plot the bar plot with data labels as percentages
ggplot(Day_of_Week, aes(x = Day_of_Week, y = percentage, fill = Airline)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), position = position_dodge(width = 0.9), vjust=-0.5, hjust=0.5, angle =90, size=7.5) +  # Use geom_text_repel instead
  labs(y = " ",x =" ", title = "Distribution of Airlines by Day of the Week") +
  theme_minimal()+ 
  theme(plot.title = element_text(size=25), axis.text.y = element_blank(), axis.text.x = element_text(face="bold", color='black', size = 18, angle = 45, hjust=1), legend.text = element_text(size = 18, face = "bold"))

# Calculate percentage of each category
Aircraft_Type <- test_nempty %>%
  group_by(Aircraft_Type, Airline) %>%
  count() %>%
  summarise(percentage = (n / nrow(test_nempty)) * 100)

## `summarise()` has grouped output by 'Aircraft_Type'. You can override using the
## `.groups` argument.

# Plot the bar plot with data labels as percentages
ggplot(Aircraft_Type, aes(x = Aircraft_Type, y = percentage, fill = Airline)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), position = position_dodge(width = 0.9), vjust=-0.5, hjust=0.5, angle =90, size=7.5) +  # Use geom_text_repel instead
labs(y = " ",x =" ", title = "Distribution of Airlines by Aircraft_Type")+
  theme_minimal()+ 
 theme(plot.title = element_text(size=25), axis.text.y = element_blank(), axis.text.x = element_text(face="bold", color='black', size = 18, angle = 45, hjust=1), legend.text = element_text(size = 18, face = "bold"))

# Calculate percentage of each category
Promotion_Type <- test_nempty %>%
  group_by(Promotion_Type, Airline) %>%
  count() %>%
  summarise(percentage = (n / nrow(test_nempty)) * 100)

## `summarise()` has grouped output by 'Promotion_Type'. You can override using
## the `.groups` argument.

# Plot the bar plot with data labels as percentages
ggplot(Promotion_Type, aes(x = Promotion_Type, y = percentage, fill = Airline)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), position = position_dodge(width = 0.9), vjust=-0.5, hjust=0.5, angle =90, size=7.5) +  # Use geom_text_repel instead
  labs(y = " ",x =" ", title = "Distribution of Airlines by Promotion_Type") +
  theme_minimal()+ 
 theme(plot.title = element_text(size=25), axis.text.y = element_blank(), axis.text.x = element_text(face="bold", color='black', size = 18, angle = 45, hjust=1), legend.text = element_text(size = 18, face = "bold"))

# Calculate percentage of each category
Number_of_Stops <- test_nempty %>%
  group_by(Number_of_Stops, Airline) %>%
  count() %>%
  summarise(percentage = (n / nrow(test_nempty)) * 100)

## `summarise()` has grouped output by 'Number_of_Stops'. You can override using
## the `.groups` argument.

# Plot the bar plot with data labels as percentages
ggplot(Number_of_Stops, aes(x = Number_of_Stops, y = percentage, fill = Airline)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), position = position_dodge(width = 0.9), vjust=-0.5, hjust=0.5, angle =90, size=7.5) +  # Use geom_text_repel instead
    labs(y = " ",x =" ", title = "Distribution of Airlines by Number_of_Stops")+
  theme_minimal()+ 
 theme(plot.title = element_text(size=25), axis.text.y = element_blank(), axis.text.x = element_text(face="bold", color='black', size = 18, angle = 45, hjust=1), legend.text = element_text(size = 18, face = "bold"))

# Calculate percentage of each category
Weather_Conditions <- test_nempty %>%
  group_by(Weather_Conditions, Airline) %>%
  count() %>%
  summarise(percentage = (n / nrow(test_nempty)) * 100)

## `summarise()` has grouped output by 'Weather_Conditions'. You can override
## using the `.groups` argument.

# Plot the bar plot with data labels as percentages
ggplot(Weather_Conditions, aes(x = Weather_Conditions, y = percentage, fill = Airline)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), position = position_dodge(width = 0.9), vjust=-0.5, hjust=0.5, angle =90, size=7.5) +  # Use geom_text_repel instead
   labs(y = " ",x =" ", title = "Distribution of Airlines by Weather_Conditions")+
  theme_minimal()+ 
  theme(plot.title = element_text(size=25), axis.text.y = element_blank(), axis.text.x = element_text(face="bold", color='black', size = 18, angle = 45, hjust=1), legend.text = element_text(size = 18, face = "bold"))

# Calculate percentage of each category
Demand <- test_nempty %>%
  group_by(Demand, Airline) %>%
  count() %>%
  summarise(percentage = (n / nrow(test_nempty)) * 100)

## `summarise()` has grouped output by 'Demand'. You can override using the
## `.groups` argument.

# Plot the bar plot with data labels as percentages
ggplot(Demand, aes(x = Demand, y = percentage, fill = Airline)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), position = position_dodge(width = 0.9), vjust=-0.5, hjust=0.5, angle =90, size=7.5) +  # Use geom_text_repel instead
    labs(y = " ",x =" ", title = "Distribution of Airlines by Demand") +
  theme_minimal()+ 
 theme(plot.title = element_text(size=25), axis.text.y = element_blank(), axis.text.x = element_text(face="bold", color='black', size = 18, angle = 45, hjust=1), legend.text = element_text(size = 18, face = "bold"))

# Calculate percentage of each category
Holiday_Season <- test_nempty %>%
  group_by(Holiday_Season, Airline) %>%
  count() %>%
  summarise(percentage = (n / nrow(test_nempty)) * 100)

## `summarise()` has grouped output by 'Holiday_Season'. You can override using
## the `.groups` argument.

# Plot the bar plot with data labels as percentages
ggplot(Holiday_Season, aes(x = Holiday_Season, y = percentage, fill = Airline)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), position = position_dodge(width = 0.9), vjust=-0.5, hjust=0.5, angle =90, size=7.5) +  # Use geom_text_repel instead
    labs(y = " ",x =" ", title = "Distribution of Airlines by Holiday_Season") +
  theme_minimal()+ 
  theme(plot.title = element_text(size=25), axis.text.y = element_blank(), axis.text.x = element_text(face="bold", color='black', size = 18, angle = 45, hjust=1), legend.text = element_text(size = 18, face = "bold"))

# Calculate percentage of each category
Month_of_Travel <- test_nempty %>%
  group_by(Month_of_Travel, Airline) %>%
  count() %>%
  summarise(percentage = (n / nrow(test_nempty)) * 100)

## `summarise()` has grouped output by 'Month_of_Travel'. You can override using
## the `.groups` argument.

# Plot the bar plot with data labels as percentages
ggplot(Month_of_Travel, aes(x = Month_of_Travel, y = percentage, fill = Airline)) +
  geom_bar(stat = "identity", position = "dodge") +
  geom_text(aes(label = paste0(round(percentage, 1), "%")), position = position_dodge(width = 0.9), vjust=-0.5, hjust=0.5, angle =90, size=7.5) +  # Use geom_text_repel instead
  labs(y = " ",x =" ", title = "Distribution of Airlines by Month of Travel") +
  theme_minimal()+ 
  theme(plot.title = element_text(size=25), axis.text.y = element_blank(), axis.text.x = element_text(face="bold", color='black', size = 18, angle = 45, hjust=1), legend.text = element_text(size = 18, face = "bold"))

5 Model training

test_nempty$Airline <- as.factor(test_nempty$Airline)
test_nempty$Number_of_Stops <- as.factor(test_nempty$Number_of_Stops)

sample <- sample.split(test_nempty$Flight_ID, SplitRatio = 0.8)
test_data <- subset(test_nempty, sample==T)
train_data <- subset(test_nempty, sample==F)

5.1 Training the model

NB_AT <-naiveBayes(Airline~ Aircraft_Type, train_data)
NB_NS <-naiveBayes(Airline~ Number_of_Stops, train_data)
NB_DW <-naiveBayes(Airline~ Day_of_Week, train_data)
NB_MT <-naiveBayes(Airline~ Month_of_Travel, train_data)
NB_HS <-naiveBayes(Airline~ Holiday_Season, train_data)
NB_DE <-naiveBayes(Airline~ Demand, train_data)
NB_WC <-naiveBayes(Airline~ Weather_Conditions, train_data)
NB_PT <-naiveBayes(Airline~ Promotion_Type, train_data)
#Let take a look at the results

NB_AT

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
## Airline A Airline B Airline C 
## 0.3294798 0.3364162 0.3341040 
## 
## Conditional probabilities:
##            Aircraft_Type
## Y           Airbus A320 Airbus A380 Boeing 737 Boeing 777 Boeing 787
##   Airline A   0.1964912   0.1894737  0.2421053  0.1824561  0.1894737
##   Airline B   0.1752577   0.1855670  0.1958763  0.2268041  0.2164948
##   Airline C   0.2006920   0.2076125  0.2387543  0.1730104  0.1799308

NB_NS

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
## Airline A Airline B Airline C 
## 0.3294798 0.3364162 0.3341040 
## 
## Conditional probabilities:
##            Number_of_Stops
## Y                    0          1          3
##   Airline A 0.40000000 0.52982456 0.07017544
##   Airline B 0.44329897 0.50515464 0.05154639
##   Airline C 0.48096886 0.46712803 0.05190311

NB_DW

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
## Airline A Airline B Airline C 
## 0.3294798 0.3364162 0.3341040 
## 
## Conditional probabilities:
##            Day_of_Week
## Y              Friday    Monday  Saturday    Sunday  Thursday   Tuesday
##   Airline A 0.1508772 0.1333333 0.1263158 0.1894737 0.1438596 0.1368421
##   Airline B 0.1271478 0.1683849 0.1512027 0.1512027 0.1305842 0.1340206
##   Airline C 0.1349481 0.1176471 0.1764706 0.1384083 0.1487889 0.1522491
##            Day_of_Week
## Y           Wednesday
##   Airline A 0.1192982
##   Airline B 0.1374570
##   Airline C 0.1314879

NB_MT

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
## Airline A Airline B Airline C 
## 0.3294798 0.3364162 0.3341040 
## 
## Conditional probabilities:
##            Month_of_Travel
## Y                April     August   December   February    January       July
##   Airline A 0.07368421 0.08771930 0.08070175 0.05263158 0.08070175 0.09122807
##   Airline B 0.08591065 0.07216495 0.05841924 0.08591065 0.09621993 0.07216495
##   Airline C 0.09688581 0.07266436 0.08650519 0.06920415 0.09342561 0.08650519
##            Month_of_Travel
## Y                 June      March        May   November    October  September
##   Airline A 0.09473684 0.11578947 0.12631579 0.07719298 0.07719298 0.04210526
##   Airline B 0.09965636 0.11340206 0.09621993 0.08591065 0.08591065 0.04810997
##   Airline C 0.06228374 0.05536332 0.09688581 0.09688581 0.07958478 0.10380623

NB_HS

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
## Airline A Airline B Airline C 
## 0.3294798 0.3364162 0.3341040 
## 
## Conditional probabilities:
##            Holiday_Season
## Y                Fall      None    Spring    Summer    Winter
##   Airline A 0.2000000 0.2140351 0.1719298 0.2105263 0.2035088
##   Airline B 0.2164948 0.1924399 0.1890034 0.2199313 0.1821306
##   Airline C 0.2006920 0.2006920 0.1937716 0.2422145 0.1626298

NB_DE

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
## Airline A Airline B Airline C 
## 0.3294798 0.3364162 0.3341040 
## 
## Conditional probabilities:
##            Demand
## Y                High       Low    Medium
##   Airline A 0.1438596 0.6350877 0.2210526
##   Airline B 0.1408935 0.6254296 0.2336770
##   Airline C 0.1487889 0.6262976 0.2249135

NB_WC

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
## Airline A Airline B Airline C 
## 0.3294798 0.3364162 0.3341040 
## 
## Conditional probabilities:
##            Weather_Conditions
## Y               Clear    Cloudy      Rain      Snow
##   Airline A 0.2596491 0.2631579 0.2561404 0.2210526
##   Airline B 0.3092784 0.2336770 0.2027491 0.2542955
##   Airline C 0.2664360 0.2387543 0.2456747 0.2491349

NB_PT

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
## Airline A Airline B Airline C 
## 0.3294798 0.3364162 0.3341040 
## 
## Conditional probabilities:
##            Promotion_Type
## Y            Discount      None Special Offer
##   Airline A 0.2982456 0.3228070     0.3789474
##   Airline B 0.2886598 0.3780069     0.3333333
##   Airline C 0.3529412 0.2906574     0.3564014

# let summarise the results

summary(NB_AT)
summary(NB_NS)
summary(NB_DW)
summary(NB_MT)
summary(NB_HS)
summary(NB_DE)
summary(NB_WC)
summary(NB_PT)

5.2 Predictions

pre_NB_AT <- predict(NB_AT, newdata = test_data, response ="class")
pre_NB_NS <- predict(NB_NS, newdata = test_data, response ="class")
pre_NB_DW <- predict(NB_DW, newdata = test_data, response ="class")
pre_NB_MT <- predict(NB_MT, newdata = test_data, response ="class")
pre_NB_HS <- predict(NB_HS, newdata = test_data, response ="class")
pre_NB_DE <- predict(NB_DE, newdata = test_data, response ="class")
pre_NB_WC <- predict(NB_WC, newdata = test_data, response ="class")
pre_NB_PT <- predict(NB_PT, newdata = test_data, response ="class")

5.3 Model Evaluation: Confusion Matrix

eva_NB_AT <- confusionMatrix(as.factor(test_data$Airline), pre_NB_AT)
eva_NB_NS <- confusionMatrix(as.factor(test_data$Airline), pre_NB_NS)
eva_NB_DW <- confusionMatrix(as.factor(test_data$Airline), pre_NB_DW)
eva_NB_MT <- confusionMatrix(as.factor(test_data$Airline), pre_NB_MT)
eva_NB_HS <- confusionMatrix(as.factor(test_data$Airline), pre_NB_HS)
eva_NB_DE <- confusionMatrix(as.factor(test_data$Airline), pre_NB_DE)
eva_NB_WC <- confusionMatrix(as.factor(test_data$Airline), pre_NB_WC)
eva_NB_PT <- confusionMatrix(as.factor(test_data$Airline), pre_NB_PT)
### Results

ca<-eva_NB_AT$byClass
da<-eva_NB_NS$byClass
ea<-eva_NB_DW$byClass
fa<-eva_NB_MT$byClass
ga<-eva_NB_HS$byClass
ha<-eva_NB_DE$byClass
ia<-eva_NB_WC$byClass
ja<-eva_NB_PT$byClass
da <-rbind(ca,da,ea,fa,ga,ha,ia,ja)
da <- as.data.frame(da)


install.packages("gt")

## Warning: package 'gt' is in use and will not be installed

library(gt)
##############


c<-eva_NB_AT$overall
d<-eva_NB_NS$overall
e<-eva_NB_DW$overall
f<-eva_NB_MT$overall
g<-eva_NB_HS$overall
h<-eva_NB_DE$overall
i<-eva_NB_WC$overall
j<-eva_NB_PT$overall
d <-rbind(c,d,e,f,g,h,i,j)
 nam <- c("NB_AT","NB_NS", "NB_DW","NB_MT","NB_HS","NB_DE","NB_WC","NB_PT")
rownames(d)<-nam
head(d)

##        Accuracy         Kappa AccuracyLower AccuracyUpper AccuracyNull
## NB_AT 0.3274515 -0.0062732412     0.3118148     0.3433806    0.4012149
## NB_NS 0.3222447 -0.0249573396     0.3066772     0.3381134    0.5620480
## NB_DW 0.3341047  0.0005871132     0.3183826     0.3501079    0.4298525
## NB_MT 0.3349725  0.0024501818     0.3192395     0.3509851    0.3335262
## NB_HS 0.3341047 -0.0019609845     0.3183826     0.3501079    0.4052647
## NB_DE 0.3277408  0.0041266293     0.3121003     0.3436732    0.8623084
##       AccuracyPValue McnemarPValue
## NB_AT      1.0000000  1.004804e-33
## NB_NS      1.0000000 2.147008e-245
## NB_DW      1.0000000  1.937358e-14
## NB_MT      0.4347563  7.534563e-01
## NB_HS      1.0000000  1.980062e-31
## NB_DE      1.0000000  0.000000e+00

d <- as.data.frame(d)
d$Model <- nam
d_long <- pivot_longer(d, -Model, names_to = "Metric", values_to = "Value")


ggplot(d_long , aes(x = Metric, y= Value, fill=Model))+geom_bar(stat = "identity", position = "dodge", col="black") +
labs(y = " ",x =" ", title = "Performance metric of the eight models with \n difference features to predict type of Airline") +
  theme_minimal()+ 
  theme(plot.title = element_text(size=25), axis.text.y = element_blank(), axis.text.x = element_text(face="bold", color='black', size = 18, angle = 45, hjust=1), legend.text = element_text(size = 19, face = "bold"))

# Classifications based on multiple conditions
NB_DC_T <-naiveBayes(Airline~Number_of_Stops+Day_of_Week+Month_of_Travel+
                     Holiday_Season+Demand+Weather_Conditions+Promotion_Type , train_data)


pre_NB_DC_T <- predict(NB_DC_T, newdata = test_data, type = "class")
confusionMatrix(pre_NB_DC_T, test_data$Airline)

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Airline A Airline B Airline C
##   Airline A       355       331       368
##   Airline B       367       353       368
##   Airline C       460       434       421
## 
## Overall Statistics
##                                          
##                Accuracy : 0.3266         
##                  95% CI : (0.311, 0.3425)
##     No Information Rate : 0.3419         
##     P-Value [Acc > NIR] : 0.9727908      
##                                          
##                   Kappa : -0.0101        
##                                          
##  Mcnemar's Test P-Value : 0.0005549      
## 
## Statistics by Class:
## 
##                      Class: Airline A Class: Airline B Class: Airline C
## Sensitivity                    0.3003           0.3157           0.3639
## Specificity                    0.6927           0.6858           0.6113
## Pos Pred Value                 0.3368           0.3244           0.3202
## Neg Pred Value                 0.6558           0.6771           0.6564
## Prevalence                     0.3419           0.3234           0.3347
## Detection Rate                 0.1027           0.1021           0.1218
## Detection Prevalence           0.3049           0.3147           0.3804
## Balanced Accuracy              0.4965           0.5008           0.4876

6 Conclusion

The performance of the model in this study was observed to be lower than expected. However, it’s essential to contextualize this outcome within the broader scope of the study’s objectives. The primary aim of this research endeavor was not solely focused on achieving optimal predictive accuracy but rather to demonstrate the application of the Naive Bayes algorithm in the context of airline prediction.

It’s crucial to recognize that various machine learning algorithms exist, each with its unique strengths and weaknesses. While Naive Bayes is a valuable tool in certain scenarios, its performance may not always meet expectations, especially in complex prediction tasks like airline classification.

Nonetheless, the study’s findings offer valuable insights. One noteworthy aspect highlighted by this research is the implementation of Laplace smoothing to address instances where decision outcomes are inconclusive. Laplace smoothing, also known as additive smoothing, is a technique used to handle cases where certain combinations of feature values have zero probabilities in the training data. By introducing a small amount of pseudo-counts to all observed feature-value combinations, Laplace smoothing prevents zero probabilities and ensures more robust model predictions.

In summary, while the observed performance of the Naive Bayes model may be modest, the study provides valuable methodological insights and highlights the importance of considering alternative approaches, such as Laplace smoothing, to enhance the robustness and reliability of predictive models in real-world applications.

Exploring the Application of Naive Bayes Algorithm Classifier: Theory, Implementation, and Evaluation in Solving the Airline Prediction Problem

Adekunle Joseph Damilare

2024-04-21

1 Introduction

2 Computational Implementation

2.1 Process of Training

2.1.1 then calculate condition probability of each attribute

2.1.1.1 Conditional probability of Airbus type

2.1.1.2 Conditional probability of Airbus type

2.1.1.3 Conditional probability of Weather Condition

2.1.1.4 Conditional probability of Number of stops

2.2 Process of Prediction

2.2.1 Prediction formula

2.2.2 Predition for the first row

2.2.3 Predition for the second row

2.2.4 Laplace Smoothing Prediction

2.2.4.1 Probability Calculation:

2.2.5 Predition for new instance in third row

2.2.6 Laplace Smoothing Prediction

2.2.7 Predition for new instance in fourth row

2.2.8 Predition for new instance in fift row : Note: This instance is the same to that of third row. Hence, we proceed to predicting for sixth instance.

2.2.9 Predition for new instance in sixth row

2.2.10 Predition for new instance in seventh row

2.2.11 Predition for new instance in eight row

2.2.12 Predition for new instance in ninth row

2.2.13 Laplace Smoothing Prediction

2.2.14 Predition for new instance in tenth row

3 Implementation in R

3.1 Unpacking neccessary packages

3.2 Data cleaning and Data Wrangling

3.3 Checking for Missing values

3.4 Checking the percentage of missing rows

3.5 Cleaning the dataset : Removing all empty values

4 Data Exploration: Descriptve Analysis and Feature

5 Model training

5.1 Training the model

5.2 Predictions

5.3 Model Evaluation: Confusion Matrix

6 Conclusion