set.seed(20)
Naive Bayes is a popular and powerful classification algorithm used in machine learning and data mining tasks. It’s based on Bayes’ theorem, which describes the probability of an event occurring based on prior knowledge of conditions that might be related to the event.
In Naive Bayes classification, the algorithm assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. This assumption is known as “naive” because it simplifies the calculation of probabilities. Despite its simplicity, Naive Bayes often performs surprisingly well in practice, especially in text classification and spam filtering tasks.
One of the key advantages of Naive Bayes is its computational efficiency and ability to handle large datasets with high-dimensional features. It’s also relatively easy to implement and requires minimal training data compared to other classification algorithms.
Naive Bayes is a versatile and efficient algorithm that’s widely used in various real-world applications due to its simplicity, speed, and effectiveness, especially in scenarios where the independence assumption holds reasonably well.
In this manual, we’ll do a short demonstration of this algorithm, and also implement it in R.
To start with, consider the table below, which is the training data. From this data we’re to train a model that is capable of predicting Airline type based on some given features. After training this model, new traveller on the test dataset who want to travel, should be able to decide on which of the Airline (A, B, and C) should they book based on those features.In this problem, the target variable is Airline, while the features are Aircraft_Type, Demand and Weather_Conditions. Number_of_Stops
| Aircraft_Type | Demand | Weather_Conditions | Number_of_Stops | Airline |
|---|---|---|---|---|
| Airbus A320 | Low | Cloudy | 0 | Airline A |
| Airbus A380 | Low | Rain | 0 | Airline A |
| Airbus A380 | Low | Snow | 1 | Airline B |
| Boeing 777 | Medium | Clear | 1 | Airline C |
| Boeing 777 | Low | Rain | 3 | Airline B |
| Boeing 737 | Low | Rain | 0 | Airline B |
| Airbus A320 | Low | Rain | 0 | Airline A |
| Boeing 787 | Low | Snow | 0 | Airline B |
| Airbus A380 | Medium | Cloudy | 0 | Airline B |
\[ P(C_k | x) = \frac{P(x | C_k) \cdot P(C_k)}{P(x)}\] In this formula:
We start training the model by calculating the condition probability of each instances. \[\begin{equation} \text{Probability of Airline A} = P(\text{Airline} = \text{Airline A}) = \frac{3}{9} = \frac{1}{3} \end{equation}\]
\[\begin{equation} \text{Probability of Airline B} = P(\text{Airline} = \text{Airline B}) = \frac{5}{9} \end{equation}\]
\[\begin{equation} \text{Probability of Airline C} = P(\text{Airline} = \text{Airline C}) = \frac{1}{9} \end{equation}\]
\[\begin{equation} n(\text{Airline} = \text{A}) = 3, \quad n(\text{Airline} = \text{B}) = 5, \quad n(\text{Airline} = \text{C}) = 1 \end{equation}\]
| Aircraft_Type | Airline.A | Airline.B | Airline.C |
|---|---|---|---|
| Airbus A320 | 0.6666667 | 0.0 | 0 |
| Airbus A380 | 0.3333333 | 0.4 | 0 |
| Airbus 777 | 0.0000000 | 0.2 | 1 |
| Airbus 737 | 0.0000000 | 0.2 | 0 |
| Airbus 787 | 0.0000000 | 0.2 | 0 |
| Demand | Airline.A | Airline.B | Airline.C |
|---|---|---|---|
| Low | 1 | 0.8 | 0 |
| Medium | 0 | 0.2 | 1 |
| High | 0 | 0.0 | 0 |
| Weather.Condition | Airline.A | Airline.B | Airline.C |
|---|---|---|---|
| Cloudy | 0.3333333 | 0.2 | 0 |
| Rain | 0.6666667 | 0.4 | 0 |
| Snow | 0.0000000 | 0.4 | 0 |
| Clear | 0.0000000 | 0.0 | 1 |
| Number.of.stops | Airline.A | Airline.B | Airline.C |
|---|---|---|---|
| 0 | 1 | 0.6 | 0 |
| 1 | 0 | 0.2 | 1 |
| 2 | 0 | 0.0 | 0 |
| 3 | 0 | 0.2 | 0 |
We’re to predict and classifier new instances in the table below. You can assume yourself as a traveler, planning on embarking on a journey, but so confused about the best airline suitable for such journey. Thank God, you have some information regarding the journey. And here we’re to help you decide. The machine we’ve trained will do this job.
| Aircraft_Type | Demand | Weather_Conditions | Number_of_Stops | Airline |
|---|---|---|---|---|
| Boeing 777 | Low | Rain | 3 | ? |
| Airbus A320 | Low | Snow | 0 | ? |
| Airbus A380 | High | Clear | 0 | ? |
| Airbus A320 | Low | Rain | 0 | ? |
| Airbus A380 | High | Clear | 0 | ? |
| Boeing 737 | Low | Snow | 1 | ? |
| Boeing 777 | Low | Snow | 1 | ? |
| Boeing 777 | Low | Cloudy | 1 | ? |
| Airbus A320 | Medium | Clear | 1 | ? |
| Airbus A380 | Low | Rain | 1 | ? |
\[ U_{NB} = \text{argMax}_{V_j} \left[ \sum_{n}^{J} P(a_1 | U_j) \right] \]
\[ = argMaxP(U_j) \]
Where:
Following this, we make prediction as follows:
To predict the values for the “Airline” column based on the given features using a Naive Bayes classifier, we would follow these steps:
Where:
To predict the airline for the first row using the given features (Aircraft_Type, Demand, Weather_Conditions, Number_of_Stops), we can use the Naive Bayes prediction formula: \[\begin{equation} \text{Predicted Class} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} P(\text{Aircraft_Type} = \text{Boeing 777} | \text{Airline} = U_j) \\ \times P(\text{Demand} = \text{Low} | \text{Airline} = U_j) \times P(\text{Weather_Conditions} = \text{Rain} | \text{Airline} = U_j) \\ \times P(\text{Number_of_Stops} = 3 | \text{Airline} = U_j) \right] \times P(\text{Airline} = U_j) \end{equation}\] Here, we substitute the specific values from the condition probability tables: \[\begin{equation} \text{Airline A} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 0 \times 1 \times 2/3 \times 0 \right] \times 1/3 = 0 \end{equation}\]
\[\begin{equation} \text{Airline B} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 1/5 \times 4/5 \times 2/5 \times 1/5 \right] \times 5/9 = 0.0071 \end{equation}\]
\[\begin{equation} \text{Airline C} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 1 \times 0 \times 0 \times 0 \right] \times 1/9 = 0 \end{equation}\]
Decision: Since \(P(\text{Airline B}) > P(\text{Airline A})\) and \(P(\text{Airline B}) > P(\text{Airline C})\), we classify the new instance as Airline B.
To predict the airline for the first row using the given features (Aircraft_Type = Airbus A320, Demand = Low, Weather_Conditions = Snow, Number_of_Stops = 0), we can use the Naive Bayes prediction formula: \[\begin{equation} \text{Predicted Class} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} P(\text{Aircraft_Type} = \text{Boeing 777} | \text{Airline} = U_j) \times \\ P(\text{Demand} = \text{Low} | \text{Airline} = U_j) \times P(\text{Weather_Conditions} = \text{Snow} | \text{Airline} = U_j) \\ \times P(\text{Number_of_Stops} = 0 | \text{Airline} = U_j) \right] \times P(\text{Airline} = U_j) \end{equation}\] Here, we substitute the specific values from the condition probability tables: \[\begin{equation} \text{Airline A} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 2/3 \times 1 \times 0 \times 1/3 \right] \times 1/3 = 0 \end{equation}\]
\[\begin{equation} \text{Airline B} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 0 \times 4/5 \times 2/5 \times 5/9 \right] \times 5/9 = 0 \end{equation}\]
\[\begin{equation} \text{Airline C} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 0 \times 0 \times 0 \times 0 \right] \times 1/9 = 0 \end{equation}\]
Decision: Since \(P(\text{Airline B}) = P(\text{Airline A})\) = P()$. This become inconclusive. Hence, we use the Laplace smoothing formula
\[P_{lap_k}(x,y) = \frac{c(x,y)+K}{c(y) + K(X)}\] Where K is the smoothing parameter and it set to 2.
Airline A: \[ \begin{align*} &= \left( \frac{2 + 2}{3+2(9)} \times \frac{3 + 2}{3+2(9)}\times \frac{0 + 2}{3+2(9)} \times \frac{0+2}{3+2(9)} \right) \times \frac{1}{3} \\ &= \left( \frac{4}{21} \times \frac{3}{21} \times \frac{2}{21} \times \frac{2}{21} \right) \times \frac{1}{3} \\ &= \left( \frac{48}{194,481} \right) \times \frac{1}{3} \\ &= \frac{48}{583,443} \\ &= 0.0000083 \end{align*} \]
Airline B: \[ \begin{align*} &= \left( \frac{0+2}{5+2(9)} \times \frac{4+2}{5+2(9)}\times \frac{2+2}{5+2(9)} \times \frac{3+2}{5+2(9)}\right) \times \frac{5}{9} \\ &= \left( \frac{2}{23} \times \frac{6}{23} \times \frac{4}{23} \times \frac{5}{23} \right) \times \frac{5}{9} \\ &= \left( \frac{240}{279,841} \right) \times \frac{5}{9} \\ &= \frac{1,200}{2,518,569} \\ &= 0.0005 \end{align*} \]
Airline C: \[ \begin{align*} &= \left( \frac{(0 + 2)}{1+2(9)} \times \frac{0 + 2}{1 + 2(9)} \times \frac{0+2}{1+2(9)} \times \frac{0+2}{1+2(9)} \right) \times \frac{1}{9} \\ &= \left( \frac{2}{19} \times \frac{2}{19} \times \frac{2}{19} \times \frac{2}{19} \right) \times \frac{1}{9} \\ &= \left( \frac{16}{130,321} \right) \times \frac{1}{9} \\ &= \frac{16}{1,172,889} \\ &= 0.0000014 \end{align*} \]
Now, we can compare these probabilities and make a decision. From the results above, Airline B > Airline A and Airline B > Airline C. Therefore, we predict this instance to be Airline B.
To predict the airline for the first row using the given features (Aircraft_Type = Airbus A380, Demand = High, Weather_Conditions = Clear, Number_of_Stops = 0). \[\begin{equation} \text{Predicted Class} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} P(\text{Aircraft_Type} = \text{Airbus A380} | \text{Airline} = U_j) \times \\ P(\text{Demand} = \text{High} | \text{Airline} = U_j) \times P(\text{Weather_Conditions} = \text{Clear} | \text{Airline} = U_j) \times \\ P(\text{Number_of_Stops} = 0 | \text{Airline} = U_j) \right] \times P(\text{Airline} = U_j) \end{equation}\] Here, we substitute the specific values from the condition probability tables: \[\begin{equation} \text{Airline A} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 1/3 \times 0 \times 0 \times 0 \right] \times 1/3 = 0 \end{equation}\]
\[\begin{equation} \text{Airline B} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 2/5 \times 4/5 \times 0/5 \times 3/5 \right] \times 5/9 = 0 \end{equation}\]
\[\begin{equation} \text{Airline C} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 0 \times 0 \times 1 \times 0 \right] \times 1/9 = 0 \end{equation}\]
Decision: Since \(P(\text{Airline B}) = P(\text{Airline A})\) = \(P(\text{Airline C})\). This become inconclusive. Hence, we use the Laplace smoothing formula
Airline A: \[ \begin{align*} &= \left( \frac{1 + 2}{3+2(9)} \times \frac{0 + 2}{3+2(9)}\times \frac{0 + 2}{3+2(9)} \times \frac{0+2}{3+2(9)} \right) \times \frac{1}{3} \\ &= \left( \frac{3}{21} \times \frac{2}{21} \times \frac{2}{21} \times \frac{2}{21} \right) \times \frac{1}{3} \\ &= \left( \frac{24}{194,481} \right) \times \frac{1}{3} \\ &= \frac{24}{583,443} \\ &= 0.00004 \end{align*} \]
Airline B: \[ \begin{align*} &= \left( \frac{2+2}{5+2(9)} \times \frac{4+2}{5+2(9)}\times \frac{0+2}{5+2(9)} \times \frac{3+2}{5+2(9)}\right) \times \frac{5}{9} \\ &= \left( \frac{4}{23} \times \frac{6}{23} \times \frac{2}{23} \times \frac{5}{23} \right) \times \frac{5}{9} \\ &= \left( \frac{240}{279,841} \right) \times \frac{5}{9} \\ &= \frac{240}{2,518,569} \\ &= 0.000095 \end{align*} \]
Airline C: \[ \begin{align*} &= \left( \frac{(0 + 2)}{1+2(9)} \times \frac{0 + 2}{1 + 2(9)} \times \frac{1+2}{1+2(9)} \times \frac{0+2}{1+2(9)} \right) \times \frac{1}{9} \\ &= \left( \frac{2}{19} \times \frac{2}{19} \times \frac{3}{19} \times \frac{2}{19} \right) \times \frac{1}{9} \\ &= \left( \frac{24}{130,321} \right) \times \frac{1}{9} \\ &= \frac{24}{1,172,889} \\ &= 0.00002 \end{align*} \]
Now, we can compare these probabilities and make a decision. From the results above, Airline B > Airline A and Airline B > Airline C. Therefore, we predict this instance to be Airline B.
To predict the airline for the fourth row using the given features (Aircraft_Type = Airbus A320, Demand = Low, Weather_Conditions = Rain, Number_of_Stops = 0). \[\begin{equation} \text{Predicted Class} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} P(\text{Aircraft_Type} = \text{Airbus A320} | \text{Airline} = U_j) \times \\ P(\text{Demand} = \text{Low} | \text{Airline} = U_j) \times P(\text{Weather_Conditions} = \text{Rain} | \text{Airline} = U_j) \times \\ P(\text{Number_of_Stops} = 0 | \text{Airline} = U_j) \right] \times P(\text{Airline} = U_j) \end{equation}\] Here, we substitute the specific values from the condition probability tables: \[\begin{equation} \text{Airline A} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 2/3 \times 3/3 \times 2/3 \times 3/3 \right] \times 1/3 \\ = 36/243 = 0.15 \end{equation}\]
\[\begin{equation} \text{Airline B} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 0 \times 4/5 \times 2/5 \times 3/5 \right] \times 5/9 = 0 \end{equation}\]
\[\begin{equation} \text{Airline C} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 0 \times 0 \times 1 \times 0 \right] \times 1/9 = 0 \end{equation}\]
Decision: Since \(P(\text{Airline A}) > P(\text{Airline B})\) = \(P(\text{Airline C})\). We conclude that the new instance is Airline A.
To predict the airline for the first row using the given features (Aircraft_Type = Boeing 737, Demand = Low, Weather_Conditions = Snow, Number_of_Stops = 1). \[\begin{equation} \text{Predicted Class} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} P(\text{Aircraft_Type} = \text{Boeing 737} | \text{Airline} = U_j) \times \\ P(\text{Demand} = \text{Low} | \text{Airline} = U_j) \times P(\text{Weather_Conditions} = \text{Snow} | \text{Airline} = U_j) \times \\ P(\text{Number_of_Stops} = 1 | \text{Airline} = U_j) \right] \times P(\text{Airline} = U_j) \end{equation}\] Here, we substitute the specific values from the condition probability tables: \[\begin{equation} \text{Airline A} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 0/3 \times 3/3 \times 0/3 \times 0/3 \right] \times 1/3 = 0 \end{equation}\]
\[\begin{equation} \text{Airline B} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 1/5 \times 4/5 \times 2/5 \times 1/5 \right] \times 5/9 = 8/625 = 0.01 \end{equation}\]
\[\begin{equation} \text{Airline C} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 0/1 \times 0/1 \times 0/1 \times 2/5 \right] \times 1/9 = 0 \end{equation}\]
Decision: Since \(P(\text{Airline B}) > P(\text{Airline A})\) = \(P(\text{Airline C})\). We conclude that the new instance is Airline B.
To predict the airline for the first row using the given features (Aircraft_Type = Boeing 777, Demand = Low, Weather_Conditions = Snow, Number_of_Stops = 1). \[\begin{equation} \text{Predicted Class} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} P(\text{Aircraft_Type} = \text{Boeing 777} | \text{Airline} = U_j) \times \\ P(\text{Demand} = \text{Low} | \text{Airline} = U_j) \times P(\text{Weather_Conditions} = \text{Snow} | \text{Airline} = U_j) \times \\ P(\text{Number_of_Stops} = 1 | \text{Airline} = U_j) \right] \times P(\text{Airline} = U_j) \end{equation}\] Here, we substitute the specific values from the condition probability tables: \[\begin{equation} \text{Airline A} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 0/3 \times 3/3 \times 0/3 \times 0/3 \right] \times 1/3 = 0 \end{equation}\]
\[\begin{equation} \text{Airline B} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 1/5 \times 4/5 \times 2/5 \times 1/5 \right] \times 5/9 = 8/625 = 0.01 \end{equation}\]
\[\begin{equation} \text{Airline C} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 1/1 \times 0/1 \times 0/1 \times 1/1 \right] \times 1/9 = 0 \end{equation}\]
Decision: Since \(P(\text{Airline B}) > P(\text{Airline A})\) = \(P(\text{Airline C})\). We conclude that the new instance is Airline B.
To predict the airline for the first row using the given features (Aircraft_Type = Boeing 777, Demand = Low, Weather_Conditions = Cloudy, Number_of_Stops = 1). \[\begin{equation} \text{Predicted Class} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} P(\text{Aircraft_Type} = \text{Boeing 777} | \text{Airline} = U_j) \times \\ P(\text{Demand} = \text{Low} | \text{Airline} = U_j) \times P(\text{Weather_Conditions} = \text{Cloudy} | \text{Airline} = U_j) \times \\ P(\text{Number_of_Stops} = 1 | \text{Airline} = U_j) \right] \times P(\text{Airline} = U_j) \end{equation}\] Here, we substitute the specific values from the condition probability tables: \[\begin{equation} \text{Airline A} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 0/3 \times 3/3 \times 1/3 \times 0/3 \right] \times 1/3 = 4/243 = 0.016 \end{equation}\]
\[\begin{equation} \text{Airline B} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 1/5 \times 4/5 \times 1/5 \times 1/5 \right] \times 5/9 = 8/625 = 0.01 \end{equation}\]
\[\begin{equation} \text{Airline C} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 1/1 \times 0/1 \times 0/1 \times 1/1 \right] \times 1/9 = 0 \end{equation}\]
Decision: Since \(P(\text{Airline B}) > P(\text{Airline A})\) = \(P(\text{Airline C})\). We conclude that the new instance is Airline B.
To predict the airline for the first row using the given features (Aircraft_Type = Airbus A380, Demand = Low, Weather_Conditions = Rain, Number_of_Stops = 1): \[\begin{equation} \text{Predicted Class} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} P(\text{Aircraft_Type} = \text{Airline A320} | \text{Airline} = U_j) \\ \times P(\text{Demand} = \text{Medium} | \text{Airline} = U_j) \\ \times P(\text{Weather_Conditions} = \text{Clear} | \text{Airline} = U_j) \times P(\text{Number_of_Stops} = 1 | \text{Airline} = U_j) \right] \\ \times P(\text{Airline} = U_j) \end{equation}\] Here, we substitute the specific values from the condition probability tables: \[\begin{equation} \text{Airline A} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 2/3 \times 0/3 \times 0/3 \times 3/3 \right] \times 1/3 = 0 \end{equation}\]
\[\begin{equation} \text{Airline B} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 0/5 \times 1/5 \times 0/5 \times 1/5 \right] \times 5/9 = 0 \end{equation}\]
\[\begin{equation} \text{Airline C} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 0/1 \times 1/1 \times 0/1 \times 1/1 \right] \times 1/9 = 0 \end{equation}\]
Decision: Since \(P(\text{Airline B}) = P(\text{Airline A})\) = P()$. This become inconclusive. Hence, we use the Laplace smoothing formula
Airline A: \[ \begin{align*} &= \left( \frac{2 + 2}{3+2(9)} \times \frac{0 + 2}{3+2(9)}\times \frac{0 + 2}{3+2(9)} \times \frac{3+2}{3+2(9)} \right) \times \frac{1}{3} \\ &= \left( \frac{4}{21} \times \frac{2}{21} \times \frac{2}{21} \times \frac{5}{21} \right) \times \frac{1}{3} \\ &= \left( \frac{80}{194,481} \right) \times \frac{1}{3} \\ &= \frac{80}{583,443} \\ &= 0.00014 \end{align*} \]
Airline B: \[ \begin{align*} &= \left( \frac{0+2}{5+2(9)} \times \frac{1+2}{5+2(9)}\times \frac{0+2}{5+2(9)} \times \frac{1+2}{5+2(9)}\right) \times \frac{5}{9} \\ &= \left( \frac{2}{23} \times \frac{3}{23} \times \frac{2}{23} \times \frac{3}{23} \right) \times \frac{5}{9} \\ &= \left( \frac{36}{279,841} \right) \times \frac{5}{9} \\ &= \frac{180}{2,518,569} \\ &= 0.000072 \end{align*} \]
Airline C: \[ \begin{align*} &= \left( \frac{(0 + 2)}{1+2(9)} \times \frac{1 + 2}{1 + 2(9)} \times \frac{0+2}{1+2(9)} \times \frac{1+2}{1+2(9)} \right) \times \frac{1}{9} \\ &= \left( \frac{2}{19} \times \frac{3}{19} \times \frac{2}{19} \times \frac{3}{19} \right) \times \frac{1}{9} \\ &= \left( \frac{36}{130,321} \right) \times \frac{1}{9} \\ &= \frac{36}{1,172,889} \\ &= 0.0000031 \end{align*} \]
Now, we can compare these probabilities and make a decision. From the results above, Airline A > Airline B and Airline A > Airline C. Therefore, we predict this instance to be Airline A.
To predict the airline for the first row using the given features (Aircraft_Type = Boeing 777, Demand = Low, Weather_Conditions = Snow, Number_of_Stops = 1). \[\begin{equation} \text{Predicted Class} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} P(\text{Aircraft_Type} = \text{Boeing 777} | \text{Airline} = U_j) \times \\ P(\text{Demand} = \text{Low} | \text{Airline} = U_j) \times P(\text{Weather_Conditions} = \text{Snow} | \text{Airline} = U_j) \times \\ P(\text{Number_of_Stops} = 1 | \text{Airline} = U_j) \right] \times P(\text{Airline} = U_j) \end{equation}\] Here, we substitute the specific values from the condition probability tables: \[\begin{equation} \text{Airline A} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 1/3 \times 3/3 \times 2/3 \times 3/3 \right] \times 1/3 = 2/27 = 0.075 \end{equation}\]
\[\begin{equation} \text{Airline B} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 2/5 \times 4/5 \times 2/5 \times 1/5 \right] \times 5/9 = 80/5625 = 0.014 \end{equation}\]
\[\begin{equation} \text{Airline C} = \underset{V_j}{\text{argmax}} \left[ \sum_{n}^{J} 0/1 \times 0/1 \times 0/1 \times 1/1 \right] \times 1/9 = 0 \end{equation}\]
Decision: Since \(P(\text{Airline A}) > P(\text{Airline B})\), and \(P(\text{Airline A}) > P(\text{Airline C})\). We conclude that the new instance is Airline A.
| Aircraft_Type | Demand | Weather_Conditions | Number_of_Stops | Airline |
|---|---|---|---|---|
| Boeing 777 | Low | Snow | 1 | Airline B |
| Boeing 737 | Low | Rain | 0 | Airline B |
| Boeing 777 | Low | Rain | 1 | Airline B |
| Boeing 787 | Low | Clear | 0 | Airline A |
| Boeing 737 | Low | Cloudy | 0 | Airline B |
| Airbus A320 | Low | Cloudy | 0 | Airline B |
| Airbus A320 | Low | Rain | 0 | Airline B |
| Boeing 787 | Medium | Clear | 0 | Airline B |
| Airbus A320 | Low | Clear | 0 | Airline A |
| Boeing 777 | Low | Cloudy | 0 | Airline A |
library(dlookr)
## Warning: package 'dlookr' was built under R version 4.4.0
## Registered S3 methods overwritten by 'dlookr':
## method from
## plot.transform scales
## print.transform scales
##
## Attaching package: 'dlookr'
## The following object is masked from 'package:tidyr':
##
## extract
## The following object is masked from 'package:base':
##
## transform
library(tidyverse)
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(caTools)
library(e1071)
##
## Attaching package: 'e1071'
## The following objects are masked from 'package:dlookr':
##
## kurtosis, skewness
train <- read.csv("C:/Users/DELL/Desktop/2024_Projects/Healthcare prediction/Full linear regression/train.csv")
test <- read.csv("C:/Users/DELL/Desktop/2024_Projects/Healthcare prediction/Full linear regression/test.csv")
train_subset <- sample_n(train, 10) %>% select(Aircraft_Type, Demand, Weather_Conditions, Number_of_Stops, Airline )
head(test, 5)
## Flight_ID Airline Departure_City Arrival_City Distance Departure_Time
## 1 F45001 Airline B Davidstad Moorebury 3096 18:43
## 2 F45002 Airline A Lake Tyler Camachoberg 8760 1:16
## 3 F45003 Airline C New Carol West Ryanfurt 6365 12:17
## 4 F45004 Airline A Richardsonshire Jordanburgh 7836 0:11
## 5 F45005 Airline B Tiffanytown Morganstad 1129 3:22
## Arrival_Time Duration Aircraft_Type Number_of_Stops Day_of_Week
## 1 0:14 5.52 Boeing 737 1 Saturday
## 2 13:04 11.80 Airbus A380 1 Thursday
## 3 21:52 9.59 Boeing 777 1 Sunday
## 4 10:23 10.21 Airbus A380 0 Thursday
## 5 5:13 1.86 Airbus A320 1 Saturday
## Month_of_Travel Holiday_Season Demand Weather_Conditions Passenger_Count
## 1 August Summer Medium Clear 110
## 2 April None High Clear 295
## 3 January None Low Rain 223
## 4 March None Low Rain 223
## 5 August Summer High Cloudy 145
## Promotion_Type Fuel_Price
## 1 None 0.95
## 2 Discount 1.05
## 3 Discount 0.63
## 4 None 0.88
## 5 Special Offer 1.11
head(train, 5)
## Flight_ID Airline Departure_City Arrival_City Distance Departure_Time
## 1 F1 Airline B Greenshire 8286 8:23
## 2 F2 Airline C Leonardland New Stephen 2942 20:28
## 3 F3 Airline B South Dylanville Port Ambermouth 2468 11:30
## 4 F4 Blakefort Crosbyberg 3145 20:24
## 5 F5 Airline B Michaelport Onealborough 5558 21:59
## Arrival_Time Duration Aircraft_Type Number_of_Stops Day_of_Week
## 1 20:19 11.94 Boeing 787 0 Wednesday
## 2 1:45 5.29 Airbus A320 0 Wednesday
## 3 15:54 4.41 Boeing 787 1 Sunday
## 4 1:21 4.96 Boeing 787 0 Sunday
## 5 6:04 8.09 Boeing 737 1 Thursday
## Month_of_Travel Holiday_Season Demand Weather_Conditions Passenger_Count
## 1 December Summer Low Rain 240
## 2 March Spring Low Rain 107
## 3 September Summer High Cloudy 131
## 4 February Fall Low Cloudy 170
## 5 January None Clear 181
## Promotion_Type Fuel_Price Flight_Price
## 1 Special Offer 0.91 643.93
## 2 None 1.08 423.13
## 3 0.52 442.17
## 4 Discount 0.71 394.42
## 5 None 1.09 804.35
emp <-train%>%filter(Airline=="" ) %>% count() %>% summarise(Empty = (n/length(train$Airline))*100)
nempt <-train%>%filter(Airline!="" )%>% count() %>% summarise(Not_Empty = (n/length(train$Airline))*100)
cbind(emp, nempt) %>% pivot_longer(cols = everything())
## # A tibble: 2 × 2
## name value
## <chr> <dbl>
## 1 Empty 7.94
## 2 Not_Empty 92.1
plot_na_pareto(test)
plot_na_pareto(train)
emp_c <-test%>%filter(Month_of_Travel=="" ) %>% count() %>% summarise(Empty = (n/length(test$Month_of_Travel))*100)
nempt_c <-test%>%filter(Month_of_Travel!="" )%>% count() %>% summarise(Not_Empty = (n/length(test$Month_of_Travel))*100)
cbind(emp_c, nempt_c) %>% pivot_longer(cols = everything())
## # A tibble: 2 × 2
## name value
## <chr> <dbl>
## 1 Empty 0.68
## 2 Not_Empty 99.3
empp <-test%>%filter(Day_of_Week=="" ) %>% count() %>% summarise(Empty = (n/length(test$Day_of_Week))*100)
nemptp <-test%>%filter(Day_of_Week!="" )%>% count() %>% summarise(Not_Empty = (n/length(test$Day_of_Week))*100)
cbind(empp, nemptp) %>% pivot_longer(cols = everything())
## # A tibble: 2 × 2
## name value
## <chr> <dbl>
## 1 Empty 0.5
## 2 Not_Empty 99.5
emp_de <-test%>%filter(Demand=="" ) %>% count() %>% summarise(Empty = (n/length(test$Demand))*100)
nempt_de <-test%>%filter(Demand!="" )%>% count() %>% summarise(Not_Empty = (n/length(test$Demand))*100)
cbind(emp_de, nempt_de) %>% pivot_longer(cols = everything())
## # A tibble: 2 × 2
## name value
## <chr> <dbl>
## 1 Empty 0.68
## 2 Not_Empty 99.3
emp_hs <-test%>%filter(Holiday_Season=="" ) %>% count() %>% summarise(Empty = (n/length(test$Holiday_Season))*100)
nempt_hs <-test%>%filter(Holiday_Season!="" )%>% count() %>% summarise(Not_Empty = (n/length(test$Holiday_Season))*100)
cbind(emp_hs, nempt_hs) %>% pivot_longer(cols = everything())
## # A tibble: 2 × 2
## name value
## <chr> <dbl>
## 1 Empty 0
## 2 Not_Empty 100
emp_h <-test%>%filter(Weather_Conditions=="" ) %>% count() %>% summarise(Empty = (n/length(test$Weather_Conditions))*100)
nempt_h <-test%>%filter(Weather_Conditions!="" )%>% count() %>% summarise(Not_Empty = (n/length(test$Weather_Conditions))*100)
cbind(emp_h, nempt_h) %>% pivot_longer(cols = everything())
## # A tibble: 2 × 2
## name value
## <chr> <dbl>
## 1 Empty 0.98
## 2 Not_Empty 99.0
emp_hsf <-test%>%filter(Promotion_Type=="" ) %>% count() %>% summarise(Empty = (n/length(test$Promotion_Type))*100)
nempt_hsf <-test%>%filter(Promotion_Type!="" )%>% count() %>% summarise(Not_Empty = (n/length(test$Promotion_Type))*100)
cbind(emp_hsf, nempt_hsf) %>% pivot_longer(cols = everything())
## # A tibble: 2 × 2
## name value
## <chr> <dbl>
## 1 Empty 0.98
## 2 Not_Empty 99.0
emp_hsp <-test%>%filter(Number_of_Stops=="" ) %>% count() %>% summarise(Empty = (n/length(test$Number_of_Stops))*100)
nempt_hsp <-test%>%filter(Number_of_Stops!="" )%>% count() %>% summarise(Not_Empty = (n/length(test$Number_of_Stops))*100)
cbind(emp_hsp, nempt_hsp) %>% pivot_longer(cols = everything())
## # A tibble: 2 × 2
## name value
## <chr> <dbl>
## 1 Empty 0
## 2 Not_Empty 100
emp_hspq <-test%>%filter(Aircraft_Type=="" ) %>% count() %>% summarise(Empty = (n/length(test$Aircraft_Type))*100)
nempt_hspq <-test%>%filter(Aircraft_Type!="" )%>% count() %>% summarise(Not_Empty = (n/length(test$Aircraft_Type))*100)
cbind(emp_hspq, nempt_hspq) %>% pivot_longer(cols = everything())
## # A tibble: 2 × 2
## name value
## <chr> <dbl>
## 1 Empty 0.16
## 2 Not_Empty 99.8
test_nempty <-test%>%filter(Airline!="" )
test_nempty <-test_nempty%>%filter(Day_of_Week!="" )
test_nempty <-test_nempty%>%filter(Departure_City!="" )
test_nempty <-test_nempty%>%filter(Arrival_City!="" )
test_nempty <-test_nempty%>%filter(Aircraft_Type!="" )
test_nempty <-test_nempty%>%filter(Promotion_Type!="" )
test_nempty <-test_nempty%>%filter(Number_of_Stops!="" )
test_nempty <-test_nempty%>%filter(Weather_Conditions!="" )
test_nempty <-test_nempty%>%filter(Demand!="" )
test_nempty <-test_nempty%>%filter(Holiday_Season!="" )
test_nempty <-test_nempty%>%filter(Month_of_Travel!="" )
head(test_nempty,5)
## Flight_ID Airline Departure_City Arrival_City Distance Departure_Time
## 1 F45001 Airline B Davidstad Moorebury 3096 18:43
## 2 F45002 Airline A Lake Tyler Camachoberg 8760 1:16
## 3 F45003 Airline C New Carol West Ryanfurt 6365 12:17
## 4 F45004 Airline A Richardsonshire Jordanburgh 7836 0:11
## 5 F45005 Airline B Tiffanytown Morganstad 1129 3:22
## Arrival_Time Duration Aircraft_Type Number_of_Stops Day_of_Week
## 1 0:14 5.52 Boeing 737 1 Saturday
## 2 13:04 11.80 Airbus A380 1 Thursday
## 3 21:52 9.59 Boeing 777 1 Sunday
## 4 10:23 10.21 Airbus A380 0 Thursday
## 5 5:13 1.86 Airbus A320 1 Saturday
## Month_of_Travel Holiday_Season Demand Weather_Conditions Passenger_Count
## 1 August Summer Medium Clear 110
## 2 April None High Clear 295
## 3 January None Low Rain 223
## 4 March None Low Rain 223
## 5 August Summer High Cloudy 145
## Promotion_Type Fuel_Price
## 1 None 0.95
## 2 Discount 1.05
## 3 Discount 0.63
## 4 None 0.88
## 5 Special Offer 1.11
test_nempty <-test_nempty%>%drop_na()
Airline <- test_nempty %>% select(Airline, Flight_ID)%>%group_by(Airline)%>%count() %>%summarise(percentage = (n / length(test_nempty$Flight_ID)) * 100)
# Plot the bar plot with data labels as percentages
ggplot(Airline, aes(x = Airline, y = percentage)) +
geom_bar(stat = "identity", fill = "red") + geom_text(aes(label = paste0(round(percentage, 1), "%")), position = position_dodge(width = 0.9), vjust=-0.5, hjust=0.5, size=7.5) + # Use geom_text_repel instead
labs(y = " ",x =" ", title = "Distribution of Airlines:This result shows data balancing") +
theme_minimal()+
theme(plot.title = element_text(size=25), axis.text.y = element_blank(), axis.text.x = element_text(face="bold", color='black', size = 18, angle = 45, hjust=1))
# Calculate percentage of each category
Day_of_Week <- test_nempty %>%
group_by(Day_of_Week, Airline) %>%
count() %>%
summarise(percentage = (n / nrow(test_nempty)) * 100)
## `summarise()` has grouped output by 'Day_of_Week'. You can override using the
## `.groups` argument.
# Plot the bar plot with data labels as percentages
ggplot(Day_of_Week, aes(x = Day_of_Week, y = percentage, fill = Airline)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = paste0(round(percentage, 1), "%")), position = position_dodge(width = 0.9), vjust=-0.5, hjust=0.5, angle =90, size=7.5) + # Use geom_text_repel instead
labs(y = " ",x =" ", title = "Distribution of Airlines by Day of the Week") +
theme_minimal()+
theme(plot.title = element_text(size=25), axis.text.y = element_blank(), axis.text.x = element_text(face="bold", color='black', size = 18, angle = 45, hjust=1), legend.text = element_text(size = 18, face = "bold"))
# Calculate percentage of each category
Aircraft_Type <- test_nempty %>%
group_by(Aircraft_Type, Airline) %>%
count() %>%
summarise(percentage = (n / nrow(test_nempty)) * 100)
## `summarise()` has grouped output by 'Aircraft_Type'. You can override using the
## `.groups` argument.
# Plot the bar plot with data labels as percentages
ggplot(Aircraft_Type, aes(x = Aircraft_Type, y = percentage, fill = Airline)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = paste0(round(percentage, 1), "%")), position = position_dodge(width = 0.9), vjust=-0.5, hjust=0.5, angle =90, size=7.5) + # Use geom_text_repel instead
labs(y = " ",x =" ", title = "Distribution of Airlines by Aircraft_Type")+
theme_minimal()+
theme(plot.title = element_text(size=25), axis.text.y = element_blank(), axis.text.x = element_text(face="bold", color='black', size = 18, angle = 45, hjust=1), legend.text = element_text(size = 18, face = "bold"))
# Calculate percentage of each category
Promotion_Type <- test_nempty %>%
group_by(Promotion_Type, Airline) %>%
count() %>%
summarise(percentage = (n / nrow(test_nempty)) * 100)
## `summarise()` has grouped output by 'Promotion_Type'. You can override using
## the `.groups` argument.
# Plot the bar plot with data labels as percentages
ggplot(Promotion_Type, aes(x = Promotion_Type, y = percentage, fill = Airline)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = paste0(round(percentage, 1), "%")), position = position_dodge(width = 0.9), vjust=-0.5, hjust=0.5, angle =90, size=7.5) + # Use geom_text_repel instead
labs(y = " ",x =" ", title = "Distribution of Airlines by Promotion_Type") +
theme_minimal()+
theme(plot.title = element_text(size=25), axis.text.y = element_blank(), axis.text.x = element_text(face="bold", color='black', size = 18, angle = 45, hjust=1), legend.text = element_text(size = 18, face = "bold"))
# Calculate percentage of each category
Number_of_Stops <- test_nempty %>%
group_by(Number_of_Stops, Airline) %>%
count() %>%
summarise(percentage = (n / nrow(test_nempty)) * 100)
## `summarise()` has grouped output by 'Number_of_Stops'. You can override using
## the `.groups` argument.
# Plot the bar plot with data labels as percentages
ggplot(Number_of_Stops, aes(x = Number_of_Stops, y = percentage, fill = Airline)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = paste0(round(percentage, 1), "%")), position = position_dodge(width = 0.9), vjust=-0.5, hjust=0.5, angle =90, size=7.5) + # Use geom_text_repel instead
labs(y = " ",x =" ", title = "Distribution of Airlines by Number_of_Stops")+
theme_minimal()+
theme(plot.title = element_text(size=25), axis.text.y = element_blank(), axis.text.x = element_text(face="bold", color='black', size = 18, angle = 45, hjust=1), legend.text = element_text(size = 18, face = "bold"))
# Calculate percentage of each category
Weather_Conditions <- test_nempty %>%
group_by(Weather_Conditions, Airline) %>%
count() %>%
summarise(percentage = (n / nrow(test_nempty)) * 100)
## `summarise()` has grouped output by 'Weather_Conditions'. You can override
## using the `.groups` argument.
# Plot the bar plot with data labels as percentages
ggplot(Weather_Conditions, aes(x = Weather_Conditions, y = percentage, fill = Airline)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = paste0(round(percentage, 1), "%")), position = position_dodge(width = 0.9), vjust=-0.5, hjust=0.5, angle =90, size=7.5) + # Use geom_text_repel instead
labs(y = " ",x =" ", title = "Distribution of Airlines by Weather_Conditions")+
theme_minimal()+
theme(plot.title = element_text(size=25), axis.text.y = element_blank(), axis.text.x = element_text(face="bold", color='black', size = 18, angle = 45, hjust=1), legend.text = element_text(size = 18, face = "bold"))
# Calculate percentage of each category
Demand <- test_nempty %>%
group_by(Demand, Airline) %>%
count() %>%
summarise(percentage = (n / nrow(test_nempty)) * 100)
## `summarise()` has grouped output by 'Demand'. You can override using the
## `.groups` argument.
# Plot the bar plot with data labels as percentages
ggplot(Demand, aes(x = Demand, y = percentage, fill = Airline)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = paste0(round(percentage, 1), "%")), position = position_dodge(width = 0.9), vjust=-0.5, hjust=0.5, angle =90, size=7.5) + # Use geom_text_repel instead
labs(y = " ",x =" ", title = "Distribution of Airlines by Demand") +
theme_minimal()+
theme(plot.title = element_text(size=25), axis.text.y = element_blank(), axis.text.x = element_text(face="bold", color='black', size = 18, angle = 45, hjust=1), legend.text = element_text(size = 18, face = "bold"))
# Calculate percentage of each category
Holiday_Season <- test_nempty %>%
group_by(Holiday_Season, Airline) %>%
count() %>%
summarise(percentage = (n / nrow(test_nempty)) * 100)
## `summarise()` has grouped output by 'Holiday_Season'. You can override using
## the `.groups` argument.
# Plot the bar plot with data labels as percentages
ggplot(Holiday_Season, aes(x = Holiday_Season, y = percentage, fill = Airline)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = paste0(round(percentage, 1), "%")), position = position_dodge(width = 0.9), vjust=-0.5, hjust=0.5, angle =90, size=7.5) + # Use geom_text_repel instead
labs(y = " ",x =" ", title = "Distribution of Airlines by Holiday_Season") +
theme_minimal()+
theme(plot.title = element_text(size=25), axis.text.y = element_blank(), axis.text.x = element_text(face="bold", color='black', size = 18, angle = 45, hjust=1), legend.text = element_text(size = 18, face = "bold"))
# Calculate percentage of each category
Month_of_Travel <- test_nempty %>%
group_by(Month_of_Travel, Airline) %>%
count() %>%
summarise(percentage = (n / nrow(test_nempty)) * 100)
## `summarise()` has grouped output by 'Month_of_Travel'. You can override using
## the `.groups` argument.
# Plot the bar plot with data labels as percentages
ggplot(Month_of_Travel, aes(x = Month_of_Travel, y = percentage, fill = Airline)) +
geom_bar(stat = "identity", position = "dodge") +
geom_text(aes(label = paste0(round(percentage, 1), "%")), position = position_dodge(width = 0.9), vjust=-0.5, hjust=0.5, angle =90, size=7.5) + # Use geom_text_repel instead
labs(y = " ",x =" ", title = "Distribution of Airlines by Month of Travel") +
theme_minimal()+
theme(plot.title = element_text(size=25), axis.text.y = element_blank(), axis.text.x = element_text(face="bold", color='black', size = 18, angle = 45, hjust=1), legend.text = element_text(size = 18, face = "bold"))
test_nempty$Airline <- as.factor(test_nempty$Airline)
test_nempty$Number_of_Stops <- as.factor(test_nempty$Number_of_Stops)
sample <- sample.split(test_nempty$Flight_ID, SplitRatio = 0.8)
test_data <- subset(test_nempty, sample==T)
train_data <- subset(test_nempty, sample==F)
NB_AT <-naiveBayes(Airline~ Aircraft_Type, train_data)
NB_NS <-naiveBayes(Airline~ Number_of_Stops, train_data)
NB_DW <-naiveBayes(Airline~ Day_of_Week, train_data)
NB_MT <-naiveBayes(Airline~ Month_of_Travel, train_data)
NB_HS <-naiveBayes(Airline~ Holiday_Season, train_data)
NB_DE <-naiveBayes(Airline~ Demand, train_data)
NB_WC <-naiveBayes(Airline~ Weather_Conditions, train_data)
NB_PT <-naiveBayes(Airline~ Promotion_Type, train_data)
#Let take a look at the results
NB_AT
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Airline A Airline B Airline C
## 0.3294798 0.3364162 0.3341040
##
## Conditional probabilities:
## Aircraft_Type
## Y Airbus A320 Airbus A380 Boeing 737 Boeing 777 Boeing 787
## Airline A 0.1964912 0.1894737 0.2421053 0.1824561 0.1894737
## Airline B 0.1752577 0.1855670 0.1958763 0.2268041 0.2164948
## Airline C 0.2006920 0.2076125 0.2387543 0.1730104 0.1799308
NB_NS
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Airline A Airline B Airline C
## 0.3294798 0.3364162 0.3341040
##
## Conditional probabilities:
## Number_of_Stops
## Y 0 1 3
## Airline A 0.40000000 0.52982456 0.07017544
## Airline B 0.44329897 0.50515464 0.05154639
## Airline C 0.48096886 0.46712803 0.05190311
NB_DW
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Airline A Airline B Airline C
## 0.3294798 0.3364162 0.3341040
##
## Conditional probabilities:
## Day_of_Week
## Y Friday Monday Saturday Sunday Thursday Tuesday
## Airline A 0.1508772 0.1333333 0.1263158 0.1894737 0.1438596 0.1368421
## Airline B 0.1271478 0.1683849 0.1512027 0.1512027 0.1305842 0.1340206
## Airline C 0.1349481 0.1176471 0.1764706 0.1384083 0.1487889 0.1522491
## Day_of_Week
## Y Wednesday
## Airline A 0.1192982
## Airline B 0.1374570
## Airline C 0.1314879
NB_MT
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Airline A Airline B Airline C
## 0.3294798 0.3364162 0.3341040
##
## Conditional probabilities:
## Month_of_Travel
## Y April August December February January July
## Airline A 0.07368421 0.08771930 0.08070175 0.05263158 0.08070175 0.09122807
## Airline B 0.08591065 0.07216495 0.05841924 0.08591065 0.09621993 0.07216495
## Airline C 0.09688581 0.07266436 0.08650519 0.06920415 0.09342561 0.08650519
## Month_of_Travel
## Y June March May November October September
## Airline A 0.09473684 0.11578947 0.12631579 0.07719298 0.07719298 0.04210526
## Airline B 0.09965636 0.11340206 0.09621993 0.08591065 0.08591065 0.04810997
## Airline C 0.06228374 0.05536332 0.09688581 0.09688581 0.07958478 0.10380623
NB_HS
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Airline A Airline B Airline C
## 0.3294798 0.3364162 0.3341040
##
## Conditional probabilities:
## Holiday_Season
## Y Fall None Spring Summer Winter
## Airline A 0.2000000 0.2140351 0.1719298 0.2105263 0.2035088
## Airline B 0.2164948 0.1924399 0.1890034 0.2199313 0.1821306
## Airline C 0.2006920 0.2006920 0.1937716 0.2422145 0.1626298
NB_DE
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Airline A Airline B Airline C
## 0.3294798 0.3364162 0.3341040
##
## Conditional probabilities:
## Demand
## Y High Low Medium
## Airline A 0.1438596 0.6350877 0.2210526
## Airline B 0.1408935 0.6254296 0.2336770
## Airline C 0.1487889 0.6262976 0.2249135
NB_WC
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Airline A Airline B Airline C
## 0.3294798 0.3364162 0.3341040
##
## Conditional probabilities:
## Weather_Conditions
## Y Clear Cloudy Rain Snow
## Airline A 0.2596491 0.2631579 0.2561404 0.2210526
## Airline B 0.3092784 0.2336770 0.2027491 0.2542955
## Airline C 0.2664360 0.2387543 0.2456747 0.2491349
NB_PT
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Airline A Airline B Airline C
## 0.3294798 0.3364162 0.3341040
##
## Conditional probabilities:
## Promotion_Type
## Y Discount None Special Offer
## Airline A 0.2982456 0.3228070 0.3789474
## Airline B 0.2886598 0.3780069 0.3333333
## Airline C 0.3529412 0.2906574 0.3564014
# let summarise the results
summary(NB_AT)
summary(NB_NS)
summary(NB_DW)
summary(NB_MT)
summary(NB_HS)
summary(NB_DE)
summary(NB_WC)
summary(NB_PT)
pre_NB_AT <- predict(NB_AT, newdata = test_data, response ="class")
pre_NB_NS <- predict(NB_NS, newdata = test_data, response ="class")
pre_NB_DW <- predict(NB_DW, newdata = test_data, response ="class")
pre_NB_MT <- predict(NB_MT, newdata = test_data, response ="class")
pre_NB_HS <- predict(NB_HS, newdata = test_data, response ="class")
pre_NB_DE <- predict(NB_DE, newdata = test_data, response ="class")
pre_NB_WC <- predict(NB_WC, newdata = test_data, response ="class")
pre_NB_PT <- predict(NB_PT, newdata = test_data, response ="class")
eva_NB_AT <- confusionMatrix(as.factor(test_data$Airline), pre_NB_AT)
eva_NB_NS <- confusionMatrix(as.factor(test_data$Airline), pre_NB_NS)
eva_NB_DW <- confusionMatrix(as.factor(test_data$Airline), pre_NB_DW)
eva_NB_MT <- confusionMatrix(as.factor(test_data$Airline), pre_NB_MT)
eva_NB_HS <- confusionMatrix(as.factor(test_data$Airline), pre_NB_HS)
eva_NB_DE <- confusionMatrix(as.factor(test_data$Airline), pre_NB_DE)
eva_NB_WC <- confusionMatrix(as.factor(test_data$Airline), pre_NB_WC)
eva_NB_PT <- confusionMatrix(as.factor(test_data$Airline), pre_NB_PT)
### Results
ca<-eva_NB_AT$byClass
da<-eva_NB_NS$byClass
ea<-eva_NB_DW$byClass
fa<-eva_NB_MT$byClass
ga<-eva_NB_HS$byClass
ha<-eva_NB_DE$byClass
ia<-eva_NB_WC$byClass
ja<-eva_NB_PT$byClass
da <-rbind(ca,da,ea,fa,ga,ha,ia,ja)
da <- as.data.frame(da)
install.packages("gt")
## Warning: package 'gt' is in use and will not be installed
library(gt)
##############
c<-eva_NB_AT$overall
d<-eva_NB_NS$overall
e<-eva_NB_DW$overall
f<-eva_NB_MT$overall
g<-eva_NB_HS$overall
h<-eva_NB_DE$overall
i<-eva_NB_WC$overall
j<-eva_NB_PT$overall
d <-rbind(c,d,e,f,g,h,i,j)
nam <- c("NB_AT","NB_NS", "NB_DW","NB_MT","NB_HS","NB_DE","NB_WC","NB_PT")
rownames(d)<-nam
head(d)
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## NB_AT 0.3274515 -0.0062732412 0.3118148 0.3433806 0.4012149
## NB_NS 0.3222447 -0.0249573396 0.3066772 0.3381134 0.5620480
## NB_DW 0.3341047 0.0005871132 0.3183826 0.3501079 0.4298525
## NB_MT 0.3349725 0.0024501818 0.3192395 0.3509851 0.3335262
## NB_HS 0.3341047 -0.0019609845 0.3183826 0.3501079 0.4052647
## NB_DE 0.3277408 0.0041266293 0.3121003 0.3436732 0.8623084
## AccuracyPValue McnemarPValue
## NB_AT 1.0000000 1.004804e-33
## NB_NS 1.0000000 2.147008e-245
## NB_DW 1.0000000 1.937358e-14
## NB_MT 0.4347563 7.534563e-01
## NB_HS 1.0000000 1.980062e-31
## NB_DE 1.0000000 0.000000e+00
d <- as.data.frame(d)
d$Model <- nam
d_long <- pivot_longer(d, -Model, names_to = "Metric", values_to = "Value")
ggplot(d_long , aes(x = Metric, y= Value, fill=Model))+geom_bar(stat = "identity", position = "dodge", col="black") +
labs(y = " ",x =" ", title = "Performance metric of the eight models with \n difference features to predict type of Airline") +
theme_minimal()+
theme(plot.title = element_text(size=25), axis.text.y = element_blank(), axis.text.x = element_text(face="bold", color='black', size = 18, angle = 45, hjust=1), legend.text = element_text(size = 19, face = "bold"))
# Classifications based on multiple conditions
NB_DC_T <-naiveBayes(Airline~Number_of_Stops+Day_of_Week+Month_of_Travel+
Holiday_Season+Demand+Weather_Conditions+Promotion_Type , train_data)
pre_NB_DC_T <- predict(NB_DC_T, newdata = test_data, type = "class")
confusionMatrix(pre_NB_DC_T, test_data$Airline)
## Confusion Matrix and Statistics
##
## Reference
## Prediction Airline A Airline B Airline C
## Airline A 355 331 368
## Airline B 367 353 368
## Airline C 460 434 421
##
## Overall Statistics
##
## Accuracy : 0.3266
## 95% CI : (0.311, 0.3425)
## No Information Rate : 0.3419
## P-Value [Acc > NIR] : 0.9727908
##
## Kappa : -0.0101
##
## Mcnemar's Test P-Value : 0.0005549
##
## Statistics by Class:
##
## Class: Airline A Class: Airline B Class: Airline C
## Sensitivity 0.3003 0.3157 0.3639
## Specificity 0.6927 0.6858 0.6113
## Pos Pred Value 0.3368 0.3244 0.3202
## Neg Pred Value 0.6558 0.6771 0.6564
## Prevalence 0.3419 0.3234 0.3347
## Detection Rate 0.1027 0.1021 0.1218
## Detection Prevalence 0.3049 0.3147 0.3804
## Balanced Accuracy 0.4965 0.5008 0.4876
The performance of the model in this study was observed to be lower than expected. However, it’s essential to contextualize this outcome within the broader scope of the study’s objectives. The primary aim of this research endeavor was not solely focused on achieving optimal predictive accuracy but rather to demonstrate the application of the Naive Bayes algorithm in the context of airline prediction.
It’s crucial to recognize that various machine learning algorithms exist, each with its unique strengths and weaknesses. While Naive Bayes is a valuable tool in certain scenarios, its performance may not always meet expectations, especially in complex prediction tasks like airline classification.
Nonetheless, the study’s findings offer valuable insights. One noteworthy aspect highlighted by this research is the implementation of Laplace smoothing to address instances where decision outcomes are inconclusive. Laplace smoothing, also known as additive smoothing, is a technique used to handle cases where certain combinations of feature values have zero probabilities in the training data. By introducing a small amount of pseudo-counts to all observed feature-value combinations, Laplace smoothing prevents zero probabilities and ensures more robust model predictions.
In summary, while the observed performance of the Naive Bayes model may be modest, the study provides valuable methodological insights and highlights the importance of considering alternative approaches, such as Laplace smoothing, to enhance the robustness and reliability of predictive models in real-world applications.