KOMPUTASI STATISTIKA
~ Ujian Akhir Semester ~
| Kontak | : \(\downarrow\) |
| yosia.yosia@student.matanauniversity.ac.id | |
| yyosia | |
| RPubs | https://rpubs.com/yosia/ |
Supervised Learning
# data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd
# visualization
import seaborn as sns
import matplotlib.pyplot as plt
# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifierInput Data
| Variable | Definition | Key |
|---|---|---|
| survival | Survival | 0 = No, 1 = Yes |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex | |
| Age | Age in years | |
| sibsp | # of siblings/spouses aboard the Titanic | |
| parch | # of parents/children aboard the Titanic | |
| ticket | Ticket number | |
| fare | Passenger fare | |
| cabin | Cabin number | |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |
Variable Notes
Variable Notes pclass: Status sosial ekonomi 1st = Upper 2nd = Middle 3rd = Lower
age: Umur pecahan jika kurang dari 1. Bila umur ditaksir dalam bentuk xx.5
sibsp: Family relation Sibling = brother, sister, stepbrother, stepsister Spouse = husband, wife (mistresses and fiancés were ignored)
parch: The dataset defines family relations in this way… Parent = mother, father Child = daughter, son, stepdaughter, stepson Some children travelled only with a nanny, therefore parch=0 for them.
train_df = pd.read_csv('input/train.csv')
test_df = pd.read_csv('input/test.csv')
combine = [train_df, test_df]print(train_df.columns.values)## ['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
## 'Ticket' 'Fare' 'Cabin' 'Embarked']
train_df.info()## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 891 entries, 0 to 890
## Data columns (total 12 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 PassengerId 891 non-null int64
## 1 Survived 891 non-null int64
## 2 Pclass 891 non-null int64
## 3 Name 891 non-null object
## 4 Sex 891 non-null object
## 5 Age 714 non-null float64
## 6 SibSp 891 non-null int64
## 7 Parch 891 non-null int64
## 8 Ticket 891 non-null object
## 9 Fare 891 non-null float64
## 10 Cabin 204 non-null object
## 11 Embarked 889 non-null object
## dtypes: float64(2), int64(5), object(5)
## memory usage: 83.7+ KB
test_df.info()## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 418 entries, 0 to 417
## Data columns (total 11 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 PassengerId 418 non-null int64
## 1 Pclass 418 non-null int64
## 2 Name 418 non-null object
## 3 Sex 418 non-null object
## 4 Age 332 non-null float64
## 5 SibSp 418 non-null int64
## 6 Parch 418 non-null int64
## 7 Ticket 418 non-null object
## 8 Fare 417 non-null float64
## 9 Cabin 91 non-null object
## 10 Embarked 418 non-null object
## dtypes: float64(2), int64(4), object(5)
## memory usage: 36.0+ KB
train_df.describe()## PassengerId Survived Pclass ... SibSp Parch Fare
## count 891.000000 891.000000 891.000000 ... 891.000000 891.000000 891.000000
## mean 446.000000 0.383838 2.308642 ... 0.523008 0.381594 32.204208
## std 257.353842 0.486592 0.836071 ... 1.102743 0.806057 49.693429
## min 1.000000 0.000000 1.000000 ... 0.000000 0.000000 0.000000
## 25% 223.500000 0.000000 2.000000 ... 0.000000 0.000000 7.910400
## 50% 446.000000 0.000000 3.000000 ... 0.000000 0.000000 14.454200
## 75% 668.500000 1.000000 3.000000 ... 1.000000 0.000000 31.000000
## max 891.000000 1.000000 3.000000 ... 8.000000 6.000000 512.329200
##
## [8 rows x 7 columns]
train_df.describe(include=['O'])## Name Sex Ticket Cabin Embarked
## count 891 891 891 204 889
## unique 891 2 681 147 3
## top Braund, Mr. Owen Harris male 347082 B96 B98 S
## freq 1 577 7 4 644
• Nama bersifat unik di seluruh kumpulan data (count=unique=891) • Variabel jenis kelamin sebagai dua kemungkinan nilai dengan 65% laki-laki (atas=laki-laki, frekuensi=577/hitung=891). • Nilai kabin memiliki beberapa duplikat di seluruh sampel. Atau beberapa penumpang berbagi kabin. • Memulai mengambil tiga kemungkinan nilai. Port S digunakan oleh sebagian besar penumpang (atas=S) • Fitur tiket memiliki rasio tinggi (22%) dari nilai duplikat (unik=681).
Analyze by pivoting features
train_df[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)## Pclass Survived
## 0 1 0.629630
## 1 2 0.472826
## 2 3 0.242363
train_df[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)## Sex Survived
## 0 female 0.742038
## 1 male 0.188908
train_df[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)## SibSp Survived
## 1 1 0.535885
## 2 2 0.464286
## 0 0 0.345395
## 3 3 0.250000
## 4 4 0.166667
## 5 5 0.000000
## 6 8 0.000000
train_df[["Parch", "Survived"]].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False)## Parch Survived
## 3 3 0.600000
## 1 1 0.550847
## 2 2 0.500000
## 0 0 0.343658
## 5 5 0.200000
## 4 4 0.000000
## 6 6 0.000000
Analyze by visualizing data
g = sns.FacetGrid(train_df, col='Survived')
g.map(plt.hist, 'Age', bins=20)# grid = sns.FacetGrid(train_df, col='Pclass', hue='Survived')
grid = sns.FacetGrid(train_df, col='Survived', row='Pclass', height=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)grid.add_legend();# grid = sns.FacetGrid(train_df, col='Embarked')
grid = sns.FacetGrid(train_df, row='Embarked', height=2.2, aspect=1.6)
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')grid.add_legend()# grid = sns.FacetGrid(train_df, col='Embarked', hue='Survived', palette={0: 'k', 1: 'w'})
grid = sns.FacetGrid(train_df, row='Embarked', col='Survived', height=2.2, aspect=1.6)
grid.map(sns.barplot, 'Sex', 'Fare', alpha=.5, ci=None)grid.add_legend()Wrangle data
print("Before", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)## Before (891, 12) (418, 11) (891, 12) (418, 11)
train_df = train_df.drop(['Ticket', 'Cabin'], axis=1)
test_df = test_df.drop(['Ticket', 'Cabin'], axis=1)
combine = [train_df, test_df]
"After", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape## ('After', (891, 10), (418, 9), (891, 10), (418, 9))
Observation
for dataset in combine:
dataset['Title'] = dataset.Name.str.extract(' ([A-Za-z]+)\.', expand=False)
pd.crosstab(train_df['Title'], train_df['Sex'])## Sex female male
## Title
## Capt 0 1
## Col 0 2
## Countess 1 0
## Don 0 1
## Dr 1 6
## Jonkheer 0 1
## Lady 1 0
## Major 0 2
## Master 0 40
## Miss 182 0
## Mlle 2 0
## Mme 1 0
## Mr 0 517
## Mrs 125 0
## Ms 1 0
## Rev 0 6
## Sir 0 1
for dataset in combine:
dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col',\
'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
train_df[['Title', 'Survived']].groupby(['Title'], as_index=False).mean()## Title Survived
## 0 Master 0.575000
## 1 Miss 0.702703
## 2 Mr 0.156673
## 3 Mrs 0.793651
## 4 Rare 0.347826
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}
for dataset in combine:
dataset['Title'] = dataset['Title'].map(title_mapping)
dataset['Title'] = dataset['Title'].fillna(0)
train_df.head()## PassengerId Survived Pclass ... Fare Embarked Title
## 0 1 0 3 ... 7.2500 S 1
## 1 2 1 1 ... 71.2833 C 3
## 2 3 1 3 ... 7.9250 S 2
## 3 4 1 1 ... 53.1000 S 3
## 4 5 0 3 ... 8.0500 S 1
##
## [5 rows x 11 columns]
train_df = train_df.drop(['Name', 'PassengerId'], axis=1)
test_df = test_df.drop(['Name'], axis=1)
combine = [train_df, test_df]
train_df.shape, test_df.shape## ((891, 9), (418, 9))
for dataset in combine:
dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)
train_df.head()## Survived Pclass Sex Age SibSp Parch Fare Embarked Title
## 0 0 3 0 22.0 1 0 7.2500 S 1
## 1 1 1 1 38.0 1 0 71.2833 C 3
## 2 1 3 1 26.0 0 0 7.9250 S 2
## 3 1 1 1 35.0 1 0 53.1000 S 3
## 4 0 3 0 35.0 0 0 8.0500 S 1
# grid = sns.FacetGrid(train_df, col='Pclass', hue='Gender')
grid = sns.FacetGrid(train_df, row='Pclass', col='Sex', height=2.2, aspect=1.6)
grid.map(plt.hist, 'Age', alpha=.5, bins=20)grid.add_legend()guess_ages = np.zeros((2,3))
for dataset in combine:
for i in range(0, 2):
for j in range(0, 3):
guess_df = dataset[(dataset['Sex'] == i) & \
(dataset['Pclass'] == j+1)]['Age'].dropna()
# age_mean = guess_df.mean()
# age_std = guess_df.std()
# age_guess = rnd.uniform(age_mean - age_std, age_mean + age_std)
age_guess = guess_df.median()
# Convert random age float to nearest .5 age
guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5
for i in range(0, 2):
for j in range(0, 3):
dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),\
'Age'] = guess_ages[i,j]
dataset['Age'] = dataset['Age'].astype(int)
train_df.head()## Survived Pclass Sex Age SibSp Parch Fare Embarked Title
## 0 0 3 0 22 1 0 7.2500 S 1
## 1 1 1 1 38 1 0 71.2833 C 3
## 2 1 3 1 26 0 0 7.9250 S 2
## 3 1 1 1 35 1 0 53.1000 S 3
## 4 0 3 0 35 0 0 8.0500 S 1
train_df['AgeBand'] = pd.cut(train_df['Age'], 5)
train_df[['AgeBand', 'Survived']].groupby(['AgeBand'], as_index=False).mean().sort_values(by='AgeBand', ascending=True)## AgeBand Survived
## 0 (-0.08, 16.0] 0.550000
## 1 (16.0, 32.0] 0.337374
## 2 (32.0, 48.0] 0.412037
## 3 (48.0, 64.0] 0.434783
## 4 (64.0, 80.0] 0.090909
for dataset in combine:
dataset.loc[ dataset['Age'] <= 16, 'Age'] = 0
dataset.loc[(dataset['Age'] > 16) & (dataset['Age'] <= 32), 'Age'] = 1
dataset.loc[(dataset['Age'] > 32) & (dataset['Age'] <= 48), 'Age'] = 2
dataset.loc[(dataset['Age'] > 48) & (dataset['Age'] <= 64), 'Age'] = 3
dataset.loc[ dataset['Age'] > 64, 'Age']## 33 66
## 54 65
## 96 71
## 116 70
## 280 65
## 456 65
## 493 71
## 630 80
## 672 70
## 745 70
## 851 74
## Name: Age, dtype: int32
## 81 67
## 96 76
## Name: Age, dtype: int32
train_df.head()## Survived Pclass Sex Age ... Fare Embarked Title AgeBand
## 0 0 3 0 1 ... 7.2500 S 1 (16.0, 32.0]
## 1 1 1 1 2 ... 71.2833 C 3 (32.0, 48.0]
## 2 1 3 1 1 ... 7.9250 S 2 (16.0, 32.0]
## 3 1 1 1 2 ... 53.1000 S 3 (32.0, 48.0]
## 4 0 3 0 2 ... 8.0500 S 1 (32.0, 48.0]
##
## [5 rows x 10 columns]
train_df = train_df.drop(['AgeBand'], axis=1)
combine = [train_df, test_df]
train_df.head()## Survived Pclass Sex Age SibSp Parch Fare Embarked Title
## 0 0 3 0 1 1 0 7.2500 S 1
## 1 1 1 1 2 1 0 71.2833 C 3
## 2 1 3 1 1 0 0 7.9250 S 2
## 3 1 1 1 2 1 0 53.1000 S 3
## 4 0 3 0 2 0 0 8.0500 S 1
for dataset in combine:
dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1
train_df[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).mean().sort_values(by='Survived', ascending=False)## FamilySize Survived
## 3 4 0.724138
## 2 3 0.578431
## 1 2 0.552795
## 6 7 0.333333
## 0 1 0.303538
## 4 5 0.200000
## 5 6 0.136364
## 7 8 0.000000
## 8 11 0.000000
for dataset in combine:
dataset['IsAlone'] = 0
dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1
train_df[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).mean()## IsAlone Survived
## 0 0 0.505650
## 1 1 0.303538
train_df = train_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
test_df = test_df.drop(['Parch', 'SibSp', 'FamilySize'], axis=1)
combine = [train_df, test_df]
train_df.head()## Survived Pclass Sex Age Fare Embarked Title IsAlone
## 0 0 3 0 1 7.2500 S 1 0
## 1 1 1 1 2 71.2833 C 3 0
## 2 1 3 1 1 7.9250 S 2 1
## 3 1 1 1 2 53.1000 S 3 0
## 4 0 3 0 2 8.0500 S 1 1
for dataset in combine:
dataset['Age*Class'] = dataset.Age * dataset.Pclass
train_df.loc[:, ['Age*Class', 'Age', 'Pclass']].head(10)## Age*Class Age Pclass
## 0 3 1 3
## 1 2 2 1
## 2 3 1 3
## 3 2 2 1
## 4 6 2 3
## 5 3 1 3
## 6 3 3 1
## 7 0 0 3
## 8 3 1 3
## 9 0 0 2
freq_port = train_df.Embarked.dropna().mode()[0]
for dataset in combine:
dataset['Embarked'] = dataset['Embarked'].fillna(freq_port)
train_df[['Embarked', 'Survived']].groupby(['Embarked'], as_index=False).mean().sort_values(by='Survived', ascending=False)## Embarked Survived
## 0 C 0.553571
## 1 Q 0.389610
## 2 S 0.339009
for dataset in combine:
dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)
train_df.head()## Survived Pclass Sex Age Fare Embarked Title IsAlone Age*Class
## 0 0 3 0 1 7.2500 0 1 0 3
## 1 1 1 1 2 71.2833 1 3 0 2
## 2 1 3 1 1 7.9250 0 2 1 3
## 3 1 1 1 2 53.1000 0 3 0 2
## 4 0 3 0 2 8.0500 0 1 1 6
test_df['Fare'].fillna(test_df['Fare'].dropna().median(), inplace=True)
test_df.head()## PassengerId Pclass Sex Age Fare Embarked Title IsAlone Age*Class
## 0 892 3 0 2 7.8292 2 1 1 6
## 1 893 3 1 2 7.0000 0 3 0 6
## 2 894 2 0 3 9.6875 2 1 1 6
## 3 895 3 0 1 8.6625 0 1 1 3
## 4 896 3 1 1 12.2875 0 3 0 3
train_df['FareBand'] = pd.qcut(train_df['Fare'], 4)
train_df[['FareBand', 'Survived']].groupby(['FareBand'], as_index=False).mean().sort_values(by='FareBand', ascending=True)## FareBand Survived
## 0 (-0.001, 7.91] 0.197309
## 1 (7.91, 14.454] 0.303571
## 2 (14.454, 31.0] 0.454955
## 3 (31.0, 512.329] 0.581081
for dataset in combine:
dataset.loc[ dataset['Fare'] <= 7.91, 'Fare'] = 0
dataset.loc[(dataset['Fare'] > 7.91) & (dataset['Fare'] <= 14.454), 'Fare'] = 1
dataset.loc[(dataset['Fare'] > 14.454) & (dataset['Fare'] <= 31), 'Fare'] = 2
dataset.loc[ dataset['Fare'] > 31, 'Fare'] = 3
dataset['Fare'] = dataset['Fare'].astype(int)
train_df = train_df.drop(['FareBand'], axis=1)
combine = [train_df, test_df]
train_df.head(10)## Survived Pclass Sex Age Fare Embarked Title IsAlone Age*Class
## 0 0 3 0 1 0 0 1 0 3
## 1 1 1 1 2 3 1 3 0 2
## 2 1 3 1 1 1 0 2 1 3
## 3 1 1 1 2 3 0 3 0 2
## 4 0 3 0 2 1 0 1 1 6
## 5 0 3 0 1 1 2 1 1 3
## 6 0 1 0 3 3 0 1 1 3
## 7 0 3 0 0 2 0 4 0 0
## 8 1 3 1 1 1 0 3 0 3
## 9 1 2 1 0 2 1 3 0 0
test_df.head(10)## PassengerId Pclass Sex Age Fare Embarked Title IsAlone Age*Class
## 0 892 3 0 2 0 2 1 1 6
## 1 893 3 1 2 0 0 3 0 6
## 2 894 2 0 3 1 2 1 1 6
## 3 895 3 0 1 1 0 1 1 3
## 4 896 3 1 1 1 0 3 0 3
## 5 897 3 0 0 1 0 1 1 0
## 6 898 3 1 1 0 2 2 1 3
## 7 899 2 0 1 2 0 1 0 2
## 8 900 3 1 1 0 1 3 1 3
## 9 901 3 0 1 2 0 1 0 3
Model, Predict and Solve
X_train = train_df.drop("Survived", axis=1)
Y_train = train_df["Survived"]
X_test = test_df.drop("PassengerId", axis=1).copy()
X_train.shape, Y_train.shape, X_test.shape## ((891, 8), (891,), (418, 8))
Logistic Regression
Regresi Logistik adalah model yang berguna untuk dijalankan di awal pada workflow. Regresi logistik mengukur hubungan antara variabel dependen kategori (fitur) dan satu atau lebih variabel independen (fitur) dengan memperkirakan probabilitas menggunakan fungsi logistik, yang merupakan distribusi logistik kumulatif.
logreg = LogisticRegression()
logreg.fit(X_train, Y_train)LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log## 80.36
coeff_df = pd.DataFrame(train_df.columns.delete(0))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])
coeff_df.sort_values(by='Correlation', ascending=False)## Feature Correlation
## 1 Sex 2.201619
## 5 Title 0.397888
## 2 Age 0.287011
## 4 Embarked 0.261473
## 6 IsAlone 0.126553
## 3 Fare -0.086655
## 7 Age*Class -0.311069
## 0 Pclass -0.750700
KNN or k-Nearest Neighbors
Dalam pengenalan pola, algoritma k-Nearest Neighbors (atau singkatnya k-NN) adalah metode non-parametrik yang digunakan untuk klasifikasi dan regresi. Sebuah sampel diklasifikasikan berdasarkan suara mayoritas terdekatnya, dengan sampel dimasukan ke kelas yang paling umum di antara k tetangga terdekatnya (k adalah bilangan bulat positif, biasanya kecil). Jika k = 1, maka objek hanya ditugaskan ke kelas tetangga terdekat.
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)KNeighborsClassifier(n_neighbors=3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KNeighborsClassifier(n_neighbors=3)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn## 83.84
Gaussian Naive Bayes
Dalam pembelajaran mesin, pengklasifikasi naive Bayes adalah sejenis pengklasifikasi probabilistik sederhana berdasarkan penerapan teorema Bayes dengan asumsi independensi yang kuat (naif) di antara fitur-fiturnya. Pengklasifikasi Naive Bayes sangat terukur, membutuhkan sejumlah parameter linier dalam jumlah variabel (fitur) dalam masalah pembelajaran.
gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)GaussianNB()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GaussianNB()
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
acc_gaussian## 72.28
Linear SVC (Support Vector Classification)
Metode Linear Support Vector Classifier (SVC) menerapkan fungsi kernel linier untuk melakukan klasifikasi dan bekerja dengan baik dengan jumlah sampel yang banyak.
linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)LinearSVC()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearSVC()
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
acc_linear_svc## 79.01
Decision Tree
Model ini menggunakan pohon keputusan sebagai model prediksi yang memetakan fitur (cabang pohon) ke kesimpulan tentang nilai target (daun pohon). Model pohon di mana variabel target dapat mengambil sekumpulan nilai terbatas disebut pohon klasifikasi; dalam struktur pohon ini, daun mewakili label kelas dan cabang mewakili konjungsi fitur yang mengarah ke label kelas tersebut. Pohon keputusan di mana variabel target dapat mengambil nilai kontinu (biasanya bilangan real) disebut pohon regresi.
decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)DecisionTreeClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier()
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree## 86.76
Random Forest
Model Random Forest berikutnya adalah salah satu yang paling populer. Random Forest atau hutan keputusan acak adalah metode pembelajaran ansambel untuk klasifikasi, regresi, dan tugas lainnya, yang beroperasi dengan membangun banyak pohon keputusan (n_estimator=100) pada waktu pelatihan dan mengeluarkan kelas yang merupakan mode kelas (klasifikasi) atau prediksi rata-rata (regresi) dari masing-masing pohon.
random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)RandomForestClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestClassifier()
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)## 0.867564534231201
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest## 86.76
Model Evaluation
Kita sekarang dapat memberi peringkat evaluasi kita terhadap semua model untuk memilih yang terbaik untuk masalah ini.
models = pd.DataFrame({
'Model': [ 'KNN', 'Logistic Regression',
'Random Forest', 'Naive Bayes', 'Linear SVC',
'Decision Tree'],
'Score': [ acc_knn, acc_log,
acc_random_forest, acc_gaussian,
acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)## Model Score
## 2 Random Forest 86.76
## 5 Decision Tree 86.76
## 0 KNN 83.84
## 1 Logistic Regression 80.36
## 4 Linear SVC 79.01
## 3 Naive Bayes 72.28
submission = pd.DataFrame({
"PassengerId": test_df["PassengerId"],
"Survived": Y_pred
})
submission## PassengerId Survived
## 0 892 0
## 1 893 0
## 2 894 0
## 3 895 0
## 4 896 1
## .. ... ...
## 413 1305 0
## 414 1306 1
## 415 1307 0
## 416 1308 0
## 417 1309 1
##
## [418 rows x 2 columns]
Unsupervised Learning
Input Data
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt # for data visualization
import seaborn as sns # for statistical data visualizationdf = pd.read_csv('input/Live.csv')
df.info()## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 7050 entries, 0 to 7049
## Data columns (total 16 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 status_id 7050 non-null object
## 1 status_type 7050 non-null object
## 2 status_published 7050 non-null object
## 3 num_reactions 7050 non-null int64
## 4 num_comments 7050 non-null int64
## 5 num_shares 7050 non-null int64
## 6 num_likes 7050 non-null int64
## 7 num_loves 7050 non-null int64
## 8 num_wows 7050 non-null int64
## 9 num_hahas 7050 non-null int64
## 10 num_sads 7050 non-null int64
## 11 num_angrys 7050 non-null int64
## 12 Column1 0 non-null float64
## 13 Column2 0 non-null float64
## 14 Column3 0 non-null float64
## 15 Column4 0 non-null float64
## dtypes: float64(4), int64(9), object(3)
## memory usage: 881.4+ KB
df.isnull().sum()## status_id 0
## status_type 0
## status_published 0
## num_reactions 0
## num_comments 0
## num_shares 0
## num_likes 0
## num_loves 0
## num_wows 0
## num_hahas 0
## num_sads 0
## num_angrys 0
## Column1 7050
## Column2 7050
## Column3 7050
## Column4 7050
## dtype: int64
df.drop(['Column1', 'Column2', 'Column3', 'Column4'], axis=1, inplace=True)EDA
len(df['status_id'].unique())## 6997
len(df['status_published'].unique())## 6913
len(df['status_type'].unique())## 4
df.drop(['status_id', 'status_published'], axis=1, inplace=True)df.info()## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 7050 entries, 0 to 7049
## Data columns (total 10 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 status_type 7050 non-null object
## 1 num_reactions 7050 non-null int64
## 2 num_comments 7050 non-null int64
## 3 num_shares 7050 non-null int64
## 4 num_likes 7050 non-null int64
## 5 num_loves 7050 non-null int64
## 6 num_wows 7050 non-null int64
## 7 num_hahas 7050 non-null int64
## 8 num_sads 7050 non-null int64
## 9 num_angrys 7050 non-null int64
## dtypes: int64(9), object(1)
## memory usage: 550.9+ KB
df.head()## status_type num_reactions num_comments ... num_hahas num_sads num_angrys
## 0 video 529 512 ... 1 1 0
## 1 photo 150 0 ... 0 0 0
## 2 video 227 236 ... 1 0 0
## 3 photo 111 0 ... 0 0 0
## 4 photo 213 0 ... 0 0 0
##
## [5 rows x 10 columns]
Convert categorical variable into integers
X = df
y = df['status_type']
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X['status_type'] = le.fit_transform(X['status_type'])
y = le.transform(y)
X.head()## status_type num_reactions num_comments ... num_hahas num_sads num_angrys
## 0 3 529 512 ... 1 1 0
## 1 1 150 0 ... 0 0 0
## 2 3 227 236 ... 1 0 0
## 3 1 111 0 ... 0 0 0
## 4 1 213 0 ... 0 0 0
##
## [5 rows x 10 columns]
Feature Scaling
cols = X.columns
from sklearn.preprocessing import MinMaxScaler
ms = MinMaxScaler()
X = ms.fit_transform(X)
X = pd.DataFrame(X, columns=[cols])
X.head()## status_type num_reactions num_comments ... num_hahas num_sads num_angrys
## 0 1.000000 0.112314 0.024393 ... 0.006369 0.019608 0.0
## 1 0.333333 0.031847 0.000000 ... 0.000000 0.000000 0.0
## 2 1.000000 0.048195 0.011243 ... 0.006369 0.000000 0.0
## 3 0.333333 0.023567 0.000000 ... 0.000000 0.000000 0.0
## 4 0.333333 0.045223 0.000000 ... 0.000000 0.000000 0.0
##
## [5 rows x 10 columns]
K-Means model
Use elbow method to find optimal number of clusters
from sklearn.cluster import KMeans
cs = []
for i in range(1, 11):
kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)
kmeans.fit(X)
cs.append(kmeans.inertia_)KMeans(n_clusters=10, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=10, random_state=0)
plt.plot(range(1, 11), cs)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('CS')
plt.show()from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2,random_state=0)
kmeans.fit(X)KMeans(n_clusters=2, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=2, random_state=0)
labels = kmeans.labels_
# check how many of the samples were correctly labeled
correct_labels = sum(y == labels)
print("Result: %d out of %d samples were correctly labeled." % (correct_labels, y.size))## Result: 63 out of 7050 samples were correctly labeled.
print('Accuracy score: {0:0.2f}'. format(correct_labels/float(y.size)))## Accuracy score: 0.01
kmeans = KMeans(n_clusters=3, random_state=0)
kmeans.fit(X)
# check how many of the samples were correctly labeledKMeans(n_clusters=3, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=3, random_state=0)
labels = kmeans.labels_
correct_labels = sum(y == labels)
print("Result: %d out of %d samples were correctly labeled." % (correct_labels, y.size))## Result: 138 out of 7050 samples were correctly labeled.
print('Accuracy score: {0:0.2f}'. format(correct_labels/float(y.size)))## Accuracy score: 0.02
kmeans = KMeans(n_clusters=4, random_state=0)
kmeans.fit(X)
# check how many of the samples were correctly labeledKMeans(n_clusters=4, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
KMeans(n_clusters=4, random_state=0)
labels = kmeans.labels_
correct_labels = sum(y == labels)
print("Result: %d out of %d samples were correctly labeled." % (correct_labels, y.size))## Result: 4340 out of 7050 samples were correctly labeled.
print('Accuracy score: {0:0.2f}'. format(correct_labels/float(y.size)))## Accuracy score: 0.62