> import pandas as pd
+ import numpy as np
+ import matplotlib.pyplot as plt
+ import seaborn as snsThe Titanic data set from Kaggle.
https://www.kaggle.com/c/titanic
train.head()| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | NaN | S |
| 6 | 0 | 3 | Moran, Mr. James | male | NaN | 0 | 0 | 330877 | 8.4583 | NaN | Q |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
> plt.figure(figsize=(8,6))
+ sns.heatmap(train.isnull(),
+ yticklabels=False,cbar=False,
+ cmap='viridis');
+ plt.tight_layout()
+ plt.show()Roughly 20 percent of the Age data is missing (yellow). The proportion of Age missing is likely small enough for reasonable replacement with some form of imputation. Looking at the Cabin column, it looks like we are just missing too much of that data to do something useful with at a basic level.
> plt.figure(figsize=(8,6))
+ sns.set_style('whitegrid')
+ sns.countplot(x='Survived',
+ data=train,palette='RdBu_r');
+ plt.show()> plt.figure(figsize=(8,6))
+ sns.set_style('whitegrid')
+ sns.countplot(x='Survived',hue='Sex',
+ data=train,palette='RdBu_r');
+ plt.show()> plt.figure(figsize=(8,6))
+ sns.set_style('whitegrid')
+ sns.countplot(x='Survived',hue='Pclass',
+ data=train,palette='rainbow');
+ plt.show()> plt.figure(figsize=(8,6))
+ sns.distplot(train['Age'].dropna(),
+ kde=False,color='darkred',bins=30);
+ plt.show()> #siblings or spouses
+ plt.figure(figsize=(8,6))
+ sns.countplot(x='SibSp',data=train,
+ palette='pastel');
+ plt.show()> fig = px.histogram(train,x="Fare", nbins=40,
+ color_discrete_sequence=['orange'])
+ fig.write_html("T_hist.html")We want to fill in missing age data instead of just dropping the missing age data rows. One way to do this is by filling in the mean or median age of all the passengers (imputation).
> #siblings or spouses
+ plt.figure(figsize=(10,6))
+ sns.boxplot(x='Pclass',y='Age',
+ data=train,palette='winter');
+ plt.show()24.0
29.0
37.0
We can see the wealthier passengers in the higher classes tend to be older, which makes sense. We’ll use these median age values to impute based on Pclass for Age.
> def impute_age(cols):
+ Age = cols[0]
+ Pclass = cols[1]
+
+ if pd.isnull(Age):
+
+ if Pclass == 1:
+ return 37
+
+ elif Pclass == 2:
+ return 29
+
+ else:
+ return 24
+
+ else:
+ return AgeNow apply that function.
Now let’s check the heat map again.
We can drop the Cabin column and the row in Embarked that is NaN.
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Embarked |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.2500 | S |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C |
| 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.9250 | S |
| 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.1000 | S |
| 5 | 0 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.0500 | S |
| 6 | 0 | 3 | Moran, Mr. James | male | 24 | 0 | 0 | 330877 | 8.4583 | Q |
You need to convert categorical features to dummy variables using pandas. Otherwise the machine learning algorithm won’t be able to directly take in those features as inputs.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 889 entries, 0 to 890
Data columns (total 11 columns):
PassengerId 889 non-null int64
Survived 889 non-null int64
Pclass 889 non-null int64
Name 889 non-null object
Sex 889 non-null object
Age 889 non-null float64
SibSp 889 non-null int64
Parch 889 non-null int64
Ticket 889 non-null object
Fare 889 non-null float64
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(4)
memory usage: 83.3+ KB
female male
0 0 1
1 1 0
2 1 0
3 1 0
4 0 1
C Q S
0 0 0 1
1 1 0 0
2 0 0 1
3 0 0 1
4 0 0 1
You need to drop the first dummy variable. That value can always be predicted from the remaining dummy variables.
> sex = pd.get_dummies(train['Sex'],
+ drop_first=True)
+ embark = pd.get_dummies(train['Embarked'],
+ drop_first=True)> # drop unnecessary rows
+ train.drop(['Sex','Embarked','Name','Ticket','PassengerId'],
+ axis=1,inplace=True)| Survived | Pclass | Age | SibSp | Parch | Fare | male | Q | S | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | 22 | 1 | 0 | 7.2500 | 1 | 0 | 1 |
| 1 | 1 | 1 | 38 | 1 | 0 | 71.2833 | 0 | 0 | 0 |
| 2 | 1 | 3 | 26 | 0 | 0 | 7.9250 | 0 | 0 | 1 |
| 3 | 1 | 1 | 35 | 1 | 0 | 53.1000 | 0 | 0 | 1 |
| 4 | 0 | 3 | 35 | 0 | 0 | 8.0500 | 1 | 0 | 1 |
| 5 | 0 | 3 | 24 | 0 | 0 | 8.4583 | 1 | 1 | 0 |
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=None, solver='liblinear', tol=0.0001, verbose=0,
warm_start=False)
You can check precision,recall,f1-score using classification report.
precision recall f1-score support
0 0.80 0.91 0.85 163
1 0.82 0.65 0.73 104
accuracy 0.81 267
macro avg 0.81 0.78 0.79 267
weighted avg 0.81 0.81 0.80 267
array([[148, 15],
[ 36, 68]], dtype=int64)
This is an artificial fake advertising data set, indicating whether or not a particular internet user clicked on an Advertisement on a company website. We will try to create a model that will predict whether or not they will click on an ad based off the features of that user.
This data set contains the following features:
Daily Time Spent on Site: consumer time on site in minutesAge: customer age in yearsArea Income: Avg. Income of geographical area of consumerDaily Internet Usage: Avg. minutes a day consumer is on the internetAd Topic Line: Headline of the advertisementCity: City of consumerMale: Whether or not consumer was maleCountry: Country of consumerTimestamp: Time at which consumer clicked on Ad or closed windowClicked on Ad: 0 or 1 indicated clicking on Ad| Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Ad Topic Line | City | Male | Country | Timestamp | Clicked on Ad |
|---|---|---|---|---|---|---|---|---|---|
| 68.95 | 35 | 61833.90 | 256.09 | Cloned 5thgeneration orchestration | Wrightburgh | 0 | Tunisia | 2016-03-27 00:53:11 | 0 |
| 80.23 | 31 | 68441.85 | 193.77 | Monitored national standardization | West Jodi | 1 | Nauru | 2016-04-04 01:39:02 | 0 |
| 69.47 | 26 | 59785.94 | 236.50 | Organic bottom-line service-desk | Davidton | 0 | San Marino | 2016-03-13 20:35:42 | 0 |
| 74.15 | 29 | 54806.18 | 245.89 | Triple-buffered reciprocal time-frame | West Terrifurt | 1 | Italy | 2016-01-10 02:31:19 | 0 |
| 68.37 | 35 | 73889.99 | 225.58 | Robust logistical utilization | South Manuel | 0 | Iceland | 2016-06-03 03:36:18 | 0 |
| 59.99 | 23 | 59761.56 | 226.74 | Sharable client-driven software | Jamieberg | 1 | Norway | 2016-05-19 14:30:17 | 0 |
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 10 columns):
Daily Time Spent on Site 1000 non-null float64
Age 1000 non-null int64
Area Income 1000 non-null float64
Daily Internet Usage 1000 non-null float64
Ad Topic Line 1000 non-null object
City 1000 non-null object
Male 1000 non-null int64
Country 1000 non-null object
Timestamp 1000 non-null object
Clicked on Ad 1000 non-null int64
dtypes: float64(3), int64(3), object(4)
memory usage: 78.2+ KB
ad_data.describe()
| Daily Time Spent on Site | Age | Area Income | Daily Internet Usage | Male | Clicked on Ad | |
|---|---|---|---|---|---|---|
| count | 1000.00000 | 1000.000000 | 1000.00 | 1000.00000 | 1000.0000000 | 1000.0000000 |
| mean | 65.00020 | 36.009000 | 55000.00 | 180.00010 | 0.4810000 | 0.5000000 |
| std | 15.85361 | 8.785562 | 13414.63 | 43.90234 | 0.4998889 | 0.5002502 |
| min | 32.60000 | 19.000000 | 13996.50 | 104.78000 | 0.0000000 | 0.0000000 |
| 25% | 51.36000 | 29.000000 | 47031.80 | 138.83000 | 0.0000000 | 0.0000000 |
| 50% | 68.21500 | 35.000000 | 57012.30 | 183.13000 | 0.0000000 | 0.5000000 |
A histogram of the Age.
> sns.set_style('whitegrid')
+ ad_data['Age'].hist(bins=30,figsize=(8,6));
+ plt.xlabel('Age');
+ plt.show()Jointplot showing Area Income versus Age.
Jointplot showing the kde distributions of Daily Time spent on site vs. Age.
> sns.jointplot(x='Age',y='Daily Time Spent on Site',
+ data=ad_data,color='red',kind='kde');
+ plt.show()Jointplot of Daily Time Spent on Site vs. Daily Internet Usage.
> sns.jointplot(x='Daily Time Spent on Site',
+ y='Daily Internet Usage',
+ data=ad_data,color='green');
+ plt.show()Pairplot with the hue defined by the Clicked on Ad column feature.
Split the data into training set and testing set using train_test_split
> X = ad_data[['Daily Time Spent on Site', 'Age',
+ 'Area Income','Daily Internet Usage',
+ 'Male']]
+ y = ad_data['Clicked on Ad']Train and fit a logistic regression model on the training set.
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, l1_ratio=None, max_iter=100,
multi_class='warn', n_jobs=None, penalty='l2',
random_state=None, solver='liblinear', tol=0.0001, verbose=0,
warm_start=False)
Now predict values for the testing data.
Create a classification report for the model.
> from sklearn.metrics import classification_report
+ print(classification_report(y_test,predictions)) precision recall f1-score support
0 0.87 0.96 0.91 162
1 0.96 0.86 0.91 168
accuracy 0.91 330
macro avg 0.91 0.91 0.91 330
weighted avg 0.91 0.91 0.91 330