Resources and libraries

library(reticulate)


import sys

print(sys.version)

## 3.7.5 (default, Oct 31 2019, 15:18:51) [MSC v.1916 64 bit (AMD64)]


import sys

if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")


import pandas as pd


import matplotlib.pyplot as plt

import matplotlib as mpl

mpl.use('ps') # generate postscript output by default


import seaborn as sb

sb.set_style('whitegrid')


pd.set_option('precision', 3)

pd.set_option('expand_frame_repr', True)

#pd.set_option('max_colwidth', -1)

Data

Structure of the dataset

library(MASS)

data("birthwt")

str(birthwt)

## 'data.frame':    189 obs. of  10 variables:
##  $ low  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ age  : int  19 33 20 21 18 21 22 17 29 26 ...
##  $ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
##  $ race : int  2 3 1 1 1 3 1 3 1 1 ...
##  $ smoke: int  0 0 1 1 1 0 0 0 1 1 ...
##  $ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ht   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ui   : int  1 0 0 1 1 0 0 0 0 0 ...
##  $ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...
##  $ bwt  : int  2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 ...

Describe the variables

#	variable name	variable label	coded levels
1	low	indicator of birth weight less than 2.5 kg	0, 1
2	age	mother’s age in years	continous variable
3	lwt	mother’s weight in pounds at last menstrual period	continous variable
4	race	mother’s race (1 = white, 2 = black, 3 = other)	1, 2, 3
5	smoke	smoking status during pregnancy	0, 1
6	ptl	number of previous premature labours	0, 1, 2, 3
7	ht	history of hypertension	0, 1
8	ui	presence of uterine irritability	0, 1
9	ftv	number of physician visits during the first trimester	0, 1, 2, 3, 4, 6
10	bwt	birth weight in grams	continous variable

Outcome variable is low. Its related variables are: age, lwt, race, smoke, ptl, ht, ui and ftv.

Import and glimpse the data

# save the 'birthwt' data as a 'csv' file and import using pandas 'read_csv' function 

pbwt=pd.read_csv('birthwt.csv')

Information about the data set

# information about data set
pbwt.info()

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 189 entries, 0 to 188
## Data columns (total 10 columns):
##  #   Column  Non-Null Count  Dtype
## ---  ------  --------------  -----
##  0   low     189 non-null    int64
##  1   age     189 non-null    int64
##  2   lwt     189 non-null    int64
##  3   race    189 non-null    int64
##  4   smoke   189 non-null    int64
##  5   ptl     189 non-null    int64
##  6   ht      189 non-null    int64
##  7   ui      189 non-null    int64
##  8   ftv     189 non-null    int64
##  9   bwt     189 non-null    int64
## dtypes: int64(10)
## memory usage: 14.9 KB

Data types

# data types: integer/object/category/floating-point
pbwt.dtypes

## low      int64
## age      int64
## lwt      int64
## race     int64
## smoke    int64
## ptl      int64
## ht       int64
## ui       int64
## ftv      int64
## bwt      int64
## dtype: object

First five rows of data


pbwt.head()

##    low  age  lwt  race  smoke  ptl  ht  ui  ftv   bwt
## 0    0   19  182     2      0    0   0   1    0  2523
## 1    0   33  155     3      0    0   0   0    3  2551
## 2    0   20  105     1      1    0   0   0    1  2557
## 3    0   21  108     1      1    0   0   1    2  2594
## 4    0   18  107     1      1    0   0   1    0  2600

Last five rows of data


pbwt.tail()

##      low  age  lwt  race  smoke  ptl  ht  ui  ftv   bwt
## 184    1   28   95     1      1    0   0   0    2  2466
## 185    1   14  100     3      0    0   0   0    2  2495
## 186    1   23   94     3      1    0   0   0    0  2495
## 187    1   17  142     2      0    0   1   0    0  2495
## 188    1   21  130     1      1    0   1   0    3  2495

Data shape or dimension



pbwt.shape

## (189, 10)

Columns name


pbwt.columns

## Index(['low', 'age', 'lwt', 'race', 'smoke', 'ptl', 'ht', 'ui', 'ftv', 'bwt'], dtype='object')

Number of columns


len(pbwt.columns)

## 10

Number of observations


len(pbwt)

## 189

Length of a variable

len(pbwt['age'])

## 189

Number of rows

# rows
pbwt.index

## RangeIndex(start=0, stop=189, step=1)

Plotting categorical outcome and its related varables using `seaborn`

Count plot using `seaborn.countplot()`


sb.countplot(x='low', data=pbwt, palette='hls')

#plt.xlabel('Low')

#plt.ylabel('Count')

plt.title('Frequency of low')

plt.show()


#f, ax = plt.subplots(figsize=(8, 8))

sb.countplot(x='low', hue="ftv", data=pbwt, palette='hls')

plt.title('Frequency of ftv by low')

plt.show()


# change axis and color

sb.countplot(y='low', hue="ftv", data=pbwt, color='purple')

plt.title('Frequency of ftv by low')

plt.show()

Count plot using `seaborn.catplot()`


sb.catplot(x="low", kind="count", palette="ch:.25", data=pbwt)

## <seaborn.axisgrid.FacetGrid object at 0x000000002CC4C348>

plt.title('Frequency of low')

plt.show()


# horizontal orientation: change AXIS

sb.catplot(y="low", kind="count", palette="ch:.25", data=pbwt)

## <seaborn.axisgrid.FacetGrid object at 0x000000002D1AA688>

plt.title('Frequency of low')

plt.show()


# adding another variable as hue

sb.catplot( y="low", hue="ftv", kind="count", data=pbwt)

## <seaborn.axisgrid.FacetGrid object at 0x000000002D1F0488>

plt.title('Frequency of ftv by low')

plt.show()


# different color scheme

sb.catplot( y="low", hue="ftv", kind="count", data=pbwt, palette="pastel", edgecolor=".6")

## <seaborn.axisgrid.FacetGrid object at 0x000000002D22E308>

plt.title('Frequency of ftv by low')

plt.show()

Bar plot


# bivariate bar graph with confidence interval

#pbwt[['low', 'race']]=pbwt[['low', 'race']].astype('int64')

sb.catplot(y="low", x='race', data=pbwt, kind="bar")

## <seaborn.axisgrid.FacetGrid object at 0x000000002D2FE708>

plt.title('Frequency of race by low')

plt.show()


# bivariate bar graph without confidence interval

sb.catplot(y="low", x='race', data=pbwt, kind="bar", ci=None)

## <seaborn.axisgrid.FacetGrid object at 0x000000002D1C20C8>

plt.title('Frequency of race by low')

plt.show()


# adding another variable as hue

sb.catplot( x='race', y="low", hue="smoke", kind="bar", data=pbwt)

## <seaborn.axisgrid.FacetGrid object at 0x000000002D2D1A48>

plt.title('Frequency of race by smoke and low')

plt.show()

Point plot


sb.catplot(x='race', y='low', kind='point', data=pbwt)

## <seaborn.axisgrid.FacetGrid object at 0x000000002D3D9748>

plt.title('Frequency of race by low')

plt.show()


# add another variable as hue

sb.catplot(x='race', y='low', hue='smoke', kind='point', data=pbwt)

## <seaborn.axisgrid.FacetGrid object at 0x000000002D40B248>

plt.title('Frequency of race by smoke and low')


plt.show()


# different markers and line styles

sb.catplot(x='race', y='low', hue='smoke', kind='point', data=pbwt, markers=["^", "o"], linestyles=["-", "--"])

## <seaborn.axisgrid.FacetGrid object at 0x000000002D3AEEC8>

plt.title('Frequency of race by smoke and low')

plt.show()


# if you want to provide labels

recode_smoke = {0:'non_smoker', 1:'smoker'}

pbwt['smokel']= pbwt['smoke'].map(recode_smoke)


recode_race = {1:'white', 2:'black', 3:'others'}

pbwt['racel']= pbwt['race'].map(recode_race)


#sb.catplot(x="racel", y="low", hue="smokel", kind="point", data=pbwt)

sb.catplot(x="racel", y="low", hue="smokel", kind="point", data=pbwt, order=['white', 'black', 'others'])

## <seaborn.axisgrid.FacetGrid object at 0x000000002B9AF888>

plt.title('Frequency of race by smoke and low')

plt.show()

Categorical scatter plot


sb.catplot(x="low", y="age", data=pbwt)

## <seaborn.axisgrid.FacetGrid object at 0x000000002E59A8C8>

plt.xlabel('Body weight category')

plt.ylabel('age')

plt.title("Distribution/spread of mother's age by low")

plt.show()

Linear model: Logistic plots


# when the predictor is categorical

sb.lmplot(x="race", y="low", data=pbwt, logistic=True, y_jitter=.03)

## <seaborn.axisgrid.FacetGrid object at 0x000000002C6FB148>

plt.title('Logistic relationship of race with low')

plt.show()


# when the predictor is continuous

sb.lmplot(x="lwt", y="low", data=pbwt, logistic=True, y_jitter=.03)

## <seaborn.axisgrid.FacetGrid object at 0x000000002E570D48>

plt.title('Logistic relationship of lwt with low')

plt.show()


# convert a continuous predictor (age: age less than 30 is given a value of TRUE) to categorical for logistic plotting

pbwt["age_cat"] = (pbwt['age'] / 30) < 1

sb.lmplot(x="age_cat", y="low", data=pbwt, logistic=True, y_jitter=.03)

## <seaborn.axisgrid.FacetGrid object at 0x000000002E5A0948>

plt.title('Logistic relationship of categorized age with low')

plt.show()


# convert a continuous outcome (bwt: bwt less than 2500 is given a value of TRUE) to a categorical outcome for logistic plotting

pbwt["bwt_cat"] = (pbwt['bwt'] / 2500) < 1

sb.lmplot(x="lwt", y="bwt_cat", data=pbwt, logistic=True, y_jitter=.03)

## <seaborn.axisgrid.FacetGrid object at 0x000000002E5A0C88>

plt.title('Logistic relationship of lwt with categorized infant body weight')

plt.show()

Plotting categorical outcome and its related varables using `matplotlib`

Bar plot


#f, ax = plt.subplots(figsize=(8, 4))

pbwt['low'].value_counts().plot(kind='bar')

plt.xlabel('Body weight category')

plt.ylabel('Frequency')

plt.title('Frequency of low')

plt.show()


# horizontal bar plot

pbwt['low'].value_counts().plot(kind='barh')

plt.title('Frequency of low')

plt.show()


pbwt['low'].value_counts().plot.bar()

plt.title('Frequency of low')

plt.show()


# horizontal bar plot using `barh` following `plot` command

pbwt['low'].value_counts().plot.barh()

plt.title('Frequency of low')

plt.show()

Pie chart



# with legend 

pbwt['low'].value_counts().plot.pie()

#plt.pie(pbwt['low'].value_counts())

plt.legend(['not_low_body_weight', 'low_body_weight'], loc='best')

plt.title('Frequency of low')

plt.show()



# with label

pbwt['low'].value_counts().plot.pie(labels=['not_low_body_weight', 'low_body_weight'])

#plt.pie(pbwt['low'].value_counts(), labels=['not_low_body_weight', 'low_body_weight'])

plt.title('Frequency of low')

plt.show()

pbwt.head()

##    low  age  lwt  race  smoke  ...   bwt      smokel   racel  age_cat  bwt_cat
## 0    0   19  182     2      0  ...  2523  non_smoker   black     True    False
## 1    0   33  155     3      0  ...  2551  non_smoker  others    False    False
## 2    0   20  105     1      1  ...  2557      smoker   white     True    False
## 3    0   21  108     1      1  ...  2594      smoker   white     True    False
## 4    0   18  107     1      1  ...  2600      smoker   white     True    False
## 
## [5 rows x 14 columns]

Plotting categorical outcome and its associated variables in python using seaborn and matplotlib

Bhagirathi Dash

1/10/2020

Resources and libraries

Data

Structure of the dataset

Describe the variables

Import and glimpse the data

Information about the data set

Data types

First five rows of data

Last five rows of data

Data shape or dimension

Columns name

Number of columns

Number of observations

Length of a variable

Number of rows

Plotting categorical outcome and its associated variables in python using seaborn and matplotlib

Bhagirathi Dash

1/10/2020

Resources and libraries

Data

Structure of the dataset

Describe the variables

Import and glimpse the data

Information about the data set

Data types

First five rows of data

Last five rows of data

Data shape or dimension

Columns name

Number of columns

Number of observations

Length of a variable

Number of rows

Plotting categorical outcome and its related varables using seaborn

Count plot using seaborn.countplot()

Count plot using seaborn.catplot()

Bar plot

Point plot

Categorical scatter plot

Linear model: Logistic plots

Plotting categorical outcome and its related varables using matplotlib

Bar plot

Pie chart

Plotting categorical outcome and its related varables using `seaborn`

Count plot using `seaborn.countplot()`

Count plot using `seaborn.catplot()`

Plotting categorical outcome and its related varables using `matplotlib`