# Load the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()Statistics_Chapter_1_Variable_Types
1 Goal
Review common variable types that one will work with in data , define these variable types and learn how to store them in Python using the pandas library.
Identifying variable types and storing them properly is an important first step for any statistical analysis. For one, different variable types require different methods for summarizing data and running hypothesis tests. Understanding your data will help you choose a method to answer your questions! It is also impossible to run any analysis in Python unless your computer can interpret your data correctly. If you store numerical values as strings, then you cannot calculate an average, for example
At the end of this module, you will be able to:
Identify the following variable types in a dataset:
Quantitative variables: continuous vs. discrete
Categorical variables: nominal vs. ordinal vs. binary
Load data in Python and check whether assigned data types match variable types
Encode categorical variables in Python using categorical encoding and one-hot encoding
Prerequisites : Basic knowledge of Python and Pandas
2 Variable Types
In tabular data (eg., a spreadsheet), variables are represented by the columns of the spreadsheet. The types of variables within our dataset will have a great impact on the insights we can gain from our data. It is therefoe important to understand variable types, and how different variables can offer different perspectives and functionalities within our data.
Generally, variables come in two varieties; categorical and quantitative.
Categorical variables group observations into separate categories that can be ordered or unordered.
Quantitative variables on the other hand are variables expressed numerically, whether as a count or measurement.
2.0.1 Quantitative Variables
We can think of quantitative variables as any information about an observation that can only be described with numbers. Quantitative variables are generally counts or measurements of something (eg number of points earned in a game or height).There are two types of quantitative variables; discrete and continuous, and they both help to serve different functions in a dataset.
Discrete Variables : Discrete quantitative variables are numeric values that represent counts and can only take on integer values. They represent whole units that can not be broken down into smaller pieces, and as such cannot be meaningfully expressed with decimals or fractions. Examples of discrete variables are the number of children in a person's family or the number of coin flips a person makes.
Continuous Variables : Continuous quantitative variables are numeric measurements that can be expressed with decimal precision.Theoretically, continuous variables can take on infinitely many values within a given range. Examples of continuous variables are length, weight, and age which can all be described with decimal values.
Sometimes the line between discrete and continuous variables can be a bit blurry. For example, age with decimal values is a continuous variable, but age in closest whole years by definition is discrete. The precision with which a quantitative variable is recorded can also determine how we classify the variable
2.0.2 Categorical Variables
Categorical variables differ from quantitative variables in that they focus on the different ways data can be grouped rather than counted or measured. With categorical variables, we want to understand how the observations in our dataset can be grouped and separated from one another based on their attributes.There are two types of categorical variables : Ordinal and Nominal.
Ordinal Variables :When the groupings of a categorical variable have a specific order or ranking , it is an ordinal variable. Suppose there was a variable containing responses to the question "Rate your agreement with the statement: The minimum age to drive should be lowered." The response options are "strongly disagree", "disagree", "neutral", "agree", and "strongly agree".Because we can see an order where "strongly disagree" < "disagree " < "neutral" < "agree" < "strongly agree" in relation to agreement, we consider the variable to be ordinal
Nominal Variables : If there is no apparent order/ranking to the categories of a categorical variable, we refer to to it as a nominal variable.
Nominal categorical variables are those variables with two or more categories that do not have any relational order. Examples of nominal categories could be states in India, brands of computers, tribes. Each of these variables has no intrinsic ordering that distinguishes a category as greater than or less than another category.
The number of possible values for a nominal variable can be quite large. It's even possible that a nominal categorical variable will take on a unique value for every observation in a dataset, like in the case of unique identifiers such as name or email_addres
Binary Variables : Binary or dichotomous variables are a special kind of nominal variable that have only two categories.
Because there are only two possible values for binary variables, they are mutually exclusive to one another. We can imagine a variable that describes if a picture contains a cat or a dog as a binary variable. In this case, if the picture is not a dog, it must be a cat, and vice versa. Binary variables can also be described with numbers similar to bits with 0 or 1 values. Likewise you may find binary variables containing boolean values of True or False.
3 Assessing Variables
The first step to working with datasets is figuring out what kind of variables (columns in a dataframe) are present, and whether they are quantitative or categorical.
An effective way of taking a peek into our dataframes in pandas is to look at the first few rows of the data. This helps us learn the variable names within our dataset, and get a sample of the values in each variable and understand what types of variables we have in our data.
Sometimes the variable type of the data may be unclear.In these cases, we may need to confirm our assessments by inspecting the data dictionary or using domain knowledge. Researching through the associated documentation and a little domain knowledge can often save us from any false assumptions we may have made about our dataset.
Let's inspect a dataset to identify its variable types. We will be working with the movies dataframe , comprising of television shows and movies hosted on the Netflix platform in 2019.
Let’s inspect the top 5 rows applying the head() method
3.1 Data Ingestion
# Load the dataset into a dataframe movies
movies = pd.read_excel("Datasets/movie_data_netflix.xlsx")# Top 5 rows
movies.head()| type | title | country | release_year | rating | duration | |
|---|---|---|---|---|---|---|
| 0 | Movie | Norm of the ... | United States | missing | PG | 91.071 |
| 1 | Movie | Jandino: Wha... | United Kingdom | 2016 | R | 94.516 |
| 2 | TV Show | Transformers... | United States | 2013 | G | 1.127 |
| 3 | TV Show | Transformers... | United States | 2016 | TV-14 | 1.687 |
| 4 | Movie | #realityhigh... | United States | 2017 | TV-14 | 99.248 |
3.2 Data Understanding
3.2.1 Data information
A useful way of assessing data types in Pandas is the .info() method, which returns the number of instances in each column along with the data types, a tally of the data types, and the dataframe memory usage
# Dataset information
movies.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6234 entries, 0 to 6233
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 type 6234 non-null object
1 title 6234 non-null object
2 country 5758 non-null object
3 release_year 6234 non-null object
4 rating 6234 non-null object
5 duration 6234 non-null float64
dtypes: float64(1), object(5)
memory usage: 292.3+ KB
Our dataset consists of 6233 rows (observations) across 6 columns (features)
5 variables are “Categorical : types = object” and 1 variable is “Numeric : type = float”
3.2.2 Data types
When using Python, Pandas dataframe data types can be inspected using the .dtypes attribute
# Data types
movies.dtypestype object
title object
country object
release_year object
rating object
duration float64
dtype: object
3.2.3 Unique values
Having looked at the first few rows of your data and establishing that a variable is categorical , an important next step is to determine whether the categories are ordered or not.
We use .unique() to view the unique values in the variable to further inspect their data types
We use .nunique to count the number of unique values
# Unique ratings
movies["rating"].unique()array(['PG', 'R', 'G', 'TV-14', 'PG-13', 'UNRATED', 'NOT RATED'],
dtype=object)
# Number of Unique Ratings
movies["rating"].nunique()7
# Unique countries
movies["country"].unique()array(['United States', 'United Kingdom', 'Spain', 'Bulgaria', 'Chile',
nan, 'Netherlands', 'France', 'Thailand', 'China', 'Belgium',
'India', 'Pakistan', 'Canada', 'South Korea', 'Denmark', 'Turkey',
'Brazil', 'Indonesia', 'Ireland', 'Hong Kong', 'Mexico', 'Vietnam',
'Nigeria', 'Japan', 'Norway', 'Lebanon', 'Cambodia', 'Russia',
'Poland', 'Israel', 'Italy', 'Germany', 'United Arab Emirates',
'Egypt', 'Taiwan', 'Australia', 'Czech Republic', 'Argentina',
'Switzerland', 'Malaysia', 'Philippines', 'Serbia', 'Colombia',
'Singapore', 'Peru', 'South Africa', 'New Zealand', 'Venezuela',
'Saudi Arabia', 'Iceland', 'Austria', 'Uruguay', 'Finland',
'Ghana', 'Iran', 'Sweden', 'Hungary', 'Guatemala', 'Portugal',
'Paraguay', 'Somalia', 'Ukraine', 'Dominican Republic', 'Romania',
'Slovenia', 'Croatia', 'Bangladesh', 'Soviet Union', 'Georgia',
'West Germany', 'Mauritius', 'Cyprus'], dtype=object)
# Number of unique countries
movies["country"].nunique()72
3.2.4 Missing values
We can apply the isnull() method and sum() to get an overview of missing values
# Missing values
movies.isnull().sum()type 0
title 0
country 476
release_year 0
rating 0
duration 0
dtype: int64
3.3 Data Summary
Here is a simple code that summarizes the key attributes of our dataset
# Tabular view of Data
# Creating the Data Dictionary with first column being datatype.
Data_dict = pd.DataFrame(movies.dtypes)
# Identifying unique values . For this I've used nunique() which returns unique elements in the object.
Data_dict['UniqueVal'] = movies.nunique()
# Identifying the missing values from the dataset.
Data_dict['MissingVal'] = movies.isnull().sum()
# Percentage of Missing Values
Data_dict['Percent Missing'] = round(movies.isnull().sum()/len(movies)*100, 2)
# identifying count of the variable.
Data_dict['Count'] = movies.count()
# Renaming the first column using rename()
Data_dict = Data_dict.rename(columns = {0:'DataType'})
Data_dict| DataType | UniqueVal | MissingVal | Percent Missing | Count | |
|---|---|---|---|---|---|
| type | object | 2 | 0 | 0.00 | 6234 |
| title | object | 5731 | 0 | 0.00 | 6234 |
| country | object | 72 | 476 | 7.64 | 5758 |
| release_year | object | 73 | 0 | 0.00 | 6234 |
| rating | object | 7 | 0 | 0.00 | 6234 |
| duration | float64 | 5517 | 0 | 0.00 | 6234 |
Our dataset consists of 6233 rows (observations) across 6 columns (features)
5 variables are “Categorical : types = object”
Type : 2 sub-categories
Title : 5731 sub-categories
Country : 72 sub-categories
Release Year : 73 sub-categories
Rating : 7 sub-categories
The duration variable is “Numeric : type = float”
Country column has 476 missing values
Making a habit of checking each variable name and what it describes in your dataset, then comparing that to the data type that is assigned to it will save you time and ensure you get the most insight from your data.
4 Altering Data Types
Sometimes, the data types that are automatically assigned to variables may not be appropriate. This can pose challenges and introduce errors in our analysis as data type determines the kind of operations that can be performed on a variable. Also, incorrect assignment of variable types can eliminate important information (like the order of the categories).
If we know that an incorrect data-type has been assigned to a variable, there are two options:
a) Read the data into Python in a different way (eg., alter the call to .read_csv())
b) Change the data-type directly
We can use the astype() method to alter the data type of a variable.
The .astype() method is one of the most effective ways of altering the data type of a column. It can cast any column within a pandas dataframe to the data type of your choosing, as long as the translation is relatively obvious.
The astype() function is smart enough to know that the character “1” should translate to the integer 1; However, it won't know how to convert the character string “one” to an integer.
4.0.1 Type conversion : cat to num
A numeric integer can be converted to a string but not vice-versa. If a number variable is in “” , it is a string and can be converted to Number directly with astype() method.
Let’s look at the top 5 rows of our movies data
# Top 5 rows
movies.head()| type | title | country | release_year | rating | duration | |
|---|---|---|---|---|---|---|
| 0 | Movie | Norm of the ... | United States | missing | PG | 91.071 |
| 1 | Movie | Jandino: Wha... | United Kingdom | 2016 | R | 94.516 |
| 2 | TV Show | Transformers... | United States | 2013 | G | 1.127 |
| 3 | TV Show | Transformers... | United States | 2016 | TV-14 | 1.687 |
| 4 | Movie | #realityhigh... | United States | 2017 | TV-14 | 99.248 |
We have a “missing” observation for a movie that released in 2019 denoted by ‘missing’. We also know that the release year cannot be a categorical variable. Let’s convert it.
Here are the steps we follow :
- Replace “missing” with “2019” , applying the .replace() method.The reason we are replacing with ‘2019’ is because Release Year is “Object Type” , stored as a string.
# Replace "missing" with "2019"
movies['release_year'] = movies['release_year'].replace('missing' , 2019)
movies.head()| type | title | country | release_year | rating | duration | |
|---|---|---|---|---|---|---|
| 0 | Movie | Norm of the ... | United States | 2019 | PG | 91.071 |
| 1 | Movie | Jandino: Wha... | United Kingdom | 2016 | R | 94.516 |
| 2 | TV Show | Transformers... | United States | 2013 | G | 1.127 |
| 3 | TV Show | Transformers... | United States | 2016 | TV-14 | 1.687 |
| 4 | Movie | #realityhigh... | United States | 2017 | TV-14 | 99.248 |
- Now we will convert the “release_year” from “Categorical” to “int” by as.type()
# Convert release_year to int
movies['release_year'] = movies['release_year'].astype('int')
# Recheck the data types with .dtypes.
movies.dtypestype object
title object
country object
release_year int32
rating object
duration float64
dtype: object
5 Category Data Type
For ordinal categorical variables, we often want to store two different pieces of information: category labels and their order. None of the data types we've covered so far can store both of these at once. For example, let's take another look at the rating column in our movies dataframe, which contains movie ratings stored as strings. We can use the .unique() method to inspect the category names:
# Unique movie ratings
movies['rating'].unique()array(['PG', 'R', 'G', 'TV-14', 'PG-13', 'UNRATED', 'NOT RATED'],
dtype=object)
At this point, Python does not know that these categories have an inherent order.
5.1 Converting to category
A specific data type for categorical variables in pandas called category addresses this problem.
The pandas .Categorical() method can be used to store data as type category and indicate the order of the categories.
# Assign order and change Rating to category
movies['rating'] = pd.Categorical(
movies['rating'],
['G','PG','PG-13','TV-14','R','UNRATED','NOT RATED'], ordered=True)Now, not only does Python recognize that the rating column is an ordinal variable, it understands that G < PG < PG-13 < TV-14 < R < UNRATED < NOT RATED. If we call .unique() on this column again, we see how Python retains the correct rankings.
# Converted unique movie ratings
movies['rating'].unique()['PG', 'R', 'G', 'TV-14', 'PG-13', 'UNRATED', 'NOT RATED']
Categories (7, object): ['G' < 'PG' < 'PG-13' < 'TV-14' < 'R' < 'UNRATED' < 'NOT RATED']
Let’s look at the data types again
# Data types
movies.dtypestype object
title object
country object
release_year int32
rating category
duration float64
dtype: object
This type of conversion of “nominal to categorical” variables is very helpful in the event that we would like to sort the column by category. If we use .sort_values(), the dataframe will be sorted by the logical order of the rating column as opposed to the alphabetical order.
Let’s look at the clothes dataframe is a dataset focusing on the reviews of an Ecommerce store
# Clothes reviews
clothes = pd.read_csv("Datasets/clothes_data.csv")
clothes.head()| Clothing ID | Age | Title | Rating | Division Name | |
|---|---|---|---|---|---|
| 0 | 767 | 33 | NaN | satisfied | Initmates |
| 1 | 1080 | 34 | NaN | very satisfied | General |
| 2 | 1077 | 60 | Some major design flaws | neutral | General |
| 3 | 1049 | 50 | My favorite buy! | very satisfied | General Petite |
| 4 | 847 | 47 | Flattering shirt | very satisfied | General |
# Dataset information
clothes.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Clothing ID 23486 non-null int64
1 Age 23486 non-null int64
2 Title 19676 non-null object
3 Rating 23486 non-null object
4 Division Name 23472 non-null object
dtypes: int64(2), object(3)
memory usage: 917.6+ KB
# Unique values of rating
clothes["Rating"].unique()array(['satisfied', 'very satisfied', 'neutral', 'unsatisfied',
'very unsatisfied'], dtype=object)
We have 5 different ratings. Now Let us change the data type of the Rating variable to categorical and set an order of [‘very unsatisfied’, ‘unsatisfied’, ‘neutral’, ‘satisfied’, ‘very satisfied’]
# Assign order and change Rating to category
clothes['Rating'] = pd.Categorical(
clothes['Rating'],
['very unsatisfied', 'unsatisfied', 'neutral', 'satisfied', 'very satisfied'],
ordered=True)# Dataset information
clothes.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Clothing ID 23486 non-null int64
1 Age 23486 non-null int64
2 Title 19676 non-null object
3 Rating 23486 non-null category
4 Division Name 23472 non-null object
dtypes: category(1), int64(2), object(2)
memory usage: 757.2+ KB
# Recheck the values of `Rating` with .unique()
clothes['Rating'].unique()['satisfied', 'very satisfied', 'neutral', 'unsatisfied', 'very unsatisfied']
Categories (5, object): ['very unsatisfied' < 'unsatisfied' < 'neutral' < 'satisfied' < 'very satisfied']
6 Encoding Categorical Variables
When working with categorical variables, it is sometimes necessary to convert the categories to numbers. This is called categorical encoding and can be achieved through several encoding methods.
6.1 Label Encoding
Label encoding is a simple method of assigning unique numerical values to each category present in a categorical feature. Each category is mapped to an integer, starting from 0.
To perform label encoding,we use the .cat.codes accessor for a variable stored as type category
In the following example, the Education variable has four unique categories: high school diploma, associates, bachelors, and post doctorate. The .cat.codes accessor will convert the categories into numeric values
Note :
Education has to be first converted to a categorical type by alpplying the pd.Categorical method with the specified order where-in High School Diploma < Associates < Bachelors < Post Doctrate.
We then apply the .cat.codes accessor to convert eduation to numerical variables with the approproate codes
Let’s apply label encoding to the “Clothes Rating Column”
# load the data into a dataframe and view top 5 rows
myclothes = pd.read_csv("Datasets/clothes_data.csv")
myclothes.head()| Clothing ID | Age | Title | Rating | Division Name | |
|---|---|---|---|---|---|
| 0 | 767 | 33 | NaN | satisfied | Initmates |
| 1 | 1080 | 34 | NaN | very satisfied | General |
| 2 | 1077 | 60 | Some major design flaws | neutral | General |
| 3 | 1049 | 50 | My favorite buy! | very satisfied | General Petite |
| 4 | 847 | 47 | Flattering shirt | very satisfied | General |
# ratings unique
myclothes['Rating'].unique()array(['satisfied', 'very satisfied', 'neutral', 'unsatisfied',
'very unsatisfied'], dtype=object)
# Convert to categorical with appropropriate order
myclothes['Rating'] = pd.Categorical(
clothes['Rating'], ['very unsatisfied','unsatisfied',
'neutral','satisfied','very satisfied'],
ordered=True)
myclothes['Rating'].unique()['satisfied', 'very satisfied', 'neutral', 'unsatisfied', 'very unsatisfied']
Categories (5, object): ['very unsatisfied' < 'unsatisfied' < 'neutral' < 'satisfied' < 'very satisfied']
# Create a new variable rating_codes, which contains the numerical codes associated with each category
myclothes['rating_code'] = myclothes['Rating'].cat.codes# 1st five rows
myclothes.head()| Clothing ID | Age | Title | Rating | Division Name | rating_code | |
|---|---|---|---|---|---|---|
| 0 | 767 | 33 | NaN | satisfied | Initmates | 3 |
| 1 | 1080 | 34 | NaN | very satisfied | General | 4 |
| 2 | 1077 | 60 | Some major design flaws | neutral | General | 2 |
| 3 | 1049 | 50 | My favorite buy! | very satisfied | General Petite | 4 |
| 4 | 847 | 47 | Flattering shirt | very satisfied | General | 4 |
Let’s apply Label Encoding to the Ratings columns of our movies dataset
# load the data into a dataframe and view top 5 rows
mymovies = pd.read_excel("Datasets/movie_data_netflix.xlsx")
mymovies.head()| type | title | country | release_year | rating | duration | |
|---|---|---|---|---|---|---|
| 0 | Movie | Norm of the ... | United States | missing | PG | 91.071 |
| 1 | Movie | Jandino: Wha... | United Kingdom | 2016 | R | 94.516 |
| 2 | TV Show | Transformers... | United States | 2013 | G | 1.127 |
| 3 | TV Show | Transformers... | United States | 2016 | TV-14 | 1.687 |
| 4 | Movie | #realityhigh... | United States | 2017 | TV-14 | 99.248 |
# ratings unique
mymovies['rating'].unique()array(['PG', 'R', 'G', 'TV-14', 'PG-13', 'UNRATED', 'NOT RATED'],
dtype=object)
# Convert to categorical with appropropriate order
mymovies['rating'] = pd.Categorical(
mymovies['rating'] ,
['G','PG','PG-13','TV-14','R','UNRATED', 'NOT RATED'], ordered=True)
mymovies['rating'].unique()['PG', 'R', 'G', 'TV-14', 'PG-13', 'UNRATED', 'NOT RATED']
Categories (7, object): ['G' < 'PG' < 'PG-13' < 'TV-14' < 'R' < 'UNRATED' < 'NOT RATED']
# Create a new variable rating_codes, which contains the numerical codes associated with each category
mymovies['rating_code'] = mymovies['rating'].cat.codes# 1st five rows
mymovies.head()| type | title | country | release_year | rating | duration | rating_code | |
|---|---|---|---|---|---|---|---|
| 0 | Movie | Norm of the ... | United States | missing | PG | 91.071 | 1 |
| 1 | Movie | Jandino: Wha... | United Kingdom | 2016 | R | 94.516 | 4 |
| 2 | TV Show | Transformers... | United States | 2013 | G | 1.127 | 0 |
| 3 | TV Show | Transformers... | United States | 2016 | TV-14 | 1.687 | 3 |
| 4 | Movie | #realityhigh... | United States | 2017 | TV-14 | 99.248 | 3 |
6.2 One Hot Encoding
We have seen how label encoding can be useful for ordinal categorical variables. But sometimes we need a different approach. This could be because:
We have a nominal categorical variable (like breed of Dog), so it doesn't really make sense to assign numbers like 0,1,2,3,4,5 to our categories, as this could create an order among the species that is not present.
We have an ordinal categorical variable but we don't want to assume that there's equal spacing between categories.
Here comes One-Hot Encoding (OHE). With OHE, we essentially create a new binary variable for each of the categories within our original variable.
OHE is useful when managing nominal variables because it encodes the variable without creating an order among the categories.
To perform OHE on a variable within a pandas dataframe, we can use .get_dummies() method which creates a binary or "dummy" variable for each category. We can assign the columns to be encoded in the columns parameter, and set the data parameter to the dataset we intend to alter. The pd.get_dummies() method will also work on data types other than category..
Notice that when using pd.dummies(), we are effectively creating a new dataframe that contains a different set of variables to the original dataframe.
By passing in the dataset and column that we want to encode into pd.get_dummies(), we have created a new dataframe that contains three new binary variables with values of 1 for True and 0 for False. Also note that we haven't assigned any weighting to our nominal variable.
It is important to note that OHE works best when we do not create too many additional variables, as increasing the dimensionality of our dataframe can create problems when working with certain machine learning models
Let’s look at a dataset of Cereal Ratings from Nestle
# Load and review the cereals dataset
cereal = pd.read_csv("Datasets/nestle_cereals_data.csv")
cereal.head()| id | name | mfr | type | fiber | rating | shelf | vitamins | coupons | price | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 100% Bran | Nestle | C | 10.0 | 68.40 | top | 25 | 4 | 4.06 |
| 1 | 1 | 100% Natural Bran | Quaker Oats | C | 2.0 | 33.98 | top | 0 | 1 | 4.62 |
| 2 | 2 | All-Bran | Kelloggs | C | 9.0 | 59.43 | top | 25 | 3 | 4.39 |
| 3 | 3 | All-Bran with Extra Fiber | Kelloggs | C | 14.0 | 93.70 | top | 25 | 2 | 6.18 |
| 4 | 4 | Almond Delight | Ralston Purina | C | 1.0 | 34.38 | top | 25 | 2 | 6.05 |
# Dataset description
cereal.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 77 non-null int64
1 name 77 non-null object
2 mfr 77 non-null object
3 type 77 non-null object
4 fiber 77 non-null float64
5 rating 77 non-null float64
6 shelf 77 non-null object
7 vitamins 77 non-null int64
8 coupons 77 non-null int64
9 price 77 non-null float64
dtypes: float64(3), int64(3), object(4)
memory usage: 6.1+ KB
Let’s look at unique manufacturers (mfr)
# unique cereal manufacturers
cereal['mfr'].unique()array(['Nestle', 'Quaker Oats', 'Kelloggs', 'Ralston Purina',
'General Mills', 'Post', 'American Home'], dtype=object)
# count of uniquecereal manufacturers
len(cereal['mfr'].unique())7
# One-Hot Encode the mfr variable with pd.get_dummies()
cereal_dummy = pd.get_dummies(cereal["mfr"],
dtype = int ,
prefix = "mfr",
drop_first=False)
cereal_dummy.head()| mfr_American Home | mfr_General Mills | mfr_Kelloggs | mfr_Nestle | mfr_Post | mfr_Quaker Oats | mfr_Ralston Purina | |
|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
# Concatenate the one-hot-encoded series to the original dataframe
cereal = pd.concat([cereal, cereal_dummy], axis=1)
cereal.head()| id | name | mfr | type | fiber | rating | shelf | vitamins | coupons | price | mfr_American Home | mfr_General Mills | mfr_Kelloggs | mfr_Nestle | mfr_Post | mfr_Quaker Oats | mfr_Ralston Purina | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 100% Bran | Nestle | C | 10.0 | 68.40 | top | 25 | 4 | 4.06 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 1 | 100% Natural Bran | Quaker Oats | C | 2.0 | 33.98 | top | 0 | 1 | 4.62 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2 | 2 | All-Bran | Kelloggs | C | 9.0 | 59.43 | top | 25 | 3 | 4.39 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 3 | 3 | All-Bran with Extra Fiber | Kelloggs | C | 14.0 | 93.70 | top | 25 | 2 | 6.18 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 4 | 4 | Almond Delight | Ralston Purina | C | 1.0 | 34.38 | top | 25 | 2 | 6.05 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
7 Columns have been created , 1 per manufacturer category...
Column Names have a pre-fix mfr_ denoting this is the dummy codec column and value of “1” implies True i.e. mfr_Post = 1 implies manufacturer is “Post”.
Notice also that the original “mfr” column still existing in the new DataFrame
We use the “drop_first” = True when we want to build ML models. This ensures that the “Reference Category” is not repeated and all other categories are coded w.r.t this category
cereal.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 77 non-null int64
1 name 77 non-null object
2 mfr 77 non-null object
3 type 77 non-null object
4 fiber 77 non-null float64
5 rating 77 non-null float64
6 shelf 77 non-null object
7 vitamins 77 non-null int64
8 coupons 77 non-null int64
9 price 77 non-null float64
10 mfr_American Home 77 non-null int32
11 mfr_General Mills 77 non-null int32
12 mfr_Kelloggs 77 non-null int32
13 mfr_Nestle 77 non-null int32
14 mfr_Post 77 non-null int32
15 mfr_Quaker Oats 77 non-null int32
16 mfr_Ralston Purina 77 non-null int32
dtypes: float64(3), int32(7), int64(3), object(4)
memory usage: 8.2+ KB
7 Project 1 : Used Car Prices
We are given a dataset that contains data around used car prices.
7.1 Step 1 : Data Ingestion
# Load the dataset
auto_data = pd.read_excel("Datasets/auto_mpg_3.xlsx")
auto_data.head()| id | auto_make | auto_body-style | auto_length | auto_width | auto_city-mpg | auto_price | auto_num_cyl | auto_two_door | auto_engine_size | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | alfa-romero | convertible | 168.8 | 64.1 | 21 | 13495 | 4 | 1 | medium |
| 1 | 1 | alfa-romero | convertible | 168.8 | 64.1 | 21 | 16500 | 4 | 1 | medium |
| 2 | 2 | alfa-romero | hatchback | 171.2 | 65.5 | 19 | 16500 | 6 | 1 | medium |
| 3 | 3 | audi | sedan | 176.6 | 66.2 | 24 | 13950 | 4 | 0 | medium |
| 4 | 4 | audi | sedan | 176.6 | 66.4 | 18 | 17450 | 5 | 0 | medium |
7.2 Step 2 : Data Understanding
# Dataset information
auto_data.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 201 entries, 0 to 200
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 201 non-null int64
1 auto_make 201 non-null object
2 auto_body-style 201 non-null object
3 auto_length 201 non-null float64
4 auto_width 201 non-null float64
5 auto_city-mpg 201 non-null int64
6 auto_price 201 non-null int64
7 auto_num_cyl 201 non-null int64
8 auto_two_door 201 non-null object
9 auto_engine_size 201 non-null object
dtypes: float64(2), int64(4), object(4)
memory usage: 15.8+ KB
The dataset consists of 200 observations across 10 variables
id , automobile city mileage , automobile price and automobile # of cylinders are Integer type
automobile length and automobile width are Float type
automobile make , automobile body style , # of doors and engine size are Categorical type
7.3 Step 3 : Data Summary
# Tabular view of Data
# Creating the Data Dictionary with first column being datatype.
Data_dict = pd.DataFrame(auto_data.dtypes)
# Identifying unique values . For this I've used nunique() which returns unique elements in the object.
Data_dict['UniqueVal'] = auto_data.nunique()
# Identifying the missing values from the dataset.
Data_dict['MissingVal'] = auto_data.isnull().sum()
# Percentage of Missing Values
Data_dict['Percent Missing'] = round(auto_data.isnull().sum()/len(auto_data)*100, 2)
# identifying count of the variable.
Data_dict['Count'] = auto_data.count()
# Renaming the first column using rename()
Data_dict = Data_dict.rename(columns = {0:'DataType'})
Data_dict| DataType | UniqueVal | MissingVal | Percent Missing | Count | |
|---|---|---|---|---|---|
| id | int64 | 201 | 0 | 0.0 | 201 |
| auto_make | object | 22 | 0 | 0.0 | 201 |
| auto_body-style | object | 5 | 0 | 0.0 | 201 |
| auto_length | float64 | 73 | 0 | 0.0 | 201 |
| auto_width | float64 | 43 | 0 | 0.0 | 201 |
| auto_city-mpg | int64 | 29 | 0 | 0.0 | 201 |
| auto_price | int64 | 186 | 0 | 0.0 | 201 |
| auto_num_cyl | int64 | 7 | 0 | 0.0 | 201 |
| auto_two_door | object | 3 | 0 | 0.0 | 201 |
| auto_engine_size | object | 3 | 0 | 0.0 | 201 |
Our dataset consists of 200 rows (observations) across 10 columns (features)
4 variables are “Categorical : types = object”
Auto Make : 5 sub-categories
Body Style : 5731 sub-categories
Doors : 3 sub-categories
Engine Size : 3 sub-categories
4 variables are “Integer : types = int64” and 2 variables are “Float : types = float64”
ID : This is a redundant variable
Length , Width , Auto Price and City MPG may not need type conversion
Number of Cylinders : This can be converted to categorical
7.4 Step 4 : Data type conversion
- Looking at the values in price , we can convert this to a float for consistency
# convert int into to float
auto_data['auto_price'] = auto_data['auto_price'].astype('float')
# review after type conversion
auto_data['auto_price'].dtypedtype('float64')
- Convert the engine_size variable to the category data type with an order of ['small', 'medium', 'large'], and check the order with the .unique() method.
# how many levels in engine size
auto_data["auto_engine_size"].unique()array(['medium', 'small', 'large'], dtype=object)
# convert to categorical with appropriate order
auto_data['auto_engine_size'] = pd.Categorical(
auto_data['auto_engine_size'],
['small','medium' , 'large'],
ordered=True)# how many levels in engine size
auto_data["auto_engine_size"].unique()['medium', 'small', 'large']
Categories (3, object): ['small' < 'medium' < 'large']
- Create a new variable called engine_codes containing numerical codes associated with each category in the engine_size variable with the .cat.codes accessor.Check the new values with the .head() method
# Create a new variable engine_codes, which contains the numerical codes associated with each category
auto_data['engine_codes'] = auto_data['auto_engine_size'].cat.codes# Check
auto_data.describe()| id | auto_length | auto_width | auto_city-mpg | auto_price | auto_num_cyl | engine_codes | |
|---|---|---|---|---|---|---|---|
| count | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 | 201.000000 |
| mean | 102.900498 | 174.200995 | 65.889055 | 25.179104 | 13207.129353 | 4.363184 | 1.000000 |
| std | 59.234956 | 12.322175 | 2.101471 | 6.423220 | 7947.066342 | 1.059452 | 0.223607 |
| min | 0.000000 | 141.100000 | 60.300000 | 13.000000 | 5118.000000 | 2.000000 | 0.000000 |
| 25% | 53.000000 | 166.800000 | 64.100000 | 19.000000 | 7775.000000 | 4.000000 | 1.000000 |
| 50% | 103.000000 | 173.200000 | 65.500000 | 24.000000 | 10295.000000 | 4.000000 | 1.000000 |
| 75% | 154.000000 | 183.500000 | 66.600000 | 30.000000 | 16500.000000 | 4.000000 | 1.000000 |
| max | 204.000000 | 208.100000 | 72.000000 | 49.000000 | 45400.000000 | 12.000000 | 2.000000 |
- One-Hot Encode the body-style category in the auto dataframe & check with .head()
# One-Hot Encode the body_style variable with pd.get_dummies()
auto_data_dummies = pd.get_dummies(data = auto_data ,
columns = ['auto_body-style'],
dtype = int ,
prefix = "mfr",
drop_first=False)
auto_data_dummies.head()| id | auto_make | auto_length | auto_width | auto_city-mpg | auto_price | auto_num_cyl | auto_two_door | auto_engine_size | engine_codes | mfr_convertible | mfr_hardtop | mfr_hatchback | mfr_sedan | mfr_wagon | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | alfa-romero | 168.8 | 64.1 | 21 | 13495.0 | 4 | 1 | medium | 1 | 1 | 0 | 0 | 0 | 0 |
| 1 | 1 | alfa-romero | 168.8 | 64.1 | 21 | 16500.0 | 4 | 1 | medium | 1 | 1 | 0 | 0 | 0 | 0 |
| 2 | 2 | alfa-romero | 171.2 | 65.5 | 19 | 16500.0 | 6 | 1 | medium | 1 | 0 | 0 | 1 | 0 | 0 |
| 3 | 3 | audi | 176.6 | 66.2 | 24 | 13950.0 | 4 | 0 | medium | 1 | 0 | 0 | 0 | 1 | 0 |
| 4 | 4 | audi | 176.6 | 66.4 | 18 | 17450.0 | 5 | 0 | medium | 1 | 0 | 0 | 0 | 1 | 0 |
# Concatenate the one-hot-encoded series to the original dataframe
auto_data = pd.concat([auto_data, auto_data_dummies], axis=1)
auto_data.head()| id | auto_make | auto_body-style | auto_length | auto_width | auto_city-mpg | auto_price | auto_num_cyl | auto_two_door | auto_engine_size | ... | auto_price | auto_num_cyl | auto_two_door | auto_engine_size | engine_codes | mfr_convertible | mfr_hardtop | mfr_hatchback | mfr_sedan | mfr_wagon | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | alfa-romero | convertible | 168.8 | 64.1 | 21 | 13495.0 | 4 | 1 | medium | ... | 13495.0 | 4 | 1 | medium | 1 | 1 | 0 | 0 | 0 | 0 |
| 1 | 1 | alfa-romero | convertible | 168.8 | 64.1 | 21 | 16500.0 | 4 | 1 | medium | ... | 16500.0 | 4 | 1 | medium | 1 | 1 | 0 | 0 | 0 | 0 |
| 2 | 2 | alfa-romero | hatchback | 171.2 | 65.5 | 19 | 16500.0 | 6 | 1 | medium | ... | 16500.0 | 6 | 1 | medium | 1 | 0 | 0 | 1 | 0 | 0 |
| 3 | 3 | audi | sedan | 176.6 | 66.2 | 24 | 13950.0 | 4 | 0 | medium | ... | 13950.0 | 4 | 0 | medium | 1 | 0 | 0 | 0 | 1 | 0 |
| 4 | 4 | audi | sedan | 176.6 | 66.4 | 18 | 17450.0 | 5 | 0 | medium | ... | 17450.0 | 5 | 0 | medium | 1 | 0 | 0 | 0 | 1 | 0 |
5 rows × 26 columns
8 Summary of learning:
We discovered the different types of variables you will encounter when working with data and their corresponding data types in Python.
We explored datasets with .head() and info().
We assessed categories within variables with the .unique() method.
We practiced ways to check the data type of variables applying .dtypes and .info() methods.
We altered data with the .replace() method.
We learned how to change the data types of variables using the .astype() method.
We investigated the pandas category data type.
We developed Label Encoding and One-Hot Encoding skills with the .cat.codes accessor and pd.get_dummies()