Statistics_Chapter_1_Variable_Types

Author

Ravi Prabhakar Mummigatti

1 Goal

Review common variable types that one will work with in data , define these variable types and learn how to store them in Python using the pandas library.

Identifying variable types and storing them properly is an important first step for any statistical analysis. For one, different variable types require different methods for summarizing data and running hypothesis tests. Understanding your data will help you choose a method to answer your questions! It is also impossible to run any analysis in Python unless your computer can interpret your data correctly. If you store numerical values as strings, then you cannot calculate an average, for example

At the end of this module, you will be able to:

  1. Identify the following variable types in a dataset:

  2. Quantitative variables: continuous vs. discrete

  3. Categorical variables: nominal vs. ordinal vs. binary

  4. Load data in Python and check whether assigned data types match variable types

  5. Encode categorical variables in Python using categorical encoding and one-hot encoding

Prerequisites : Basic knowledge of Python and Pandas

# Load the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

2 Variable Types

In tabular data (eg., a spreadsheet), variables are represented by the columns of the spreadsheet. The types of variables within our dataset will have a great impact on the insights we can gain from our data. It is therefoe important to understand variable types, and how different variables can offer different perspectives and functionalities within our data.

Generally, variables come in two varieties; categorical and quantitative.

  1. Categorical variables group observations into separate categories that can be ordered or unordered.

  2. Quantitative variables on the other hand are variables expressed numerically, whether as a count or measurement.

2.0.1 Quantitative Variables

We can think of quantitative variables as any information about an observation that can only be described with numbers. Quantitative variables are generally counts or measurements of something (eg number of points earned in a game or height).There are two types of quantitative variables; discrete and continuous, and they both help to serve different functions in a dataset.

Discrete Variables : Discrete quantitative variables are numeric values that represent counts and can only take on integer values. They represent whole units that can not be broken down into smaller pieces, and as such cannot be meaningfully expressed with decimals or fractions. Examples of discrete variables are the number of children in a person's family or the number of coin flips a person makes.

Continuous Variables : Continuous quantitative variables are numeric measurements that can be expressed with decimal precision.Theoretically, continuous variables can take on infinitely many values within a given range. Examples of continuous variables are length, weight, and age which can all be described with decimal values.

Sometimes the line between discrete and continuous variables can be a bit blurry. For example, age with decimal values is a continuous variable, but age in closest whole years by definition is discrete. The precision with which a quantitative variable is recorded can also determine how we classify the variable

2.0.2 Categorical Variables

Categorical variables differ from quantitative variables in that they focus on the different ways data can be grouped rather than counted or measured. With categorical variables, we want to understand how the observations in our dataset can be grouped and separated from one another based on their attributes.There are two types of categorical variables : Ordinal and Nominal.

Ordinal Variables :When the groupings of a categorical variable have a specific order or ranking , it is an ordinal variable. Suppose there was a variable containing responses to the question "Rate your agreement with the statement: The minimum age to drive should be lowered." The response options are "strongly disagree", "disagree", "neutral", "agree", and "strongly agree".Because we can see an order where "strongly disagree" < "disagree " < "neutral" < "agree" < "strongly agree" in relation to agreement, we consider the variable to be ordinal

Nominal Variables : If there is no apparent order/ranking to the categories of a categorical variable, we refer to to it as a nominal variable.
Nominal categorical variables are those variables with two or more categories that do not have any relational order. Examples of nominal categories could be states in India, brands of computers, tribes. Each of these variables has no intrinsic ordering that distinguishes a category as greater than or less than another category.
The number of possible values for a nominal variable can be quite large. It's even possible that a nominal categorical variable will take on a unique value for every observation in a dataset, like in the case of unique identifiers such as name or email_addres

Binary Variables : Binary or dichotomous variables are a special kind of nominal variable that have only two categories.
Because there are only two possible values for binary variables, they are mutually exclusive to one another. We can imagine a variable that describes if a picture contains a cat or a dog as a binary variable. In this case, if the picture is not a dog, it must be a cat, and vice versa. Binary variables can also be described with numbers similar to bits with 0 or 1 values. Likewise you may find binary variables containing boolean values of True or False.

3 Assessing Variables

The first step to working with datasets is figuring out what kind of variables (columns in a dataframe) are present, and whether they are quantitative or categorical.

An effective way of taking a peek into our dataframes in pandas is to look at the first few rows of the data. This helps us learn the variable names within our dataset, and get a sample of the values in each variable and understand what types of variables we have in our data.

Sometimes the variable type of the data may be unclear.In these cases, we may need to confirm our assessments by inspecting the data dictionary or using domain knowledge. Researching through the associated documentation and a little domain knowledge can often save us from any false assumptions we may have made about our dataset.

Let's inspect a dataset to identify its variable types. We will be working with the movies dataframe , comprising of television shows and movies hosted on the Netflix platform in 2019.

Let’s inspect the top 5 rows applying the head() method

3.1 Data Ingestion

# Load the dataset into a dataframe movies
movies = pd.read_excel("Datasets/movie_data_netflix.xlsx")
# Top 5 rows
movies.head()
type title country release_year rating duration
0 Movie Norm of the ... United States missing PG 91.071
1 Movie Jandino: Wha... United Kingdom 2016 R 94.516
2 TV Show Transformers... United States 2013 G 1.127
3 TV Show Transformers... United States 2016 TV-14 1.687
4 Movie #realityhigh... United States 2017 TV-14 99.248

3.2 Data Understanding

3.2.1 Data information

A useful way of assessing data types in Pandas is the .info() method, which returns the number of instances in each column along with the data types, a tally of the data types, and the dataframe memory usage

# Dataset information
movies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6234 entries, 0 to 6233
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   type          6234 non-null   object 
 1   title         6234 non-null   object 
 2   country       5758 non-null   object 
 3   release_year  6234 non-null   object 
 4   rating        6234 non-null   object 
 5   duration      6234 non-null   float64
dtypes: float64(1), object(5)
memory usage: 292.3+ KB
  • Our dataset consists of 6233 rows (observations) across 6 columns (features)

  • 5 variables are “Categorical : types = object” and 1 variable is “Numeric : type = float”

3.2.2 Data types

When using Python, Pandas dataframe data types can be inspected using the .dtypes attribute

# Data types
movies.dtypes
type             object
title            object
country          object
release_year     object
rating           object
duration        float64
dtype: object

3.2.3 Unique values

Having looked at the first few rows of your data and establishing that a variable is categorical , an important next step is to determine whether the categories are ordered or not.

We use .unique() to view the unique values in the variable to further inspect their data types

We use .nunique to count the number of unique values

# Unique ratings
movies["rating"].unique()
array(['PG', 'R', 'G', 'TV-14', 'PG-13', 'UNRATED', 'NOT RATED'],
      dtype=object)
# Number of Unique Ratings
movies["rating"].nunique()
7
# Unique countries
movies["country"].unique()
array(['United States', 'United Kingdom', 'Spain', 'Bulgaria', 'Chile',
       nan, 'Netherlands', 'France', 'Thailand', 'China', 'Belgium',
       'India', 'Pakistan', 'Canada', 'South Korea', 'Denmark', 'Turkey',
       'Brazil', 'Indonesia', 'Ireland', 'Hong Kong', 'Mexico', 'Vietnam',
       'Nigeria', 'Japan', 'Norway', 'Lebanon', 'Cambodia', 'Russia',
       'Poland', 'Israel', 'Italy', 'Germany', 'United Arab Emirates',
       'Egypt', 'Taiwan', 'Australia', 'Czech Republic', 'Argentina',
       'Switzerland', 'Malaysia', 'Philippines', 'Serbia', 'Colombia',
       'Singapore', 'Peru', 'South Africa', 'New Zealand', 'Venezuela',
       'Saudi Arabia', 'Iceland', 'Austria', 'Uruguay', 'Finland',
       'Ghana', 'Iran', 'Sweden', 'Hungary', 'Guatemala', 'Portugal',
       'Paraguay', 'Somalia', 'Ukraine', 'Dominican Republic', 'Romania',
       'Slovenia', 'Croatia', 'Bangladesh', 'Soviet Union', 'Georgia',
       'West Germany', 'Mauritius', 'Cyprus'], dtype=object)
# Number of unique countries
movies["country"].nunique()
72

3.2.4 Missing values

We can apply the isnull() method and sum() to get an overview of missing values

# Missing values
movies.isnull().sum()
type              0
title             0
country         476
release_year      0
rating            0
duration          0
dtype: int64

3.3 Data Summary

Here is a simple code that summarizes the key attributes of our dataset

# Tabular view of Data
# Creating the Data Dictionary with first column being datatype.
Data_dict = pd.DataFrame(movies.dtypes)
# Identifying unique values . For this I've used nunique() which returns unique elements in the object.
Data_dict['UniqueVal'] = movies.nunique()
# Identifying the missing values from the dataset.
Data_dict['MissingVal'] = movies.isnull().sum()
# Percentage of Missing Values
Data_dict['Percent Missing'] = round(movies.isnull().sum()/len(movies)*100, 2)
# identifying count of the variable.
Data_dict['Count'] = movies.count()
# Renaming the first column using rename()
Data_dict = Data_dict.rename(columns = {0:'DataType'})
Data_dict
DataType UniqueVal MissingVal Percent Missing Count
type object 2 0 0.00 6234
title object 5731 0 0.00 6234
country object 72 476 7.64 5758
release_year object 73 0 0.00 6234
rating object 7 0 0.00 6234
duration float64 5517 0 0.00 6234
  • Our dataset consists of 6233 rows (observations) across 6 columns (features)

  • 5 variables are “Categorical : types = object”

    • Type : 2 sub-categories

    • Title : 5731 sub-categories

    • Country : 72 sub-categories

    • Release Year : 73 sub-categories

    • Rating : 7 sub-categories

  • The duration variable is “Numeric : type = float”

  • Country column has 476 missing values

Making a habit of checking each variable name and what it describes in your dataset, then comparing that to the data type that is assigned to it will save you time and ensure you get the most insight from your data.

4 Altering Data Types

Sometimes, the data types that are automatically assigned to variables may not be appropriate. This can pose challenges and introduce errors in our analysis as data type determines the kind of operations that can be performed on a variable. Also, incorrect assignment of variable types can eliminate important information (like the order of the categories).

If we know that an incorrect data-type has been assigned to a variable, there are two options:

a) Read the data into Python in a different way (eg., alter the call to .read_csv())

b) Change the data-type directly

We can use the astype() method to alter the data type of a variable.

The .astype() method is one of the most effective ways of altering the data type of a column. It can cast any column within a pandas dataframe to the data type of your choosing, as long as the translation is relatively obvious.

The astype() function is smart enough to know that the character “1” should translate to the integer 1; However, it won't know how to convert the character string “one” to an integer.

4.0.1 Type conversion : cat to num

A numeric integer can be converted to a string but not vice-versa. If a number variable is in “” , it is a string and can be converted to Number directly with astype() method.

Let’s look at the top 5 rows of our movies data

# Top 5 rows
movies.head()
type title country release_year rating duration
0 Movie Norm of the ... United States missing PG 91.071
1 Movie Jandino: Wha... United Kingdom 2016 R 94.516
2 TV Show Transformers... United States 2013 G 1.127
3 TV Show Transformers... United States 2016 TV-14 1.687
4 Movie #realityhigh... United States 2017 TV-14 99.248

We have a “missing” observation for a movie that released in 2019 denoted by ‘missing’. We also know that the release year cannot be a categorical variable. Let’s convert it.

Here are the steps we follow :

  1. Replace “missing” with “2019” , applying the .replace() method.The reason we are replacing with ‘2019’ is because Release Year is “Object Type” , stored as a string.
# Replace "missing" with "2019"
movies['release_year'] = movies['release_year'].replace('missing' , 2019)
movies.head()
type title country release_year rating duration
0 Movie Norm of the ... United States 2019 PG 91.071
1 Movie Jandino: Wha... United Kingdom 2016 R 94.516
2 TV Show Transformers... United States 2013 G 1.127
3 TV Show Transformers... United States 2016 TV-14 1.687
4 Movie #realityhigh... United States 2017 TV-14 99.248
  1. Now we will convert the “release_year” from “Categorical” to “int” by as.type()
# Convert release_year to int
movies['release_year'] = movies['release_year'].astype('int')  

# Recheck the data types with .dtypes.
movies.dtypes
type             object
title            object
country          object
release_year      int32
rating           object
duration        float64
dtype: object

5 Category Data Type

For ordinal categorical variables, we often want to store two different pieces of information: category labels and their order. None of the data types we've covered so far can store both of these at once. For example, let's take another look at the rating column in our movies dataframe, which contains movie ratings stored as strings. We can use the .unique() method to inspect the category names:

# Unique movie ratings
movies['rating'].unique()
array(['PG', 'R', 'G', 'TV-14', 'PG-13', 'UNRATED', 'NOT RATED'],
      dtype=object)

At this point, Python does not know that these categories have an inherent order.

5.1 Converting to category

A specific data type for categorical variables in pandas called category addresses this problem.
The pandas .Categorical() method can be used to store data as type category and indicate the order of the categories.

# Assign order and change Rating to category
movies['rating'] = pd.Categorical(
                    movies['rating'], 
                    ['G','PG','PG-13','TV-14','R','UNRATED','NOT RATED'],   ordered=True)

Now, not only does Python recognize that the rating column is an ordinal variable, it understands that G < PG < PG-13 < TV-14 < R < UNRATED < NOT RATED. If we call .unique() on this column again, we see how Python retains the correct rankings.

# Converted unique movie ratings
movies['rating'].unique()
['PG', 'R', 'G', 'TV-14', 'PG-13', 'UNRATED', 'NOT RATED']
Categories (7, object): ['G' < 'PG' < 'PG-13' < 'TV-14' < 'R' < 'UNRATED' < 'NOT RATED']

Let’s look at the data types again

# Data types
movies.dtypes
type              object
title             object
country           object
release_year       int32
rating          category
duration         float64
dtype: object

This type of conversion of “nominal to categorical” variables is very helpful in the event that we would like to sort the column by category. If we use .sort_values(), the dataframe will be sorted by the logical order of the rating column as opposed to the alphabetical order.

Let’s look at the clothes dataframe is a dataset focusing on the reviews of an Ecommerce store

# Clothes reviews
clothes = pd.read_csv("Datasets/clothes_data.csv")
clothes.head()
Clothing ID Age Title Rating Division Name
0 767 33 NaN satisfied Initmates
1 1080 34 NaN very satisfied General
2 1077 60 Some major design flaws neutral General
3 1049 50 My favorite buy! very satisfied General Petite
4 847 47 Flattering shirt very satisfied General
# Dataset information
clothes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   Clothing ID    23486 non-null  int64 
 1   Age            23486 non-null  int64 
 2   Title          19676 non-null  object
 3   Rating         23486 non-null  object
 4   Division Name  23472 non-null  object
dtypes: int64(2), object(3)
memory usage: 917.6+ KB
# Unique values of rating
clothes["Rating"].unique()
array(['satisfied', 'very satisfied', 'neutral', 'unsatisfied',
       'very unsatisfied'], dtype=object)

We have 5 different ratings. Now Let us change the data type of the Rating variable to categorical and set an order of [‘very unsatisfied’, ‘unsatisfied’, ‘neutral’, ‘satisfied’, ‘very satisfied’]

# Assign order and change Rating to category
clothes['Rating'] = pd.Categorical(
                    clothes['Rating'],
                    ['very unsatisfied', 'unsatisfied', 'neutral', 'satisfied', 'very satisfied'], 
                    ordered=True)
# Dataset information
clothes.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23486 entries, 0 to 23485
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   Clothing ID    23486 non-null  int64   
 1   Age            23486 non-null  int64   
 2   Title          19676 non-null  object  
 3   Rating         23486 non-null  category
 4   Division Name  23472 non-null  object  
dtypes: category(1), int64(2), object(2)
memory usage: 757.2+ KB
# Recheck the values of `Rating` with .unique()
clothes['Rating'].unique()
['satisfied', 'very satisfied', 'neutral', 'unsatisfied', 'very unsatisfied']
Categories (5, object): ['very unsatisfied' < 'unsatisfied' < 'neutral' < 'satisfied' < 'very satisfied']

6 Encoding Categorical Variables

When working with categorical variables, it is sometimes necessary to convert the categories to numbers. This is called categorical encoding and can be achieved through several encoding methods.

6.1 Label Encoding

Label encoding is a simple method of assigning unique numerical values to each category present in a categorical feature. Each category is mapped to an integer, starting from 0.

To perform label encoding,we use the .cat.codes accessor for a variable stored as type category

In the following example, the Education variable has four unique categories: high school diploma, associates, bachelors, and post doctorate. The .cat.codes accessor will convert the categories into numeric values

Note :

  1. Education has to be first converted to a categorical type by alpplying the pd.Categorical method with the specified order where-in High School Diploma < Associates < Bachelors < Post Doctrate.

  2. We then apply the .cat.codes accessor to convert eduation to numerical variables with the approproate codes

Let’s apply label encoding to the “Clothes Rating Column”

# load the data into a dataframe and view top 5 rows
myclothes = pd.read_csv("Datasets/clothes_data.csv")
myclothes.head()
Clothing ID Age Title Rating Division Name
0 767 33 NaN satisfied Initmates
1 1080 34 NaN very satisfied General
2 1077 60 Some major design flaws neutral General
3 1049 50 My favorite buy! very satisfied General Petite
4 847 47 Flattering shirt very satisfied General
# ratings unique
myclothes['Rating'].unique()
array(['satisfied', 'very satisfied', 'neutral', 'unsatisfied',
       'very unsatisfied'], dtype=object)
# Convert to categorical with appropropriate order
myclothes['Rating'] = pd.Categorical(
                      clothes['Rating'], ['very unsatisfied','unsatisfied',
                      'neutral','satisfied','very satisfied'], 
                      ordered=True)

myclothes['Rating'].unique()
['satisfied', 'very satisfied', 'neutral', 'unsatisfied', 'very unsatisfied']
Categories (5, object): ['very unsatisfied' < 'unsatisfied' < 'neutral' < 'satisfied' < 'very satisfied']
# Create a new variable rating_codes, which contains the numerical codes associated with each category 
myclothes['rating_code'] = myclothes['Rating'].cat.codes
# 1st five rows
myclothes.head()
Clothing ID Age Title Rating Division Name rating_code
0 767 33 NaN satisfied Initmates 3
1 1080 34 NaN very satisfied General 4
2 1077 60 Some major design flaws neutral General 2
3 1049 50 My favorite buy! very satisfied General Petite 4
4 847 47 Flattering shirt very satisfied General 4

Let’s apply Label Encoding to the Ratings columns of our movies dataset

# load the data into a dataframe and view top 5 rows
mymovies = pd.read_excel("Datasets/movie_data_netflix.xlsx")
mymovies.head()
type title country release_year rating duration
0 Movie Norm of the ... United States missing PG 91.071
1 Movie Jandino: Wha... United Kingdom 2016 R 94.516
2 TV Show Transformers... United States 2013 G 1.127
3 TV Show Transformers... United States 2016 TV-14 1.687
4 Movie #realityhigh... United States 2017 TV-14 99.248
# ratings unique
mymovies['rating'].unique()
array(['PG', 'R', 'G', 'TV-14', 'PG-13', 'UNRATED', 'NOT RATED'],
      dtype=object)
# Convert to categorical with appropropriate order
mymovies['rating'] = pd.Categorical(
                     mymovies['rating'] , 
                     ['G','PG','PG-13','TV-14','R','UNRATED', 'NOT RATED'],                        ordered=True)

mymovies['rating'].unique()
['PG', 'R', 'G', 'TV-14', 'PG-13', 'UNRATED', 'NOT RATED']
Categories (7, object): ['G' < 'PG' < 'PG-13' < 'TV-14' < 'R' < 'UNRATED' < 'NOT RATED']
# Create a new variable rating_codes, which contains the numerical codes associated with each category 
mymovies['rating_code'] = mymovies['rating'].cat.codes
# 1st five rows
mymovies.head()
type title country release_year rating duration rating_code
0 Movie Norm of the ... United States missing PG 91.071 1
1 Movie Jandino: Wha... United Kingdom 2016 R 94.516 4
2 TV Show Transformers... United States 2013 G 1.127 0
3 TV Show Transformers... United States 2016 TV-14 1.687 3
4 Movie #realityhigh... United States 2017 TV-14 99.248 3

6.2 One Hot Encoding

We have seen how label encoding can be useful for ordinal categorical variables. But sometimes we need a different approach. This could be because:

  1. We have a nominal categorical variable (like breed of Dog), so it doesn't really make sense to assign numbers like 0,1,2,3,4,5 to our categories, as this could create an order among the species that is not present.

  2. We have an ordinal categorical variable but we don't want to assume that there's equal spacing between categories.

Here comes One-Hot Encoding (OHE). With OHE, we essentially create a new binary variable for each of the categories within our original variable.

OHE is useful when managing nominal variables because it encodes the variable without creating an order among the categories.

To perform OHE on a variable within a pandas dataframe, we can use .get_dummies() method which creates a binary or "dummy" variable for each category. We can assign the columns to be encoded in the columns parameter, and set the data parameter to the dataset we intend to alter. The pd.get_dummies() method will also work on data types other than category..

Notice that when using pd.dummies(), we are effectively creating a new dataframe that contains a different set of variables to the original dataframe.
By passing in the dataset and column that we want to encode into pd.get_dummies(), we have created a new dataframe that contains three new binary variables with values of 1 for True and 0 for False. Also note that we haven't assigned any weighting to our nominal variable.

It is important to note that OHE works best when we do not create too many additional variables, as increasing the dimensionality of our dataframe can create problems when working with certain machine learning models

Let’s look at a dataset of Cereal Ratings from Nestle

# Load and review the cereals dataset
cereal = pd.read_csv("Datasets/nestle_cereals_data.csv")
cereal.head()
id name mfr type fiber rating shelf vitamins coupons price
0 0 100% Bran Nestle C 10.0 68.40 top 25 4 4.06
1 1 100% Natural Bran Quaker Oats C 2.0 33.98 top 0 1 4.62
2 2 All-Bran Kelloggs C 9.0 59.43 top 25 3 4.39
3 3 All-Bran with Extra Fiber Kelloggs C 14.0 93.70 top 25 2 6.18
4 4 Almond Delight Ralston Purina C 1.0 34.38 top 25 2 6.05
# Dataset description
cereal.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   id        77 non-null     int64  
 1   name      77 non-null     object 
 2   mfr       77 non-null     object 
 3   type      77 non-null     object 
 4   fiber     77 non-null     float64
 5   rating    77 non-null     float64
 6   shelf     77 non-null     object 
 7   vitamins  77 non-null     int64  
 8   coupons   77 non-null     int64  
 9   price     77 non-null     float64
dtypes: float64(3), int64(3), object(4)
memory usage: 6.1+ KB

Let’s look at unique manufacturers (mfr)

# unique cereal manufacturers
cereal['mfr'].unique()
array(['Nestle', 'Quaker Oats', 'Kelloggs', 'Ralston Purina',
       'General Mills', 'Post', 'American Home'], dtype=object)
# count of uniquecereal manufacturers
len(cereal['mfr'].unique())
7
# One-Hot Encode the mfr variable with pd.get_dummies()
cereal_dummy = pd.get_dummies(cereal["mfr"],
                              dtype = int , 
                              prefix = "mfr",
                              drop_first=False)
cereal_dummy.head()
mfr_American Home mfr_General Mills mfr_Kelloggs mfr_Nestle mfr_Post mfr_Quaker Oats mfr_Ralston Purina
0 0 0 0 1 0 0 0
1 0 0 0 0 0 1 0
2 0 0 1 0 0 0 0
3 0 0 1 0 0 0 0
4 0 0 0 0 0 0 1
# Concatenate the one-hot-encoded series to the original dataframe
cereal = pd.concat([cereal, cereal_dummy], axis=1)
cereal.head()
id name mfr type fiber rating shelf vitamins coupons price mfr_American Home mfr_General Mills mfr_Kelloggs mfr_Nestle mfr_Post mfr_Quaker Oats mfr_Ralston Purina
0 0 100% Bran Nestle C 10.0 68.40 top 25 4 4.06 0 0 0 1 0 0 0
1 1 100% Natural Bran Quaker Oats C 2.0 33.98 top 0 1 4.62 0 0 0 0 0 1 0
2 2 All-Bran Kelloggs C 9.0 59.43 top 25 3 4.39 0 0 1 0 0 0 0
3 3 All-Bran with Extra Fiber Kelloggs C 14.0 93.70 top 25 2 6.18 0 0 1 0 0 0 0
4 4 Almond Delight Ralston Purina C 1.0 34.38 top 25 2 6.05 0 0 0 0 0 0 1
  • 7 Columns have been created , 1 per manufacturer category...

  • Column Names have a pre-fix mfr_ denoting this is the dummy codec column and value of “1” implies True i.e. mfr_Post = 1 implies manufacturer is “Post”.

  • Notice also that the original “mfr” column still existing in the new DataFrame

  • We use the “drop_first” = True when we want to build ML models. This ensures that the “Reference Category” is not repeated and all other categories are coded w.r.t this category

cereal.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77 entries, 0 to 76
Data columns (total 17 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   id                  77 non-null     int64  
 1   name                77 non-null     object 
 2   mfr                 77 non-null     object 
 3   type                77 non-null     object 
 4   fiber               77 non-null     float64
 5   rating              77 non-null     float64
 6   shelf               77 non-null     object 
 7   vitamins            77 non-null     int64  
 8   coupons             77 non-null     int64  
 9   price               77 non-null     float64
 10  mfr_American Home   77 non-null     int32  
 11  mfr_General Mills   77 non-null     int32  
 12  mfr_Kelloggs        77 non-null     int32  
 13  mfr_Nestle          77 non-null     int32  
 14  mfr_Post            77 non-null     int32  
 15  mfr_Quaker Oats     77 non-null     int32  
 16  mfr_Ralston Purina  77 non-null     int32  
dtypes: float64(3), int32(7), int64(3), object(4)
memory usage: 8.2+ KB

7 Project 1 : Used Car Prices

We are given a dataset that contains data around used car prices.

7.1 Step 1 : Data Ingestion

# Load the dataset
auto_data = pd.read_excel("Datasets/auto_mpg_3.xlsx")
auto_data.head()
id auto_make auto_body-style auto_length auto_width auto_city-mpg auto_price auto_num_cyl auto_two_door auto_engine_size
0 0 alfa-romero convertible 168.8 64.1 21 13495 4 1 medium
1 1 alfa-romero convertible 168.8 64.1 21 16500 4 1 medium
2 2 alfa-romero hatchback 171.2 65.5 19 16500 6 1 medium
3 3 audi sedan 176.6 66.2 24 13950 4 0 medium
4 4 audi sedan 176.6 66.4 18 17450 5 0 medium

7.2 Step 2 : Data Understanding

# Dataset information
auto_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 201 entries, 0 to 200
Data columns (total 10 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   id                201 non-null    int64  
 1   auto_make         201 non-null    object 
 2   auto_body-style   201 non-null    object 
 3   auto_length       201 non-null    float64
 4   auto_width        201 non-null    float64
 5   auto_city-mpg     201 non-null    int64  
 6   auto_price        201 non-null    int64  
 7   auto_num_cyl      201 non-null    int64  
 8   auto_two_door     201 non-null    object 
 9   auto_engine_size  201 non-null    object 
dtypes: float64(2), int64(4), object(4)
memory usage: 15.8+ KB
  • The dataset consists of 200 observations across 10 variables

  • id , automobile city mileage , automobile price and automobile # of cylinders are Integer type

  • automobile length and automobile width are Float type

  • automobile make , automobile body style , # of doors and engine size are Categorical type

7.3 Step 3 : Data Summary

# Tabular view of Data
# Creating the Data Dictionary with first column being datatype.
Data_dict = pd.DataFrame(auto_data.dtypes)
# Identifying unique values . For this I've used nunique() which returns unique elements in the object.
Data_dict['UniqueVal'] = auto_data.nunique()
# Identifying the missing values from the dataset.
Data_dict['MissingVal'] = auto_data.isnull().sum()
# Percentage of Missing Values
Data_dict['Percent Missing'] = round(auto_data.isnull().sum()/len(auto_data)*100, 2)
# identifying count of the variable.
Data_dict['Count'] = auto_data.count()
# Renaming the first column using rename()
Data_dict = Data_dict.rename(columns = {0:'DataType'})
Data_dict
DataType UniqueVal MissingVal Percent Missing Count
id int64 201 0 0.0 201
auto_make object 22 0 0.0 201
auto_body-style object 5 0 0.0 201
auto_length float64 73 0 0.0 201
auto_width float64 43 0 0.0 201
auto_city-mpg int64 29 0 0.0 201
auto_price int64 186 0 0.0 201
auto_num_cyl int64 7 0 0.0 201
auto_two_door object 3 0 0.0 201
auto_engine_size object 3 0 0.0 201
  • Our dataset consists of 200 rows (observations) across 10 columns (features)

  • 4 variables are “Categorical : types = object”

    • Auto Make : 5 sub-categories

    • Body Style : 5731 sub-categories

    • Doors : 3 sub-categories

    • Engine Size : 3 sub-categories

  • 4 variables are “Integer : types = int64” and 2 variables are “Float : types = float64”

    • ID : This is a redundant variable

    • Length , Width , Auto Price and City MPG may not need type conversion

    • Number of Cylinders : This can be converted to categorical

7.4 Step 4 : Data type conversion

  1. Looking at the values in price , we can convert this to a float for consistency
# convert int into to float
auto_data['auto_price'] = auto_data['auto_price'].astype('float')

# review after type conversion
auto_data['auto_price'].dtype
dtype('float64')
  1. Convert the engine_size variable to the category data type with an order of ['small', 'medium', 'large'], and check the order with the .unique() method.
# how many levels in engine size
auto_data["auto_engine_size"].unique()
array(['medium', 'small', 'large'], dtype=object)
# convert to categorical with appropriate order
auto_data['auto_engine_size'] = pd.Categorical(
                         auto_data['auto_engine_size'], 
                         ['small','medium' , 'large'], 
                         ordered=True)
# how many levels in engine size
auto_data["auto_engine_size"].unique()
['medium', 'small', 'large']
Categories (3, object): ['small' < 'medium' < 'large']
  1. Create a new variable called engine_codes containing numerical codes associated with each category in the engine_size variable with the .cat.codes accessor.Check the new values with the .head() method
# Create a new variable engine_codes, which contains the numerical codes associated with each category 
auto_data['engine_codes'] = auto_data['auto_engine_size'].cat.codes
# Check
auto_data.describe()
id auto_length auto_width auto_city-mpg auto_price auto_num_cyl engine_codes
count 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000 201.000000
mean 102.900498 174.200995 65.889055 25.179104 13207.129353 4.363184 1.000000
std 59.234956 12.322175 2.101471 6.423220 7947.066342 1.059452 0.223607
min 0.000000 141.100000 60.300000 13.000000 5118.000000 2.000000 0.000000
25% 53.000000 166.800000 64.100000 19.000000 7775.000000 4.000000 1.000000
50% 103.000000 173.200000 65.500000 24.000000 10295.000000 4.000000 1.000000
75% 154.000000 183.500000 66.600000 30.000000 16500.000000 4.000000 1.000000
max 204.000000 208.100000 72.000000 49.000000 45400.000000 12.000000 2.000000
  1. One-Hot Encode the body-style category in the auto dataframe & check with .head()
# One-Hot Encode the body_style variable with pd.get_dummies()
auto_data_dummies = pd.get_dummies(data = auto_data , 
                                   columns = ['auto_body-style'],
                                   dtype = int ,
                                   prefix = "mfr",
                                   drop_first=False)
auto_data_dummies.head()
id auto_make auto_length auto_width auto_city-mpg auto_price auto_num_cyl auto_two_door auto_engine_size engine_codes mfr_convertible mfr_hardtop mfr_hatchback mfr_sedan mfr_wagon
0 0 alfa-romero 168.8 64.1 21 13495.0 4 1 medium 1 1 0 0 0 0
1 1 alfa-romero 168.8 64.1 21 16500.0 4 1 medium 1 1 0 0 0 0
2 2 alfa-romero 171.2 65.5 19 16500.0 6 1 medium 1 0 0 1 0 0
3 3 audi 176.6 66.2 24 13950.0 4 0 medium 1 0 0 0 1 0
4 4 audi 176.6 66.4 18 17450.0 5 0 medium 1 0 0 0 1 0
# Concatenate the one-hot-encoded series to the original dataframe
auto_data = pd.concat([auto_data, auto_data_dummies], axis=1)
auto_data.head()
id auto_make auto_body-style auto_length auto_width auto_city-mpg auto_price auto_num_cyl auto_two_door auto_engine_size ... auto_price auto_num_cyl auto_two_door auto_engine_size engine_codes mfr_convertible mfr_hardtop mfr_hatchback mfr_sedan mfr_wagon
0 0 alfa-romero convertible 168.8 64.1 21 13495.0 4 1 medium ... 13495.0 4 1 medium 1 1 0 0 0 0
1 1 alfa-romero convertible 168.8 64.1 21 16500.0 4 1 medium ... 16500.0 4 1 medium 1 1 0 0 0 0
2 2 alfa-romero hatchback 171.2 65.5 19 16500.0 6 1 medium ... 16500.0 6 1 medium 1 0 0 1 0 0
3 3 audi sedan 176.6 66.2 24 13950.0 4 0 medium ... 13950.0 4 0 medium 1 0 0 0 1 0
4 4 audi sedan 176.6 66.4 18 17450.0 5 0 medium ... 17450.0 5 0 medium 1 0 0 0 1 0

5 rows × 26 columns

8 Summary of learning:

  1. We discovered the different types of variables you will encounter when working with data and their corresponding data types in Python.

  2. We explored datasets with .head() and info().

  3. We assessed categories within variables with the .unique() method.

  4. We practiced ways to check the data type of variables applying .dtypes and .info() methods.

  5. We altered data with the .replace() method.

  6. We learned how to change the data types of variables using the .astype() method.

  7. We investigated the pandas category data type.

  8. We developed Label Encoding and One-Hot Encoding skills with the .cat.codes accessor and pd.get_dummies()