# Load the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()Statistics_Chapter_1_Variable_Types
1 Goal
Review common variable types that one will work with in data , define these variable types and learn how to store them in Python using the pandas library.
Identifying variable types and storing them properly is an important first step for any statistical analysis. For one, different variable types require different methods for summarizing data and running hypothesis tests. Understanding your data will help you choose a method to answer your questions! It is also impossible to run any analysis in Python unless your computer can interpret your data correctly. If you store numerical values as strings, then you cannot calculate an average, for example
At the end of this module, you will be able to:
Identify the following variable types in a dataset:
Quantitative variables: continuous vs. discrete
Categorical variables: nominal vs. ordinal vs. binary
Load data in Python and check whether assigned data types match variable types
Encode categorical variables in Python using categorical encoding and one-hot encoding
Prerequisites : Basic knowledge of Python and Pandas
2 Variable Types
In tabular data (eg., a spreadsheet), variables are represented by the columns of the spreadsheet. The types of variables within our dataset will have a great impact on the insights we can gain from our data. It is therefoe important to understand variable types, and how different variables can offer different perspectives and functionalities within our data.
Generally, variables come in two varieties; categorical and quantitative.
Categorical variables group observations into separate categories that can be ordered or unordered.
Quantitative variables on the other hand are variables expressed numerically, whether as a count or measurement.
2.0.1 Quantitative Variables
We can think of quantitative variables as any information about an observation that can only be described with numbers. Quantitative variables are generally counts or measurements of something (eg number of points earned in a game or height).There are two types of quantitative variables; discrete and continuous, and they both help to serve different functions in a dataset.
Discrete Variables : Discrete quantitative variables are numeric values that represent counts and can only take on integer values. They represent whole units that can not be broken down into smaller pieces, and as such cannot be meaningfully expressed with decimals or fractions. Examples of discrete variables are the number of children in a person's family or the number of coin flips a person makes.
Continuous Variables : Continuous quantitative variables are numeric measurements that can be expressed with decimal precision.Theoretically, continuous variables can take on infinitely many values within a given range. Examples of continuous variables are length, weight, and age which can all be described with decimal values.
Sometimes the line between discrete and continuous variables can be a bit blurry. For example, age with decimal values is a continuous variable, but age in closest whole years by definition is discrete. The precision with which a quantitative variable is recorded can also determine how we classify the variable
2.0.2 Categorical Variables
Categorical variables differ from quantitative variables in that they focus on the different ways data can be grouped rather than counted or measured. With categorical variables, we want to understand how the observations in our dataset can be grouped and separated from one another based on their attributes.There are two types of categorical variables : Ordinal and Nominal.
Ordinal Variables :When the groupings of a categorical variable have a specific order or ranking , it is an ordinal variable. Suppose there was a variable containing responses to the question "Rate your agreement with the statement: The minimum age to drive should be lowered." The response options are "strongly disagree", "disagree", "neutral", "agree", and "strongly agree".Because we can see an order where "strongly disagree" < "disagree " < "neutral" < "agree" < "strongly agree" in relation to agreement, we consider the variable to be ordinal
Nominal Variables : If there is no apparent order/ranking to the categories of a categorical variable, we refer to to it as a nominal variable.
Nominal categorical variables are those variables with two or more categories that do not have any relational order. Examples of nominal categories could be states in India, brands of computers, tribes. Each of these variables has no intrinsic ordering that distinguishes a category as greater than or less than another category.
The number of possible values for a nominal variable can be quite large. It's even possible that a nominal categorical variable will take on a unique value for every observation in a dataset, like in the case of unique identifiers such as name or email_addres
Binary Variables : Binary or dichotomous variables are a special kind of nominal variable that have only two categories.
Because there are only two possible values for binary variables, they are mutually exclusive to one another. We can imagine a variable that describes if a picture contains a cat or a dog as a binary variable. In this case, if the picture is not a dog, it must be a cat, and vice versa. Binary variables can also be described with numbers similar to bits with 0 or 1 values. Likewise you may find binary variables containing boolean values of True or False.
3 Assessing variable types
The first step to working with datasets is figuring out what kind of variables (columns in a dataframe) are present, and whether they are quantitative or categorical.
An effective way of taking a peek into our dataframes in pandas is to look at the first few rows of the data. This helps us learn the variable names within our dataset, and get a sample of the values in each variable and understand what types of variables we have in our data.
Sometimes the variable type of the data may be unclear.In these cases, we may need to confirm our assessments by inspecting the data dictionary or using domain knowledge. Researching through the associated documentation and a little domain knowledge can often save us from any false assumptions we may have made about our dataset.
Let's inspect a dataset to identify its variable types. We will be working with the movies dataframe , comprising of television shows and movies hosted on the Netflix platform in 2019.
Let’s inspect the top 5 rows applying the head() method
3.1 Data ingestion
# Load the dataset into a dataframe movies
movies = pd.read_excel("Datasets/movie_data_netflix.xlsx")# Top 5 rows
movies.head()| type | title | country | release_year | rating | duration | |
|---|---|---|---|---|---|---|
| 0 | Movie | Norm of the ... | United States | missing | PG | 91.071 |
| 1 | Movie | Jandino: Wha... | United Kingdom | 2016 | R | 94.516 |
| 2 | TV Show | Transformers... | United States | 2013 | G | 1.127 |
| 3 | TV Show | Transformers... | United States | 2016 | TV-14 | 1.687 |
| 4 | Movie | #realityhigh... | United States | 2017 | TV-14 | 99.248 |
3.2 Data review
3.2.1 Information
A useful way of assessing data types in Pandas is the .info() method, which returns the number of instances in each column along with the data types, a tally of the data types, and the dataframe memory usage
# Dataset information
movies.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6234 entries, 0 to 6233
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 type 6234 non-null object
1 title 6234 non-null object
2 country 5758 non-null object
3 release_year 6234 non-null object
4 rating 6234 non-null object
5 duration 6234 non-null float64
dtypes: float64(1), object(5)
memory usage: 292.3+ KB
Our dataset consists of 6233 rows (observations) across 6 columns (features)
5 variables are “Categorical : types = object” and 1 variable is “Numeric : type = float”
3.2.2 Data types
When using Python, Pandas dataframe data types can be inspected using the .dtypes attribute
# Data types
movies.dtypestype object
title object
country object
release_year object
rating object
duration float64
dtype: object
3.2.3 Unique values
Having looked at the first few rows of your data and establishing that a variable is categorical , an important next step is to determine whether the categories are ordered or not.
We use .unique() to view the unique values in the variable to further inspect their data types
We use .nunique to count the number of unique values
# Unique ratings
movies["rating"].unique()array(['PG', 'R', 'G', 'TV-14', 'PG-13', 'UNRATED', 'NOT RATED'],
dtype=object)
# Number of Unique Ratings
movies["rating"].nunique()7
# Unique countries
movies["country"].unique()array(['United States', 'United Kingdom', 'Spain', 'Bulgaria', 'Chile',
nan, 'Netherlands', 'France', 'Thailand', 'China', 'Belgium',
'India', 'Pakistan', 'Canada', 'South Korea', 'Denmark', 'Turkey',
'Brazil', 'Indonesia', 'Ireland', 'Hong Kong', 'Mexico', 'Vietnam',
'Nigeria', 'Japan', 'Norway', 'Lebanon', 'Cambodia', 'Russia',
'Poland', 'Israel', 'Italy', 'Germany', 'United Arab Emirates',
'Egypt', 'Taiwan', 'Australia', 'Czech Republic', 'Argentina',
'Switzerland', 'Malaysia', 'Philippines', 'Serbia', 'Colombia',
'Singapore', 'Peru', 'South Africa', 'New Zealand', 'Venezuela',
'Saudi Arabia', 'Iceland', 'Austria', 'Uruguay', 'Finland',
'Ghana', 'Iran', 'Sweden', 'Hungary', 'Guatemala', 'Portugal',
'Paraguay', 'Somalia', 'Ukraine', 'Dominican Republic', 'Romania',
'Slovenia', 'Croatia', 'Bangladesh', 'Soviet Union', 'Georgia',
'West Germany', 'Mauritius', 'Cyprus'], dtype=object)
# Number of unique countries
movies["country"].nunique()72
3.2.4 Missing values
We can apply the isnull() method and sum() to get an overview of missing values
# Missing values
movies.isnull().sum()type 0
title 0
country 476
release_year 0
rating 0
duration 0
dtype: int64
Here is a simple code that summarizes the key attributes of our dataset
# Tabular view of Data
# Creating the Data Dictionary with first column being datatype.
Data_dict = pd.DataFrame(movies.dtypes)
# Identifying unique values . For this I've used nunique() which returns unique elements in the object.
Data_dict['UniqueVal'] = movies.nunique()
# Identifying the missing values from the dataset.
Data_dict['MissingVal'] = movies.isnull().sum()
# Percentage of Missing Values
Data_dict['Percent Missing'] = round(movies.isnull().sum()/len(movies)*100, 2)
# identifying count of the variable.
Data_dict['Count'] = movies.count()
# Renaming the first column using rename()
Data_dict = Data_dict.rename(columns = {0:'DataType'})
Data_dict| DataType | UniqueVal | MissingVal | Percent Missing | Count | |
|---|---|---|---|---|---|
| type | object | 2 | 0 | 0.00 | 6234 |
| title | object | 5731 | 0 | 0.00 | 6234 |
| country | object | 72 | 476 | 7.64 | 5758 |
| release_year | object | 73 | 0 | 0.00 | 6234 |
| rating | object | 7 | 0 | 0.00 | 6234 |
| duration | float64 | 5517 | 0 | 0.00 | 6234 |
Our dataset consists of 6233 rows (observations) across 6 columns (features)
5 variables are “Categorical : types = object”
Type : 2 sub-categories
Title : 5731 sub-categories
Country : 72 sub-categories
Release Year : 73 sub-categories
Rating : 7 sub-categories
The duration variable is “Numeric : type = float”
Country column has 476 missing values
Making a habit of checking each variable name and what it describes in your dataset, then comparing that to the data type that is assigned to it will save you time and ensure you get the most insight from your data