Statistics_Chapter_1_Variable_Types

1 Goal

Review common variable types that one will work with in data , define these variable types and learn how to store them in Python using the pandas library.

Identifying variable types and storing them properly is an important first step for any statistical analysis. For one, different variable types require different methods for summarizing data and running hypothesis tests. Understanding your data will help you choose a method to answer your questions! It is also impossible to run any analysis in Python unless your computer can interpret your data correctly. If you store numerical values as strings, then you cannot calculate an average, for example

At the end of this module, you will be able to:

Identify the following variable types in a dataset:
Quantitative variables: continuous vs. discrete
Categorical variables: nominal vs. ordinal vs. binary
Load data in Python and check whether assigned data types match variable types
Encode categorical variables in Python using categorical encoding and one-hot encoding

Prerequisites : Basic knowledge of Python and Pandas

# Load the required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

2 Variable Types

In tabular data (eg., a spreadsheet), variables are represented by the columns of the spreadsheet. The types of variables within our dataset will have a great impact on the insights we can gain from our data. It is therefoe important to understand variable types, and how different variables can offer different perspectives and functionalities within our data.

Generally, variables come in two varieties; categorical and quantitative.

Categorical variables group observations into separate categories that can be ordered or unordered.
Quantitative variables on the other hand are variables expressed numerically, whether as a count or measurement.

2.0.1 Quantitative Variables

We can think of quantitative variables as any information about an observation that can only be described with numbers. Quantitative variables are generally counts or measurements of something (eg number of points earned in a game or height).There are two types of quantitative variables; discrete and continuous, and they both help to serve different functions in a dataset.

Discrete Variables : Discrete quantitative variables are numeric values that represent counts and can only take on integer values. They represent whole units that can not be broken down into smaller pieces, and as such cannot be meaningfully expressed with decimals or fractions. Examples of discrete variables are the number of children in a person's family or the number of coin flips a person makes.

Continuous Variables : Continuous quantitative variables are numeric measurements that can be expressed with decimal precision.Theoretically, continuous variables can take on infinitely many values within a given range. Examples of continuous variables are length, weight, and age which can all be described with decimal values.

Sometimes the line between discrete and continuous variables can be a bit blurry. For example, age with decimal values is a continuous variable, but age in closest whole years by definition is discrete. The precision with which a quantitative variable is recorded can also determine how we classify the variable

2.0.2 Categorical Variables

Categorical variables differ from quantitative variables in that they focus on the different ways data can be grouped rather than counted or measured. With categorical variables, we want to understand how the observations in our dataset can be grouped and separated from one another based on their attributes.There are two types of categorical variables : Ordinal and Nominal.

Ordinal Variables :When the groupings of a categorical variable have a specific order or ranking , it is an ordinal variable. Suppose there was a variable containing responses to the question "Rate your agreement with the statement: The minimum age to drive should be lowered." The response options are "strongly disagree", "disagree", "neutral", "agree", and "strongly agree".Because we can see an order where "strongly disagree" < "disagree " < "neutral" < "agree" < "strongly agree" in relation to agreement, we consider the variable to be ordinal

Nominal Variables : If there is no apparent order/ranking to the categories of a categorical variable, we refer to to it as a nominal variable.
Nominal categorical variables are those variables with two or more categories that do not have any relational order. Examples of nominal categories could be states in India, brands of computers, tribes. Each of these variables has no intrinsic ordering that distinguishes a category as greater than or less than another category.
The number of possible values for a nominal variable can be quite large. It's even possible that a nominal categorical variable will take on a unique value for every observation in a dataset, like in the case of unique identifiers such as name or email_addres

Binary Variables : Binary or dichotomous variables are a special kind of nominal variable that have only two categories.
Because there are only two possible values for binary variables, they are mutually exclusive to one another. We can imagine a variable that describes if a picture contains a cat or a dog as a binary variable. In this case, if the picture is not a dog, it must be a cat, and vice versa. Binary variables can also be described with numbers similar to bits with 0 or 1 values. Likewise you may find binary variables containing boolean values of True or False.

3 Assessing variable types

The first step to working with datasets is figuring out what kind of variables (columns in a dataframe) are present, and whether they are quantitative or categorical.

An effective way of taking a peek into our dataframes in pandas is to look at the first few rows of the data. This helps us learn the variable names within our dataset, and get a sample of the values in each variable and understand what types of variables we have in our data.

Sometimes the variable type of the data may be unclear.In these cases, we may need to confirm our assessments by inspecting the data dictionary or using domain knowledge. Researching through the associated documentation and a little domain knowledge can often save us from any false assumptions we may have made about our dataset.

Let's inspect a dataset to identify its variable types. We will be working with the movies dataframe , comprising of television shows and movies hosted on the Netflix platform in 2019.

Let’s inspect the top 5 rows applying the head() method

3.1 Data ingestion

# Load the dataset into a dataframe movies
movies = pd.read_excel("Datasets/movie_data_netflix.xlsx")

# Top 5 rows
movies.head()

	type	title	country	release_year	rating	duration
0	Movie	Norm of the ...	United States	missing	PG	91.071
1	Movie	Jandino: Wha...	United Kingdom	2016	R	94.516
2	TV Show	Transformers...	United States	2013	G	1.127
3	TV Show	Transformers...	United States	2016	TV-14	1.687
4	Movie	#realityhigh...	United States	2017	TV-14	99.248

3.2 Data review

3.2.1 Information

A useful way of assessing data types in Pandas is the .info() method, which returns the number of instances in each column along with the data types, a tally of the data types, and the dataframe memory usage

# Dataset information
movies.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6234 entries, 0 to 6233
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   type          6234 non-null   object 
 1   title         6234 non-null   object 
 2   country       5758 non-null   object 
 3   release_year  6234 non-null   object 
 4   rating        6234 non-null   object 
 5   duration      6234 non-null   float64
dtypes: float64(1), object(5)
memory usage: 292.3+ KB

Our dataset consists of 6233 rows (observations) across 6 columns (features)
5 variables are “Categorical : types = object” and 1 variable is “Numeric : type = float”

3.2.2 Data types

When using Python, Pandas dataframe data types can be inspected using the .dtypes attribute

# Data types
movies.dtypes

type             object
title            object
country          object
release_year     object
rating           object
duration        float64
dtype: object

3.2.3 Unique values

Having looked at the first few rows of your data and establishing that a variable is categorical , an important next step is to determine whether the categories are ordered or not.

We use .unique() to view the unique values in the variable to further inspect their data types

We use .nunique to count the number of unique values

# Unique ratings
movies["rating"].unique()

array(['PG', 'R', 'G', 'TV-14', 'PG-13', 'UNRATED', 'NOT RATED'],
      dtype=object)

# Number of Unique Ratings
movies["rating"].nunique()

# Unique countries
movies["country"].unique()

array(['United States', 'United Kingdom', 'Spain', 'Bulgaria', 'Chile',
       nan, 'Netherlands', 'France', 'Thailand', 'China', 'Belgium',
       'India', 'Pakistan', 'Canada', 'South Korea', 'Denmark', 'Turkey',
       'Brazil', 'Indonesia', 'Ireland', 'Hong Kong', 'Mexico', 'Vietnam',
       'Nigeria', 'Japan', 'Norway', 'Lebanon', 'Cambodia', 'Russia',
       'Poland', 'Israel', 'Italy', 'Germany', 'United Arab Emirates',
       'Egypt', 'Taiwan', 'Australia', 'Czech Republic', 'Argentina',
       'Switzerland', 'Malaysia', 'Philippines', 'Serbia', 'Colombia',
       'Singapore', 'Peru', 'South Africa', 'New Zealand', 'Venezuela',
       'Saudi Arabia', 'Iceland', 'Austria', 'Uruguay', 'Finland',
       'Ghana', 'Iran', 'Sweden', 'Hungary', 'Guatemala', 'Portugal',
       'Paraguay', 'Somalia', 'Ukraine', 'Dominican Republic', 'Romania',
       'Slovenia', 'Croatia', 'Bangladesh', 'Soviet Union', 'Georgia',
       'West Germany', 'Mauritius', 'Cyprus'], dtype=object)

# Number of unique countries
movies["country"].nunique()

3.2.4 Missing values

We can apply the isnull() method and sum() to get an overview of missing values

# Missing values
movies.isnull().sum()

type              0
title             0
country         476
release_year      0
rating            0
duration          0
dtype: int64

Here is a simple code that summarizes the key attributes of our dataset

# Tabular view of Data
# Creating the Data Dictionary with first column being datatype.
Data_dict = pd.DataFrame(movies.dtypes)
# Identifying unique values . For this I've used nunique() which returns unique elements in the object.
Data_dict['UniqueVal'] = movies.nunique()
# Identifying the missing values from the dataset.
Data_dict['MissingVal'] = movies.isnull().sum()
# Percentage of Missing Values
Data_dict['Percent Missing'] = round(movies.isnull().sum()/len(movies)*100, 2)
# identifying count of the variable.
Data_dict['Count'] = movies.count()
# Renaming the first column using rename()
Data_dict = Data_dict.rename(columns = {0:'DataType'})
Data_dict

	DataType	UniqueVal	MissingVal	Percent Missing	Count
type	object	2	0	0.00	6234
title	object	5731	0	0.00	6234
country	object	72	476	7.64	5758
release_year	object	73	0	0.00	6234
rating	object	7	0	0.00	6234
duration	float64	5517	0	0.00	6234

Our dataset consists of 6233 rows (observations) across 6 columns (features)
5 variables are “Categorical : types = object”
- Type : 2 sub-categories
- Title : 5731 sub-categories
- Country : 72 sub-categories
- Release Year : 73 sub-categories
- Rating : 7 sub-categories
The duration variable is “Numeric : type = float”
Country column has 476 missing values

Making a habit of checking each variable name and what it describes in your dataset, then comparing that to the data type that is assigned to it will save you time and ensure you get the most insight from your data