2024-02-20

Introduction to Variability

At the heart of statistics lies the concept of variability, which measures how much data points in a dataset differ from each other and from the average. It’s a fundamental aspect because it provides us with insights into the spread or dispersion within a dataset.

Why Is Variability Important?

  • Decision Making: Understanding variability helps in making informed decisions.

  • Insight into Data Spread: Variability gives us a glimpse into how spread out the data is.

  • Risk Assessment: In finance, variability is closely associated with risk.

Breaking Down the Data

To explore the statistical concept of variability, we will be using a dataset on Credit Card Spending Trends in India. This dataset is sourced from Kaggle, and has 26052 objects of 7 variables. These variables are City, Date, Card Type, Expenditure Type, Gender, and Amount. For this project, we will be taking a sample of 1000 cases from the original data. To introduce the key concepts and measures of variability, we will demonstrate:

  • The Range, Interquartile Range (IQR), Variance, and Standard Deviation of the set.

  • The application of these concepts to a real-world dataset on credit card spendings.

  • How to examine how spending behaviors vary across different demographics and various factors

The Variation and Standard Deviation of the Dataset

Our credit dataset has a variance of 9203431928.26687 and a standard deviation of 95934.5189609. This implies there is a large spread in our data. But why can we assume this? Variance (\(\sigma^2\)) and standard deviation (\(\sigma\)) are two key measures of variability that indicate how much individual data points in a dataset are spread out from the mean. To further understand the variability of our data, we will use visualizations, but first, let’s understand the math behind these concepts…

The Math of Sample Variance

  • Sample Variance is the average of the squared differences from the Mean.

\[s^2 = \frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}\]

  • Sample Standard Deviation is the square root of the variance.

\[\sigma = \sqrt{\sigma^2} = \sqrt{\frac{\sum_{i=1}^{n} (x_i - \bar{x})^2}{n-1}}\]

Breaking Down the Symbols

  • \(\sum_{i=1}^{n}\): This is the summation symbol, and will sum all the elements of the set.

  • n: The sample size

  • \(\bar{x}\): The sample mean

  • \(x_i\): This represents each individual data point in the set

  • \(s^2\): The Sample Variance

  • \(\sigma\): The standard deviation

Notice that the equation is the square root, \(\sqrt{}\) of the equation for variance. We square root the equation to normalize it.

Variability in Credit Card Spending

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1694   80979  151515  154737  225644  834691

To find the Interquartile Range, we subtract the 3rd Quartile by the 1st Quartile. This tells us the middle 50% of our data, and it is more resistant to outliers. To determine if we have outliers, we want to calculate our upper and lower spending limits. We can find these by:

  • Lower: Q1 - 1.5*IQR

  • Upper: Q3 + 1.5*IQR

This calculation shows us that the lower limit = -163077, and the upper limit = 469315. Our max value is 834691 and our minimum is 1694. So, we can see that our maximum value is an outlier. Outliers can skew distribution of data, which is why it is important to identify them.

The Unedited Distribution of Spending

The Log Distribution of Spending

Interpretation

It is a little difficult to interpret the first graph, since all the values are so large. We can also see that it is heavily skewed to the right by outliers. However, in the second graph, we log all the values to help show variability. We can now confirm that the data is heavily skewed and unimodal, suggesting a majority of the data is clustered together at a specific point, and but we can be confident that the data is not normally distributed. There is far more discernible variance. What factors are influencing this distribution?

Analyzing Variability by Gender

Variability by Card Type

Average Spending by Card Type

Expenditure Type Pie Chart

## [1] "Bills"         "Grocery"       "Food"          "Fuel"         
## [5] "Entertainment" "Travel"

Spending by Expense Type Visualization

Time Trends in Spending Variability

Code for Seasonal Trends Graph

Summary Interpretations

There are many influencing factors in our data. From our visualizations, we can ascertain that most credit is spent on Food, Bills, and Fuel, and that Gender is not an influencing factor. We can also see that on average, credit spenders with a Platinum card spend higher amounts on credit, and the peak of credit spending happens in November and March. March is the end of the financial year in India, and November is the start of the festival season, with large celebrations such as Diwali. These could be reasons for increased spending. Due to there being such a large amount of variance in this dataset, we theorized that there were many influencing factors. However, these variations were extremely difficult to see without manipulating and correctly graphing our data. This case study has been an example of why it is essential to dig deep beyond surface level interpretations of data.