2024-02-01

Normal Distribution and Normality

In this presentation, we’ll delve into the fascinating world of Normal Distributions, also known as Gaussian Distributions. We’ll explore their key characteristics, their prevalence in data science salaries, and how they can inform our understanding of various real-world phenomena. Our dataset Jobs and Salaries in Data Science is:

  • Sourced from Kaggle

  • Updated annually

  • Has 8080 objects with 12 variables

  • Represents a sample of Data Scientists in the United States

What is Normal Distribution?

The Normal Distribution, often nicknamed the “bell curve,” is a fundamental concept in statistics and probability. Its distinctive bell shape reflects the likelihood of different values occurring within a dataset. From human heights to exam scores, the Normal Distribution pops up in various contexts, making it a powerful tool for data analysis. Key features can help us identify it:

  • Symmetry: The distribution is symmetrical around the mean, with equal probability of values occurring on either side.

  • Unimodality: It has a single peak at the mean, indicating the most frequent value.

  • Tail behavior: The distribution tails off gradually towards both positive and negative infinity.

Elements of Normal Distribution

The Normal Distribution exhibits specific properties that make it predictable and informative. The total area under the curve sums to 1, encompassing all possible values. Notably, 68% of the data clusters within 1 standard deviation of the mean, and this proportion progressively increases as we move further out.

  • The total area under the curve equals 1, representing all possible values.

  • 68% of the data falls within 1 standard deviation of the mean \(\mu\) +- \(\sigma\)

  • 95% of the data falls within 2 standard deviations of the mean \(\mu\) +- 2\(\sigma\)

  • 99.7% of the data falls within 3 standard deviations of the mean \(\mu\) +- 3\(\sigma\)

Data Science and Distribution

Now, let’s bring the theory to life! We’ll be working with a dataset specifically curated for data science jobs and salaries. This dataset will serve as our springboard to explore how Normal Distributions can shed light on the distribution of salaries within this exciting field. We want to look at the:

  • Visual Distribution: We can use histograms, pie charts, dot plots, and density plots to visualize data.

  • Density: The higher the density at a particular point, the more likely it is to find data points near that value.

  • Distribution Function: This function provides information about the accumulation of probabilities.

Histogram of US Salaries

Explaining our Histogram

This dataset appears to have normal distribution! Below is a summary of the salaries:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   24000  117875  150000  158599  192000  450000

But when we use the distribution function, we have a 0.44 probability of a data scientist making 150,000. Is that enough to accurately predict a data science salary? While the calculated probability represents the likelihood of encountering a data scientist earning exactly $150,000, a single probability value doesn’t guarantee prediction. It simply indicates the relative position of $150,000 within the normal distribution. We need to understand the context and math behind this to interpret the value.

The Formula for Probability Density Function

The formula for the probability density function (PDF) of a normal distribution is:

\[ f(x) = \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left( -\frac{(x - \mu)^2}{2\sigma^2} \right) \]

  • \(\mu\): Represents the mean of the distribution.

  • \(\sigma\): Represents the standard deviation of the distribution.

  • \(\pi\): Represents the mathematical constant pi (approximately 3.14159).

  • exp(): Represents the exponential function.

Can We Interpret This Value?

The value this formula returns represents the likelihood of a specific data point occurring within the distribution. But context is essential to interpreting data. Just because we have 44% probability of a salary of $150,000 does not mean any one individual data scientist has a 44% chance at being offered it. Let’s look at additional factors such as

  • Work Setting: What is the salary distribution for remote data scientists compared to in-person?

  • Experience Level: Does experience and position level influence the average salary range expectation?

Pie Chart of Data Science Work Settings

We can see that most Data Science jobs are in-person and remote; does work setting and experience change the distribution of salary? Does the level of experience correlate to salary dispersal, and does that change by work environment?

Dot Plot of Experience Distribution

Code for Dot Plot

Code:

mine_ordered <- mine %>% mutate(exp_level_ordered = factor(experience_level, levels = c(“Entry-level”, “Mid-level”, “Senior”, “Executive”)))

ggplot(mine_ordered, aes(y = exp_level_ordered, x = salary_in_usd, color = exp_level_ordered)) + geom_point(alpha = 0.3) +
labs(title = “Salaries by Work Setting and Experience”, x = “Salary (USD)”, y = ““) + facet_wrap(~ work_setting, ncol = 1) + theme_bw()

Average Salary by Position Level

Median and Mean by Position Level

## Mean Salary Table
##   exp_level_ordered salary_in_usd
## 1       Entry-level      104849.0
## 2         Mid-level      130431.6
## 3            Senior      166277.9
## 4         Executive      195731.1
## Median Salary Table
##   exp_level_ordered salary_in_usd
## 1       Entry-level         91000
## 2         Mid-level        125000
## 3            Senior        159095
## 4         Executive        194500

Understanding the Impact

This is how we can use Normal Distribution concepts to derive insights from real world data:

Normal distribution can tell us the likelihood of a certain value occurring within the data, which we have linked to salary probability. By analyzing the data through the lens of a normal distribution, we could see how likely it is for someone to earn a specific salary within the data set. We saw that while work setting does not have causation to Data Science salary expectations, work experience does. Most Data Scientist positions are Senior Level, and the dispersal suggests that we cannot expect $150,000 as an Entry-level offer. The means and medians of each position are extremely similar, giving us confidence in the estimate. We can assume that the average entry level position salary will be around $90,000 to $95,000.