This section of the narrated notebook will discuss the dataset derived from a comprehensive manager salary survey conducted in 2024. The survey aimed to gather detailed information about the salaries, industries, job functions, and demographics of managers across various sectors globally. ### b. Data Collection The data was collected via an online survey that included multiple-choice questions and open-ended responses to capture detailed information about the respondents job titles, industries, and salaries. ### c. Purpose of the Data The primary purpose of this dataset is to analyze trends in manager salaries across different industries and regions. It helps identify factors influencing salary variations and provides insights that can assist stakeholders like HR professionals, recruiters, and policy-makers in making informed decisions. ### d. Data Description The dataset contains responses to questions about age, industry, job function, salary, work experience, education level, and demographic information. The key stakeholders interested in this dataset include: - HR departments analyzing compensation fairness. - Recruiters seeking industry-specific salary benchmarks. - Economic researchers studying employment trends. - Policy-makers shaping labor market regulations. This dataset was sourced from an online survey platform and is crucial for understanding compensation trends and disparities in the labor market. ## Let’s Dive in: # 1. Setup
Load Packages
Use library() function to load tidyverse package. Connect csv and save as a new object called salary_df
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.3.3
Warning: package 'ggplot2' was built under R version 4.3.3
Warning: package 'tibble' was built under R version 4.3.3
Warning: package 'tidyr' was built under R version 4.3.3
Warning: package 'readr' was built under R version 4.3.3
Warning: package 'purrr' was built under R version 4.3.3
Warning: package 'dplyr' was built under R version 4.3.3
Warning: package 'stringr' was built under R version 4.3.3
Warning: package 'forcats' was built under R version 4.3.3
Warning: package 'lubridate' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# URL to the CSV published Google Sheeturl <-"https://docs.google.com/spreadsheets/d/e/2PACX-1vST9_KrP-oqKCOxWxsrcZ1AQUCys5hFJquQY0iH3dTxY2LKAdcia2vQhs6uhOmFSBMRxDp3E3iZY85M/pub?gid=1401121012&single=true&output=csv"# Read the CSV data directly into Rsalary_df <-read_csv(url)
Rows: 13349 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (19): Timestamp, How old are you?, What industry is your employer in?, W...
dbl (2): What is your annual salary? This should be your GROSS (pre-tax) in...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
1a. Your Turn ⤵
Inspect the dataframe with glimse() function.
# ADD YOUR CODE BELOW with commentsglimpse(salary_df)
Rows: 13,349
Columns: 21
$ Timestamp <chr> …
$ `How old are you?` <chr> …
$ `What industry is your employer in?` <chr> …
$ `What is the functional area of your job (this might be different from your company's industry)?` <chr> …
$ `Job title` <chr> …
$ `If your job title needs additional context, please clarify here:` <chr> …
$ `What is your annual salary? This should be your GROSS (pre-tax) income. (You'll indicate the currency in a later question.) If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.` <dbl> …
$ `How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Only include monetary compensation here, not the value of benefits, tuition reimbursement, etc. If your bonus or overtime varies from year to year, use the most recent figures.` <dbl> …
$ `Please indicate the currency` <chr> …
$ `If "Other," please indicate the currency here:` <chr> …
$ `If your income needs additional context, please provide it here:` <chr> …
$ `What country do you work in? (Countries listed had by far the largest representation last year. Please write in your country if it's not listed.)` <chr> …
$ `If you're in the U.S., what state do you work in?` <chr> …
$ `What city/region do you work in?` <chr> …
$ `Are you remote or on-site?` <chr> …
$ `Is your job unionized?` <chr> …
$ `How many years of professional work experience do you have overall?` <chr> …
$ `How many years of professional work experience do you have in your field?` <chr> …
$ `What is your highest level of education completed?` <chr> …
$ `What is your gender?` <chr> …
$ `What is your race? (Choose all that apply.)` <chr> …
What is happening? This glimpse() function provides a quick overview of the dataframe, including the number of observations (rows), variables (columns), and the first few entries of each column. How many observations does it have? How many variables? What are the variables? What are the data types of the variables? # 2. Data Structure ### ** 2a. Your Turn** ⤵ a. Lets look at the names of the variables and rename them using names() function.
# ADD YOUR CODE BELOW with commentsnames(salary_df)
[1] "Timestamp"
[2] "How old are you?"
[3] "What industry is your employer in?"
[4] "What is the functional area of your job (this might be different from your company's industry)?"
[5] "Job title"
[6] "If your job title needs additional context, please clarify here:"
[7] "What is your annual salary? This should be your GROSS (pre-tax) income. (You'll indicate the currency in a later question.) If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year."
[8] "How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Only include monetary compensation here, not the value of benefits, tuition reimbursement, etc. If your bonus or overtime varies from year to year, use the most recent figures."
[9] "Please indicate the currency"
[10] "If \"Other,\" please indicate the currency here:"
[11] "If your income needs additional context, please provide it here:"
[12] "What country do you work in? (Countries listed had by far the largest representation last year. Please write in your country if it's not listed.)"
[13] "If you're in the U.S., what state do you work in?"
[14] "What city/region do you work in?"
[15] "Are you remote or on-site?"
[16] "Is your job unionized?"
[17] "How many years of professional work experience do you have overall?"
[18] "How many years of professional work experience do you have in your field?"
[19] "What is your highest level of education completed?"
[20] "What is your gender?"
[21] "What is your race? (Choose all that apply.)"
To rename variables using the dplyr packages rename() function, you can specify the new name you want for each variable while keeping your dataframe tidy and easy to work with. This method is especially useful as it allows you to selectively rename Firefox about:blank 2 af 20 19.10.2024 15.40 variables without having to list all variables in your dataset, as was the case with the direct assignment method we did in the last assignment.
#rename the variablessalary_df <- salary_df %>%rename(timestamp =`Timestamp`,age =`How old are you?`,industry =`What industry is your employer in?`,functional_area =`What is the functional area of your job (this might be different from your company's industry)?`,job_title =`Job title`,Job_title_context =`If your job title needs additional context, please clarify here:`,annual_salary =`What is your annual salary? This should be your GROSS (pre-tax) income. (You'll indicate the currency in a later question.) If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.`,additional_compensation =`How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Only include monetary compensation here, not the value of benefits, tuition reimbursement, etc. If your bonus or overtime varies from year to year, use the most recent figures.`,currency =`Please indicate the currency`,other_currency =`If "Other," please indicate the currency here:`,income_context =`If your income needs additional context, please provide it here:`,country =`What country do you work in? (Countries listed had by far the largest representation last year. Please write in your country if it's not listed.)`,state =`If you're in the U.S., what state do you work in?`,city_region =`What city/region do you work in?`,work_mode =`Are you remote or on-site?`,unionized =`Is your job unionized?`,yr_experience =`How many years of professional work experience do you have overall?`,experience_in_field =`How many years of professional work experience do you have in your field?`,education_level =`What is your highest level of education completed?`,gender =`What is your gender?`,race =`What is your race? (Choose all that apply.)` )
** 2c. Your Turn** ⤵
Now use str() function to inspect the data. This function str() is particularly useful in large datasets where manual inspection of each column isnt feasible.
# ADD YOUR CODE BELOW with commentsstr(salary_df)
spc_tbl_ [13,349 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ timestamp : chr [1:13349] "4/9/2024 11:01:42" "4/9/2024 11:02:14" "4/9/2024 11:02:18" "4/9/2024 11:02:19" ...
$ age : chr [1:13349] "25-34" "35-44" "35-44" "25-34" ...
$ industry : chr [1:13349] "Media & Digital" "Education (Higher Education)" "Nonprofits" "Government & Public Administration" ...
$ functional_area : chr [1:13349] "Media & Digital" "Health care" "Administration" "Government & Public Administration" ...
$ job_title : chr [1:13349] "Digital Project Manager" "Senior Director" "Advancement Operations Manager" "Program Analyst" ...
$ Job_title_context : chr [1:13349] NA NA NA NA ...
$ annual_salary : num [1:13349] 73000 150000 53800 97000 64000 128000 136000 50000 68000 80000 ...
$ additional_compensation: num [1:13349] NA 4500 NA 0 NA 0 0 NA 5000 2500 ...
$ currency : chr [1:13349] "USD" "USD" "USD" "USD" ...
$ other_currency : chr [1:13349] NA NA NA NA ...
$ income_context : chr [1:13349] NA NA NA NA ...
$ country : chr [1:13349] "United States" "United States" "United States" "United States" ...
$ state : chr [1:13349] "New York" "California" "Maryland" "Colorado" ...
$ city_region : chr [1:13349] "New York" "Norhtern" "Olney" "Fort Collins" ...
$ work_mode : chr [1:13349] "Hybrid" "Hybrid" "Hybrid" "Fully remote" ...
$ unionized : chr [1:13349] "No" "No" "No" "No" ...
$ yr_experience : chr [1:13349] "5-7 years" "11-20 years" "11-20 years" "8-10 years" ...
$ experience_in_field : chr [1:13349] "5-7 years" "11-20 years" "8-10 years" "8-10 years" ...
$ education_level : chr [1:13349] "College degree" "College degree" "Master's degree" "Master's degree" ...
$ gender : chr [1:13349] "Woman" "Woman" "Woman" "Woman" ...
$ race : chr [1:13349] "White" "White" "White" "White" ...
- attr(*, "spec")=
.. cols(
.. Timestamp = col_character(),
.. `How old are you?` = col_character(),
.. `What industry is your employer in?` = col_character(),
.. `What is the functional area of your job (this might be different from your company's industry)?` = col_character(),
.. `Job title` = col_character(),
.. `If your job title needs additional context, please clarify here:` = col_character(),
.. `What is your annual salary? This should be your GROSS (pre-tax) income. (You'll indicate the currency in a later question.) If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.` = col_double(),
.. `How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Only include monetary compensation here, not the value of benefits, tuition reimbursement, etc. If your bonus or overtime varies from year to year, use the most recent figures.` = col_double(),
.. `Please indicate the currency` = col_character(),
.. `If "Other," please indicate the currency here:` = col_character(),
.. `If your income needs additional context, please provide it here:` = col_character(),
.. `What country do you work in? (Countries listed had by far the largest representation last year. Please write in your country if it's not listed.)` = col_character(),
.. `If you're in the U.S., what state do you work in?` = col_character(),
.. `What city/region do you work in?` = col_character(),
.. `Are you remote or on-site?` = col_character(),
.. `Is your job unionized?` = col_character(),
.. `How many years of professional work experience do you have overall?` = col_character(),
.. `How many years of professional work experience do you have in your field?` = col_character(),
.. `What is your highest level of education completed?` = col_character(),
.. `What is your gender?` = col_character(),
.. `What is your race? (Choose all that apply.)` = col_character()
.. )
- attr(*, "problems")=<externalptr>
Looking at the columns what are you seeing? ### ** 2d. Your Turn** ⤵ How many of numeric, character etc. are you seeing and what are the names of the
columns? - All but two entries in the data set consists of type character strings. In total there are 19 entris of type character strings strings. - annual_salary and additional_compensation are the two data entries consisting type numeric. Do you think any need to be converted, if so why, or why not? - Some of these could be converted. The unionized coloumn could be converted into binary numeric values, since the data entries either consists of character strings “No” or “Yes”. The same goes for the age coloumn, which contains numeric value intervals, but is still expressed with character strings. ### ** 2e. Your Turn** ⤵ Now lets look at the summary() function to get a summary of the data. The summary function provides additional information. It can be used for the elary_dfntire dataset, or individual variables.
#inspect the data using summary() functionsummary(salary_df)
timestamp age industry functional_area
Length:13349 Length:13349 Length:13349 Length:13349
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
job_title Job_title_context annual_salary
Length:13349 Length:13349 Min. : 0
Class :character Class :character 1st Qu.: 63860
Mode :character Mode :character Median : 89000
Mean : 122366
3rd Qu.: 125000
Max. :58400000
additional_compensation currency other_currency
Min. : 0 Length:13349 Length:13349
1st Qu.: 0 Class :character Class :character
Median : 1800 Mode :character Mode :character
Mean : 12275
3rd Qu.: 10000
Max. :2500000
NA's :3224
income_context country state city_region
Length:13349 Length:13349 Length:13349 Length:13349
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
work_mode unionized yr_experience experience_in_field
Length:13349 Length:13349 Length:13349 Length:13349
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
education_level gender race
Length:13349 Length:13349 Length:13349
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
3 Wrangle
3a. Column Types:
<chr> (Character Strings): Most of the columns, including "How old are you?", "What industry is your employer in?", "What is the functional area of your job?", and "What is your gender?", are read as character strings.
Why it might need conversion: Some of these fields likely represent categorical data (factors) rather than free text (e.g., “Age,” “Industry,” “Gender”). Factors are useful in R for summarizing categorical data, making it easier to run descriptive statistics, create plots, and build models. Thus, converting them from character to factors will be beneficial.
Conversion needed: We will convert these character columns to factors using the as.factor() function.
<dbl> (Numeric Data): Some columns, like "What is your annual salary?" and "How much additional monetary compensation do you get?", are read as numeric (double) types.
No conversion needed: These numeric columns seem appropriate, so no type conversion is necessary here. ### 3b. Columns to Consider for Factor Conversion: But first, What is a factor? Factors are data structures in R used to handle categorical data, which are variables that have a fixed number of distinct categories or levels. Factors can store both the labels and the internal integer codes that represent each unique category. They are especially useful in statistical modeling as they correctly inform the model that the
data is categorical, not numeric #### Age (How old are you?):
- Currently: `<chr>`
- Should be: **Factor** (`as.factor()`), because it represents categorical age ranges
rather than free text.
# Convert age to a factorsalary_df$age <-as.factor(salary_df$age)# Inspect the datasalary_df
# A tibble: 13,349 × 21
timestamp age industry functional_area job_title Job_title_context
<chr> <fct> <chr> <chr> <chr> <chr>
1 4/9/2024 11:01:42 25-34 Media & … Media & Digital Digital … <NA>
2 4/9/2024 11:02:14 35-44 Educatio… Health care Senior D… <NA>
3 4/9/2024 11:02:18 35-44 Nonprofi… Administration Advancem… <NA>
4 4/9/2024 11:02:19 25-34 Governme… Government & P… Program … <NA>
5 4/9/2024 11:02:24 25-34 Governme… Administration Project … <NA>
6 4/9/2024 11:02:30 35-44 Health c… Health care Clinical… Advanced Practic…
7 4/9/2024 11:02:37 45-54 Governme… Health care Microbio… <NA>
8 4/9/2024 11:02:37 18-24 Nonprofi… Marketing, Adv… Public E… <NA>
9 4/9/2024 11:02:40 25-34 Computin… Business or Co… Client S… <NA>
10 4/9/2024 11:02:45 35-44 Accounti… Accounting, Ba… Accounta… <NA>
# ℹ 13,339 more rows
# ℹ 15 more variables: annual_salary <dbl>, additional_compensation <dbl>,
# currency <chr>, other_currency <chr>, income_context <chr>, country <chr>,
# state <chr>, city_region <chr>, work_mode <chr>, unionized <chr>,
# yr_experience <chr>, experience_in_field <chr>, education_level <chr>,
# gender <chr>, race <chr>
View the Levels of a Factor
To see the different levels that a factor variable has, you can use the levels() function. This is useful to understand what categories are included in your factor data
# View the levels of the age factorlevels(salary_df$age)
Alternatively, you could convert it to ordered factor if age ranges have a logical order. This makes sense when this variable is a set of ranges representing age levels. If the ranges have a natural order (e.g., “1 year or less,” “2-4 years,” etc.), converting to an ordered factor would be useful.
# Convert age to an ordered factorsalary_df$age <-factor(salary_df$age, ordered =TRUE)
Summary of a Factor
A quick way to get the count of each level within a factor is by using the summary() function. This function provides a count of occurrences for each level.
# Summary of the age factorsummary(salary_df$age)
18-24 25-34 35-44 45-54 55-64 65 or over under 18
175 4100 5552 2403 993 125 1
Table of a Factor
To get a frequency table of the levels, you can use the table() function. This is similar to summary() but is used specifically for getting the frequency of each level.
table(salary_df$age)
18-24 25-34 35-44 45-54 55-64 65 or over under 18
175 4100 5552 2403 993 125 1
Using dplyr to Summarize Factors
If you are using the dplyr package, you can also create grouped summaries or counts of factor levels easily.
salary_df %>%count(age) %>%arrange(desc(n)) # Arranges the output in descending order of counts
# A tibble: 7 × 2
age n
<ord> <int>
1 35-44 5552
2 25-34 4100
3 45-54 2403
4 55-64 993
5 18-24 175
6 65 or over 125
7 under 18 1
** 3c. Your Turn** ⤵
Your Task: Convert the other character variables descride if they need to be converted to factors or ordered factors. Once you have converted your character columns to factors, explore these factors to understand the levels they contain and how the data within each category is structured. #### Hint - you need to look at 8 variables in total.
# ADD YOUR CODE BELOW with commentssalary_df$work_mode <-as.factor(salary_df$work_mode)salary_df$industry <-as.factor(salary_df$industry)salary_df$unionized <-as.factor(salary_df$unionized)salary_df$currency <-as.factor(salary_df$currency)salary_df$education_level <-as.factor(salary_df$education_level)salary_df$country <-as.factor(salary_df$country)salary_df$gender <-as.factor(salary_df$gender)salary_df$race <-as.factor(salary_df$race)
Calculating descriptive statistics
Univariate data EDA is the process of exploring and summarizing the main characteristics of a single variable. This process helps you understand the distribution of the data, identify outliers, and detect patterns or trends. Now that we have made sure that the type conversions are correct we can look at describe() and describe_by() functions describe() computes descriptive statistics for numerical data. Descriptive statistics help determine the distribution of numerical variables. Like the function of dplyr, the first argument is the tibble (or data frame). The second and subsequent arguments refer to variables within that data frame. The first thing we need to do is to add the correct package: Firefox about:blank 6 af 20 19.10.2024 15.40
# Check if the psych package is installed. If not, install it.if (!require(psych)) {install.packages("psych", dependencies =TRUE)}
Loading required package: psych
Warning: package 'psych' was built under R version 4.3.3
Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':
%+%, alpha
# Load the psych package# ADD YOUR CODE BELOW with commentslibrary(psych)
3d. Numerical Variables
Your Task: Use the describe() function from the psych package to generate descriptive statistics for numerical variables.
# numerical_descriptions <-describe(salary_df[c("annual_salary", "additional_compensation")])# Inspect the objectnumerical_descriptions
vars n mean sd median trimmed mad
annual_salary 1 13349 122365.52 793049.45 89000 93870.03 42995.40
additional_compensation 2 10125 12275.22 53872.31 1800 4487.77 2668.68
min max range skew kurtosis se
annual_salary 0 58400000 58400000 56.74 3552.38 6863.98
additional_compensation 0 2500000 2500000 22.20 758.57 535.39
Do you want to see this as a nice table?
# read in knitr packagelibrary(knitr)
Warning: package 'knitr' was built under R version 4.3.3
# Use kable() for a cleaner table formatkable(numerical_descriptions)
vars
n
mean
sd
median
trimmed
mad
min
max
range
skew
kurtosis
se
annual_salary
1
13349
122365.52
793049.45
89000
93870.026
42995.40
0
58400000
58400000
56.73891
3552.3788
6863.9783
additional_compensation
2
10125
12275.22
53872.31
1800
4487.766
2668.68
0
2500000
2500000
22.19539
758.5696
535.3873
Columns in the Output
vars: Variable identifier (ordinal number in the dataset). This column simply numbers the variables being described. For example, annual_salary is labeled as 1, indicating its the first variable analyzed.
n: Number of observations (non-missing values) in each variable. This represents the number of entries or observations that have been recorded. For annual_salary, there are 13,349 entries. This tells you how many peoples salaries were considered in this analysis.
mean: The average value of the variable. The average value. For annual_salary, the average is $122,365.52. This means if you added up all the salaries and divided by the number of people, the average salary would be this amount.
sd: Standard deviation, which measures the amount of variation or dispersion of the data points. This measures how spread out the salaries are from the average. A high standard deviation, like $793,049.45 for annual salaries, means that salaries vary a Firefox about:blank 7 af 20 19.10.2024 15.40 lot from the average. Some salaries are much higher or much lower than the average.
median: The middle value when the data is ordered, which is a better measure of central tendency when the data is skewed. This is the middle value when all the salaries are lined up from smallest to largest. For annual_salary, the middle value is $89,000. This is often a better measure than the average in cases where the data includes very high or very low values (outliers).
trimmed: The mean after trimming a certain percentage of the highest and lowest values, which helps reduce the effect of outliers. This is another type of average where the highest and lowest values are ignored to avoid extreme values skewing the average. For annual_salary, this trimmed average is $93,870.026, slightly higher than the median, indicating that ignoring the extremes brings the average closer to the median.
mad: Median Absolute Deviation, a robust measure of variability. This is similar to standard deviation but more robust to outliers. It measures how much the values differ from the median salary. For annual_salary, it’s $42,995.40.
min: The smallest value in the dataset.For annual_salary, the smallest is $0 (perhaps indicating unpaid positions or data entry errors),
max: The largest value in the dataset. These show the smallest and largest values recorded. For annual salary, the largest is $58,400,000, showing a huge range in salaries.
range: The difference between the maximum and minimum values. This is the difference between the maximum and minimum values. For annual_salary, the range is
$58,400,000, confirming the vast disparity in salaries.
skew: Skewness of the distribution, a measure of the asymmetry. Positive values indicate a tail to the right, negative values a tail to the left. This measures how asymmetrical the distribution of salaries is around the average. A positive skew, like 56.73891 for annual_salary, means there are a lot of lower salaries and a few extremely high ones (the tail is on the high side).
kurtosis: A measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. High kurtosis indicates a distribution with heavy tails. This measures the “peakedness” of the salary distribution. A very high kurtosis, like 3552.3788 for annual_salary, suggests a sharp peak with thick tails, indicating most salaries are around the median but include some extreme outliers.
se: Standard error of the mean, indicating the variability of the mean estimate. This measures how much the average (mean) salary would vary from sample to sample taken from the same population. For annual_salary, it is $6,863.9783, indicating significant variability which might be influenced by the high standard deviation and outliers. ### 3e. DescribeBy Use describe.by() to analyze these factors grouped by another variable (e.g., industry).
# Use describe_by if you want to describe data by groups (e.g., by industry)factor_descriptions_by_age <-describeBy(salary_df, group = salary_df$age)
Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
print(factor_descriptions_by_age)
Descriptive statistics by group
group: 18-24
vars n mean sd median trimmed mad
timestamp 1 175 6105.22 3528.23 6167 6110.77 4379.60
age 2 175 1.00 0.00 1 1.00 0.00
industry 3 175 164.28 96.58 148 160.58 108.23
functional_area 4 170 245.57 176.29 214 234.32 182.36
job_title 5 175 3629.57 2153.05 3678 3590.94 2722.05
Job_title_context 6 35 1403.54 821.33 1439 1398.62 1037.82
annual_salary 7 175 87468.45 276384.48 51000 56248.07 20104.06
additional_compensation 8 109 5852.70 16489.45 1000 2364.54 1482.60
currency 9 175 9.83 2.65 11 10.48 0.00
other_currency 10 3 17.67 8.08 13 17.67 0.00
income_context 11 21 554.95 409.51 501 526.18 461.09
country 12 175 55.85 13.02 60 59.89 0.00
state 13 141 25.30 13.64 24 25.27 17.79
city_region 14 175 1394.09 925.64 1386 1360.74 1264.66
work_mode 15 174 2.35 0.69 2 2.42 1.48
unionized 16 170 1.11 0.31 1 1.01 0.00
yr_experience 17 175 2.94 1.83 3 2.67 0.00
experience_in_field 18 175 2.35 1.46 3 2.18 0.00
education_level 19 174 64.57 27.14 50 59.67 0.00
gender 20 174 8.13 3.28 10 8.74 0.00
race 21 173 30.61 11.69 36 33.47 0.00
min max range skew kurtosis se
timestamp 74 12037 11963 0.01 -1.18 266.71
age 1 1 0 NaN NaN 0.00
industry 4 389 385 0.37 -0.55 7.30
functional_area 9 635 626 0.49 -0.80 13.52
job_title 15 7923 7908 0.13 -1.07 162.76
Job_title_context 38 2839 2801 0.10 -1.10 138.83
annual_salary 2950 3360000 3357050 10.42 114.44 20892.70
additional_compensation 0 130000 130000 5.53 34.39 1579.40
currency 1 11 10 -2.01 2.57 0.20
other_currency 13 27 14 0.38 -2.33 4.67
income_context 14 1336 1322 0.50 -1.13 89.36
country 2 60 58 -3.05 7.86 0.98
state 3 50 47 -0.04 -1.24 1.15
city_region 95 3131 3036 0.13 -1.39 69.97
work_mode 1 4 3 -0.36 -0.61 0.05
unionized 1 2 1 2.54 4.47 0.02
yr_experience 1 8 7 1.14 0.74 0.14
experience_in_field 1 7 6 1.38 2.67 0.11
education_level 9 139 130 1.22 0.57 2.06
gender 1 11 10 -1.48 0.52 0.25
race 1 36 35 -1.84 1.62 0.89
------------------------------------------------------------
group: 25-34
vars n mean sd median trimmed
timestamp 1 4100 5923.93 3395.26 5976.5 5906.27
age 2 4100 2.00 0.00 2.0 2.00
industry 3 4087 161.77 95.91 157.0 158.49
functional_area 4 4061 231.69 173.06 209.0 218.98
job_title 5 4100 3984.15 2307.42 4176.5 4002.45
Job_title_context 6 856 1438.70 829.11 1445.5 1445.00
annual_salary 7 4100 123122.04 1051153.44 79000.0 82606.73
additional_compensation 8 2978 10348.66 60669.76 1500.0 3721.56
currency 9 4100 9.67 2.93 11.0 10.42
other_currency 10 22 18.64 7.86 18.5 19.00
income_context 11 404 664.91 389.52 690.0 665.74
country 12 4100 54.54 15.25 60.0 59.43
state 13 3289 25.48 14.46 24.0 25.26
city_region 14 4099 1470.68 921.76 1484.0 1452.84
work_mode 15 4085 2.14 0.76 2.0 2.15
unionized 16 4042 1.14 0.35 1.0 1.05
yr_experience 17 4100 5.93 2.47 7.0 6.18
experience_in_field 18 4100 5.45 2.46 7.0 5.62
education_level 19 4089 68.97 25.21 50.0 66.32
gender 20 4086 8.68 2.99 10.0 9.45
race 21 4083 32.48 9.24 36.0 35.29
mad min max range skew kurtosis se
timestamp 4243.20 5 12129 12124 0.01 -1.12 53.03
age 0.00 2 2 0 NaN NaN 0.00
industry 111.19 1 406 405 0.28 -0.68 1.50
functional_area 220.91 1 638 637 0.42 -0.89 2.72
job_title 2945.18 2 7962 7960 -0.09 -1.22 36.04
Job_title_context 1066.73 5 2873 2868 -0.05 -1.21 28.34
annual_salary 35582.40 0 45000000 45000000 37.07 1437.28 16416.26
additional_compensation 2223.90 0 2500000 2500000 28.56 1043.74 1111.76
currency 0.00 1 12 11 -1.90 1.95 0.05
other_currency 10.38 3 30 27 -0.37 -1.06 1.68
income_context 504.08 1 1343 1342 -0.05 -1.21 19.38
country 0.00 1 60 59 -2.62 5.15 0.24
state 19.27 1 51 50 0.05 -1.27 0.25
city_region 1266.14 1 3138 3137 0.06 -1.25 14.40
work_mode 1.48 1 4 3 0.01 -0.77 0.01
unionized 0.00 1 2 1 2.08 2.34 0.01
yr_experience 1.48 1 8 7 -0.78 -1.21 0.04
experience_in_field 1.48 1 8 7 -0.40 -1.56 0.04
education_level 0.00 6 153 147 0.54 -0.30 0.39
gender 0.00 1 11 10 -2.02 2.36 0.05
race 0.00 1 37 36 -2.60 5.38 0.14
------------------------------------------------------------
group: 35-44
vars n mean sd median trimmed mad
timestamp 1 5552 6146.30 3421.71 6272.5 6167.92 4263.22
age 2 5552 3.00 0.00 3.0 3.00 0.00
industry 3 5535 158.72 93.79 157.0 154.15 102.30
functional_area 4 5505 229.61 173.69 209.0 216.49 216.46
job_title 5 5552 3914.08 2303.83 3958.0 3915.46 2950.37
Job_title_context 6 1230 1464.38 834.96 1503.0 1469.31 1075.63
annual_salary 7 5552 126308.42 819481.71 93000.0 99522.56 44478.00
additional_compensation 8 4252 14163.90 56969.81 2000.0 5390.69 2965.20
currency 9 5552 9.73 2.94 11.0 10.52 0.00
other_currency 10 26 13.77 8.27 13.5 13.36 11.12
income_context 11 567 666.16 393.97 650.0 663.56 512.98
country 12 5552 54.26 15.79 60.0 59.26 0.00
state 13 4573 25.84 14.51 24.0 25.72 19.27
city_region 14 5548 1495.27 919.94 1515.0 1484.79 1243.90
work_mode 15 5535 2.11 0.79 2.0 2.10 1.48
unionized 16 5510 1.14 0.34 1.0 1.05 0.00
yr_experience 17 5552 3.00 1.98 2.0 2.51 0.00
experience_in_field 18 5552 4.28 2.67 3.0 4.12 1.48
education_level 19 5540 73.94 26.39 90.0 72.19 44.48
gender 20 5530 8.68 3.09 10.0 9.47 0.00
race 21 5527 33.40 8.13 36.0 35.89 0.00
min max range skew kurtosis se
timestamp 1 12128 12127 -0.06 -1.12 45.92
age 3 3 0 NaN NaN 0.00
industry 2 404 402 0.41 -0.41 1.26
functional_area 9 643 634 0.43 -0.88 2.34
job_title 1 7963 7962 -0.02 -1.24 30.92
Job_title_context 2 2874 2872 -0.06 -1.23 23.81
annual_salary 0 58400000 58400000 65.62 4612.08 10998.02
additional_compensation 0 2000000 2000000 17.68 468.27 873.67
currency 1 12 11 -2.01 2.32 0.04
other_currency 1 31 30 0.34 -1.10 1.62
income_context 4 1348 1344 0.04 -1.26 16.55
country 2 66 64 -2.51 4.50 0.21
state 1 51 50 0.02 -1.25 0.21
city_region 3 3138 3135 0.01 -1.27 12.35
work_mode 1 4 3 0.11 -0.80 0.01
unionized 1 2 1 2.10 2.42 0.00
yr_experience 1 8 7 1.83 1.78 0.03
experience_in_field 1 8 7 0.45 -1.66 0.04
education_level 3 148 145 0.10 -0.44 0.35
gender 1 13 12 -2.02 2.20 0.04
race 1 37 36 -3.20 9.01 0.11
------------------------------------------------------------
group: 45-54
vars n mean sd median trimmed
timestamp 1 2403 6372.43 3320.15 6414.0 6424.71
age 2 2403 4.00 0.00 4.0 4.00
industry 3 2400 160.61 94.49 157.0 156.44
functional_area 4 2386 216.52 174.39 190.0 200.28
job_title 5 2403 3947.54 2312.28 3930.0 3937.84
Job_title_context 6 543 1455.07 801.40 1460.0 1452.58
annual_salary 7 2403 118099.88 203187.98 95000.0 101243.41
additional_compensation 8 1909 13024.78 46028.81 1500.0 4632.80
currency 9 2403 9.78 2.92 11.0 10.60
other_currency 10 9 13.78 7.24 16.0 13.78
income_context 11 240 699.80 374.72 683.5 703.36
country 12 2403 54.42 15.66 60.0 59.46
state 13 2000 26.11 14.63 24.0 26.07
city_region 14 2402 1511.38 917.88 1567.0 1506.83
work_mode 15 2399 2.12 0.81 2.0 2.13
unionized 16 2376 1.13 0.34 1.0 1.04
yr_experience 17 2403 3.80 1.05 4.0 3.81
experience_in_field 18 2403 3.84 1.94 4.0 3.57
education_level 19 2397 74.86 29.51 90.0 73.90
gender 20 2393 8.66 3.15 10.0 9.45
race 21 2394 33.93 7.41 36.0 36.00
mad min max range skew kurtosis se
timestamp 3968.92 4 12116 12112 -0.09 -1.02 67.73
age 0.00 4 4 0 NaN NaN 0.00
industry 81.54 4 402 398 0.39 -0.39 1.93
functional_area 201.63 2 641 639 0.55 -0.75 3.57
job_title 2948.89 3 7965 7962 0.03 -1.21 47.17
Job_title_context 985.93 1 2856 2855 0.03 -1.12 34.39
annual_salary 45905.74 0 8000000 8000000 28.91 1031.36 4144.97
additional_compensation 2223.90 0 1000000 1000000 11.51 186.82 1053.48
currency 0.00 1 12 11 -2.10 2.66 0.06
other_currency 4.45 2 24 22 -0.30 -1.48 2.41
income_context 455.90 3 1345 1342 -0.01 -1.08 24.19
country 0.00 1 61 60 -2.57 4.76 0.32
state 19.27 1 51 50 -0.01 -1.27 0.33
city_region 1229.82 2 3133 3131 -0.02 -1.25 18.73
work_mode 1.48 1 4 3 0.04 -0.95 0.02
unionized 0.00 1 2 1 2.17 2.69 0.01
yr_experience 0.00 2 8 6 0.31 3.01 0.02
experience_in_field 2.97 1 8 7 0.93 -0.12 0.04
education_level 44.48 9 154 145 0.05 -0.51 0.60
gender 0.00 1 13 12 -1.99 2.01 0.06
race 0.00 1 37 36 -3.70 12.57 0.15
------------------------------------------------------------
group: 55-64
vars n mean sd median trimmed mad
timestamp 1 993 6321.46 3421.68 6444.0 6365.11 4201.69
age 2 993 5.00 0.00 5.0 5.00 0.00
industry 3 989 159.71 96.15 157.0 155.14 80.06
functional_area 4 983 203.90 172.14 189.0 185.26 203.12
job_title 5 993 3933.37 2318.57 3877.0 3927.64 2997.82
Job_title_context 6 237 1334.72 846.30 1182.0 1309.66 1011.13
annual_salary 7 993 115262.90 157356.24 96000.0 100070.72 45960.60
additional_compensation 8 779 8859.83 24093.45 1200.0 3647.28 1779.12
currency 9 993 10.15 2.51 11.0 10.96 0.00
other_currency 10 2 20.50 2.12 20.5 20.50 2.22
income_context 11 116 704.79 394.47 735.0 710.62 512.24
country 12 993 55.93 13.57 60.0 59.98 0.00
state 13 874 25.79 14.73 24.0 25.70 20.76
city_region 14 993 1493.95 911.27 1506.0 1485.05 1197.94
work_mode 15 990 2.19 0.82 2.0 2.22 1.48
unionized 16 984 1.14 0.35 1.0 1.05 0.00
yr_experience 17 993 4.67 0.95 5.0 4.71 0.00
experience_in_field 18 993 4.22 1.67 4.0 4.07 1.48
education_level 19 990 75.33 31.41 75.0 74.91 37.06
gender 20 990 8.72 3.11 10.0 9.52 0.00
race 21 991 34.32 6.92 36.0 36.00 0.00
min max range skew kurtosis se
timestamp 2 12105 12103 -0.12 -1.10 108.58
age 5 5 0 NaN NaN 0.00
industry 4 407 403 0.45 -0.33 3.06
functional_area 5 644 639 0.64 -0.59 5.49
job_title 5 7964 7959 0.04 -1.18 73.58
Job_title_context 6 2869 2863 0.26 -1.16 54.97
annual_salary 45 4300000 4299955 19.73 504.48 4993.55
additional_compensation 0 280000 280000 6.36 52.55 863.24
currency 1 11 10 -2.75 5.88 0.08
other_currency 19 22 3 0.00 -2.75 1.50
income_context 17 1347 1330 -0.09 -1.27 36.63
country 2 64 62 -3.16 8.17 0.43
state 1 51 50 -0.03 -1.29 0.50
city_region 4 3125 3121 0.00 -1.29 28.92
work_mode 1 4 3 -0.15 -1.07 0.03
unionized 1 2 1 2.08 2.33 0.01
yr_experience 1 8 7 -0.64 2.99 0.03
experience_in_field 1 8 7 0.51 0.07 0.05
education_level 1 143 142 0.02 -0.65 1.00
gender 1 11 10 -2.06 2.29 0.10
race 1 37 36 -4.22 16.58 0.22
------------------------------------------------------------
group: 65 or over
vars n mean sd median trimmed mad
timestamp 1 125 6875.11 2953.76 7048.0 6993.06 2876.24
age 2 125 6.00 0.00 6.0 6.00 0.00
industry 3 125 163.60 97.58 157.0 159.19 81.54
functional_area 4 124 213.95 171.52 189.5 199.50 202.37
job_title 5 125 4081.90 2086.48 4027.0 4098.37 2155.70
Job_title_context 6 32 1143.53 848.87 924.5 1085.27 1034.85
annual_salary 7 125 110685.06 70513.02 101000.0 102485.69 51499.59
additional_compensation 8 97 8653.00 25367.94 100.0 2989.01 148.26
currency 9 125 10.31 2.29 11.0 11.00 0.00
other_currency 10 0 NaN NA NA NaN NA
income_context 11 12 854.67 370.59 861.0 884.90 432.92
country 12 125 56.59 12.30 60.0 60.00 0.00
state 13 113 25.71 15.70 23.0 25.48 20.76
city_region 14 125 1697.59 875.06 1612.0 1714.36 1106.02
work_mode 15 125 2.27 0.84 2.0 2.30 1.48
unionized 16 124 1.13 0.34 1.0 1.04 0.00
yr_experience 17 125 5.37 0.94 6.0 5.54 0.00
experience_in_field 18 125 4.57 1.59 5.0 4.56 1.48
education_level 19 123 80.90 32.02 90.0 80.60 53.37
gender 20 124 8.73 3.11 10.0 9.51 0.00
race 21 125 33.80 7.90 36.0 36.00 0.00
min max range skew kurtosis se
timestamp 70 12100 12030 -0.29 -0.56 264.19
age 6 6 0 NaN NaN 0.00
industry 4 381 377 0.43 -0.45 8.73
functional_area 4 615 611 0.51 -0.83 15.40
job_title 23 7759 7736 -0.03 -0.89 186.62
Job_title_context 97 2797 2700 0.47 -1.23 150.06
annual_salary 18 600000 599982 2.98 16.92 6306.88
additional_compensation 0 200000 200000 5.39 33.72 2575.72
currency 2 11 9 -3.10 7.94 0.21
other_currency Inf -Inf -Inf NA NA NA
income_context 97 1310 1213 -0.40 -0.94 106.98
country 9 60 51 -3.43 10.10 1.10
state 1 50 49 0.08 -1.44 1.48
city_region 50 3132 3082 -0.12 -1.12 78.27
work_mode 1 4 3 -0.21 -1.01 0.07
unionized 1 2 1 2.19 2.80 0.03
yr_experience 2 6 4 -1.77 3.24 0.08
experience_in_field 1 8 7 -0.12 -0.43 0.14
education_level 9 150 141 0.00 -0.88 2.89
gender 1 10 9 -2.06 2.28 0.28
race 1 36 35 -3.53 11.16 0.71
------------------------------------------------------------
group: under 18
vars n mean sd median trimmed mad min max range
timestamp 1 1 12021 NA 12021 12021 0 12021 12021 0
age 2 1 7 NA 7 7 0 7 7 0
industry 3 1 4 NA 4 4 0 4 4 0
functional_area 4 1 9 NA 9 9 0 9 9 0
job_title 5 1 999 NA 999 999 0 999 999 0
Job_title_context 6 1 333 NA 333 333 0 333 333 0
annual_salary 7 1 0 NA 0 0 0 0 0 0
additional_compensation 8 1 0 NA 0 0 0 0 0 0
currency 9 1 11 NA 11 11 0 11 11 0
other_currency 10 0 NaN NA NA NaN NA Inf -Inf -Inf
income_context 11 0 NaN NA NA NaN NA Inf -Inf -Inf
country 12 1 60 NA 60 60 0 60 60 0
state 13 1 1 NA 1 1 0 1 1 0
city_region 14 1 331 NA 331 331 0 331 331 0
work_mode 15 0 NaN NA NA NaN NA Inf -Inf -Inf
unionized 16 0 NaN NA NA NaN NA Inf -Inf -Inf
yr_experience 17 1 1 NA 1 1 0 1 1 0
experience_in_field 18 1 1 NA 1 1 0 1 1 0
education_level 19 0 NaN NA NA NaN NA Inf -Inf -Inf
gender 20 0 NaN NA NA NaN NA Inf -Inf -Inf
race 21 0 NaN NA NA NaN NA Inf -Inf -Inf
skew kurtosis se
timestamp NA NA NA
age NA NA NA
industry NA NA NA
functional_area NA NA NA
job_title NA NA NA
Job_title_context NA NA NA
annual_salary NA NA NA
additional_compensation NA NA NA
currency NA NA NA
other_currency NA NA NA
income_context NA NA NA
country NA NA NA
state NA NA NA
city_region NA NA NA
work_mode NA NA NA
unionized NA NA NA
yr_experience NA NA NA
experience_in_field NA NA NA
education_level NA NA NA
gender NA NA NA
race NA NA NA
Firefox about:blank 8 af 20 19.10.2024 15.40 ### Group or Individual Tasks � Your Turn⤵ #### TASK 1: Use the describe() function to analyze the distribution of annual salaries within a particular demographic or job function category. You will generate a detailed statistical table and write a short analysis based on your findings. - Select a Variable for Analysis - Choose one categorical variable from the dataset (e.g., industry, country, or job function) to focus your analysis on how annual salaries vary within the chosen category. - Filter and Prepare the Data - Optionally, filter the data if you want to focus on a specific subset (e.g., managers in a specific country or age group). - Use describe() to Generate Descriptive Statistics - Apply the describe() function to the salary data grouped by the selected categorical variable. - Print and Interpret the Results - Print the output table using print() and provide a concise interpretation of the key statistics such as mean, median, standard deviation, and range.
# ADD YOUR CODE BELOW with comments#ilter the data to focus on a subset, e.g., a specific countryindustry_data <- salary_df %>%filter(country =="United States") %>%# This line is optionalselect(industry, annual_salary)# Apply describe() to the filtered datadescriptive_stats <-describe(industry_data$annual_salary)kable(descriptive_stats)
vars
n
mean
sd
median
trimmed
mad
min
max
range
skew
kurtosis
se
X1
1
11069
105566.6
79852.41
90500
96677.8
42254.1
0
4300000
4300000
17.27037
734.624
758.9861
Questions Further Discussion:
Choose one or two questions below or create your own questions to discuss the results
of your analysis:
- What is the median salary within your chosen category, and how does it compare to
the overall median salary of the dataset? Why might there be differences?
- Identify and discuss the implications of the skewness in the salary distribution
for your chosen category. What does the direction and magnitude of the skewness imply
about the majority of salaries in this group?
- Analyze the trimmed mean of salaries between different job functions or industries.
How does removing outliers affect the comparison of average salaries across these
categories?
- Discuss how the salary data might influence policy-making or business strategies in
Firefox about:blank
9 af 20 19.10.2024 15.40
terms of diversity and inclusion within the workplace.
- Reflect on any potential biases in the data collection or analysis process. How
might these biases impact the conclusions drawn from your analysis?
TASK 2: Use R with the dplyr package to filter, group, and summarize the dataset
to answer the following question: “Which age groups in each country have an average annual salary exceeding $100,000?” - Data Filtering - Filter the dataset to include only those entries where the annual salary is greater than $100,000. - Grouping Data - Group the filtered data by country and age. - Summarizing Data - Calculate the average salary for each group. - Count the number of individuals in each group. - Arranging Data - Arrange your results to show groups with the highest average salaries at the top.
# ADD YOUR CODE BELOW with comments# Perform the analysisresult <- salary_df %>%filter(annual_salary >100000) %>%group_by(country, age) %>%summarise(Count =n(),Average_Salary =mean(annual_salary, na.rm =TRUE) ) %>%arrange(desc(Average_Salary))
`summarise()` has grouped output by 'country'. You can override using the
`.groups` argument.
# Print the resultsresult
# A tibble: 89 × 4
# Groups: country [40]
country age Count Average_Salary
<fct> <ord> <int> <dbl>
1 South Korea 35-44 1 58400000
2 South Korea 25-34 3 37733333.
3 Hungary 35-44 1 8160000
4 Iceland 45-54 1 8000000
5 Japan 35-44 6 5170038
6 Japan 25-34 5 4745083.
7 Japan 45-54 1 4500000
8 Japan 18-24 1 3360000
9 India 25-34 1 2000000
10 India 18-24 2 928000
# ℹ 79 more rows
Questions Further Discussion:
- How does the age distribution of high earners vary between countries?
- What factors might influence the differences in salary distributions across
countries?
- Could there be external factors affecting the accuracy of this analysis?
FIRST rename your file to DSC406_001_FA24_WA_3_unityID
Firefox about:blank 10 af 20 19.10.2024 15.40 ggplot2: Today, we’re diving into exploratory data analysis (EDA) using ggplot2. You’ve watched a video about the basics of ggplot, so now we’ll start applying what you’ve learned. Our focus will be on exploring a dataset with big data related to salaries across various industries and regions. Remember, ggplot2 allows us to iteratively build up layers to create meaningful data visualizations. Resources: - https://datavizproject.com/ - https://www.data-to-viz.com/ - Kieran Healy - DataViz ### Step-by-Step Guide #### Step 1: Basic Bar Plot Let is begin with the simplest form of visualization, a bar plot. This will help us count how often a certain value appears. We’ll start by counting how often each type of currency appears in the dataset.
# Create a bar chart to show currency distributionviz_1 <-ggplot(salary_df, aes(x = currency)) +geom_bar(fill ="steelblue") +labs(title ="Currency Distribution", x ="Currency", y ="Count") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))viz_1
This plot shows how many respondents are paid in each type of currency. The x-axis lists the different currencies, and the y-axis shows the number of respondents for each currency. This helps us understand the global distribution of salaries in the dataset. ### ** 1. Your Turn** ⤵ ❓ What happens if we change the aesthetic to another variable, like work_mode? 1. Task: Modify the bar chart by changing the aesthetic from currency to another variable of your choice, such as work_mode. 2. Steps: - Update the aes(x = currency) argument to reflect a different categorical variable, such as work_mode. - Experiment with changing the color of the bars. - Modify the labs to ensure the title and axis labels match the new variable you’ve chosen. Firefox about:blank 11 af 20 19.10.2024 15.40 3. Goal: Create a new bar chart that explores a different variable distribution (e.g., work mode), and make sure the chart is easy to read by adjusting the title, labels, and any other necessary aesthetics. 4. Write: Write a few insights you have noticed from the graph and what you still wonder about.
#ADD CODE AND COMMENTS# Create a bar chart to show currency distributionviz_2 <-ggplot(salary_df, aes(x = work_mode)) +geom_bar(fill ="red") +labs(title ="Currency Distribution", x ="Work mode", y ="Count") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))viz_2
This bar plot counts how many people work in each mode, such as remote or hybrid. Take a moment to compare the different work modes and think about why certain modes might be more common in different industries. #### 2: Exploring Continuous Variables (Histogram) Now, let’s move to a continuous variable like annual salary. A histogram helps us visualize how salaries are distributed.
options(scipen =999) # instructs R to avoid using scientific notation in its output# Create a histogram for annual salaryggplot(salary_df, aes(x = annual_salary)) +geom_histogram(binwidth =1000, fill ="steelblue", color ="black") +theme_minimal() +labs(title ="Distribution of Annual Salary", x ="Annual Salary", y ="Count")
After exploring the continuous variable annual salary, we might notice that the values are quite high, especially in certain bins. This raises a critical question: are the salaries in this dataset separated by currency? Given that salaries in different currencies can vary significantly, we should focus on one currency at a time. In this case, we’ll filter the dataset to include only salaries reported in USD. By doing this, we ensure that our analysis is relevant and that we’re not mixing different currencies. ### ** 2. Your Turn** ⤵ ❓ How would you filter the dataset to focus only on salaries reported in USD? 1. Task: Filter the dataset to only include rows where the salary is reported in USD. 2. Steps: Firefox about:blank 12 af 20 19.10.2024 15.40 - Use the filter() function from dplyr to create a new data frame that only includes rows where the currency is “USD”. - Ensure the filtered data is assigned to a new object, such as usd_salary. - Display the resulting data frame to confirm the filtering worked correctly. 3. Goal: By completing this task, you’ll be able to filter out non-USD salaries and work with a cleaner, more focused dataset. 4. Write: Write a few insights you have noticed from the graph and what you still wonder about. - The reduced table only containing data entries where the salary is USD, still contains 11.114 rows, which is a substantial amount out of the 13.349. This would make sense based on the bar plot we saw before with distribution of used currency in the dataset.
#ADD CODE AND COMMENTSusd_salary <-filter(salary_df, currency =='USD')usd_salary
# A tibble: 11,114 × 21
timestamp age industry functional_area job_title Job_title_context
<chr> <ord> <fct> <chr> <chr> <chr>
1 4/9/2024 11:01:42 25-34 Media & … Media & Digital Digital … <NA>
2 4/9/2024 11:02:14 35-44 Educatio… Health care Senior D… <NA>
3 4/9/2024 11:02:18 35-44 Nonprofi… Administration Advancem… <NA>
4 4/9/2024 11:02:19 25-34 Governme… Government & P… Program … <NA>
5 4/9/2024 11:02:24 25-34 Governme… Administration Project … <NA>
6 4/9/2024 11:02:37 45-54 Governme… Health care Microbio… <NA>
7 4/9/2024 11:02:37 18-24 Nonprofi… Marketing, Adv… Public E… <NA>
8 4/9/2024 11:02:40 25-34 Computin… Business or Co… Client S… <NA>
9 4/9/2024 11:02:45 35-44 Accounti… Accounting, Ba… Accounta… <NA>
10 4/9/2024 11:02:46 45-54 Nonprofi… Nonprofits Developm… <NA>
# ℹ 11,104 more rows
# ℹ 15 more variables: annual_salary <dbl>, additional_compensation <dbl>,
# currency <fct>, other_currency <chr>, income_context <chr>, country <fct>,
# state <chr>, city_region <chr>, work_mode <fct>, unionized <fct>,
# yr_experience <chr>, experience_in_field <chr>, education_level <fct>,
# gender <fct>, race <fct>
This code filters the dataset so that we’re only looking at respondents who reported their salary in USD. This step is important because it ensures that our analysis is consistent and we’re not comparing salaries reported in other currencies, which might distort the distribution. what else we might want to do? Check out the salary data to see the distribution. #### Summary table
summary(usd_salary$annual_salary)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0 66500 90386 105516 127950 4300000
3: Box Plot
# Create a box plot of annual salaryggplot(usd_salary, aes(y = annual_salary)) +geom_boxplot() +labs(title ="Box Plot of Annual Salary (USD)", x ="", y ="Annual Salary (USD)") +theme_minimal()
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `yr_experience_numeric = as.numeric(yr_experience)`.
Caused by warning:
! NAs introduced by coercion
# Create a scatter plot of annual salary vs. years of experience (numeric)ggplot(usd_salary_clean, aes(x = yr_experience_numeric, y = annual_salary)) +geom_point(alpha =0.6, color ="pink") +labs(title ="Scatter Plot of Annual Salary vs. Years of Experience",x ="Years of Experience",y ="Annual Salary (USD)") +theme_minimal()
Warning: Removed 11114 rows containing missing values or values outside the scale range
(`geom_point()`).
#inspect vizulaitonviz_2
# A tibble: 11,114 × 22
timestamp age industry functional_area job_title Job_title_context
<chr> <ord> <fct> <chr> <chr> <chr>
1 4/9/2024 11:01:42 25-34 Media & … Media & Digital Digital … <NA>
2 4/9/2024 11:02:14 35-44 Educatio… Health care Senior D… <NA>
3 4/9/2024 11:02:18 35-44 Nonprofi… Administration Advancem… <NA>
4 4/9/2024 11:02:19 25-34 Governme… Government & P… Program … <NA>
5 4/9/2024 11:02:24 25-34 Governme… Administration Project … <NA>
6 4/9/2024 11:02:37 45-54 Governme… Health care Microbio… <NA>
7 4/9/2024 11:02:37 18-24 Nonprofi… Marketing, Adv… Public E… <NA>
8 4/9/2024 11:02:40 25-34 Computin… Business or Co… Client S… <NA>
9 4/9/2024 11:02:45 35-44 Accounti… Accounting, Ba… Accounta… <NA>
10 4/9/2024 11:02:46 45-54 Nonprofi… Nonprofits Developm… <NA>
# ℹ 11,104 more rows
# ℹ 16 more variables: annual_salary <dbl>, additional_compensation <dbl>,
# currency <fct>, other_currency <chr>, income_context <chr>, country <fct>,
# state <chr>, city_region <chr>, work_mode <fct>, unionized <fct>,
# yr_experience <chr>, experience_in_field <chr>, education_level <fct>,
# gender <fct>, race <fct>, yr_experience_numeric <dbl>
Filter out the extreme values
#filter out the extreme valuesusd_salary_no_outliers <- usd_salary_clean %>%filter(annual_salary <200000)
** 3 Your Turn** **⤵*
❓ Now that you’ve filtered out the extreme values, it’s time to create another boxplot. What do you observe after removing the outliers? 1. Task: Create a new box plot for the filtered dataset that excludes extreme salary values. 2. Steps: - Use the filtered dataset (e.g., usd_salary_no_outliers) to create a box plot that visualizes the distribution of salaries. - Use the geom_boxplot() function to display the distribution. - Add appropriate title and labels using labs() to ensure the plot is well-labeled. 3. Goal: Visualize the distribution of annual salaries after removing outliers. Compare this new box plot with the previous one that included extreme values, and reflect on how the distribution has changed. 4. Write: Write a few insights you have noticed from the graph and what you still wonder about.
#ADD CODE AND COMMENTS# Create a box plot of annual salaryggplot(usd_salary_no_outliers, aes(y = annual_salary)) +geom_boxplot() +labs(title ="Box Plot of Annual Salary (USD)", x ="", y ="Annual Salary (USD)") +theme_minimal()
We will now explore the trend between years of experience and annual salary.
usd_salary_no_outliers %>%count(yr_experience, name ="Count") %>%arrange(desc(Count))
# A tibble: 8 × 2
yr_experience Count
<chr> <int>
1 11-20 years 4217
2 21-30 years 2012
3 8-10 years 1740
4 5-7 years 1142
5 31-40 years 676
6 2-4 years 475
7 41 years or more 147
8 1 year or less 81
usd_salary_no_outliers <- usd_salary_no_outliers %>%mutate(yr_experience_numeric =case_when( yr_experience =="1 year or less"~1, yr_experience =="2-4 years"~3, yr_experience =="5-7 years"~6, yr_experience =="8-10 years"~9, yr_experience =="11-20 years"~15, yr_experience =="21-30 years"~25, yr_experience =="31-40 years"~35, yr_experience =="41 years or more"~45 ))
# Create a scatter plot of annual salary vs. numeric years of experienceggplot(usd_salary_no_outliers, aes(x = yr_experience_numeric, y = annual_salary)) +geom_point(alpha =0.6, color ="steelblue") +labs(title ="Scatter Plot of Annual Salary vs. Years of Experience",x ="Years of Experience (Numeric)",y ="Annual Salary (USD)") +theme_minimal()
# Create a scatter plot of annual salary vs. numeric years of experience with custom labelsviz_3 <-ggplot(usd_salary_no_outliers, aes(x = yr_experience_numeric, y = annual_salary)) +geom_point(alpha =0.6, color ="steelblue") +scale_x_continuous(breaks =c(1, 3, 6, 9, 15, 25, 35, 45), # Numeric valueslabels =c("1 year or less", "2-4 years", "5-7 years", "8-10 years", "11-20 years", "21-30 years", "31-40 years", "41 years or more") # Custom labels ) +labs(title ="Scatter Plot of Annual Salary vs. Years of Experience",x ="Years of Experience",y ="Annual Salary (USD)") +theme_minimal()#inspect visualizationviz_3
# Violin plot of annual salary by years of experienceviz_4 <-ggplot(usd_salary_no_outliers, aes(x =as.factor(yr_experience_numeric), y =annual_salary)) +geom_violin(fill ="lightblue") +scale_x_discrete(labels =c("1 year or less", "2-4 years", "5-7 years", "8-10 years", "11-20 years", "21-30 years", "31-40 years", "41 years or more") ) +labs(title ="Violin Plot of Annual Salary by Years of Experience",x ="Years of Experience",y ="Annual Salary (USD)") +theme_minimal()+theme(axis.text.x =element_text(angle =45, hjust =1))#inspect vizviz_4
** 4. Your Turn** ⤵
❓ Can you use facet_wrap to break down the violin plot by gender? 1. Task: Create a violin plot to show the distribution of annual salary by years of experience, and then use facet_wrap to break down the plot by gender. 2. Steps: - Use the filtered dataset (e.g., usd_salary_no_outliers) to create a violin plot of annual salary against years of experience. - Use geom_violin() to visualize the distribution. - Use facet_wrap(~gender) to split the plot by gender, so you can compare salary distributions for different genders. - Ensure the labels and titles clearly explain the plot. 3. Goal: Create a violin plot that compares salary distributions by years of experience and gender. Reflect on how the distribution differs between genders. 4. Write: Write a few insights you have noticed from the graph and what you still wonder about.
#ADD CODE AND COMMENTSggplot(usd_salary_no_outliers, aes(x =factor(yr_experience_numeric), y =annual_salary)) +geom_violin(fill ="lightblue", color ="black") +facet_wrap(~ gender) +theme_minimal() +labs(title ="Distribution of Annual Salary by Years of Experience and Gender",x ="Years of Experience",y ="Annual Salary",caption ="Data filtered to exclude outliers")
Warning: Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Warning in max(data$density, na.rm = TRUE): no non-missing arguments to max;
returning -Inf
Warning: Computation failed in `stat_ydensity()`.
Caused by error in `$<-.data.frame`:
! replacement has 1 row, data has 0
Warning: Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Warning in max(data$density, na.rm = TRUE): no non-missing arguments to max;
returning -Inf
Warning: Computation failed in `stat_ydensity()`.
Caused by error in `$<-.data.frame`:
! replacement has 1 row, data has 0
Warning: Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Warning in max(data$density, na.rm = TRUE): no non-missing arguments to max;
returning -Inf
Warning: Computation failed in `stat_ydensity()`.
Caused by error in `$<-.data.frame`:
! replacement has 1 row, data has 0
Warning: Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
I was interested in looking at the genders but I noticed there are multiple genders. We would want to split this data so that we can count it.
#Use separate_rows to split gender but keep other data intactusd_salary_no_outliers <- usd_salary_no_outliers %>%separate_rows(gender, sep =",")
# Violin plot of annual salary by years of experienceggplot(usd_salary_no_outliers, aes(x =as.factor(yr_experience_numeric), y =annual_salary)) +geom_violin(fill ="lightblue") +scale_x_discrete(labels =c("1 year or less", "2-4 years", "5-7 years", "8-10 years", "11-20 years", "21-30 years", "31-40 years", "41 years or more") ) +labs(title ="Violin Plot of Annual Salary by Years of Experience",x ="Years of Experience",y ="Annual Salary (USD)") +theme_minimal()+theme(axis.text.x =element_text(angle =45, hjust =1))+facet_wrap(~gender)
Warning: Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
** 5. Your Turn** ⤵
❓ Create a similar violin plot for annual salary and age variables. Can you visualize the salary distribution across different age groups? 1. Task: Create a violin plot for annual salary by age groups, and use facet_wrap to break it down by gender. 2. Steps: - Use the filtered dataset (e.g., usd_salary_no_outliers) to create a violin plot of annual salary by age. - Use geom_violin() to visualize the distribution for different age groups. - Modify the x-axis labels to show specific age groups. - Add facet_wrap(~gender) to split the plot by gender and allow for comparison. - Ensure that the title and axis labels are updated to reflect the new variables. 3. Goal: Visualize how annual salary varies across different age groups, and examine how this pattern differs between genders. 4. Write: Write a few insights you have noticed from the graph and what you still wonder about.
#ADD CODE AND COMMENTSggplot(usd_salary_no_outliers, aes(x =factor(age), y = annual_salary)) +geom_violin(fill ="lightgreen", color ="black") +facet_wrap(~ gender) +theme_minimal() +labs(title ="Distribution of Annual Salary by Age Groups and Gender",x ="Age Group",y ="Annual Salary",caption ="Data filtered to exclude outliers") +theme(axis.text.x =element_text(angle =45, hjust =1)) # Rotate x-axis labels for better readability
Warning: Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Advanced GGPlot: Interactive Plots with Plotly
To make our visualizations interactive, we can use the plotly package. This allows us to transform our ggplot graphs into interactive visualizations that users can explore. First, install and load the plotly package if you have not already:
if (!require(plotly)) {install.packages("plotly")}
Loading required package: plotly
Warning: package 'plotly' was built under R version 4.3.3
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(plotly)
ggplotly(viz_1)
Beeswarm Plot for Enhanced Data Visualization
Another interesting way to visualize distributions is by using a [beeswarm plot] (https://r-graph-gallery.com/beeswarm.html), which can help to show individual data points without too much overlap. First, install and load the beeswarm package:
if (!require(ggbeeswarm)) {install.packages("ggbeeswarm")}
Loading required package: ggbeeswarm
Warning: package 'ggbeeswarm' was built under R version 4.3.3
library(ggbeeswarm)
Create a beeswarm plot:
ggplot(usd_salary_no_outliers, aes(x =as.factor(yr_experience_numeric), y =annual_salary))+geom_beeswarm(color ="blue", size = .03, alpha =0.6, method ="compactswarm", side =-1L) +scale_x_discrete(labels =c("1 year or less", "2-4 years", "5-7 years", "8-10 years", "11-20 years", "21-30 years", "31-40 years", "41 years or more") ) +labs(title ="Beeswarm Plot of Annual Salary by Years Experience", x ="Years Experience", y ="Annual Salary (USD)") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
** 6. Your Turn** ⤵
Create a beeswarm plot using usd_salary_no_outliers to visualize the relationship between age and annual salary.
Task: Create a beeswarm plot with age groups on the x-axis and annual salary on the y-axis.
Steps:
Use the ggplot() function to map age to the x-axis and annual salary to the y-axis.
Add the beeswarm plot using geom_beeswarm(), adjusting parameters like color, size, alpha (transparency), and method to make the plot clear and visually appealing.
Customize the title and axis labels using the labs() function.
Make sure the x-axis labels are angled for readability.
Finally make it interactive
Goal: Create a visually appealing beeswarm plot that clearly displays the distribution of salaries across different age groups. Reflect on the spread of salaries within each age group.
Write: Write a few insights you have noticed from the graph and what you still wonder about.
#ADD CODE AND COMMENTS#ADD CODE AND COMMENTS# Create a beeswarm plot with age groups on the x-axis and annual salary on the y-axisbeeswarm_plot <-ggplot(usd_salary_no_outliers, aes(x =as.factor(age), y =annual_salary)) +geom_beeswarm(color ="dodgerblue", size =1, alpha =0.4, method ="swarm") +theme_minimal() +labs(title ="Beeswarm Plot of Annual Salary by Age Groups",x ="Age Group",y ="Annual Salary",caption ="Data filtered to exclude outliers") +theme(axis.text.x =element_text(angle =45, hjust =1)) # Rotate x-axis labels for readability# Make the plot interactive using plotlyinteractive_plot <-ggplotly(beeswarm_plot)# Display the interactive plotinteractive_plot
Insights about the graph: It seems that most salaries are payed to people between 18 and 65, which makes sense as this is the normal age range for working people. Furthermore there doesnt seems to be any difference in the max annual salary based on the age groups, which could be Firefox about:blank 19 af 20 19.10.2024 15.40 cause for further data analyzing as one might expect the salary to increase as age increases. An idea could be to compare the annual salary to years of experience to further gain insight into the distribution of salary in the data. This being said, it should also be mentioned that the graphs shows us that there still is a lot more young people receiving a lower salary between 50.000 and 100.000 USD showcasing a pretty wide spread of salaries within each age group. This maybe means that the data contains majori