WA_3_manager_salary_surveys

DSC_406.001

Author

August Pallesen

Published

October 27, 2024

0. Introduction

a. Overview

This section of the narrated notebook will discuss the dataset derived from a comprehensive manager salary survey conducted in 2024. The survey aimed to gather detailed information about the salaries, industries, job functions, and demographics of managers across various sectors globally. ### b. Data Collection The data was collected via an online survey that included multiple-choice questions and open-ended responses to capture detailed information about the respondents job titles, industries, and salaries. ### c. Purpose of the Data The primary purpose of this dataset is to analyze trends in manager salaries across different industries and regions. It helps identify factors influencing salary variations and provides insights that can assist stakeholders like HR professionals, recruiters, and policy-makers in making informed decisions. ### d. Data Description The dataset contains responses to questions about age, industry, job function, salary, work experience, education level, and demographic information. The key stakeholders interested in this dataset include: - HR departments analyzing compensation fairness. - Recruiters seeking industry-specific salary benchmarks. - Economic researchers studying employment trends. - Policy-makers shaping labor market regulations. This dataset was sourced from an online survey platform and is crucial for understanding compensation trends and disparities in the labor market. ## Let’s Dive in: # 1. Setup

Load Packages

Use library() function to load tidyverse package. Connect csv and save as a new object called salary_df

library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.3.3
Warning: package 'ggplot2' was built under R version 4.3.3
Warning: package 'tibble' was built under R version 4.3.3
Warning: package 'tidyr' was built under R version 4.3.3
Warning: package 'readr' was built under R version 4.3.3
Warning: package 'purrr' was built under R version 4.3.3
Warning: package 'dplyr' was built under R version 4.3.3
Warning: package 'stringr' was built under R version 4.3.3
Warning: package 'forcats' was built under R version 4.3.3
Warning: package 'lubridate' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# URL to the CSV published Google Sheet
url <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vST9_KrP-oqKCOxWxsrcZ1AQUCys5hFJquQY0iH3dTxY2LKAdcia2vQhs6uhOmFSBMRxDp3E3iZY85M/pub?gid=1401121012&single=true&output=csv"

# Read the CSV data directly into R
salary_df <- read_csv(url)
Rows: 13349 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (19): Timestamp, How old are you?, What industry is your employer in?, W...
dbl  (2): What is your annual salary? This should be your GROSS (pre-tax) in...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1a. Your Turn ⤵

  • Inspect the dataframe with glimse() function.
# ADD YOUR CODE BELOW with comments
glimpse(salary_df)
Rows: 13,349
Columns: 21
$ Timestamp                                                                                                                                                                                                                                                                                              <chr> …
$ `How old are you?`                                                                                                                                                                                                                                                                                     <chr> …
$ `What industry is your employer in?`                                                                                                                                                                                                                                                                   <chr> …
$ `What is the functional area of your job (this might be different from your company's industry)?`                                                                                                                                                                                                      <chr> …
$ `Job title`                                                                                                                                                                                                                                                                                            <chr> …
$ `If your job title needs additional context, please clarify here:`                                                                                                                                                                                                                                     <chr> …
$ `What is your annual salary? This should be your GROSS (pre-tax) income. (You'll indicate the currency in a later question.) If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.`                     <dbl> …
$ `How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Only include monetary compensation here, not the value of benefits, tuition reimbursement, etc. If your bonus or overtime varies from year to year, use the most recent figures.` <dbl> …
$ `Please indicate the currency`                                                                                                                                                                                                                                                                         <chr> …
$ `If "Other," please indicate the currency here:`                                                                                                                                                                                                                                                       <chr> …
$ `If your income needs additional context, please provide it here:`                                                                                                                                                                                                                                     <chr> …
$ `What country do you work in? (Countries listed had by far the largest representation last year. Please write in your country if it's not listed.)`                                                                                                                                                    <chr> …
$ `If you're in the U.S., what state do you work in?`                                                                                                                                                                                                                                                    <chr> …
$ `What city/region do you work in?`                                                                                                                                                                                                                                                                     <chr> …
$ `Are you remote or on-site?`                                                                                                                                                                                                                                                                           <chr> …
$ `Is your job unionized?`                                                                                                                                                                                                                                                                               <chr> …
$ `How many years of professional work experience do you have overall?`                                                                                                                                                                                                                                  <chr> …
$ `How many years of professional work experience do you have in your field?`                                                                                                                                                                                                                            <chr> …
$ `What is your highest level of education completed?`                                                                                                                                                                                                                                                   <chr> …
$ `What is your gender?`                                                                                                                                                                                                                                                                                 <chr> …
$ `What is your race? (Choose all that apply.)`                                                                                                                                                                                                                                                          <chr> …

What is happening? This glimpse() function provides a quick overview of the dataframe, including the number of observations (rows), variables (columns), and the first few entries of each column. How many observations does it have? How many variables? What are the variables? What are the data types of the variables? # 2. Data Structure ### ** 2a. Your Turn** a. Lets look at the names of the variables and rename them using names() function.

# ADD YOUR CODE BELOW with comments
names(salary_df)
 [1] "Timestamp"                                                                                                                                                                                                                                                                                           
 [2] "How old are you?"                                                                                                                                                                                                                                                                                    
 [3] "What industry is your employer in?"                                                                                                                                                                                                                                                                  
 [4] "What is the functional area of your job (this might be different from your company's industry)?"                                                                                                                                                                                                     
 [5] "Job title"                                                                                                                                                                                                                                                                                           
 [6] "If your job title needs additional context, please clarify here:"                                                                                                                                                                                                                                    
 [7] "What is your annual salary? This should be your GROSS (pre-tax) income. (You'll indicate the currency in a later question.) If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year."                    
 [8] "How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Only include monetary compensation here, not the value of benefits, tuition reimbursement, etc. If your bonus or overtime varies from year to year, use the most recent figures."
 [9] "Please indicate the currency"                                                                                                                                                                                                                                                                        
[10] "If \"Other,\" please indicate the currency here:"                                                                                                                                                                                                                                                    
[11] "If your income needs additional context, please provide it here:"                                                                                                                                                                                                                                    
[12] "What country do you work in? (Countries listed had by far the largest representation last year. Please write in your country if it's not listed.)"                                                                                                                                                   
[13] "If you're in the U.S., what state do you work in?"                                                                                                                                                                                                                                                   
[14] "What city/region do you work in?"                                                                                                                                                                                                                                                                    
[15] "Are you remote or on-site?"                                                                                                                                                                                                                                                                          
[16] "Is your job unionized?"                                                                                                                                                                                                                                                                              
[17] "How many years of professional work experience do you have overall?"                                                                                                                                                                                                                                 
[18] "How many years of professional work experience do you have in your field?"                                                                                                                                                                                                                           
[19] "What is your highest level of education completed?"                                                                                                                                                                                                                                                  
[20] "What is your gender?"                                                                                                                                                                                                                                                                                
[21] "What is your race? (Choose all that apply.)"                                                                                                                                                                                                                                                         
  1. To rename variables using the dplyr packages rename() function, you can specify the new name you want for each variable while keeping your dataframe tidy and easy to work with. This method is especially useful as it allows you to selectively rename Firefox about:blank 2 af 20 19.10.2024 15.40 variables without having to list all variables in your dataset, as was the case with the direct assignment method we did in the last assignment.
#rename the variables
salary_df <- salary_df %>%
  rename(
    timestamp = `Timestamp`,
    age = `How old are you?`,
    industry = `What industry is your employer in?`,
    functional_area = `What is the functional area of your job (this might be different from your company's industry)?`,
    job_title = `Job title`,
    Job_title_context = `If your job title needs additional context, please clarify here:`,
    annual_salary = `What is your annual salary? This should be your GROSS (pre-tax) income. (You'll indicate the currency in a later question.) If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.`,
    additional_compensation = `How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Only include monetary compensation here, not the value of benefits, tuition reimbursement, etc. If your bonus or overtime varies from year to year, use the most recent figures.`,
    currency = `Please indicate the currency`,
    other_currency = `If "Other," please indicate the currency here:`,
    income_context = `If your income needs additional context, please provide it here:`,
    country = `What country do you work in? (Countries listed had by far the largest representation last year. Please write in your country if it's not listed.)`,
    state = `If you're in the U.S., what state do you work in?`,
    city_region = `What city/region do you work in?`,
    work_mode = `Are you remote or on-site?`,
    unionized = `Is your job unionized?`,
    yr_experience = `How many years of professional work experience do you have overall?`,
    experience_in_field = `How many years of professional work experience do you have in your field?`,
    education_level = `What is your highest level of education completed?`,
    gender = `What is your gender?`,
    race = `What is your race? (Choose all that apply.)`
  )

** 2c. Your Turn**

  1. Now use str() function to inspect the data. This function str() is particularly useful in large datasets where manual inspection of each column isnt feasible.
# ADD YOUR CODE BELOW with comments
str(salary_df)
spc_tbl_ [13,349 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ timestamp              : chr [1:13349] "4/9/2024 11:01:42" "4/9/2024 11:02:14" "4/9/2024 11:02:18" "4/9/2024 11:02:19" ...
 $ age                    : chr [1:13349] "25-34" "35-44" "35-44" "25-34" ...
 $ industry               : chr [1:13349] "Media & Digital" "Education (Higher Education)" "Nonprofits" "Government & Public Administration" ...
 $ functional_area        : chr [1:13349] "Media & Digital" "Health care" "Administration" "Government & Public Administration" ...
 $ job_title              : chr [1:13349] "Digital Project Manager" "Senior Director" "Advancement Operations Manager" "Program Analyst" ...
 $ Job_title_context      : chr [1:13349] NA NA NA NA ...
 $ annual_salary          : num [1:13349] 73000 150000 53800 97000 64000 128000 136000 50000 68000 80000 ...
 $ additional_compensation: num [1:13349] NA 4500 NA 0 NA 0 0 NA 5000 2500 ...
 $ currency               : chr [1:13349] "USD" "USD" "USD" "USD" ...
 $ other_currency         : chr [1:13349] NA NA NA NA ...
 $ income_context         : chr [1:13349] NA NA NA NA ...
 $ country                : chr [1:13349] "United States" "United States" "United States" "United States" ...
 $ state                  : chr [1:13349] "New York" "California" "Maryland" "Colorado" ...
 $ city_region            : chr [1:13349] "New York" "Norhtern" "Olney" "Fort Collins" ...
 $ work_mode              : chr [1:13349] "Hybrid" "Hybrid" "Hybrid" "Fully remote" ...
 $ unionized              : chr [1:13349] "No" "No" "No" "No" ...
 $ yr_experience          : chr [1:13349] "5-7 years" "11-20 years" "11-20 years" "8-10 years" ...
 $ experience_in_field    : chr [1:13349] "5-7 years" "11-20 years" "8-10 years" "8-10 years" ...
 $ education_level        : chr [1:13349] "College degree" "College degree" "Master's degree" "Master's degree" ...
 $ gender                 : chr [1:13349] "Woman" "Woman" "Woman" "Woman" ...
 $ race                   : chr [1:13349] "White" "White" "White" "White" ...
 - attr(*, "spec")=
  .. cols(
  ..   Timestamp = col_character(),
  ..   `How old are you?` = col_character(),
  ..   `What industry is your employer in?` = col_character(),
  ..   `What is the functional area of your job (this might be different from your company's industry)?` = col_character(),
  ..   `Job title` = col_character(),
  ..   `If your job title needs additional context, please clarify here:` = col_character(),
  ..   `What is your annual salary? This should be your GROSS (pre-tax) income. (You'll indicate the currency in a later question.) If you are part-time or hourly, please enter an annualized equivalent -- what you would earn if you worked the job 40 hours a week, 52 weeks a year.` = col_double(),
  ..   `How much additional monetary compensation do you get, if any (for example, bonuses or overtime in an average year)? Only include monetary compensation here, not the value of benefits, tuition reimbursement, etc. If your bonus or overtime varies from year to year, use the most recent figures.` = col_double(),
  ..   `Please indicate the currency` = col_character(),
  ..   `If "Other," please indicate the currency here:` = col_character(),
  ..   `If your income needs additional context, please provide it here:` = col_character(),
  ..   `What country do you work in? (Countries listed had by far the largest representation last year. Please write in your country if it's not listed.)` = col_character(),
  ..   `If you're in the U.S., what state do you work in?` = col_character(),
  ..   `What city/region do you work in?` = col_character(),
  ..   `Are you remote or on-site?` = col_character(),
  ..   `Is your job unionized?` = col_character(),
  ..   `How many years of professional work experience do you have overall?` = col_character(),
  ..   `How many years of professional work experience do you have in your field?` = col_character(),
  ..   `What is your highest level of education completed?` = col_character(),
  ..   `What is your gender?` = col_character(),
  ..   `What is your race? (Choose all that apply.)` = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

Looking at the columns what are you seeing? ### ** 2d. Your Turn** How many of numeric, character etc. are you seeing and what are the names of the

columns? - All but two entries in the data set consists of type character strings. In total there are 19 entris of type character strings strings. - annual_salary and additional_compensation are the two data entries consisting type numeric. Do you think any need to be converted, if so why, or why not? - Some of these could be converted. The unionized coloumn could be converted into binary numeric values, since the data entries either consists of character strings “No” or “Yes”. The same goes for the age coloumn, which contains numeric value intervals, but is still expressed with character strings. ### ** 2e. Your Turn** Now lets look at the summary() function to get a summary of the data. The summary function provides additional information. It can be used for the elary_dfntire dataset, or individual variables.

#inspect the data using summary() function
summary(salary_df)
  timestamp             age              industry         functional_area   
 Length:13349       Length:13349       Length:13349       Length:13349      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
  job_title         Job_title_context  annual_salary     
 Length:13349       Length:13349       Min.   :       0  
 Class :character   Class :character   1st Qu.:   63860  
 Mode  :character   Mode  :character   Median :   89000  
                                       Mean   :  122366  
                                       3rd Qu.:  125000  
                                       Max.   :58400000  
                                                         
 additional_compensation   currency         other_currency    
 Min.   :      0         Length:13349       Length:13349      
 1st Qu.:      0         Class :character   Class :character  
 Median :   1800         Mode  :character   Mode  :character  
 Mean   :  12275                                              
 3rd Qu.:  10000                                              
 Max.   :2500000                                              
 NA's   :3224                                                 
 income_context       country             state           city_region       
 Length:13349       Length:13349       Length:13349       Length:13349      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
  work_mode          unionized         yr_experience      experience_in_field
 Length:13349       Length:13349       Length:13349       Length:13349       
 Class :character   Class :character   Class :character   Class :character   
 Mode  :character   Mode  :character   Mode  :character   Mode  :character   
                                                                             
                                                                             
                                                                             
                                                                             
 education_level       gender              race          
 Length:13349       Length:13349       Length:13349      
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character  
                                                         
                                                         
                                                         
                                                         

3 Wrangle

3a. Column Types:

  • <chr> (Character Strings): Most of the columns, including "How old are you?", "What industry is your employer in?", "What is the functional area of your job?", and "What is your gender?", are read as character strings.
  • Why it might need conversion: Some of these fields likely represent categorical data (factors) rather than free text (e.g., “Age,” “Industry,” “Gender”). Factors are useful in R for summarizing categorical data, making it easier to run descriptive statistics, create plots, and build models. Thus, converting them from character to factors will be beneficial.
  • Conversion needed: We will convert these character columns to factors using the as.factor() function.
  • <dbl> (Numeric Data): Some columns, like "What is your annual salary?" and "How much additional monetary compensation do you get?", are read as numeric (double) types.
  • No conversion needed: These numeric columns seem appropriate, so no type conversion is necessary here. ### 3b. Columns to Consider for Factor Conversion: But first, What is a factor? Factors are data structures in R used to handle categorical data, which are variables that have a fixed number of distinct categories or levels. Factors can store both the labels and the internal integer codes that represent each unique category. They are especially useful in statistical modeling as they correctly inform the model that the

data is categorical, not numeric #### Age (How old are you?):

- Currently: `<chr>`
- Should be: **Factor** (`as.factor()`), because it represents categorical age ranges 
rather than free text. 
# Convert age to a factor
salary_df$age <- as.factor(salary_df$age)
# Inspect the data
salary_df
# A tibble: 13,349 × 21
   timestamp         age   industry  functional_area job_title Job_title_context
   <chr>             <fct> <chr>     <chr>           <chr>     <chr>            
 1 4/9/2024 11:01:42 25-34 Media & … Media & Digital Digital … <NA>             
 2 4/9/2024 11:02:14 35-44 Educatio… Health care     Senior D… <NA>             
 3 4/9/2024 11:02:18 35-44 Nonprofi… Administration  Advancem… <NA>             
 4 4/9/2024 11:02:19 25-34 Governme… Government & P… Program … <NA>             
 5 4/9/2024 11:02:24 25-34 Governme… Administration  Project … <NA>             
 6 4/9/2024 11:02:30 35-44 Health c… Health care     Clinical… Advanced Practic…
 7 4/9/2024 11:02:37 45-54 Governme… Health care     Microbio… <NA>             
 8 4/9/2024 11:02:37 18-24 Nonprofi… Marketing, Adv… Public E… <NA>             
 9 4/9/2024 11:02:40 25-34 Computin… Business or Co… Client S… <NA>             
10 4/9/2024 11:02:45 35-44 Accounti… Accounting, Ba… Accounta… <NA>             
# ℹ 13,339 more rows
# ℹ 15 more variables: annual_salary <dbl>, additional_compensation <dbl>,
#   currency <chr>, other_currency <chr>, income_context <chr>, country <chr>,
#   state <chr>, city_region <chr>, work_mode <chr>, unionized <chr>,
#   yr_experience <chr>, experience_in_field <chr>, education_level <chr>,
#   gender <chr>, race <chr>

View the Levels of a Factor

To see the different levels that a factor variable has, you can use the levels() function. This is useful to understand what categories are included in your factor data

# View the levels of the age factor
levels(salary_df$age)
[1] "18-24"      "25-34"      "35-44"      "45-54"      "55-64"     
[6] "65 or over" "under 18"  

Converting to Ordered Factor

Alternatively, you could convert it to ordered factor if age ranges have a logical order. This makes sense when this variable is a set of ranges representing age levels. If the ranges have a natural order (e.g., “1 year or less,” “2-4 years,” etc.), converting to an ordered factor would be useful.

# Convert age to an ordered factor
salary_df$age <- factor(salary_df$age, ordered = TRUE)

Summary of a Factor

A quick way to get the count of each level within a factor is by using the summary() function. This function provides a count of occurrences for each level.

# Summary of the age factor
summary(salary_df$age)
     18-24      25-34      35-44      45-54      55-64 65 or over   under 18 
       175       4100       5552       2403        993        125          1 

Table of a Factor

To get a frequency table of the levels, you can use the table() function. This is similar to summary() but is used specifically for getting the frequency of each level.

table(salary_df$age)

     18-24      25-34      35-44      45-54      55-64 65 or over   under 18 
       175       4100       5552       2403        993        125          1 

Using dplyr to Summarize Factors

If you are using the dplyr package, you can also create grouped summaries or counts of factor levels easily.

salary_df %>%
 count(age) %>%
 arrange(desc(n)) # Arranges the output in descending order of counts
# A tibble: 7 × 2
  age            n
  <ord>      <int>
1 35-44       5552
2 25-34       4100
3 45-54       2403
4 55-64        993
5 18-24        175
6 65 or over   125
7 under 18       1

** 3c. Your Turn**

Your Task: Convert the other character variables descride if they need to be converted to factors or ordered factors. Once you have converted your character columns to factors, explore these factors to understand the levels they contain and how the data within each category is structured. #### Hint - you need to look at 8 variables in total.

# ADD YOUR CODE BELOW with comments
salary_df$work_mode <- as.factor(salary_df$work_mode)
salary_df$industry <- as.factor(salary_df$industry)
salary_df$unionized <- as.factor(salary_df$unionized)
salary_df$currency <- as.factor(salary_df$currency)
salary_df$education_level <- as.factor(salary_df$education_level)
salary_df$country <- as.factor(salary_df$country)
salary_df$gender <- as.factor(salary_df$gender)
salary_df$race <- as.factor(salary_df$race)

Calculating descriptive statistics

Univariate data EDA is the process of exploring and summarizing the main characteristics of a single variable. This process helps you understand the distribution of the data, identify outliers, and detect patterns or trends. Now that we have made sure that the type conversions are correct we can look at describe() and describe_by() functions describe() computes descriptive statistics for numerical data. Descriptive statistics help determine the distribution of numerical variables. Like the function of dplyr, the first argument is the tibble (or data frame). The second and subsequent arguments refer to variables within that data frame. The first thing we need to do is to add the correct package: Firefox about:blank 6 af 20 19.10.2024 15.40

# Check if the psych package is installed. If not, install it.
if (!require(psych)) {
 install.packages("psych", dependencies = TRUE)
}
Loading required package: psych
Warning: package 'psych' was built under R version 4.3.3

Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':

    %+%, alpha
# Load the psych package
# ADD YOUR CODE BELOW with comments
library(psych)

3d. Numerical Variables

Your Task: Use the describe() function from the psych package to generate descriptive statistics for numerical variables.

# 
numerical_descriptions <- describe(salary_df[c("annual_salary", 
"additional_compensation")])
# Inspect the object
numerical_descriptions
                        vars     n      mean        sd median  trimmed      mad
annual_salary              1 13349 122365.52 793049.45  89000 93870.03 42995.40
additional_compensation    2 10125  12275.22  53872.31   1800  4487.77  2668.68
                        min      max    range  skew kurtosis      se
annual_salary             0 58400000 58400000 56.74  3552.38 6863.98
additional_compensation   0  2500000  2500000 22.20   758.57  535.39

Do you want to see this as a nice table?

# read in knitr package
library(knitr)
Warning: package 'knitr' was built under R version 4.3.3
# Use kable() for a cleaner table format
kable(numerical_descriptions)
vars n mean sd median trimmed mad min max range skew kurtosis se
annual_salary 1 13349 122365.52 793049.45 89000 93870.026 42995.40 0 58400000 58400000 56.73891 3552.3788 6863.9783
additional_compensation 2 10125 12275.22 53872.31 1800 4487.766 2668.68 0 2500000 2500000 22.19539 758.5696 535.3873

Columns in the Output

  • vars: Variable identifier (ordinal number in the dataset). This column simply numbers the variables being described. For example, annual_salary is labeled as 1, indicating its the first variable analyzed.
  • n: Number of observations (non-missing values) in each variable. This represents the number of entries or observations that have been recorded. For annual_salary, there are 13,349 entries. This tells you how many peoples salaries were considered in this analysis.
  • mean: The average value of the variable. The average value. For annual_salary, the average is $122,365.52. This means if you added up all the salaries and divided by the number of people, the average salary would be this amount.
  • sd: Standard deviation, which measures the amount of variation or dispersion of the data points. This measures how spread out the salaries are from the average. A high standard deviation, like $793,049.45 for annual salaries, means that salaries vary a Firefox about:blank 7 af 20 19.10.2024 15.40 lot from the average. Some salaries are much higher or much lower than the average.
  • median: The middle value when the data is ordered, which is a better measure of central tendency when the data is skewed. This is the middle value when all the salaries are lined up from smallest to largest. For annual_salary, the middle value is $89,000. This is often a better measure than the average in cases where the data includes very high or very low values (outliers).
  • trimmed: The mean after trimming a certain percentage of the highest and lowest values, which helps reduce the effect of outliers. This is another type of average where the highest and lowest values are ignored to avoid extreme values skewing the average. For annual_salary, this trimmed average is $93,870.026, slightly higher than the median, indicating that ignoring the extremes brings the average closer to the median.
  • mad: Median Absolute Deviation, a robust measure of variability. This is similar to standard deviation but more robust to outliers. It measures how much the values differ from the median salary. For annual_salary, it’s $42,995.40.
  • min: The smallest value in the dataset.For annual_salary, the smallest is $0 (perhaps indicating unpaid positions or data entry errors),
  • max: The largest value in the dataset. These show the smallest and largest values recorded. For annual salary, the largest is $58,400,000, showing a huge range in salaries.
  • range: The difference between the maximum and minimum values. This is the difference between the maximum and minimum values. For annual_salary, the range is
    $58,400,000, confirming the vast disparity in salaries.
  • skew: Skewness of the distribution, a measure of the asymmetry. Positive values indicate a tail to the right, negative values a tail to the left. This measures how asymmetrical the distribution of salaries is around the average. A positive skew, like 56.73891 for annual_salary, means there are a lot of lower salaries and a few extremely high ones (the tail is on the high side).
  • kurtosis: A measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. High kurtosis indicates a distribution with heavy tails. This measures the “peakedness” of the salary distribution. A very high kurtosis, like 3552.3788 for annual_salary, suggests a sharp peak with thick tails, indicating most salaries are around the median but include some extreme outliers.
  • se: Standard error of the mean, indicating the variability of the mean estimate. This measures how much the average (mean) salary would vary from sample to sample taken from the same population. For annual_salary, it is $6,863.9783, indicating significant variability which might be influenced by the high standard deviation and outliers. ### 3e. DescribeBy Use describe.by() to analyze these factors grouped by another variable (e.g., industry).
# Use describe_by if you want to describe data by groups (e.g., by industry)
factor_descriptions_by_age <- describeBy(salary_df, group = salary_df$age)
Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
Warning in FUN(newX[, i], ...): no non-missing arguments to max; returning -Inf
print(factor_descriptions_by_age)

 Descriptive statistics by group 
group: 18-24
                        vars   n     mean        sd median  trimmed      mad
timestamp                  1 175  6105.22   3528.23   6167  6110.77  4379.60
age                        2 175     1.00      0.00      1     1.00     0.00
industry                   3 175   164.28     96.58    148   160.58   108.23
functional_area            4 170   245.57    176.29    214   234.32   182.36
job_title                  5 175  3629.57   2153.05   3678  3590.94  2722.05
Job_title_context          6  35  1403.54    821.33   1439  1398.62  1037.82
annual_salary              7 175 87468.45 276384.48  51000 56248.07 20104.06
additional_compensation    8 109  5852.70  16489.45   1000  2364.54  1482.60
currency                   9 175     9.83      2.65     11    10.48     0.00
other_currency            10   3    17.67      8.08     13    17.67     0.00
income_context            11  21   554.95    409.51    501   526.18   461.09
country                   12 175    55.85     13.02     60    59.89     0.00
state                     13 141    25.30     13.64     24    25.27    17.79
city_region               14 175  1394.09    925.64   1386  1360.74  1264.66
work_mode                 15 174     2.35      0.69      2     2.42     1.48
unionized                 16 170     1.11      0.31      1     1.01     0.00
yr_experience             17 175     2.94      1.83      3     2.67     0.00
experience_in_field       18 175     2.35      1.46      3     2.18     0.00
education_level           19 174    64.57     27.14     50    59.67     0.00
gender                    20 174     8.13      3.28     10     8.74     0.00
race                      21 173    30.61     11.69     36    33.47     0.00
                         min     max   range  skew kurtosis       se
timestamp                 74   12037   11963  0.01    -1.18   266.71
age                        1       1       0   NaN      NaN     0.00
industry                   4     389     385  0.37    -0.55     7.30
functional_area            9     635     626  0.49    -0.80    13.52
job_title                 15    7923    7908  0.13    -1.07   162.76
Job_title_context         38    2839    2801  0.10    -1.10   138.83
annual_salary           2950 3360000 3357050 10.42   114.44 20892.70
additional_compensation    0  130000  130000  5.53    34.39  1579.40
currency                   1      11      10 -2.01     2.57     0.20
other_currency            13      27      14  0.38    -2.33     4.67
income_context            14    1336    1322  0.50    -1.13    89.36
country                    2      60      58 -3.05     7.86     0.98
state                      3      50      47 -0.04    -1.24     1.15
city_region               95    3131    3036  0.13    -1.39    69.97
work_mode                  1       4       3 -0.36    -0.61     0.05
unionized                  1       2       1  2.54     4.47     0.02
yr_experience              1       8       7  1.14     0.74     0.14
experience_in_field        1       7       6  1.38     2.67     0.11
education_level            9     139     130  1.22     0.57     2.06
gender                     1      11      10 -1.48     0.52     0.25
race                       1      36      35 -1.84     1.62     0.89
------------------------------------------------------------ 
group: 25-34
                        vars    n      mean         sd  median  trimmed
timestamp                  1 4100   5923.93    3395.26  5976.5  5906.27
age                        2 4100      2.00       0.00     2.0     2.00
industry                   3 4087    161.77      95.91   157.0   158.49
functional_area            4 4061    231.69     173.06   209.0   218.98
job_title                  5 4100   3984.15    2307.42  4176.5  4002.45
Job_title_context          6  856   1438.70     829.11  1445.5  1445.00
annual_salary              7 4100 123122.04 1051153.44 79000.0 82606.73
additional_compensation    8 2978  10348.66   60669.76  1500.0  3721.56
currency                   9 4100      9.67       2.93    11.0    10.42
other_currency            10   22     18.64       7.86    18.5    19.00
income_context            11  404    664.91     389.52   690.0   665.74
country                   12 4100     54.54      15.25    60.0    59.43
state                     13 3289     25.48      14.46    24.0    25.26
city_region               14 4099   1470.68     921.76  1484.0  1452.84
work_mode                 15 4085      2.14       0.76     2.0     2.15
unionized                 16 4042      1.14       0.35     1.0     1.05
yr_experience             17 4100      5.93       2.47     7.0     6.18
experience_in_field       18 4100      5.45       2.46     7.0     5.62
education_level           19 4089     68.97      25.21    50.0    66.32
gender                    20 4086      8.68       2.99    10.0     9.45
race                      21 4083     32.48       9.24    36.0    35.29
                             mad min      max    range  skew kurtosis       se
timestamp                4243.20   5    12129    12124  0.01    -1.12    53.03
age                         0.00   2        2        0   NaN      NaN     0.00
industry                  111.19   1      406      405  0.28    -0.68     1.50
functional_area           220.91   1      638      637  0.42    -0.89     2.72
job_title                2945.18   2     7962     7960 -0.09    -1.22    36.04
Job_title_context        1066.73   5     2873     2868 -0.05    -1.21    28.34
annual_salary           35582.40   0 45000000 45000000 37.07  1437.28 16416.26
additional_compensation  2223.90   0  2500000  2500000 28.56  1043.74  1111.76
currency                    0.00   1       12       11 -1.90     1.95     0.05
other_currency             10.38   3       30       27 -0.37    -1.06     1.68
income_context            504.08   1     1343     1342 -0.05    -1.21    19.38
country                     0.00   1       60       59 -2.62     5.15     0.24
state                      19.27   1       51       50  0.05    -1.27     0.25
city_region              1266.14   1     3138     3137  0.06    -1.25    14.40
work_mode                   1.48   1        4        3  0.01    -0.77     0.01
unionized                   0.00   1        2        1  2.08     2.34     0.01
yr_experience               1.48   1        8        7 -0.78    -1.21     0.04
experience_in_field         1.48   1        8        7 -0.40    -1.56     0.04
education_level             0.00   6      153      147  0.54    -0.30     0.39
gender                      0.00   1       11       10 -2.02     2.36     0.05
race                        0.00   1       37       36 -2.60     5.38     0.14
------------------------------------------------------------ 
group: 35-44
                        vars    n      mean        sd  median  trimmed      mad
timestamp                  1 5552   6146.30   3421.71  6272.5  6167.92  4263.22
age                        2 5552      3.00      0.00     3.0     3.00     0.00
industry                   3 5535    158.72     93.79   157.0   154.15   102.30
functional_area            4 5505    229.61    173.69   209.0   216.49   216.46
job_title                  5 5552   3914.08   2303.83  3958.0  3915.46  2950.37
Job_title_context          6 1230   1464.38    834.96  1503.0  1469.31  1075.63
annual_salary              7 5552 126308.42 819481.71 93000.0 99522.56 44478.00
additional_compensation    8 4252  14163.90  56969.81  2000.0  5390.69  2965.20
currency                   9 5552      9.73      2.94    11.0    10.52     0.00
other_currency            10   26     13.77      8.27    13.5    13.36    11.12
income_context            11  567    666.16    393.97   650.0   663.56   512.98
country                   12 5552     54.26     15.79    60.0    59.26     0.00
state                     13 4573     25.84     14.51    24.0    25.72    19.27
city_region               14 5548   1495.27    919.94  1515.0  1484.79  1243.90
work_mode                 15 5535      2.11      0.79     2.0     2.10     1.48
unionized                 16 5510      1.14      0.34     1.0     1.05     0.00
yr_experience             17 5552      3.00      1.98     2.0     2.51     0.00
experience_in_field       18 5552      4.28      2.67     3.0     4.12     1.48
education_level           19 5540     73.94     26.39    90.0    72.19    44.48
gender                    20 5530      8.68      3.09    10.0     9.47     0.00
race                      21 5527     33.40      8.13    36.0    35.89     0.00
                        min      max    range  skew kurtosis       se
timestamp                 1    12128    12127 -0.06    -1.12    45.92
age                       3        3        0   NaN      NaN     0.00
industry                  2      404      402  0.41    -0.41     1.26
functional_area           9      643      634  0.43    -0.88     2.34
job_title                 1     7963     7962 -0.02    -1.24    30.92
Job_title_context         2     2874     2872 -0.06    -1.23    23.81
annual_salary             0 58400000 58400000 65.62  4612.08 10998.02
additional_compensation   0  2000000  2000000 17.68   468.27   873.67
currency                  1       12       11 -2.01     2.32     0.04
other_currency            1       31       30  0.34    -1.10     1.62
income_context            4     1348     1344  0.04    -1.26    16.55
country                   2       66       64 -2.51     4.50     0.21
state                     1       51       50  0.02    -1.25     0.21
city_region               3     3138     3135  0.01    -1.27    12.35
work_mode                 1        4        3  0.11    -0.80     0.01
unionized                 1        2        1  2.10     2.42     0.00
yr_experience             1        8        7  1.83     1.78     0.03
experience_in_field       1        8        7  0.45    -1.66     0.04
education_level           3      148      145  0.10    -0.44     0.35
gender                    1       13       12 -2.02     2.20     0.04
race                      1       37       36 -3.20     9.01     0.11
------------------------------------------------------------ 
group: 45-54
                        vars    n      mean        sd  median   trimmed
timestamp                  1 2403   6372.43   3320.15  6414.0   6424.71
age                        2 2403      4.00      0.00     4.0      4.00
industry                   3 2400    160.61     94.49   157.0    156.44
functional_area            4 2386    216.52    174.39   190.0    200.28
job_title                  5 2403   3947.54   2312.28  3930.0   3937.84
Job_title_context          6  543   1455.07    801.40  1460.0   1452.58
annual_salary              7 2403 118099.88 203187.98 95000.0 101243.41
additional_compensation    8 1909  13024.78  46028.81  1500.0   4632.80
currency                   9 2403      9.78      2.92    11.0     10.60
other_currency            10    9     13.78      7.24    16.0     13.78
income_context            11  240    699.80    374.72   683.5    703.36
country                   12 2403     54.42     15.66    60.0     59.46
state                     13 2000     26.11     14.63    24.0     26.07
city_region               14 2402   1511.38    917.88  1567.0   1506.83
work_mode                 15 2399      2.12      0.81     2.0      2.13
unionized                 16 2376      1.13      0.34     1.0      1.04
yr_experience             17 2403      3.80      1.05     4.0      3.81
experience_in_field       18 2403      3.84      1.94     4.0      3.57
education_level           19 2397     74.86     29.51    90.0     73.90
gender                    20 2393      8.66      3.15    10.0      9.45
race                      21 2394     33.93      7.41    36.0     36.00
                             mad min     max   range  skew kurtosis      se
timestamp                3968.92   4   12116   12112 -0.09    -1.02   67.73
age                         0.00   4       4       0   NaN      NaN    0.00
industry                   81.54   4     402     398  0.39    -0.39    1.93
functional_area           201.63   2     641     639  0.55    -0.75    3.57
job_title                2948.89   3    7965    7962  0.03    -1.21   47.17
Job_title_context         985.93   1    2856    2855  0.03    -1.12   34.39
annual_salary           45905.74   0 8000000 8000000 28.91  1031.36 4144.97
additional_compensation  2223.90   0 1000000 1000000 11.51   186.82 1053.48
currency                    0.00   1      12      11 -2.10     2.66    0.06
other_currency              4.45   2      24      22 -0.30    -1.48    2.41
income_context            455.90   3    1345    1342 -0.01    -1.08   24.19
country                     0.00   1      61      60 -2.57     4.76    0.32
state                      19.27   1      51      50 -0.01    -1.27    0.33
city_region              1229.82   2    3133    3131 -0.02    -1.25   18.73
work_mode                   1.48   1       4       3  0.04    -0.95    0.02
unionized                   0.00   1       2       1  2.17     2.69    0.01
yr_experience               0.00   2       8       6  0.31     3.01    0.02
experience_in_field         2.97   1       8       7  0.93    -0.12    0.04
education_level            44.48   9     154     145  0.05    -0.51    0.60
gender                      0.00   1      13      12 -1.99     2.01    0.06
race                        0.00   1      37      36 -3.70    12.57    0.15
------------------------------------------------------------ 
group: 55-64
                        vars   n      mean        sd  median   trimmed      mad
timestamp                  1 993   6321.46   3421.68  6444.0   6365.11  4201.69
age                        2 993      5.00      0.00     5.0      5.00     0.00
industry                   3 989    159.71     96.15   157.0    155.14    80.06
functional_area            4 983    203.90    172.14   189.0    185.26   203.12
job_title                  5 993   3933.37   2318.57  3877.0   3927.64  2997.82
Job_title_context          6 237   1334.72    846.30  1182.0   1309.66  1011.13
annual_salary              7 993 115262.90 157356.24 96000.0 100070.72 45960.60
additional_compensation    8 779   8859.83  24093.45  1200.0   3647.28  1779.12
currency                   9 993     10.15      2.51    11.0     10.96     0.00
other_currency            10   2     20.50      2.12    20.5     20.50     2.22
income_context            11 116    704.79    394.47   735.0    710.62   512.24
country                   12 993     55.93     13.57    60.0     59.98     0.00
state                     13 874     25.79     14.73    24.0     25.70    20.76
city_region               14 993   1493.95    911.27  1506.0   1485.05  1197.94
work_mode                 15 990      2.19      0.82     2.0      2.22     1.48
unionized                 16 984      1.14      0.35     1.0      1.05     0.00
yr_experience             17 993      4.67      0.95     5.0      4.71     0.00
experience_in_field       18 993      4.22      1.67     4.0      4.07     1.48
education_level           19 990     75.33     31.41    75.0     74.91    37.06
gender                    20 990      8.72      3.11    10.0      9.52     0.00
race                      21 991     34.32      6.92    36.0     36.00     0.00
                        min     max   range  skew kurtosis      se
timestamp                 2   12105   12103 -0.12    -1.10  108.58
age                       5       5       0   NaN      NaN    0.00
industry                  4     407     403  0.45    -0.33    3.06
functional_area           5     644     639  0.64    -0.59    5.49
job_title                 5    7964    7959  0.04    -1.18   73.58
Job_title_context         6    2869    2863  0.26    -1.16   54.97
annual_salary            45 4300000 4299955 19.73   504.48 4993.55
additional_compensation   0  280000  280000  6.36    52.55  863.24
currency                  1      11      10 -2.75     5.88    0.08
other_currency           19      22       3  0.00    -2.75    1.50
income_context           17    1347    1330 -0.09    -1.27   36.63
country                   2      64      62 -3.16     8.17    0.43
state                     1      51      50 -0.03    -1.29    0.50
city_region               4    3125    3121  0.00    -1.29   28.92
work_mode                 1       4       3 -0.15    -1.07    0.03
unionized                 1       2       1  2.08     2.33    0.01
yr_experience             1       8       7 -0.64     2.99    0.03
experience_in_field       1       8       7  0.51     0.07    0.05
education_level           1     143     142  0.02    -0.65    1.00
gender                    1      11      10 -2.06     2.29    0.10
race                      1      37      36 -4.22    16.58    0.22
------------------------------------------------------------ 
group: 65 or over
                        vars   n      mean       sd   median   trimmed      mad
timestamp                  1 125   6875.11  2953.76   7048.0   6993.06  2876.24
age                        2 125      6.00     0.00      6.0      6.00     0.00
industry                   3 125    163.60    97.58    157.0    159.19    81.54
functional_area            4 124    213.95   171.52    189.5    199.50   202.37
job_title                  5 125   4081.90  2086.48   4027.0   4098.37  2155.70
Job_title_context          6  32   1143.53   848.87    924.5   1085.27  1034.85
annual_salary              7 125 110685.06 70513.02 101000.0 102485.69 51499.59
additional_compensation    8  97   8653.00 25367.94    100.0   2989.01   148.26
currency                   9 125     10.31     2.29     11.0     11.00     0.00
other_currency            10   0       NaN       NA       NA       NaN       NA
income_context            11  12    854.67   370.59    861.0    884.90   432.92
country                   12 125     56.59    12.30     60.0     60.00     0.00
state                     13 113     25.71    15.70     23.0     25.48    20.76
city_region               14 125   1697.59   875.06   1612.0   1714.36  1106.02
work_mode                 15 125      2.27     0.84      2.0      2.30     1.48
unionized                 16 124      1.13     0.34      1.0      1.04     0.00
yr_experience             17 125      5.37     0.94      6.0      5.54     0.00
experience_in_field       18 125      4.57     1.59      5.0      4.56     1.48
education_level           19 123     80.90    32.02     90.0     80.60    53.37
gender                    20 124      8.73     3.11     10.0      9.51     0.00
race                      21 125     33.80     7.90     36.0     36.00     0.00
                        min    max  range  skew kurtosis      se
timestamp                70  12100  12030 -0.29    -0.56  264.19
age                       6      6      0   NaN      NaN    0.00
industry                  4    381    377  0.43    -0.45    8.73
functional_area           4    615    611  0.51    -0.83   15.40
job_title                23   7759   7736 -0.03    -0.89  186.62
Job_title_context        97   2797   2700  0.47    -1.23  150.06
annual_salary            18 600000 599982  2.98    16.92 6306.88
additional_compensation   0 200000 200000  5.39    33.72 2575.72
currency                  2     11      9 -3.10     7.94    0.21
other_currency          Inf   -Inf   -Inf    NA       NA      NA
income_context           97   1310   1213 -0.40    -0.94  106.98
country                   9     60     51 -3.43    10.10    1.10
state                     1     50     49  0.08    -1.44    1.48
city_region              50   3132   3082 -0.12    -1.12   78.27
work_mode                 1      4      3 -0.21    -1.01    0.07
unionized                 1      2      1  2.19     2.80    0.03
yr_experience             2      6      4 -1.77     3.24    0.08
experience_in_field       1      8      7 -0.12    -0.43    0.14
education_level           9    150    141  0.00    -0.88    2.89
gender                    1     10      9 -2.06     2.28    0.28
race                      1     36     35 -3.53    11.16    0.71
------------------------------------------------------------ 
group: under 18
                        vars n  mean sd median trimmed mad   min   max range
timestamp                  1 1 12021 NA  12021   12021   0 12021 12021     0
age                        2 1     7 NA      7       7   0     7     7     0
industry                   3 1     4 NA      4       4   0     4     4     0
functional_area            4 1     9 NA      9       9   0     9     9     0
job_title                  5 1   999 NA    999     999   0   999   999     0
Job_title_context          6 1   333 NA    333     333   0   333   333     0
annual_salary              7 1     0 NA      0       0   0     0     0     0
additional_compensation    8 1     0 NA      0       0   0     0     0     0
currency                   9 1    11 NA     11      11   0    11    11     0
other_currency            10 0   NaN NA     NA     NaN  NA   Inf  -Inf  -Inf
income_context            11 0   NaN NA     NA     NaN  NA   Inf  -Inf  -Inf
country                   12 1    60 NA     60      60   0    60    60     0
state                     13 1     1 NA      1       1   0     1     1     0
city_region               14 1   331 NA    331     331   0   331   331     0
work_mode                 15 0   NaN NA     NA     NaN  NA   Inf  -Inf  -Inf
unionized                 16 0   NaN NA     NA     NaN  NA   Inf  -Inf  -Inf
yr_experience             17 1     1 NA      1       1   0     1     1     0
experience_in_field       18 1     1 NA      1       1   0     1     1     0
education_level           19 0   NaN NA     NA     NaN  NA   Inf  -Inf  -Inf
gender                    20 0   NaN NA     NA     NaN  NA   Inf  -Inf  -Inf
race                      21 0   NaN NA     NA     NaN  NA   Inf  -Inf  -Inf
                        skew kurtosis se
timestamp                 NA       NA NA
age                       NA       NA NA
industry                  NA       NA NA
functional_area           NA       NA NA
job_title                 NA       NA NA
Job_title_context         NA       NA NA
annual_salary             NA       NA NA
additional_compensation   NA       NA NA
currency                  NA       NA NA
other_currency            NA       NA NA
income_context            NA       NA NA
country                   NA       NA NA
state                     NA       NA NA
city_region               NA       NA NA
work_mode                 NA       NA NA
unionized                 NA       NA NA
yr_experience             NA       NA NA
experience_in_field       NA       NA NA
education_level           NA       NA NA
gender                    NA       NA NA
race                      NA       NA NA

Firefox about:blank 8 af 20 19.10.2024 15.40 ### Group or Individual Tasks � Your Turn #### TASK 1: Use the describe() function to analyze the distribution of annual salaries within a particular demographic or job function category. You will generate a detailed statistical table and write a short analysis based on your findings. - Select a Variable for Analysis - Choose one categorical variable from the dataset (e.g., industry, country, or job function) to focus your analysis on how annual salaries vary within the chosen category. - Filter and Prepare the Data - Optionally, filter the data if you want to focus on a specific subset (e.g., managers in a specific country or age group). - Use describe() to Generate Descriptive Statistics - Apply the describe() function to the salary data grouped by the selected categorical variable. - Print and Interpret the Results - Print the output table using print() and provide a concise interpretation of the key statistics such as mean, median, standard deviation, and range.

# ADD YOUR CODE BELOW with comments
#ilter the data to focus on a subset, e.g., a specific country
industry_data <- salary_df %>%
 filter(country == "United States") %>% # This line is optional
 select(industry, annual_salary)
# Apply describe() to the filtered data
descriptive_stats <- describe(industry_data$annual_salary)
kable(descriptive_stats)
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 11069 105566.6 79852.41 90500 96677.8 42254.1 0 4300000 4300000 17.27037 734.624 758.9861

Questions Further Discussion:

Choose one or two questions below or create your own questions to discuss the results 
of your analysis:
- What is the median salary within your chosen category, and how does it compare to 
the overall median salary of the dataset? Why might there be differences?
- Identify and discuss the implications of the skewness in the salary distribution 
for your chosen category. What does the direction and magnitude of the skewness imply 
about the majority of salaries in this group?
- Analyze the trimmed mean of salaries between different job functions or industries. 
How does removing outliers affect the comparison of average salaries across these 
categories?
- Discuss how the salary data might influence policy-making or business strategies in 
Firefox about:blank
9 af 20 19.10.2024 15.40
terms of diversity and inclusion within the workplace.
- Reflect on any potential biases in the data collection or analysis process. How 
might these biases impact the conclusions drawn from your analysis?

TASK 2: Use R with the dplyr package to filter, group, and summarize the dataset

to answer the following question: “Which age groups in each country have an average annual salary exceeding $100,000?” - Data Filtering - Filter the dataset to include only those entries where the annual salary is greater than $100,000. - Grouping Data - Group the filtered data by country and age. - Summarizing Data - Calculate the average salary for each group. - Count the number of individuals in each group. - Arranging Data - Arrange your results to show groups with the highest average salaries at the top.

# ADD YOUR CODE BELOW with comments
# Perform the analysis
result <- salary_df %>%
 filter(annual_salary > 100000) %>%
 group_by(country, age) %>%
 summarise(
 Count = n(),
 Average_Salary = mean(annual_salary, na.rm = TRUE)
 ) %>%
 arrange(desc(Average_Salary))
`summarise()` has grouped output by 'country'. You can override using the
`.groups` argument.
# Print the results
result
# A tibble: 89 × 4
# Groups:   country [40]
   country     age   Count Average_Salary
   <fct>       <ord> <int>          <dbl>
 1 South Korea 35-44     1      58400000 
 2 South Korea 25-34     3      37733333.
 3 Hungary     35-44     1       8160000 
 4 Iceland     45-54     1       8000000 
 5 Japan       35-44     6       5170038 
 6 Japan       25-34     5       4745083.
 7 Japan       45-54     1       4500000 
 8 Japan       18-24     1       3360000 
 9 India       25-34     1       2000000 
10 India       18-24     2        928000 
# ℹ 79 more rows

Questions Further Discussion:

 - How does the age distribution of high earners vary between countries?
 - What factors might influence the differences in salary distributions across 
countries?
 - Could there be external factors affecting the accuracy of this analysis?

FIRST rename your file to DSC406_001_FA24_WA_3_unityID

Firefox about:blank 10 af 20 19.10.2024 15.40 ggplot2: Today, we’re diving into exploratory data analysis (EDA) using ggplot2. You’ve watched a video about the basics of ggplot, so now we’ll start applying what you’ve learned. Our focus will be on exploring a dataset with big data related to salaries across various industries and regions. Remember, ggplot2 allows us to iteratively build up layers to create meaningful data visualizations. Resources: - https://datavizproject.com/ - https://www.data-to-viz.com/ - Kieran Healy - DataViz ### Step-by-Step Guide #### Step 1: Basic Bar Plot Let is begin with the simplest form of visualization, a bar plot. This will help us count how often a certain value appears. We’ll start by counting how often each type of currency appears in the dataset.

# Create a bar chart to show currency distribution
viz_1 <- 
 ggplot(salary_df, aes(x = currency)) +
 geom_bar(fill = "steelblue") +
 labs(title = "Currency Distribution", x = "Currency", y = "Count") +
 theme_minimal() +
 theme(axis.text.x = element_text(angle = 45, hjust = 1))
viz_1

This plot shows how many respondents are paid in each type of currency. The x-axis lists the different currencies, and the y-axis shows the number of respondents for each currency. This helps us understand the global distribution of salaries in the dataset. ### ** 1. Your Turn** ❓ What happens if we change the aesthetic to another variable, like work_mode? 1. Task: Modify the bar chart by changing the aesthetic from currency to another variable of your choice, such as work_mode. 2. Steps: - Update the aes(x = currency) argument to reflect a different categorical variable, such as work_mode. - Experiment with changing the color of the bars. - Modify the labs to ensure the title and axis labels match the new variable you’ve chosen. Firefox about:blank 11 af 20 19.10.2024 15.40 3. Goal: Create a new bar chart that explores a different variable distribution (e.g., work mode), and make sure the chart is easy to read by adjusting the title, labels, and any other necessary aesthetics. 4. Write: Write a few insights you have noticed from the graph and what you still wonder about.

#ADD CODE AND COMMENTS
# Create a bar chart to show currency distribution
viz_2 <- ggplot(salary_df, aes(x = work_mode)) +
 geom_bar(fill = "red") +
 labs(title = "Currency Distribution", x = "Work mode", y = "Count") +
 theme_minimal() +
 theme(axis.text.x = element_text(angle = 45, hjust = 1))
viz_2

This bar plot counts how many people work in each mode, such as remote or hybrid. Take a moment to compare the different work modes and think about why certain modes might be more common in different industries. #### 2: Exploring Continuous Variables (Histogram) Now, let’s move to a continuous variable like annual salary. A histogram helps us visualize how salaries are distributed.

options(scipen = 999) # instructs R to avoid using scientific notation in its output
# Create a histogram for annual salary
ggplot(salary_df, aes(x = annual_salary)) +
 geom_histogram(binwidth = 1000, fill = "steelblue", color = "black") +
 theme_minimal() +
 labs(title = "Distribution of Annual Salary", x = "Annual Salary", y = "Count")

After exploring the continuous variable annual salary, we might notice that the values are quite high, especially in certain bins. This raises a critical question: are the salaries in this dataset separated by currency? Given that salaries in different currencies can vary significantly, we should focus on one currency at a time. In this case, we’ll filter the dataset to include only salaries reported in USD. By doing this, we ensure that our analysis is relevant and that we’re not mixing different currencies. ### ** 2. Your Turn** ❓ How would you filter the dataset to focus only on salaries reported in USD? 1. Task: Filter the dataset to only include rows where the salary is reported in USD. 2. Steps: Firefox about:blank 12 af 20 19.10.2024 15.40 - Use the filter() function from dplyr to create a new data frame that only includes rows where the currency is “USD”. - Ensure the filtered data is assigned to a new object, such as usd_salary. - Display the resulting data frame to confirm the filtering worked correctly. 3. Goal: By completing this task, you’ll be able to filter out non-USD salaries and work with a cleaner, more focused dataset. 4. Write: Write a few insights you have noticed from the graph and what you still wonder about. - The reduced table only containing data entries where the salary is USD, still contains 11.114 rows, which is a substantial amount out of the 13.349. This would make sense based on the bar plot we saw before with distribution of used currency in the dataset.

#ADD CODE AND COMMENTS
usd_salary <- filter(salary_df, currency == 'USD')
usd_salary
# A tibble: 11,114 × 21
   timestamp         age   industry  functional_area job_title Job_title_context
   <chr>             <ord> <fct>     <chr>           <chr>     <chr>            
 1 4/9/2024 11:01:42 25-34 Media & … Media & Digital Digital … <NA>             
 2 4/9/2024 11:02:14 35-44 Educatio… Health care     Senior D… <NA>             
 3 4/9/2024 11:02:18 35-44 Nonprofi… Administration  Advancem… <NA>             
 4 4/9/2024 11:02:19 25-34 Governme… Government & P… Program … <NA>             
 5 4/9/2024 11:02:24 25-34 Governme… Administration  Project … <NA>             
 6 4/9/2024 11:02:37 45-54 Governme… Health care     Microbio… <NA>             
 7 4/9/2024 11:02:37 18-24 Nonprofi… Marketing, Adv… Public E… <NA>             
 8 4/9/2024 11:02:40 25-34 Computin… Business or Co… Client S… <NA>             
 9 4/9/2024 11:02:45 35-44 Accounti… Accounting, Ba… Accounta… <NA>             
10 4/9/2024 11:02:46 45-54 Nonprofi… Nonprofits      Developm… <NA>             
# ℹ 11,104 more rows
# ℹ 15 more variables: annual_salary <dbl>, additional_compensation <dbl>,
#   currency <fct>, other_currency <chr>, income_context <chr>, country <fct>,
#   state <chr>, city_region <chr>, work_mode <fct>, unionized <fct>,
#   yr_experience <chr>, experience_in_field <chr>, education_level <fct>,
#   gender <fct>, race <fct>

This code filters the dataset so that we’re only looking at respondents who reported their salary in USD. This step is important because it ensures that our analysis is consistent and we’re not comparing salaries reported in other currencies, which might distort the distribution. what else we might want to do? Check out the salary data to see the distribution. #### Summary table

summary(usd_salary$annual_salary)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0   66500   90386  105516  127950 4300000 

3: Box Plot

# Create a box plot of annual salary
ggplot(usd_salary, aes(y = annual_salary)) +
 geom_boxplot() +
 labs(title = "Box Plot of Annual Salary (USD)", x = "", y = "Annual Salary (USD)") +
 theme_minimal()

4: Scatter Plot

viz_2 <- usd_salary_clean <- usd_salary %>%
 mutate(yr_experience_numeric = as.numeric(yr_experience)) # Convert to numeric
Warning: There was 1 warning in `mutate()`.
ℹ In argument: `yr_experience_numeric = as.numeric(yr_experience)`.
Caused by warning:
! NAs introduced by coercion
# Create a scatter plot of annual salary vs. years of experience (numeric)
ggplot(usd_salary_clean, aes(x = yr_experience_numeric, y = annual_salary)) +
 geom_point(alpha = 0.6, color = "pink") +
 labs(title = "Scatter Plot of Annual Salary vs. Years of Experience",
 x = "Years of Experience",
 y = "Annual Salary (USD)") +
 theme_minimal()
Warning: Removed 11114 rows containing missing values or values outside the scale range
(`geom_point()`).

#inspect vizulaiton
viz_2
# A tibble: 11,114 × 22
   timestamp         age   industry  functional_area job_title Job_title_context
   <chr>             <ord> <fct>     <chr>           <chr>     <chr>            
 1 4/9/2024 11:01:42 25-34 Media & … Media & Digital Digital … <NA>             
 2 4/9/2024 11:02:14 35-44 Educatio… Health care     Senior D… <NA>             
 3 4/9/2024 11:02:18 35-44 Nonprofi… Administration  Advancem… <NA>             
 4 4/9/2024 11:02:19 25-34 Governme… Government & P… Program … <NA>             
 5 4/9/2024 11:02:24 25-34 Governme… Administration  Project … <NA>             
 6 4/9/2024 11:02:37 45-54 Governme… Health care     Microbio… <NA>             
 7 4/9/2024 11:02:37 18-24 Nonprofi… Marketing, Adv… Public E… <NA>             
 8 4/9/2024 11:02:40 25-34 Computin… Business or Co… Client S… <NA>             
 9 4/9/2024 11:02:45 35-44 Accounti… Accounting, Ba… Accounta… <NA>             
10 4/9/2024 11:02:46 45-54 Nonprofi… Nonprofits      Developm… <NA>             
# ℹ 11,104 more rows
# ℹ 16 more variables: annual_salary <dbl>, additional_compensation <dbl>,
#   currency <fct>, other_currency <chr>, income_context <chr>, country <fct>,
#   state <chr>, city_region <chr>, work_mode <fct>, unionized <fct>,
#   yr_experience <chr>, experience_in_field <chr>, education_level <fct>,
#   gender <fct>, race <fct>, yr_experience_numeric <dbl>

Filter out the extreme values

#filter out the extreme values
usd_salary_no_outliers <- usd_salary_clean %>%
 filter(annual_salary < 200000)

** 3 Your Turn** **⤵*

❓ Now that you’ve filtered out the extreme values, it’s time to create another boxplot. What do you observe after removing the outliers? 1. Task: Create a new box plot for the filtered dataset that excludes extreme salary values. 2. Steps: - Use the filtered dataset (e.g., usd_salary_no_outliers) to create a box plot that visualizes the distribution of salaries. - Use the geom_boxplot() function to display the distribution. - Add appropriate title and labels using labs() to ensure the plot is well-labeled. 3. Goal: Visualize the distribution of annual salaries after removing outliers. Compare this new box plot with the previous one that included extreme values, and reflect on how the distribution has changed. 4. Write: Write a few insights you have noticed from the graph and what you still wonder about.

#ADD CODE AND COMMENTS
# Create a box plot of annual salary
ggplot(usd_salary_no_outliers, aes(y = annual_salary)) +
 geom_boxplot() +
 labs(title = "Box Plot of Annual Salary (USD)", x = "", y = "Annual Salary (USD)") +
 theme_minimal()

We will now explore the trend between years of experience and annual salary.

usd_salary_no_outliers %>%
 count(yr_experience, name = "Count") %>%
 arrange(desc(Count))
# A tibble: 8 × 2
  yr_experience    Count
  <chr>            <int>
1 11-20 years       4217
2 21-30 years       2012
3 8-10 years        1740
4 5-7 years         1142
5 31-40 years        676
6 2-4 years          475
7 41 years or more   147
8 1 year or less      81
usd_salary_no_outliers <- usd_salary_no_outliers %>%

 mutate(yr_experience_numeric = case_when(
 yr_experience == "1 year or less" ~ 1,
 yr_experience == "2-4 years" ~ 3,
 yr_experience == "5-7 years" ~ 6,
 yr_experience == "8-10 years" ~ 9,
 yr_experience == "11-20 years" ~ 15,
 yr_experience == "21-30 years" ~ 25,
 yr_experience == "31-40 years" ~ 35,
 yr_experience == "41 years or more" ~ 45
 ))
# Create a scatter plot of annual salary vs. numeric years of experience
ggplot(usd_salary_no_outliers, aes(x = yr_experience_numeric, y = annual_salary)) +
 geom_point(alpha = 0.6, color = "steelblue") +
 labs(title = "Scatter Plot of Annual Salary vs. Years of Experience",
 x = "Years of Experience (Numeric)",
 y = "Annual Salary (USD)") +
 theme_minimal()

# Create a scatter plot of annual salary vs. numeric years of experience with custom labels
viz_3 <- ggplot(usd_salary_no_outliers, aes(x = yr_experience_numeric, y = annual_salary)) +
  geom_point(alpha = 0.6, color = "steelblue") +
  scale_x_continuous(
    breaks = c(1, 3, 6, 9, 15, 25, 35, 45),  # Numeric values
    labels = c("1 year or less", "2-4 years", "5-7 years", "8-10 years", 
               "11-20 years", "21-30 years", "31-40 years", "41 years or more")  # Custom labels
  ) +
  labs(title = "Scatter Plot of Annual Salary vs. Years of Experience",
       x = "Years of Experience",
       y = "Annual Salary (USD)") +
  theme_minimal()

#inspect visualization
viz_3

# Violin plot of annual salary by years of experience
viz_4 <- ggplot(usd_salary_no_outliers, aes(x = as.factor(yr_experience_numeric), y = 
annual_salary)) +
 geom_violin(fill = "lightblue") +
 scale_x_discrete(
 labels = c("1 year or less", "2-4 years", "5-7 years", "8-10 years", 
 "11-20 years", "21-30 years", "31-40 years", "41 years or more")
 ) +
 labs(title = "Violin Plot of Annual Salary by Years of Experience",
 x = "Years of Experience",
 y = "Annual Salary (USD)") +
 theme_minimal()+
 theme(axis.text.x = element_text(angle = 45, hjust = 1))

#inspect viz
viz_4

** 4. Your Turn**

❓ Can you use facet_wrap to break down the violin plot by gender? 1. Task: Create a violin plot to show the distribution of annual salary by years of experience, and then use facet_wrap to break down the plot by gender. 2. Steps: - Use the filtered dataset (e.g., usd_salary_no_outliers) to create a violin plot of annual salary against years of experience. - Use geom_violin() to visualize the distribution. - Use facet_wrap(~gender) to split the plot by gender, so you can compare salary distributions for different genders. - Ensure the labels and titles clearly explain the plot. 3. Goal: Create a violin plot that compares salary distributions by years of experience and gender. Reflect on how the distribution differs between genders. 4. Write: Write a few insights you have noticed from the graph and what you still wonder about.

#ADD CODE AND COMMENTS
ggplot(usd_salary_no_outliers, aes(x = factor(yr_experience_numeric), y = 
annual_salary)) +
 geom_violin(fill = "lightblue", color = "black") +
 facet_wrap(~ gender) +
 theme_minimal() +
 labs(title = "Distribution of Annual Salary by Years of Experience and Gender",
 x = "Years of Experience",
 y = "Annual Salary",
 caption = "Data filtered to exclude outliers")
Warning: Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Warning in max(data$density, na.rm = TRUE): no non-missing arguments to max;
returning -Inf
Warning: Computation failed in `stat_ydensity()`.
Caused by error in `$<-.data.frame`:
! replacement has 1 row, data has 0
Warning: Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Warning in max(data$density, na.rm = TRUE): no non-missing arguments to max;
returning -Inf
Warning: Computation failed in `stat_ydensity()`.
Caused by error in `$<-.data.frame`:
! replacement has 1 row, data has 0
Warning: Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Warning in max(data$density, na.rm = TRUE): no non-missing arguments to max;
returning -Inf
Warning: Computation failed in `stat_ydensity()`.
Caused by error in `$<-.data.frame`:
! replacement has 1 row, data has 0
Warning: Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.

I was interested in looking at the genders but I noticed there are multiple genders. We would want to split this data so that we can count it.

#Use separate_rows to split gender but keep other data intact
usd_salary_no_outliers <- usd_salary_no_outliers %>%
 separate_rows(gender, sep = ",")
# Violin plot of annual salary by years of experience

ggplot(usd_salary_no_outliers, aes(x = as.factor(yr_experience_numeric), y = 
annual_salary)) +
 geom_violin(fill = "lightblue") +
 scale_x_discrete(
 labels = c("1 year or less", "2-4 years", "5-7 years", "8-10 years", 
 "11-20 years", "21-30 years", "31-40 years", "41 years or more")
 ) +
 labs(title = "Violin Plot of Annual Salary by Years of Experience",
 x = "Years of Experience",
 y = "Annual Salary (USD)") +
 theme_minimal()+
 theme(axis.text.x = element_text(angle = 45, hjust = 1))+
 facet_wrap(~gender)
Warning: Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.

** 5. Your Turn**

❓ Create a similar violin plot for annual salary and age variables. Can you visualize the salary distribution across different age groups? 1. Task: Create a violin plot for annual salary by age groups, and use facet_wrap to break it down by gender. 2. Steps: - Use the filtered dataset (e.g., usd_salary_no_outliers) to create a violin plot of annual salary by age. - Use geom_violin() to visualize the distribution for different age groups. - Modify the x-axis labels to show specific age groups. - Add facet_wrap(~gender) to split the plot by gender and allow for comparison. - Ensure that the title and axis labels are updated to reflect the new variables. 3. Goal: Visualize how annual salary varies across different age groups, and examine how this pattern differs between genders. 4. Write: Write a few insights you have noticed from the graph and what you still wonder about.

#ADD CODE AND COMMENTS
ggplot(usd_salary_no_outliers, aes(x = factor(age), y = annual_salary)) +
 geom_violin(fill = "lightgreen", color = "black") +
 facet_wrap(~ gender) +
 theme_minimal() +
 labs(title = "Distribution of Annual Salary by Age Groups and Gender",
 x = "Age Group",
 y = "Annual Salary",
 caption = "Data filtered to exclude outliers") +
 theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for better readability
Warning: Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.
Groups with fewer than two datapoints have been dropped.
ℹ Set `drop = FALSE` to consider such groups for position adjustment purposes.

Advanced GGPlot: Interactive Plots with Plotly

To make our visualizations interactive, we can use the plotly package. This allows us to transform our ggplot graphs into interactive visualizations that users can explore. First, install and load the plotly package if you have not already:

if (!require(plotly)) {
 install.packages("plotly")
}
Loading required package: plotly
Warning: package 'plotly' was built under R version 4.3.3

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout
library(plotly)
ggplotly(viz_1)

Beeswarm Plot for Enhanced Data Visualization

Another interesting way to visualize distributions is by using a [beeswarm plot] (https://r-graph-gallery.com/beeswarm.html), which can help to show individual data points without too much overlap. First, install and load the beeswarm package:

if (!require(ggbeeswarm)) {
 install.packages("ggbeeswarm")
}
Loading required package: ggbeeswarm
Warning: package 'ggbeeswarm' was built under R version 4.3.3
library(ggbeeswarm)

Create a beeswarm plot:

ggplot(usd_salary_no_outliers, aes(x = as.factor(yr_experience_numeric), y = 
annual_salary))+
 geom_beeswarm(color = "blue", size = .03, alpha = 0.6, method = "compactswarm", side 
= -1L) +
 scale_x_discrete(
 labels = c("1 year or less", "2-4 years", "5-7 years", "8-10 years", 
 "11-20 years", "21-30 years", "31-40 years", "41 years or more")
 ) +
 labs(title = "Beeswarm Plot of Annual Salary by Years Experience", 
 x = "Years Experience", 
 y = "Annual Salary (USD)") +
 theme_minimal() +
 theme(axis.text.x = element_text(angle = 45, hjust = 1))

** 6. Your Turn**

Create a beeswarm plot using usd_salary_no_outliers to visualize the relationship between age and annual salary.

  1. Task: Create a beeswarm plot with age groups on the x-axis and annual salary on the y-axis.
  2. Steps:
  • Use the ggplot() function to map age to the x-axis and annual salary to the y-axis.
  • Add the beeswarm plot using geom_beeswarm(), adjusting parameters like color, size, alpha (transparency), and method to make the plot clear and visually appealing.
  • Customize the title and axis labels using the labs() function.
  • Make sure the x-axis labels are angled for readability.
  • Finally make it interactive
  1. Goal: Create a visually appealing beeswarm plot that clearly displays the distribution of salaries across different age groups. Reflect on the spread of salaries within each age group.
  2. Write: Write a few insights you have noticed from the graph and what you still wonder about.
#ADD CODE AND COMMENTS
#ADD CODE AND COMMENTS
# Create a beeswarm plot with age groups on the x-axis and annual salary on the y-axis
beeswarm_plot <- ggplot(usd_salary_no_outliers, aes(x = as.factor(age), y = 
annual_salary)) +
 geom_beeswarm(color = "dodgerblue", size = 1, alpha = 0.4, method = "swarm") +
 theme_minimal() +
 labs(title = "Beeswarm Plot of Annual Salary by Age Groups",
 x = "Age Group",
 y = "Annual Salary",
 caption = "Data filtered to exclude outliers") +
 theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for readability
# Make the plot interactive using plotly
interactive_plot <- ggplotly(beeswarm_plot)
# Display the interactive plot
interactive_plot
write_csv(usd_salary_no_outliers, "usd_salary_no_outliers.csv")

Insights about the graph:
It seems that most salaries are payed to people between 18 and 65, which makes sense as this is the normal age range for working people. Furthermore there doesnt seems to be any difference in the max annual salary based on the age groups, which could be Firefox about:blank 19 af 20 19.10.2024 15.40 cause for further data analyzing as one might expect the salary to increase as age increases. An idea could be to compare the annual salary to years of experience to further gain insight into the distribution of salary in the data. This being said, it should also be mentioned that the graphs shows us that there still is a lot more young people receiving a lower salary between 50.000 and 100.000 USD showcasing a pretty wide spread of salaries within each age group. This maybe means that the data contains majori