Project 2 DATA-101

Loading Libraries & Dataset

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)
library(ggplot2)
 
setwd("~/Documents/EC/Spring 2026/DATA 101/Project 2")
 
salary <- read_csv("blizzard_salary.csv")

## Rows: 466 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): timestamp, status, current_title, salary_type, other_info, location...
## dbl (2): current_salary, percent_incr
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

str(salary)

## spc_tbl_ [466 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ timestamp         : chr [1:466] "8/6/20 18:57" "8/6/20 18:56" "8/6/20 18:56" "7/31/20 16:50" ...
##  $ status            : chr [1:466] "Full Time Employee" "Full Time Employee" "Full Time Employee" "Full Time Employee" ...
##  $ current_title     : chr [1:466] "Consultant" "Engineer" "Engineer" "Customer Support" ...
##  $ current_salary    : num [1:466] 1 1 1 16.3 16.7 ...
##  $ salary_type       : chr [1:466] "year" "year" "year" "hour" ...
##  $ percent_incr      : num [1:466] 1 1 1 1 NA NA NA 0 4 1.2 ...
##  $ other_info        : chr [1:466] NA NA NA "Near smack dab in the middle of my pay band" ...
##  $ location          : chr [1:466] "Irvine" "Irvine" "Irvine" "Austin" ...
##  $ performance_rating: chr [1:466] "High" "Successful" "High" "Successful" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   timestamp = col_character(),
##   ..   status = col_character(),
##   ..   current_title = col_character(),
##   ..   current_salary = col_double(),
##   ..   salary_type = col_character(),
##   ..   percent_incr = col_double(),
##   ..   other_info = col_character(),
##   ..   location = col_character(),
##   ..   performance_rating = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

head(salary)

## # A tibble: 6 × 9
##   timestamp     status     current_title current_salary salary_type percent_incr
##   <chr>         <chr>      <chr>                  <dbl> <chr>              <dbl>
## 1 8/6/20 18:57  Full Time… Consultant               1   year                   1
## 2 8/6/20 18:56  Full Time… Engineer                 1   year                   1
## 3 8/6/20 18:56  Full Time… Engineer                 1   year                   1
## 4 7/31/20 16:50 Full Time… Customer Sup…           16.3 hour                   1
## 5 3/11/21 10:28 Full Time… Game Master             16.7 hour                  NA
## 6 7/31/20 15:03 Contractor Tester                  17   hour                  NA
## # ℹ 3 more variables: other_info <chr>, location <chr>,
## #   performance_rating <chr>

Introduction

Is there a significant difference in mean current salary among the different types of performance ratings? The data set I selected to work on contains data from employee-generated anonymous surveys of salary information. This dataset has 466 observations and 9 variables, making it perfect for this project. The question I stated in the first sentence is what I am going to discover throughout this project with various coding techniques. I will utilize two variables in this data set, including performance_rating and current_salary. I discovered the dataset on the OpenIntro website, which was linked on Blackboard. OpenIntro Dataset Link: https://www.openintro.org/data/index.php?data=blizzard_salary.

Data Analysis

To find if there is a significant difference in mean current salary among the different types of performance ratings, I will perform the ANOVA (Analysis of Variance) Test to observe the correlation between the multiple levels of the performance rating variable. I will then obtain a p-value that shows the correlation between these two variables in this dataset. First, I will perform cleaning on the data set and select the main variables I am going to use in this project, which are performance_rating and current_salary. I will then plug this into a bar graph to have a nice visualization of the difference in mean current salary among the different types of performance ratings.

Cleaning

names(salary) <- gsub("[(). \\-]", "_", names(salary))
names(salary) <- gsub("_$", "", names(salary))
names(salary) <- tolower(names(salary))

head(salary)

## # A tibble: 6 × 9
##   timestamp     status     current_title current_salary salary_type percent_incr
##   <chr>         <chr>      <chr>                  <dbl> <chr>              <dbl>
## 1 8/6/20 18:57  Full Time… Consultant               1   year                   1
## 2 8/6/20 18:56  Full Time… Engineer                 1   year                   1
## 3 8/6/20 18:56  Full Time… Engineer                 1   year                   1
## 4 7/31/20 16:50 Full Time… Customer Sup…           16.3 hour                   1
## 5 3/11/21 10:28 Full Time… Game Master             16.7 hour                  NA
## 6 7/31/20 15:03 Contractor Tester                  17   hour                  NA
## # ℹ 3 more variables: other_info <chr>, location <chr>,
## #   performance_rating <chr>

Selecting Variables

salary2 <- salary |>
  select(current_salary, performance_rating) |>
  filter(!is.na(current_salary)) |>
  filter(!is.na(performance_rating))
head(salary2)

## # A tibble: 6 × 2
##   current_salary performance_rating
##            <dbl> <chr>             
## 1            1   High              
## 2            1   Successful        
## 3            1   High              
## 4           16.3 Successful        
## 5           16.7 High              
## 6           19.8 Top

Table of Mean Salaries of Each Performance Rating

salary2_mean <- salary2 |>
  group_by(performance_rating) |>
  summarize(mean_salary = mean(current_salary, na.rm = TRUE))
salary2_mean

## # A tibble: 4 × 2
##   performance_rating mean_salary
##   <chr>                    <dbl>
## 1 Developing             112344.
## 2 High                    88713.
## 3 Successful              71194.
## 4 Top                     81247.

Bargraph of the Difference in Mean Current Salary Among the Different Types of Performance Ratings

ggplot(data = salary2_mean) +
  geom_bar(mapping = aes(x = performance_rating, y = mean_salary, fill = performance_rating), stat = "identity") +
  labs(x = "Performance Ratings", y = "Salaries", 
       title = "Difference in Mean Current Salary Among the Different Types of Performance Ratings",
       caption = "Source: blizzard_salary.csv (OpenIntro)") +
  theme_minimal()

Statistical Analysis

Hypothesis

H₀: μ₁ = μ₂ = μ₃ = … = μₖ Hₐ: At least one group’s mean salary differs

Null Hypothesis: The mean salary is the same among all groups of performance rating. Alternative Hypothesis: The mean salary is different among all groups of performance rating.

ANOVA (Analysis of Variance) Test

anova_result <- aov(current_salary ~ performance_rating, data = salary2)
anova_result

## Call:
##    aov(formula = current_salary ~ performance_rating, data = salary2)
## 
## Terms:
##                 performance_rating    Residuals
## Sum of Squares         25679215660 710047849310
## Deg. of Freedom                  3          294
## 
## Residual standard error: 49143.96
## Estimated effects may be unbalanced

summary(anova_result)

##                     Df    Sum Sq   Mean Sq F value Pr(>F)  
## performance_rating   3 2.568e+10 8.560e+09   3.544  0.015 *
## Residuals          294 7.100e+11 2.415e+09                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interperting Results

In this test, our alpha value is 0.05. With the p-value being 0.015, which is less than 0.05, we reject the null. We can conclude that there is significant evidence that at least one of the group’s mean salaries differs depending on the performance rating of the employee.

Conclusion and Future Directions

Looking at my findings, we can see that the average salary mean does differ in the bar graph I created. In the bar graph, we can see that the “developing” category has the highest overall mean by quite a large margin, while all the other category means are similar to each other. Using the bar graph, we can conclude that there is a significant difference in mean current salary among the different types of performance ratings. In the ANOVA test between current salary and performance rating, we obtained a p-value of 0.015, which is under the alpha value of 0.05. Using the information from both the bar graph and the ANOVA test, we can confidently confirm that there is a significant difference in mean current salary among the different types of performance ratings. To improve this dataset’s accuracy in the future, I believe they should have gotten more data specifically from the “developing” category, as when I filtered the salary2 chunk, I observed that this category only had 5 observations compared to the numerous amounts for the other categories, which could have fluctuated the dataset’s accuracy towards my research question. If we could improve the data collection aspect of this dataset, it would allow future analysis to go far more smoothly.

References

OpenIntro Dataset Link: https://www.openintro.org/data/index.php?data=blizzard_salary