library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.2.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
setwd("~/Documents/EC/Spring 2026/DATA 101/Project 2")
salary <- read_csv("blizzard_salary.csv")
## Rows: 466 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (7): timestamp, status, current_title, salary_type, other_info, location...
## dbl (2): current_salary, percent_incr
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(salary)
## spc_tbl_ [466 × 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ timestamp : chr [1:466] "8/6/20 18:57" "8/6/20 18:56" "8/6/20 18:56" "7/31/20 16:50" ...
## $ status : chr [1:466] "Full Time Employee" "Full Time Employee" "Full Time Employee" "Full Time Employee" ...
## $ current_title : chr [1:466] "Consultant" "Engineer" "Engineer" "Customer Support" ...
## $ current_salary : num [1:466] 1 1 1 16.3 16.7 ...
## $ salary_type : chr [1:466] "year" "year" "year" "hour" ...
## $ percent_incr : num [1:466] 1 1 1 1 NA NA NA 0 4 1.2 ...
## $ other_info : chr [1:466] NA NA NA "Near smack dab in the middle of my pay band" ...
## $ location : chr [1:466] "Irvine" "Irvine" "Irvine" "Austin" ...
## $ performance_rating: chr [1:466] "High" "Successful" "High" "Successful" ...
## - attr(*, "spec")=
## .. cols(
## .. timestamp = col_character(),
## .. status = col_character(),
## .. current_title = col_character(),
## .. current_salary = col_double(),
## .. salary_type = col_character(),
## .. percent_incr = col_double(),
## .. other_info = col_character(),
## .. location = col_character(),
## .. performance_rating = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
head(salary)
## # A tibble: 6 × 9
## timestamp status current_title current_salary salary_type percent_incr
## <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 8/6/20 18:57 Full Time… Consultant 1 year 1
## 2 8/6/20 18:56 Full Time… Engineer 1 year 1
## 3 8/6/20 18:56 Full Time… Engineer 1 year 1
## 4 7/31/20 16:50 Full Time… Customer Sup… 16.3 hour 1
## 5 3/11/21 10:28 Full Time… Game Master 16.7 hour NA
## 6 7/31/20 15:03 Contractor Tester 17 hour NA
## # ℹ 3 more variables: other_info <chr>, location <chr>,
## # performance_rating <chr>
Is there a significant difference in mean current salary among the different types of performance ratings? The data set I selected to work on contains data from employee-generated anonymous surveys of salary information. This dataset has 466 observations and 9 variables, making it perfect for this project. The question I stated in the first sentence is what I am going to discover throughout this project with various coding techniques. I will utilize two variables in this data set, including performance_rating and current_salary. I discovered the dataset on the OpenIntro website, which was linked on Blackboard. OpenIntro Dataset Link: https://www.openintro.org/data/index.php?data=blizzard_salary.
To find if there is a significant difference in mean current salary among the different types of performance ratings, I will perform the ANOVA (Analysis of Variance) Test to observe the correlation between the multiple levels of the performance rating variable. I will then obtain a p-value that shows the correlation between these two variables in this dataset. First, I will perform cleaning on the data set and select the main variables I am going to use in this project, which are performance_rating and current_salary. I will then plug this into a bar graph to have a nice visualization of the difference in mean current salary among the different types of performance ratings.
names(salary) <- gsub("[(). \\-]", "_", names(salary))
names(salary) <- gsub("_$", "", names(salary))
names(salary) <- tolower(names(salary))
head(salary)
## # A tibble: 6 × 9
## timestamp status current_title current_salary salary_type percent_incr
## <chr> <chr> <chr> <dbl> <chr> <dbl>
## 1 8/6/20 18:57 Full Time… Consultant 1 year 1
## 2 8/6/20 18:56 Full Time… Engineer 1 year 1
## 3 8/6/20 18:56 Full Time… Engineer 1 year 1
## 4 7/31/20 16:50 Full Time… Customer Sup… 16.3 hour 1
## 5 3/11/21 10:28 Full Time… Game Master 16.7 hour NA
## 6 7/31/20 15:03 Contractor Tester 17 hour NA
## # ℹ 3 more variables: other_info <chr>, location <chr>,
## # performance_rating <chr>
salary2 <- salary |>
select(current_salary, performance_rating) |>
filter(!is.na(current_salary)) |>
filter(!is.na(performance_rating))
head(salary2)
## # A tibble: 6 × 2
## current_salary performance_rating
## <dbl> <chr>
## 1 1 High
## 2 1 Successful
## 3 1 High
## 4 16.3 Successful
## 5 16.7 High
## 6 19.8 Top
salary2_mean <- salary2 |>
group_by(performance_rating) |>
summarize(mean_salary = mean(current_salary, na.rm = TRUE))
salary2_mean
## # A tibble: 4 × 2
## performance_rating mean_salary
## <chr> <dbl>
## 1 Developing 112344.
## 2 High 88713.
## 3 Successful 71194.
## 4 Top 81247.
ggplot(data = salary2_mean) +
geom_bar(mapping = aes(x = performance_rating, y = mean_salary, fill = performance_rating), stat = "identity") +
labs(x = "Performance Ratings", y = "Salaries",
title = "Difference in Mean Current Salary Among the Different Types of Performance Ratings",
caption = "Source: blizzard_salary.csv (OpenIntro)") +
theme_minimal()
H₀: μ₁ = μ₂ = μ₃ = … = μₖ Hₐ: At least one group’s mean salary differs
Null Hypothesis: The mean salary is the same among all groups of performance rating. Alternative Hypothesis: The mean salary is different among all groups of performance rating.
anova_result <- aov(current_salary ~ performance_rating, data = salary2)
anova_result
## Call:
## aov(formula = current_salary ~ performance_rating, data = salary2)
##
## Terms:
## performance_rating Residuals
## Sum of Squares 25679215660 710047849310
## Deg. of Freedom 3 294
##
## Residual standard error: 49143.96
## Estimated effects may be unbalanced
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## performance_rating 3 2.568e+10 8.560e+09 3.544 0.015 *
## Residuals 294 7.100e+11 2.415e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this test, our alpha value is 0.05. With the p-value being 0.015, which is less than 0.05, we reject the null. We can conclude that there is significant evidence that at least one of the group’s mean salaries differs depending on the performance rating of the employee.
Looking at my findings, we can see that the average salary mean does differ in the bar graph I created. In the bar graph, we can see that the “developing” category has the highest overall mean by quite a large margin, while all the other category means are similar to each other. Using the bar graph, we can conclude that there is a significant difference in mean current salary among the different types of performance ratings. In the ANOVA test between current salary and performance rating, we obtained a p-value of 0.015, which is under the alpha value of 0.05. Using the information from both the bar graph and the ANOVA test, we can confidently confirm that there is a significant difference in mean current salary among the different types of performance ratings. To improve this dataset’s accuracy in the future, I believe they should have gotten more data specifically from the “developing” category, as when I filtered the salary2 chunk, I observed that this category only had 5 observations compared to the numerous amounts for the other categories, which could have fluctuated the dataset’s accuracy towards my research question. If we could improve the data collection aspect of this dataset, it would allow future analysis to go far more smoothly.
OpenIntro Dataset Link: https://www.openintro.org/data/index.php?data=blizzard_salary