IMPORTING FILE
``` r
data <- read.csv("C:\\Users\\dukel\\Downloads\\Advanced_Analytics_Project\\Data_Salaries.csv")
#1 MAKING A HISTOGRAM AND SCATTERPLOT - I made both a histogram showing the distribution of how many values were categorized by their “Company Rating” (1-5 stars) and a scatterplot showing the relationship between company rating and salary in order to try and visualize if there was a correlation between the two (none was immediately apparent).
library(ggplot2)
ggplot(data, aes(x = Company.Score)) +
geom_histogram(binwidth = 1, fill = "blue", color = "black") +
labs(title = "Histogram of Company Score", x = "Company Score", y = "Frequency")
## Warning: Removed 98 rows containing non-finite outside the scale range
## (`stat_bin()`).
ggplot(data, aes(x = Converted_Salary, y = Company.Score)) +
geom_point(color = "red") +
labs(title = "Scatterplot of Converted Salary vs. Company Score", x = "Converted Salary", y = "Company Score")
## Warning: Removed 98 rows containing missing values or values outside the scale range
## (`geom_point()`).
#2 MEAN VALUE - For the calculation of the average, I attempted to calculate the average hourly wage of each position listed in the dataset (which came out to approximately $35.34 per hour). In terms of my question I now have a basis of which to compare for the t-tests to see if location makes any kind of quantifiable difference.
data$Converted_Salary <- as.numeric(as.character(data$Converted_Salary))
## Warning: NAs introduced by coercion
mean_value <- mean(data$Converted_Salary, na.rm = TRUE)
print(mean_value)
## [1] 35.34979
#3 SCATTERPLOT WITH REGRESSION LINE - It was interesting to see such a stable trendline drawn throughout the extracted chart. It would seem to demonstrate the fact that there is no correlation between a company’s score (in overall satisfaction) and the payment received during that time. The test seems to suggest that the two variables do not have a noticeable correlation between one another (I have included the R squared value and data regarding the correlation in the screenshots).
ggplot(data, aes(x = Converted_Salary, y = Company.Score)) +
geom_point(color = "red") + # Scatterplot points
geom_smooth(method = "lm", se = FALSE, color = "blue") + # Linear regression line
labs(title = "Scatterplot of Converted Salary vs. Company Score with Regression Line",
x = "Converted Salary",
y = "Company Score")
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 163 rows containing non-finite outside the scale range
## (`stat_smooth()`).
## Warning: Removed 163 rows containing missing values or values outside the scale range
## (`geom_point()`).
#4 T-TEST - Initially, I wanted to test specifically if the location of the state of Georgia had any kind of noticeable impact on compensation (salary) when compared to others. However, after performing the t-test by separating Georgia salaries from others and comparing their averages, it appears that there is not a statistically significant result (which surprised me to be honest).
Hypothesis Null Hypothesis (H₀): There is no significant difference in the average hourly rate between GA locations and other locations in the nation.
Alternative Hypothesis (H₁): There is a significant difference in the average hourly rate between GA locations and other locations in the nation.
Findings from the T-Test Based on the t-test results, with a p-value of 0.6896, we fail to reject the null hypothesis. This indicates that there is no statistically significant difference between the average hourly rates of GA locations and other locations in the dataset.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data_yes <- data %>% filter(In_GA == "Yes") %>% pull(Converted_Salary)
data_no <- data %>% filter(In_GA == "No") %>% pull(Converted_Salary)
t_test_result <- t.test(data_yes, data_no)
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: data_yes and data_no
## t = 0.40499, df = 20.753, p-value = 0.6896
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.554503 11.205147
## sample estimates:
## mean of x mean of y
## 37.10000 35.27468