Setting up RMarkdown when opening it enables you to create dynamic, reproducible, and visually appealing reports, presentations, and documents, that can help you communicate your data analysis and research findings more effectively.
In this project, you play the role of a statistical advisor to a large international firm. In particular, you will analyze data regarding the salaries and experience of a sample of the firm’s employees. Use R and hand work to do the following with the provided data set. The data provided present information about the salaries and experience of a sample of 475 employees of a large international corporation of 3210 total employees. Throughout, you may assume that both the sample and the population of salaries of the employees are normally distributed.
salary <-read.csv("C:\\Users\\user\\Downloads\\salary.csv")
attach(salary)
head(salary,5)
ID Birth.Date Jobcat Salary.... Prev.Exp
1 1 3-Feb-1952 3 57000 144
2 2 23-May-1958 1 40200 36
3 3 26-Jul-1929 1 21450 381
4 4 15-Apr-1947 1 21900 190
5 5 9-Feb-1955 1 45000 138
tail(salary,5)
ID Birth.Date Jobcat Salary.... Prev.Exp
471 471 3-Aug-1966 1 26400 32
472 472 21-Feb-1966 1 39150 46
473 473 25-Nov-1937 1 21450 139
474 474 5-Nov-1968 1 29400 9
475 475 4-Dec-81 1 34200 22
if(!require(dplyr)){install.packages('dplyr')} #installing the package if not
library(stargazer)
library(dplyr) #loading the library
library(gtsummary)
library(magrittr)
library(ggplot2)
library(knitr)
library(gmodels)
stargazer(salary[,-1], type = "text")
===================================================
Statistic N Mean St. Dev. Min Max
---------------------------------------------------
Jobcat 475 1.411 0.773 1 3
Salary.... 475 34,105.420 15,983.470 15,750 103,500
Prev.Exp 475 95.705 104.531 0 476
---------------------------------------------------
This is a summary table of three variables for a sample of 475 observations:
Jobcat: categorical variable indicating job category, with values ranging from 1 to 3. Salary: continuous variable indicating annual salary in dollars. Prev.Exp: continuous variable indicating previous work experience in months. The table provides the following descriptive statistics for each variable:
N: the number of observations in the sample. Mean: the arithmetic mean or average of the variable. St. Dev.: the standard deviation of the variable, which measures how spread out the data are from the mean. Min: the minimum value observed for the variable. Max: the maximum value observed for the variable.
From the descriptive statistics above, we can see that; There are 475 observations in the sample. The average job category is 1.411, with a standard deviation of 0.773. The minimum value is 1 and the maximum value is 3. The average salary is $34,105.42, with a standard deviation of $15,983.47. The lowest salary is $15,750 and the highest salary is $103,500. The average previous work experience is 95.705 months, with a standard deviation of 104.531 months. The lowest value is 0 (presumably indicating no previous work experience) and the highest value is 476 months.
summary(salary$Salary....)
Min. 1st Qu. Median Mean 3rd Qu. Max.
15750 24075 29100 34105 37275 103500
summary(salary$Salary....)
Min. 1st Qu. Median Mean 3rd Qu. Max.
15750 24075 29100 34105 37275 103500
ggplot(data=salary, aes(Salary....)) +
geom_histogram(aes(y = ..density..),color="red")+
scale_x_continuous(limits = c(14000,115000))+
stat_function(fun = dnorm,
args = list(mean = mean(Salary....),
sd = sd(Salary....)),
col = "#1b98e0",
size = 1)+
labs(title = "Histogram showing the distribution of salaries",
caption="Salaries")
Histogram showing the distribution of sales
A histogram is a graphical representation of a dataset using bars of different heights where each bar represents the frequency of different data categories or intervals. Histograms are used to display the distribution of data, and they are useful for visualizing the shape of data, including skewness. Skewness refers to the lack of symmetry in a distribution. If most of the data is on the left side of the histogram but a few larger values are on the right, the data are said to be skewed to the right, and the histogram has a long tail to the right. Conversely, if most of the data are on the right, with a few smaller values showing up on the left side of the histogram, the data are skewed to the left, and the histogram has a long tail to the left.
In your case, since the histogram has a long tail to the right, it is skewed to the right. This means that the mean is larger than the median. In other words, the few larger values bring the mean upwards but don’t really affect the median. An example of such data would be NBA team salaries where star players make a lot more than their teammates.
It is important to determine the skewness of a distribution because it can affect the choice of summary statistics and hypothesis tests that are appropriate for your data. For example, for skewed distributions, the direction of the skew indicates which way the longer tail extends. For right-skewed distributions, the long tail extends to the right while most values cluster on the left. Conversely, for left-skewed distributions, the long tail extends to the left while most values cluster on the right.
To compare two histograms, you can roughly estimate the median to be located near the middle of each histogram, which allows you to compare the median values of the distributions. You can also visually see which histogram is more spread out, which gives you an idea of which distribution has values that are more dispersed. Another way to compare histograms is to check for skewness. If a histogram has a “tail” on the left side of the plot, it is said to be negatively skewed. Conversely, if a histogram has a “tail” on the right side of the plot, it is said to be positively skewed.
Histograms are useful because they allow us to gain a quick understanding of the distribution of values in a dataset. They’re also useful for comparing two different datasets. Creating a histogram provides a visual representation of data distribution. Histograms can display a large amount of data and the frequency of the data values. The median and distribution of the data can be determined by a histogram. In addition, it can show any outliers or gaps in the data.
In conclusion, a histogram with a long tail to the right indicates that the data is skewed to the right, and the mean is larger than the median. Skewness is important because it can affect the choice of summary statistics and hypothesis tests that are appropriate for your data. To compare two histograms, you can estimate the median, visually see which histogram is more spread out, and check for skewness. Histograms are useful for gaining a quick understanding of the distribution of values in a dataset and for comparing two different datasets.
A boxplot is a standardized way of displaying the distribution of data based on a five number summary of data, which includes the minimum, first quartile (Q1), median, third quartile (Q3), and maximum. A boxplot can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.
In this case, we can created a boxplot of salaries across job categories. A boxplot is useful for comparing multiple groups, and it is compact in its summarization of data. It is easy to compare groups through the box and whisker markings’ positions. Boxplots are less useful when you only have one group’s distribution to plot because they offer only a high-level summary of the data and lack the ability to show the details of a data distribution’s shape. With only one group, it is better to choose a more detailed chart type like a histogram or a density curve.
To create a boxplot, you can plot a boxplot by invoking .boxplot() on your DataFrame. In your code below, you have created a boxplot of the salary column with respect to different job categories.
salary$Jobcat<-factor(salary$Jobcat, levels = c(1,2,3),
labels = c("Category_1", "Category_2", "Category_3"))
# Load ggplot2 package
library(ggplot2)
# Create a boxplot of salaries by job category
ggplot(data = salary, aes(x = Jobcat, y = Salary....)) +
geom_boxplot() +
ggtitle("Boxplot of Salaries Across Job Categories") +
xlab("Job Category") +
ylab("Salary")
From the boxplot, the median salaries for employees in category_3 earns significantly higher than the first and second category. This is shown by a higher median salaries.
# Create a boxplot of Previous Experience by job category
ggplot(data = salary, aes(x = Jobcat, y = Prev.Exp)) +
geom_boxplot() +
ggtitle("Boxplot of Previous Experience Across Job Categories") +
xlab("Job Category") +
ylab("Previous Experience")
Employees in the second job category have a higher median median experience as compared to those in the other two categories.
library(ggplot2)
# Create scatter plot with line of best fit
ggplot(salary, aes(x = Prev.Exp, y = Salary....)) +
geom_point() +
geom_smooth(method = lm, se = FALSE, color = "blue") +
ggtitle("Scatter Plot with Line of Best Fit") +
xlab("Previous Experience") +
ylab("Salary") +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5)) +
annotate("text", x = max(salary$Prev.Exp), y = max(salary$Salary....),
label = paste("y =", round(coef(lm(Salary.... ~ Prev.Exp, data = salary))[1], 2),
"+", round(coef(lm(Salary.... ~ Prev.Exp, data = salary))[2], 2), "x",
"\nR-squared =", round(summary(lm(Salary.... ~ Prev.Exp, data = salary))$r.squared, 2)))
library(ggpubr)
ggplot(salary,aes(x = Prev.Exp, y = Salary....)) +
geom_point() +
geom_smooth(method = "lm", se=FALSE) +
ggtitle("Scatter Plot with Line of Best Fit") +
xlab("Previous Experience") +
ylab("Salary") +
stat_regline_equation(label.x = 300, label.y = 100000, aes(label = ..eq.label..)) +
stat_regline_equation(label.x = 320, label.y = 95000, aes(label = ..rr.label..))
# Calculate the sample mean and sample standard deviation
x_bar <- mean(Salary....)
s <- sd(Salary....)
s
[1] 15983.47
# Calculate the standard error of the mean
n <- length(Salary....)
n
[1] 475
sem <- s / sqrt(n)
# Calculate the t-value for a 99% confidence level with n-1 degrees of freedom
t_value <- qt(0.995, df = n - 1)
t_value
[1] 2.586241
# Calculate the confidence interval
lower_bound <- x_bar - t_value * sem
upper_bound <- x_bar + t_value * sem
# Print the confidence interval
cat("99% confidence interval for population mean: (", round(lower_bound, 2),
", ", round(upper_bound, 2), ")\n")
99% confidence interval for population mean: ( 32208.74 , 36002.1 )
# Calculate the sample mean and population standard deviation
x_bar <- mean(Salary....)
x_bar
[1] 34105.42
s <- sd(Salary....)
s
[1] 15983.47
# Calculate the standard error of the mean
n <- length(Salary....)
n
[1] 475
sem <- s / sqrt(n)
# Calculate the z-value for a 99% confidence level
z_value <- qnorm(0.995)
z_value
[1] 2.575829
# Calculate the confidence interval
lower_bound <- x_bar - z_value * sem
upper_bound <- x_bar + z_value * sem
# Print the confidence interval
cat("99% confidence interval for population mean: (", round(lower_bound, 2),
", ", round(upper_bound, 2), ")\n")
99% confidence interval for population mean: ( 32216.38 , 35994.46 )
\[\begin{equation} CI = \bar{X} \pm Z_{\alpha/2} * \frac{\sigma}{\sqrt{n}} \end{equation}\]
Using the sample values from the previous example, the LaTeX code for the confidence interval would be:
\[\begin{equation} CI = \overline{X} \pm 2.576 * \frac{15983.47}{\sqrt{475}} = \overline{X} \pm 1889.04 \end{equation}\]
The 99% confidence interval now becomes \[[32216.38 , 35994.46]\]
# Set the desired confidence level and maximum error
conf_level <- 0.97
max_error <- 125
# Set the population standard deviation
sigma <- 8700
# Calculate the required sample size
z_value <- qnorm((1 + conf_level) / 2)
n <- ceiling((z_value * sigma / max_error)^2)
# Print the required sample size
cat("Sample size:", n)
Sample size: 22813
To estimate the required sample size to construct a 97% confidence interval with a maximum error of $125, we can use the formula:
\[[n = \left(\frac{Z_{\alpha/2} \cdot \sigma}{E}\right)^2]\]
where \[Z_\alpha/_2\] is the z-score corresponding to the desired confidence level.
\[\sigma\] is the population standard deviation, and
\[E\] is the maximum error we are willing to tolerate.
Plug the values calculated above in the given formula to get the sample size.
To verify whether the population mean salary is actually greater than $35,500 at a significance level of 𝛼=0.025, we can use a one-tailed hypothesis test with the following null and alternative hypotheses:
\[Null\space hypothesis: \mu ≤ 35500\]
\[ Alternative \space hypothesis: \mu> 35500 \]
We will reject the null hypothesis if the test statistic falls in the rejection region, which is determined by the significance level and the degrees of freedom.
To conduct the hypothesis test in R, we first need to calculate the test statistic and the p-value. We can use the following code:
# Set the null hypothesis mean and alternative hypothesis mean
mu_null <- 35500
mu_alt <- Inf
# Set the significance level
alpha <- 0.025
# Set the sample mean, sample standard deviation, and sample size
x_bar <- 34105.42
s <- 15983.47
n <- 475
# Calculate the test statistic and p-value
t_stat <- (x_bar - mu_null) / (s / sqrt(n))
p_value <- pt(t_stat, df = n - 1, lower.tail = FALSE)
# Print the test statistic and p-value
cat("Test statistic:", t_stat, "\n")
Test statistic: -1.9016
cat("p-value:", p_value, "\n")
p-value: 0.9710852
# Set the null hypothesis mean and alternative hypothesis mean
mu_null <- 35500
mu_alt <- Inf
# Set the significance level
alpha <- 0.025
# Set the sample mean, population standard deviation, and sample size
x_bar <- 34105.42
sigma <- 15983.47
n <- 475
# Calculate the test statistic and p-value
z_stat <- (x_bar - mu_null) / (sigma / sqrt(n))
p_value <- pnorm(z_stat, lower.tail = FALSE)
# Print the test statistic and p-value
cat("Test statistic:", z_stat, "\n")
Test statistic: -1.9016
cat("p-value:", p_value, "\n")
p-value: 0.9713883
The test statistic is a measure of how many standard errors the sample mean is away from the null hypothesis mean. In this case, the test statistic is -1.9016, which means that the sample mean is 1.9016 standard errors below the null hypothesis mean. The negative sign indicates that the sample mean is less than the null hypothesis mean.
The p-value is the probability of observing a test statistic as extreme as the one calculated from the sample data, assuming that the null hypothesis is true. In this case, the p-value is 0.9713883, which means that if the null hypothesis is true (i.e., the population mean salary is less than or equal to $35,500), there is a 97.14% chance of obtaining a test statistic as extreme or more extreme than the one calculated from the sample data.
Since the p-value is greater than the significance level of 0.025, we fail to reject the null hypothesis. This means that we do not have enough evidence to conclude that the population mean salary is actually greater than $35,500. However, we cannot conclude that the null hypothesis is true either; we can only say that we do not have enough evidence to reject it.
In summary, based on the test statistic of -1.9016 and the p-value of 0.9713883, we fail to reject the null hypothesis that the population mean salary is less than or equal to $35,500 at a significance level of 0.025.
\[ Null \space hypothesis: \mu_1 = \mu_2 = \mu_3 \]
\[ Alternative \space hypothesis: \mu_i \neq \mu_j \] Where i can be job category 1, 2 or 3, and j can be job category 1, 2 or 3
The assumptions of one-way ANOVA are similar to the general assumptions for any parametric test. The following are the three primary assumptions that need to be met for the results of a one-way ANOVA to be valid:
qqnorm(salary$Salary....)
shapiro.test(salary$Salary....)
Shapiro-Wilk normality test
data: salary$Salary....
W = 0.79462, p-value < 2.2e-16
library(car)
leveneTest(Salary.... ~ Jobcat, data = salary)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 49.207 < 2.2e-16 ***
472
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It is essential to check these assumptions before conducting a one-way ANOVA. If any of these assumptions are violated, it may affect the validity of the results. For example, if the normality assumption is violated, the p-values and confidence intervals will be incorrect. If the homogeneity of variance assumption is violated, the F-test may not be reliable.
It is also important to note that one-way ANOVA is not robust to violations of these assumptions. If the assumptions are not met, alternative non-parametric tests can be used instead.
In summary, the assumptions of one-way ANOVA are normality, homogeneity of variance, and independence. These assumptions should be checked before conducting a one-way ANOVA, and if any of these assumptions are violated, alternative non-parametric tests should be used instead.
# Create a boxplot of salaries by job category
ggplot(data = salary, aes(x = Jobcat, y = Salary....)) +
geom_boxplot() +
ggtitle("Boxplot of Salaries Across Job Categories") +
xlab("Job Category") +
ylab("Salary")
The boxplot above indicates the the employees in the job category_3 earns significantly higher that those in category_1 and category_2. Besides, it is also evident that employees in category_2 earns significantly higher than those in category_1.
#fit the one-way ANOVA model
model <- aov(Salary.... ~ Jobcat, data = salary)
#view the model output
summary(model)
Df Sum Sq Mean Sq F value Pr(>F)
Jobcat 2 7.944e+10 3.972e+10 450.1 <2e-16 ***
Residuals 472 4.165e+10 8.825e+07
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results shows statistically significant difference in the average salary for all employees in various job categories.
library(multcomp)
plot(TukeyHSD(aov(Salary.... ~ Jobcat, data = salary)),las=0)
# Create a boxplot of salaries by job category
ggplot(data = salary, aes(x = Jobcat, y = Prev.Exp)) +
geom_boxplot() +
ggtitle("Boxplot of Previous Experience Across Job Categories") +
xlab("Job Category") +
ylab("Previous Experience")
The graph above indicated employees in the category_2 have a significantly higher previous experienced as compared to those in category_1 and category_3.
#fit the one-way ANOVA model
model1 <- aov(Prev.Exp~ Jobcat, data = salary)
#view the model output
summary(model1)
Df Sum Sq Mean Sq F value Pr(>F)
Jobcat 2 1176388 588194 69.36 <2e-16 ***
Residuals 472 4002863 8481
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the results above, it is evident that there is a statistically significant difference in the average previous experience across the three job categories as indicated by the p-value of approximately 0.0001.
library(multcomp)
plot(TukeyHSD(aov(Prev.Exp ~ Jobcat, data = salary)),las=1)
An ANOVA, or analysis of variance, is a statistical test used to compare the means of two or more groups. In this case, the ANOVA results indicates that the average salary and previous experience varied significantly across job categories. This means that there are statistically significant differences in the average salary and previous experience between the different job categories.
One possible conclusion we can draw from these results is that certain job categories may require more experience or qualifications, and therefore offer higher salaries. Another possible conclusion is that there may be a disparity in pay and experience across different job categories, which could be further investigated to ensure fair compensation and hiring practices. In other words, job categories are not equal in terms of compensation and experience, and that there may be underlying factors that contribute to these differences. Further, it is possible to note that the differences among the job categories are not due to chance. The small p value (<0.001) also supports this conclusion. These results suggest that the job categories are a significant factor in determining salary and previous experience, and that further investigation may be needed to understand why these differences exist.