Data Analysis and Hypothesis Testing

Set up Rstudio
Load the dataset
View the first and last few observations
Load the following libraries
1. Give the 5-number summary of the sample salaries.
- Additional Summary Statistics
2. Generate an appropriate graphical representation (not a scatter plot) of your data.
3. Find a best-fit regression line for the data. You choose an appropriate x and y. For example, x = age and y = salary, x=experience and y = salary, and so on. Provide the scatter plot along with the regression line.
- Alternative plot
**4. Construct a 99% confidence interval for population mean 𝜇𝜇 of employee salaries.
5. The corporation believes that the population standard deviation of employee salaries is $8,700. How large of a sample would be required in order to estimate the mean salary at the 97% confidence level with an error of at most $125?
- Manual Calculation
6. An internal corporate report claims that the population mean salary is actually greater than $35,500. Can you verify this at 𝛼𝛼 = .025. Construct an appropriate full hypothesis test.
- T-statistics
- Z-statistics
7. Let μ1 be the (population) mean of all Category 1 employees, μ2 be the (population) mean of all Category 2employees, and μ3 be the (population) mean of all Category 3 employees. Test, using one-way ANOVA at theα = 0.05 significance level, the hypothesis H0: μ1 = μ2 = μ3 against all possible alternative hypotheses. For each set of data, construct box-and-whisker diagrams on the same figure and give an interpretation of your diagram.
8. Type up a paragraph (or more) discussing the results of your analysis, including at least two conclusions you can make. Recall that you are addressing individuals who may know little about statistics. As a result, you will have to explain in layman’s terms what you have studied and concluded.

Set up Rstudio

Setting up RMarkdown when opening it enables you to create dynamic, reproducible, and visually appealing reports, presentations, and documents, that can help you communicate your data analysis and research findings more effectively.

In this project, you play the role of a statistical advisor to a large international firm. In particular, you will analyze data regarding the salaries and experience of a sample of the firm’s employees. Use R and hand work to do the following with the provided data set. The data provided present information about the salaries and experience of a sample of 475 employees of a large international corporation of 3210 total employees. Throughout, you may assume that both the sample and the population of salaries of the employees are normally distributed.

Load the dataset

salary <-read.csv("C:\\Users\\user\\Downloads\\salary.csv")
attach(salary)

View the first and last few observations

head(salary,5)

  ID  Birth.Date Jobcat Salary.... Prev.Exp
1  1  3-Feb-1952      3      57000      144
2  2 23-May-1958      1      40200       36
3  3 26-Jul-1929      1      21450      381
4  4 15-Apr-1947      1      21900      190
5  5  9-Feb-1955      1      45000      138

tail(salary,5)

     ID  Birth.Date Jobcat Salary.... Prev.Exp
471 471  3-Aug-1966      1      26400       32
472 472 21-Feb-1966      1      39150       46
473 473 25-Nov-1937      1      21450      139
474 474  5-Nov-1968      1      29400        9
475 475    4-Dec-81      1      34200       22

Load the following libraries

if(!require(dplyr)){install.packages('dplyr')} #installing the package if not
library(stargazer)
library(dplyr) #loading the library
library(gtsummary)
library(magrittr)
library(ggplot2)
library(knitr)
library(gmodels)

1. Give the 5-number summary of the sample salaries.

stargazer(salary[,-1], type = "text")


===================================================
Statistic   N     Mean     St. Dev.   Min     Max  
---------------------------------------------------
Jobcat     475   1.411      0.773      1       3   
Salary.... 475 34,105.420 15,983.470 15,750 103,500
Prev.Exp   475   95.705    104.531     0      476  
---------------------------------------------------

This is a summary table of three variables for a sample of 475 observations:

Jobcat: categorical variable indicating job category, with values ranging from 1 to 3. Salary: continuous variable indicating annual salary in dollars. Prev.Exp: continuous variable indicating previous work experience in months. The table provides the following descriptive statistics for each variable:

N: the number of observations in the sample. Mean: the arithmetic mean or average of the variable. St. Dev.: the standard deviation of the variable, which measures how spread out the data are from the mean. Min: the minimum value observed for the variable. Max: the maximum value observed for the variable.

From the descriptive statistics above, we can see that; There are 475 observations in the sample. The average job category is 1.411, with a standard deviation of 0.773. The minimum value is 1 and the maximum value is 3. The average salary is $34,105.42, with a standard deviation of $15,983.47. The lowest salary is $15,750 and the highest salary is $103,500. The average previous work experience is 95.705 months, with a standard deviation of 104.531 months. The lowest value is 0 (presumably indicating no previous work experience) and the highest value is 476 months.

Additional Summary Statistics

summary(salary$Salary....)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  15750   24075   29100   34105   37275  103500

2. Generate an appropriate graphical representation (not a scatter plot) of your data.

Histogram

summary(salary$Salary....)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  15750   24075   29100   34105   37275  103500

ggplot(data=salary, aes(Salary....)) +        
  geom_histogram(aes(y = ..density..),color="red")+
  scale_x_continuous(limits = c(14000,115000))+
  stat_function(fun = dnorm,
                args = list(mean = mean(Salary....),
                            sd = sd(Salary....)),
                col = "#1b98e0",
                size = 1)+
 labs(title = "Histogram showing the distribution of salaries",
       caption="Salaries")

Histogram showing the distribution of sales

A histogram is a graphical representation of a dataset using bars of different heights where each bar represents the frequency of different data categories or intervals. Histograms are used to display the distribution of data, and they are useful for visualizing the shape of data, including skewness. Skewness refers to the lack of symmetry in a distribution. If most of the data is on the left side of the histogram but a few larger values are on the right, the data are said to be skewed to the right, and the histogram has a long tail to the right. Conversely, if most of the data are on the right, with a few smaller values showing up on the left side of the histogram, the data are skewed to the left, and the histogram has a long tail to the left.

In your case, since the histogram has a long tail to the right, it is skewed to the right. This means that the mean is larger than the median. In other words, the few larger values bring the mean upwards but don’t really affect the median. An example of such data would be NBA team salaries where star players make a lot more than their teammates.

It is important to determine the skewness of a distribution because it can affect the choice of summary statistics and hypothesis tests that are appropriate for your data. For example, for skewed distributions, the direction of the skew indicates which way the longer tail extends. For right-skewed distributions, the long tail extends to the right while most values cluster on the left. Conversely, for left-skewed distributions, the long tail extends to the left while most values cluster on the right.

To compare two histograms, you can roughly estimate the median to be located near the middle of each histogram, which allows you to compare the median values of the distributions. You can also visually see which histogram is more spread out, which gives you an idea of which distribution has values that are more dispersed. Another way to compare histograms is to check for skewness. If a histogram has a “tail” on the left side of the plot, it is said to be negatively skewed. Conversely, if a histogram has a “tail” on the right side of the plot, it is said to be positively skewed.

Histograms are useful because they allow us to gain a quick understanding of the distribution of values in a dataset. They’re also useful for comparing two different datasets. Creating a histogram provides a visual representation of data distribution. Histograms can display a large amount of data and the frequency of the data values. The median and distribution of the data can be determined by a histogram. In addition, it can show any outliers or gaps in the data.

In conclusion, a histogram with a long tail to the right indicates that the data is skewed to the right, and the mean is larger than the median. Skewness is important because it can affect the choice of summary statistics and hypothesis tests that are appropriate for your data. To compare two histograms, you can estimate the median, visually see which histogram is more spread out, and check for skewness. Histograms are useful for gaining a quick understanding of the distribution of values in a dataset and for comparing two different datasets.

Boxplot of Salaries across Job Categories

A boxplot is a standardized way of displaying the distribution of data based on a five number summary of data, which includes the minimum, first quartile (Q1), median, third quartile (Q3), and maximum. A boxplot can tell you about your outliers and what their values are. It can also tell you if your data is symmetrical, how tightly your data is grouped, and if and how your data is skewed.

In this case, we can created a boxplot of salaries across job categories. A boxplot is useful for comparing multiple groups, and it is compact in its summarization of data. It is easy to compare groups through the box and whisker markings’ positions. Boxplots are less useful when you only have one group’s distribution to plot because they offer only a high-level summary of the data and lack the ability to show the details of a data distribution’s shape. With only one group, it is better to choose a more detailed chart type like a histogram or a density curve.

To create a boxplot, you can plot a boxplot by invoking .boxplot() on your DataFrame. In your code below, you have created a boxplot of the salary column with respect to different job categories.

Label Job category

salary$Jobcat<-factor(salary$Jobcat, levels = c(1,2,3),
                              labels = c("Category_1", "Category_2", "Category_3"))

# Load ggplot2 package
library(ggplot2)

# Create a boxplot of salaries by job category
ggplot(data = salary, aes(x = Jobcat, y = Salary....)) +
  geom_boxplot() +
  ggtitle("Boxplot of Salaries Across Job Categories") +
  xlab("Job Category") +
  ylab("Salary")

From the boxplot, the median salaries for employees in category_3 earns significantly higher than the first and second category. This is shown by a higher median salaries.

Boxplot of Previous Experience across Job Categories

# Create a boxplot of Previous Experience by job category
ggplot(data = salary, aes(x = Jobcat, y = Prev.Exp)) +
  geom_boxplot() +
  ggtitle("Boxplot of Previous Experience Across Job Categories") +
  xlab("Job Category") +
  ylab("Previous Experience")

Employees in the second job category have a higher median median experience as compared to those in the other two categories.

3. Find a best-fit regression line for the data. You choose an appropriate x and y. For example, x = age and y = salary, x=experience and y = salary, and so on. Provide the scatter plot along with the regression line.

library(ggplot2)

# Create scatter plot with line of best fit
ggplot(salary, aes(x = Prev.Exp, y = Salary....)) +
  geom_point() +
  geom_smooth(method = lm, se = FALSE, color = "blue") +
  ggtitle("Scatter Plot with Line of Best Fit") +
  xlab("Previous Experience") +
  ylab("Salary") +
  theme_bw() +                        
  theme(plot.title = element_text(hjust = 0.5)) +
  annotate("text", x = max(salary$Prev.Exp), y = max(salary$Salary....), 
           label = paste("y =", round(coef(lm(Salary.... ~ Prev.Exp, data = salary))[1], 2), 
                         "+", round(coef(lm(Salary.... ~ Prev.Exp, data = salary))[2], 2), "x", 
                         "\nR-squared =", round(summary(lm(Salary.... ~ Prev.Exp, data = salary))$r.squared, 2)))

Alternative plot

library(ggpubr)
ggplot(salary,aes(x = Prev.Exp, y = Salary....)) + 
  geom_point() + 
  geom_smooth(method = "lm", se=FALSE) +
  ggtitle("Scatter Plot with Line of Best Fit") +
  xlab("Previous Experience") +
  ylab("Salary") +
  stat_regline_equation(label.x = 300, label.y = 100000, aes(label = ..eq.label..)) +
  stat_regline_equation(label.x = 320, label.y = 95000, aes(label = ..rr.label..))

**4. Construct a 99% confidence interval for population mean 𝜇𝜇 of employee salaries.

T-statistics**

# Calculate the sample mean and sample standard deviation
x_bar <- mean(Salary....)
s <- sd(Salary....)
s

[1] 15983.47

# Calculate the standard error of the mean
n <- length(Salary....)
n

[1] 475

sem <- s / sqrt(n)

# Calculate the t-value for a 99% confidence level with n-1 degrees of freedom
t_value <- qt(0.995, df = n - 1)
t_value

[1] 2.586241

# Calculate the confidence interval
lower_bound <- x_bar - t_value * sem
upper_bound <- x_bar + t_value * sem

# Print the confidence interval
cat("99% confidence interval for population mean: (", round(lower_bound, 2), 
    ", ", round(upper_bound, 2), ")\n")

99% confidence interval for population mean: ( 32208.74 ,  36002.1 )

Z-statistics

# Calculate the sample mean and population standard deviation
x_bar <- mean(Salary....)
x_bar

[1] 34105.42

s <- sd(Salary....)
s

[1] 15983.47

# Calculate the standard error of the mean
n <- length(Salary....)
n

[1] 475

sem <- s / sqrt(n)

# Calculate the z-value for a 99% confidence level
z_value <- qnorm(0.995)
z_value

[1] 2.575829

# Calculate the confidence interval
lower_bound <- x_bar - z_value * sem
upper_bound <- x_bar + z_value * sem

# Print the confidence interval
cat("99% confidence interval for population mean: (", round(lower_bound, 2), 
    ", ", round(upper_bound, 2), ")\n")

99% confidence interval for population mean: ( 32216.38 ,  35994.46 )

Manual Calculation

\[\begin{equation} CI = \bar{X} \pm Z_{\alpha/2} * \frac{\sigma}{\sqrt{n}} \end{equation}\]

Using the sample values from the previous example, the LaTeX code for the confidence interval would be:

\[\begin{equation} CI = \overline{X} \pm 2.576 * \frac{15983.47}{\sqrt{475}} = \overline{X} \pm 1889.04 \end{equation}\]

The 99% confidence interval now becomes \[[32216.38 , 35994.46]\]

5. The corporation believes that the population standard deviation of employee salaries is $8,700. How large of a sample would be required in order to estimate the mean salary at the 97% confidence level with an error of at most $125?

# Set the desired confidence level and maximum error
conf_level <- 0.97
max_error <- 125

# Set the population standard deviation
sigma <- 8700

# Calculate the required sample size
z_value <- qnorm((1 + conf_level) / 2)
n <- ceiling((z_value * sigma / max_error)^2)

# Print the required sample size
cat("Sample size:", n)

Sample size: 22813

Manual Calculation

To estimate the required sample size to construct a 97% confidence interval with a maximum error of $125, we can use the formula:

\[[n = \left(\frac{Z_{\alpha/2} \cdot \sigma}{E}\right)^2]\]

where \[Z_\alpha/_2\] is the z-score corresponding to the desired confidence level.

\[\sigma\] is the population standard deviation, and

\[E\] is the maximum error we are willing to tolerate.

Plug the values calculated above in the given formula to get the sample size.

6. An internal corporate report claims that the population mean salary is actually greater than $35,500. Can you verify this at 𝛼𝛼 = .025. Construct an appropriate full hypothesis test.

To verify whether the population mean salary is actually greater than $35,500 at a significance level of 𝛼=0.025, we can use a one-tailed hypothesis test with the following null and alternative hypotheses:

\[Null\space hypothesis: \mu ≤ 35500\]

\[ Alternative \space hypothesis: \mu> 35500 \]

We will reject the null hypothesis if the test statistic falls in the rejection region, which is determined by the significance level and the degrees of freedom.

To conduct the hypothesis test in R, we first need to calculate the test statistic and the p-value. We can use the following code:

T-statistics

# Set the null hypothesis mean and alternative hypothesis mean
mu_null <- 35500
mu_alt <- Inf

# Set the significance level
alpha <- 0.025

# Set the sample mean, sample standard deviation, and sample size
x_bar <- 34105.42
s <- 15983.47
n <- 475

# Calculate the test statistic and p-value
t_stat <- (x_bar - mu_null) / (s / sqrt(n))
p_value <- pt(t_stat, df = n - 1, lower.tail = FALSE)

# Print the test statistic and p-value
cat("Test statistic:", t_stat, "\n")

Test statistic: -1.9016

cat("p-value:", p_value, "\n")

p-value: 0.9710852

Z-statistics

# Set the null hypothesis mean and alternative hypothesis mean
mu_null <- 35500
mu_alt <- Inf

# Set the significance level
alpha <- 0.025

# Set the sample mean, population standard deviation, and sample size
x_bar <- 34105.42
sigma <- 15983.47
n <- 475

# Calculate the test statistic and p-value
z_stat <- (x_bar - mu_null) / (sigma / sqrt(n))
p_value <- pnorm(z_stat, lower.tail = FALSE)

# Print the test statistic and p-value
cat("Test statistic:", z_stat, "\n")

Test statistic: -1.9016

cat("p-value:", p_value, "\n")

p-value: 0.9713883

The test statistic is a measure of how many standard errors the sample mean is away from the null hypothesis mean. In this case, the test statistic is -1.9016, which means that the sample mean is 1.9016 standard errors below the null hypothesis mean. The negative sign indicates that the sample mean is less than the null hypothesis mean.

The p-value is the probability of observing a test statistic as extreme as the one calculated from the sample data, assuming that the null hypothesis is true. In this case, the p-value is 0.9713883, which means that if the null hypothesis is true (i.e., the population mean salary is less than or equal to $35,500), there is a 97.14% chance of obtaining a test statistic as extreme or more extreme than the one calculated from the sample data.

Since the p-value is greater than the significance level of 0.025, we fail to reject the null hypothesis. This means that we do not have enough evidence to conclude that the population mean salary is actually greater than $35,500. However, we cannot conclude that the null hypothesis is true either; we can only say that we do not have enough evidence to reject it.

In summary, based on the test statistic of -1.9016 and the p-value of 0.9713883, we fail to reject the null hypothesis that the population mean salary is less than or equal to $35,500 at a significance level of 0.025.

7. Let μ1 be the (population) mean of all Category 1 employees, μ2 be the (population) mean of all Category 2employees, and μ3 be the (population) mean of all Category 3 employees. Test, using one-way ANOVA at theα = 0.05 significance level, the hypothesis H0: μ1 = μ2 = μ3 against all possible alternative hypotheses. For each set of data, construct box-and-whisker diagrams on the same figure and give an interpretation of your diagram.

Salaries across job category

\[ Null \space hypothesis: \mu_1 = \mu_2 = \mu_3 \]

\[ Alternative \space hypothesis: \mu_i \neq \mu_j \] Where i can be job category 1, 2 or 3, and j can be job category 1, 2 or 3

Assumption of one way ANOVA

The assumptions of one-way ANOVA are similar to the general assumptions for any parametric test. The following are the three primary assumptions that need to be met for the results of a one-way ANOVA to be valid:

Normality: The responses for each factor level have a normal population distribution. The normality assumption can be checked by examining QQ plots or running a Shapiro-Wilk test. A bell curve in the histogram suggests normality.

qqplot

qqnorm(salary$Salary....)

Shapiro Wilks Statistics

shapiro.test(salary$Salary....)


    Shapiro-Wilk normality test

data:  salary$Salary....
W = 0.79462, p-value < 2.2e-16

Homogeneity of variance: The variances of the populations that the samples come from are equal. Homogeneity of variance is reflected in the “Versus Fits” plot, with the spread of the blue dots about the same across all levels of the fitted values. Levene’s Test can be used to verify this assumption.

library(car)
leveneTest(Salary.... ~ Jobcat, data = salary)

Levene's Test for Homogeneity of Variance (center = median)
       Df F value    Pr(>F)    
group   2  49.207 < 2.2e-16 ***
      472                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Independence: The observations in each group are independent of each other and the observations within groups were obtained by a random sample. The nature of the study should not be constructed of dependent sampling units.

It is essential to check these assumptions before conducting a one-way ANOVA. If any of these assumptions are violated, it may affect the validity of the results. For example, if the normality assumption is violated, the p-values and confidence intervals will be incorrect. If the homogeneity of variance assumption is violated, the F-test may not be reliable.

It is also important to note that one-way ANOVA is not robust to violations of these assumptions. If the assumptions are not met, alternative non-parametric tests can be used instead.

In summary, the assumptions of one-way ANOVA are normality, homogeneity of variance, and independence. These assumptions should be checked before conducting a one-way ANOVA, and if any of these assumptions are violated, alternative non-parametric tests should be used instead.

Create box plot

# Create a boxplot of salaries by job category
ggplot(data = salary, aes(x = Jobcat, y = Salary....)) +
  geom_boxplot() +
  ggtitle("Boxplot of Salaries Across Job Categories") +
  xlab("Job Category") +
  ylab("Salary")

The boxplot above indicates the the employees in the job category_3 earns significantly higher that those in category_1 and category_2. Besides, it is also evident that employees in category_2 earns significantly higher than those in category_1.

#fit the one-way ANOVA model
model <- aov(Salary.... ~ Jobcat, data = salary)

#view the model output
summary(model)

             Df    Sum Sq   Mean Sq F value Pr(>F)    
Jobcat        2 7.944e+10 3.972e+10   450.1 <2e-16 ***
Residuals   472 4.165e+10 8.825e+07                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results shows statistically significant difference in the average salary for all employees in various job categories.

Post_Hoc Analysis

library(multcomp)
plot(TukeyHSD(aov(Salary.... ~ Jobcat, data = salary)),las=0)

Previous Experience across job category

Create the plot (Boxplot)

# Create a boxplot of salaries by job category
ggplot(data = salary, aes(x = Jobcat, y = Prev.Exp)) +
  geom_boxplot() +
  ggtitle("Boxplot of Previous Experience Across Job Categories") +
  xlab("Job Category") +
  ylab("Previous Experience")

The graph above indicated employees in the category_2 have a significantly higher previous experienced as compared to those in category_1 and category_3.

#fit the one-way ANOVA model
model1 <- aov(Prev.Exp~ Jobcat, data = salary)

#view the model output
summary(model1)

             Df  Sum Sq Mean Sq F value Pr(>F)    
Jobcat        2 1176388  588194   69.36 <2e-16 ***
Residuals   472 4002863    8481                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the results above, it is evident that there is a statistically significant difference in the average previous experience across the three job categories as indicated by the p-value of approximately 0.0001.

Post hoc analysis

library(multcomp)
plot(TukeyHSD(aov(Prev.Exp ~ Jobcat, data = salary)),las=1)

8. Type up a paragraph (or more) discussing the results of your analysis, including at least two conclusions you can make. Recall that you are addressing individuals who may know little about statistics. As a result, you will have to explain in layman’s terms what you have studied and concluded.

An ANOVA, or analysis of variance, is a statistical test used to compare the means of two or more groups. In this case, the ANOVA results indicates that the average salary and previous experience varied significantly across job categories. This means that there are statistically significant differences in the average salary and previous experience between the different job categories.

One possible conclusion we can draw from these results is that certain job categories may require more experience or qualifications, and therefore offer higher salaries. Another possible conclusion is that there may be a disparity in pay and experience across different job categories, which could be further investigated to ensure fair compensation and hiring practices. In other words, job categories are not equal in terms of compensation and experience, and that there may be underlying factors that contribute to these differences. Further, it is possible to note that the differences among the job categories are not due to chance. The small p value (<0.001) also supports this conclusion. These results suggest that the job categories are a significant factor in determining salary and previous experience, and that further investigation may be needed to understand why these differences exist.

Data Analysis and Hypothesis Testing

Wesley Bartler

2023-04-19

Set up Rstudio

Load the dataset

View the first and last few observations

Load the following libraries

1. Give the 5-number summary of the sample salaries.

Additional Summary Statistics

2. Generate an appropriate graphical representation (not a scatter plot) of your data.

Histogram

Boxplot of Salaries across Job Categories

Label Job category

Boxplot of Previous Experience across Job Categories

3. Find a best-fit regression line for the data. You choose an appropriate x and y. For example, x = age and y = salary, x=experience and y = salary, and so on. Provide the scatter plot along with the regression line.

Alternative plot

**4. Construct a 99% confidence interval for population mean 𝜇𝜇 of employee salaries.

T-statistics**

Z-statistics

Manual Calculation

5. The corporation believes that the population standard deviation of employee salaries is $8,700. How large of a sample would be required in order to estimate the mean salary at the 97% confidence level with an error of at most $125?

Manual Calculation

6. An internal corporate report claims that the population mean salary is actually greater than $35,500. Can you verify this at 𝛼𝛼 = .025. Construct an appropriate full hypothesis test.

T-statistics

Z-statistics

Salaries across job category

Assumption of one way ANOVA

qqplot

Shapiro Wilks Statistics

Create box plot

Post_Hoc Analysis

Previous Experience across job category

Create the plot (Boxplot)

Post hoc analysis

8. Type up a paragraph (or more) discussing the results of your analysis, including at least two conclusions you can make. Recall that you are addressing individuals who may know little about statistics. As a result, you will have to explain in layman’s terms what you have studied and concluded.