ADVANCED COMPUTERIZED METHODS

Load the required packages

library(readxl)
library(janitor)

## Warning: package 'janitor' was built under R version 3.6.3

## 
## Attaching package: 'janitor'

## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test

Read the data

stat325 <- read_xlsx("STAT 325 Assignment Data.xlsx")
names(stat325)

Clean columns names

stat325 <- stat325 %>% janitor::clean_names()

The top 10 observations

head(stat325,10)

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

The structure of the data

str(stat325)

1. Present appropriate summaries for the data

The data has bot numeric and cotegorical variables. Therefore we are going to do summaries for the numeric and categorical variables independently

Numeric Variables

To do this, we can create a data frame containing numeric variables only.

numeric_vars <- stat325[,c(3,7:14)]

The first few observations of the numeric variables

head(numeric_vars)

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

In order to do summary statistics for the numeric variables, we can use describe() function from the psych package. However there are other ways we can do the same. The describe() function is comprehensive and simple.

Load the package

library(psych)

Create an object to store the summaries.

num_summaries<- describe(numeric_vars)

Format the output such that it is a table

knitr::kable(apply(num_summaries,2,round,2))

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
age	1	153	40.15	10.79	39.00	39.96	8.90	19.00	67.00	48.00	0.20	-0.44	0.87
number_of_assistants	2	153	12.39	7.28	12.00	12.07	8.90	2.00	31.00	29.00	0.31	-0.80	0.59
experience	3	153	10.75	5.96	10.00	10.60	7.41	1.00	25.00	24.00	0.20	-0.83	0.48
marketing_budget	4	153	1426151.90	889186.57	1288260.00	1364300.65	986833.39	165920.00	4599160.00	4433240.00	0.61	0.01	71886.47
appraisal_score	5	153	80.64	8.36	81.10	80.69	10.82	66.20	94.20	28.00	-0.05	-1.24	0.68
quarter_1	6	153	18961.60	4879.04	19080.62	18920.92	4786.87	8574.16	30068.04	21493.88	0.02	-0.58	394.45
quarter_2	7	153	20703.36	4880.24	20767.62	20640.39	4280.09	9144.08	33333.32	24189.24	0.11	-0.21	394.54
quarter_3	8	153	23034.45	4628.41	22772.40	23035.63	4565.07	10948.20	32534.36	21586.16	-0.01	-0.51	374.18
quarter_4	9	153	23356.82	5035.83	23329.86	23194.62	5514.12	13058.68	35907.48	22848.80	0.28	-0.50	407.12

Categorical data

For this, we may wish to create frequency tables and contigency tables

categorical <- stat325[,c("sex","marital_status","education_level","department")]
apply(categorical, 2, table)

## $sex
## 
## Female   Male 
##     82     71 
## 
## $marital_status
## 
## Divorced  Married   Single  Widowed 
##       32       68       30       23 
## 
## $education_level
## 
##     Bachelors   Certificate       Diploma Post graduate 
##            39            37            37            40 
## 
## $department
## 
##        Agriculture             Energy Financial services      Manufacturing 
##                 33                 15                 27                 21 
##             Mining            Tourism 
##                 27                 30

Contigency tables.

table(categorical$sex, categorical$department)

##         
##          Agriculture Energy Financial services Manufacturing Mining Tourism
##   Female          19     10                 15            12     11      15
##   Male            14      5                 12             9     16      15

2. Test the hypothesis of equality of proportions of female and male salespersons in each department and the entire organization

Testing the equality of sex proportions in each department

attach(stat325)
table1 <- table(department,sex)

prop.test(table1)

## 
##  6-sample test for equality of proportions without continuity
##  correction
## 
## data:  table1
## X-squared = 3.3385, df = 5, p-value = 0.648
## alternative hypothesis: two.sided
## sample estimates:
##    prop 1    prop 2    prop 3    prop 4    prop 5    prop 6 
## 0.5757576 0.6666667 0.5555556 0.5714286 0.4074074 0.5000000

The p-value is higher than 0.05. We fail to reject the null hypothesis. There is no significant difference in the proportion of male and female in each department.

Testing the equality of sex proportions in thet entire organization

table2 <- table(sex)

prop.test(table2)

## 
##  1-sample proportions test with continuity correction
## 
## data:  table2, null probability 0.5
## X-squared = 0.65359, df = 1, p-value = 0.4188
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
##  0.4537919 0.6162707
## sample estimates:
##         p 
## 0.5359477

The p-value is greater than \[\alpha = 0.05\]. We fail to reject the null hypothesis. Ther is no significant difference in the proportion of male andd female in the entire organization.

3. Is there a significant relationship between marital status and education level?

table3 <-table(stat325$marital_status, stat325$education_level)
chisq.test(table3)

## 
##  Pearson's Chi-squared test
## 
## data:  table3
## X-squared = 19.236, df = 9, p-value = 0.02326

The p-value is less than \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant relationship between marital status and education level.

4. Is there a significant relationship between education level and department?

Edu_Dep<-table(stat325$education_level, stat325$department)
chisq.test(Edu_Dep)

## Warning in stats::chisq.test(x, y, ...): Chi-squared approximation may be
## incorrect

## 
##  Pearson's Chi-squared test
## 
## data:  Edu_Dep
## X-squared = 20.611, df = 15, p-value = 0.1497

The p-value is greater than \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant relationship between education level and department.

5. Compare the mean quarterly and annual sales by sex, age, marital status, education level and department. For age categorize young employees as those aged less than 35 years and old employees otherwise

# Data preprocessing
# Annual sales
stat325$annual_sales <- quarter_1+quarter_2+quarter_3+quarter_4
# Agegroup
stat325$age_group <- ifelse(age<35,"Young","Old")
attach(stat325)

## The following objects are masked from stat325 (pos = 3):
## 
##     age, appraisal_score, department, education_level, experience,
##     marital_status, marketing_budget, number_of_assistants,
##     personel_number, quarter_1, quarter_2, quarter_3, quarter_4, sex

Quarter 1 and sex

t.test(quarter_1~sex)

## 
##  Welch Two Sample t-test
## 
## data:  quarter_1 by sex
## t = -1.329, df = 142.69, p-value = 0.186
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2626.2673   514.5749
## sample estimates:
## mean in group Female   mean in group Male 
##             18471.64             19527.48

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant difference in mean quarter 1 for male and female.

Quarter 1 and age group

t.test(quarter_1~age_group)

## 
##  Welch Two Sample t-test
## 
## data:  quarter_1 by age_group
## t = 7.241, df = 98.331, p-value = 1.001e-10
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3736.412 6557.399
## sample estimates:
##   mean in group Old mean in group Young 
##            20509.04            15362.13

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant difference in mean quarter 1 for young and old.

Quarter 1 and marital status

q1_mstatus <- aov(quarter_1~marital_status)
summary(q1_mstatus)

##                 Df    Sum Sq  Mean Sq F value Pr(>F)
## marital_status   3 5.074e+07 16914948   0.706   0.55
## Residuals      149 3.568e+09 23943771

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant difference in mean quarter 1 in each marital status.

Quarter 1 and education level

q1_edu <- aov(quarter_1~education_level)
summary(q1_edu)

##                  Df    Sum Sq   Mean Sq F value  Pr(>F)   
## education_level   3 3.143e+08 104762883   4.724 0.00353 **
## Residuals       149 3.304e+09  22175021                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant difference in mean quarter 1 in each education level.

Quarter 1 and department

q1_dept <- aov(quarter_1~department)
summary(q1_dept)

##              Df    Sum Sq  Mean Sq F value Pr(>F)
## department    5 1.006e+08 20122778   0.841  0.523
## Residuals   147 3.518e+09 23930292

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant difference in mean quarter 1 in each department.

Quarter 2 and sex

t.test(quarter_2~sex)

## 
##  Welch Two Sample t-test
## 
## data:  quarter_2 by sex
## t = -1.4582, df = 146.09, p-value = 0.1469
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2714.6023   409.5354
## sample estimates:
## mean in group Female   mean in group Male 
##             20168.53             21321.06

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant difference in mean quarter 2 for male and female.

Quarter 2 and age group

t.test(quarter_2~age_group)

## 
##  Welch Two Sample t-test
## 
## data:  quarter_2 by age_group
## t = 6.2731, df = 96.79, p-value = 9.929e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3159.053 6083.273
## sample estimates:
##   mean in group Old mean in group Young 
##            22092.73            17471.57

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant difference in mean quarter 2 for young and old.

Quarter 2 and marital status

q2_mstatus <- aov(quarter_2~marital_status)
summary(q2_mstatus)

##                 Df    Sum Sq  Mean Sq F value Pr(>F)  
## marital_status   3 1.643e+08 54753924   2.361 0.0738 .
## Residuals      149 3.456e+09 23193850                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant difference in mean quarter 2 in each marital status.

Quarter 2 and education level

q2_edu <- aov(quarter_2~education_level)
summary(q2_edu)

##                  Df    Sum Sq  Mean Sq F value Pr(>F)  
## education_level   3 2.177e+08 72572857   3.178 0.0259 *
## Residuals       149 3.402e+09 22835080                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant difference in mean quarter 2 in each education level.

Quarter 2 and department

q2_dept <- aov(quarter_2~department)
summary(q2_dept)

##              Df    Sum Sq  Mean Sq F value Pr(>F)
## department    5 2.131e+08 42624441   1.839  0.109
## Residuals   147 3.407e+09 23177029

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant difference in mean quarter 2 in each department.

Quarter 3 and sex

t.test(quarter_3~sex)

## 
##  Welch Two Sample t-test
## 
## data:  quarter_3 by sex
## t = -2.0617, df = 144, p-value = 0.04104
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3014.10777   -63.50156
## sample estimates:
## mean in group Female   mean in group Male 
##             22320.37             23859.17

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant difference in mean quarter 3 for male and female.

Quarter 3 and age group

t.test(quarter_3~age_group)

## 
##  Welch Two Sample t-test
## 
## data:  quarter_3 by age_group
## t = 5.3694, df = 99.238, p-value = 5.206e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2406.514 5227.490
## sample estimates:
##   mean in group Old mean in group Young 
##            24182.05            20365.05

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant difference in mean quarter 3 for young and old.

Quarter 3 and marital status

q3_mstatus <- aov(quarter_3~marital_status)
summary(q3_mstatus)

##                 Df    Sum Sq  Mean Sq F value Pr(>F)  
## marital_status   3 2.030e+08 67661315   3.302 0.0221 *
## Residuals      149 3.053e+09 20491172                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant difference in mean quarter 3 in each marital status.

Quarter 3 and education level

q3_edu <- aov(quarter_3~education_level)
summary(q3_edu)

##                  Df    Sum Sq  Mean Sq F value  Pr(>F)   
## education_level   3 2.769e+08 92287639   4.615 0.00406 **
## Residuals       149 2.979e+09 19995340                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant difference in mean quarter 3 in each education level.

Quarter 3 and department

q3_dept <- aov(quarter_3~department)
summary(q3_dept)

##              Df    Sum Sq  Mean Sq F value Pr(>F)
## department    5 1.172e+08 23449551   1.098  0.364
## Residuals   147 3.139e+09 21353203

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant difference in mean quarter 3 in each department.

Quarter 4 and sex

t.test(quarter_4~sex)

## 
##  Welch Two Sample t-test
## 
## data:  quarter_4 by sex
## t = -0.95341, df = 143.38, p-value = 0.342
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2407.5750   840.7809
## sample estimates:
## mean in group Female   mean in group Male 
##             22993.28             23776.68

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant difference in mean quarter 4 for male and female.

Quarter 4 and age group

t.test(quarter_4~age_group)

## 
##  Welch Two Sample t-test
## 
## data:  quarter_4 by age_group
## t = 6.3998, df = 115.51, p-value = 3.455e-09
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3158.437 5989.766
## sample estimates:
##   mean in group Old mean in group Young 
##            24732.04            20157.94

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant difference in mean quarter 4 for young and old.

Quarter 4 and marital status

q4_mstatus <- aov(quarter_4~marital_status)
summary(q4_mstatus)

##                 Df    Sum Sq  Mean Sq F value Pr(>F)
## marital_status   3 1.364e+08 45451395   1.821  0.146
## Residuals      149 3.718e+09 24955051

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant difference in mean quarter 4 in each marital status.

Quarter 4 and education level

q4_edu <- aov(quarter_4~education_level)
summary(q4_edu)

##                  Df    Sum Sq  Mean Sq F value Pr(>F)  
## education_level   3 2.401e+08 80043657     3.3 0.0221 *
## Residuals       149 3.615e+09 24258562                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant difference in mean quarter 4 in each education level.

Quarter 4 and department

q4_dept <- aov(quarter_4~department)
summary(q4_dept)

##              Df    Sum Sq  Mean Sq F value Pr(>F)
## department    5 1.798e+08 35954460   1.438  0.214
## Residuals   147 3.675e+09 24999214

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant difference in mean quarter 4 in each department.

Annual Sales and sex

t.test(annual_sales~sex)

## 
##  Welch Two Sample t-test
## 
## data:  annual_sales by sex
## t = -1.6724, df = 143.52, p-value = 0.09662
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -9885.2905   824.1276
## sample estimates:
## mean in group Female   mean in group Male 
##             83953.81             88484.40

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant difference in mean annual sales for male and female.

Annual Sales and age group

t.test(annual_sales~age_group)

## 
##  Welch Two Sample t-test
## 
## data:  annual_sales by age_group
## t = 8.0544, df = 116.3, p-value = 7.765e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  13693.86 22624.49
## sample estimates:
##   mean in group Old mean in group Young 
##            91515.86            73356.69

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant difference in mean annual sales for young and old.

Annual Sales and marital status

an_mstatus <- aov(annual_sales~marital_status)
summary(an_mstatus)

##                 Df    Sum Sq   Mean Sq F value Pr(>F)  
## marital_status   3 1.935e+09 645122373   2.373 0.0726 .
## Residuals      149 4.050e+10 271838511                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant difference in mean annual sales in each marital status.

Annual Sales and education level

an_edu <- aov(annual_sales~education_level)
summary(an_edu)

##                  Df    Sum Sq   Mean Sq F value  Pr(>F)   
## education_level   3 3.798e+09 1.266e+09   4.881 0.00289 **
## Residuals       149 3.864e+10 2.593e+08                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant difference in mean annual sales in each education level.

Annual Sales and department

an_dept <- aov(annual_sales~department)
summary(an_dept)

##              Df    Sum Sq   Mean Sq F value Pr(>F)
## department    5 1.901e+09 380205609   1.379  0.236
## Residuals   147 4.054e+10 275770593

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant difference in mean annual sales in each department.

6. Is there a quarter where the sales were significantly different from others?

Quarter 1 and Quarter 2

t.test(quarter_1,quarter_2,paired = TRUE)

## 
##  Paired t-test
## 
## data:  quarter_1 and quarter_2
## t = -5.3236, df = 152, p-value = 3.591e-07
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2388.165 -1095.351
## sample estimates:
## mean of the differences 
##               -1741.758

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant differnce in mean quarter 1 and quarter 2. Quarter 1 and Quarter 3

t.test(quarter_1,quarter_3,paired = TRUE)

## 
##  Paired t-test
## 
## data:  quarter_1 and quarter_3
## t = -12.297, df = 152, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -4727.220 -3418.479
## sample estimates:
## mean of the differences 
##                -4072.85

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant differnce in mean quarter 1 and quarter 3.

Quarter 1 and Quarter 4

t.test(quarter_1,quarter_4,paired = TRUE)

## 
##  Paired t-test
## 
## data:  quarter_1 and quarter_4
## t = -12.492, df = 152, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5090.333 -3700.098
## sample estimates:
## mean of the differences 
##               -4395.216

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant differnce in mean quarter 1 and quarter 4.

Quarter 2 and Quarter 3

t.test(quarter_2,quarter_3,paired = TRUE)

## 
##  Paired t-test
## 
## data:  quarter_2 and quarter_3
## t = -7.6167, df = 152, p-value = 2.581e-12
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2935.752 -1726.431
## sample estimates:
## mean of the differences 
##               -2331.092

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant differnce in mean quarter 2 and quarter 3.

Quarter 2 and Quarter 4

t.test(quarter_2,quarter_4,paired = TRUE)

## 
##  Paired t-test
## 
## data:  quarter_2 and quarter_4
## t = -8.0137, df = 152, p-value = 2.726e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3307.638 -1999.277
## sample estimates:
## mean of the differences 
##               -2653.458

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant differnce in mean quarter 4 and quarter 4.

Quarter 3 and Quarter 4

t.test(quarter_3,quarter_4,paired = TRUE)

## 
##  Paired t-test
## 
## data:  quarter_3 and quarter_4
## t = -1.0227, df = 152, p-value = 0.3081
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -945.1123  300.3803
## sample estimates:
## mean of the differences 
##                -322.366

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant differnce in mean quarter 3 and quarter 4.

7. Repeat 5) but use the non-parametric approach. Quarter 1 and sex

wilcox.test(quarter_1~sex)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  quarter_1 by sex
## W = 2481, p-value = 0.1161
## alternative hypothesis: true location shift is not equal to 0

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant difference in mean quarter 1 for male and female.

Quarter 1 and age group

wilcox.test(quarter_1~age_group)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  quarter_1 by age_group
## W = 3993, p-value = 1.104e-09
## alternative hypothesis: true location shift is not equal to 0

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant difference in mean quarter 1 for young and old.

Quarter 1 and marital status

stat325 <- transform(stat325,marital_status = as.factor(marital_status),education_level = as.factor(education_level),department = as.factor(department))
attach(stat325)

## The following objects are masked from stat325 (pos = 3):
## 
##     age, age_group, annual_sales, appraisal_score, department,
##     education_level, experience, marital_status, marketing_budget,
##     number_of_assistants, personel_number, quarter_1, quarter_2,
##     quarter_3, quarter_4, sex

## The following objects are masked from stat325 (pos = 4):
## 
##     age, appraisal_score, department, education_level, experience,
##     marital_status, marketing_budget, number_of_assistants,
##     personel_number, quarter_1, quarter_2, quarter_3, quarter_4, sex

kruskal.test(quarter_1~marital_status)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  quarter_1 by marital_status
## Kruskal-Wallis chi-squared = 2.4976, df = 3, p-value = 0.4757

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant difference in mean quarter 1 in each marital status.

Quarter 1 and education level

kruskal.test(quarter_1~education_level)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  quarter_1 by education_level
## Kruskal-Wallis chi-squared = 14.312, df = 3, p-value = 0.00251

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant difference in mean quarter 1 in each education level.

Quarter 1 and department

kruskal.test(quarter_1~department)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  quarter_1 by department
## Kruskal-Wallis chi-squared = 3.8982, df = 5, p-value = 0.5642

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant difference in mean quarter 1 in each department.

Quarter 2 and sex

wilcox.test(quarter_2~sex)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  quarter_2 by sex
## W = 2573, p-value = 0.2169
## alternative hypothesis: true location shift is not equal to 0

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant difference in mean quarter 2 for male and female.

Quarter 2 and age group

wilcox.test(quarter_2~age_group)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  quarter_2 by age_group
## W = 3802, p-value = 9.625e-08
## alternative hypothesis: true location shift is not equal to 0

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant difference in mean quarter 2 for young and old.

Quarter 2 and marital status

kruskal.test(quarter_2~marital_status)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  quarter_2 by marital_status
## Kruskal-Wallis chi-squared = 6.8368, df = 3, p-value = 0.07728

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant difference in mean quarter 2 in each marital status.

Quarter 2 and education level

kruskal.test(quarter_2~education_level)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  quarter_2 by education_level
## Kruskal-Wallis chi-squared = 8.9715, df = 3, p-value = 0.02967

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant difference in mean quarter 2 in each education level.

Quarter 2 and department

kruskal.test(quarter_2~department)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  quarter_2 by department
## Kruskal-Wallis chi-squared = 11.446, df = 5, p-value = 0.04322

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant difference in mean quarter 2 in each department.

Quarter 3 and sex

wilcox.test(quarter_3~sex)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  quarter_3 by sex
## W = 2419, p-value = 0.07216
## alternative hypothesis: true location shift is not equal to 0

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant difference in mean quarter 3 for male and female.

Quarter 3 and age group

wilcox.test(quarter_3~age_group)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  quarter_3 by age_group
## W = 3611, p-value = 4.792e-06
## alternative hypothesis: true location shift is not equal to 0

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant difference in mean quarter 3 for young and old.

Quarter 3 and marital status

kruskal.test(quarter_3~marital_status)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  quarter_3 by marital_status
## Kruskal-Wallis chi-squared = 10.513, df = 3, p-value = 0.01468

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant difference in mean quarter 3 in each marital status.

Quarter 3 and education level

kruskal.test(quarter_3~education_level)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  quarter_3 by education_level
## Kruskal-Wallis chi-squared = 12.948, df = 3, p-value = 0.004751

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant difference in mean quarter 3 in each education level.

Quarter 3 and department

kruskal.test(quarter_3~department)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  quarter_3 by department
## Kruskal-Wallis chi-squared = 6.247, df = 5, p-value = 0.2829

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant difference in mean quarter 3 in each department.

Quarter 4 and sex

wilcox.test(quarter_4~sex)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  quarter_4 by sex
## W = 2691, p-value = 0.422
## alternative hypothesis: true location shift is not equal to 0

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant difference in mean quarter 4 for male and female.

Quarter 4 and age group

wilcox.test(quarter_4~age_group)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  quarter_4 by age_group
## W = 3772, p-value = 1.845e-07
## alternative hypothesis: true location shift is not equal to 0

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant difference in mean quarter 4 for young and old.

Quarter 4 and marital status

kruskal.test(quarter_4~marital_status)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  quarter_4 by marital_status
## Kruskal-Wallis chi-squared = 5.0574, df = 3, p-value = 0.1676

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant difference in mean quarter 4 in each marital status.

Quarter 4 and education level

kruskal.test(quarter_4~education_level)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  quarter_4 by education_level
## Kruskal-Wallis chi-squared = 10.02, df = 3, p-value = 0.01839

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant difference in mean quarter 4 in each education level.

Quarter 4 and department

kruskal.test(quarter_4~department)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  quarter_4 by department
## Kruskal-Wallis chi-squared = 7.731, df = 5, p-value = 0.1717

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant difference in mean quarter 4 in each department.

Annual Sales and sex

wilcox.test(annual_sales~sex)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  annual_sales by sex
## W = 2489, p-value = 0.1231
## alternative hypothesis: true location shift is not equal to 0

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant difference in mean annual sales for male and female.

Annual Sales and age group

wilcox.test(annual_sales~age_group)

## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  annual_sales by age_group
## W = 4042, p-value = 3.203e-10
## alternative hypothesis: true location shift is not equal to 0

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant difference in mean annual sales for young and old.

Annual Sales and marital status

kruskal.test(annual_sales~marital_status)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  annual_sales by marital_status
## Kruskal-Wallis chi-squared = 7.3164, df = 3, p-value = 0.06247

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant difference in mean annual sales in each marital status.

Annual Sales and education level

kruskal.test(annual_sales~education_level)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  annual_sales by education_level
## Kruskal-Wallis chi-squared = 13.836, df = 3, p-value = 0.003138

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant difference in mean annual sales in each education level.

Annual Sales and department

kruskal.test(annual_sales~department)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  annual_sales by department
## Kruskal-Wallis chi-squared = 7.9549, df = 5, p-value = 0.1587

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant difference in mean annual sales in each department.

8. Repeat 6) but use the non-parametric approach

Quarter 1 and Quarter 2

wilcox.test(quarter_1,quarter_2,paired = TRUE)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  quarter_1 and quarter_2
## V = 3261, p-value = 1.678e-06
## alternative hypothesis: true location shift is not equal to 0

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant differnce in mean quarter 1 and quarter 2. Quarter 1 and Quarter 3

wilcox.test(quarter_1,quarter_3,paired = TRUE)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  quarter_1 and quarter_3
## V = 966, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant differnce in mean quarter 1 and quarter 3.

Quarter 1 and Quarter 4

wilcox.test(quarter_1,quarter_4,paired = TRUE)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  quarter_1 and quarter_4
## V = 955, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant differnce in mean quarter 1 and quarter 4.

Quarter 2 and Quarter 3

wilcox.test(quarter_2,quarter_3,paired = TRUE)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  quarter_2 and quarter_3
## V = 2301.5, p-value = 6.299e-11
## alternative hypothesis: true location shift is not equal to 0

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant differnce in mean quarter 2 and quarter 3.

Quarter 2 and Quarter 4

wilcox.test(quarter_2,quarter_4,paired = TRUE)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  quarter_2 and quarter_4
## V = 2223.5, p-value = 2.413e-11
## alternative hypothesis: true location shift is not equal to 0

The p-value is less than the \[\alpha = 0.05\]. We reject the null hypothesis. There is a significant differnce in mean quarter 4 and quarter 4.

Quarter 3 and Quarter 4

wilcox.test(quarter_3,quarter_4,paired = TRUE)

## 
##  Wilcoxon signed rank test with continuity correction
## 
## data:  quarter_3 and quarter_4
## V = 5622.5, p-value = 0.6261
## alternative hypothesis: true location shift is not equal to 0

The p-value is greater than the \[\alpha = 0.05\]. We fail to reject the null hypothesis. There is no significant differnce in mean quarter 3 and quarter 4.

9. Consider the relationship between age, number of assistants, experience, marketing budget, appraisal score, quarterly and annual sales

a. Determine the product moment correlation coefficients. Interpret your result

numeric_vars$annual_sales <- stat325$annual_sales
knitr::kable(round(cor(numeric_vars),2))

	age	number_of_assistants	experience	marketing_budget	appraisal_score	quarter_1	quarter_2	quarter_3	quarter_4	annual_sales
age	1.00	0.79	0.87	0.76	-0.10	0.66	0.65	0.65	0.64	0.76
number_of_assistants	0.79	1.00	0.91	0.96	-0.13	0.77	0.72	0.74	0.76	0.87
experience	0.87	0.91	1.00	0.88	-0.16	0.74	0.71	0.72	0.72	0.84
marketing_budget	0.76	0.96	0.88	1.00	-0.12	0.77	0.73	0.73	0.76	0.87
appraisal_score	-0.10	-0.13	-0.16	-0.12	1.00	0.03	0.11	0.01	0.09	0.07
quarter_1	0.66	0.77	0.74	0.77	0.03	1.00	0.66	0.63	0.62	0.84
quarter_2	0.65	0.72	0.71	0.73	0.11	0.66	1.00	0.68	0.66	0.87
quarter_3	0.65	0.74	0.72	0.73	0.01	0.63	0.68	1.00	0.68	0.86
quarter_4	0.64	0.76	0.72	0.76	0.09	0.62	0.66	0.68	1.00	0.86
annual_sales	0.76	0.87	0.84	0.87	0.07	0.84	0.87	0.86	0.86	1.00

There is a strong positive correlation between age, number of assistants, experience, marketing budget, quarterly and annual sales.
There is a weak negative correlation between appraisal score and age, number_of_assistants experience marketing_budget.
There is weak positive correlation between appraisal score and quarters and annual sales.

b. Compute the Spearman’s Rank Correlation coefficients. Interpret your result.

knitr::kable(round(cor(numeric_vars,method = "spearman"),2))

	age	number_of_assistants	experience	marketing_budget	appraisal_score	quarter_1	quarter_2	quarter_3	quarter_4	annual_sales
age	1.00	0.79	0.88	0.77	-0.12	0.67	0.61	0.63	0.63	0.75
number_of_assistants	0.79	1.00	0.90	0.96	-0.14	0.77	0.71	0.74	0.76	0.87
experience	0.88	0.90	1.00	0.88	-0.16	0.74	0.67	0.71	0.72	0.84
marketing_budget	0.77	0.96	0.88	1.00	-0.11	0.77	0.72	0.73	0.76	0.88
appraisal_score	-0.12	-0.14	-0.16	-0.11	1.00	0.02	0.12	0.00	0.07	0.07
quarter_1	0.67	0.77	0.74	0.77	0.02	1.00	0.64	0.64	0.64	0.85
quarter_2	0.61	0.71	0.67	0.72	0.12	0.64	1.00	0.64	0.64	0.84
quarter_3	0.63	0.74	0.71	0.73	0.00	0.64	0.64	1.00	0.68	0.85
quarter_4	0.63	0.76	0.72	0.76	0.07	0.64	0.64	0.68	1.00	0.86
annual_sales	0.75	0.87	0.84	0.88	0.07	0.85	0.84	0.85	0.86	1.00

There is a strong positive correlation between age, number of assistants, experience, marketing budget, quarterly and annual sales.
There is a weak negative correlation between appraisal score and age, number_of_assistants experience marketing_budget.
There is weak positive correlation between appraisal score and quarters and annual sales.

10. Fit multiple linear regression models for quarterly and annual sales on age, number of assistants, experience, marketing budget, and appraisal score. Comment on the significance of the fitted models as well as the significance of each of the independent variables

Quarter 1

fit1 <- lm(quarter_1 ~ age + number_of_assistants + experience + marketing_budget + marketing_budget + appraisal_score, data = stat325)
summary(fit1)

## 
## Call:
## lm(formula = quarter_1 ~ age + number_of_assistants + experience + 
##     marketing_budget + marketing_budget + appraisal_score, data = stat325)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -5789  -2454   -131   2114   6409 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)   
## (Intercept)          5.232e+03  2.666e+03   1.962  0.05162 . 
## age                  1.729e+01  4.663e+01   0.371  0.71129   
## number_of_assistants 1.292e+02  1.315e+02   0.983  0.32730   
## experience           1.942e+02  1.232e+02   1.577  0.11694   
## marketing_budget     1.994e-03  9.494e-04   2.100  0.03747 * 
## appraisal_score      8.065e+01  2.972e+01   2.714  0.00745 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3009 on 147 degrees of freedom
## Multiple R-squared:  0.6322, Adjusted R-squared:  0.6197 
## F-statistic: 50.53 on 5 and 147 DF,  p-value: < 2.2e-16

The model accounts for the 63.22% of the variability in the data.
Age, number of assistance and experience have p-values greater than 0.05. They are less statistically significant to the model.
Marketing budget and appraisal score have p-value less than 0.05. They are more significant to the model.

Quarter 2

fit2 <- lm(quarter_2 ~ age + number_of_assistants + experience + marketing_budget + marketing_budget + appraisal_score, data = stat325)
summary(fit2)

## 
## Call:
## lm(formula = quarter_2 ~ age + number_of_assistants + experience + 
##     marketing_budget + marketing_budget + appraisal_score, data = stat325)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6402.4 -2179.7  -346.7  2870.0  6367.9 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2.961e+03  2.784e+03   1.064 0.289252    
## age                  5.363e+01  4.869e+01   1.102 0.272433    
## number_of_assistants 6.921e+01  1.373e+02   0.504 0.614982    
## experience           1.777e+02  1.286e+02   1.382 0.169017    
## marketing_budget     2.047e-03  9.914e-04   2.064 0.040746 *  
## appraisal_score      1.228e+02  3.103e+01   3.957 0.000118 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3142 on 147 degrees of freedom
## Multiple R-squared:  0.5992, Adjusted R-squared:  0.5855 
## F-statistic: 43.95 on 5 and 147 DF,  p-value: < 2.2e-16

The model accounts for the 59.92% of the variability in the data.
Age, number of assistance and experience have p-values greater than 0.05. They are less statistically significant to the model.
Marketing budget and appraisal score have p-value less than 0.05. They are more significant to the model.

Quarter 3

fit3 <- lm(quarter_3 ~ age + number_of_assistants + experience + marketing_budget + marketing_budget + appraisal_score, data = stat325)
summary(fit3)

## 
## Call:
## lm(formula = quarter_3 ~ age + number_of_assistants + experience + 
##     marketing_budget + marketing_budget + appraisal_score, data = stat325)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -6583  -2099    277   2229   5861 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.057e+04  2.710e+03   3.899 0.000146 ***
## age                  4.578e+01  4.739e+01   0.966 0.335573    
## number_of_assistants 2.105e+02  1.336e+02   1.575 0.117450    
## experience           1.366e+02  1.252e+02   1.092 0.276745    
## marketing_budget     9.791e-04  9.649e-04   1.015 0.311914    
## appraisal_score      6.398e+01  3.020e+01   2.118 0.035825 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3058 on 147 degrees of freedom
## Multiple R-squared:  0.5779, Adjusted R-squared:  0.5635 
## F-statistic: 40.24 on 5 and 147 DF,  p-value: < 2.2e-16

The model accounts for the 57.79% of the variability in the data.
Age, number of assistance, experience and Marketing budget have p-values greater than 0.05. They are less statistically significant to the model.
Appraisal score has p-value less than 0.05. It is more significant to the model.

Quarter 4

fit4 <- lm(quarter_4 ~ age + number_of_assistants + experience + marketing_budget + marketing_budget + appraisal_score, data = stat325)
summary(fit4)

## 
## Call:
## lm(formula = quarter_4 ~ age + number_of_assistants + experience + 
##     marketing_budget + marketing_budget + appraisal_score, data = stat325)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5466.6 -2282.3  -678.5  2356.2  7074.6 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          6.275e+03  2.731e+03   2.298 0.022996 *  
## age                  1.496e+01  4.776e+01   0.313 0.754534    
## number_of_assistants 1.671e+02  1.347e+02   1.240 0.216821    
## experience           1.474e+02  1.261e+02   1.168 0.244605    
## marketing_budget     2.144e-03  9.725e-04   2.205 0.029005 *  
## appraisal_score      1.212e+02  3.044e+01   3.980 0.000108 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3082 on 147 degrees of freedom
## Multiple R-squared:  0.6378, Adjusted R-squared:  0.6254 
## F-statistic: 51.76 on 5 and 147 DF,  p-value: < 2.2e-16

The model accounts for the 63.78% of the variability in the data.
Age, number of assistance and experience have p-values greater than 0.05. They are less statistically significant to the model.
Marketing budget and appraisal score have p-value less than 0.05. They are more significant to the model.

Annual Sales

fitAn <- lm(annual_sales ~ age + number_of_assistants + experience + marketing_budget + marketing_budget + appraisal_score, data = stat325)
summary(fitAn)

## 
## Call:
## lm(formula = annual_sales ~ age + number_of_assistants + experience + 
##     marketing_budget + marketing_budget + appraisal_score, data = stat325)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13408  -4812  -1364   4629  19835 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2.503e+04  6.331e+03   3.954 0.000119 ***
## age                  1.317e+02  1.107e+02   1.189 0.236272    
## number_of_assistants 5.760e+02  3.122e+02   1.845 0.067106 .  
## experience           6.560e+02  2.924e+02   2.243 0.026384 *  
## marketing_budget     7.163e-03  2.254e-03   3.177 0.001811 ** 
## appraisal_score      3.886e+02  7.057e+01   5.506 1.59e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7145 on 147 degrees of freedom
## Multiple R-squared:  0.8232, Adjusted R-squared:  0.8172 
## F-statistic: 136.9 on 5 and 147 DF,  p-value: < 2.2e-16

The model accounts for the 82.32% of the variability in the data.
Age, number of assistance have p-values greater than 0.05. They are less statistically significant to the model.
Experience, marketing budget and appraisal score have p-values less than 0.05. They are more significant to the model.

\[\text{In case of any queries text or WhatsApp +245724555216. }\]

ADVANCED COMPUTERIZED METHODS

Tarus Gilbert

12/30/2020

Load the required packages

Read the data

Clean columns names

The top 10 observations

The structure of the data

Load the package

Create an object to store the summaries.

Format the output such that it is a table

Categorical data