question 1
pizza_Toppings <- c("pineaple","pepper","pepperoni")
toppings <- c(0,1,2,3)
#toppings = (1/pizza_Toppings * 1/toppings)
prob <- (1/3 *1/4)
prob
## [1] 0.08333333
There is a probability of 0.08333333 that a random customer will ask for two toppings. Question 2
x=5
dbinom(x<= 5, size =100, prob = 0.03)
## [1] 0.1470696
There is a 0.1470696 probability that at most five tomatoes are bad.
Question 3
two.dice <- function(){
dice <- c(3,3,5,5,7,7)
}
more.sims <- replicate(1000, two.dice())
table(more.sims)/length(more.sims)
## more.sims
## 3 5 7
## 0.3333333 0.3333333 0.3333333
two.dice2 <- function(){
dice <- c(2,2,4,4,9,9)
}
more.sims <- replicate(1000, two.dice2())
table(more.sims)/length(more.sims)
## more.sims
## 2 4 9
## 0.3333333 0.3333333 0.3333333
two.dice2 <- function(){
dice <- c(1,1,6,6,8,8)
}
more.sims <- replicate(1000, two.dice2())
table(more.sims)/length(more.sims)
## more.sims
## 1 6 8
## 0.3333333 0.3333333 0.3333333
Question 4, binomial distribution probabilities
dbinom(5,10, prob = 0.5)
## [1] 0.2460938
dbinom(50,100, prob = 0.5)
## [1] 0.07958924
dbinom(500,1000, prob=0.5)
## [1] 0.02522502
5 heads in 10 coin flips has the highest likelihood with a 24.60938% chance of occurrence, followed by 5o heads in 100 coin flips with 7.958924% chance and 500 heads has the least chance of occurrence with a 2.522502% chance of occurrence.
Question 5
my.dice.sum <- function(n.dice, n.sides){
dice <- sample(1:n.sides, size = n.dice, replace = TRUE)
return(sum(dice))
}
replicate(10, my.dice.sum(3,4))
## [1] 7 5 6 8 4 7 9 6 8 9
Question 6
experimentA <- function(){
rolls <- sample(1:6, size = 4, replace = TRUE)
condition <- sum(rolls == 6) > 0
return(condition)
}
experimentB <- function(){
first.die <- sample(1:6, size = 24, replace = TRUE)
second.die <- sample(1:6, size = 24, replace = TRUE)
condition <- sum((first.die == second.die) & (first.die == 6)) > 0
return(condition)
}
simsA <- replicate(100000, experimentA())
sum(simsA)/length(simsA)
## [1] 0.51783
simsB <- replicate(100000, experimentB())
sum(simsB)/length(simsB)
## [1] 0.49217
There is a 51.63% likelihood of getting at least one 6 when rolling a single fair six-sided die 4 times and a 48.996% for getting at least one pair of sixes when two fair , six-sided dice are thrown 24 times. It is therefore conclusive that option (A) is more likely.
Perform an analysis of the “store24” data set. Importing the data set and converting to a data frame
library(readr)
library(psych)
setwd("~/R codes and pdf/my data")
Store24 <- read_csv(data.frame("Store24.csv"))
Question 7(a) comparing the results of the summary statistics and the psych library data summary
summary <- summary(Store24);summary
## store Sales Profit MTenure
## Min. : 1.0 Min. : 699306 Min. :122180 Min. : 0.00
## 1st Qu.:19.5 1st Qu.: 984579 1st Qu.:211004 1st Qu.: 6.67
## Median :38.0 Median :1127332 Median :265014 Median : 24.12
## Mean :38.0 Mean :1205413 Mean :276314 Mean : 45.30
## 3rd Qu.:56.5 3rd Qu.:1362388 3rd Qu.:331314 3rd Qu.: 50.92
## Max. :75.0 Max. :2113089 Max. :518998 Max. :277.99
## CTenure Pop Comp Visibility
## Min. : 0.8871 Min. : 1046 Min. : 1.651 Min. :2.00
## 1st Qu.: 4.3943 1st Qu.: 5616 1st Qu.: 3.151 1st Qu.:3.00
## Median : 7.2115 Median : 8896 Median : 3.629 Median :3.00
## Mean : 13.9315 Mean : 9826 Mean : 3.788 Mean :3.08
## 3rd Qu.: 17.2156 3rd Qu.:14104 3rd Qu.: 4.230 3rd Qu.:4.00
## Max. :114.1519 Max. :26519 Max. :11.128 Max. :5.00
## PedCount Res Hours24 CrewSkill MgrSkill
## Min. :1.00 Min. :0.00 Min. :0.00 Min. :2.060 Min. :2.957
## 1st Qu.:2.00 1st Qu.:1.00 1st Qu.:1.00 1st Qu.:3.225 1st Qu.:3.344
## Median :3.00 Median :1.00 Median :1.00 Median :3.500 Median :3.589
## Mean :2.96 Mean :0.96 Mean :0.84 Mean :3.457 Mean :3.638
## 3rd Qu.:4.00 3rd Qu.:1.00 3rd Qu.:1.00 3rd Qu.:3.655 3rd Qu.:3.925
## Max. :5.00 Max. :1.00 Max. :1.00 Max. :4.640 Max. :4.622
## ServQual
## Min. : 57.90
## 1st Qu.: 78.95
## Median : 89.47
## Mean : 87.15
## 3rd Qu.: 99.90
## Max. :100.00
library(dplyr)
library(tidyverse)
psych <- describe(Store24);psych
## vars n mean sd median trimmed mad
## store 1 75 38.00 21.79 38.00 38.00 28.17
## Sales 2 75 1205413.12 304531.31 1127332.00 1182031.25 288422.04
## Profit 3 75 276313.61 89404.08 265014.00 270260.34 90532.00
## MTenure 4 75 45.30 57.67 24.12 33.58 29.67
## CTenure 5 75 13.93 17.70 7.21 10.60 6.14
## Pop 6 75 9825.59 5911.67 8896.00 9366.07 7266.22
## Comp 7 75 3.79 1.31 3.63 3.66 0.82
## Visibility 8 75 3.08 0.75 3.00 3.07 0.00
## PedCount 9 75 2.96 0.99 3.00 2.97 1.48
## Res 10 75 0.96 0.20 1.00 1.00 0.00
## Hours24 11 75 0.84 0.37 1.00 0.92 0.00
## CrewSkill 12 75 3.46 0.41 3.50 3.47 0.34
## MgrSkill 13 75 3.64 0.41 3.59 3.62 0.45
## ServQual 14 75 87.15 12.61 89.47 88.62 15.61
## min max range skew kurtosis se
## store 1.00 75.00 74.00 0.00 -1.25 2.52
## Sales 699306.00 2113089.00 1413783.00 0.71 -0.09 35164.25
## Profit 122180.00 518998.00 396818.00 0.62 -0.21 10323.49
## MTenure 0.00 277.99 277.99 2.01 3.90 6.66
## CTenure 0.89 114.15 113.26 3.52 15.00 2.04
## Pop 1046.00 26519.00 25473.00 0.62 -0.23 682.62
## Comp 1.65 11.13 9.48 2.48 11.31 0.15
## Visibility 2.00 5.00 3.00 0.25 -0.38 0.09
## PedCount 1.00 5.00 4.00 0.00 -0.52 0.11
## Res 0.00 1.00 1.00 -4.60 19.43 0.02
## Hours24 0.00 1.00 1.00 -1.82 1.32 0.04
## CrewSkill 2.06 4.64 2.58 -0.43 1.64 0.05
## MgrSkill 2.96 4.62 1.67 0.27 -0.53 0.05
## ServQual 57.90 100.00 42.10 -0.66 -0.72 1.46
The mean,median,min and max are the statistical information provided by the summary statistics function and the psych library description.
Question 8
Measuring the Mean and standard deviation of the store profit, management tenure (MTenure), and crew tenure (Ctenure).
library(dplyr)
importantvariables <- cbind.data.frame(Store24$Profit, Store24$MTenure, Store24$CTenure)
mean(importantvariables$`Store24$Profit`)
## [1] 276313.6
mean(importantvariables$`Store24$MTenure`)
## [1] 45.29644
mean(importantvariables$`Store24$CTenure`)
## [1] 13.9315
The means of profit,MTenure and CTenure are 276313.6, 45.29644 and 13.9315 respectively.
#Standard deviation
sd(importantvariables$`Store24$Profit`)
## [1] 89404.08
sd(importantvariables$`Store24$MTenure`)
## [1] 57.67155
sd(importantvariables$`Store24$CTenure`)
## [1] 17.69752
The standard deviation of profit,MTenure and CTenure 89404.08, 57.67155 and 17.69752 respectively.
Question 7 (C)
Sorting and Sub-setting data in R. Printing the (storeID, sales, profit, MTenure and CTenure) of the top 10 most profitable and bottom 10 least profitable stores
profitability <- Store24[order(desc(Store24$Profit)),]
#Selecting the top 10 most profitable stores
top10_profitable <- head(profitability, 10)
subset(top10_profitable, select = c("store","Sales","Profit","MTenure","CTenure"))
## # A tibble: 10 × 5
## store Sales Profit MTenure CTenure
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 74 1782957 518998 171. 29.5
## 2 7 1809256 476355 62.5 7.33
## 3 9 2113089 474725 109. 6.06
## 4 6 1703140 469050 150. 11.4
## 5 44 1807740 439781 182. 114.
## 6 2 1619874 424007 86.2 6.64
## 7 45 1602362 410149 47.6 9.17
## 8 18 1704826 394039 240. 33.8
## 9 11 1583446 389886 44.8 2.04
## 10 47 1665657 387853 12.8 6.64
#Selecting the bottom 10 least profitable stores
bottom10_profitable <- tail(profitability,10)
subset(bottom10_profitable, select = c("store","Sales","Profit","MTenure","CTenure"))
## # A tibble: 10 × 5
## store Sales Profit MTenure CTenure
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 37 1202917 187765 23.2 1.35
## 2 61 716589 177046 21.8 13.3
## 3 52 1073008 169201 24.1 3.42
## 4 54 811190 159792 6.67 3.88
## 5 13 857843 152513 0.657 1.58
## 6 32 828918 149033 36.1 6.64
## 7 55 925744 147672 6.67 18.4
## 8 41 744211 147327 14.9 11.9
## 9 66 879581 146058 115. 3.88
## 10 57 699306 122180 24.3 2.96
Question 7(d) Correlation between different variables.
#Correlation results and scatter plot of correlation between the profit and CTenure variables
cor.test(Store24$Profit,Store24$CTenure)
##
## Pearson's product-moment correlation
##
## data: Store24$Profit and Store24$CTenure
## t = 2.2786, df = 73, p-value = 0.02562
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.03262507 0.45786339
## sample estimates:
## cor
## 0.2576789
correlation is a statistical measure of the interdependence of variables in a data set. A correlation coefficient of 0.2576789 indicates that there is a weak positive (negative) linear relationship via a shaky linear rule between the profit and the crew tenure.
#Scatter plot
Store24$Profit <- 0.01*(Store24$Profit)
plot(Store24$Profit,Store24$CTenure, main="Scatter plot for correlation between profits and crew tenure", xlab="profit",ylab = "CTenure")
#Correlation results and scatter plot of correlation between the profit and MTenure variables
cor.test(Store24$Profit,Store24$MTenure)
##
## Pearson's product-moment correlation
##
## data: Store24$Profit and Store24$MTenure
## t = 4.1731, df = 73, p-value = 8.193e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2353497 0.6055175
## sample estimates:
## cor
## 0.4388692
correlation is a statistical measure of the interdependence of variables in a data set. A correlation coefficient of 0.4388692 indicate a moderate positive (negative) linear relationship via a fuzzy-firm linear rule. between the profit and the management tenure.
plot(Store24$Profit,Store24$MTenure, main="Scatter plot for correlation between profits and management tenure", xlab="profit",ylab = "MTenure")
Question 7(e) correlation matrix
library(Hmisc)
Correlation matrix between all variables
library(round)
matrix <- cor(as.matrix(Store24))
round(matrix, digits=2)
## store Sales Profit MTenure CTenure Pop Comp Visibility PedCount
## store 1.00 -0.23 -0.20 -0.06 0.02 -0.29 0.03 -0.03 -0.22
## Sales -0.23 1.00 0.92 0.45 0.25 0.40 -0.24 0.13 0.42
## Profit -0.20 0.92 1.00 0.44 0.26 0.43 -0.33 0.14 0.45
## MTenure -0.06 0.45 0.44 1.00 0.24 -0.06 0.18 0.16 0.06
## CTenure 0.02 0.25 0.26 0.24 1.00 0.00 -0.07 0.07 -0.08
## Pop -0.29 0.40 0.43 -0.06 0.00 1.00 -0.27 -0.05 0.61
## Comp 0.03 -0.24 -0.33 0.18 -0.07 -0.27 1.00 0.03 -0.15
## Visibility -0.03 0.13 0.14 0.16 0.07 -0.05 0.03 1.00 -0.14
## PedCount -0.22 0.42 0.45 0.06 -0.08 0.61 -0.15 -0.14 1.00
## Res -0.03 -0.17 -0.16 -0.06 -0.34 -0.24 0.22 0.02 -0.28
## Hours24 0.03 0.06 -0.03 -0.17 0.07 -0.22 0.13 0.05 -0.28
## CrewSkill 0.05 0.16 0.16 0.10 0.26 0.28 -0.04 -0.20 0.21
## MgrSkill -0.07 0.31 0.32 0.23 0.12 0.08 0.22 0.07 0.09
## ServQual -0.32 0.39 0.36 0.18 0.08 0.12 0.02 0.21 -0.01
## Res Hours24 CrewSkill MgrSkill ServQual
## store -0.03 0.03 0.05 -0.07 -0.32
## Sales -0.17 0.06 0.16 0.31 0.39
## Profit -0.16 -0.03 0.16 0.32 0.36
## MTenure -0.06 -0.17 0.10 0.23 0.18
## CTenure -0.34 0.07 0.26 0.12 0.08
## Pop -0.24 -0.22 0.28 0.08 0.12
## Comp 0.22 0.13 -0.04 0.22 0.02
## Visibility 0.02 0.05 -0.20 0.07 0.21
## PedCount -0.28 -0.28 0.21 0.09 -0.01
## Res 1.00 -0.09 -0.15 -0.03 0.09
## Hours24 -0.09 1.00 0.11 -0.04 0.06
## CrewSkill -0.15 0.11 1.00 -0.02 -0.03
## MgrSkill -0.03 -0.04 -0.02 1.00 0.36
## ServQual 0.09 0.06 -0.03 0.36 1.00
Pearson’s correlation test between profit, CTenure and MTenure
part(i)
#Pearson's correlation test between profit and MTenure
cor.test(Store24$Profit, Store24$MTenure, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: Store24$Profit and Store24$MTenure
## t = 4.1731, df = 73, p-value = 8.193e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2353497 0.6055175
## sample estimates:
## cor
## 0.4388692
A p-value is a probability that the null hypothesis is true. The p-value of the correlation between the profit and management Tenure is 8.193e-05 which is less than the significance level of 0.05, therefore we reject the null hypothesis and accept the alternative hypothesis since the correlation statistically significant.
part(ii)
#Pearsons correlation test between profit ans CTenure
cor.test(Store24$Profit, Store24$CTenure, method = "pearson")
##
## Pearson's product-moment correlation
##
## data: Store24$Profit and Store24$CTenure
## t = 2.2786, df = 73, p-value = 0.02562
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.03262507 0.45786339
## sample estimates:
## cor
## 0.2576789
A p-value is a probability that the null hypothesis is true. The p-value of the correlation between the profit and crew tenure is 0.02562 which is less than the significance level of 0.05, therefore we reject the null hypothesis and accept the alternative hypothesis since the correlation statistically significant.
Question 7(f)
#regression
regression <- subset(Store24, select = c("Profit","MTenure","CTenure","Comp","Pop","PedCount","Res","Hours24","Visibility"))
Model <- lm(Profit ~., data = regression)
summary(Model)
##
## Call:
## lm(formula = Profit ~ ., data = regression)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1057.89 -359.46 -70.69 337.80 1123.90
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 76.10041 668.21994 0.114 0.909674
## MTenure 7.60993 1.27086 5.988 9.72e-08 ***
## CTenure 9.44978 4.21687 2.241 0.028400 *
## Comp -252.86887 54.91937 -4.604 1.94e-05 ***
## Pop 0.03667 0.01466 2.501 0.014890 *
## PedCount 340.87359 90.73196 3.757 0.000366 ***
## Res 915.84675 392.31283 2.334 0.022623 *
## Hours24 632.33307 196.41114 3.219 0.001994 **
## Visibility 126.25447 90.87620 1.389 0.169411
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 569.7 on 66 degrees of freedom
## Multiple R-squared: 0.6379, Adjusted R-squared: 0.594
## F-statistic: 14.53 on 8 and 66 DF, p-value: 5.382e-12
part(i)
Explanatory variables whose coefficients are statistically significant are; Management tenure, Crew tenure, Comp, Pop, PedCount, Res,and Hours24.
part(ii)
Explanatory variables whose coefficients are not statistically significant are: Visibility and profit
part(iii)
From the correlations results there is a positive relationship between the profit, management tenure and the crew tenure variables.Therefore, An increase in a months experience would result to an equal proportionate increase in the profits.
Question 8
passadieci <- replicate(100000, my.dice.sum(n.dice = 3, n.sides = 6))
sum(passadieci >= 11)/length(passadieci)
## [1] 0.49951
sum(passadieci == 11)/length(passadieci)
## [1] 0.1243
sum(passadieci == 12)/length(passadieci)
## [1] 0.11527
sum((passadieci <= 7) | (passadieci >= 15))/length(passadieci)
## [1] 0.25479
There is a 25.529% likelihood of getting a sum greater than 7 or no less than q5 when throwing three dice
plot of the simulated probabilities of each possible sum when throwing three fair,six-sided dice.
plot(table(passadieci)/length(passadieci), xlab = 'Sum',
ylab = 'Relative Frequency', main = 'Passadieci Simulation: 100000 Throws')