Homework1 Probability Theory

Descriptive Statistics

Descriptive statistics are brief informational coefficients that summarize a given data set, which can be either a representation of the entire population or a sample of a population. Descriptive statistics are broken down into measures of central tendency and measures of variability (spread). Measures of central tendency include the mean, median, and mode, while measures of variability include standard deviation, variance, minimum and maximum variables, kurtosis, and skewness

Objective

Analyze any given data set to compute summary statistics, identify potential outliers, and assess the homogeneity or heterogeneity of the data.

Exercise 1

Given the following data of sales (the higest the better) in thounsand units, compute the summary statistics and analyze them.

Data

#vector
sales <- c(106, 100, 112, 124, 118, 114, 113, 116, 117, 114, 115, 117, 113, 110, 112, 115, 114, 113, 116, 114, 112, 114, 116, 115, 113, 115)

Number of observations

#length(sales)
print(paste("The provided information contains information of", length(sales), "sales"))

## [1] "The provided information contains information of 26 sales"

Number of unique observations

#For gathering this value we need to first call "dplyr" library
#library(dplyr)
#n_distinct(sales)
print(paste("The sales list contains", n_distinct(sales), "unique within the values"))

## [1] "The sales list contains 11 unique within the values"

Given that there are 26 total sales values with 11 unique entries, it is evident that there are duplicate values present. With this information, we can now proceed to calculate the minimum and maximum values. Once these are determined, we can compute the range to understand the span within which the observations fall.

Minimun and Maximun values

cat("The minimum value within the total sales is: ", min(sales), ", while the maximum one is: ", max(sales), ". Therefore, the range is: ", range(sales)[2] - range(sales)[1], "\n")

## The minimum value within the total sales is:  100 , while the maximum one is:  124 . Therefore, the range is:  24

It looks like the range is small giving us the idea that there is not much dispersion among the data. Let’s look at the central measure of tendency to gather more insight about the data.

Central measures of tendency

Within these measures we have: mean, mode and median

print("1. Average")

## [1] "1. Average"

cat("The average or mean for the provided information is: ", round(mean(sales),2), 
    ". We did not compute the weighted average as we did not have more information; hence, we assume arithmetic mean. \n\n"  )

## The average or mean for the provided information is:  113.77 . We did not compute the weighted average as we did not have more information; hence, we assume arithmetic mean.

print("2. Mode")

## [1] "2. Mode"

cat("To compute the mode we might need to use an external package as the mode() function that is built-in in R works only for unimodal which means that multiple values within a dataset will not be showed. \n\n")

## To compute the mode we might need to use an external package as the mode() function that is built-in in R works only for unimodal which means that multiple values within a dataset will not be showed.

#install.packages("DescTools")
#library(DescTools)
cat("The mode for the provided information is: ", Mode(sales), 
    ". The close proximity of the mean and the mode suggests that the data may exhibit a concentration on the lower end, which could imply a right-skewed distribution. In the context of 'sales', this is not ideal, as we would expect a higher concentration of values on the higher end, or at least values that are closer to the higher end.\n\n")

## The mode for the provided information is:  114 . The close proximity of the mean and the mode suggests that the data may exhibit a concentration on the lower end, which could imply a right-skewed distribution. In the context of 'sales', this is not ideal, as we would expect a higher concentration of values on the higher end, or at least values that are closer to the higher end.

print("3. Median")

## [1] "3. Median"

cat("The median for the provided information is: ", median(sales), 
    ". This indicates that 50% of the observations fall between ", 
    range(sales)[1], " and ", median(sales), ". When analyzing sales data, this suggests that most sales typically range between these values. Now, considering that the maximun value is ", range(sales)[2], "we might say that values are following a normal distribution. \n\n")

## The median for the provided information is:  114 . This indicates that 50% of the observations fall between  100  and  114 . When analyzing sales data, this suggests that most sales typically range between these values. Now, considering that the maximun value is  124 we might say that values are following a normal distribution.

Before continue with the exercise, it is a good a idea always to draw some charts with the idea of having a better understanding of them. Some good options for this purpose are histogram and boxplots.

Histogram and Box plot

par(mfrow = c(1, 2))
hist(sales, main = "Histogram of sales", xlab = "sales")
#hist(sales, freq=FALSE)
avg <- mean(sales)
mod <- Mode(sales)
med <- median(sales)
abline(v = c(avg, mod, med), col = c('blue', 'red', 'green'), lwd = 2, lty = 'dashed')

boxplot(sales, horizontal = TRUE, main = "Boxplot of sales", xlab = "sales")
avg <- mean(sales)
points(c(avg, mod), y = rep(1,2), col = c('blue', 'red'), pch = 20, bg = TRUE)

par(mfrow = c(1, 1))

Figure 1

Based on the above charts, we can say that indeed the data follows a normal distribution nevertheless we might have some outliers, two in the lower end and one in the right end, which might be skewing the central measures of tendency.

hist(sales, freq = F, main = "Histogram of sales", xlab = "sales")
abline(v = c(avg, mod, med), col = c('blue', 'red', 'green'), lwd = 2, lty = 'dashed')
lines(density(sales), lwd = 2)

Based on the previous chart which is a histogram with the density probability in it, and considering that the mode and median are nearly identical and close to the mean, we can infer that the distribution of the data is approximately symmetric. However, visual inspection alone is not sufficient for a comprehensive analysis. Therefore, skewness and kurtosis will be calculated in the future chunks to provide a more robust assessment of the data distribution.

Before doing so and leveraging the Boxplot chart it is displayed above, measures of position can also help one to understand more about what one can find in the data. They give one a range where a certain percentage of data fall.

Measures of Position: Quantiles

print("1. Quartiles")

## [1] "1. Quartiles"

cat("The data can be divided into four quartiles as follows:\n",
    "1. The first quartile (Q1) is the range from the minimum value", "\033[1m", quantile(sales)[1], "\033[0m", "up to", "\033[1m", quantile(sales)[2], "\033[0m", ", where 25% of the data falls below ", quantile(sales)[2], ".\n",
    "2. The second quartile (Q2) is the range from", "\033[1m", quantile(sales)[2], "\033[0m", "up to the median (50th percentile) ", "\033[1m", quantile(sales)[3], "\033[0m", ", covering the next 25% of the data.\n",
    "3. The third quartile (Q3) is the range from the median ", "\033[1m", quantile(sales)[3], "\033[0m", "up to", "\033[1m", quantile(sales)[4], "\033[0m", ", covering the next 25% of the data.\n",
    "4. The fourth quartile (Q4) extends from ", "\033[1m", quantile(sales)[4], "\033[0m", " to the maximum value ", "\033[1m", quantile(sales)[5], "\033[0m", ", covering the final 25% of the data.\n\n",
    "   • The middle 50% of the data is between ", quantile(sales)[1], " and ", quantile(sales)[3], ".\n",
    "   • Overall, 100% of the data spans from the minimum value ", quantile(sales)[1], " to the maximum value ", quantile(sales)[5], ".\n\n\n")

## The data can be divided into four quartiles as follows:
##  1. The first quartile (Q1) is the range from the minimum value [1m 100 [0m up to [1m 113 [0m , where 25% of the data falls below  113 .
##  2. The second quartile (Q2) is the range from [1m 113 [0m up to the median (50th percentile)  [1m 114 [0m , covering the next 25% of the data.
##  3. The third quartile (Q3) is the range from the median  [1m 114 [0m up to [1m 115.75 [0m , covering the next 25% of the data.
##  4. The fourth quartile (Q4) extends from  [1m 115.75 [0m  to the maximum value  [1m 124 [0m , covering the final 25% of the data.
## 
##     • The middle 50% of the data is between  100  and  114 .
##     • Overall, 100% of the data spans from the minimum value  100  to the maximum value  124 .

print("2. IQR: Interquartile range")

## [1] "2. IQR: Interquartile range"

cat("The Interquartile Range (IQR) is: ", "\033[1m", round(IQR(sales), 2), "\033[0m",
    ". \nThis value indicates the range within which the central 50% of the data falls. A small IQR suggests that there is limited dispersion among the middle 50% of the data points, implying less variability in this central portion of the data.\n\n")

## The Interquartile Range (IQR) is:  [1m 2.75 [0m . 
## This value indicates the range within which the central 50% of the data falls. A small IQR suggests that there is limited dispersion among the middle 50% of the data points, implying less variability in this central portion of the data.

print("3. Interdecile range")

## [1] "3. Interdecile range"

cat("The interdecile range is between: ", "\033[1m", quantile(sales, 0.1), "\033[0m", "and","\033[1m", quantile(sales,0.9), "\033[0m", ". Its result is: ", "\033[1m", quantile(sales, 0.9) - quantile(sales, 0.1), "\033[0m",". 
Considering that the interdecile range covers 80% of the data and is relatively small,",
    "we can infer that there is limited dispersion within the central portion of the data.\n\n")

## The interdecile range is between:  [1m 111 [0m and [1m 117 [0m . Its result is:  [1m 6 [0m . 
## Considering that the interdecile range covers 80% of the data and is relatively small, we can infer that there is limited dispersion within the central portion of the data.

print("4. Intersixtile range")

## [1] "4. Intersixtile range"

cat("The intersixtile range is between: ", "\033[1m", quantile(sales, 1/6), "\033[0m", "and","\033[1m", quantile(sales, 5/6), "\033[0m", ". Its result is: ", "\033[1m", round(quantile(sales, 5/6) - quantile(sales, 1/6),2), "\033[0m",". 
Considering that this measure captures 68% of the data and its value is relatively small,",
    "we can infer that there is limited dispersion and they tend to be homogeneous \n\n")

## The intersixtile range is between:  [1m 112 [0m and [1m 116 [0m . Its result is:  [1m 4 [0m . 
## Considering that this measure captures 68% of the data and its value is relatively small, we can infer that there is limited dispersion and they tend to be homogeneous

Based on previous analysis, we know that we have couple of outliers but when we take or work with values that tend to be more in the center of the sample, they tend to be similar. One can uses measures of dispersion to reinforce previous calculations.

Measures of Dispersion

Absolute Measures

print("1. Range")

## [1] "1. Range"

cat("We have already compute the range and knew that the value was 24\n\n")

## We have already compute the range and knew that the value was 24

print("2. Variance")

## [1] "2. Variance"

cat("As we learnt in class, we know that variance could not have a well interpretation, nevertheless its value is :", round(var(sales),2), "\n\n")

## As we learnt in class, we know that variance could not have a well interpretation, nevertheless its value is : 17.62

print("3. Standard Deviation")

## [1] "3. Standard Deviation"

cat("With the idea of knowing the dispersion / spread of the data we might calculate the Standard Deviation which is the root square of the variance and does have interpretation. In this case its value is: ", round(sd(sales),2), ". In this case value is small and we can infer that there is not much spread in the data \n\n")

## With the idea of knowing the dispersion / spread of the data we might calculate the Standard Deviation which is the root square of the variance and does have interpretation. In this case its value is:  4.2 . In this case value is small and we can infer that there is not much spread in the data

Relative Measures

print("1. Coefficient of variation")

## [1] "1. Coefficient of variation"

cv = round(((sd(sales)/mean(sales))*100),2)
cat("We agreed that an acceptable coefficient of variation (CV) should be 15% or lower. In this case, the CV for the sales data is ", cv, "%, which is quite favorable. This indicates that the sales values tend to be relatively consistent and are included within a narrow range.\n\n")

## We agreed that an acceptable coefficient of variation (CV) should be 15% or lower. In this case, the CV for the sales data is  3.69 %, which is quite favorable. This indicates that the sales values tend to be relatively consistent and are included within a narrow range.

Measures of Skewness and Kurtosis

print("1. Skewness")

## [1] "1. Skewness"

cat("The skewness of the 'sales' data set is: ", round(skewness(sales), 2), 
    ". A negative skewness value indicates that the distribution is skewed to the right, meaning there is a higher concentration of values on the right end of the data. However, leftward outliers can influence this measure. In the context of sales data, this skewness is favorable, as we generally expect a higher concentration of values at the higher end of the scale.\n\n")

## The skewness of the 'sales' data set is:  -1.05 . A negative skewness value indicates that the distribution is skewed to the right, meaning there is a higher concentration of values on the right end of the data. However, leftward outliers can influence this measure. In the context of sales data, this skewness is favorable, as we generally expect a higher concentration of values at the higher end of the scale.

print("2. Kurtosis")

## [1] "2. Kurtosis"

cat("The kurtosis of the 'sales' data set is: ", round(kurtosis(sales), 2), 
    ". A positive kurtosis (any value higher than zero) means that the distribution of the data is concentrated around the mean and it is called leptokurtic as it was shown in the histogram with the proability density \n\n")

## The kurtosis of the 'sales' data set is:  6.88 . A positive kurtosis (any value higher than zero) means that the distribution of the data is concentrated around the mean and it is called leptokurtic as it was shown in the histogram with the proability density

Conclusion

We analyzed a small data set containing sales information, focusing on basic descriptive statistics. This included central tendency measures, positional measures, dispersion measures, and assessments of skewness and kurtosis.

Our findings indicate that, despite the presence of outliers, the data set approximates a normal distribution. This is advantageous for the business context, where we seek homogeneity in sales values. Additionally, the negative skewness observed suggests that most values are concentrated towards the higher end of the distribution, which aligns well with our objective of having more sales concentrated in the upper range.

Apprendix

sort(sales)

##  [1] 100 106 110 112 112 112 113 113 113 113 114 114 114 114 114 115 115 115 115
## [20] 116 116 116 117 117 118 124

stem(sales, scale = 1)

## 
##   The decimal point is at the |
## 
##   100 | 0
##   102 | 
##   104 | 
##   106 | 0
##   108 | 
##   110 | 0
##   112 | 0000000
##   114 | 000000000
##   116 | 00000
##   118 | 0
##   120 | 
##   122 | 
##   124 | 0

Using the Stem and Leaf Plot we could see the distribution of the values and identify clearly the outliers: 100 and 106 in the left (top of the chart) and 124 in the bottom of the chart (right or higher end of the data). What would it happend if we remove those values?

sales1 <- sales[ ! sales %in%  c(100, 106, 124)]
sales1

##  [1] 112 118 114 113 116 117 114 115 117 113 110 112 115 114 113 116 114 112 114
## [20] 116 115 113 115

sales

##  [1] 106 100 112 124 118 114 113 116 117 114 115 117 113 110 112 115 114 113 116
## [20] 114 112 114 116 115 113 115

We have now removed the outliers, let’s now compute and compare some of the previous measures

par(mfrow = c(1, 2))

hist(sales, freq = F, main = "First Histogram of sales with outliers", xlab = "sales", cex.main=1, cex.lab=0.9)
abline(v = c(avg, mod, med), col = c('blue', 'red', 'green'), lwd = 2, lty = 'dashed')
lines(density(sales), lwd = 2)

hist(sales1, freq = F, main = "Second Histogram of sales without outliers", xlab = "sales1", , cex.main=0.9, cex.lab=0.9)
abline(v = c(mean(sales1), Mode(sales1), median(sales1)), col = c('blue', 'red', 'green'), lwd = 2, lty = 'dashed')
lines(density(sales1), lwd = 2)
par(mfrow = c(1, 1))

Figure 1

It’s not magic! We have removed outliers and now the distribution of the data is similar to a normal one. We might compute the new skewness and kurtosis to know the new concentration of the data

print("1. New Skewness")

## [1] "1. New Skewness"

cat("The new skewness of the 'sales1' data set is: ", round(skewness(sales1), 2), 
    ". This value is almost 0 which allows us to say that values are well distributed around the mean (average).\n\n")

## The new skewness of the 'sales1' data set is:  -0.06 . This value is almost 0 which allows us to say that values are well distributed around the mean (average).

print("2. New Kurtosis")

## [1] "2. New Kurtosis"

cat("The new kurtosis of the 'sales1' data set is: ", round(kurtosis(sales1), 2), 
    ". We moved from almost 7 to an almost 3 value. We still have a leptokurtik distribution \n\n")

## The new kurtosis of the 'sales1' data set is:  2.68 . We moved from almost 7 to an almost 3 value. We still have a leptokurtik distribution

What about the new CV?

print("1. New coefficient of variation")

## [1] "1. New coefficient of variation"

print(paste("The new average is: ", round(mean(sales1),2)))

## [1] "The new average is:  114.26"

print(paste("The new standard deviation is: ", round(sd(sales1),2)))

## [1] "The new standard deviation is:  1.91"

cv = round(((sd(sales1)/mean(sales1))*100),2)
cat("New average did not change much from previous one, this as we only had three outliers and that most of the values were located around the central tendenct. In this case, the new CV for the sales data is ", cv, "%, which is even better than previous one. This indicates that we have homogenous data now.\n\n")

## New average did not change much from previous one, this as we only had three outliers and that most of the values were located around the central tendenct. In this case, the new CV for the sales data is  1.67 %, which is even better than previous one. This indicates that we have homogenous data now.

Let’s finish plotting previous and new box plot and see the final considerations

par(mfrow = c(1, 2))

boxplot(sales, horizontal = TRUE, main = "First Boxplot of sales with outliers", xlab = "sales", cex.main=1, cex.lab=0.9)
avg <- mean(sales)
points(c(avg, mod), y = rep(1,2), col = c('blue', 'red'), pch = 20, bg = TRUE)

avg <- mean(sales)
mod <- Mode(sales)
med <- median(sales)

boxplot(sales1, horizontal = TRUE, main = "Second Boxplot of sales without outliers", xlab = "sales1", cex.main=1, cex.lab=0.9)
avg <- mean(sales)
points(c(mean(sales1), Mode(sales1)), y = rep(1,2), col = c('blue', 'red'), pch = 20, bg = TRUE)

par(mfrow = c(1, 1))

Figure 1

Some important considerations that can be appreciated are that range moved from 24 to 18; although it’s minimum in the first scenario we had average higher than mode while in the second one average is higher than the mode and this is a good improvement.

Exercise 2

Given the following data of revenue (the higest the better) in millions units, compute the summary statistics and analyze them.

#vector
revenue <- c(10, 19, 15, 13, 16, 15, 14, 17, 18, 16, 17, 14, 12, 15, 12, 13, 18, 16, 14, 17, 15, 14, 16, 24, 15, 13)

#number of revenue observations
cat("For this second exercise we have :", length(revenue), "records in millions units and its average and standard deviation are :", round(mean(revenue),2), ",", round(sd(revenue),2), "respectively\n\n")

## For this second exercise we have : 26 records in millions units and its average and standard deviation are : 15.31 , 2.74 respectively

cat("The range is :", max(revenue) - min(revenue), ", and values are between", min(revenue), " as the minimum and", max(revenue), "as the maximum\n\n")

## The range is : 14 , and values are between 10  as the minimum and 24 as the maximum

We might take advantage of the box plot and histogram to gather the first impression on how values are distributed:

par(mfrow = c(1, 2))

hist(revenue, main = "Histogram of revenue", xlab = "revenue")
avg <- mean(revenue)
mod <- Mode(revenue)
med <- median(revenue)
abline(v = c(avg, mod, med), col = c('blue', 'red', 'green'), lwd = 2, lty = 'dashed')

boxplot(revenue, horizontal = TRUE, main = "Boxplot of revenue", xlab = "revenue")
avg <- mean(revenue)
points(c(avg, mod), y = rep(1,2), col = c('blue', 'red'), pch = 20, bg = TRUE)

par(mfrow = c(1, 1))

Figure 1

Based on the histogram and box plot, it is evident that values higher than 22 are identified as outliers. When the mean, mode, and median are equal, this often indicates a symmetric distribution. However, the presence of outliers may skew the distribution.

It is crucial not to rely solely on visual inspection but to analyze the computed statistics for accurate insights. Although preliminary observations suggest a possible normal distribution with outliers, further examination of skewness and kurtosis is necessary to validate this assumption.

The histogram suggests that the distribution may be leptokurtic with positive skewness. In the context of the business problem, this is not ideal, as we aim for higher revenue and would prefer a distribution where values are less concentrated around the mean.

print("1. Skewness")

## [1] "1. Skewness"

cat("The skewness of the 'revenue' data set is: ", round(skewness(revenue), 2), 
    ". A positive skewness value indicates that the distribution is skewed to the left, meaning there is a higher concentration of values on the left (lower) end of the data. In the revenue context, this skewness might not be favorable, as we generally expect a higher concentration of values at the higher (right) end of the scale or in other words that most ot he profits be higher than the average.\n\n")

## The skewness of the 'revenue' data set is:  0.98 . A positive skewness value indicates that the distribution is skewed to the left, meaning there is a higher concentration of values on the left (lower) end of the data. In the revenue context, this skewness might not be favorable, as we generally expect a higher concentration of values at the higher (right) end of the scale or in other words that most ot he profits be higher than the average.

print("2. Kurtosis")

## [1] "2. Kurtosis"

cat("The kurtosis of the 'sales' data set is: ", round(kurtosis(revenue), 2), 
    ". A positive kurtosis (any value higher than zero) means that the distribution of the data is concentrated around the mean and it is called leptokurtic as it was shown in the histogram with the proability density \n\n")

## The kurtosis of the 'sales' data set is:  5.29 . A positive kurtosis (any value higher than zero) means that the distribution of the data is concentrated around the mean and it is called leptokurtic as it was shown in the histogram with the proability density

We now have information about how the data is distributed; what the mean and standard deviation are, we know the skewness and the kurtosis, we could run an analysis based on the measures of position and with that then proceed to remove outliers in case it would be necessary

print("1. Quartiles")

## [1] "1. Quartiles"

cat("The data can be divided into four quartiles as follows:\n",
    "1. The first quartile (Q1) is the range from the minimum value", "\033[1m", quantile(revenue)[1], "\033[0m", "up to", "\033[1m", quantile(revenue)[2], "\033[0m", ", where 25% of the data falls below ", quantile(revenue)[2], ".\n",
    "2. The second quartile (Q2) is the range from", "\033[1m", quantile(revenue)[2], "\033[0m", "up to the median (50th percentile) ", "\033[1m", quantile(revenue)[3], "\033[0m", ", covering the next 25% of the data.\n",
    "3. The third quartile (Q3) is the range from the median ", "\033[1m", quantile(revenue)[3], "\033[0m", "up to", "\033[1m", quantile(revenue)[4], "\033[0m", ", covering the next 25% of the data.\n",
    "4. The fourth quartile (Q4) extends from ", "\033[1m", quantile(revenue)[4], "\033[0m", " to the maximum value ", "\033[1m", quantile(revenue)[5], "\033[0m", ", covering the final 25% of the data.\n\n",
    "   • The middle 50% of the data is between ", quantile(revenue)[1], " and ", quantile(revenue)[3], ".\n",
    "   • Overall, 100% of the data spans from the minimum value ", quantile(revenue)[1], " to the maximum value ", quantile(revenue)[5], ".\n\n\n")

## The data can be divided into four quartiles as follows:
##  1. The first quartile (Q1) is the range from the minimum value [1m 10 [0m up to [1m 14 [0m , where 25% of the data falls below  14 .
##  2. The second quartile (Q2) is the range from [1m 14 [0m up to the median (50th percentile)  [1m 15 [0m , covering the next 25% of the data.
##  3. The third quartile (Q3) is the range from the median  [1m 15 [0m up to [1m 16.75 [0m , covering the next 25% of the data.
##  4. The fourth quartile (Q4) extends from  [1m 16.75 [0m  to the maximum value  [1m 24 [0m , covering the final 25% of the data.
## 
##     • The middle 50% of the data is between  10  and  15 .
##     • Overall, 100% of the data spans from the minimum value  10  to the maximum value  24 .

print("2. IQR: Interquartile range")

## [1] "2. IQR: Interquartile range"

cat("The Interquartile Range (IQR) is: ", "\033[1m", round(IQR(revenue), 2), "\033[0m",
    ". \nThis value indicates the range within which the central 50% of the data falls. A small IQR suggests that there is limited dispersion among the middle 50% of the data points, implying less variability in this central portion of the data.\n\n")

## The Interquartile Range (IQR) is:  [1m 2.75 [0m . 
## This value indicates the range within which the central 50% of the data falls. A small IQR suggests that there is limited dispersion among the middle 50% of the data points, implying less variability in this central portion of the data.

print("3. Interdecile range")

## [1] "3. Interdecile range"

cat("The interdecile range is between: ", "\033[1m", quantile(revenue, 0.9), "\033[0m", "and","\033[1m", quantile(revenue,0.1), "\033[0m", ". Its result is: ", "\033[1m", quantile(revenue, 0.9) - quantile(revenue, 0.1), "\033[0m",". 
Considering that the interdecile range covers 80% of the data and is relatively small,",
    "we can infer that there is limited dispersion within the central portion of the data that for the revenue context could be favorable.\n\n")

## The interdecile range is between:  [1m 18 [0m and [1m 12.5 [0m . Its result is:  [1m 5.5 [0m . 
## Considering that the interdecile range covers 80% of the data and is relatively small, we can infer that there is limited dispersion within the central portion of the data that for the revenue context could be favorable.

print("4. Intersixtile range")

## [1] "4. Intersixtile range"

cat("The intersixtile range is between: ", "\033[1m", quantile(revenue, 1/6), "\033[0m", "and","\033[1m", quantile(revenue, 5/6), "\033[0m", ". Its result is: ", "\033[1m", round(quantile(revenue, 5/6) - quantile(revenue, 1/6),2), "\033[0m",". 
Considering that this measure captures around 68% of the data and its value is relatively small,",
    "we can infer that there is limited dispersion and they tend to be homogeneous \n\n")

## The intersixtile range is between:  [1m 13 [0m and [1m 17 [0m . Its result is:  [1m 4 [0m . 
## Considering that this measure captures around 68% of the data and its value is relatively small, we can infer that there is limited dispersion and they tend to be homogeneous

We have just seen that they might be limited spread in the data using both quantiles and the standard deviation, we might run the coefficient of variation in order to conclude, once this is done we can proceed to remove the outlier and compared old values with the new ones.

cat("1. Coefficient of variation: \nRelative measure of dispersion that is used, along with some others, to compare values wihtout units\n\n")

## 1. Coefficient of variation: 
## Relative measure of dispersion that is used, along with some others, to compare values wihtout units

cv = round(((sd(revenue)/mean(revenue))*100),2)
cat("We agreed that an acceptable coefficient of variation (CV) should be 15% or lower. In this case, the CV for the revenue data is ", cv, "%, which is quite favorable. This indicates that the sales values tend to be relatively consistent and are included within a narrow range.\n\n")

## We agreed that an acceptable coefficient of variation (CV) should be 15% or lower. In this case, the CV for the revenue data is  17.89 %, which is quite favorable. This indicates that the sales values tend to be relatively consistent and are included within a narrow range.

Based on the above measure we might not need to remove the outlier value:

stem(revenue)

## 
##   The decimal point is at the |
## 
##   10 | 0
##   12 | 00000
##   14 | 000000000
##   16 | 0000000
##   18 | 000
##   20 | 
##   22 | 
##   24 | 0

Of course from 18 millions to 24 millions there is a difference of 6 millions, and considering we have an average of 15.31 whereas most of the data is around to, removing it will not cause a significant difference.

Let’s have a look at it:

revenue1 <- revenue[!revenue == 24]
sort(revenue1)

##  [1] 10 12 12 13 13 13 14 14 14 14 15 15 15 15 15 16 16 16 16 17 17 17 18 18 19

sort(revenue)

##  [1] 10 12 12 13 13 13 14 14 14 14 15 15 15 15 15 16 16 16 16 17 17 17 18 18 19
## [26] 24

We have removed the outlier value, what are the measures without it?

par(mfrow = c(1, 2))

hist(revenue, freq = F, main = "First Histogram of sales with outliers", xlab = "revenue", cex.main=1, cex.lab=0.9)
abline(v = c(mean(revenue), Mode(revenue), median(revenue)), col = c('blue', 'red', 'green'), lwd = 2, lty = 'dashed')
lines(density(revenue), lwd = 2)

hist(revenue1, freq = F, main = "Second Histogram of sales without outliers", xlab = "revenue1", , cex.main=0.9, cex.lab=0.9)
abline(v = c(mean(revenue1), Mode(revenue1), median(revenue1)), col = c('blue', 'red', 'green'), lwd = 2, lty = 'dashed')
lines(density(revenue1), lwd = 2)

par(mfrow = c(1, 1))

Figure 1

It looks like the kurtosis has changed, let’s take a look at its value and compared the other metrics.

print("1. New Skewness")

## [1] "1. New Skewness"

cat("The new skewness of the 'revenue1' data set is: ", round(skewness(revenue1), 2), "that compared to the previous value :", round(skewness(revenue), 2), ". \nWe moved from a positive value to a smaller but negative one which now tells us that the distribution could be distributed around the new average value which is : ", round(mean(revenue1),2), "that compared to the previous one :", round(mean(revenue),2), ". This value is almost 0 but negative which for the revenue context is favorable.\n\n")

## The new skewness of the 'revenue1' data set is:  -0.21 that compared to the previous value : 0.98 . 
## We moved from a positive value to a smaller but negative one which now tells us that the distribution could be distributed around the new average value which is :  14.96 that compared to the previous one : 15.31 . This value is almost 0 but negative which for the revenue context is favorable.

print("2. New Kurtosis")

## [1] "2. New Kurtosis"

cat("The new kurtosis of the 'revenue1' data set is: ", round(kurtosis(revenue1), 2), "that compared to the previous value :", round(kurtosis(revenue), 2), ". We know that kurtosis function in R works with 3, in this case as the value is lower than 3 we have a playkurtic kurtosis.
This suggests that removing the extreme value we had in the right end could have led to have fewer and less extreme outliers than a normal distribution. \n\n")

## The new kurtosis of the 'revenue1' data set is:  2.74 that compared to the previous value : 5.29 . We know that kurtosis function in R works with 3, in this case as the value is lower than 3 we have a playkurtic kurtosis.
## This suggests that removing the extreme value we had in the right end could have led to have fewer and less extreme outliers than a normal distribution.

As a conclusion, we have analyzed a toy data set which has helped us to leverage the knowledge we have been gather in the classroom. In this case, in both exercise 1 and 2, we proceeded to remove outliers with the idea of extending the analysis and to have more practice nevertheless in real life this has to be done with an expert help in the field we would be working as for the context of these two problems it would be normal to have some months, let’s say, for example, peak seasons, in which both sales and revenue report higher values that the normal ones. We have put into practice how a distribution looks like and used three different charts: histogram, boxplot and steam and leaf plot. Also using measures of tendency; dispersion, position and skewness and kurtosis we could understand the key concepts to use in the future for exploratory data analysis.

Homework1 Probability Theory

JD&Vic

2024-08-23

Descriptive Statistics

Objective

Exercise 1

Data

Number of observations

Number of unique observations

Minimun and Maximun values

Central measures of tendency

Histogram and Box plot

Measures of Position: Quantiles

Measures of Dispersion

Absolute Measures

Relative Measures

Measures of Skewness and Kurtosis

Conclusion

Apprendix

Exercise 2