Blob 4 - Common Statistical Method

Common statistical method used in Business

Mean, Median and Mode

Mean is calculated by taking the sum of the values and dividing with the number of values in a data series. The function mean() is used to calculate this in R.

Median is the middle most value in a data series is called the median. The median() function is used in R to calculate this value.

Mode is the value that has highest number of occurrences in a set of data. Unike mean and median, mode can have both numeric and character data. There is no built in function for mode in R. Custom function can be used.

If there are missing values, then the mean function returns NA. To drop the missing values from the calculation use na.rm = TRUE. which means remove the NA values.

x <- c(12,7,3,4.2,18,2,54,-21,8,-5)

mean(x, trim = 0, na.rm = FALSE)

## [1] 8.22

median(x, na.rm = FALSE)

## [1] 5.6

funMode <- function(v) {
   uq<- unique(v)
   uq[which.max(tabulate(match(v, uq)))]
}

funMode(x)

## [1] 12

Standard Deviation

Standard Deviation is a measure of the amount of variation of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range.

x <- c(12,7,3,4.2,18,2,54,-21,8,-5)
sd(x, na.rm = FALSE)

## [1] 19.20057

Regression

Regression models the relationships between dependent and explanatory variables, which are usually charted on a scatterplot. The regression line also designates whether those relationships are strong or weak. Regression is commonly taught in high school or college statistics courses with applications for science or business in determining trends over time.

linearMod <- lm(dist ~ speed, data=cars)
summary(linearMod)

## 
## Call:
## lm(formula = dist ~ speed, data = cars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.069  -9.525  -2.272   9.215  43.201 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -17.5791     6.7584  -2.601   0.0123 *  
## speed         3.9324     0.4155   9.464 1.49e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 15.38 on 48 degrees of freedom
## Multiple R-squared:  0.6511, Adjusted R-squared:  0.6438 
## F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

par(mfrow = c(2, 2))
plot(linearMod)

Sample Size Determination

How many is enough? Over the years, researchers have grappled with the problem of finding the perfect sample size for statistically sound results.

The size of the sample is very important for getting accurate, statistically significant results and running your study successfully.

If your sample is too small, you may include a disproportionate number of individuals which are outliers and anomalies. These skew the results and you don’t get a fair picture of the whole population.
If the sample is too big, the whole study becomes complex, expensive and time-consuming to run, and although the results are more accurate, the benefits don’t outweigh the costs.

library(e1071)

duration = faithful$eruptions

Variance

Variance is a measurement of the span of numbers in a data set. The variance measures the distance each number in the set is from the mean. Variance can help determine the risk an investor might accept when buying an investment.

var(duration)

## [1] 1.302728

Skewness

The skewness of a data population is defined by the following formula, where μ2 and μ3 are the second and third central moments.

skewness(duration)

## [1] -0.4135498

Kurtosis

The excess kurtosis of a univariate population is defined by the following formula, where μ2 and μ4 are respectively the second and fourth central moments.

kurtosis(duration)

## [1] -1.511605