Data Management and Analysis

Salmonella Case Study - (Question 1a)

Outbreak of an illness has been attributed tot ice cream ingestion. Some samples of 9 ice cream were collected and the levels of salmonella tested in MPG/g. This is the R code normality for normality of the samples collected from the test using Shapiro-Wilk test and a histogram for visualization

Data Input

salmonella_level <- c(0.59, 0.14, 0.33,0.69, 0.23, 0.79, 0.52, 0.39, 0.42)

Histogram

Description of the Histogram.

The Bars are mostly uniform except for the bar at the center of the distribution. The curve is over-layered and after transformation, it represents a bell-shaped curve. This suggests that the samples of ice cream collected depict that of a normal distribution.

Shapiro-Wilk Test for Normality

This is a statistical test that determines whether a data set follows a normal distribution. Its simply a hypothesis test with testing the null hypothesis testing for normality against the alternative hypothesis that the data significantly deviates from a normal deviation.

From the results, Shapiro-Wilk statistic is found out to be 0.98 with a probability value of also 0.98. This result means that there is no sufficient evidence to reject the null hypothesis described earlier. We therefore assume that the levels of salmonella recorded in the nine ice cream samples is approximately normally distributed.

This histogram combined with Shapiro-Wilk test confirm that the samples collected follow a normal distribution.

T-Test

The One Sample t-test indicates a strong statistical evidence to reject the null hypothesis. Its established that the true mean of salmonella level is greater than 0.3 MPG/g at a 5% significance level. The t-statistic is found to be 2.1974 with a probability value of 0.03 which shows a meaningful difference from the null hypothesis. The confidence interval further supports the alternative hypothesis as it ranges from 0.3239 to positive infinity.

Z-Test Comparison for Two Population Means - (Question 1b)

A function z is created that compares means for two populations. For instance, a null hypothesis and alternative hypothesis are formulated that states:

  • \(H_0\): There is no significant difference between the means of the two populations i.e \(( \bar{X_1} = \bar{X_2})\)

  • \(H_1\) : There is a significant difference between the means of the two populations \(( \bar{X_1} \neq \bar{X_2})\)

After obtaining the z-statistic. we compare it to the critical value from the standard tables or we can use qnorm to compare the critical value and the z-test value. A function is created that outputs the results of a z.test as displayed below.

Example

# Example
n <- 20
z_test_extended(rnorm(n,0,1),rnorm(n,1,0),0.05)
## [1] "Null hypothesis rejected and the means are significantly different with z-statistic as: 4.76 and z-critical as: 1.96"

Tidy Data - (Question 1c)

Wide format data follows the data structure where the observations are presented in single rows and different attributes organized in columns. On the other hand, long format is the data structure where there is columns for identification or grouping variables and even so variables for time stamps. The data set income was transformed from wide format to long format using pivot_longer.

income <- data.frame(
  id <- c(1,2,3),
  sex <- c("M", "F", "M"),
  "1980" <- c(5000, 2000, 3000),
  "1981" <- c(5500, 2200, 2000),
  "1982" <- c(6000, 3300, 1000)
)
colnames(income) <- c("id", "sex", "1980", "1981", "1982")

#Change to long form
income_long <- income %>% pivot_longer(
  cols = -c(id, sex),
  names_to = "year",
  values_to = "income"
)
datatable(income_long, options = list( dom = 't'), rownames = FALSE)

Average

The dataframe is first parsed to a grouping function before finding the average income based on the grouping option.

income_long %>% 
  group_by(sex) %>% 
  summarise(
    average = mean(income)
  ) %>% datatable(options = list(dom = 't'), rownames = F)

T-Distribution- (Question 1d)

The differnce clearly lies on the shapes of the two distribution when the degrees of freedom are changed.

  • On the first plot, df = 1 shows that the t-distribution has heavier tails compared to the normal distribution. The peak is also sharper compared to the broader peak of the normal distribution.

  • As the degrees of freedom increase to 30, the shape of the t-distribution almost resembles z-distribution. The t-distribution becomes more similar to the normal distribution.

Categorization - (Question 1e)

This is a scenario for nested if. For simplicity, mutate and case_when are used for easier work flow and easy interpretation.

score <- c(45, 20, 35, 85, 20, 28, 35, 45, 50, 60, 75, 65, 60, 55, 75, 80, 55)
score <- as_tibble(score)
table <- score %>% 
  mutate(grade = case_when(
    score < 49 ~'E',
    score < 59 ~ 'C',
    score < 69 ~ 'B',
    score >69 ~ 'A')) %>% 
  mutate(
    status = case_when(
      score<50 ~ "FAIL",
      score>49 ~ "PASS"))
datatable(table, options = list(dom = 't'), rownames = F)

Regression - (Question 2)

The task requires random generation of 1000 observations that follow normal distribution for Crop Yields, 1000 observations for Rainfall in mm that follow t-distribution and likewise number of observations for factor variables Weeding and Region.

By default, the entry method used in the regression is step-wise regression. The entry method is used for variable elimination if there are existing variables that don’t affect the performance of the model when they are included. For the model generated, Region is eliminated since it has no significant impact on the performance of the model overall.

The regression model generated can be interpreted as follows;

  • The intercept is estimated as 67.84. This is the predicted value of the dependent variable when all predictors are held constant. It has no practical implication in a real world scenario.

  • The variable Rainfall has a coefficient of 0.01756. This implies that with each unit increase in rainfall, there is an expected increase in crop yield by 0.01756 which is really small comparing to a real world scenario.

  • The variable Weeding has a coefficient of 0.8486 meaning that by an increase in a unit of weeding, we would expect crop yields to increase by 0.85. This shows that weeding has a positive impact on weeding.

  • Region variable has been eliminated from the model by the step-wise entry method.

The overall equation of the regression model can be written as:

\(Yields = 67.84 + (0.02)*Rainfall +(0.84)*Weeding\)

Diagnostic Plots - (Question 2b)

model1 <-lm(Crop_Yield~ Rainfall + Weeding + Region, Yields2)
autoplot(model1)+
  theme_minimal()

Diagnostic plots are used to evaluate the assumptions and diagnose regression models. Validity checks and identify outliers affecting performance are done using diagnostic plots. From the figure, there are four plots and what can be said is as follow:

  • In Residual vs Fitted plot,no pattern is exhibited from the scatter points and therefore the model seems to be a good fit to predict crop yields
  • The QQ plots checks for normality in the residuals. Most of the points within the diagonal line fall along it so its easy to assume the residuals are normally distributed.
  • In scale location, it checks for the assumption of equal variance or homoscedasticity. The line is roughly horizontal but then forms a straight line. This is an indication that there is change in variance in the residuals as the fitted values increase.
  • The Residuals vs Leverage plot checks for influential observations. The dashed line is the cooks distance and points far from it represent influential observations. For example, 702 is the point furthest from the cooks distance and thus the most influential observation.

Hypothesis Testing - (Question 3a)

compare <- function(group1, group2){
  var.test(x = group1, y = group2,
           alternative = "two.sided",
           conf.level = 0.95)
}

Example

Comparison of two groups

Example

Confidence Interval

compare3 <- function(group1, group2){
  test <- t.test(group1,group2,
                 alternative = "two.sided")
  conf <- test$confint
  return(conf)
}

Example

compare3(rnorm(20,1,0), rnorm(20,3,5))
## NULL

Hypothesis Testing

group1 <- c(23,24,32,54,56,37,87,29,32)
group2 <- c(30,39,31,26,87,49,95,30,98)
datatable(compare2(group1,group2), options = list(dom = 't'), rownames = F)

Independent Sample t.test - (Question 3b)

manual.test_independent <- function(group1, group2){
  se = sqrt(((((length(group1)-1)*var(group1))+
                (length(group2)-1)*var(group2))/(length(group1)
               +length(group2)-2))*(1/length(group1)+1/length(group2)))
  t = (mean(group1)-mean(group2))/se
  return(print(paste("The t-statistic obtained is",round(t,4))))
}

manual.test_independent(group1,group2)
## [1] "The t-statistic obtained is -1.0041"