Final Exam (50 pts)

Instructions

You will answer the questions directly in the Rmarkdown file provided.
Please change your name in the template file.
How to insert R code chunks: You can quickly insert chunks into your R Markdown file with the keyboard shortcut Cmd + Option + I (Windows Ctrl + Alt + I).
You can knit a PDF/word/html and upload the knitted file to Canvas for submission.

Problem 1 (8 pts). The dataset called textbooks is a random sample of all the textbooks for UCLA courses. In this dataset, each textbook has two corresponding prices in the data set: one for the UCLA Bookstore (ucla_new) and one for Amazon (amaz_new). The price differences (diff = ucla_new - amaz_new) between Amazon and UCLA bookstore were also calculated for each textbook. The question is whether Amazon prices are, on average, different than those of the UCLA Bookstore for UCLA courses.

Load the dataset using the following R code, and identify whether the data are paired, and whether the data is in a wide format or long format?

textbooks <- read.csv("https://raw.githubusercontent.com/livelyjing/data/main/textbooks.csv")

Data is paired and it is in wide format

Plot a histogram of the variable called diff to check the distribution of the price differences between UCLA bookstore and Amazon.

hist(textbooks$diff)

Conduct a two sided hypothesis test to check whether there is a difference in the average of Amazon price and UCLA bookstore for UCLA textbooks. (hint: please include the R code, report the test you used, the test statistic, the degree of freedom if any, the P value, and the conclusion)

t.test(textbooks$ucla_new,textbooks$amaz_new,paired = TRUE)

## 
##  Paired t-test
## 
## data:  textbooks$ucla_new and textbooks$amaz_new
## t = 7.6488, df = 72, p-value = 6.928e-11
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##   9.435636 16.087652
## sample estimates:
## mean difference 
##        12.76164

t-statistic = 7.6488, df = 72, p-value = 6.928e-11

The p-value is less than our significance level of 0.05. This means that we can reject our null hypothesis and accept that there is a mean difference between price of Amazon and UCLA books.

Find out a 95% confidence interval for the average textbook price difference between UCLA bookstore and Amazon. (hint: Please report the result as a summary sentence. )

We are 95% confident that our mean difference between UCLA and Amazon books is between $9.435636 and $16.087652.

Problem 2 (12 pts) Go to the website https://www.stat.cmu.edu/isle/data_explorers.html, choose a dataset and download it. Pick the variables you will analyze, where the explanatory variable is categorical type with two levels, and the response variable is numerical.

Load your dataset into R.

Broadway.Show <- read.csv("C:/Users/maaudley-hinds/Downloads/Broadway Show.csv")
Broadway.Show$Type <-c("Play", "Musical")

Categorical Explanatory: Type Numerical Response: gross

Conduct appropriate plot(s) for the data.

boxplot(Gross ~ Type , data= Broadway.Show)

Provide a few summary statistics – sample size, mean, median, standard deviation for each group.

Musical - Mean: 562860.3 Median: 492718.5 St. Dev: 331235.2 Play - Mean: 533735.8 Median: 460753.0 St. Dev: 320264.7

A description of the skew of the data based on the plot and summary statistics.

The gross weekly earnings of musicals skew the data towards higher earnings compared to the weekly earnings of plays.

Conduct appropriate two sample hypothesis test, with a formal set up of your hypotheses and statistical conclusion of your results.

t.test(Broadway.Show$Gross~Broadway.Show$Type)

## 
##  Welch Two Sample t-test
## 
## data:  Broadway.Show$Gross by Broadway.Show$Type
## t = 1.9989, df = 1995.7, p-value = 0.04575
## alternative hypothesis: true difference in means between group Musical and group Play is not equal to 0
## 95 percent confidence interval:
##    550.3418 57698.5362
## sample estimates:
## mean in group Musical    mean in group Play 
##              562860.3              533735.8

Null Hypothesis: There is no true difference in gross weekly earnings for musicals and plays. Alternative Hypothesis: There is a true difference in gross weekly earnings for musicals and plays.

Since the p-value is lower than the significance level of 0.05, we can reject our null hypothesis, meaning that the mean earnings of musicals and plays are significantly different.

State the confidence interval for the mean difference and its interpretation.

We are 95% confident that the mean of both musicals and plays are between 550.3418 and 533735.8.

Problem 3 (10 pts) Carl the farmer has three fields of tomatoes, on one he used no fertilizer, in another he used organic fertilizer, and the third he used a chemical fertilizer. He wants to see if there is a difference in the mean weights of tomatoes from the different fields. Carl claims there is a difference in the mean weight for all tomatoes between the different fertilizing methods.

The sample data called tomato can be loaded using R code.

Create a plot illustrating the relationship between different fields and tomato weight. Describe what you see.

boxplot(tomato$weight~tomato$field)

Conduct a formal statistical analysis of the relationship between tomato weight and fertilizer usage. Summarize your findings.

anova_result <- aov(weight~field, data = tomato)

summary(anova_result)

##             Df Sum Sq Mean Sq F value  Pr(>F)   
## field        2   2092  1046.2   5.792 0.00807 **
## Residuals   27   4877   180.6                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is less than the significance level of 0.05, so we know that there is a significant difference between the means of the three fields.

If the result from the above question is significant, please proceed to pairwise comparisons to find out which pair(s) of groups have mean tomato weights different from each other.

tukey_result <- TukeyHSD(anova_result)

print(tukey_result)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = weight ~ field, data = tomato)
## 
## $field
##                                       diff        lwr       upr     p adj
## NoFertilize-ChemicalFertilizer       -20.2 -35.102349 -5.297651 0.0063840
## OrganicFertilizer-ChemicalFertilizer  -7.3 -22.202349  7.602349 0.4550443
## OrganicFertilizer-NoFertilize         12.9  -2.002349 27.802349 0.0993314

The means of the No Fertilizer Field and the Chemical Fertilizer field are the only ones that show a significant difference.

Problem 4 (10 pts) The data set from the sample of the members of the Donner party.

In the spring of 1846, a group of American pioneers set out for California However, they suffered a series of setbacks and did not arrive at the Sierra Nevada mountains until October. While crossing the mountains, they became trapped by an early snowfall, and had to spend the winter there.

Please load the dataset donner using the following R code. Our variables of interest are - Survived: 0 dead, 1 survived - sex: Male/Female

The aim was to assess whether there is an association between survival outcome and gender.

donner <- read.csv("https://raw.githubusercontent.com/livelyjing/data/main/donner.csv")

Please explore the the variables of interest and write your R code to create a contingency table.

df = data.frame( "Survived" = donner$survived, "Sex" = donner$sex)

conTable = table(df)
print(conTable)

##         Sex
## Survived Female Male
##        0     10   32
##        1     25   23

Formally assess whether there is evidence of an association between survival and gender.

lm_sex <- lm(donner$survived~donner$sex)
summary(lm_sex)

## 
## Call:
## lm(formula = donner$survived ~ donner$sex)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7143 -0.4182  0.2857  0.5078  0.5818 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     0.71429    0.08163   8.750 1.34e-13 ***
## donner$sexMale -0.29610    0.10442  -2.836  0.00567 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4829 on 88 degrees of freedom
## Multiple R-squared:  0.08372,    Adjusted R-squared:  0.07331 
## F-statistic:  8.04 on 1 and 88 DF,  p-value: 0.005675

P-value is less than 0.05, so there is a correlation between gender and survival

anova_result <- aov(survived~sex, data = donner)


tukey_result <- TukeyHSD(anova_result)

print(tukey_result)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = survived ~ sex, data = donner)
## 
## $sex
##                   diff        lwr       upr     p adj
## Male-Female -0.2961039 -0.5036258 -0.088582 0.0056746

Men were shown to be less likely to survive compared to women.

Problem 5 (10 pts) The data set called alcohol involves a study of 18 women and 14 men and contains the following variables:

Metabol: First-pass metabolism of alcohol in the stomach (mmol/liter-hour); this is the outcome/dependent variable
Gastric: Gastric alcohol dehydrogenase activity in the stomach ($\mu$mol/min/g of tissue)
Sex: Sex of the subject
Alcohol: Whether a person is Alcoholic or not (You do not need to use this variable)

You may load the dataset from the following R code:

The research question was to study the relationship between Metabolism and two predictor variables including gastric and gender.

Please check the data and create a plot to visualize the relationship between Metabolism of alcohol and Gastric alcohol dehydrogenase activity.

plot(alcohol$Metabol,alcohol$Gastric)

Please perform multiple linear regression in R to study the research question and write your code below

lm_mlr_gastric <- lm(Gastric~Metabol, data = alcohol)
summary(lm_mlr_gastric)

## 
## Call:
## lm(formula = Gastric ~ Metabol, data = alcohol)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.87592 -0.44090 -0.07588  0.26909  1.21381 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.13585    0.13260   8.566 1.48e-09 ***
## Metabol      0.30004    0.03719   8.067 5.27e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5505 on 30 degrees of freedom
## Multiple R-squared:  0.6845, Adjusted R-squared:  0.674 
## F-statistic: 65.08 on 1 and 30 DF,  p-value: 5.266e-09

lm_lmr_gender <- lm(Metabol~Sex, data = alcohol)
summary(lm_lmr_gender)

## 
## Call:
## lm(formula = Metabol ~ Sex, data = alcohol)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.8214 -1.0304 -0.4607  0.4250  8.1786 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.1000     0.5221   2.107 0.043601 *  
## SexMale       3.0214     0.7894   3.828 0.000612 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.215 on 30 degrees of freedom
## Multiple R-squared:  0.3281, Adjusted R-squared:  0.3057 
## F-statistic: 14.65 on 1 and 30 DF,  p-value: 0.0006117

Write your predicted linear regression model. SST=SSE+SSReg
Interpret the model in term of the coefficient estimate by filling the blanks below.

We can conclude that every one unit increase of gastric alcohol dehydrogenase activity in the stomach (measured by $\mu$mol/min/g of tissue) is significantly associated with increased first-pass metabolism of alcohol in the stomach by -1.8271___ mmol/liter-hour, after adjusting for gender (p is less_ (greater/less) than 0.001).
The results from linear regression show that males have ______increased____ (increased/decreased) first-pass metabolism of alcohol in the stomach than females by 1.1_mmol/liter-hour, after adjusting for gastric alcohol dehydrogenase activity in the stomach (p value = 0.0006117____).

Find the proportion of the variance of the metabolism of alcohol explained by the multiple linear regression model 0.305____.

Bonus Question (5 pts)

Using the same dataset as problem 5, the research question was to study whether the relationship between Metabolism and gastric differs by gender.

Fit a multiple linear regression with an interaction term between gastric alcohol dehydrogenase activity in the stomach and gender.

lm_lmr_gender <- lm(Gastric~Sex, data = alcohol)
summary(lm_lmr_gender)

## 
## Call:
## lm(formula = Gastric ~ Sex, data = alcohol)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1643 -0.6750 -0.0500  0.4393  2.9357 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.5500     0.2143   7.233 4.73e-08 ***
## SexMale       0.7143     0.3240   2.205   0.0353 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9092 on 30 degrees of freedom
## Multiple R-squared:  0.1394, Adjusted R-squared:  0.1108 
## F-statistic: 4.861 on 1 and 30 DF,  p-value: 0.03528