Applied Regression HW #1

Question 4

setwd("/Users/wonny/Downloads/Applied reg")
rainfall <- read.csv(file='rainfall.csv',header=T)
groundhog <- read.table(file='groundhog.table',sep=",",header=T)
range(rainfall$WY);range(groundhog$year)

## [1] 1921 2013

## [1] 1990 2010

Separate years Phil sees his shadow and he doesn’t

YT <- groundhog[groundhog$shadow=="Y","year"] # Years Phil sees his shadow
YF <- groundhog[groundhog$shadow=="N","year"] # Years Phil does not his shadow
YT;YF

##  [1] 1991 1992 1993 1994 1996 1998 2000 2001 2002 2003 2004 2005 2006 2008 2009
## [16] 2010

## [1] 1990 1995 1997 1999 2007

Calculate the average rainfalls

av_YT <- rainfall[which(rainfall$WY%in%YT),"Total"]/12
av_YF <- rainfall[which(rainfall$WY%in%YF),"Total"]/12

Make a boxplot of the average rainfall in Northen California comparing the years Phil sees his shadow versus the years he does not.

boxplot(av_YT,av_YF,names=c("Phil sees his shadow","Phil does not see his shadow"),ylab="Average rainfall",col=c("grey","darkgrey"))

90% confidence interval for the difference between the mean rainfall in years Phil sees his shadow and years he does not.

Check whether the variances are equal or not

var(av_YT);var(av_YF)

## [1] 1.796807

## [1] 3.092972

var.test(av_YT,av_YF)

## 
##  F test to compare two variances
## 
## data:  av_YT and av_YF
## F = 0.58093, num df = 15, denom df = 4, p-value = 0.3952
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.06710904 2.21002367
## sample estimates:
## ratio of variances 
##          0.5809322

Since p-value=0.3952 is greater than 0.05, we cannot reject the H0, so there is not enough evidence supporting the unequal variances.

t.test(av_YT,av_YF,conf.level = 0.90,var.equal = T)$conf.int

## [1] -1.6779319  0.8710985
## attr(,"conf.level")
## [1] 0.9

Interpret the interval in part 2.
The confidence interval will capture the true paramter, mean, with 0.95 probability.
At level α = 0.05, would you reject the null hypothesis that the average rainfall in Northern California during the month of February was the same in years Phil sees his shadow versus years he does not?

rain_YT_Feb <- rainfall[which(rainfall$WY%in%YT),"Feb"]
rain_YF_Feb <- rainfall[which(rainfall$WY%in%YF),"Feb"]
var.test(rain_YT_Feb,rain_YF_Feb)

## 
##  F test to compare two variances
## 
## data:  rain_YT_Feb and rain_YF_Feb
## F = 0.66525, num df = 15, denom df = 4, p-value = 0.5022
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.07684912 2.53078222
## sample estimates:
## ratio of variances 
##          0.6652476

t.test(rain_YT_Feb,rain_YF_Feb,var.equal = T)

## 
##  Two Sample t-test
## 
## data:  rain_YT_Feb and rain_YF_Feb
## t = 0.8636, df = 19, p-value = 0.3986
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -3.597957  8.652707
## sample estimates:
## mean of x mean of y 
##  9.779375  7.252000

Since the p-value 0.3986 is greater than 0.05, the null hypothesis cannot be rejected at α = 0.05.

What assumptions are you making in forming your confidence interval and in your hypothesis test?
The two samples are drawn from normal distributions with equal variance but unknown, also they are independent.

Question 5

Hertz <- c(37.16, 14.36, 17.59, 19.73, 30.77, 26.29, 30.03, 29.02, 22.63, 39.21)
Thrifty <- c(29.49, 12.19, 15.07, 15.17, 24.52, 22.32, 25.30, 22.74, 19.35, 34.44)

Explain why this is a paired-sample problem.
The two groups are not independent because the car types are the same.
Use a graph to determine whether the assumption of normality is reasonable.

diff <- Hertz-Thrifty
s.diff <- sort(diff)
qqnorm(s.diff);qqline(s.diff)

The data points seem to follow the straight line, so we can conclude that the normality assumption is reasonable.

Using p-value, test at α = 0.05 whether Thrifty has a lower mean rental rate than Hertz via a t-test

t.test(diff,alternative="less", mu=0)

## 
##  One Sample t-test
## 
## data:  diff
## t = 8.3756, df = 9, p-value = 1
## alternative hypothesis: true mean is less than 0
## 95 percent confidence interval:
##      -Inf 5.631148
## sample estimates:
## mean of x 
##      4.62

\(\mu=\mu_{H}-\mu_{T}\)
H0:\(\mu<0\) vs H1:\(\mu \ge0\)
Since p-value is nearly 1, we cannot reject the null hypothesis H0 and therefore we cannot conclude that Thrifty has a lower mean rental rate than Hertz.

Question 6 1. Create a boxplot of the supervisor rating Y, splitting the data based on the median of X4

P060 <- read.table(file="P060.txt",head=T)
med4 <- median(P060$X4)
above_med <- subset(P060,X4>=med4)$Y # split the data based on the median of X4
below_med <- subset(P060,X4<med4)$Y
boxplot(above_med,below_med,names=c("above the median of X4","below the median of X4"),ylab="Y")

Compute the sample mean, sample standard deviation Y in the two groups
Let the group of Yi’s whose values of X4 is above the median of X4 be the first group, and the other is the second group

# mean and sd of Y in the first group
mean(above_med);sd(above_med)

## [1] 70.46667

## [1] 9.605554

# mean and sd of Y in the second group
mean(below_med);sd(below_med)

## [1] 58.8

## [1] 11.90558

Create a histogram of Y within each group

par(mfrow=c(1,2))
hist(above_med,main="first group",xlab="Y");hist(below_med,main="second group",xlab="Y")

Compute a 90% confidence interval for the difference in supervisor performance between the two groups. What assumptions are you making?

var.test(above_med,below_med)

## 
##  F test to compare two variances
## 
## data:  above_med and below_med
## F = 0.65094, num df = 14, denom df = 14, p-value = 0.4318
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.2185412 1.9388938
## sample estimates:
## ratio of variances 
##           0.650944

t.test(above_med,below_med,conf.level = 0.90,var.equal = T)$conf.int

## [1]  4.947601 18.385732
## attr(,"conf.level")
## [1] 0.9

The two samples are drawn from normal distributions with equal variance but unknown, also they are independent.

At level α = 5%, test the null hypothesis that the average supervisor performance does not differ between the two groups. What assumptions are you making? What can you conclude?

t.test(above_med,below_med,conf.level = 0.95,var.equal = T)

## 
##  Two Sample t-test
## 
## data:  above_med and below_med
## t = 2.9538, df = 28, p-value = 0.006296
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   3.575942 19.757391
## sample estimates:
## mean of x mean of y 
##  70.46667  58.80000

Assumptions are described in the answer to the previous question. Since p-value 0.006296 < 0.05, we can reject the null hypothesis that the true difference is equal to 0, which means there is strong evidence supporting the difference between the two average supervisor performances.

Applied Regression HW #1

Jiwon Lee

9/4/2021