Problem 1

Part (b)

#Import Data Into R from Excel file
#(1) install.packages("readxl") if have never done so
#(2) Save data file to same location as RMarkdown file
#(3) Run the following three com
library(readxl)
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
mitchellData = read_excel("mitchell.xlsx")

Part (c)

head(mitchellData)

## # A tibble: 6 × 3
##     Obs Month  Temp
##   <dbl> <dbl> <dbl>
## 1     1     0 -5.18
## 2     2     1 -1.65
## 3     3     2  2.49
## 4     4     3 10.4 
## 5     5     4 15.0 
## 6     6     5 21.7

str(mitchellData)

## tibble [204 × 3] (S3: tbl_df/tbl/data.frame)
##  $ Obs  : num [1:204] 1 2 3 4 5 6 7 8 9 10 ...
##  $ Month: num [1:204] 0 1 2 3 4 5 6 7 8 9 ...
##  $ Temp : num [1:204] -5.18 -1.65 2.49 10.4 14.99 ...

Part (d)

plot(Temp~Month,data=mitchellData, pch=1, col='black' ,xlab='month', ylab='temp')

Problem 2

Part (a)

#I would expect the relationship between score and driving distance to be negative because a better score would result from a higher driving distance. Thus, for a lower score, you would a higher driving distance.

##Part (b)

#I would expect the relationship between score and putting to be positive because less strokes would result in a better score. As a result, less putting average would also mean a lower score and higher putting average would mean a higher score.

#Part (c)

library(readxl)
setwd(dirname(rstudioapi::getActiveDocumentContext()$path))
golfersData = read_excel("golfers.xls")
plot(avgscore~driving, data=golfersData, pch=1, col='black', xlab="Driving Distance",ylab="Average Score")

cor(golfersData$driving,golfersData$avgscore)

## [1] -0.2654319

##Part (d)

plot(avgscore~avgputts,data=golfersData,pch=19,col='black',xlab="Average Putts", ylab="Average Score")

cor(golfersData$avgputts,golfersData$avgscore)

## [1] 0.4441574

(e)

Yes. For part (a) I expected the greater driving distance to correlate with a lower average score, hence a negative correlation which I expected. In part (b), the data suggests that a higher putting average is more likely for a higher score.

(f)

Based on the scatterplots, the relationship between average score and average putting is stronger due to the points being much closer to each other. Furthermore, the points also seem to resemble a linear line more than the relationship between average score and driving distance.

Problem 3

##Part (a)

#stores airfare values in the function c() as a vector
y = c(631.8, 338.6, 627.9, 352.6, 669.8, 470.7, 557.8, 547.8, 569.83, 321.1, 344.7, 427.6)

y

##  [1] 631.80 338.60 627.90 352.60 669.80 470.70 557.80 547.80 569.83 321.10
## [11] 344.70 427.60

#finds the median and mean of y
median(y)

## [1] 509.25

mean(y)

## [1] 488.3525

#Generates a sequence from 400 to 600 with numbers in between increasing by 1. The function is also stored as yhat
yhat=seq(400, 600, by=0.1)

#create a function that computes the SAD for a given yhat
SAE.fun <- function(yhat){
  sum(abs(y-yhat))
}

#compute the SAD for all the candidate yhat's
SAE=sapply(yhat,SAE.fun)

#plot the SAD for each candidate yhat's
#cex=0.5 makes the size of the points smaller 
plot(yhat,SAE,cex=0.5)

#determine the yhat(s) that minimizes the SAD
yhat[SAE==min(SAE)]

##   [1] 470.7 471.0 471.2 471.5 471.7 472.0 472.2 472.5 472.7 473.0 473.2 473.5
##  [13] 473.7 474.0 474.2 474.5 474.7 475.0 475.2 475.5 475.7 476.0 476.2 476.5
##  [25] 476.7 477.0 477.2 477.5 477.7 478.0 478.2 478.5 478.7 479.0 479.2 479.5
##  [37] 479.7 480.0 480.2 480.5 480.7 481.0 481.2 481.5 481.7 482.0 482.2 482.5
##  [49] 482.7 483.0 483.2 483.5 483.7 484.0 484.2 484.5 484.7 485.0 485.2 485.5
##  [61] 485.7 486.0 486.2 486.5 486.7 487.0 487.2 487.5 487.7 488.0 488.2 488.5
##  [73] 488.7 489.0 489.2 489.5 489.7 490.0 490.2 490.5 490.7 491.0 491.2 491.5
##  [85] 491.7 492.0 492.2 492.5 492.7 493.0 493.2 493.5 493.7 494.0 494.2 494.5
##  [97] 494.7 495.0 495.2 495.5 495.7 496.0 496.2 496.5 496.7 497.0 497.2 497.5
## [109] 497.7 498.0 498.2 498.5 498.7 499.0 499.2 499.5 499.7 500.0 500.2 500.5
## [121] 500.7 501.0 501.2 501.5 501.7 502.0 502.2 502.5 502.7 503.0 503.2 503.5
## [133] 503.7 504.0 504.2 504.5 504.7 505.0 505.2 505.5 505.7 506.0 506.2 506.5
## [145] 506.7 507.0 507.2 507.5 507.7 508.0 508.2 508.5 508.7 509.0 509.2 509.5
## [157] 509.7 510.0 510.2 510.5 510.7 511.0 511.2 511.5 511.7 512.0 512.2 512.3
## [169] 512.5 512.7 512.8 513.0 513.2 513.3 513.5 513.7 513.8 514.0 514.2 514.3
## [181] 514.5 514.7 514.8 515.0 515.2 515.3 515.5 515.7 515.8 516.0 516.2 516.3
## [193] 516.5 516.7 516.8 517.0 517.2 517.3 517.5 517.7 517.8 518.0 518.2 518.3
## [205] 518.5 518.7 518.8 519.0 519.2 519.3 519.5 519.7 519.8 520.0 520.2 520.3
## [217] 520.5 520.7 520.8 521.0 521.2 521.3 521.5 521.7 521.8 522.0 522.2 522.3
## [229] 522.5 522.7 522.8 523.0 523.2 523.3 523.5 523.7 523.8 524.0 524.2 524.3
## [241] 524.5 524.7 524.8 525.0 525.2 525.3 525.5 525.7 525.8 526.0 526.2 526.3
## [253] 526.5 526.7 526.8 527.0 527.2 527.3 527.5 527.7 527.8 528.0 528.2 528.3
## [265] 528.5 528.7 528.8 529.0 529.2 529.3 529.5 529.7 529.8 530.0 530.2 530.3
## [277] 530.5 530.7 530.8 531.0 531.2 531.3 531.5 531.7 531.8 532.0 532.2 532.3
## [289] 532.5 532.7 532.8 533.0 533.2 533.3 533.5 533.7 533.8 534.0 534.2 534.3
## [301] 534.5 534.7 534.8 535.0 535.2 535.3 535.5 535.7 535.8 536.0 536.2 536.3
## [313] 536.5 536.7 536.8 537.0 537.2 537.3 537.5 537.7 537.8 538.0 538.2 538.3
## [325] 538.5 538.7 538.8 539.0 539.2 539.3 539.5 539.7 539.8 540.0 540.2 540.3
## [337] 540.5 540.7 540.8 541.0 541.2 541.3 541.5 541.7 541.8 542.0 542.2 542.3
## [349] 542.5 542.7 542.8 543.0 543.2 543.3 543.5 543.7 543.8 544.0 544.2 544.3
## [361] 544.5 544.7 544.8 545.0 545.2 545.3 545.5 545.7 545.8 546.0 546.2 546.3
## [373] 546.5 546.7 546.8 547.0 547.2 547.3 547.5 547.7 547.8

###################################
#Sum of squares deviation (SSE)   # 
###################################

#create a function that computes the SSE for a given yhat
SSE.fun <- function(yhat){
  sum((y-yhat)^2)
}

#compute the SSE for all the candidate yhat's
SSE=sapply(yhat,SSE.fun)

#plot the SSE for each candidate yhat's
#cex=0.5 makes the size of the points smaller 
plot(yhat,SSE,cex=0.5)

#determine the yhat(s) that minimizes the SSE
yhat[SSE==min(SSE)]

## [1] 488.4

I learned how to assign an equation to minimize the SSE and a function to yhat. I also learned how to plot the yhat vs the SSE. Median minimizes absolute deviations. Mean minimized squared deviations.

Problem 4

#Part(a) One observational unit is Pages.

#Part(b) Price is the response variable.

#Part(c) The explanatory variable is the number of pages.

#Part(d) Price = -5.13 + .1465(Pages)

#Part(e) 0.1

#Part(f) The correlation for observation A is negative because it is below the line of best fit.

#Part(g)

-5.13+.1465*1300

## [1] 185.32

Based on the regression equation, the statement is correct. This is because when you plug in 1300 for the regression equation the output is $185.32.

Problem 5

##Part (a) The measurements of study may have affected the study as there may be some biased. People may under or over report the severity of a hangover due to social norms.

##Part(b) The sample used was college students, 62% being women and 91% caucasian. I don’t think the sample be used to represent a larger population because ethnicity also plays a role in hangovers as well. For example, Asians tend to more severe hangovers due to an allergic reaction that occurs during the consumption of alcohol.

##Part(c) The two variables reported were the frequency of drinking and their score on the Hangover Symptoms Scale(HSS). The logical explanatory variable would be the frequency of consumption of alcohol because there is no biased in the response.

#Part(d) r measures the correlation the strength and direction of a linear relationship between the two variables. r=.44 implies that there is a moderate positive correlation between the two variables.

#Part(e) Significantly implies that the correlation was statistically significant.This means that the p-value<.05.

Homework 1

STAT 334

Nam Nguyen

2025-04-16