Exercise 1

Exercise 1.1

The scatter diagram shows scores on the midterm and final in a certain course.

  1. Was the average midterm score around 25, 50, or 75? Ans: 75

  2. Was the SD of the midterm scores around 5, 10, or 20? Ans: 10 because within 2 SDs should cover most of the data.

  3. Was the SD of the final scores around 5, 10, or 20? Ans: 20

  4. Which exam was harder? Ans: Final exam because its mean is lower and more students get lower marks.

  5. Was there more spread in the midterm scores, or the4 final exam? Final score.

  6. True of False: there was a strong posistive associatio between midterm scores and final scores. Ans: True

Exercise 1.2

    1. Would the correlation between the age of a second-hand car and its price be positive or negative? Why? (Antiques are not included) Ans: Negative because as the price of a car drops with age, we would expect the correlation coefficient to be negative.

(b)What about the correlation between weight and miles per gallon? Ans: Negative. As fuel consumption increases with the weight of the carm we would expect the miles per gallon to decrease with the weight of the car. Therefore, the correlation coefficient will be negative.

  1. For each scatter diagram below:
  1. The average of x is around? 1, 1.5, 2, 2.5, 3, 3.5, 4 Ans: 3.5

  2. Same for y? Ans: 1.5

  3. The SD of x is around? 0.25, 0.5, 1, 1.5 Ans: 1 (spread of x should be covered by around 2 standard deviations)

  4. Same for y? Ans:1

  5. Is the correlation positive, negative or 0?
  1. for the first scatter plot, r will be positive. For the second, r will be negative
  1. For which of the diagrams in the previous exercise is the correlation closer to 0, forgetting about signs? The first one, as is it slightly more scattered around the line.

Exercise 1.3

  1. (a)somewhat positive
  1. nearly -1 because total of the 2 must be 80k to 90k ie…, almost constant. The more the wife makes, the lessthe husband makes.
  1. True or false and explain, if the correlation coefficient is 0.90, the 90% of the points are highly correlated. Ans: False. The correlation coefficient is between -1 and 1 inclusive. r=0.9 means that there is a very strong positive linear association between the two quantitive variables.

Exercise 1.4 (r Translations)

  1. A small data set is shown below r= 0.76. If you switch the two columns, does this change r? Explain or calculate. X= 1, 2, 3, 4, 5 Y= 2, 3, 1, 5, 6 Ans:
x= c(1, 2, 3, 4, 5)
y= c(2, 3, 1, 5, 6)
cor(x, y)
## [1] 0.7624929
#switching x and y
cor(y,x)
## [1] 0.7624929

Therefore cor(x,y)=cos(y,x), swapping x and y does not chnage the correlation coefficient because the pairwise relationship does not change.

  1. As in exercise 2, but you add 3 to each value of y instead of interchanging the columns. Ans:
cor(x,y+3)
## [1] 0.7624929

Adding 3 to each valye of y does not change the correlation coefficient because adding a constant to one of the variables does not alter the pairwise relationship.

Further:

cor(x+8,y+3)
## [1] 0.7624929

SD stays the same so a shift up or down in either values will not change the correlation coefficient

Exercise 3

Exercise 3.1 Domain Knowledge

  1. What is the AQI? Ans : The air quality index (AQI) is an index for reporting daily and hourly air quality.

  2. What does it measure? Ans: It measures how clean or polluted the air is in areas across NSW.

  3. Why is it important? Ans: It is important because AQI provides a numberical measure about how clean or polluted the air is. If the AQI is high, we can tale appropriate action.

Exercise 3.2 Source Data

library(readxl)
air= read_excel("Data/AQI_July2015.xls")
dim(air) 
## [1] 31  3
str(air)
## Classes 'tbl_df', 'tbl' and 'data.frame':    31 obs. of  3 variables:
##  $ Date       : chr  "01/07/2015" "02/07/2015" "03/07/2015" "04/07/2015" ...
##  $ SydneyCEAQI: num  99 32 70 74 95 71 31 58 108 82 ...
##  $ SydneyNWAQI: num  92 44 82 96 100 98 65 71 74 67 ...
head(air)
## # A tibble: 6 x 3
##   Date       SydneyCEAQI SydneyNWAQI
##   <chr>            <dbl>       <dbl>
## 1 01/07/2015         99.         92.
## 2 02/07/2015         32.         44.
## 3 03/07/2015         70.         82.
## 4 04/07/2015         74.         96.
## 5 05/07/2015         95.        100.
## 6 06/07/2015         71.         98.

There are 31 rows and 3 columns (variables). These varaibles are “Date”, “SydneyCEAQI”, “SydneyNWAQI”

Exercise 3.3 Explore univariate

date = air$Date
CE = air$SydneyCEAQI
NW = air$SydneyNWAQI
  1. What day has the worst air quality?
date[CE == max(CE)] #worst AQI for CE region
## [1] "09/07/2015"
date[NW == max(NW)] #worst AQI for NW region
## [1] "05/07/2015"

Which region has had the best air quality?

summary(CE) #give summary statistic for CE
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   30.00   35.50   41.00   50.77   60.50  108.00
summary(NW) #give summary statistic for NW
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   33.00   38.50   54.00   56.13   67.00  100.00

CE region has the best air quality because it has the lower mean and medium than the NW region.

Exercise 3.5 Explore bivariate

  1. Produce a scatter plot and correlation. What does this tell you?
plot(CE, NW)

cor(CE, NW)
## [1] 0.757917

r = 0.757917 shows that there is a psotive strong linear relationship between CE and NW regions.

  1. Find the regression line for NW Sydney regressed on CE Sydney. Plot the regression line on the scatter plot.
L = lm(NW ~ CE) # this will run the simple linear regression model by using NW as the dependant variable and CE as the independant variable.
L$coeff #gives the estimated intercept and slope of the fitted regression line
## (Intercept)          CE 
##  19.8873954   0.7137806

The fitted regression line is NW_hat = 19.887 + 0.714*CE where NW is the predicted value of NW

plot(CE, NW)
abline(lm(NW ~ CE)) #plot the regression line on the scatter plot

  1. Produce a residual plot. What does this tell you? (remember you want to see no pattern)
plot(CE, L$residuals, xlab = "CE airquality", ylab = "Residuals")
abline(h = 0, col = "blue")

The residual plot does not exhibit any pattern i.e…, it is a very random scatter, meaning that it does not appear to violate any assumptions (linearity and constant variant assumptions) of a linear model. In other words, a linear model is appropriate

  1. Could you predict the air quality in one region from the other? If prediction is valid, predict the air quality in NW Sydney on a day when the air quality in CE Sydney is 40. Predict the air quality in NW Sydney on a day when the air quality in CE Sydney is 120.

Ans: Yes, the linear model is appropriate as indicated by the residual plot. We can use the fitted line: NW_hat = 19.887 + 0.714*CE to predict the air quality in one region from the other.

summary(CE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   30.00   35.50   41.00   50.77   60.50  108.00

Prediction appears to be valid because CE = 40 is within the range of the data that we establish the regression line (interpolation)

AQI_CE = 40
yintc = unname(L$coeff[1])  #create a variable for the y-intercept, drop name
slope = unname(L$coeff[2])  #create a variable for the slope, drop name
pred = yintc + slope * AQI_CE  #predict AQI in NW region when CE = 40
c(yintc, slope, pred)              
## [1] 19.8873954  0.7137806 48.4386214

The predicted value of NW is 48.4386 when CE = 40.

summary(CE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   30.00   35.50   41.00   50.77   60.50  108.00

Prediction is not valid when CE= 120 because it is outside of the rnage of the data that we use to establish the regression model (extraplation). Extraploation is dangerous because the model might not be linear outside of the rnage of the date that we use to build the regression model.