The scatter diagram shows scores on the midterm and final in a certain course.
Was the average midterm score around 25, 50, or 75? Ans: 75
Was the SD of the midterm scores around 5, 10, or 20? Ans: 10 because within 2 SDs should cover most of the data.
Was the SD of the final scores around 5, 10, or 20? Ans: 20
Which exam was harder? Ans: Final exam because its mean is lower and more students get lower marks.
Was there more spread in the midterm scores, or the4 final exam? Final score.
True of False: there was a strong posistive associatio between midterm scores and final scores. Ans: True
(b)What about the correlation between weight and miles per gallon? Ans: Negative. As fuel consumption increases with the weight of the carm we would expect the miles per gallon to decrease with the weight of the car. Therefore, the correlation coefficient will be negative.
The average of x is around? 1, 1.5, 2, 2.5, 3, 3.5, 4 Ans: 3.5
Same for y? Ans: 1.5
The SD of x is around? 0.25, 0.5, 1, 1.5 Ans: 1 (spread of x should be covered by around 2 standard deviations)
Same for y? Ans:1
x= c(1, 2, 3, 4, 5)
y= c(2, 3, 1, 5, 6)
cor(x, y)
## [1] 0.7624929
#switching x and y
cor(y,x)
## [1] 0.7624929
Therefore cor(x,y)=cos(y,x), swapping x and y does not chnage the correlation coefficient because the pairwise relationship does not change.
cor(x,y+3)
## [1] 0.7624929
Adding 3 to each valye of y does not change the correlation coefficient because adding a constant to one of the variables does not alter the pairwise relationship.
Further:
cor(x+8,y+3)
## [1] 0.7624929
SD stays the same so a shift up or down in either values will not change the correlation coefficient
What is the AQI? Ans : The air quality index (AQI) is an index for reporting daily and hourly air quality.
What does it measure? Ans: It measures how clean or polluted the air is in areas across NSW.
Why is it important? Ans: It is important because AQI provides a numberical measure about how clean or polluted the air is. If the AQI is high, we can tale appropriate action.
library(readxl)
air= read_excel("Data/AQI_July2015.xls")
dim(air)
## [1] 31 3
str(air)
## Classes 'tbl_df', 'tbl' and 'data.frame': 31 obs. of 3 variables:
## $ Date : chr "01/07/2015" "02/07/2015" "03/07/2015" "04/07/2015" ...
## $ SydneyCEAQI: num 99 32 70 74 95 71 31 58 108 82 ...
## $ SydneyNWAQI: num 92 44 82 96 100 98 65 71 74 67 ...
head(air)
## # A tibble: 6 x 3
## Date SydneyCEAQI SydneyNWAQI
## <chr> <dbl> <dbl>
## 1 01/07/2015 99. 92.
## 2 02/07/2015 32. 44.
## 3 03/07/2015 70. 82.
## 4 04/07/2015 74. 96.
## 5 05/07/2015 95. 100.
## 6 06/07/2015 71. 98.
There are 31 rows and 3 columns (variables). These varaibles are “Date”, “SydneyCEAQI”, “SydneyNWAQI”
date = air$Date
CE = air$SydneyCEAQI
NW = air$SydneyNWAQI
date[CE == max(CE)] #worst AQI for CE region
## [1] "09/07/2015"
date[NW == max(NW)] #worst AQI for NW region
## [1] "05/07/2015"
Which region has had the best air quality?
summary(CE) #give summary statistic for CE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30.00 35.50 41.00 50.77 60.50 108.00
summary(NW) #give summary statistic for NW
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 33.00 38.50 54.00 56.13 67.00 100.00
CE region has the best air quality because it has the lower mean and medium than the NW region.
plot(CE, NW)
cor(CE, NW)
## [1] 0.757917
r = 0.757917 shows that there is a psotive strong linear relationship between CE and NW regions.
L = lm(NW ~ CE) # this will run the simple linear regression model by using NW as the dependant variable and CE as the independant variable.
L$coeff #gives the estimated intercept and slope of the fitted regression line
## (Intercept) CE
## 19.8873954 0.7137806
The fitted regression line is NW_hat = 19.887 + 0.714*CE where NW is the predicted value of NW
plot(CE, NW)
abline(lm(NW ~ CE)) #plot the regression line on the scatter plot
plot(CE, L$residuals, xlab = "CE airquality", ylab = "Residuals")
abline(h = 0, col = "blue")
The residual plot does not exhibit any pattern i.e…, it is a very random scatter, meaning that it does not appear to violate any assumptions (linearity and constant variant assumptions) of a linear model. In other words, a linear model is appropriate
Ans: Yes, the linear model is appropriate as indicated by the residual plot. We can use the fitted line: NW_hat = 19.887 + 0.714*CE to predict the air quality in one region from the other.
summary(CE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30.00 35.50 41.00 50.77 60.50 108.00
Prediction appears to be valid because CE = 40 is within the range of the data that we establish the regression line (interpolation)
AQI_CE = 40
yintc = unname(L$coeff[1]) #create a variable for the y-intercept, drop name
slope = unname(L$coeff[2]) #create a variable for the slope, drop name
pred = yintc + slope * AQI_CE #predict AQI in NW region when CE = 40
c(yintc, slope, pred)
## [1] 19.8873954 0.7137806 48.4386214
The predicted value of NW is 48.4386 when CE = 40.
summary(CE)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30.00 35.50 41.00 50.77 60.50 108.00
Prediction is not valid when CE= 120 because it is outside of the rnage of the data that we use to establish the regression model (extraplation). Extraploation is dangerous because the model might not be linear outside of the rnage of the date that we use to build the regression model.