The dataset “airquality” contains information on air quality measurments in New York City during 1973. The dataset includes six variables, two of which include the day and month of the measurement. Use the following R code to load the data.
data(airquality)
(a) Choose the appropriate graph for the variable “Wind” and plot the data.
options(scipen=8) # Turns off scientific notation
hist(airquality$Wind,main="Histogram of Wind in NYC in 1973",
xlab="Wind",ylab="Frequency")
(b) Describe the shape of the distribution. What does that suggest about the mean and median for this variable?
The histogram is bimodal with two peaks (one from 6-8 and one from 10-12), and skews slightly to the right (with a longer “tail” trailing off to the right). Because of the influence of the greater values on the right, the mean is probably greater than the median for this variable.
(c) Provide the mean and median for the “Wind” variable and discuss whether your results are consistent (or not) with your response to (b).
As shown below, the mean is 9.957516 and hte median is 9.7. The median is not affected by outliers or skewness, whereas the mean is, causing it to be greater than the median.
mean(airquality$Wind)
## [1] 9.957516
median(airquality$Wind)
## [1] 9.7
Using the same “airquality” dataset, create a side-by-side boxplots of the “Wind” variable for the months of May and June. The easiest way to do this is first subset the data to only include May and June. This can be done using the following code, which also generates the boxplots:
airquality2 <- airquality[airquality$Month==5|airquality$Month==6,] # This creates a new dataset called "airquality2" that only includes May and June
boxplot(Wind ~ Month,data=airquality2,main="Comparison of Wind Between May and June")
tapply(airquality2$Wind, airquality2$Month, summary)
## $`5`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.70 8.90 11.50 11.62 14.05 20.10
##
## $`6`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.70 8.00 9.70 10.27 11.50 20.70
Describe what you see in the boxplots of the two months.
Overall, the boxplot for May is skewed right with a longer whisker on the right, indicating that the greatest 25% of the data points are more widely dispersed. The boxplot for June is roughly symmetric, with outliers representing both the minimum and maximum values. With the outliers included, June has a wider spread; if we remove the outliers, however, May’s spread is greater.
The maximum value is the same for both May and June; in June, however, the maximum value is an outlier. In June, the minimum wind is lower–but again, it is an outlier.
Load the dataset “USArrests,” which contains crime statistics by state in the United States. The data includes four variables: arrests for murder, assault, rape, and the total urban population. Use the following R code to load the data.
data("USArrests")
(a) Make a scatterplot of the variable “Assault” vs “UrbanPop.” Clearly label the axes and give your plot a title. Describe the relationship between the two variables in terms of form, direction, and strength.
# Base R
plot(USArrests$Assault ~ USArrests$UrbanPop, main="Number of Arrests for Assault vs. Urban Populations")
# Correlation
cor(USArrests$Assault,
USArrests$UrbanPop)
## [1] 0.2588717
cor(USArrests$UrbanPop,USArrests$Assault)
## [1] 0.2588717
cor(USArrests$UrbanPop,USArrests$Assault^3)
## [1] 0.1433171
(b) Compute the correlation \(r\) between the number of arrests for assault and the urban population. Then, compute the correlation between urban population and the number of arrests for assault. Explain why your answers are either the same or different.
Correlation measures the direction and strength of the linear relationship between two quantitative variables. he correlations are the same (both 0.2588717), because correlation makes no use of the distinction between explanatory and response variables. It makes no difference which variable you call x and which you call y in calculating the correlation.
(c) How would the correlation change if the number of arrests for assault in the data all increased by \(100\) and the urban population also increased by \(200\)?
The correlation would not change at all. If you add a constant to every value in the dataset, all of the points will shift their position in the same way (up 100 and to the right 200), meaning the point cloud will move but the strength and direction stay the same. The overall shape of the distribution will remain unchanged.
(d) How would the correlation between “Assault” and “UrbanPop” change if the number of assaults increased exponentially by \(3\) (that is, raised to the third power)? How is this scenario different from (c)?
If the number of assaults increased exponentially by 3, the correlation would decrease. I tested this out on line 84, cor(USArrests\(UrbanPop,USArrests\)Assault^3), which produced 0.1433171. Cubing a variable is a non-linear transformation, resulting in the data getting stretched, with larger values becoming much larger. Scenario (c) was a linear transformation.
Consider the same “USArrests” dataset.
(a) Give the five number summary (with the mean) for the “Assault” variable. What do the values of the mean and median suggest about the distributional shape of this variable?
summary(USArrests$Assault)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 45.0 109.0 159.0 170.8 249.0 337.0
The mean is greater than the median, which suggests that the data is positively skewed (right-skewed). Most data points are clustered at the lower end, but there are a few large values pulling the mean up. The median is robust (it generally stays the same), whereas the mean is sensitive to extreme values.
(b) Compute the IQR for “Assault.” Based on the IQR, are there any values that are potential outliers?
IQR is Q3 - Q1. That’s 249 minus 109, which is 140. That means that the middle 50% of the data falls within this range of 109 to 249. Using the 1.5xIQR rule for outliers, I multiple 140 by 1.5, getting 210.
When I subtract 210 from Q1 (109), I see that the lower bound is -101. When I add 210 to Q3 (249), I see that the upper bound is 459.
When I look at the dataset, I see that the minimum value is 45 (North Dakota) and the maximum value is 337 (North Carolina). In other words, all values are within the range of -101 to 459, so there are no outliers based on this method.
Consider the scatterplot between the number of arrests of assault and the urban population from Q3a. Based on the form and strength (correlation) of the plot, do you think the relationship between the two variables would be suitable for prediction? Why or why not?
No, the relationship between the two variables is NOT suitable for prediction. It is more appropriate appropriate to make a prediction when the correlation is stronger, so closer to positive 1 (in this case, given that the direction is positive). The correlation between these variables is 0.2588717, which is considered relatively weak. If we were to draw a line of best fit on the scatterplot, we would see data points a considerable distance both above and below that line of best fit, indicating that it is not appropriate for making predictions.