author: “Abigail Starkey” title: “Computer Project 1: Regression” date: ‘2026-06-17’ output: pdf_document —
The airquality data set has daily air quality measurements in New York, May to September 1973. Here is a plot of ozone levels (in ppb) as a function of daily high temperature (degrees F).
plot(airquality$Temp, airquality$Ozone, col='blue', xlab="Temperature", ylab="Pollution")
Based on the graph, would you say the correlation is positive or negative? Is the magnitude close to 1, 0 or something in between?
The correlation is positive due to the line going upward, the magnitude is in the middle, and the points are fairly spread out, so it can’t be one or 0 because there appears to be some kind of relationship between the points.
Here is the correlation.
cor(airquality$Temp, airquality$Ozone, use='complete.obs')
## [1] 0.6983603
What does the correlation tell you about the association between temperature and ozone? Is it positive or negative? Is it a strong or weak association?
Positively, as the temperature goes up, so do the ozone levels. Moderate thee is a clear upward trend, but the point are a bit spread out, so its would not be considered a tight cluster and would result in about 0.6983603, or 0.70 Witch I found using the command in R cor(airquality\(Temp, airquality\)Ozone, use=‘complete.obs’) Getting
0.6983603
Here is a calculation of the regression line.
model <- lm(airquality$Ozone ~ airquality$Temp)
summary(model)
##
## Call:
## lm(formula = airquality$Ozone ~ airquality$Temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.729 -17.409 -0.587 11.306 118.271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -146.9955 18.2872 -8.038 9.37e-13 ***
## airquality$Temp 2.4287 0.2331 10.418 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23.71 on 114 degrees of freedom
## (37 observations deleted due to missingness)
## Multiple R-squared: 0.4877, Adjusted R-squared: 0.4832
## F-statistic: 108.5 on 1 and 114 DF, p-value: < 2.2e-16
Now plot the data with the regression line.
plot(airquality$Temp, airquality$Ozone, col='blue', xlab="Temperature", ylab="Pollution")
abline(model, col='green')
Use the output to write down the equation for the regression line. Does the data seem to fit the line well or would it be better to use a quadratic or some other curve?
How I found this was I entered into R as follows: model <- lm(Ozone ~ Temp, data = airquality) summary(model), which gave me the same notion as on the regression page in the pdf on page 3
Then I entered R. coef(model) Giving me (Intercept) Temp -146.9955 2.4287
Then I entered in R the region line of the plot as such: plot(airquality\(Temp, airquality\)Ozone, col = “blue”, xlab = “Temperature”, ylab = “Ozone”) abline(model, col = “green”)
Giving me a picture of the graph with a regression line
Then to calculate the RMS error (Residual Standard Error) I entered in R as follows: sigma(model) Getting 23.71 as the typical size of the prediction errors
Then to get R squared, I entered in r: summary(model)$r.squared Giving me 0.4877—about 49% of the variation in ozone is explained through temperature.
Regression equation y = −146.9955 + 2.4287x
Yes, the data does fit the line The R-squared is 0.4877, which means about 49% of the variation in ozone is explained by temperature. That’s moderate — the line captures a clear upward trend, but there’s still a lot of scatter around it. No other equations would work as well due to the graph seeming to be linear and scattered in contrast to a pattern. So the liner fit works OK for the answer of this question.
Here is a plot of ozone levels (in ppb) as a function of wind speed (mph).
plot(airquality$Wind, airquality$Ozone, col='blue', xlab="Wind Speed", ylab="Pollution")
Based on the graph, would you say the correlation is positive or negative? Is the magnitude close to 1, 0 or something in between?
Negative due to the direction of the line going down at a steady rate as the wind speed increases, the pollution rate decreases. And the magnitude is something in between due to the points not being clustered together, i.e., one or, in this case, negative one, but there is definitely some sort of relationship, so it can be zero, so it’s somewhere in between.
Here is the correlation.
cor(airquality$Wind, airquality$Ozone, use='complete.obs')
## [1] -0.6015465
What does the correlation tell you about the association between wind speed and ozone? Is it positive or negative? Is it a strong or weak association?
Negative because as the wind speed increases, ozone levels decrease. The wind spreads out the pollutants, so the ozone concentrations go down because they are being spread out by the wind.
It would be moderately strong due to the wind being about -0.6; that is a strong negative linear relationship. But the ozone temperature correlation being 0.70 is not weak either.
The correlation between ozone levels and wind speed is -0.6
Here is a calculation of the regression line.
model <- lm(airquality$Ozone ~ airquality$Wind)
summary(model)
##
## Call:
## lm(formula = airquality$Ozone ~ airquality$Wind)
##
## Residuals:
## Min 1Q Median 3Q Max
## -51.572 -18.854 -4.868 15.234 90.000
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 96.8729 7.2387 13.38 < 2e-16 ***
## airquality$Wind -5.5509 0.6904 -8.04 9.27e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 26.47 on 114 degrees of freedom
## (37 observations deleted due to missingness)
## Multiple R-squared: 0.3619, Adjusted R-squared: 0.3563
## F-statistic: 64.64 on 1 and 114 DF, p-value: 9.272e-13
Now plot the data with the regression line.
plot(airquality$Wind, airquality$Ozone, col='blue', xlab="Wind Speed", ylab="Pollution")
abline(model, col='green')
Use the output to write down the equation for the regression line. Does the data seem to fit the line well or would it be better to use a quadratic or some other curve?
How I found the formula for the regression line was I typed these commands into r model <- lm(airquality\(Ozone ~ airquality\)Wind) summary(model)
Here is a plot of ozone levels (in ppb) as a function of solar radiation level (in lang).
Giving me a list of numbers and commands as listed in the regression page located on page 6 of the pdf.
Then I entered coef(model) Giving me (Intercept) airquality$Wind 96.872895 -5.550923 Then I entered
plot(airquality$Solar.R, airquality$Ozone, col='blue', xlab="Solar Radiation", ylab="Pollution")
Showing the graph with a regression line
Then I entered to Get the RMS error sigma(model) Giving me 26.46729, the typical size of the prediction errors.
Then I squared the r by entering summary(model)$r.squared Giving me 0.3618582 or 0.36 Wind explains about 36% of the variation in ozone
The equation for the regression line would be y = 96.87 − 5.55x
Not really. An R² of 0.36 means 64% of the variation is left unexplained in the linear model. A straight line captures the general downward trend, but it’s simply just too messy to make anything of it.
Would a quadratic or other curve be better? Maybe with ozone and wind, the relationship often flattens out at higher wind speeds—but the ozone does not keep dropping at the same rate. A quadratic term (Wind²) could capture that curve.
As I tested this in r through the command
plot(airquality\(Wind, airquality\)Ozone, col = ‘blue’, xlab = “Wind Speed”, ylab = “Ozone”) abline(model, col = ‘green’, lwd = 2) # linear fit
lines(lowess(airquality\(Wind, airquality\)Ozone), col = ‘red’, lwd = 2)
This curve would clear up the data but would not account for all of the points. While this strategy improves noticeably, I think a curve works better then a straight line in this case
Here is a plot of ozone levels (in ppb) as a function of solar radiation level (in lang).
plot(airquality$Solar.R, airquality$Ozone, col='blue', xlab="Solar Radiation", ylab="Pollution")
Based on the graph, would you say the correlation is positive or negative? Is the magnitude close to 1, 0 or something in between?
Positive due to the points progressively going upwards and the magnitude would be somewhere in between due to the scattering of the plots so it could not be 1 but there does appear to be some kind of relationship so it could not be 0
Here is the correlation.
cor(airquality$Solar.R, airquality$Ozone, use='complete.obs')
## [1] 0.3483417
What does the correlation tell you about the association between solar radiation and ozone? Is it positive or negative? Is it a strong or weak association? It is positive due to the graph climbing upward.
R squared = 0.1213 — Solar radiation explains only about 12% of ozone. That’s a small proportion, meaning the relationship is weak due to the smaller percentage. That solar radiation is not very precise when trying to pin down ozone due to the scattering of the points.
Here is a calculation of the regression line.
model <- lm(airquality$Ozone ~ airquality$Solar.R)
summary(model)
##
## Call:
## lm(formula = airquality$Ozone ~ airquality$Solar.R)
##
## Residuals:
## Min 1Q Median 3Q Max
## -48.292 -21.361 -8.864 16.373 119.136
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.59873 6.74790 2.756 0.006856 **
## airquality$Solar.R 0.12717 0.03278 3.880 0.000179 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.33 on 109 degrees of freedom
## (42 observations deleted due to missingness)
## Multiple R-squared: 0.1213, Adjusted R-squared: 0.1133
## F-statistic: 15.05 on 1 and 109 DF, p-value: 0.0001793
Now plot the data with the regression line.
plot(airquality$Solar.R, airquality$Ozone, col='blue', xlab="Solar Radiation", ylab="Pollution")
abline(model, col='green')
Use the output to write down the equation for the regression
line. Does the data seem to fit the line well or would it be better to
use a quadratic or some other curve? Which of the three variables would
be the most helpful in predicting ozone levels?
Which the least? Explain your conclusions.
Here is the first command I gave to r model <- lm(airquality\(Ozone ~ airquality\)Solar.R) summary(model)
Giving me a list of numbers and commands as listed in the regression page located on page 9 of the pdf.
The next command I gave to r was coef(model) Giving me (Intercept) airquality$Solar.R 18.5987278 0.1271653 The I added the regression line to the plot by entering
plot(airquality\(Temp, airquality\)Ozone, col = “blue”, xlab = “Temperature”, ylab = “Ozone”) abline(model, col = “green”)
giving me a picture of the graph with a regression line
Next I entered the command to get the RMS error sigma(model)
Giving me 31.33457, the typical size of the prediction errors.
Then to get the r-squared, entered summary(model)$r.squared = 0.1213419, or 0.12, or 12% of the variation in ozone is explained by temperature.
The regression line equation y = 18.60 + 0.127x
Does the data seem to fit the line well or would it be better to use
a quadratic or some other Curve? Which of the three variables would be
the most helpful in predicting ozone levels?
Which is the least? Explain your conclusions.
Yes, but not well. But when compared to the other options, it is the one that gives the most accurate information. r-squared = 0.12 means solar radiation explains only about 12% of the variation in ozone. The RMS error is 31.33, which is large when compared to the ozone values. The slope is statistically significant (p < 0.001); the relationship is weak due to the scattering of the points.
As for if a quadratic or some other curve would work better, I entered the following in R: First I tested the quadratic formula
model_quad <- lm(Ozone ~ Solar.R + I(Solar.R^2), data = airquality) summary(model_quad)
Giving me this lm(formula = Ozone ~ Solar.R + I(Solar.R^2), data = airquality)
Residuals: Min 1Q Median 3Q Max -40.155 -22.793 -6.438 18.061 115.117
Coefficients: Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.7561171 9.2761865 -0.513 0.609192
Solar.R 0.5550868 0.1264847 4.389 2.67e-05 I(Solar.R^2)
-0.0013147 0.0003766 -3.491 0.000698 — Signif. codes: 0
‘’ 0.001 ‘’ 0.01 ‘’ 0.05 ‘.’ 0.1 ‘ ’ 1
If the squared term is significant (p < 0.05) and R-squared goes up, then a curve would fit better. But with solar radiation and ozone, the relationship tends to be linear and weak. The low R-squared is more about noise and other factors (like temperature and wind) than a curved pattern. So it would not give us the data we need, so it would not really work.
So then I tried standardization to z-scores by entering in r as follows:
plot(airquality\(Solar.R, airquality\)Ozone, col = ‘blue’, xlab = “Solar Radiation”, ylab = “Ozone”) abline(model, col = ‘green’, lwd = 2) # linear fit lines(lowess(airquality\(Solar.R, airquality\)Ozone), col = ‘red’, lwd = 2) # smooth curve
This method would not work either, being that the new regression line doesn’t even go through half of the points, so the data would be inaccurate.
Now if the red lowest curve stays close to the green line, my best bet would be to stay with the linear equation.
Which of the three variables would be the most helpful in predicting ozone levels?
To find this, I compared solar wind and temperature in an R command as follows:
summary(lm(Ozone ~ Solar.R, data = airquality))\(r.squared # 0.12 summary(lm(Ozone ~ Wind, data = airquality))\)r.squared summary(lm(Ozone ~ Temp, data = airquality))$r.squared
Temperature has a relationship with ozone due to it having the highest R squared and also having the strongest correlation. Due to ozone levels often rising with hotter temperatures.
Which is the least? Solar r-squared has the weakest relationship out of the 3
The data shows the points for solar are all over the place near to 0 and over 100 this makes solar radiation unreliable for pinning down ozone.