Exploration of Air Quality Dataset

Tasks:

Load necessary packages into R

In this assignment you will need to make use of commands that are part of various’ packages’. If thees has never been installed before on the machine you are using, you will need to install them using install.packages("<name of package>"). You will then need to load them into R to make them available for your session, using library(<name of package>).

library(dplyr)
library(ggplot2)
library(readr)
library(lubridate)
library(ggfortify)
library(GGally)

Write a code chunk to load these packages

Load the air quality data set into your current session

df<-airquality

R comes with many data sets pre-installed. The air quality data set is one of them. They can be very useful for practising all the analyses and visualisations you might want to perform on your own data.

Whatever data we have, it is a good idea to look at it in various ways, then plot and only then do any statistical tests that still seem sueful.

R provides various ways to look at data:

glimpse(df)

Describe and Display the data

Type ?airquality into the console window to access the help information on this data set.

Summary of the data.

Write a code chunk to display summary statistical information of the data

summary(df)

Which of the variables have missing values?

Create a new data frame and remove missing values

Create a new data frame called aq, which is a copy of the original data. In this data frame, eliminate any rows that contain missing values

aq<-df[complete.cases(df),]

Creation of factor (categorical) variables.

The day and month variables are stored as numeric variables. Convert them to factors, with 7 and 31 levels respectively.

aq$Month<-as.factor(aq$Month)
aq$Day<-as.factor(aq$Day)

Distribution of the data

Using either the base plot functions within R or ggplot2(), plot a histogram of the ozone concentration values. Using this, plus the skewness() and kurtosis() functions, or by other mean*s, determine whether the ozone data are nearly normally distributed.

g<-ggplot(aq,aes(x=Ozone))+
  geom_histogram(bins=10)
g

*A good other way is to use the commands

qqnorm(aq$Ozone)
qqline(aq$Ozone)

No, they are not - the histogram is highly right-skewed, and the qqplot is not at all linear.

If they are not, apply a square root transformation to the ozone data to make them more nearly so. To do this, create a new column of data called osq in which the values in each row are the square root of the ozone concentration in that row.

aq<-mutate(aq,osq=sqrt(Ozone))

Determine whether the osq data are nearly normally distributed.

g<-ggplot(aq,aes(x=osq))+
  geom_histogram(bins=10)
g
qqnorm(aq$osq)
qqline(aq$osq)

It is still not great, but the histogram is more nearly normal than before we applied the transform, and the qqplot is more nearly linear.

Does the wind speed appear to be fairly normally distributed?

g<-ggplot(aq,aes(x=Wind))+
  geom_histogram(bins=15)
g
qqnorm(aq$Wind)
qqline(aq$Wind)

Yes, not too bad. The qqplot shows, as we see in the histogram, evidence of slight right skew

Variation with month

Plot box plots to show how the ozone concentration, solar intensity, temperature and wind speeds vary across the months.

g<-ggplot(aq,aes(x=Month,y=Ozone))+
  geom_boxplot()+
  xlab('Month')+
  ylab('Ozone Concentrqtion')
g

Briefly summarise what can be learned from these box plots.

Data summaries by month

Use the group_by() and summarise() commands from dplyr() to pruduce a table which shows the mean ozone concentration, solar intensity, temperature and wind speeds for each month

means<- aq %>% 
  group_by(Month) %>% 
  summarise (meanO2 = mean(Ozone),meanSolar = mean(Solar.R),meanWind = mean(Wind),meanTemp = mean(Temp))
means

Variation with day

Plot a box plot to show the variation of ozone concentration with day of the week.

# These first lines create a new column, day of the week - don't spend too much time trying to understand this
dows<-c(seq(2,7,1),1)
aq<-mutate(aq,dow=as.factor(c(rep(dows,15),2,3,4,5,6,7)))
aq$dow<-as.factor(aq$dow)

Now we will plot the data:

g<-ggplot(aq,aes(x=dow,y=Ozone))+
  geom_boxplot()+
  xlab('Day of the week')+
  ylab('Ozone Concentrqtion')
g

Is there any evidence from this plot to suggest that the ozone might be linked to traffic volumes? If not, what other evidence would be useful in establishing whether or not the two were correlated?

Correlation

ggpairs(aq,columns=1:4, aes(alpha = 0.4))

Use the cor() and pairs() functions to determine which pairs of variables, if any, are strongly correlated, and which are weakly correlated.

Scatter plots

Produce scatter plots of * ozone concentration vs temperature
* ozone concentration vs wind speed
* solar intensity vs temperature.

In each plot, show the data for each month in a different colour.

Scatter plot of Ozone Concentration vs temperature

g<-ggplot(aq,aes(x=Temp,y=Ozone,colour=Month))+
  geom_point()+
  xlab('Temperature F')+
  ylab('Ozone Concentration ppb' )+
  ggtitle('Ozone Concentration vs temperature')
g

Now you write the code for the other two:

Scatter plot of Ozone Concentration vs wind speed

Scatter plot of Solar Intensity vs temperature

Do these plots confirm the results of the pairs plots?

Save your work and email it to Mike.

Tyr to ‘knit’