For this exercise you are required to carry out exploratory analysis of the data from the air quality data set within the R language. This data set contains various measurements relating to air quality that were carried out over a five month period in 1973 in New York.
The idea of the exercise, besodes giving you some practice in R, is to emphasises a good ‘work-flow’ with data: first look at and summarise the data in various ways, then plot it in various ways. Only then, if it still seems like a good idea, should you carry out any statistical tests on the data.
For this exercise, you should work within the R Studio IDE.
You will work from the following instructions:
To start with, do the following:
You should create and enter your work in a markdown notebook. Chooose File/Nre File/R Notebook. Save the notebook to your scripts folder.
A notebook written in markdown is a document that combines literate text, code and the results of the code, such as figures. Help on how to write Markdown can be found in the Help menu of RStudio.
To tell R where to look for data and where to save files, we need to use the concept of the working directiory. We will set this to be your RStuff folder. There are various ways to set this directory. Today, I suggest you got to Session/Set Working Directory/Choose Directory, and go and find it.
In this assignment you will need to make use of commands that are part of various’ packages’. If thees has never been installed before on the machine you are using, you will need to install them using install.packages("<name of package>"). You will then need to load them into R to make them available for your session, using library(<name of package>).
library(dplyr)
library(ggplot2)
library(readr)
library(lubridate)
library(ggfortify)
library(GGally)
Write a code chunk to load these packages
df<-airquality
R comes with many data sets pre-installed. The air quality data set is one of them. They can be very useful for practising all the analyses and visualisations you might want to perform on your own data.
Whatever data we have, it is a good idea to look at it in various ways, then plot and only then do any statistical tests that still seem sueful.
R provides various ways to look at data:
glimpse(df)
Type ?airquality into the console window to access the help information on this data set.
Write a code chunk to display summary statistical information of the data
summary(df)
Which of the variables have missing values?
Create a new data frame called aq, which is a copy of the original data. In this data frame, eliminate any rows that contain missing values
aq<-df[complete.cases(df),]
The day and month variables are stored as numeric variables. Convert them to factors, with 7 and 31 levels respectively.
aq$Month<-as.factor(aq$Month)
aq$Day<-as.factor(aq$Day)
Using either the base plot functions within R or ggplot2(), plot a histogram of the ozone concentration values. Using this, plus the skewness() and kurtosis() functions, or by other mean*s, determine whether the ozone data are nearly normally distributed.
g<-ggplot(aq,aes(x=Ozone))+
geom_histogram(bins=10)
g
*A good other way is to use the commands
qqnorm(aq$Ozone)
qqline(aq$Ozone)
No, they are not - the histogram is highly right-skewed, and the qqplot is not at all linear.
If they are not, apply a square root transformation to the ozone data to make them more nearly so. To do this, create a new column of data called osq in which the values in each row are the square root of the ozone concentration in that row.
aq<-mutate(aq,osq=sqrt(Ozone))
Determine whether the osq data are nearly normally distributed.
g<-ggplot(aq,aes(x=osq))+
geom_histogram(bins=10)
g
qqnorm(aq$osq)
qqline(aq$osq)
It is still not great, but the histogram is more nearly normal than before we applied the transform, and the qqplot is more nearly linear.
Does the wind speed appear to be fairly normally distributed?
g<-ggplot(aq,aes(x=Wind))+
geom_histogram(bins=15)
g
qqnorm(aq$Wind)
qqline(aq$Wind)
Yes, not too bad. The qqplot shows, as we see in the histogram, evidence of slight right skew
Plot box plots to show how the ozone concentration, solar intensity, temperature and wind speeds vary across the months.
g<-ggplot(aq,aes(x=Month,y=Ozone))+
geom_boxplot()+
xlab('Month')+
ylab('Ozone Concentrqtion')
g
Briefly summarise what can be learned from these box plots.
Use the group_by() and summarise() commands from dplyr() to pruduce a table which shows the mean ozone concentration, solar intensity, temperature and wind speeds for each month
means<- aq %>%
group_by(Month) %>%
summarise (meanO2 = mean(Ozone),meanSolar = mean(Solar.R),meanWind = mean(Wind),meanTemp = mean(Temp))
means
Plot a box plot to show the variation of ozone concentration with day of the week.
# These first lines create a new column, day of the week - don't spend too much time trying to understand this
dows<-c(seq(2,7,1),1)
aq<-mutate(aq,dow=as.factor(c(rep(dows,15),2,3,4,5,6,7)))
aq$dow<-as.factor(aq$dow)
Now we will plot the data:
g<-ggplot(aq,aes(x=dow,y=Ozone))+
geom_boxplot()+
xlab('Day of the week')+
ylab('Ozone Concentrqtion')
g
Is there any evidence from this plot to suggest that the ozone might be linked to traffic volumes? If not, what other evidence would be useful in establishing whether or not the two were correlated?
ggpairs(aq,columns=1:4, aes(alpha = 0.4))
Use the cor() and pairs() functions to determine which pairs of variables, if any, are strongly correlated, and which are weakly correlated.
Produce scatter plots of * ozone concentration vs temperature
* ozone concentration vs wind speed
* solar intensity vs temperature.
In each plot, show the data for each month in a different colour.
g<-ggplot(aq,aes(x=Temp,y=Ozone,colour=Month))+
geom_point()+
xlab('Temperature F')+
ylab('Ozone Concentration ppb' )+
ggtitle('Ozone Concentration vs temperature')
g
Now you write the code for the other two:
Do these plots confirm the results of the pairs plots?
Tyr to ‘knit’