Introduction

For this exercise you are required to carry out exploratory analysis of the data from the air quality data set within the R language. This data set contains various measurements relating to air quality that were carried out over a five month period in 1973 in New York.

The idea of the exercise, besodes giving you some practice in R, is to emphasises a good ‘work-flow’ with data: first look at and summarise the data in various ways, then plot it in various ways. Only then, if it still seems like a good idea, should you carry out any statistical tests on the data.

For this exercise, you should work within the R Studio IDE.

You will work from the following instructions:

https://rpubs.com/mbh038/583109

Set up your work spaces

To start with, do the following:

You should create and enter your work in a markdown notebook. Chooose File/Nre File/R Notebook. Save the notebook to your scripts folder.

A notebook written in markdown is a document that combines literate text, code and the results of the code, such as figures. Help on how to write Markdown can be found in the Help menu of RStudio.

Set the working directory

To tell R where to look for data and where to save files, we need to use the concept of the working directiory. We will set this to be your RStuff folder. There are various ways to set this directory. Today, I suggest you got to Session/Set Working Directory/Choose Directory, and go and find it.

Tasks:

Load necessary packages into R

In this assignment you will need to make use of commands that are part of various’ packages’. If thees has never been installed before on the machine you are using, you will need to install them using install.packages("<name of package>"). You will then need to load them into R to make them available for your session, using library(<name of package>).

library(dplyr)
library(ggplot2)
library(readr)
library(lubridate)
library(ggfortify)
library(GGally)

Write a code chunk to load these packages

Load the air quality data set into your current session

df<-airquality

R comes with many data sets pre-installed. The air quality data set is one of them. They can be very useful for practising all the analyses and visualisations you might want to perform on your own data.

Whatever data we have, it is a good idea to look at it in various ways, then plot and only then do any statistical tests that still seem sueful.

R provides various ways to look at data:

glimpse(df)

Describe and Display the data

Type ?airquality into the console window to access the help information on this data set.

Summary of the data.

Write a code chunk to display summary statistical information of the data

summary(df)

Which of the variables have missing values?

Create a new data frame and remove missing values

Create a new data frame called aq, which is a copy of the original data. In this data frame, eliminate any rows that contain missing values

aq<-df[complete.cases(df),]

Creation of factor (categorical) variables.

The day and month variables are stored as numeric variables. Convert them to factors, with 7 and 31 levels respectively.

aq$Month<-as.factor(aq$Month)
aq$Day<-as.factor(aq$Day)

Distribution of the data

Using either the base plot functions within R or ggplot2(), plot a histogram of the ozone concentration values. Using this, plus the skewness() and kurtosis() functions, or by other mean*s, determine whether the ozone data are nearly normally distributed.

g<-ggplot(aq,aes(x=Ozone))+
  geom_histogram(bins=10)
g

*A good other way is to use the commands

qqnorm(aq$Ozone)
qqline(aq$Ozone)

No, they are not - the histogram is highly right-skewed, and the qqplot is not at all linear.

If they are not, apply a square root transformation to the ozone data to make them more nearly so. To do this, create a new column of data called osq in which the values in each row are the square root of the ozone concentration in that row.

aq<-mutate(aq,osq=sqrt(Ozone))

Determine whether the osq data are nearly normally distributed.

g<-ggplot(aq,aes(x=osq))+
  geom_histogram(bins=10)
g
qqnorm(aq$osq)
qqline(aq$osq)

It is still not great, but the histogram is more nearly normal than before we applied the transform, and the qqplot is more nearly linear.

Does the wind speed appear to be fairly normally distributed?

g<-ggplot(aq,aes(x=Wind))+
  geom_histogram(bins=15)
g
qqnorm(aq$Wind)
qqline(aq$Wind)

Yes, not too bad. The qqplot shows, as we see in the histogram, evidence of slight right skew

Variation with month

Plot box plots to show how the ozone concentration, solar intensity, temperature and wind speeds vary across the months.

g<-ggplot(aq,aes(x=Month,y=Ozone))+
  geom_boxplot()+
  xlab('Month')+
  ylab('Ozone Concentrqtion')
g

Briefly summarise what can be learned from these box plots.

Data summaries by month

Use the group_by() and summarise() commands from dplyr() to pruduce a table which shows the mean ozone concentration, solar intensity, temperature and wind speeds for each month

means<- aq %>% 
  group_by(Month) %>% 
  summarise (meanO2 = mean(Ozone),meanSolar = mean(Solar.R),meanWind = mean(Wind),meanTemp = mean(Temp))
means

Variation with day

Plot a box plot to show the variation of ozone concentration with day of the week.

# These first lines create a new column, day of the week - don't spend too much time trying to understand this
dows<-c(seq(2,7,1),1)
aq<-mutate(aq,dow=as.factor(c(rep(dows,15),2,3,4,5,6,7)))
aq$dow<-as.factor(aq$dow)

Now we will plot the data:

g<-ggplot(aq,aes(x=dow,y=Ozone))+
  geom_boxplot()+
  xlab('Day of the week')+
  ylab('Ozone Concentrqtion')
g

Is there any evidence from this plot to suggest that the ozone might be linked to traffic volumes? If not, what other evidence would be useful in establishing whether or not the two were correlated?

Correlation

ggpairs(aq,columns=1:4, aes(alpha = 0.4))

Use the cor() and pairs() functions to determine which pairs of variables, if any, are strongly correlated, and which are weakly correlated.

Scatter plots

Produce scatter plots of * ozone concentration vs temperature
* ozone concentration vs wind speed
* solar intensity vs temperature.

In each plot, show the data for each month in a different colour.

Scatter plot of Ozone Concentration vs temperature

g<-ggplot(aq,aes(x=Temp,y=Ozone,colour=Month))+
  geom_point()+
  xlab('Temperature F')+
  ylab('Ozone Concentration ppb' )+
  ggtitle('Ozone Concentration vs temperature')
g

Now you write the code for the other two:

Scatter plot of Ozone Concentration vs wind speed

Scatter plot of Solar Intensity vs temperature

Do these plots confirm the results of the pairs plots?

Save your work and email it to Mike.

Tyr to ‘knit’