::opts_chunk$set(warning = FALSE, message = FALSE) knitr
Discussion 2 - Types of Data
Air Quality - Data Description
The Air Quality data set looks at various factors that impact air quality in New York, from May to September 1973. Factors include the numeric levels of the following: Ozone, Solar Radiation, Wind, Temperature in Farenheit. It also analyzes what month and day these factors were recorded.
Air Quality - Using R
The following commands help us understand this better.
First we gain an understand and description of the data using the help page.
# this command pulls up the help page for this built-in dataset
?airquality
Then, we can do some basic data analysis to understand a summary of the data. Summary
shows up the min, max, mean, etc. of each variable to get an understanding of the spread as well as how many NAs. Head
allows us to view the first 6 rows of the data. Last, View
allows us to see the data in another time, like we would view a spreadsheet.
#call upon and name our datasets
library(datasets)
data("airquality")
<- airquality
mydata summary(airquality)
Ozone Solar.R Wind Temp
Min. : 1.00 Min. : 7.0 Min. : 1.700 Min. :56.00
1st Qu.: 18.00 1st Qu.:115.8 1st Qu.: 7.400 1st Qu.:72.00
Median : 31.50 Median :205.0 Median : 9.700 Median :79.00
Mean : 42.13 Mean :185.9 Mean : 9.958 Mean :77.88
3rd Qu.: 63.25 3rd Qu.:258.8 3rd Qu.:11.500 3rd Qu.:85.00
Max. :168.00 Max. :334.0 Max. :20.700 Max. :97.00
NA's :37 NA's :7
Month Day
Min. :5.000 Min. : 1.0
1st Qu.:6.000 1st Qu.: 8.0
Median :7.000 Median :16.0
Mean :6.993 Mean :15.8
3rd Qu.:8.000 3rd Qu.:23.0
Max. :9.000 Max. :31.0
head(mydata, 6)
Ozone Solar.R Wind Temp Month Day
1 41 190 7.4 67 5 1
2 36 118 8.0 72 5 2
3 12 149 12.6 74 5 3
4 18 313 11.5 62 5 4
5 NA NA 14.3 56 5 5
6 28 NA 14.9 66 5 6
View(mydata)
Air Quality - Data Type
The Air Quality data set is panel data, where it consists of 2 or more observations on the same sample of cross-sectional members at two or more points in time. In this case, we look at observations like solar radiation, temperature, etc. during a given month and day during that month.
options(repos = c(CRAN = "https://cloud.r-project.org"))
Air Quality - Visualization
To illustrate the relationship between these variables, ggplot
is used below to analyze the ozone variable in particular and how it varies by month (in each diagram) and day (on the x axis to show the fluctuation in ozone over the month).
#install.packages("ggplot2")
library(ggplot2)
ggplot(airquality, aes(x = Day, y = Ozone)) +
geom_point() +
facet_wrap(~ Month) +
labs(title = "Ozone Levels by Day and Month")
Racing - Data Description
This data set comes from Kaggle (view here) and provides detailed results for Formula 1 races, including finishing positions, and points awarded.
<- read.csv("/Users/ginaocchipinti/Documents/Econometrics Course - BC/Race_Results.csv") racing_data
Racing - Using R
Using R, we can use the head
function again to get a sense of the values for each of the variables. Here I’m using the simpler plot
function to plot relevant data points such as grid, or the starting position the racer has on track, and the points awarded for the race.
head(racing_data)
resultId raceId driverId constructorId number grid position positionText
1 1 18 1 1 22 1 1 1
2 2 18 2 2 3 5 2 2
3 3 18 3 3 7 7 3 3
4 4 18 4 4 5 11 4 4
5 5 18 5 1 23 3 5 5
6 6 18 6 3 8 13 6 6
positionOrder points laps time milliseconds fastestLap rank
1 1 10 58 1:34:50.616 5690616 39 2
2 2 8 58 +5.478 5696094 41 3
3 3 6 58 +8.163 5698779 41 5
4 4 5 58 +17.181 5707797 58 7
5 5 4 58 +18.014 5708630 43 1
6 6 3 57 \\N \\N 50 14
fastestLapTime fastestLapSpeed statusId
1 1:27.452 218.300 1
2 1:27.739 217.586 1
3 1:28.090 216.719 1
4 1:28.603 215.464 1
5 1:27.418 218.385 1
6 1:29.639 212.974 11
?plot
Help on topic 'plot' was found in the following packages:
Package Library
base /Library/Frameworks/R.framework/Resources/library
graphics /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library
Using the first match ...
plot(data = racing_data,
type = "h",
x = racing_data$grid,
y = racing_data$points,
main="Racing Start Position vs. Awarded Points",col.main='blue',
xlab = "Starting Position",
ylab = "Points Awarded")
Racing - Data Type
This Racing Results data is cross-sectional given it tracks observations for agents or units (in this case racers) at a single point in time i.e., for a single race.