Discussion 2 - Types of Data

Author

Gina Occhipinti

knitr::opts_chunk$set(warning = FALSE, message = FALSE) 

Air Quality - Data Description

The Air Quality data set looks at various factors that impact air quality in New York, from May to September 1973. Factors include the numeric levels of the following: Ozone, Solar Radiation, Wind, Temperature in Farenheit. It also analyzes what month and day these factors were recorded.

Air Quality - Using R

The following commands help us understand this better.

First we gain an understand and description of the data using the help page.

# this command pulls up the help page for this built-in dataset
?airquality

Then, we can do some basic data analysis to understand a summary of the data. Summary shows up the min, max, mean, etc. of each variable to get an understanding of the spread as well as how many NAs. Head allows us to view the first 6 rows of the data. Last, View allows us to see the data in another time, like we would view a spreadsheet.

#call upon and name our datasets
library(datasets)
data("airquality")
mydata <- airquality
summary(airquality)
     Ozone           Solar.R           Wind             Temp      
 Min.   :  1.00   Min.   :  7.0   Min.   : 1.700   Min.   :56.00  
 1st Qu.: 18.00   1st Qu.:115.8   1st Qu.: 7.400   1st Qu.:72.00  
 Median : 31.50   Median :205.0   Median : 9.700   Median :79.00  
 Mean   : 42.13   Mean   :185.9   Mean   : 9.958   Mean   :77.88  
 3rd Qu.: 63.25   3rd Qu.:258.8   3rd Qu.:11.500   3rd Qu.:85.00  
 Max.   :168.00   Max.   :334.0   Max.   :20.700   Max.   :97.00  
 NA's   :37       NA's   :7                                       
     Month            Day      
 Min.   :5.000   Min.   : 1.0  
 1st Qu.:6.000   1st Qu.: 8.0  
 Median :7.000   Median :16.0  
 Mean   :6.993   Mean   :15.8  
 3rd Qu.:8.000   3rd Qu.:23.0  
 Max.   :9.000   Max.   :31.0  
                               
head(mydata, 6)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6
View(mydata)

Air Quality - Data Type

The Air Quality data set is panel data, where it consists of 2 or more observations on the same sample of cross-sectional members at two or more points in time. In this case, we look at observations like solar radiation, temperature, etc. during a given month and day during that month.

options(repos = c(CRAN = "https://cloud.r-project.org"))

Air Quality - Visualization

To illustrate the relationship between these variables, ggplot is used below to analyze the ozone variable in particular and how it varies by month (in each diagram) and day (on the x axis to show the fluctuation in ozone over the month).

#install.packages("ggplot2")
library(ggplot2)

ggplot(airquality, aes(x = Day, y = Ozone)) +
  geom_point() +
  facet_wrap(~ Month) +
  labs(title = "Ozone Levels by Day and Month")

Racing - Data Description

This data set comes from Kaggle (view here) and provides detailed results for Formula 1 races, including finishing positions, and points awarded.

racing_data <- read.csv("/Users/ginaocchipinti/Documents/Econometrics Course - BC/Race_Results.csv")

Racing - Using R

Using R, we can use the head function again to get a sense of the values for each of the variables. Here I’m using the simpler plot function to plot relevant data points such as grid, or the starting position the racer has on track, and the points awarded for the race.

head(racing_data)
  resultId raceId driverId constructorId number grid position positionText
1        1     18        1             1     22    1        1            1
2        2     18        2             2      3    5        2            2
3        3     18        3             3      7    7        3            3
4        4     18        4             4      5   11        4            4
5        5     18        5             1     23    3        5            5
6        6     18        6             3      8   13        6            6
  positionOrder points laps        time milliseconds fastestLap rank
1             1     10   58 1:34:50.616      5690616         39    2
2             2      8   58      +5.478      5696094         41    3
3             3      6   58      +8.163      5698779         41    5
4             4      5   58     +17.181      5707797         58    7
5             5      4   58     +18.014      5708630         43    1
6             6      3   57         \\N          \\N         50   14
  fastestLapTime fastestLapSpeed statusId
1       1:27.452         218.300        1
2       1:27.739         217.586        1
3       1:28.090         216.719        1
4       1:28.603         215.464        1
5       1:27.418         218.385        1
6       1:29.639         212.974       11
?plot
Help on topic 'plot' was found in the following packages:

  Package               Library
  base                  /Library/Frameworks/R.framework/Resources/library
  graphics              /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library


Using the first match ...
plot(data = racing_data,
     type = "h",
     x = racing_data$grid,
     y = racing_data$points,
     main="Racing Start Position vs. Awarded Points",col.main='blue',
     xlab = "Starting Position",
     ylab = "Points Awarded")

Racing - Data Type

This Racing Results data is cross-sectional given it tracks observations for agents or units (in this case racers) at a single point in time i.e., for a single race.