class: center, middle, title-slide .title[ # Exploratory Data Analysis ] .subtitle[ ## JSC 370: Data Science II ] .date[ ### January 20, 2025 ] --- <style> .html-widget { margin: auto; } </style> ## Exploratory Data Analysis * Exploratory data analysis is the process of summarizing data * It should be the first step in your analysis pipeline * It involves: -- + 1) checking data (import issues, outliers, missing values, data errors) + 2) cleaning data + 3) summary statistics of key variables (univariate and bivariate) + 4) basic plots and graphs --- ## Pipeline <div class="figure" style="text-align: center"> <img src="data:image/png;base64,#img/data-science2.png" alt=" From: R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund" width="65%" /> <p class="caption"> From: R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund</p> </div> EDA involves Import -> Tidy -> Transform -> Visualize. Basically it is everything before we do modeling, prediction or inference. --- ## EDA Checklist The goal of EDA is to better understand your data. Let's use the **checklist**: -- 1. Formulate a question -- 2. Read in the data -- 3. Check the dimensions and headers and footers of the data -- 4. Check the variable types in the data -- 5. Take a closer look at some/all of the variables -- 6. Validate with an external source -- 7. Conduct some summary statistics to answer the initial question -- 8. Make exploratory graphs --- ## Case study We are going to use a dataset created from the National Center for Environmental Information (https://www.ncei.noaa.gov/). The data are 2019 hourly measurements from weather stations across the continental U.S. <img src="data:image/png;base64,#img/weather.png" width="65%" height="55%" style="display: block; margin: auto;" /> --- ## Formulate a Question It is a good idea to first have a question such as: -- * what weather stations reported the hottest and coldest daily temperatures? * what day of the month was on average the hottest? * is there covariation between temperature and humidity in my dataset? --- ## Read in the Data There are several ways to read in data (some depend on the type of data you have): * `read.table` or `read.csv` in base R for delimited files * `readRDS` if you have a .rds dataset * `read_csv`, `read_csv2`, `read_delim`, `read_fwf` from `library(readr)` that is part of the tidyverse * `readxl()` from `library(readxl)` for .xls and .xlsx files * `read_sas`, `read_spss`, `read_stata` from `library(haven)` * `fread` from `library(data.table)` for efficiently importing large datasets that are regular delimited files --- ## Read in the Data In R we will use `data.table` and the `tidyverse` packages. For exploratory maps we will use `maps` and `leaflet`. If we want to use python we can set code chunks that use python but we need the `reticulate` package. We also want to use pandas in python. Let's load all of the libraries we need to read in the data in R and python. ``` r library(data.table) library(tidyverse) library(leaflet) library(reticulate) py_install("pandas") ``` ``` ## Using virtual environment '/Users/meredith/.virtualenvs/r-reticulate' ... ``` --- ## Read in the data: R Let's load in the data with `data.table`. I have it stored locally, but we will see how to load it straight from github in the lab. ``` r met <- data.table::fread("met_all.gz") ``` --- ## Read in the data: R `data.table` is a more efficient version of base R's `data.frame`. We create a `data.table` from an external data source by reading it in using `fread()` We can convert existing `data.frame` or `list` objects to `data.table` by using `setDT()` The nice thing about `data.table` is that it handles large datasets efficiently and it never reads/converts character type variables to factors, which can be a pain with `data.frame` The other nice thing about `data.table` is that many of the base R functions work just as if it was a `data.frame` --- ## Read in the data: Python ``` python import pandas as pd # Read the .gz file into a DataFrame met = pd.read_csv("met_all.gz", compression='gzip') ``` --- ## Check the data: R We should check the dimensions of the data set. This can be done several ways: ``` r dim(met) ``` ``` ## [1] 2377343 30 ``` ``` r nrow(met) ``` ``` ## [1] 2377343 ``` ``` r ncol(met) ``` ``` ## [1] 30 ``` --- ## Check the data: Python - check the first few rows with `met.head` and the shape with `met.shape` ``` python print(met.head()) ``` ``` ## USAFID WBAN year month ... dew.point.qc atm.press atm.press.qc rh ## 0 690150 93121 2019 8 ... 5 1009.9 5 19.881266 ## 1 690150 93121 2019 8 ... 5 1010.3 5 21.760980 ## 2 690150 93121 2019 8 ... 5 1010.6 5 18.482122 ## 3 690150 93121 2019 8 ... 5 1011.6 5 16.888622 ## 4 690150 93121 2019 8 ... 5 1012.7 5 17.384101 ## ## [5 rows x 30 columns] ``` ``` python print("Shape of the dataset:", met.shape) ``` ``` ## Shape of the dataset: (2377343, 30) ``` --- ## Check the data * We see that there are 2,377,343 records of hourly temperature in August 2019 from all of the weather stations in the US. The data set has 30 variables. * We should also check the top and bottom of the dataset to make sure that it imported correctly. Use `head(met)` and `tail(met)` for this. * Next we can take a deeper dive into the contents of the data with `str()` --- ## Check variables: R ``` ## Classes 'data.table' and 'data.frame': 2377343 obs. of 30 variables: ## $ USAFID : int 690150 690150 690150 690150 690150 690150 690150 690150 690150 690150 ... ## $ WBAN : int 93121 93121 93121 93121 93121 93121 93121 93121 93121 93121 ... ## $ year : int 2019 2019 2019 2019 2019 2019 2019 2019 2019 2019 ... ## $ month : int 8 8 8 8 8 8 8 8 8 8 ... ## $ day : int 1 1 1 1 1 1 1 1 1 1 ... ## $ hour : int 0 1 2 3 4 5 6 7 8 9 ... ## $ min : int 56 56 56 56 56 56 56 56 56 56 ... ## $ lat : num 34.3 34.3 34.3 34.3 34.3 34.3 34.3 34.3 34.3 34.3 ... ## $ lon : num -116 -116 -116 -116 -116 ... ## $ elev : int 696 696 696 696 696 696 696 696 696 696 ... ## $ wind.dir : int 220 230 230 210 120 NA 320 10 320 350 ... ## $ wind.dir.qc : chr "5" "5" "5" "5" ... ## $ wind.type.code : chr "N" "N" "N" "N" ... ## $ wind.sp : num 5.7 8.2 6.7 5.1 2.1 0 1.5 2.1 2.6 1.5 ... ## $ wind.sp.qc : chr "5" "5" "5" "5" ... ## $ ceiling.ht : int 22000 22000 22000 22000 22000 22000 22000 22000 22000 22000 ... ## $ ceiling.ht.qc : int 5 5 5 5 5 5 5 5 5 5 ... ## $ ceiling.ht.method: chr "9" "9" "9" "9" ... ## $ sky.cond : chr "N" "N" "N" "N" ... ## $ vis.dist : int 16093 16093 16093 16093 16093 16093 16093 16093 16093 16093 ... ## $ vis.dist.qc : chr "5" "5" "5" "5" ... ## $ vis.var : chr "N" "N" "N" "N" ... ## $ vis.var.qc : chr "5" "5" "5" "5" ... ## $ temp : num 37.2 35.6 34.4 33.3 32.8 31.1 29.4 28.9 27.2 26.7 ... ## $ temp.qc : chr "5" "5" "5" "5" ... ## $ dew.point : num 10.6 10.6 7.2 5 5 5.6 6.1 6.7 7.8 7.8 ... ## $ dew.point.qc : chr "5" "5" "5" "5" ... ## $ atm.press : num 1010 1010 1011 1012 1013 ... ## $ atm.press.qc : int 5 5 5 5 5 5 5 5 5 5 ... ## $ rh : num 19.9 21.8 18.5 16.9 17.4 ... ## - attr(*, ".internal.selfref")=<externalptr> ``` --- ## Check variables: Python ``` python met.info() ``` ``` ## <class 'pandas.core.frame.DataFrame'> ## RangeIndex: 2377343 entries, 0 to 2377342 ## Data columns (total 30 columns): ## # Column Dtype ## --- ------ ----- ## 0 USAFID int64 ## 1 WBAN int64 ## 2 year int64 ## 3 month int64 ## 4 day int64 ## 5 hour int64 ## 6 min int64 ## 7 lat float64 ## 8 lon float64 ## 9 elev int64 ## 10 wind.dir float64 ## 11 wind.dir.qc object ## 12 wind.type.code object ## 13 wind.sp float64 ## 14 wind.sp.qc object ## 15 ceiling.ht float64 ## 16 ceiling.ht.qc int64 ## 17 ceiling.ht.method object ## 18 sky.cond object ## 19 vis.dist float64 ## 20 vis.dist.qc object ## 21 vis.var object ## 22 vis.var.qc object ## 23 temp float64 ## 24 temp.qc object ## 25 dew.point float64 ## 26 dew.point.qc object ## 27 atm.press float64 ## 28 atm.press.qc int64 ## 29 rh float64 ## dtypes: float64(10), int64(10), object(10) ## memory usage: 544.1+ MB ``` --- ## Check variables * First, we see that `str()` gives us the class of the data, which in this case is both `data.table` and `data.frame`, as well as the dimensions of the data * We also see the variable names and their type (integer, numeric, character) * We can identify major problems with the data at this stage (e.g. a variable that has all missing values) -- We can get summary statistics on our `data.table` using `summary()` as we would with a regular `data.frame` --- ## Check variables: R ``` ## lat lon ## Min. :24.55 Min. :-124.29 ## 1st Qu.:33.97 1st Qu.: -98.02 ## Median :38.35 Median : -91.71 ## Mean :37.94 Mean : -92.15 ## 3rd Qu.:41.94 3rd Qu.: -82.99 ## Max. :48.94 Max. : -68.31 ## ## elev wind.dir ## Min. : -13.0 Min. : 3 ## 1st Qu.: 101.0 1st Qu.:120 ## Median : 252.0 Median :180 ## Mean : 415.8 Mean :185 ## 3rd Qu.: 400.0 3rd Qu.:260 ## Max. :9999.0 Max. :360 ## NA's :785290 ## wind.dir.qc wind.type.code ## Length:2377343 Length:2377343 ## Class :character Class :character ## Mode :character Mode :character ## ## ## ## ``` --- ## Check variables: Python ``` python met.describe() ``` ``` ## USAFID WBAN ... atm.press.qc rh ## count 2.377343e+06 2.377343e+06 ... 2.377343e+06 2.310917e+06 ## mean 7.230995e+05 2.953885e+04 ... 7.728208e+00 7.164138e+01 ## std 2.156678e+03 3.336921e+04 ... 2.018170e+00 2.275361e+01 ## min 6.901500e+05 1.160000e+02 ... 1.000000e+00 8.334298e-01 ## 25% 7.209280e+05 3.706000e+03 ... 5.000000e+00 5.579015e+01 ## 50% 7.227280e+05 1.386000e+04 ... 9.000000e+00 7.655374e+01 ## 75% 7.250900e+05 5.476700e+04 ... 9.000000e+00 9.062916e+01 ## max 7.268130e+05 9.499800e+04 ... 9.000000e+00 1.000000e+02 ## ## [8 rows x 20 columns] ``` --- ## Check variables more closely We know that we are supposed to have hourly measurements of weather data for the month of August 2019 for the entire US. We should check that we have all of these components. Let us check: * the year * the month * the hours * the range of locations (latitude and longitude) --- ## Check variables more closely We can generate tables for hour (top) and month (bottom) for integer values using `table()` ``` ## ## 0 1 2 3 4 5 6 ## 99434 93482 93770 96703 110504 112128 106235 ## 7 8 9 10 11 12 13 ## 101985 100310 102915 101880 100470 103605 97004 ## 14 15 16 17 18 19 20 ## 96507 97635 94942 94184 100179 94604 94928 ## 21 22 23 ## 96070 94046 93823 ``` ``` ## ## 8 ## 2377343 ``` --- ## Check variables more closely For numeric values we should do a summary to see the quantiles, max, min ``` r table(met$year) ``` ``` ## ## 2019 ## 2377343 ``` ``` r summary(met$lat) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 24.55 33.97 38.35 37.94 41.94 48.94 ``` ``` r summary(met$lon) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -124.29 -98.02 -91.71 -92.15 -82.99 -68.31 ``` --- ## Check variables more closely If we return to our initial question, what weather stations reported the hottest and coldest temperatures, we should take a closer look at our key variable, temperature (temp) ``` r summary(met$temp) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## -40.00 19.60 23.50 23.59 27.80 56.00 ## NA's ## 60089 ``` It looks like the temperatures are in Celsius. A temperature of -40 in August is really cold, we should see if this is an implausible value. --- ## Check variables more closely It also looks like there are a lot of missing data. Let us check the proportion of missings ``` r mean(is.na(met$temp)) ``` ``` ## [1] 0.0252757 ``` 2.5% of the data are missing, which is not a huge amount. --- ## Check variables more closely In `data.table` we can easily subset the data and select certain columns ``` r met_ss <- met[temp == -40.00, .(hour, lat, lon, elev, wind.sp)] dim(met_ss) ``` ``` ## [1] 36 5 ``` ``` r summary(met_ss) ``` ``` ## hour lat lon ## Min. : 0.00 Min. :29.12 Min. :-89.55 ## 1st Qu.: 2.75 1st Qu.:29.12 1st Qu.:-89.55 ## Median : 5.50 Median :29.12 Median :-89.55 ## Mean : 5.50 Mean :29.12 Mean :-89.55 ## 3rd Qu.: 8.25 3rd Qu.:29.12 3rd Qu.:-89.55 ## Max. :11.00 Max. :29.12 Max. :-89.55 ## ## elev wind.sp ## Min. :36 Min. : NA ## 1st Qu.:36 1st Qu.: NA ## Median :36 Median : NA ## Mean :36 Mean :NaN ## 3rd Qu.:36 3rd Qu.: NA ## Max. :36 Max. : NA ## NA's :36 ``` --- ## Check variables more closely In `dplyr` we can do the same thing using `filter` and `select` ``` r met_ss <- filter(met, temp == -40.00) |> select(USAFID, day, hour, lat, lon, elev, wind.sp) dim(met_ss) ``` ``` ## [1] 36 7 ``` ``` r names(met_ss) ``` ``` ## [1] "USAFID" "day" "hour" "lat" ## [5] "lon" "elev" "wind.sp" ``` --- ## Validate against an external source We should check outside sources to make sure that our data make sense. For example the observation with -40C is suspicious, so we should look up the location of the weather station. Go to [Google maps](https://www.google.com/maps/) and enter the coordinates for the site with -40C (29.12, -89.55) <img src="data:image/png;base64,#img/map.png" width="45%" height="45%" style="display: block; margin: auto;" /> It doesn't make much sense to have a -40C reading in the Gulf of Mexico off the coast of Louisiana. --- ## Summary statistics If we return to our initial question, we need to generate a list of weather stations that are ordered from highest to lowest. We can then examine the top and bottom of this new dataset. First let us remove the -40C observations and then sort. In `data.table` we can use `order()` to sort. With over 2 million observations, `data.table` is the best solution since it is more computationally efficient. ``` r met <- met[temp > -40] setorder(met, temp) ``` This is equivalent to ``` r met <- met[temp > -10][order(temp)] ``` --- ## Summary statistics ``` r head(met)[,c(1,8:10,24)] ``` ``` ## USAFID lat lon elev temp ## <int> <num> <num> <int> <num> ## 1: 726764 44.683 -111.116 2025 -3.0 ## 2: 726764 44.683 -111.116 2025 -3.0 ## 3: 726764 44.683 -111.116 2025 -3.0 ## 4: 726764 44.683 -111.116 2025 -3.0 ## 5: 720411 36.422 -105.290 2554 -2.4 ## 6: 726764 44.683 -111.116 2025 -2.0 ``` ``` r tail(met)[,c(1,8:10,24)] ``` ``` ## USAFID lat lon elev temp ## <int> <num> <num> <int> <num> ## 1: 720267 38.955 -121.081 467 52.0 ## 2: 690150 34.300 -116.166 696 52.8 ## 3: 690150 34.296 -116.162 625 52.8 ## 4: 690150 34.300 -116.166 696 53.9 ## 5: 690150 34.300 -116.166 696 54.4 ## 6: 720267 38.955 -121.081 467 56.0 ``` --- ## Summary statistics The maximum hourly temperature is 56C at site 720267, and the minimum hourly temperature is -3C at site 726764. --- ## Summary statistics We need to transform our data to answer our initial question. Let's find the **daily** mean max and min temperatures in `data.table` ``` r met_daily <- met[, .( temp = mean(temp), lat = mean(lat), lon = mean(lon), elev = mean(elev) ), by = c("USAFID", "day")][order(temp)] head(met_daily) ``` ``` ## USAFID day temp lat lon elev ## <int> <int> <num> <num> <num> <num> ## 1: 726130 11 4.278261 44.27 -71.3 1909 ## 2: 726130 31 4.304348 44.27 -71.3 1909 ## 3: 726130 10 4.583333 44.27 -71.3 1909 ## 4: 726130 25 4.705882 44.27 -71.3 1909 ## 5: 726130 24 5.050000 44.27 -71.3 1909 ## 6: 726130 23 5.500000 44.27 -71.3 1909 ``` ``` r tail(met_daily) ``` ``` ## USAFID day temp lat lon ## <int> <int> <num> <num> <num> ## 1: 722749 5 40.85714 33.26900 -111.8120 ## 2: 723805 5 40.97500 34.76800 -114.6180 ## 3: 720339 14 41.00000 32.14600 -111.1710 ## 4: 723805 4 41.18333 34.76800 -114.6180 ## 5: 722787 5 41.35714 33.52700 -112.2950 ## 6: 690150 31 41.71667 34.29967 -116.1657 ## elev ## <num> ## 1: 379.0000 ## 2: 279.0000 ## 3: 737.0000 ## 4: 279.0000 ## 5: 325.0000 ## 6: 690.0833 ``` --- ## Summary statistics Let's find the **daily** mean max and min temperatures in `dplyr` ```r met_daily_dplyr <- met |> group_by(USAFID, day) |> summarize(temp = mean(temp)) |> arrange(desc(temp)) ``` (try it yourself) --- ## Summary statistics The maximum **daily** temperature is 41.7166667 C at site 690150 and the minimum daily temperature is 4.2782609C at site 726130. --- ## Exploratory graphs With exploratory graphs we aim to: * debug any issues remaining in the data * understand properties of the data * look for patterns in the data * inform modeling strategies Exploratory graphs do not need to be perfect, we will look at presentation ready plots next week. --- ## Exploratory graphs Examples of exploratory graphs include: * histograms (for continuous data) * boxplots (for continuous data) * scatterplots (for continuous data) * barplots (for categorical data) * simple maps (for spatial data) --- ## Exploratory Graphs Focusing on the variable of interest, temperature, let's look at the distribution (after removing -40C) using a histogram `hist()` ``` r hist(met$temp) ``` <img src="data:image/png;base64,#JSC370-slides-03_files/figure-html/unnamed-chunk-16-1.png" width="40%" height="40%" style="display: block; margin: auto;" /> --- ## Exploratory Graphs Let's look at the daily data ``` r hist(met_daily$temp) ``` <img src="data:image/png;base64,#JSC370-slides-03_files/figure-html/unnamed-chunk-17-1.png" width="30%" height="30%" style="display: block; margin: auto;" /> --- ## Exploratory Graphs A boxplot gives us an idea of the quantiles of the distribution and any outliers ``` r boxplot(met$temp, col = "blue") ``` <img src="data:image/png;base64,#JSC370-slides-03_files/figure-html/unnamed-chunk-18-1.png" width="30%" height="30%" style="display: block; margin: auto;" /> --- ## Exploratory Graphs Let's look at the daily data ``` r boxplot(met_daily$temp, col = "blue") ``` <img src="data:image/png;base64,#JSC370-slides-03_files/figure-html/unnamed-chunk-19-1.png" width="30%" height="30%" style="display: block; margin: auto;" /> --- ## Exploratory Graphs A map will show us where the weather stations are located. First let's get the unique latitutdes and longitudes and see how many meteorological sites there are ``` r met_stations <- (unique(met[,c("lat","lon")])) dim(met_stations) ``` ``` ## [1] 2827 2 ``` --- ## Exploratory Graphs A simple map can be done with the maps library ``` r library(maps) map("world", "usa") points(met_stations$lon,met_stations$lat, pch=19, col='red') ``` <img src="data:image/png;base64,#JSC370-slides-03_files/figure-html/unnamed-chunk-21-1.png" style="display: block; margin: auto;" /> --- ## Exploratory Graphs An interactive map is helpful because it provides a basemap (background) ``` r leaflet(met_stations) %>% addProviderTiles('CartoDB.Positron') %>% addCircles(lat = ~lat, lng = ~lon, opacity = 1, fillOpacity = 1, radius = 100) ```
--- ## Exploratory Graphs - Mapping Let's map the locations of the max and min daily temperatures. ``` r min <- met_daily[1] # First observation. max <- met_daily[.N] # Last obs, .N is a special symbol in data.table leaflet() %>% addProviderTiles('CartoDB.Positron') %>% addCircles( data = min, lat = ~lat, lng = ~lon, popup = "Min temp.", opacity = 1, fillOpacity = 1, radius = 400, color = "blue" ) %>% addCircles( data = max, lat = ~lat, lng = ~lon, popup = "Max temp.", opacity=1, fillOpacity=1, radius = 1400, color = "red" ) ``` --- ## Mapping
--- ## Exploratory Graphs Scatterplots help us look at pairwise relationships. Let's see if there is any trend in temperature with latitude ``` r plot(met_daily$lat, met_daily$temp, pch=19, cex=0.5) ``` <img src="data:image/png;base64,#JSC370-slides-03_files/figure-html/unnamed-chunk-25-1.png" width="30%" height="30%" style="display: block; margin: auto;" /> --- ## Exploratory Graphs This is equivalent to ``` r met_daily[, plot(lat, temp, pch = 19, cex =0.5)] ``` <img src="data:image/png;base64,#JSC370-slides-03_files/figure-html/plot_daily-bis-1.png" width="25%" style="display: block; margin: auto;" /> There is a clear decrease in temperatures as you increase in latitude (i.e as you go north). --- ## Exploratory Graphs We can add a simple linear regression line to this plot using `lm()` and `abline()`. We can also add a title and change the axis labels. ``` r mod <- lm(temp ~ lat, data = met_daily) met_daily[, plot( lat, temp, pch=19, cex=0.5, main = "Temperature and Latitude", xlab = "Latitude", ylab = "Temperature (deg C)") ] abline(mod, lwd=2, col="red") ``` --- ## Exploratory Graphs <img src="data:image/png;base64,#JSC370-slides-03_files/figure-html/unnamed-chunk-27-1.png" width="45%" style="display: block; margin: auto;" /> --- ## Using `ggplot2` (more next class) ``` r ggplot(data = met_daily, mapping = aes(x = lat, y = temp)) + geom_point() + geom_smooth(method = "lm", se = FALSE) + labs(title = "Temperature and Latitute", xlab = "Latitute", y = "Temperature (deg C)")+ theme_bw() ``` --- ## Using `ggplot2` (next class) <img src="data:image/png;base64,#JSC370-slides-03_files/figure-html/unnamed-chunk-29-1.png" width="45%" height="30%" style="display: block; margin: auto;" /> --- ## Summary In EDA we: * have an initial question that we aim to answer * import, check, clean the data * perform any data transformations to answer the initial question * make some basic graphs to examine the data and visualize the initial question