Documentation

In this data dive of Seoul Bike Share Data, I investigate the importance of data documentation by identifying variables with unclear meanings, exploring the documentation, and visualizing variables to gain more clarity. I highlight examples of poor documentation in the data set, potential consequences, and improve on the documentation. I also looked into missing values, missing groups, and outliers.

Unclear Column Names

These three variables have unclear meanings or documentation errors:

  • Date

    • Date is formatted as dd/mm/yyyy, which is an irregular format and needed to be cleaned within the program. The documentation incorrectly noted that the format was yyyy-mm-dd.
  • Visibility (10m)

    • This range of values in this column is [27, 2000]. It is not clear what these units mean and the relation to “10m.”
  • Functioning Day

    • There are no details about what it means for the bike share program to be down. Are all bikes down? Just a percentage? The only information about it, it’s name Functioning Day, implies the system is down for an entire day, but each observation represents an hour.

The format and meaning of these variables are unclear from the documentation. The description column is not filled in to provide more detail.

Variable Table of Information in the Seoul Bike Data Documentation

The documentation includes a second table that provides more insight into the attribute meanings. Date is described as yyyy-mm-dd, but the actual data shows dd-mm-yyyy. Visibility 10m is not clarified. The Functional Day description is not clear, but at least clarifies that the the attribute refers to whether the bike share system was working that hour.

Variable Table of Information in the Seoul Bike Data Documentation

If I had not read the documentation, I would have relied on exploratory data analysis to figure out the meanings of the unclear columns. This would have been simple for Functioning Day because the data is clearly broken out by hour instead of date, and some dates have both “Yes” and “No” values for Functioning Day as shown in Figure 1.

SeoulBikeData|>
  group_by(date) |>
  summarize(mean = mean(functioning_day=="Yes")) |>
  ggplot(aes(x=mean)) +
  geom_histogram() +
  labs(
    title = "Histogram of Average Percent of System Uptime in Seoul Bike Data",
    x = "Percent of System Uptime Hours in a Day",
    y = "Count of Days"
  )
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Figure 1

Having read the documentation, the incorrect format of Date in the documentation could have caused errors in analysis had I not thoroughly explored this data. Visibility (10m) is still unclear and additional research had to be done to figure this out. For example, is this a measure of fog, daylight/moonlight, or both?

The simplicity and intuitiveness of most fields in this data set alleviates most of the problems caused by poor documentation. For the non-intuitive fields, unclear documentation creates an opportunities for misunderstanding the variables and drawing wrong conclusions. One way to prevent these misunderstandings is through exploratory data analysis. By visualizing unclear attributes, you can gain some understanding of what they are trying to communicate.

For example, we see in Figure 2 that the average of visibility_10m grouped by hour of day is sinusoidal throughout the day. The hours align with sunlight makes up a portion of the value.

SeoulBikeData|>
  group_by(hour) |>
  summarize(mean = mean(visibility_10m)) |>
  ggplot(aes(x=hour, y = mean)) +
  geom_point() +
  labs(
    title = "Histogram of Visibility by Hour in Seoul Bike Data",
    x = "Hour of Day",
    y = "Average Hourly Visibility"
  )

Figure 2

This still was not a full explanation. What about humidity?

# Fit a linear model
lm_model <- lm(visibility_10m ~ humid_pct, data = SeoulBikeData)
r2_value <- summary(lm_model)$r.squared

# Create the scatter plot with a regression line and R^2 annotation
SeoulBikeData |> 
  ggplot(aes(x = humid_pct, y = visibility_10m)) +
  geom_point(alpha = 0.6, color = "blue") +
  geom_smooth(method = "lm", color = "red", se = TRUE) +
  annotate("text", x = max(SeoulBikeData$humid_pct) * 0.3, 
           y = max(SeoulBikeData$visibility_10m) * 1.1, 
           label = paste("R² =", round(r2_value, 3)), 
            color = "black", size = 5, hjust = 0) +
  labs(
    title = "Relationship Between Humidity and Visibility in Seoul Bike Data",
    x = "Humidity (%)",
    y = "Visibility (meters)"
  )
## `geom_smooth()` using formula = 'y ~ x'

Figure 3

Figure 3 shows that 30% of the variation in Visibility_10m is explained by the humidity percentage, so weather effects visibility_10m as well.

After doing some research, visibility_10m refers to visibility measured at 10 meters above the ground, which aligns with standard meteorological reporting. If this measure wasn’t standard and exploratory data analysis wasn’t insightful, one would need to contact the researcher who created the data set for more information.

Missing values

The Seoul Bike Share data set has 8760 observations but no implicitly or explicitly missing values. However, there is an attribute called “Functioning Day” which is a binary variable where “No” values correspond with “Rented Bike Count” equal to zero due to the Bike Share system being down. These values indicate that the data was already cleaned to avoid missing data by the creator of the data set. The data was likely encoded this way to create a cleaned data set for multivariate regression analysis.

Surprisingly, the weather information was complete. The entities collecting this data may have had backup sensors in place in case of failure so that they could capture the conditions every hour.

If the attribute Functioning Day was not present, Rented Bike Count could have had implicitly missing data where the rows corresponding to the hours that bike share was down were not there entirely. However, due to the completeness of the weather attributes, Rented Bike Count could not contain any implicitly missing rows. The Rented Bike Count data could have had explicitly missing data if the counts were NA for hours that bike share was down.

sprintf("There are %d missing values.", sum(is.na(SeoulBikeData)))
## [1] "There are 0 missing values."
hour_counts = SeoulBikeData |>
  group_by(date) |>
  summarize(count=n())
  sprintf("The max observations in a single day is %d, and the minimum observations in a single day is %d. Therefore, there are no implicitly missing data.", max(hour_counts$count) ,min(hour_counts$count))
## [1] "The max observations in a single day is 24, and the minimum observations in a single day is 24. Therefore, there are no implicitly missing data."

Empty Groups

Of the categorical variables, there are no empty groups.

unique(SeoulBikeData$seasons)
## [1] "Winter" "Spring" "Summer" "Autumn"
unique(SeoulBikeData$holiday)
## [1] "No Holiday" "Holiday"
unique(SeoulBikeData$functioning_day)
## [1] "Yes" "No"

Outliers

I investigated the data set for outliers. First, I examined the sum of bikes rented each day. Figure 4 has a bin-width of 1000 hourly bikes rented, and visually there are no outliers.

daily_rented = SeoulBikeData |>
 group_by(date) |>
 summarize(daily_sum = sum(rented_bikes))

daily_rented |>
 ggplot(aes(x=daily_sum)) +
  geom_histogram(binwidth=1000) +
  labs(
    title = "Histogram of Daily Bike Rentals in Seoul Bike Data",
    x = "Daily Bikes Rented",
    y = "Count of Days"
  )

Figure 4

To be sure there were no outlying days for bike rental counts, I calculated the interquartile range (IQR) for an arbitrary but widely used metric for assessing outliers. I found no values fell outside of the “fences” created by multiplying the IQR by 1.5 and adding/subtracting result to third/first quantile, respectively.

iqr = IQR(daily_rented$daily_sum)

q1 <- quantile(daily_rented$daily_sum, probs = 0.25) 

q3 <- quantile(daily_rented$daily_sum, probs = 0.75)

dim(filter(daily_rented, daily_sum > q3 + 1.5*iqr))
## [1] 0 2
dim(filter(daily_rented, daily_sum < q1 - 1.5*iqr))
## [1] 0 2

There are no outlier days for bike counts, but are there outlier hours?

iqr = IQR(SeoulBikeData$rented_bikes)

q1 <- quantile(SeoulBikeData$rented_bikes, probs = 0.25) 

q3 <- quantile(SeoulBikeData$rented_bikes, probs = 0.75)

dim(filter(SeoulBikeData, rented_bikes > q3 + 1.5*iqr))
## [1] 158  18
dim(filter(SeoulBikeData, rented_bikes < q1 - 1.5*iqr))
## [1]  0 18

There are 158 instances of an hour of bike share counts landing 1.5*IQR above the third quantile.

SeoulBikeData |>
 ggplot(aes(x=rented_bikes)) +
  geom_histogram(binwidth=100) +
  labs(
    title = "Histogram of Hourly Bike Rentals in Seoul Bike Data",
    x = "Hourly Bikes Rented",
    y = "Count of Hours"
  )

Figure 5

Figure 5 shows the presence of many hours that fall far above the rest of the data points. These are clear outliers.

Some final questions to leave off on: what constraints are there to the count of rented bikes per hour? There are a finite number of rental bikes and people to ride them. How close do these outlier hours get to reaching full system capacity? If more capacity was added to the system (more bike rental docks and more bikes at each stations), how much greater could the outliers have been?

Poor documentation leads to an added workload on everyone using your data set. Poor documentation also can lead to misunderstandings about the meaning of variables and incorrect conclusions.