W11-P2-MDData

Author

Project 2 - MD Datasets

Introduction & Background

I chose to explore the “Dataset Freshness” dataset from the MD Department of Information Technology. The dataset serves as an inventory for the 1665 datasets hosted in the MD Open Data Portal and sister sites within data.maryland.gov, and how “fresh” they are, meaning how recently and frequently they’re updated. This data is used to by administrators to keep track of the data, generate reports for council meetings, and ensure data stays up to date.

I accessed the data on Nov 13, 2024, but it was created Aug 12, 2015, and has been updated daily since.As a data science student, having access to all these publicly available datasets is massively helpful, but parsing the catalog for data that is relevant to whatever I’m trying to do can be difficult. I thought it’d be interesting to explore data about data.

There are 18 variables, including the datasets’ names, column names, number of rows, dates of and days since last update, source, department managing the data, tags and keywords, a judgement on how “fresh” the dataset is, and more. Most of them are categorical, but some can be converted to numeric as well, as I will do to find the number of tags/keywords and columns.

Links

The dataset: https://opendata.maryland.gov/Administrative/Dataset-Freshness/8ypa-c9d9/about_data

More information: https://opendata.maryland.gov/stories/s/xdqw-5b5w

Data Exploration

setup

#load necessary libraries
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

#set working directory
setwd("C:/Users/zivsa/Desktop/OneDrive - montgomerycollege.edu/school/DATA 110/W11 - Project 2")

#import the data
dsets_raw <- read_csv("Dataset_Freshness_20241113.csv")

Rows: 1665 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): Unique Identifier, Dataset Name, Link, Agency Performing Data Upda...
dbl  (2): Days Since Most Recent Data Change, Number of Rows

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

cleaning

#a few rows inventory older datasets with weirdly formatted missing values, I don't know what they mean so I'm deciding to exclude them 
dsets <- dsets_raw |> filter(`Number of Rows`>-9999 & `Days Since Most Recent Data Change`>-9999)

#count the number of tags / keywords for each dataset. NULL means no tags / keywords
dsets <- dsets |> 
  mutate(`Number of Tags / Keywords`= str_split(`Tags / Keywords`, ", ") |> lengths(), 
         `Number of Tags / Keywords`= if_else(`Tags / Keywords`!= "NULL", 
                                              `Number of Tags / Keywords`, 0)
         )

#count the number of columns for each dataset
dsets <- dsets |> 
  mutate(`Number of Columns`= str_split(`Column Names`, ", ") |> lengths())

#change date columns to recognizable r dates
dsets$`Date of Most Recent Data Change` <- dsets$`Date of Most Recent Data Change`|> as.Date.character("%m/%d/%Y")
dsets$`Date of Most Recent Change (Data Change or Metadata Change)` <- dsets$`Date of Most Recent Change (Data Change or Metadata Change)` |> as.Date.character("%m/%d/%Y")

#clean column names
colnames(dsets) <- gsub(" ", "_", colnames(dsets))
colnames(dsets) <- str_replace_all(colnames(dsets), "[()/?]", "")

linear regression analysis

dsets_fit <- lm(Days_Since_Most_Recent_Data_Change ~ Number_of_Columns + Number_of_Tags__Keywords,
                data = dsets)
summary(dsets_fit)


Call:
lm(formula = Days_Since_Most_Recent_Data_Change ~ Number_of_Columns + 
    Number_of_Tags__Keywords, data = dsets)

Residuals:
    Min      1Q  Median      3Q     Max 
-4227.6 -1192.2  -414.1   841.9 16823.3 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)              3058.616    112.362  27.221  < 2e-16 ***
Number_of_Columns         -14.895      2.357  -6.319 3.57e-10 ***
Number_of_Tags__Keywords   34.307      6.261   5.479 5.09e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2375 on 1349 degrees of freedom
Multiple R-squared:  0.04479,   Adjusted R-squared:  0.04337 
F-statistic: 31.62 on 2 and 1349 DF,  p-value: 3.782e-14

Equation: Days Since Most Recent Data Change = 2990.145 + 36.424(Number of Tags / Keywords) - 15.189(Number of Columns)

There’s a negative association between number of columns and days since most recent data change, but a stronger positive association between number of tags / keywords and days since most recent data change.

p-values: < 2e-16 for the intercept, 1.21e-10 for the number of columns, 5.43e-09 for the number of tags / keywords, 1.787e-15 for the entire model

All of these are very low, which indicates statistical significance of the model.

Adjusted R²: 0.04776

About 4.8% of the variation in days since the most recent data change of a dataset can be explained by the number of columns and the number of tags / keywords, which isn’t very much.

visualizing a regression

dsets |> ggplot(aes(y=Days_Since_Most_Recent_Data_Change, x=Number_of_Tags__Keywords))+
  geom_point(aes(colour = Portal, size=Number_of_Columns), alpha=.4)+
  scale_size_continuous(range=c(.5,3.5))+
  geom_smooth(method = "lm") + 
  labs(caption = "source: MD Open Data Portal", 
       title = "MD Open Data Datasets' Most Recent Update", 
       x = "Number of Tags/Keywords", 
       y = "Days Since Most Recent Data Change", 
       size = "Number of Columns")

`geom_smooth()` using formula = 'y ~ x'

exploration of dataset size

dsets|> filter(Number_of_Rows<2000000) |>ggplot(aes(x=Number_of_Columns, y= Number_of_Rows)) +
  geom_point(aes(color=Days_Since_Most_Recent_Data_Change), alpha = .5) +
  labs(caption = "source: MD Open Data Portal", 
       title = "MD Open Data Dataset Size", 
       x = "Number of Columns", y = "Number of Rows", 
       color = "Days Since Most Recent Data Change")

Even after filtering out some outliers with a very high amount of rows, a higher number of columns seems to be related to a lower number of rows. It’s difficult to glean anything about the most recent update through color when the points are so densely packed near the origin.

narrowing the data range by category

#creating a simpler "update needed?" column
dsets <- dsets |> mutate(Update_Judgement = case_when(
  Updated_Recently_Enough %in% c("Yes", "Yes. The data are updated as needed, which makes evaluation difficult. As an approximate measure, this dataset is evaluated as updated recently enough because it has been updated in the past month.") ~ "Yes", 
  Updated_Recently_Enough == "No" ~ "No", 
  T ~ "Better Metadata Needed"))

#are there categories of datasets that are more likely to be adequately updated?
dsets |> ggplot(aes(y=Category)) +
  geom_bar( aes(fill = Update_Judgement)) +
  labs(caption = "source: MD Open Data Portal", 
       title = "Is MD Data Outdated?", 
       color = "Updated Recently Enough")

There seem to be a few categories that are applied to very few datasets. The majority of data is judged to be updated recently enough, with the most obvious exception being the transportation category.

focusing on transportation

#create new dataframe focused on transportation data
transets <- dsets |> filter(Category=="Transportation"); head(transets)

# A tibble: 6 × 21
  Unique_Identifier              Dataset_Name Link  Agency_Performing_Da…¹ Owner
  <chr>                          <chr>        <chr> <chr>                  <chr>
1 1f7af0c9fa2c4d168f3f3b8ca21fe… Maryland Al… http… U.S. DOE               mdim…
2 2e934685306d43a094f642c8dfb8e… Maryland Al… http… U.S. DOE               mdim…
3 7eaa7eae084548d0b24c40323b235… Maryland Al… http… U.S. DOE               mdim…
4 052945270da7408fbcfec43151851… Maryland Al… http… U.S. DOE               mdim…
5 22e4569f85e94d49b9333bad0b27b… Maryland Al… http… U.S. DOE               mdim…
6 3f4b959826c34480be3e4740e4ee0… Maryland An… http… MDOT SHA               mdim…
# ℹ abbreviated name: ¹Agency_Performing_Data_Updates
# ℹ 16 more variables: Data_Provided_By <chr>, Source_URL <chr>,
#   Update_Frequency <chr>, Date_of_Most_Recent_Data_Change <date>,
#   Days_Since_Most_Recent_Data_Change <dbl>,
#   Date_of_Most_Recent_Change_Data_Change_or_Metadata_Change <date>,
#   Updated_Recently_Enough <chr>, Number_of_Rows <dbl>, Tags__Keywords <chr>,
#   Column_Names <chr>, Missing_Metadata_Fields <chr>, Portal <chr>, …

Who is in charge of updating the data, and are they doing it?

#clean some of the multiples of agency names
transets <- transets |> mutate(Agency_Performing_Data_Updates = recode(Agency_Performing_Data_Updates,
                                      "Maryland Department of Transportation" = "MDOT", 
                                      "Department of Information Technology" = "DoIT", 
                                      "Department of Commerce" = "COMMERCE"), 
                               Update_Frequency = recode(Update_Frequency, 
                                      "Continual" = "Continually", 
                                      "NULL" = "Unknown")
                               )

#visualize 
transets |> ggplot(aes(x=Number_of_Columns, y=Number_of_Tags__Keywords)) +
  geom_point(aes(color = Update_Judgement, size=Number_of_Rows), alpha = .4) +
  labs(caption = "source: MD Open Data Portal", 
       title = "MD Transportation Data Inventory", 
       color = "Updated Recently Enough?",
       x = "Number of Columns", y = "Number of Tags / Keywords",
       size = "Number of Rows"
  )

add interactivity: final visulaization

MD Transportation Data Inventory

library(highcharter) #load highcharter library

Registered S3 method overwritten by 'quantmod':
  method            from
  as.zoo.data.frame zoo

Highcharts (www.highcharts.com) is a Highsoft software product which is

not free for commercial and Governmental use

#create visualization
highchart() |>
  hc_add_series(
    data = transets,
    type = "scatter", 
    hcaes(size =Number_of_Rows, 
          x = Number_of_Columns, 
          y = Number_of_Tags__Keywords, 
          group = Update_Judgement)
  ) |> 
  #hc_colorAxis() |>
  hc_tooltip( useHTML=T, headerFormat = "",
              pointFormat = "<strong><a href={point.Link}>{point.Dataset_Name}</a></strong>
                            <br><em>{point.Number_of_Rows} rows, {point.Number_of_Columns} columns</em>
                            <br><b>Last Updated:</b> 
                            {point.Date_of_Most_Recent_Change_Data_Change_or_Metadata_Change}
                            <br><b>Agency:</b> {point.Agency_Performing_Data_Updates}
                            <br><b>Update Frequency:</b> {point.Update_Frequency}
                            <br><b>Updated Recently Enough?</b> {point.Update_Judgement}") |>
  hc_title(text = "MD Transportation Data Inventory") |>
  hc_caption(text = "source: MD Open Data Portal") |>
  hc_xAxis(title = list(text="Number of Columns")) |>
  hc_yAxis(title = list(text="Number of Tags / Keywords")) |>
  hc_legend(title = list(text="Updated Recently Enough?")) |> 
  hc_add_theme(hc_theme_economist())

plot description

The scatter plot “MD Transportation Data Inventory” visualizes multiple variables of datasets that are categorized as transportation. Variables include the datasets’ size (columns on the x axis, point size scaled to number of rows), how many tags the dataset is labeled with on the website on the y axis, a judgement on whether the dataset was updated recently enough or not, and a few more variables in the intercative tooltip. I couldn’t figure out how to add the scale of the points to the legend to show it corresponds to the number of rows. It’s also difficult to identify a few of the points that overlap.

A pattern that I find interesting is the cluster of datasets in the top left with very high amounts of observations that are not up to date. By hovering over them, I also see they should be updated annually under the responsibility of the MDOT SHA Agency. I wonder if they’re more difficult to maintain or there are other factors influencing the agency’s ability to update the data as it should be. I also am intrigued by the gap between the majority of datasets with less than 30 columns and a the few that have about 80 or more, and that there’s no gradual decline. Even more interesting, however, is the less obvious gap in datasets with about 7-25 tags/keywords, but the fact that these are the datasets that are up to date, while those with less than 7 or more than 25 are largely out of date.