I chose to explore the “Dataset Freshness” dataset from the MD Department of Information Technology. The dataset serves as an inventory for the 1665 datasets hosted in the MD Open Data Portal and sister sites within data.maryland.gov, and how “fresh” they are, meaning how recently and frequently they’re updated. This data is used to by administrators to keep track of the data, generate reports for council meetings, and ensure data stays up to date.
I accessed the data on Nov 13, 2024, but it was created Aug 12, 2015, and has been updated daily since.As a data science student, having access to all these publicly available datasets is massively helpful, but parsing the catalog for data that is relevant to whatever I’m trying to do can be difficult. I thought it’d be interesting to explore data about data.
There are 18 variables, including the datasets’ names, column names, number of rows, dates of and days since last update, source, department managing the data, tags and keywords, a judgement on how “fresh” the dataset is, and more. Most of them are categorical, but some can be converted to numeric as well, as I will do to find the number of tags/keywords and columns.
Links
The dataset: https://opendata.maryland.gov/Administrative/Dataset-Freshness/8ypa-c9d9/about_data
More information: https://opendata.maryland.gov/stories/s/xdqw-5b5w
Data Exploration
setup
#load necessary librarieslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#set working directorysetwd("C:/Users/zivsa/Desktop/OneDrive - montgomerycollege.edu/school/DATA 110/W11 - Project 2")#import the datadsets_raw <-read_csv("Dataset_Freshness_20241113.csv")
Rows: 1665 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (16): Unique Identifier, Dataset Name, Link, Agency Performing Data Upda...
dbl (2): Days Since Most Recent Data Change, Number of Rows
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cleaning
#a few rows inventory older datasets with weirdly formatted missing values, I don't know what they mean so I'm deciding to exclude them dsets <- dsets_raw |>filter(`Number of Rows`>-9999&`Days Since Most Recent Data Change`>-9999)#count the number of tags / keywords for each dataset. NULL means no tags / keywordsdsets <- dsets |>mutate(`Number of Tags / Keywords`=str_split(`Tags / Keywords`, ", ") |>lengths(), `Number of Tags / Keywords`=if_else(`Tags / Keywords`!="NULL", `Number of Tags / Keywords`, 0) )#count the number of columns for each datasetdsets <- dsets |>mutate(`Number of Columns`=str_split(`Column Names`, ", ") |>lengths())#change date columns to recognizable r datesdsets$`Date of Most Recent Data Change`<- dsets$`Date of Most Recent Data Change`|>as.Date.character("%m/%d/%Y")dsets$`Date of Most Recent Change (Data Change or Metadata Change)`<- dsets$`Date of Most Recent Change (Data Change or Metadata Change)`|>as.Date.character("%m/%d/%Y")#clean column namescolnames(dsets) <-gsub(" ", "_", colnames(dsets))colnames(dsets) <-str_replace_all(colnames(dsets), "[()/?]", "")
Call:
lm(formula = Days_Since_Most_Recent_Data_Change ~ Number_of_Columns +
Number_of_Tags__Keywords, data = dsets)
Residuals:
Min 1Q Median 3Q Max
-4227.6 -1192.2 -414.1 841.9 16823.3
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3058.616 112.362 27.221 < 2e-16 ***
Number_of_Columns -14.895 2.357 -6.319 3.57e-10 ***
Number_of_Tags__Keywords 34.307 6.261 5.479 5.09e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2375 on 1349 degrees of freedom
Multiple R-squared: 0.04479, Adjusted R-squared: 0.04337
F-statistic: 31.62 on 2 and 1349 DF, p-value: 3.782e-14
Equation: Days Since Most Recent Data Change = 2990.145 + 36.424(Number of Tags / Keywords) - 15.189(Number of Columns)
There’s a negative association between number of columns and days since most recent data change, but a stronger positive association between number of tags / keywords and days since most recent data change.
p-values: < 2e-16 for the intercept, 1.21e-10 for the number of columns, 5.43e-09 for the number of tags / keywords, 1.787e-15 for the entire model
All of these are very low, which indicates statistical significance of the model.
Adjusted R2: 0.04776
About 4.8% of the variation in days since the most recent data change of a dataset can be explained by the number of columns and the number of tags / keywords, which isn’t very much.
visualizing a regression
dsets |>ggplot(aes(y=Days_Since_Most_Recent_Data_Change, x=Number_of_Tags__Keywords))+geom_point(aes(colour = Portal, size=Number_of_Columns), alpha=.4)+scale_size_continuous(range=c(.5,3.5))+geom_smooth(method ="lm") +labs(caption ="source: MD Open Data Portal", title ="MD Open Data Datasets' Most Recent Update", x ="Number of Tags/Keywords", y ="Days Since Most Recent Data Change", size ="Number of Columns")
`geom_smooth()` using formula = 'y ~ x'
exploration of dataset size
dsets|>filter(Number_of_Rows<2000000) |>ggplot(aes(x=Number_of_Columns, y= Number_of_Rows)) +geom_point(aes(color=Days_Since_Most_Recent_Data_Change), alpha = .5) +labs(caption ="source: MD Open Data Portal", title ="MD Open Data Dataset Size", x ="Number of Columns", y ="Number of Rows", color ="Days Since Most Recent Data Change")
Even after filtering out some outliers with a very high amount of rows, a higher number of columns seems to be related to a lower number of rows. It’s difficult to glean anything about the most recent update through color when the points are so densely packed near the origin.
narrowing the data range by category
#creating a simpler "update needed?" columndsets <- dsets |>mutate(Update_Judgement =case_when( Updated_Recently_Enough %in%c("Yes", "Yes. The data are updated as needed, which makes evaluation difficult. As an approximate measure, this dataset is evaluated as updated recently enough because it has been updated in the past month.") ~"Yes", Updated_Recently_Enough =="No"~"No", T ~"Better Metadata Needed"))#are there categories of datasets that are more likely to be adequately updated?dsets |>ggplot(aes(y=Category)) +geom_bar( aes(fill = Update_Judgement)) +labs(caption ="source: MD Open Data Portal", title ="Is MD Data Outdated?", color ="Updated Recently Enough")
There seem to be a few categories that are applied to very few datasets. The majority of data is judged to be updated recently enough, with the most obvious exception being the transportation category.
focusing on transportation
#create new dataframe focused on transportation datatransets <- dsets |>filter(Category=="Transportation"); head(transets)
# A tibble: 6 × 21
Unique_Identifier Dataset_Name Link Agency_Performing_Da…¹ Owner
<chr> <chr> <chr> <chr> <chr>
1 1f7af0c9fa2c4d168f3f3b8ca21fe… Maryland Al… http… U.S. DOE mdim…
2 2e934685306d43a094f642c8dfb8e… Maryland Al… http… U.S. DOE mdim…
3 7eaa7eae084548d0b24c40323b235… Maryland Al… http… U.S. DOE mdim…
4 052945270da7408fbcfec43151851… Maryland Al… http… U.S. DOE mdim…
5 22e4569f85e94d49b9333bad0b27b… Maryland Al… http… U.S. DOE mdim…
6 3f4b959826c34480be3e4740e4ee0… Maryland An… http… MDOT SHA mdim…
# ℹ abbreviated name: ¹Agency_Performing_Data_Updates
# ℹ 16 more variables: Data_Provided_By <chr>, Source_URL <chr>,
# Update_Frequency <chr>, Date_of_Most_Recent_Data_Change <date>,
# Days_Since_Most_Recent_Data_Change <dbl>,
# Date_of_Most_Recent_Change_Data_Change_or_Metadata_Change <date>,
# Updated_Recently_Enough <chr>, Number_of_Rows <dbl>, Tags__Keywords <chr>,
# Column_Names <chr>, Missing_Metadata_Fields <chr>, Portal <chr>, …
Who is in charge of updating the data, and are they doing it?
#clean some of the multiples of agency namestransets <- transets |>mutate(Agency_Performing_Data_Updates =recode(Agency_Performing_Data_Updates,"Maryland Department of Transportation"="MDOT", "Department of Information Technology"="DoIT", "Department of Commerce"="COMMERCE"), Update_Frequency =recode(Update_Frequency, "Continual"="Continually", "NULL"="Unknown") )#visualize transets |>ggplot(aes(x=Number_of_Columns, y=Number_of_Tags__Keywords)) +geom_point(aes(color = Update_Judgement, size=Number_of_Rows), alpha = .4) +labs(caption ="source: MD Open Data Portal", title ="MD Transportation Data Inventory", color ="Updated Recently Enough?",x ="Number of Columns", y ="Number of Tags / Keywords",size ="Number of Rows" )
add interactivity: final visulaization
MD Transportation Data Inventory
library(highcharter) #load highcharter library
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Highcharts (www.highcharts.com) is a Highsoft software product which is
not free for commercial and Governmental use
#create visualizationhighchart() |>hc_add_series(data = transets,type ="scatter", hcaes(size =Number_of_Rows, x = Number_of_Columns, y = Number_of_Tags__Keywords, group = Update_Judgement) ) |>#hc_colorAxis() |>hc_tooltip( useHTML=T, headerFormat ="",pointFormat ="<strong><a href={point.Link}>{point.Dataset_Name}</a></strong> <br><em>{point.Number_of_Rows} rows, {point.Number_of_Columns} columns</em> <br><b>Last Updated:</b> {point.Date_of_Most_Recent_Change_Data_Change_or_Metadata_Change} <br><b>Agency:</b> {point.Agency_Performing_Data_Updates} <br><b>Update Frequency:</b> {point.Update_Frequency} <br><b>Updated Recently Enough?</b> {point.Update_Judgement}") |>hc_title(text ="MD Transportation Data Inventory") |>hc_caption(text ="source: MD Open Data Portal") |>hc_xAxis(title =list(text="Number of Columns")) |>hc_yAxis(title =list(text="Number of Tags / Keywords")) |>hc_legend(title =list(text="Updated Recently Enough?")) |>hc_add_theme(hc_theme_economist())
plot description
The scatter plot “MD Transportation Data Inventory” visualizes multiple variables of datasets that are categorized as transportation. Variables include the datasets’ size (columns on the x axis, point size scaled to number of rows), how many tags the dataset is labeled with on the website on the y axis, a judgement on whether the dataset was updated recently enough or not, and a few more variables in the intercative tooltip. I couldn’t figure out how to add the scale of the points to the legend to show it corresponds to the number of rows. It’s also difficult to identify a few of the points that overlap.
A pattern that I find interesting is the cluster of datasets in the top left with very high amounts of observations that are not up to date. By hovering over them, I also see they should be updated annually under the responsibility of the MDOT SHA Agency. I wonder if they’re more difficult to maintain or there are other factors influencing the agency’s ability to update the data as it should be. I also am intrigued by the gap between the majority of datasets with less than 30 columns and a the few that have about 80 or more, and that there’s no gradual decline. Even more interesting, however, is the less obvious gap in datasets with about 7-25 tags/keywords, but the fact that these are the datasets that are up to date, while those with less than 7 or more than 25 are largely out of date.