Juan C Gonzalez - PAD 6833 - Research Data Selection

R Markdown - A tall white fountain played

options (scipen = 100) #Could not figure out how to round down exponents, used this function instead
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readxl)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.5
## ✔ ggplot2   4.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.3.0
## ✔ purrr     1.1.0     ✔ tidyr     1.3.1

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

sbeda_SetUp <- read_xlsx("sbeda-annual-report.xlsx", sheet = "S|M|WBE Utilization by Industry")

## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`

sbeda_Report <- sbeda_SetUp
colnames(sbeda_Report) <- sbeda_SetUp[1,]
sbeda_Report <- sbeda_Report[-1,]
summary(sbeda_Report)

##      Year             Industry         SBEDA Certification Type
##  Length:45          Length:45          Length:45               
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##  Aspirational Goal  Percent of Dollars Paid  Total Paid       
##  Length:45          Length:45               Length:45         
##  Class :character   Class :character        Class :character  
##  Mode  :character   Mode  :character        Mode  :character

sbeda_Clean <- sbeda_Report |> 
  mutate(across(c(`Aspirational Goal`, `Percent of Dollars Paid`, `Total Paid`), as.numeric)) |> 
  drop_na(`Aspirational Goal`)
summary(sbeda_Clean)

##      Year             Industry         SBEDA Certification Type
##  Length:30          Length:30          Length:30               
##  Class :character   Class :character   Class :character        
##  Mode  :character   Mode  :character   Mode  :character        
##                                                                
##                                                                
##                                                                
##  Aspirational Goal Percent of Dollars Paid   Total Paid       
##  Min.   :0.00100   Min.   :0.00000         Min.   :        0  
##  1st Qu.:0.00625   1st Qu.:0.01389         1st Qu.:   672449  
##  Median :0.03450   Median :0.06303         Median :  5673375  
##  Mean   :0.06617   Mean   :0.17097         Mean   : 19516986  
##  3rd Qu.:0.08250   3rd Qu.:0.33619         3rd Qu.: 16022300  
##  Max.   :0.28800   Max.   :0.59742         Max.   :170367099

#Data set was not set to "numeric", as such, the N/A were not originally caught in summary. By setting specific columns as numeric, this coerced the N/A to become visible to summary so I could properly remove them.

library(ggplot2)
hist(sbeda_Clean$`Percent of Dollars Paid`)

ggplot(sbeda_Clean, aes(x = `Percent of Dollars Paid`, y = `Total Paid`, colour = `SBEDA Certification Type`, shape = Industry)) +
  geom_point() +
  labs(x = "Percent of dollars paid by industry", y = "Total paid")

cor(sbeda_Clean$`Percent of Dollars Paid`, sbeda_Clean$`Total Paid`)

## [1] 0.6123938

Data set does come with dictionary as a worksheet in Excel workbook
No sensitive data present
This is public data
This data only has about 30 numeric cells depicted on the graph. A large data set could be looked into.
Yes, as this data comes from the economic development department of the City of San Antonio, this gives insight into how the city helps promote small businesses

Juan C Gonzalez - PAD 6833 - Research Data Selection

2025-09-12

R Markdown - A tall white fountain played