R Markdown - A tall white fountain played
options (scipen = 100) #Could not figure out how to round down exponents, used this function instead
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(readxl)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.5
## ✔ ggplot2 4.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.4 ✔ tibble 3.3.0
## ✔ purrr 1.1.0 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
sbeda_SetUp <- read_xlsx("sbeda-annual-report.xlsx", sheet = "S|M|WBE Utilization by Industry")
## New names:
## • `` -> `...2`
## • `` -> `...3`
## • `` -> `...4`
## • `` -> `...5`
## • `` -> `...6`
sbeda_Report <- sbeda_SetUp
colnames(sbeda_Report) <- sbeda_SetUp[1,]
sbeda_Report <- sbeda_Report[-1,]
summary(sbeda_Report)
## Year Industry SBEDA Certification Type
## Length:45 Length:45 Length:45
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
## Aspirational Goal Percent of Dollars Paid Total Paid
## Length:45 Length:45 Length:45
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
sbeda_Clean <- sbeda_Report |>
mutate(across(c(`Aspirational Goal`, `Percent of Dollars Paid`, `Total Paid`), as.numeric)) |>
drop_na(`Aspirational Goal`)
summary(sbeda_Clean)
## Year Industry SBEDA Certification Type
## Length:30 Length:30 Length:30
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Aspirational Goal Percent of Dollars Paid Total Paid
## Min. :0.00100 Min. :0.00000 Min. : 0
## 1st Qu.:0.00625 1st Qu.:0.01389 1st Qu.: 672449
## Median :0.03450 Median :0.06303 Median : 5673375
## Mean :0.06617 Mean :0.17097 Mean : 19516986
## 3rd Qu.:0.08250 3rd Qu.:0.33619 3rd Qu.: 16022300
## Max. :0.28800 Max. :0.59742 Max. :170367099
#Data set was not set to "numeric", as such, the N/A were not originally caught in summary. By setting specific columns as numeric, this coerced the N/A to become visible to summary so I could properly remove them.
library(ggplot2)
hist(sbeda_Clean$`Percent of Dollars Paid`)

ggplot(sbeda_Clean, aes(x = `Percent of Dollars Paid`, y = `Total Paid`, colour = `SBEDA Certification Type`, shape = Industry)) +
geom_point() +
labs(x = "Percent of dollars paid by industry", y = "Total paid")

cor(sbeda_Clean$`Percent of Dollars Paid`, sbeda_Clean$`Total Paid`)
## [1] 0.6123938
- Data set does come with dictionary as a worksheet in Excel
workbook
- No sensitive data present
- This is public data
- This data only has about 30 numeric cells depicted on the graph. A
large data set could be looked into.
- Yes, as this data comes from the economic development department of
the City of San Antonio, this gives insight into how the city helps
promote small businesses