This file is provided as a preliminary resource until official data
is added to the critstats
package. You may also use this
code to gather data related to your class project, thesis, or other
academic tasks beyond what is provided below. Content in this file comes
from a host of different sources which you should be familiar with prior
to access and analyzing any data.
Open up a new .Rmd file.
Use {r setup, include=F}
in your first code chunk.
knitr::opts_chunk$set(echo = TRUE)
# Load necessary libraries
library(knitr)
library(kableExtra)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::group_rows() masks kableExtra::group_rows()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(readr)
library(dplyr)
library(tidyr)
We start with Pew to get an idea of the publicly available files.
We then load the data from the Annual Business Survey (ABS) Program.
I extract data from the 2007 Survey of Business Owners Public Use Microdata Sample.
# Load necessary library
# install.packages("readr") # Uncomment if 'readr' is not installed
library(readr)
# Define the URL of the ZIP file
url <- "https://www2.census.gov/programs-surveys/sbo/datasets/2007/pums_csv.zip"
# Download the ZIP file
download.file(url, destfile = "pums_csv.zip")
# Unzip the file
unzip("pums_csv.zip", exdir = "pums_data")
# List the contents of the unzipped directory
files <- list.files("pums_data", full.names = TRUE)
print(files)
# Read a specific CSV file (replace 'your_file.csv' with the actual filename)
data <- read_csv("pums_data/pums.csv") # Adjust index as necessary based on your files
# Sample 10% of the data
data %>% sample_frac(0.1)
Try to work with the data and generate some summary statistics.