TIM-8501 Week 5 Assignment

Introduction

For this week’s assignment, the State of Washington, Labor & Industries (L&I) contractor license and registration dataset was selected for analysis. The dataset is housed on the Data.WA.gov website and the source of the data is the Washington State Department of Labor & Industries (L&I) (Data.WA.gov, 2015 – 2023). The dataset was created on October 13, 2015, and it has been continuously updated three times per day through the present. According to the website, the dataset reflects the L&I contractor licensing data with 150k rows of such data with 23 columns of attributes collected per row. The website includes a data dictionary, which was thoroughly reviewed prior to starting analysis. A comma-separated values (CSV) version of the file was downloaded on July 18, 2023.

After uploading the CSV file to R Studio with stringsAsFactors = TRUE,the lsr (Navarro, 2015), ggplot2 (Wickham, 2016) and tidyverse (Wickham et al., 2019) packages were loaded from the library for use in the analysis process.

#Upload data set

LI <- read.csv("L_I_Contractor_License_Data_-_General.csv", stringsAsFactors = TRUE)

#Load packages from library

library(lsr)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)

Variable Types

Using the str() function, the dimensions of the dataset were confirmed: 150,205 rows of observations with 23 columns of variables. The str() function also provided the variable type for each column. All variables are listed as factors (categorical) except for Phone Number, which is listed as numeric, and UBI, which is listed as an integer. However, Phone Number and UBI are actually both categorical variables as well since they do not represent measurements or have quantitative meaning.

#Detect the variables and data type

str(LI)

## 'data.frame':    150205 obs. of  23 variables:
##  $ BusinessName                 : Factor w/ 142051 levels "# JUAN HANDYMAN",..: 94800 106498 113275 70989 136569 67441 72278 88128 127009 27761 ...
##  $ ContractorLicenseNumber      : Factor w/ 150205 levels "04CONCL862CR",..: 100240 112084 119748 74409 144155 70059 76047 93065 41407 28120 ...
##  $ ContractorLicenseTypeCode    : Factor w/ 4 levels "CC","EC","LC",..: 2 1 1 1 1 1 1 2 1 1 ...
##  $ ContractorLicenseTypeCodeDesc: Factor w/ 4 levels "CONSTRUCTION CONTRACTOR",..: 2 1 1 1 1 1 1 2 1 1 ...
##  $ Address1                     : Factor w/ 126785 levels "# 2 Harbor View Drive",..: 72859 121750 50922 102984 75211 36772 85929 99130 102752 116365 ...
##  $ Address2                     : Factor w/ 5939 levels "","-","#-556",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ City                         : Factor w/ 3642 levels "","Abbotsford",..: 3195 792 3076 3301 2484 3623 1680 1008 949 582 ...
##  $ State                        : Factor w/ 67 levels "","AB","AK","AL",..: 64 64 64 64 64 64 64 64 64 64 ...
##  $ Zip                          : Factor w/ 8668 levels "10001","10003",..: 6935 6114 6855 7105 8390 7370 6178 4870 4918 8040 ...
##  $ PhoneNumber                  : num  2.53e+09 2.07e+09 2.53e+09 3.60e+09 5.10e+09 ...
##  $ LicenseEffectiveDate         : Factor w/ 11153 levels "1/1/1954","1/1/1962",..: 5592 6954 857 3401 8478 1585 1587 1752 4701 988 ...
##  $ LicenseExpirationDate        : Factor w/ 3370 levels "1/1/2017","1/1/2018",..: 1667 482 1134 1053 536 467 482 534 1675 482 ...
##  $ BusinessTypeCode             : Factor w/ 9 levels "","A","C","I",..: 6 4 4 3 4 4 6 6 6 6 ...
##  $ BusinessTypeCodeDesc         : Factor w/ 9 levels "","Association",..: 6 4 4 3 4 4 6 6 6 6 ...
##  $ SpecialtyCode1               : Factor w/ 84 levels "0","1","10","2",..: 9 50 49 2 35 45 50 10 71 58 ...
##  $ SpecialtyCode1Desc           : Factor w/ 87 levels "","ALL AREAS",..: 47 80 79 30 55 58 80 36 41 9 ...
##  $ SpecialtyCode2               : Factor w/ 16 levels "","10","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ SpecialtyCode2Desc           : Factor w/ 14 levels "","APPLIANCE REPAIR",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ UBI                          : int  604970689 602411448 602977969 602709309 604146482 604022998 604663640 602160098 603587534 602485300 ...
##  $ PrimaryPrincipalName         : Factor w/ 139504 levels "","'YEVICH KURKOV, TIMOFEY VALER",..: 10589 133673 112242 62534 104620 127401 64245 95579 39053 11665 ...
##  $ StatusCode                   : Factor w/ 11 levels "A","B","C","E",..: 1 1 1 1 1 4 1 4 1 1 ...
##  $ ContractorLicenseStatus      : Factor w/ 11 levels "ACTIVE","EXPIRED",..: 1 1 1 1 1 2 1 2 1 1 ...
##  $ ContractorLicenseSuspendDate : Factor w/ 3198 levels "","1/1/2015",..: 1 1 1 1 1 1 1 1 1 1 ...

Missing Data

The is.na() function was used to determine the extent of missing data in the dataset. This function revealed that there are 108 values missing from the dataset. The researcher selected the variable ContractorLicenseStatus for the univariate analysis in this assignment. An examination of the ContractorLicenseStatus variable in isolation found no data missing in this variable. Therefore, no action was taken to address the missing data for this assignment as the missing data will not affect the univariate analysis.

#Count missing values in dataset

sum(is.na(LI))

## [1] 108

#Count missing values in ContractorLicenseStatus

sum(is.na(LI$ContractorLicenseStatus))

## [1] 0

Discussion of Univariate Methods

Univariate analysis is used to analyze a single variable, seeking to discover patterns in the sample data that can be used to gain information from the larger population. There are 3 predominant types of univariate analyses: determining and examining 1) frequencies, 2) central tendency, and 3) dispersion. For the first method, examining the frequencies within a variable, frequency distribution tables are typically used. This method is most appropriate for categorical variables. On the other hand, the methods of examining central tendency by looking at the mean, median, or mode of the variable and analyzing dispersion, which includes investigating the range, variance, standard deviation, and quartiles of a variable, are only appropriate for continuous data. Since ContractorLicenseStatus is a categorical variable, the goal of univariate analysis for this assignment will be to determine the frequency or count of observations by category for the variable.

Per Mukhiya & Ahmed (2020), the primary steps for performing univariate analysis in Python or R are to import the required libraries and upload the dataset, check the data types of each column, compute the measure of central tendency, and visualize the analysis in a graphical manner. This approach applies to numeric variables but must be modified slightly for categorical variables. Here, instead of finding the measure of central tendency, the variable’s frequency distribution should be examined and plotted in a visualization. The data types have already been determined (as shown above), so the frequency distribution will be examined next through hypothesis testing.

Hypothesis Testing

The researcher designed a one-sample hypothesis test for the ContractorLicenseStatus variable since this assignment entails analysis of a single variable. The general steps for hypothesis testing are to develop research questions with associated null and alternate hypotheses, establish the rejection region, establish the significance level for the test, and lastly, perform the test (365 Careers, 2018). For numeric variables, testing is done to see if the mean value for the variable differs from a specific values, whereas for categorical variables, testing is done on the proportion to test how the categories (levels) are distributed.

Research Questions and Hypotheses

The following set of research questions (RQs) and corresponding null (H_O) and alternative (H_A) hypotheses were created:

RQ1: Are all categories of contractor licenses status evenly distributed in the dataset?
- H_O1: Each category in the contractor license status variable has the same probability of occurring in the dataset.
- H_A1: The categories in the contractor license status variable do not have the same probability of occurring in the dataset.
RQ2: Are more than 50% of contractor licenses in Active status?
- H_O2: The proportion of contractor licenses in Active status is greater than 50%.
- H_A2: The proportion of contractor licenses in Active status is less than 50%.

Chi-Square Goodness of Fit Test

To address the first research question (RQ1), the researcher performed a Chi-square goodness of fit test using the lsr package in R (Navarro, 2015) to determine how well the sample data fits the expected data; specifically, whether the probability of all factors having the same frequency is accurate. First, the rejection region was calculated using the qchisq() function with a significance level of 0.05 and degrees of freedom of 10 (since there are 11 factor levels for the ContractorLicenseStatus variable). Because the sample data pertains to humans and more random behavior is expected, a significance level of 0.05 is appropriate (365 Careers, 2018). The qchisq() function gave a value of 18.30704, meaning that if the chi-squared statistic is greater than 18.30704, the null hypothesis should be rejected. The goodnessOfFitTest() function was then applied to the ContractorLicenseStatus variable. The output shows the observed frequencies for each factor (the actual frequencies in the dataset) and the expected frequencies per the null hypothesis (each value being equally represented as 9.09% of the variable’s observations). The test yielded a chi-squared statistic of 470493.7 and a p-value of <0.001. Since the chi-squared statistic is significantly larger than the rejection region value, the null hypothesis should be rejected and the alternative hypothesis should be accepted. The small p-value (<0.001) indicates extremely strong evidence against the null hypothesis. It can therefore be concluded that the levels/factors in the contractor license status variable do not have the same probability of occurring in the dataset.

#Determine rejection region (using alpha = 0.05 and df=10)
qchisq(p = 0.95, df= 10)

## [1] 18.30704

#Goodness of fit test
goodnessOfFitTest(LI$ContractorLicenseStatus)

## 
##      Chi-square test against specified probabilities
## 
## Data variable:   LI$ContractorLicenseStatus 
## 
## Hypotheses: 
##    null:        true probabilities are as specified
##    alternative: true probabilities differ from those specified
## 
## Descriptives: 
##                        observed freq. expected freq. specified prob.
## ACTIVE                          74346          13655      0.09090909
## EXPIRED                         52457          13655      0.09090909
## INACTIVE                          175          13655      0.09090909
## OUT OF BUSINESS                  4291          13655      0.09090909
## PASSED AWAY                        24          13655      0.09090909
## RE-LICENSED                      9816          13655      0.09090909
## RESTORED FROM ARCHIVED              2          13655      0.09090909
## REVOKED DUE DEPT ERR                1          13655      0.09090909
## REVOKED LICENSE BY L&I              1          13655      0.09090909
## SUPERCEDED                         49          13655      0.09090909
## SUSPENDED                        9043          13655      0.09090909
## 
## Test results: 
##    X-squared statistic:  470493.7 
##    degrees of freedom:  10 
##    p-value:  <.001

One-Sample Proportion Test

For the second research question (RQ2), a one-sample proportions test was conducted using the prop.test() function in R. Using an observed frequency value of 74,346 for Active (which was obtained from the goodness of fit test), a sample size of 150,205 (which was obtained from the str() function), and the probability of 0.5 (or 50%), the test yielded a p value of 0.4949635. As shown in the R output below, since the p value is less than 0.5 (just barely), the null hypothesis should be rejected and the alternative hypothesis should be accepted. Hence, to can be concluded that the proportion of contractor licenses in Active status is less than 50%. However, since the p value is very close to the probability value, it can be inferred that the null hypothesis is only slightly off (which will be described in more detail below).

prop.test(x = 74346, n = 150205, p = 0.5, correct = FALSE,
          alternative = "less")

## 
##  1-sample proportions test without continuity correction
## 
## data:  74346 out of 150205, null probability 0.5
## X-squared = 15.24, df = 1, p-value = 4.733e-05
## alternative hypothesis: true p is less than 0.5
## 95 percent confidence interval:
##  0.0000000 0.4970856
## sample estimates:
##         p 
## 0.4949635

Continuing with Univariate Analysis

The researcher continued with univariate analysis by performing exploratory data analysis (EDA) on the ContractorLicenseStatus variable. The str() function confirmed that the ContractorLicenseStatus variable has 11 levels, or categories, of values, as determined by the goodnessOfFitTest() output. The summary() function provided a breakdown of each category by count, further confirming the observed frequency data from the goodness of fit test.

#Obtain basic statistical information for ContractorLicenseStatus variable
str(LI$ContractorLicenseStatus)

##  Factor w/ 11 levels "ACTIVE","EXPIRED",..: 1 1 1 1 1 2 1 2 1 1 ...

summary(LI$ContractorLicenseStatus)

##                 ACTIVE                EXPIRED               INACTIVE 
##                  74346                  52457                    175 
##        OUT OF BUSINESS            PASSED AWAY            RE-LICENSED 
##                   4291                     24                   9816 
## RESTORED FROM ARCHIVED   REVOKED DUE DEPT ERR REVOKED LICENSE BY L&I 
##                      2                      1                      1 
##             SUPERCEDED              SUSPENDED 
##                     49                   9043

A relative frequency distribution table was created using the tidyverse package (Wickham et al., 2019). The relative frequency distribution table confirms that Active is the status with the highest frequency, accounting for just under half of the observations (49.5%). This confirms the findings from the one-sample proportion test. The Expired status has the next highest frequency (34.9%). Together, these two values account for 84% of all data recorded for this variable. The remaining factors (Inactive, Out of Business, Passed Away, Re-Licensed, Restored from Archived, Revoked Due Dept Err, Revoked License by L&I, Superceded, and Suspended) make up the remaining 16% of the distribution, with Re-Licensed and Suspended having the next highest frequencies, 6.5% and 6%, respectively.

#Create relative frequency distribution table
LI %>%
  group_by(ContractorLicenseStatus) %>%
  summarise(n = n()) %>%
  mutate(freq = paste0((100 * n/sum(n)), '%'))

## # A tibble: 11 × 3
##    ContractorLicenseStatus     n freq                
##    <fct>                   <int> <chr>               
##  1 ACTIVE                  74346 49.4963549815253%   
##  2 EXPIRED                 52457 34.92360440731%     
##  3 INACTIVE                  175 0.116507439832229%  
##  4 OUT OF BUSINESS          4291 2.85676242468626%   
##  5 PASSED AWAY                24 0.0159781631769914% 
##  6 RE-LICENSED              9816 6.5350687393895%    
##  7 RESTORED FROM ARCHIVED      2 0.00133151359808262%
##  8 REVOKED DUE DEPT ERR        1 0.00066575679904131%
##  9 REVOKED LICENSE BY L&I      1 0.00066575679904131%
## 10 SUPERCEDED                 49 0.0326220831530242% 
## 11 SUSPENDED                9043 6.02043873373057%

Bar Plot

Using ggplot2 (Wickham, 2016), a bar plot was created to depict the frequency distribution. The visualization shows that Active and Inactive have the highest frequencies for ContractorLicenseStatus variables by a considerable amount, followed by Re-Licensed, Suspended, and Out of Business. All other categories are almost negligible.

#Bar plot with ggplot
ggplot(LI) + geom_bar(aes(x= ContractorLicenseStatus), fill = "blue") + 
  xlab("Contractor License Status") + ylab("Count") +
  theme(axis.text.x = element_text(size = 5), axis.text.y = element_text(size = 5))

Pie Chart

A pie chart was also created using base R. The main message of this graphic is that the Active status makes up approximately half of all values for the ContractorLicenseStatus, which further confirms the hypothesis test results.

#Pie chart with base R
percent <- round(prop.table(table(LI$ContractorLicenseStatus))*100,2)
labels<- as.character(unique(LI$ContractorLicenseStatus))

LI %>% 
  pull(ContractorLicenseStatus) %>%
  table() %>%
  pie(x=percent, label=paste(labels, " ", percent, "%"), col=rainbow(length(labels)), main="Contractor Licensing Status", cex = 0.3)

References

365 Careers. (2018). Statistics for Data Science and Business Analysis. Packt Publishing.

Data.WA.gov (2015 – 2023). State of Washington, Labor & Industries contractor license and registration data; L&I Contractor License Data - General [Data set]. Washington Department of Labor & Industries. Retrieved July 18, 2023, from: https://data.wa.gov/Labor/L-I-Contractor-License-Data-General/m8qx-ubtq

Mukhiya, S. K., & Ahmed, U. (2020). Hands-on exploratory data analysis with Python. Packt Publishing.

Navarro, D. (2015). Learning statistics with R: A tutorial for psychology students and other beginners. (Version 0.6). University of New South Wales, Sydney, Australia. R package version 0.5.1, https://learningstatisticswithr.com.

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L.D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T.L., Miller, E., Bache, S.M., Müller, K., Ooms, J., Robinson, D., Seidel, D.P., Spinu, V.,… Yutani, H. (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686.

Wickham, H. (2016). ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York. ISBN 978-3-319-24277-4, https://ggplot2.tidyverse.org.