DATA 606 Data Project

Grando - Data Project Proposal

Part 1 - Introduction:

LEED is a framework for identifying, implementing, and measuring green building and neighborhood design, construction, operations, and maintenance. The United States Green Building Council provides publicly available data that identifies all LEED Certified buildings. This list is known as the LEED Project Directory (http://www.usgbc.org/projects/) and features non-confidential projects under each LEED rating system. Once the project has been certified, its team is invited to submit content to update their profile, which contains basic project details, an overview on how the project achieved sustainability using LEED, and a photo slideshow. Project teams also have the option of linking to a video of their LEED building and a relevant website.

Generally, LEED projects register near the beginning of a building’s design phase (the drawing/planning stage). After preparing documentation, and after the construction phase (the building stage) ends, the project team will have their project certified by the Green Business Certification Inc. (GBCI). GBCI is the only certification and credentialing body within the green business and sustainability industry to exclusively administer project certifications and professional credentials of LEED, EDGE, PEER, WELL, SITES, GRESB, Parksmart and Zero Waste. The data available on this LEED Project Directory (http://www.usgbc.org/project-download-all) is a summary of every LEED registered project. Each project can be broken down into further subcategories, such as Rating System Families, based on the type of Certification. While this data is not directly accessible in a single file, the relations can be created from multiple locations on the website.

Within the company, it is important to be able to predict the number of days it takes for a project to certify (from the time it was regiestered) in order to coordinate employee time commitments to perform certification work. Additionally, it is essential that any variables which signficantly affect this time period (submittal time) be identified so that if there is a shift in submittals for one group, the company is aware of the future impacts. By having the ability to predict the length of time it takes for projects to certify, the GBCI can forecast their needs and manage the review staff as needed.

Part 2 - Data:

Upon registering a LEED project, the information in this data set is populated as a part of the requirments of the LEED process. Each case represents a registered LEED project.

The project data is collected by the United States Green Building Council and is available online here: http://www.usgbc.org/project-download-all. The version mapping table, used to break projects down into sub-categories, is not publicly available in a single format. however, all the infomration needed to create this data set is available on http://www.usgbc.org.

There are 126237 observations in the overall data set. The data consist of projects which are registered under different Platforms (v2 and v3 are analyzed for this study). These platforms are created in a sequential manner which means that v2 is an older version than v3. In order to normalize for this issue, v2 projects that were pending longer than the maximum amount of days v3 has been open are considered to have never become certified and remained open for the same amount of time as the maximum number of days v3 has been open. After filtering to projects with sufficient information, and pertinent categories, there are 67694 observations.
This is an observational study because it is based on past, and current, data which we have not influenced in any manner and are only evaluating the results.

The population of interest is all future project buildings which will register for LEED Certification. Given that this data actually represents the entire current population, it is assumed that this analysis can be generalized to future projects. Some issues that may cause a discrepancy between this analysis and the time it takes future projects to certify could include the following; changes to the LEED Rating System requirements, expansion into new building markets that are unfamiliar with referenced procedures and protocols, market acclimation to the standard (which may speed up certification time), and new Rating System Families or altenrative compliance paths to Certification.

The response variable chosen is number of days between a project’s registration and certification date (SubmittalTime). This is a discrete numerical value. After a preliminary analysis, shown in the next section, the explanatory variables chosen for this study are the rating system families (categorical, nominal) and gross floor area (numerical, continuous).

Since this is an observational study, we cannot determine causal connections between the variables of interest. We will only be able to determine whether the variables of interest are associated or not.

Part 3 - Exploratory data analysis:

I have prepared a data set which includes calculations that determine the length of time it took for each project to certify. Since v3 projects can only have been open for 2871 days, v2 projects (previous version) which were certified after 2871 days were considered to be un-certified.

require(ggplot2)

## Loading required package: ggplot2

#### Graphs based on present assumptions
table(completed_projects$RatingSystemFamily)

## 
##   BDC   IDC    OM 
## 20419  8183  4983

table(completed_projects$Platform)

## 
##    v2    v3 
## 14329 19256

# General Rating System Family histogram
ggplot(completed_projects, aes(x = SubmittalTime, fill = RatingSystemFamily)) + geom_histogram(binwidth = 100, 
    alpha = 0.5, position = "identity", aes(y = ..density..)) + labs(x = "Days Between Registration and Certification") + 
    ggtitle("Project Age vs. Rating System Family") + theme(plot.title = element_text(hjust = 0.5))

# General Platform ~ Rating System Family boxplot
ggplot(completed_projects, aes(y = SubmittalTime, x = Platform)) + geom_boxplot() + facet_wrap(~RatingSystemFamily) + 
    labs(y = "Days Between Registration and Certification") + ggtitle("Project Age Broken Down By Platform and Rating System Family") + 
    theme(plot.title = element_text(hjust = 0.5))

# Summary Tables
tabledata <- subset(completed_projects, completed_projects$RatingSystemFamily == "BDC")
summary(tabledata$SubmittalTime)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2     744    1055    1132    1450    2870

sd(tabledata$SubmittalTime)

## [1] 537.2698

tabledata <- subset(completed_projects, completed_projects$RatingSystemFamily == "IDC")
summary(tabledata$SubmittalTime)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0   274.0   462.0   528.7   711.0  2814.0

sd(tabledata$SubmittalTime)

## [1] 387.4607

tabledata <- subset(completed_projects, completed_projects$RatingSystemFamily == "OM")
summary(tabledata$SubmittalTime)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     6.0   255.0   406.0   510.1   657.0  2860.0

sd(tabledata$SubmittalTime)

## [1] 380.7195

#### Rating System Family Tests
completed_projects_BDC <- subset(completed_projects, completed_projects$RatingSystemFamily == 
    "BDC")
ggplot(completed_projects_BDC, aes(x = SubmittalTime, fill = Platform)) + geom_histogram(binwidth = 50, 
    alpha = 0.5, position = "identity", aes(y = ..density..)) + labs(x = "Days Between Registration and Certification") + 
    ggtitle("BD+C Projects") + theme(plot.title = element_text(hjust = 0.5))

completed_projects_IDC <- subset(completed_projects, completed_projects$RatingSystemFamily == 
    "IDC")
ggplot(completed_projects_IDC, aes(x = SubmittalTime, fill = Platform)) + geom_histogram(binwidth = 50, 
    alpha = 0.5, position = "identity", aes(y = ..density..)) + labs(x = "Days Between Registration and Certification") + 
    ggtitle("ID+C Projects") + theme(plot.title = element_text(hjust = 0.5))

completed_projects_OM <- subset(completed_projects, completed_projects$RatingSystemFamily == 
    "OM")
ggplot(completed_projects_OM, aes(x = SubmittalTime, fill = Platform)) + geom_histogram(binwidth = 50, 
    alpha = 0.5, position = "identity", aes(y = ..density..)) + labs(x = "Days Between Registration and Certification") + 
    ggtitle("OM Projects") + theme(plot.title = element_text(hjust = 0.5))

#### Gross floor area vs. submittal time tests

# I have scaled the y axis to exclude some outliers in favor of seeing the main data set
# more clearly

ggplot(completed_projects, aes(y = GrossSqFoot, x = SubmittalTime, color = RatingSystemFamily)) + 
    geom_point() + scale_y_continuous(limits = c(0, 5e+06)) + labs(x = "Days Between Registration and Certification", 
    y = "Gross Floor Area") + ggtitle("Project Age vs Total Area") + theme(plot.title = element_text(hjust = 0.5))

## Warning: Removed 14 rows containing missing values (geom_point).

# When a facet wrap is applied, it is a little more clear there may be some association
# with at least one rating system.

ggplot(completed_projects, aes(y = GrossSqFoot, x = SubmittalTime)) + geom_point() + facet_wrap(~RatingSystemFamily) + 
    scale_y_continuous(limits = c(0, 5e+06)) + labs(x = "Days Between Registration and Certification", 
    y = "Gross Floor Area") + ggtitle("Project Age vs Total Area By Rating System Family") + 
    theme(plot.title = element_text(hjust = 0.5))

## Warning: Removed 14 rows containing missing values (geom_point).

# This is a test to see if project timelines decrease as the time between the registration
# date and the launch date increases Create a data frame of completed v2 projects that are
# at least as old as the v3 launch.  This is so that we know we aren't creating a bias with
# recently registered projects that certified quickly.
completed_projects_v2 <- subset(completed_projects, completed_projects$Platform == "v2" & completed_projects$RegYear > 
    0)
table(completed_projects_v2$RegYear)

## 
##    1    2    3    4    5    6    7    8    9   10   11   12 
## 1791 3228 3588 3973  605   24   20   81   36   26   21    6

completed_projects_v2 <- subset(completed_projects_v2, completed_projects_v2$RegYear < 5 & 
    completed_projects_v2$RegYear > 0)
ggplot(completed_projects_v2, aes(y = SubmittalTime, x = RatingSystemFamily)) + geom_boxplot() + 
    facet_wrap(~RegYear) + labs(x = "Rating Sytem Family", y = "Days Between Registration and Certification") + 
    ggtitle("v2 Projects By Number of Years Registered after Platform Launch") + theme(plot.title = element_text(hjust = 0.5))

completed_projects_v3 <- subset(completed_projects, completed_projects$Platform == "v3")
table(completed_projects_v3$RegYear)

## 
##    1    2    3    4    5    6    7    8 
## 2281 3430 3568 3238 2693 2074 1350  622

completed_projects_v3 <- subset(completed_projects_v3, completed_projects_v3$RegYear < 7 & 
    completed_projects_v3$RegYear > 0)
ggplot(completed_projects_v3, aes(y = SubmittalTime, x = RatingSystemFamily)) + geom_boxplot() + 
    facet_wrap(~RegYear) + labs(x = "Rating Sytem Family", y = "Days Between Registration and Certification") + 
    ggtitle("v3 Projects By Number of Years Registered after Platform Launch") + theme(plot.title = element_text(hjust = 0.5))

younger_cutoff <- latest_date - 744
older_cutoff <- latest_date - 1750
completed_projects <- subset(completed_projects, RegistrationDate > older_cutoff & RegistrationDate < 
    younger_cutoff & Platform == "v3")

While doing the analysis, it was determined that older projects, and very recent projects, were singificant leverage points and it was unclear if these data accurately depicted typical timelines. Pertaining to older projects, these submittals are so old that there is an opportunity for a few projects to have very large submittal times which skew the data. Pertaining to newer projects, it appears that there is a bias towards only the “quick” projects since we only are evaluating certified projects. Therefore, we have reduced the data set to only include projects after 2012-05-22 and before 2015-02-22.

Since most projects using the v2 Platform have either been certified or will soon be sunsetted and no longer have the ability to certify, it does not benefit us to make any predictions based on this data. Additionally, there is more than enough v3 Platform project data to do an analyses; therefore, the v2 data will be removed for all inference calculations. Below are the exploratory graphs repeated for the v3 platform only.

#### Graphs based on present assumptions
table(completed_projects$RatingSystemFamily)

## 
##  BDC  IDC   OM 
## 3724 2153 1599

# General Rating System Family histogram
ggplot(completed_projects, aes(x = SubmittalTime, fill = RatingSystemFamily)) + geom_histogram(binwidth = 100, 
    alpha = 0.5, position = "identity", aes(y = ..density..)) + labs(x = "Days Between Registration and Certification") + 
    ggtitle("Project Age vs. Rating System Family") + theme(plot.title = element_text(hjust = 0.5))

# General Platform ~ Rating System Family boxplot
ggplot(completed_projects, aes(y = SubmittalTime, x = Platform)) + geom_boxplot() + facet_wrap(~RatingSystemFamily) + 
    labs(y = "Days Between Registration and Certification") + ggtitle("Project Age Broken Down By Platform and Rating System Family") + 
    theme(plot.title = element_text(hjust = 0.5))

# Summary Tables
tabledata <- subset(completed_projects, completed_projects$RatingSystemFamily == "BDC")
summary(tabledata$SubmittalTime)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      34     576     812     816    1058    1730

sd(tabledata$SubmittalTime)

## [1] 348.8812

tabledata <- subset(completed_projects, completed_projects$RatingSystemFamily == "IDC")
summary(tabledata$SubmittalTime)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     4.0   289.0   441.0   494.9   669.0  1654.0

sd(tabledata$SubmittalTime)

## [1] 298.2147

tabledata <- subset(completed_projects, completed_projects$RatingSystemFamily == "OM")
summary(tabledata$SubmittalTime)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    17.0   146.5   292.0   379.4   486.0  1691.0

sd(tabledata$SubmittalTime)

## [1] 300.6006

#### Rating System Family Tests
completed_projects_BDC <- subset(completed_projects, completed_projects$RatingSystemFamily == 
    "BDC")
ggplot(completed_projects_BDC, aes(x = SubmittalTime, fill = Platform)) + geom_histogram(binwidth = 50, 
    alpha = 0.5, position = "identity", aes(y = ..density..)) + labs(x = "Days Between Registration and Certification") + 
    ggtitle("BD+C Projects") + theme(plot.title = element_text(hjust = 0.5))

completed_projects_IDC <- subset(completed_projects, completed_projects$RatingSystemFamily == 
    "IDC")
ggplot(completed_projects_IDC, aes(x = SubmittalTime, fill = Platform)) + geom_histogram(binwidth = 50, 
    alpha = 0.5, position = "identity", aes(y = ..density..)) + labs(x = "Days Between Registration and Certification") + 
    ggtitle("ID+C Projects") + theme(plot.title = element_text(hjust = 0.5))

completed_projects_OM <- subset(completed_projects, completed_projects$RatingSystemFamily == 
    "OM")
ggplot(completed_projects_OM, aes(x = SubmittalTime, fill = Platform)) + geom_histogram(binwidth = 50, 
    alpha = 0.5, position = "identity", aes(y = ..density..)) + labs(x = "Days Between Registration and Certification") + 
    ggtitle("OM Projects") + theme(plot.title = element_text(hjust = 0.5))

#### Gross floor area vs. submittal time tests

# I have scaled the y axis to exclude some outliers in favor of seeing the main data set
# more clearly

ggplot(completed_projects, aes(y = GrossSqFoot, x = SubmittalTime, color = RatingSystemFamily)) + 
    geom_point() + scale_y_continuous(limits = c(0, 5e+06)) + labs(x = "Days Between Registration and Certification", 
    y = "Gross Floor Area") + ggtitle("Project Age vs Total Area") + theme(plot.title = element_text(hjust = 0.5))

## Warning: Removed 2 rows containing missing values (geom_point).

# When a facet wrap is applied, it is a little more clear there may be some association
# with at least one rating system.

ggplot(completed_projects, aes(y = GrossSqFoot, x = SubmittalTime)) + geom_point() + facet_wrap(~RatingSystemFamily) + 
    scale_y_continuous(limits = c(0, 5e+06)) + labs(x = "Days Between Registration and Certification", 
    y = "Gross Floor Area") + ggtitle("Project Age vs Total Area By Rating System Family") + 
    theme(plot.title = element_text(hjust = 0.5))

## Warning: Removed 2 rows containing missing values (geom_point).

# This is a test to see if project timelines decrease as the time between the registration
# date and the launch date increases Create a data frame of completed v2 projects that are
# at least as old as the v3 launch.  This is so that we know we aren't creating a bias with
# recently registered projects that certified quickly.
completed_projects_v3 <- subset(completed_projects, completed_projects$Platform == "v3")
table(completed_projects_v3$RegYear)

## 
##    2    3    4    5    6 
##   20   23 2986 2691 1756

completed_projects_v3 <- subset(completed_projects_v3, completed_projects_v3$RegYear < 7 & 
    completed_projects_v3$RegYear > 0)
ggplot(completed_projects_v3, aes(y = SubmittalTime, x = RatingSystemFamily)) + geom_boxplot() + 
    facet_wrap(~RegYear) + labs(x = "Rating Sytem Family", y = "Days Between Registration and Certification") + 
    ggtitle("v3 Projects By Number of Years Registered after Platform Launch") + theme(plot.title = element_text(hjust = 0.5))

Part 4 - Inference:

Frome the previous section, we can see that it appears the most signinficant difference between groups is the rating system families. Additionally, we will test to see if gross floor area is a significant predictor of the time it takes for a project to certify (submittal time).

Analysis of Variance (ANOVA) Test Between Rating System Families.

First, we will test to see if there is a statistically significant difference of at least one of the means between Rating System Families (BDC, IDC, OM). Using an analysis of variace (ANOVA) test, we can evaluate whether the null hypthoesis (\({ H }_{ O }\)), which assumes there is no difference, can be rejected in favor of the althernative hypothesis (\({ H }_{ A }\)). These hypothesis can be summarized as follows

\[{ H }_{ O }:\quad The\quad mean\quad submittal\quad time\quad is\quad the\quad same\quad across\quad all\quad Rating\quad System\quad Families\\ { H }_{ A }:\quad The\quad mean\quad submittal\quad time\quad for\quad at\quad least\quad one\quad Rating\quad System\quad Family\quad is\quad different\]

In order to do this, we must first verify the following assumptions:

The observations are independent within and across groups

Since the data actually represent all current projects, it can be assumed that there is no bias in the sampling because no sampling is being performed. Also, projects can only be in one Rating System Family. Therefore, the obseravations are considered independent.

The data within each group are nearly normal

BD+C Projects

completed_projects_BDC_V3 <- subset(completed_projects_BDC, Platform == "v3")
ggplot(completed_projects_BDC_V3, aes(x = SubmittalTime)) + geom_histogram(binwidth = 50, alpha = 0.5, 
    position = "identity", aes(y = ..density..)) + labs(x = "Days Between Registration and Certification") + 
    ggtitle("BD+C Projects") + theme(plot.title = element_text(hjust = 0.5)) + stat_function(fun = dnorm, 
    color = "black", args = list(mean = mean(completed_projects_BDC_V3$SubmittalTime), sd(completed_projects_BDC_V3$SubmittalTime)))

qqnorm(completed_projects_BDC_V3$SubmittalTime)
qqline(completed_projects_BDC_V3$SubmittalTime)

IDC Projects

completed_projects_IDC_V3 <- subset(completed_projects_IDC, Platform == "v3")
ggplot(completed_projects_IDC_V3, aes(x = SubmittalTime)) + geom_histogram(binwidth = 50, alpha = 0.5, 
    position = "identity", aes(y = ..density..)) + labs(x = "Days Between Registration and Certification") + 
    ggtitle("ID+C Projects") + theme(plot.title = element_text(hjust = 0.5)) + stat_function(fun = dnorm, 
    color = "black", args = list(mean = mean(completed_projects_IDC_V3$SubmittalTime), sd(completed_projects_IDC_V3$SubmittalTime)))

qqnorm(completed_projects_IDC_V3$SubmittalTime)
qqline(completed_projects_IDC_V3$SubmittalTime)

OM Projects

completed_projects_OM_V3 <- subset(completed_projects_OM, Platform == "v3")
ggplot(completed_projects_OM_V3, aes(x = SubmittalTime)) + geom_histogram(binwidth = 50, alpha = 0.5, 
    position = "identity", aes(y = ..density..)) + labs(x = "Days Between Registration and Certification") + 
    ggtitle("OM Projects") + theme(plot.title = element_text(hjust = 0.5)) + stat_function(fun = dnorm, 
    color = "black", args = list(mean = mean(completed_projects_OM_V3$SubmittalTime), sd(completed_projects_OM_V3$SubmittalTime)))

qqnorm(completed_projects_OM_V3$SubmittalTime)
qqline(completed_projects_OM_V3$SubmittalTime)

The variability across the groups is about equal.

completed_projects_V3 <- subset(completed_projects, Platform == "v3")
ggplot(completed_projects_V3, aes(y = SubmittalTime, x = RatingSystemFamily)) + geom_boxplot() + 
    labs(y = "Days Between Registration and Certification") + ggtitle("Project Age Broken Down By Rating System Family") + 
    theme(plot.title = element_text(hjust = 0.5))

The boxplots indicate that the variability across groups does not have any significant outliers. The standard deviation is 348.88 for BD+C projects, 298.21 for ID+C projects, and 300.6 for OM projects. It appears the variance in the BD+C group is somewhat greater than the variance in the ID+C and OM groups. Therefore, we shoud procede with caution and note this issue in the findings. Additionally, each group (BD+C, ID+C, and OM) has issues pertaining to their distributions. The BD+C data set appears narrower than a typical normal distribution (short tails). The ID+C and OM data sets are heavily right skewed and indicate that this might not be the best method to determine if the groups are significantly different; however, we will procede since this is the best available test we can perform at the moment.

We will now perform an ANOVA test. For this analysis, we will use a significance level of \(\alpha = 0.05\). This corresponds to an assumption that 95% of the time, we will not reject the null hypothesis (i.e. no difference) when it is true and 5% of the time we will, which is considered a Type I error.

RSF_aov <- aov(SubmittalTime ~ RatingSystemFamily, data = completed_projects_V3)
summary(RSF_aov)

##                      Df    Sum Sq   Mean Sq F value Pr(>F)    
## RatingSystemFamily    2 268624193 134312097    1272 <2e-16 ***
## Residuals          7473 788934628    105571                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

lb <- 0
ub <- 3750
f_value <- summary(RSF_aov)[[1]][["F value"]][1]
f1 <- qf(p = (1 - 0.05), df1 = summary(RSF_aov)[[1]][[1]][1], df2 = summary(RSF_aov)[[1]][[1]][2])
pick_line1 <- round(f_value, digits = 2)
ggplot(data.frame(x = c(lb, ub)), aes(x = x)) + stat_function(fun = df, args = list(df1 = summary(RSF_aov)[[1]][[1]][1], 
    df2 = summary(RSF_aov)[[1]][[1]][2])) + stat_function(fun = df, args = list(df1 = summary(RSF_aov)[[1]][[1]][1], 
    df2 = summary(RSF_aov)[[1]][[1]][2]), xlim = c(f1, ub), geom = "area", alpha = 0.5) + geom_vline(xintercept = pick_line1, 
    color = "black", alpha = 0.75) + geom_text(aes(x = pick_line1, y = 0.5, label = sprintf("F = %s\n", 
    pick_line1)), color = "black", angle = 90) + labs(x = "F Value") + ggtitle("ANOVA test F Distribution") + 
    theme(plot.title = element_text(hjust = 0.5))

Using a confidence level of 95%, we can see that there is sufficient evidence to reject the null hypothesis since the anova test has returned a p-value of 2e-16, which is much less than 0.05. Therefore we reject the hypothesis that the average time between registration and certification (submittal time) is equal.

Now that we know there is a difference of one of the means, we can perform individual pairwise t-tests to determine which means are statistically significant from each other. Since we are performing multiple comparisons across groups, it is necessary to implement a Bonferroni correction so that the Type I error rate is not increased. This is calculated by taking the significance level (\(\alpha = 0.05\)) and dividing it by the number of groups (3).

bonf_alpha <- 0.05/3

We will now test each project against each other using two-sample t-tests.

# BDC vs OM
(BDC_vs_OM_t <- t.test(completed_projects_BDC_V3$SubmittalTime, completed_projects_OM_V3$SubmittalTime, 
    conf.level = (1 - bonf_alpha)))

## 
##  Welch Two Sample t-test
## 
## data:  completed_projects_BDC_V3$SubmittalTime and completed_projects_OM_V3$SubmittalTime
## t = 46.234, df = 3481.2, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 98.33333 percent confidence interval:
##  414.0322 459.2732
## sample estimates:
## mean of x mean of y 
##  816.0392  379.3865

lb <- -4
ub <- 4
t_value <- BDC_vs_OM_t[[1]]
t1 <- qt(p = bonf_alpha, df = BDC_vs_OM_t[[2]])
t2 <- qt(p = 1 - bonf_alpha, df = BDC_vs_OM_t[[2]])
pick_line1 <- round(t_value, digits = 2)
ggplot(data.frame(x = c(lb, ub)), aes(x = x)) + stat_function(fun = dt, args = list(df = BDC_vs_OM_t[[2]])) + 
    stat_function(fun = dt, args = list(df = BDC_vs_OM_t[[2]]), xlim = c(lb, t1), geom = "area", 
        alpha = 0.5) + geom_vline(xintercept = pick_line1, color = "black", alpha = 0.75) + 
    stat_function(fun = dt, args = list(df = BDC_vs_OM_t[[2]]), xlim = c(t2, ub), geom = "area", 
        alpha = 0.5) + geom_vline(xintercept = pick_line1, color = "black", alpha = 0.75) + 
    geom_text(aes(x = pick_line1, y = 0.2, label = sprintf("t = %s\n", pick_line1)), color = "black", 
        angle = 90) + labs(x = "t Value") + ggtitle("BD+C vs EBOM Projects") + theme(plot.title = element_text(hjust = 0.5))

# BDC vs IDC
(BDC_vs_IDC_t <- t.test(completed_projects_BDC_V3$SubmittalTime, completed_projects_IDC_V3$SubmittalTime, 
    conf.level = (1 - bonf_alpha)))

## 
##  Welch Two Sample t-test
## 
## data:  completed_projects_BDC_V3$SubmittalTime and completed_projects_IDC_V3$SubmittalTime
## t = 37.336, df = 5070.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 98.33333 percent confidence interval:
##  300.5560 341.7547
## sample estimates:
## mean of x mean of y 
##  816.0392  494.8839

lb <- -4
ub <- 4
t_value <- BDC_vs_IDC_t[[1]]
t1 <- qt(p = bonf_alpha, df = BDC_vs_IDC_t[[2]])
t2 <- qt(p = 1 - bonf_alpha, df = BDC_vs_IDC_t[[2]])
pick_line1 <- round(t_value, digits = 2)
ggplot(data.frame(x = c(lb, ub)), aes(x = x)) + stat_function(fun = dt, args = list(df = BDC_vs_IDC_t[[2]])) + 
    stat_function(fun = dt, args = list(df = BDC_vs_IDC_t[[2]]), xlim = c(lb, t1), geom = "area", 
        alpha = 0.5) + geom_vline(xintercept = pick_line1, color = "black", alpha = 0.75) + 
    stat_function(fun = dt, args = list(df = BDC_vs_IDC_t[[2]]), xlim = c(t2, ub), geom = "area", 
        alpha = 0.5) + geom_vline(xintercept = pick_line1, color = "black", alpha = 0.75) + 
    geom_text(aes(x = pick_line1, y = 0.2, label = sprintf("t = %s\n", pick_line1)), color = "black", 
        angle = 90) + labs(x = "t Value") + ggtitle("BD+C vs IDC Projects") + theme(plot.title = element_text(hjust = 0.5))

# OM vs IDC
(OM_vs_IDC_t <- t.test(completed_projects_OM_V3$SubmittalTime, completed_projects_IDC_V3$SubmittalTime, 
    conf.level = (1 - bonf_alpha)))

## 
##  Welch Two Sample t-test
## 
## data:  completed_projects_OM_V3$SubmittalTime and completed_projects_IDC_V3$SubmittalTime
## t = -11.678, df = 3427.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 98.33333 percent confidence interval:
##  -139.18605  -91.80873
## sample estimates:
## mean of x mean of y 
##  379.3865  494.8839

lb <- -4
ub <- 4
t_value <- OM_vs_IDC_t[[1]]
t1 <- qt(p = bonf_alpha, df = OM_vs_IDC_t[[2]])
t2 <- qt(p = 1 - bonf_alpha, df = OM_vs_IDC_t[[2]])
pick_line1 <- round(t_value, digits = 2)
ggplot(data.frame(x = c(lb, ub)), aes(x = x)) + stat_function(fun = dt, args = list(df = OM_vs_IDC_t[[2]])) + 
    stat_function(fun = dt, args = list(df = OM_vs_IDC_t[[2]]), xlim = c(lb, t1), geom = "area", 
        alpha = 0.5) + geom_vline(xintercept = pick_line1, color = "black", alpha = 0.75) + 
    stat_function(fun = dt, args = list(df = OM_vs_IDC_t[[2]]), xlim = c(t2, ub), geom = "area", 
        alpha = 0.5) + geom_vline(xintercept = pick_line1, color = "black", alpha = 0.75) + 
    geom_text(aes(x = pick_line1, y = 0.2, label = sprintf("t = %s\n", pick_line1)), color = "black", 
        angle = 90) + labs(x = "t Value") + ggtitle("EBOM vs IDC Projects") + theme(plot.title = element_text(hjust = 0.5))

As can be seen from the pariwise t-tests, it has been determined that we are 95% confident that the mean submittal time between Rating System Famiilies is different between all groups (BD+C, ID+C, and OM).

Now, there are not many numerical values that can be used as predictors of how long it will take a project to certify. Note, attributes such as the number of points achieved and certification level are only known after a project has certified; therefore, have limited value. One possible numerical indicator of submittal time that can be used in the gross floor area of the project. We will now use this predictor to determine a least squares linear regerssion line to predict the response variable of submittal time. Due to the large differences in the maximum and minimum values, a log translation will be applied to the data. Also, since we have already determined that each Rating System Family has significantly different submittal times, we will perform separate linear regression analsyses for BD+C, ID+C, and OM.

Generate Linear Regression Models and Graphs

# BDC
gf_lm_BDC_V3 <- lm(SubmittalTime ~ log(GrossSqFoot), data = completed_projects_BDC_V3)
summary(gf_lm_BDC_V3)

## 
## Call:
## lm(formula = SubmittalTime ~ log(GrossSqFoot), data = completed_projects_BDC_V3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -853.60 -240.18   -7.34  241.48  926.94 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       521.414     44.757  11.650  < 2e-16 ***
## log(GrossSqFoot)   26.465      3.988   6.636 3.68e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 346.9 on 3722 degrees of freedom
## Multiple R-squared:  0.01169,    Adjusted R-squared:  0.01143 
## F-statistic: 44.04 on 1 and 3722 DF,  p-value: 3.678e-11

ggplot(completed_projects_BDC_V3, aes(y = SubmittalTime, x = log(GrossSqFoot))) + geom_point() + 
    geom_smooth(method = lm, fullrange = TRUE) + labs(x = "Log of Gross Floor Area", y = "Days Between Registration and Certification") + 
    ggtitle("Log of BD+C Gross Floor Area Against Submittal Time") + theme(plot.title = element_text(hjust = 0.5))

# IDC
gf_lm_IDC_V3 <- lm(SubmittalTime ~ log(GrossSqFoot), data = completed_projects_IDC_V3)
summary(gf_lm_IDC_V3)

## 
## Call:
## lm(formula = SubmittalTime ~ log(GrossSqFoot), data = completed_projects_IDC_V3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -605.92 -206.05  -53.11  140.11 1229.05 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -355.499     42.700  -8.325   <2e-16 ***
## log(GrossSqFoot)   87.639      4.358  20.108   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 273.7 on 2151 degrees of freedom
## Multiple R-squared:  0.1582, Adjusted R-squared:  0.1578 
## F-statistic: 404.3 on 1 and 2151 DF,  p-value: < 2.2e-16

ggplot(completed_projects_IDC_V3, aes(y = SubmittalTime, x = log(GrossSqFoot))) + geom_point() + 
    geom_smooth(method = lm, fullrange = TRUE) + labs(x = "Log of Gross Floor Area", y = "Days Between Registration and Certification") + 
    ggtitle("Log of ID+C Gross Floor Area Against Submittal Time") + theme(plot.title = element_text(hjust = 0.5))

# OM
gf_lm_OM_V3 <- lm(SubmittalTime ~ log(GrossSqFoot), data = completed_projects_OM_V3)
summary(gf_lm_OM_V3)

## 
## Call:
## lm(formula = SubmittalTime ~ log(GrossSqFoot), data = completed_projects_OM_V3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -473.39 -177.64  -97.53   79.56 1304.92 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -160.342     52.723  -3.041  0.00239 ** 
## log(GrossSqFoot)   45.340      4.387  10.336  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 291.1 on 1597 degrees of freedom
## Multiple R-squared:  0.0627, Adjusted R-squared:  0.06212 
## F-statistic: 106.8 on 1 and 1597 DF,  p-value: < 2.2e-16

ggplot(completed_projects_OM_V3, aes(y = SubmittalTime, x = log(GrossSqFoot))) + geom_point() + 
    geom_smooth(method = lm, fullrange = TRUE) + labs(x = "Log of Gross Floor Area", y = "Days Between Registration and Certification") + 
    ggtitle("Log of OM Gross Floor Area Against Submittal Time") + theme(plot.title = element_text(hjust = 0.5))

As we can see from the R-square values, these models do not appear to be great predictors of submittal time but do produce statistically significant postitive prediction slopes. While these results do not appear to be of much use at the moment, we will continue the analysis and validate the models by checking the following asusmptions:

the residuals of the model are nearly normal

# BDC
ggplot(data = gf_lm_BDC_V3, aes(x = resid(gf_lm_BDC_V3))) + geom_histogram(binwidth = 100, 
    position = "identity", aes(y = ..density..)) + stat_function(fun = dnorm, color = "black", 
    args = list(mean = mean(gf_lm_BDC_V3$residuals), sd(gf_lm_BDC_V3$residuals))) + labs(x = "Residuals") + 
    ggtitle("Log of BD+C Gross Floor Area Residuals") + theme(plot.title = element_text(hjust = 0.5))

qqnorm(gf_lm_BDC_V3$residuals)
qqline(gf_lm_BDC_V3$residuals)

# IDC
ggplot(data = gf_lm_IDC_V3, aes(x = resid(gf_lm_IDC_V3))) + geom_histogram(binwidth = 100, 
    position = "identity", aes(y = ..density..)) + stat_function(fun = dnorm, color = "black", 
    args = list(mean = mean(gf_lm_IDC_V3$residuals), sd(gf_lm_IDC_V3$residuals))) + labs(x = "Residuals") + 
    ggtitle("Log of ID+C Gross Floor Area Residuals") + theme(plot.title = element_text(hjust = 0.5))

qqnorm(gf_lm_IDC_V3$residuals)
qqline(gf_lm_IDC_V3$residuals)

# OM
ggplot(data = gf_lm_OM_V3, aes(x = resid(gf_lm_OM_V3))) + geom_histogram(binwidth = 100, position = "identity", 
    aes(y = ..density..)) + stat_function(fun = dnorm, color = "black", args = list(mean = mean(gf_lm_OM_V3$residuals), 
    sd(gf_lm_OM_V3$residuals))) + labs(x = "Residuals") + ggtitle("Log of OM Gross Floor Area Residuals") + 
    theme(plot.title = element_text(hjust = 0.5))

qqnorm(gf_lm_OM_V3$residuals)
qqline(gf_lm_OM_V3$residuals)

the variability of the residuals is nearly constant

# BDC
ggplot(gf_lm_BDC_V3, aes(y = abs(resid(gf_lm_BDC_V3)), x = gf_lm_BDC_V3$fitted.values)) + geom_point() + 
    geom_hline(yintercept = 0) + labs(x = "Fitted Values", y = "Absolute Values of residuals") + 
    ggtitle("Log of BD+C Gross Floor Area Fitted Values Against Residuals") + theme(plot.title = element_text(hjust = 0.5))

# IDC
ggplot(gf_lm_IDC_V3, aes(y = abs(resid(gf_lm_IDC_V3)), x = gf_lm_IDC_V3$fitted.values)) + geom_point() + 
    geom_hline(yintercept = 0) + labs(x = "Fitted Values", y = "Absolute Values of residuals") + 
    ggtitle("Log of ID+C Gross Floor Area Fitted Values Against Residuals") + theme(plot.title = element_text(hjust = 0.5))

# OM
ggplot(gf_lm_OM_V3, aes(y = abs(resid(gf_lm_OM_V3)), x = gf_lm_OM_V3$fitted.values)) + geom_point() + 
    geom_hline(yintercept = 0) + labs(x = "Fitted Values", y = "Absolute Values of residuals") + 
    ggtitle("Log of OM Gross Floor Area Fitted Values Against Residuals") + theme(plot.title = element_text(hjust = 0.5))

the residuals are independent

# BDC
ggplot(gf_lm_BDC_V3, aes(y = resid(gf_lm_BDC_V3), x = completed_projects_BDC_V3$RegistrationDate)) + 
    geom_point() + geom_hline(yintercept = 0) + labs(x = "Registration Date", y = "Residuals") + 
    ggtitle("Log of BD+C Registration Date Against Residuals") + theme(plot.title = element_text(hjust = 0.5))

# IDC
ggplot(gf_lm_IDC_V3, aes(y = resid(gf_lm_IDC_V3), x = completed_projects_IDC_V3$RegistrationDate)) + 
    geom_point() + geom_hline(yintercept = 0) + labs(x = "Registration Date", y = "Residuals") + 
    ggtitle("Log of ID+C Registration Date Against Residuals") + theme(plot.title = element_text(hjust = 0.5))

# OM
ggplot(gf_lm_OM_V3, aes(y = resid(gf_lm_OM_V3), x = completed_projects_OM_V3$RegistrationDate)) + 
    geom_point() + geom_hline(yintercept = 0) + labs(x = "Registration Date", y = "Residuals") + 
    ggtitle("Log of OM Registration Date Against Residuals") + theme(plot.title = element_text(hjust = 0.5))

each variable is linearly related to the outcome.

# BDC
ggplot(gf_lm_BDC_V3, aes(y = resid(gf_lm_BDC_V3), x = log(completed_projects_BDC_V3$GrossSqFoot))) + 
    geom_point() + geom_hline(yintercept = 0) + labs(x = "Log of Gross Floor Area", y = "Residuals") + 
    ggtitle("Log of BD+C Gross Floor Area Against Residuals") + theme(plot.title = element_text(hjust = 0.5))

# IDC
ggplot(gf_lm_IDC_V3, aes(y = resid(gf_lm_IDC_V3), x = log(completed_projects_IDC_V3$GrossSqFoot))) + 
    geom_point() + geom_hline(yintercept = 0) + labs(x = "Log of Gross Floor Area", y = "Residuals") + 
    ggtitle("Log of ID+C Gross Floor Area Against Residuals") + theme(plot.title = element_text(hjust = 0.5))

# OM
ggplot(gf_lm_OM_V3, aes(y = resid(gf_lm_OM_V3), x = log(completed_projects_OM_V3$GrossSqFoot))) + 
    geom_point() + geom_hline(yintercept = 0) + labs(x = "Log of Gross Floor Area", y = "Residuals") + 
    ggtitle("Log of OM Gross Floor Area Against Residuals") + theme(plot.title = element_text(hjust = 0.5))

The histogram shows that the data do not exactly follow a normal distribution. From the normal probability plot it appears the data is narrower than a typical normal distribution (short tails) and has a right skew for the ID+C data set. The OM data set appear to have a right skew and it is unclear if there are short tails. The residuals appear to be independent from the registration date vs residuals and gross floor area vs residuals graphs. Note that there is apparent decrease of the maximum possible value in the registration date vs residuals graph. This is due to the fact that older projects have a larger window to certify since they have been open longer, thus they can have a larger maximum possible residual and the newer projects only represent projects that have certified early. In addition to the issues with skewness, each model reports a low r-squared value. Even if all the assumptions were completely validated, the model would only explain 1.14% of the variability for the BD+C data set, 15.78% of the variability for the ID+C data set, and 6.21% of the variability for the OM data set, which is much too small to be considered useful.

As previously noted, the gross floor is not a great predictor variable of the time it takes for a project to certify; however, we can perform a multi-linear regression analysis, which uses more than one predictor variable, to estimate the submittal times. In order to do this, we must filter the data to ensure there are sufficient values in each predictor variable we want to use and then perform an analysis. For this study, we will start with the following variables since they are known at the time of registration: confidential status (Isconfidential), Country (Note - any countries that only had a small number of submittals have been grouped into an “other” category), Rating System Family, owner types (OwnerTypesAdj - Note - the number of categories has been significantly reduced for meaningfullness), and gross floor area (GrossSqFoot).

# Make sure all columns have good data and that v4, ND, and Homes projects are excluded due
# to small size.  Only v3 since that is what we want to guess.
filtered_projects <- subset(project_data_adjusted, !(identical(project_data_adjusted$RatingSystemFamily, 
    character(0))) & !(is.na(project_data_adjusted$RatingSystemFamily)) & !(project_data_adjusted$RatingSystemFamily == 
    "ND") & !(project_data_adjusted$RatingSystemFamily == "Homes") & project_data_adjusted$Platform == 
    "v3" & project_data_adjusted$GrossSqFoot > 0)

# Leverage point adjustment
filtered_projects <- subset(filtered_projects, RegistrationDate > older_cutoff & RegistrationDate < 
    younger_cutoff & Platform == "v3")

# Remove all NAs that are present
filtered_projects_lm <- subset(filtered_projects, filtered_projects$SubmittalTime > 0 & !(is.na(filtered_projects$Isconfidential)) & 
    !(is.na(filtered_projects$Country)) & !(is.na(filtered_projects$RatingSystemFamily)) & 
    !(is.na(filtered_projects$Platform)) & !(is.na(filtered_projects$OwnerTypesAdj)) & !(is.na(filtered_projects$GrossSqFoot)))



# Relevel countries so that US is the reference (base) value in the linear regression.
filtered_projects_lm$Country <- as.factor(filtered_projects_lm$Country)
filtered_projects_lm$Country <- relevel(filtered_projects_lm$Country, ref = "US")

# Filter out the countries with only a few projects
country_count <- data.frame(names(table(filtered_projects_lm$Country)), unname(table(filtered_projects_lm$Country)))
country_count_small <- subset(country_count, Freq > 99)
names(country_count_small) <- c("Country", "Var1", "Freq")

filtered_projects_lm$CountryAdj <- sapply(filtered_projects_lm$Country, function(x) {
    x <- as.character(country_count_small[which(country_count_small$Country == x), "Country"][1])
    ifelse(is.na(x), "Other", x)
})
# Relevel adjusted countries so that US is the reference (base) value in the linear
# regression.
filtered_projects_lm$CountryAdj <- as.factor(filtered_projects_lm$CountryAdj)
filtered_projects_lm$CountryAdj <- relevel(filtered_projects_lm$CountryAdj, ref = "US")

Now, by using backward elimination and forward selection, we can determine the best combination of predictor variables.

ST_lm <- lm(SubmittalTime ~ Isconfidential + CountryAdj + RatingSystemFamily + OwnerTypesAdj + 
    log(GrossSqFoot), data = filtered_projects_lm)
summary(ST_lm)

## 
## Call:
## lm(formula = SubmittalTime ~ Isconfidential + CountryAdj + RatingSystemFamily + 
##     OwnerTypesAdj + log(GrossSqFoot), data = filtered_projects_lm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -896.48 -202.11  -49.75  166.86 1340.20 
## 
## Coefficients: (1 not defined because of singularities)
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              104.110     28.957   3.595 0.000326 ***
## IsconfidentialYes        102.112     15.947   6.403 1.61e-10 ***
## CountryAdjBR              17.340     27.358   0.634 0.526217    
## CountryAdjCA            -552.926     24.560 -22.514  < 2e-16 ***
## CountryAdjCN             -47.592     16.660  -2.857 0.004294 ** 
## CountryAdjIN             -67.630     22.270  -3.037 0.002399 ** 
## CountryAdjMX             -56.488     29.259  -1.931 0.053564 .  
## CountryAdjOther          -29.227      9.643  -3.031 0.002446 ** 
## CountryAdjTR            -153.743     28.667  -5.363 8.43e-08 ***
## RatingSystemFamilyIDC   -216.104      9.200 -23.489  < 2e-16 ***
## RatingSystemFamilyOM    -450.602      9.525 -47.309  < 2e-16 ***
## OwnerTypesAdjcorporate    57.661     12.967   4.447 8.85e-06 ***
## OwnerTypesAdjeducation   304.584     17.475  17.430  < 2e-16 ***
## OwnerTypesAdjgovernment  259.076     17.518  14.789  < 2e-16 ***
## OwnerTypesAdjindividual  145.794     30.629   4.760 1.97e-06 ***
## OwnerTypesAdjinvestor     51.516     15.721   3.277 0.001054 ** 
## OwnerTypesAdjnon-profit  226.671     23.375   9.697  < 2e-16 ***
## OwnerTypesAdjother            NA         NA      NA       NA    
## log(GrossSqFoot)          56.009      2.409  23.249  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 293.1 on 7458 degrees of freedom
## Multiple R-squared:  0.394,  Adjusted R-squared:  0.3926 
## F-statistic: 285.2 on 17 and 7458 DF,  p-value: < 2.2e-16

# Backward Radj - First step remove Isconfidential
ST_lm_b <- lm(SubmittalTime ~ CountryAdj + RatingSystemFamily + OwnerTypesAdj + log(GrossSqFoot), 
    data = filtered_projects_lm)
summary(ST_lm_b)

## 
## Call:
## lm(formula = SubmittalTime ~ CountryAdj + RatingSystemFamily + 
##     OwnerTypesAdj + log(GrossSqFoot), data = filtered_projects_lm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -896.48 -202.11  -49.75  166.86 1340.20 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              206.222     29.632   6.959 3.71e-12 ***
## CountryAdjBR              17.340     27.358   0.634 0.526217    
## CountryAdjCA            -552.926     24.560 -22.514  < 2e-16 ***
## CountryAdjCN             -47.592     16.660  -2.857 0.004294 ** 
## CountryAdjIN             -67.630     22.270  -3.037 0.002399 ** 
## CountryAdjMX             -56.488     29.259  -1.931 0.053564 .  
## CountryAdjOther          -29.227      9.643  -3.031 0.002446 ** 
## CountryAdjTR            -153.743     28.667  -5.363 8.43e-08 ***
## RatingSystemFamilyIDC   -216.104      9.200 -23.489  < 2e-16 ***
## RatingSystemFamilyOM    -450.602      9.525 -47.309  < 2e-16 ***
## OwnerTypesAdjcorporate   -44.451     11.472  -3.875 0.000108 ***
## OwnerTypesAdjeducation   202.472     16.371  12.368  < 2e-16 ***
## OwnerTypesAdjgovernment  156.964     16.563   9.477  < 2e-16 ***
## OwnerTypesAdjindividual   43.682     30.065   1.453 0.146286    
## OwnerTypesAdjinvestor    -50.596     14.550  -3.477 0.000509 ***
## OwnerTypesAdjnon-profit  124.559     22.592   5.513 3.64e-08 ***
## OwnerTypesAdjother      -102.112     15.947  -6.403 1.61e-10 ***
## log(GrossSqFoot)          56.009      2.409  23.249  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 293.1 on 7458 degrees of freedom
## Multiple R-squared:  0.394,  Adjusted R-squared:  0.3926 
## F-statistic: 285.2 on 17 and 7458 DF,  p-value: < 2.2e-16

# #Forward ST_lm_f <- lm(SubmittalTime ~ Isconfidential, data = filtered_projects_lm)
# summary(ST_lm_f) ST_lm_f <- lm(SubmittalTime ~ CountryAdj, data = filtered_projects_lm)
# summary(ST_lm_f) ST_lm_f <- lm(SubmittalTime ~ RatingSystemFamily, data =
# filtered_projects_lm) summary(ST_lm_f) ST_lm_f <- lm(SubmittalTime ~ OwnerTypesAdj, data
# = filtered_projects_lm) summary(ST_lm_f) ST_lm_f <- lm(SubmittalTime ~ log(GrossSqFoot),
# data = filtered_projects_lm) summary(ST_lm_f) #Forward pick RatingSystemFamily ST_lm_f <-
# lm(SubmittalTime ~ RatingSystemFamily + Isconfidential, data = filtered_projects_lm)
# summary(ST_lm_f) ST_lm_f <- lm(SubmittalTime ~ RatingSystemFamily + CountryAdj, data =
# filtered_projects_lm) summary(ST_lm_f) ST_lm_f <- lm(SubmittalTime ~ RatingSystemFamily +
# OwnerTypesAdj, data = filtered_projects_lm) summary(ST_lm_f) ST_lm_f <- lm(SubmittalTime
# ~ RatingSystemFamily + log(GrossSqFoot), data = filtered_projects_lm) summary(ST_lm_f)
# #Forward pick RatingSystemFamily and OwnerTypesAdj ST_lm_f <- lm(SubmittalTime ~
# RatingSystemFamily + OwnerTypesAdj + Isconfidential, data = filtered_projects_lm)
# summary(ST_lm_f) ST_lm_f <- lm(SubmittalTime ~ RatingSystemFamily + OwnerTypesAdj +
# CountryAdj, data = filtered_projects_lm) summary(ST_lm_f) ST_lm_f <- lm(SubmittalTime ~
# RatingSystemFamily + OwnerTypesAdj + log(GrossSqFoot), data = filtered_projects_lm)
# summary(ST_lm_f) #Forward pick RatingSystemFamily and OwnerTypesAdj and log(GrossSqFoot)
# ST_lm_f <- lm(SubmittalTime ~ RatingSystemFamily + OwnerTypesAdj + log(GrossSqFoot) +
# Isconfidential, data = filtered_projects_lm) summary(ST_lm_f) ST_lm_f <- lm(SubmittalTime
# ~ RatingSystemFamily + OwnerTypesAdj + log(GrossSqFoot) + CountryAdj, data =
# filtered_projects_lm) summary(ST_lm_f)

# Forward pick RatingSystemFamily and OwnerTypesAdj and CountryAdj and GrossSqFoot ST_lm_f
# <- lm(SubmittalTime ~ RatingSystemFamily + OwnerTypesAdj + log(GrossSqFoot) + CountryAdj,
# data = filtered_projects_lm) summary(ST_lm_f)


# Test step_ST_lm <- step(object = ST_lm, direction = 'backward') summary(step_ST_lm)

From this analysis, we can see Country, Rating System Family, owner types, and gross floor area are good predictors for the model. Additionally, we see that the model explains 39.26% of the variability for respone variable (submittal time), which is much better than the previous model. However, in order to validate the model, the following assumptions must be checked:

the residuals of the model are nearly normal

ggplot(data = ST_lm_b, aes(x = resid(ST_lm_b))) + geom_histogram(binwidth = 100, position = "identity", 
    aes(y = ..density..)) + stat_function(fun = dnorm, color = "black", args = list(mean = mean(ST_lm_b$residuals), 
    sd(ST_lm_b$residuals))) + labs(x = "Residuals") + ggtitle("Multi-linear Regression Model Residuals") + 
    theme(plot.title = element_text(hjust = 0.5))

qqnorm(ST_lm_b$residuals)
qqline(ST_lm_b$residuals)

the variability of the residuals is nearly constant

ggplot(ST_lm_b, aes(y = abs(resid(ST_lm_b)), x = ST_lm_b$fitted.values)) + geom_point() + geom_hline(yintercept = 0) + 
    labs(x = "Fitted Values", y = "Absolute value of Residuals") + ggtitle("Multi-linear Regression Fitted Values against residuals") + 
    theme(plot.title = element_text(hjust = 0.5))

the residuals are independent

ggplot(ST_lm_b, aes(y = resid(ST_lm_b), x = filtered_projects_lm$RegistrationDate)) + geom_point() + 
    geom_hline(yintercept = 0) + labs(x = "Registration Date", y = "Residuals") + ggtitle("Multi-linear Regression Model Registration Date Against Residuals") + 
    theme(plot.title = element_text(hjust = 0.5))

each variable is linearly related to the outcome.

ggplot(ST_lm_b, aes(y = resid(ST_lm_b), x = filtered_projects_lm$CountryAdj)) + geom_boxplot() + 
    labs(x = "Country Code", y = "Residuals") + ggtitle("Multi-linear Regression Model Residuals By Country") + 
    theme(plot.title = element_text(hjust = 0.5)) + coord_flip()

ggplot(ST_lm_b, aes(y = resid(ST_lm_b), x = filtered_projects_lm$RatingSystemFamily)) + geom_boxplot() + 
    labs(x = "Rating System Family", y = "Residuals") + ggtitle("Multi-linear Regression Model Residuals By Rating System Family") + 
    theme(plot.title = element_text(hjust = 0.5))

ggplot(ST_lm_b, aes(y = resid(ST_lm_b), x = filtered_projects_lm$OwnerTypesAdj)) + geom_boxplot() + 
    labs(x = "Owner Types", y = "Residuals") + ggtitle("Multi-linear Regression Model Residuals By Owner Type") + 
    theme(plot.title = element_text(hjust = 0.5))

ggplot(ST_lm_b, aes(y = resid(ST_lm_b), x = log(filtered_projects_lm$GrossSqFoot))) + geom_point() + 
    geom_hline(yintercept = 0) + labs(x = "Log of Gross Floor Area", y = "Residuals") + ggtitle("Multi-linear Regression Model Residuals By Gross Floor Area") + 
    theme(plot.title = element_text(hjust = 0.5))

Unlike the the previous linear model, the residual histogram plot shows that the data distribution is a approximately normal. The remaining data show that each predictor apepars to be linearly related to the outcome and appear to be independent based on the registration date graph.

Part 5 - Conclusion:

In this analysis, we have tested the assumption that the three Rating System Families (BD+C, ID+C, OM) have a different mean time it takes to certify a project from the registration date (Submittal Time). Using an ANOVA test, we determined that at least one mean is statistically different from the set. Note, the assumptions for the ANOVA test were not entirely met since the distributions for each group do not closely follow a normal distribution; therefore, we should interpret the results with caution. A set of pairwise t-tests were then performed for each possible combination of groups which determined that all three Rating System Faimly submittal times are statistically significant from each other. All tests were performed using a significance level of \(\alpha = 0.05\). Note, the t-tests were adjusted using the Bonferroni correction so that the Type I error rate is not increased (\(\alpha = 0.05 / 3\)).

Additionally, gross floor area was evaluated to see if it would be a good predictor variable to determine submittal time. Due to the large values, and shape of the data, a log() transformation of the values was taken for the linear regression analysis. The results of this analysis determined that gross floor area, on its own, is not a good predictor fo submittal time based on r-squared value (BD+C - 0.01, ID+C - 0.16, OM - 0.06).

To expand upon this study, we then performed a multiple regression analysis using a set of predictor variables. With this study, we determined that by using Country (Note - any countries that only had a small number of submittals have been grouped into an “other” category), Rating System Family, owner types (OwnerTypesAdj - Note - the number of categories has been significantly reduced for meaningfullness), and gross floor area (GrossSqFoot - Note this variable has been log() transformed) we have significantly improved the model which now explains 39.26% of the variability for the submittal time. However, as noted in the Inference section and at the beginning of this summary, the data appear to deviate from some of the validation assumptions; therefore, should be interpreted with caution.

For future possible adjustments to the data, we could set more aggressive cutoff submittal times that would start to limit the v3 platform data. This adjustment would help us determine what we consider a success and failure in terms of submittal times. However, we must evalute this cutoff carefully as we do not want to simply reduce the cutoff date to just “make the data fit” a normal distribution. Currently, it appears that multiple cutoff time periods would be necessary, since we have shown there is a difference in mean submittal times between Rating System Families, and the use of hierarchical modeling techniques may be necessary.

Appendix (optional):

Self Reflection

During a meetup, a question was proposed to the class about whether we felt that a short slide presentation would be preferable instead of the long-form report format. I think it would be a great experience to give a short presentation over my findings but I do think that the long-form report helped me work through a lot of my ideas and assumptions. Therefore, if possible I would support the idea of creating a slide deck summarizing a data project in addition to a long-form report.

Future work

The remainder of this appendix contains additional analysis work that will be completed in the future and need not be reviewed.

# Create a glm that predicts the probability of Certification.

filtered_projects$TimeOpen <- ifelse(!(is.na(filtered_projects$SubmittalTime)), filtered_projects$SubmittalTime, 
    ifelse(!(is.na(filtered_projects$SubmittalTimePending)), filtered_projects$SubmittalTimePending, 
        NA))

filtered_projects$IsCertifiedBol <- ifelse(filtered_projects$IsCertified == "Yes", 1, ifelse(filtered_projects$IsCertified == 
    "No", 0, NA))

filtered_projects_glm <- subset(filtered_projects, filtered_projects$TimeOpen > 0 & !(is.na(filtered_projects$Isconfidential)) & 
    !(is.na(filtered_projects$Country)) & !(is.na(filtered_projects$RatingSystemFamily)) & 
    !(is.na(filtered_projects$Platform)) & !(is.na(filtered_projects$OwnerTypesAdj)) & !(is.na(filtered_projects$GrossSqFoot)) & 
    !(is.na(filtered_projects$IsCertifiedBol)))

# Relevel countries so that US is the reference (base) value in the linear regression.
filtered_projects_lm$Country <- as.factor(filtered_projects_lm$Country)
filtered_projects_lm$Country <- relevel(filtered_projects_lm$Country, ref = "US")

filtered_projects_glm$CountryAdj <- sapply(filtered_projects_glm$Country, function(x) {
    x <- as.character(country_count_small[which(country_count_small$Country == x), "Country"][1])
    ifelse(is.na(x), "Other", x)
})
# Relevel adjusted countries so that US is the reference (base) value in the linear
# regression.
filtered_projects_glm$CountryAdj <- as.factor(filtered_projects_glm$CountryAdj)
filtered_projects_glm$CountryAdj <- relevel(filtered_projects_glm$CountryAdj, ref = "US")


ST_glm <- glm(IsCertifiedBol ~ RatingSystemFamily + OwnerTypesAdj + CountryAdj + log(GrossSqFoot) + 
    TimeOpen, data = filtered_projects_glm)
summary(ST_glm)

## 
## Call:
## glm(formula = IsCertifiedBol ~ RatingSystemFamily + OwnerTypesAdj + 
##     CountryAdj + log(GrossSqFoot) + TimeOpen, data = filtered_projects_glm)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7416  -0.2725  -0.0311   0.2248   1.1641  
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1.074e+00  2.521e-02  42.583  < 2e-16 ***
## RatingSystemFamilyIDC    8.135e-02  8.721e-03   9.328  < 2e-16 ***
## RatingSystemFamilyOM     1.830e-02  9.371e-03   1.953  0.05081 .  
## OwnerTypesAdjcorporate   1.580e-01  8.232e-03  19.192  < 2e-16 ***
## OwnerTypesAdjeducation   1.406e-01  1.173e-02  11.986  < 2e-16 ***
## OwnerTypesAdjgovernment  9.103e-02  1.109e-02   8.211 2.36e-16 ***
## OwnerTypesAdjindividual  1.170e-01  2.252e-02   5.194 2.08e-07 ***
## OwnerTypesAdjinvestor    1.606e-01  1.182e-02  13.586  < 2e-16 ***
## OwnerTypesAdjnon-profit  1.627e-01  1.791e-02   9.085  < 2e-16 ***
## OwnerTypesAdjother       6.561e-02  1.294e-02   5.070 4.01e-07 ***
## CountryAdjBR            -4.316e-02  1.849e-02  -2.335  0.01957 *  
## CountryAdjCA            -4.581e-02  2.908e-02  -1.575  0.11519    
## CountryAdjCN            -7.214e-02  1.263e-02  -5.711 1.14e-08 ***
## CountryAdjIN            -3.308e-02  2.128e-02  -1.555  0.11996    
## CountryAdjMX            -9.948e-02  2.253e-02  -4.415 1.02e-05 ***
## CountryAdjOther         -2.806e-02  8.119e-03  -3.456  0.00055 ***
## CountryAdjTR            -4.850e-02  2.168e-02  -2.237  0.02527 *  
## log(GrossSqFoot)        -5.013e-03  2.066e-03  -2.426  0.01529 *  
## TimeOpen                -7.036e-04  7.490e-06 -93.933  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1319751)
## 
##     Null deviance: 3993.5  on 16048  degrees of freedom
## Residual deviance: 2115.6  on 16030  degrees of freedom
## AIC: 13065
## 
## Number of Fisher Scoring iterations: 2