Final-Project DATA-698

#uploading and viewing of data
#df=read.csv('data.csv', header = TRUE)


#upload of libraries(all these libraries should firstly be installed at the packages section)
library(readr) # reading data of different types
library(ggplot2) # for visualization
library(plyr)
library(dplyr)
library(tidyr) #tidying up data
library(psych) #for modeling
library(caTools)
library(mice) #for cleaning
library(VIM) #for cleaning
library(viridis) #for additional coloring
library(knitr)
download.file("https://raw.githubusercontent.com/ashishsm1986/DATA698/main/Entrepreneurial_Acclimatization.csv",destfile = "Entrepreneurial_Acclimatization.csv",method = "curl")
df=read.csv('Entrepreneurial_Acclimatization.csv', header = TRUE)
#df=read.csv('Entrepreneurial Acclimatization.csv', header = TRUE)
head(df)

The Dataset is a copy of dataset located here:- https://www.kaggle.com/manishkc06/startup-success-prediction as well as answers received from survey conducted for the following questions and they have been mapped into the file as additional columns:-

C1    Top management pushes for Research and Development, Technological acclimatization and innovations hence high entrepreneurial acclimatization?
C2    Projects are executed under stipulated budgets/timelines, with targets met and concerns addressed, hence bringing forth positive influence and satisfaction of different stakeholders.
C3    There is a fast adoption of new technological acclimatization and there is the tendency of using the most updated technology in performing tasks.Single choice.
C4    Top management has high rate of positive support and influence on projects hence high rate of success noted.Single choice.

Lets Dig into the data

dim(df)

## [1] 923  57

The Data has 923 Rows and 57 Columns

kable(df[1:5,c(2, 5, 21, 29, 57)])

state_code	zip_code	funding_total_usd	category_code	status
CA	92101	375000	music	acquired
CA	95032	40100000	enterprise	acquired
CA	92121	2600000	web	acquired
CA	95014	40000000	software	acquired
CA	94105	1300000	games_video	closed

##Data Preprocessing
#Deletion of the rows that have missing data cells
#checking for missing values and cleaning data
head(is.na(df))

##         ID state_code latitude longitude zip_code    id  city Unnamed..6  name
## [1,] FALSE      FALSE    FALSE     FALSE    FALSE FALSE FALSE      FALSE FALSE
## [2,] FALSE      FALSE    FALSE     FALSE    FALSE FALSE FALSE      FALSE FALSE
## [3,] FALSE      FALSE    FALSE     FALSE    FALSE FALSE FALSE      FALSE FALSE
## [4,] FALSE      FALSE    FALSE     FALSE    FALSE FALSE FALSE      FALSE FALSE
## [5,] FALSE      FALSE    FALSE     FALSE    FALSE FALSE FALSE      FALSE FALSE
## [6,] FALSE      FALSE    FALSE     FALSE    FALSE FALSE FALSE      FALSE FALSE
##      labels founded_at closed_at first_funding_at last_funding_at
## [1,]  FALSE      FALSE     FALSE            FALSE           FALSE
## [2,]  FALSE      FALSE     FALSE            FALSE           FALSE
## [3,]  FALSE      FALSE     FALSE            FALSE           FALSE
## [4,]  FALSE      FALSE     FALSE            FALSE           FALSE
## [5,]  FALSE      FALSE     FALSE            FALSE           FALSE
## [6,]  FALSE      FALSE     FALSE            FALSE           FALSE
##      age_first_funding_year age_last_funding_year age_first_milestone_year
## [1,]                  FALSE                 FALSE                    FALSE
## [2,]                  FALSE                 FALSE                    FALSE
## [3,]                  FALSE                 FALSE                    FALSE
## [4,]                  FALSE                 FALSE                    FALSE
## [5,]                  FALSE                 FALSE                    FALSE
## [6,]                  FALSE                 FALSE                    FALSE
##      age_last_milestone_year relationships funding_rounds funding_total_usd
## [1,]                   FALSE         FALSE          FALSE             FALSE
## [2,]                   FALSE         FALSE          FALSE             FALSE
## [3,]                   FALSE         FALSE          FALSE             FALSE
## [4,]                   FALSE         FALSE          FALSE             FALSE
## [5,]                   FALSE         FALSE          FALSE             FALSE
## [6,]                   FALSE         FALSE          FALSE             FALSE
##      milestones state_code.1 is_CA is_NY is_MA is_TX is_otherstate
## [1,]      FALSE        FALSE FALSE FALSE FALSE FALSE         FALSE
## [2,]      FALSE        FALSE FALSE FALSE FALSE FALSE         FALSE
## [3,]      FALSE        FALSE FALSE FALSE FALSE FALSE         FALSE
## [4,]      FALSE        FALSE FALSE FALSE FALSE FALSE         FALSE
## [5,]      FALSE        FALSE FALSE FALSE FALSE FALSE         FALSE
## [6,]      FALSE        FALSE FALSE FALSE FALSE FALSE         FALSE
##      category_code is_software is_web is_mobile is_enterprise is_advertising
## [1,]         FALSE       FALSE  FALSE     FALSE         FALSE          FALSE
## [2,]         FALSE       FALSE  FALSE     FALSE         FALSE          FALSE
## [3,]         FALSE       FALSE  FALSE     FALSE         FALSE          FALSE
## [4,]         FALSE       FALSE  FALSE     FALSE         FALSE          FALSE
## [5,]         FALSE       FALSE  FALSE     FALSE         FALSE          FALSE
## [6,]         FALSE       FALSE  FALSE     FALSE         FALSE          FALSE
##      is_gamesvideo is_ecommerce is_biotech is_consulting is_othercategory
## [1,]         FALSE        FALSE      FALSE         FALSE            FALSE
## [2,]         FALSE        FALSE      FALSE         FALSE            FALSE
## [3,]         FALSE        FALSE      FALSE         FALSE            FALSE
## [4,]         FALSE        FALSE      FALSE         FALSE            FALSE
## [5,]         FALSE        FALSE      FALSE         FALSE            FALSE
## [6,]         FALSE        FALSE      FALSE         FALSE            FALSE
##      object_id has_VC has_angel has_roundA has_roundB has_roundC has_roundD
## [1,]     FALSE  FALSE     FALSE      FALSE      FALSE      FALSE      FALSE
## [2,]     FALSE  FALSE     FALSE      FALSE      FALSE      FALSE      FALSE
## [3,]     FALSE  FALSE     FALSE      FALSE      FALSE      FALSE      FALSE
## [4,]     FALSE  FALSE     FALSE      FALSE      FALSE      FALSE      FALSE
## [5,]     FALSE  FALSE     FALSE      FALSE      FALSE      FALSE      FALSE
## [6,]     FALSE  FALSE     FALSE      FALSE      FALSE      FALSE      FALSE
##      avg_participants is_top500 Gender   Age Experience Education    C1    C2
## [1,]            FALSE     FALSE  FALSE FALSE      FALSE     FALSE FALSE FALSE
## [2,]            FALSE     FALSE  FALSE FALSE      FALSE     FALSE FALSE FALSE
## [3,]            FALSE     FALSE  FALSE FALSE      FALSE     FALSE FALSE FALSE
## [4,]            FALSE     FALSE  FALSE FALSE      FALSE     FALSE FALSE FALSE
## [5,]            FALSE     FALSE  FALSE FALSE      FALSE     FALSE FALSE FALSE
## [6,]            FALSE     FALSE  FALSE FALSE      FALSE     FALSE FALSE FALSE
##         C3    C4 status
## [1,] FALSE FALSE  FALSE
## [2,] FALSE FALSE  FALSE
## [3,] FALSE FALSE  FALSE
## [4,] FALSE FALSE  FALSE
## [5,] FALSE FALSE  FALSE
## [6,] FALSE FALSE  FALSE

P=function(x) {sum(is.na(x))/length(x)*100}
apply(df, 2, P)

##                       ID               state_code                 latitude 
##                  0.00000                  0.00000                  0.00000 
##                longitude                 zip_code                       id 
##                  0.00000                  0.00000                  0.00000 
##                     city               Unnamed..6                     name 
##                  0.00000                  0.00000                  0.00000 
##                   labels               founded_at                closed_at 
##                  0.00000                  0.00000                  0.00000 
##         first_funding_at          last_funding_at   age_first_funding_year 
##                  0.00000                  0.00000                  0.00000 
##    age_last_funding_year age_first_milestone_year  age_last_milestone_year 
##                  0.00000                 16.46804                 16.46804 
##            relationships           funding_rounds        funding_total_usd 
##                  0.00000                  0.00000                  0.00000 
##               milestones             state_code.1                    is_CA 
##                  0.00000                  0.00000                  0.00000 
##                    is_NY                    is_MA                    is_TX 
##                  0.00000                  0.00000                  0.00000 
##            is_otherstate            category_code              is_software 
##                  0.00000                  0.00000                  0.00000 
##                   is_web                is_mobile            is_enterprise 
##                  0.00000                  0.00000                  0.00000 
##           is_advertising            is_gamesvideo             is_ecommerce 
##                  0.00000                  0.00000                  0.00000 
##               is_biotech            is_consulting         is_othercategory 
##                  0.00000                  0.00000                  0.00000 
##                object_id                   has_VC                has_angel 
##                  0.00000                  0.00000                  0.00000 
##               has_roundA               has_roundB               has_roundC 
##                  0.00000                  0.00000                  0.00000 
##               has_roundD         avg_participants                is_top500 
##                  0.00000                  0.00000                  0.00000 
##                   Gender                      Age               Experience 
##                  0.00000                  0.00000                  0.00000 
##                Education                       C1                       C2 
##                  0.00000                  0.00000                  0.00000 
##                       C3                       C4                   status 
##                  0.00000                  0.00000                  0.00000

#the dataset df, has lots of cells with missing data entries

#viewing actual missing data
md.pattern(df)

##     ID state_code latitude longitude zip_code id city Unnamed..6 name labels
## 771  1          1        1         1        1  1    1          1    1      1
## 152  1          1        1         1        1  1    1          1    1      1
##      0          0        0         0        0  0    0          0    0      0
##     founded_at closed_at first_funding_at last_funding_at
## 771          1         1                1               1
## 152          1         1                1               1
##              0         0                0               0
##     age_first_funding_year age_last_funding_year relationships funding_rounds
## 771                      1                     1             1              1
## 152                      1                     1             1              1
##                          0                     0             0              0
##     funding_total_usd milestones state_code.1 is_CA is_NY is_MA is_TX
## 771                 1          1            1     1     1     1     1
## 152                 1          1            1     1     1     1     1
##                     0          0            0     0     0     0     0
##     is_otherstate category_code is_software is_web is_mobile is_enterprise
## 771             1             1           1      1         1             1
## 152             1             1           1      1         1             1
##                 0             0           0      0         0             0
##     is_advertising is_gamesvideo is_ecommerce is_biotech is_consulting
## 771              1             1            1          1             1
## 152              1             1            1          1             1
##                  0             0            0          0             0
##     is_othercategory object_id has_VC has_angel has_roundA has_roundB
## 771                1         1      1         1          1          1
## 152                1         1      1         1          1          1
##                    0         0      0         0          0          0
##     has_roundC has_roundD avg_participants is_top500 Gender Age Experience
## 771          1          1                1         1      1   1          1
## 152          1          1                1         1      1   1          1
##              0          0                0         0      0   0          0
##     Education C1 C2 C3 C4 status age_first_milestone_year
## 771         1  1  1  1  1      1                        1
## 152         1  1  1  1  1      1                        0
##             0  0  0  0  0      0                      152
##     age_last_milestone_year    
## 771                       1   0
## 152                       0   2
##                         152 304

Looking at the two attributes that do have actual missing value entries, we have it that the two are having a total missing values of 16.46804% in total. The imputation of the two data attributes is as the code lines that follow below;

#imputing data
df$age_first_milestone_year[which(is.na(df$age_first_milestone_year))]=
  mean(df$age_first_milestone_year,na.rm = TRUE)

df$age_last_milestone_year[which(is.na(df$age_last_milestone_year))]=
  mean(df$age_last_milestone_year,na.rm = TRUE)

#farther cleaning requires that irrelevant attributes are dropped
df=df[, -c(3,4,5,8,23,40)]
head(df)

##   ID state_code      id          city              name labels founded_at
## 1  1         CA  c:6669     San Diego       Bandsintown      1   1/1/2007
## 2  2         CA c:16283     Los Gatos         TriCipher      1   1/1/2000
## 3  3         CA c:65620     San Diego             Plixi      1  3/18/2009
## 4  4         CA c:42668     Cupertino Solidcore Systems      1   1/1/2002
## 5  5         CA c:65806 San Francisco    Inhale Digital      0   8/1/2010
## 6  6         CA c:22898 Mountain View  Matisse Networks      0   1/1/2002
##   closed_at first_funding_at last_funding_at age_first_funding_year
## 1                   4/1/2009        1/1/2010                 2.2493
## 2                  2/14/2005      12/28/2009                 5.1260
## 3                  3/30/2010       3/30/2010                 1.0329
## 4                  2/17/2005       4/25/2007                 3.1315
## 5 10/1/2012         8/1/2010        4/1/2012                 0.0000
## 6 2/15/2009        7/18/2006       7/18/2006                 4.5452
##   age_last_funding_year age_first_milestone_year age_last_milestone_year
## 1                3.0027                   4.6685                  6.7041
## 2                9.9973                   7.0055                  7.0055
## 3                1.0329                   1.4575                  2.2055
## 4                5.3151                   6.0027                  6.0027
## 5                1.6685                   0.0384                  0.0384
## 6                4.5452                   5.0027                  5.0027
##   relationships funding_rounds funding_total_usd milestones is_CA is_NY is_MA
## 1             3              3            375000          3     1     0     0
## 2             9              4          40100000          1     1     0     0
## 3             5              1           2600000          2     1     0     0
## 4             5              3          40000000          1     1     0     0
## 5             2              2           1300000          1     1     0     0
## 6             3              1           7500000          1     1     0     0
##   is_TX is_otherstate   category_code is_software is_web is_mobile
## 1     0             0           music           0      0         0
## 2     0             0      enterprise           0      0         0
## 3     0             0             web           0      1         0
## 4     0             0        software           1      0         0
## 5     0             0     games_video           0      0         0
## 6     0             0 network_hosting           0      0         0
##   is_enterprise is_advertising is_gamesvideo is_ecommerce is_biotech
## 1             0              0             0            0          0
## 2             1              0             0            0          0
## 3             0              0             0            0          0
## 4             0              0             0            0          0
## 5             0              0             1            0          0
## 6             0              0             0            0          0
##   is_consulting is_othercategory has_VC has_angel has_roundA has_roundB
## 1             0                1      0         1          0          0
## 2             0                0      1         0          0          1
## 3             0                0      0         0          1          0
## 4             0                0      0         0          0          1
## 5             0                0      1         1          0          0
## 6             0                1      0         0          0          1
##   has_roundC has_roundD avg_participants is_top500            Gender   Age
## 1          0          0           1.0000         0                        
## 2          1          1           4.7500         1 Prefer not to say 41-50
## 3          0          0           4.0000         1              Male 31-40
## 4          1          1           3.3333         1              Male 31-40
## 5          0          0           1.0000         1              Male 31-40
## 6          0          0           3.0000         1              Male 31-40
##     Experience Education  C1  C2  C3  C4   status
## 1    6-8 Years       PhD Yes Yes Yes Yes acquired
## 2 Over 8 Years   Masters Yes Yes Yes Yes acquired
## 3 Over 8 Years Bachelors Yes Yes Yes Yes acquired
## 4   8-10 Years   Masters Yes Yes Yes Yes acquired
## 5   8-10 Years   Masters Yes  No Yes  No   closed
## 6 Over 8 Years Bachelors Yes Yes Yes Yes   closed

The next step that would follow would be the Descriptive Research Design and for this matter therefore, there will be a run of the descriptive analytics and to start off it is wise to note that the dataset has a mix of integer, numeric and character attributes in the mix up.

#Running Descriptives
str(df)

## 'data.frame':    923 obs. of  51 variables:
##  $ ID                      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ state_code              : Factor w/ 35 levels "AR","AZ","CA",..: 3 3 3 3 3 3 3 3 13 3 ...
##  $ id                      : Factor w/ 922 levels "c:10054","c:101312",..: 812 171 808 593 809 311 162 699 15 675 ...
##  $ city                    : Factor w/ 221 levels "Acton","Addison",..: 173 109 173 56 174 126 126 178 215 144 ...
##  $ name                    : Factor w/ 922 levels "#waywire","1000memories",..: 77 815 613 743 367 469 665 147 859 217 ...
##  $ labels                  : int  1 1 1 1 0 0 1 1 1 1 ...
##  $ founded_at              : Factor w/ 217 levels "1/1/1984","1/1/1985",..: 16 9 107 11 184 11 14 13 11 151 ...
##  $ closed_at               : Factor w/ 203 levels "","1/1/2001",..: 1 1 1 1 26 75 1 1 1 1 ...
##  $ first_funding_at        : Factor w/ 585 levels "1/1/2000","1/1/2001",..: 296 217 279 219 491 460 559 514 507 206 ...
##  $ last_funding_at         : Factor w/ 680 levels "1/1/2001","1/1/2004",..: 8 206 323 380 341 534 298 112 281 274 ...
##  $ age_first_funding_year  : num  2.25 5.13 1.03 3.13 0 ...
##  $ age_last_funding_year   : num  3 10 1.03 5.32 1.67 ...
##  $ age_first_milestone_year: num  4.6685 7.0055 1.4575 6.0027 0.0384 ...
##  $ age_last_milestone_year : num  6.7041 7.0055 2.2055 6.0027 0.0384 ...
##  $ relationships           : int  3 9 5 5 2 3 6 25 13 14 ...
##  $ funding_rounds          : int  3 4 1 3 2 1 3 3 3 3 ...
##  $ funding_total_usd       : num  375000 40100000 2600000 40000000 1300000 7500000 26000000 34100000 9650000 5750000 ...
##  $ milestones              : int  3 1 2 1 1 1 2 3 4 4 ...
##  $ is_CA                   : int  1 1 1 1 1 1 1 1 0 1 ...
##  $ is_NY                   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_MA                   : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ is_TX                   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_otherstate           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ category_code           : Factor w/ 35 levels "advertising",..: 20 9 35 31 12 21 31 11 19 35 ...
##  $ is_software             : int  0 0 0 1 0 0 1 0 0 0 ...
##  $ is_web                  : int  0 0 1 0 0 0 0 0 0 1 ...
##  $ is_mobile               : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ is_enterprise           : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ is_advertising          : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_gamesvideo           : int  0 0 0 0 1 0 0 0 0 0 ...
##  $ is_ecommerce            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_biotech              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_consulting           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ is_othercategory        : int  1 0 0 0 0 1 0 1 0 0 ...
##  $ has_VC                  : int  0 1 0 0 1 0 1 0 1 1 ...
##  $ has_angel               : int  1 0 0 0 1 0 0 0 0 1 ...
##  $ has_roundA              : int  0 0 1 0 0 0 1 1 1 1 ...
##  $ has_roundB              : int  0 1 0 1 0 1 1 1 0 0 ...
##  $ has_roundC              : int  0 1 0 1 0 0 0 0 0 0 ...
##  $ has_roundD              : int  0 1 0 1 0 0 0 1 1 0 ...
##  $ avg_participants        : num  1 4.75 4 3.33 1 ...
##  $ is_top500               : int  0 1 1 1 1 1 1 1 1 1 ...
##  $ Gender                  : Factor w/ 4 levels "","Female","Male",..: 1 4 3 3 3 3 3 3 3 3 ...
##  $ Age                     : Factor w/ 54 levels "","31-40","41-100",..: 1 5 2 2 2 2 2 2 2 2 ...
##  $ Experience              : Factor w/ 3 levels "6-8 Years","8-10 Years",..: 1 3 3 2 2 3 3 3 3 3 ...
##  $ Education               : Factor w/ 3 levels "Bachelors","Masters",..: 3 2 1 2 2 1 1 1 1 2 ...
##  $ C1                      : Factor w/ 1 level "Yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ C2                      : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ C3                      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 1 2 2 1 ...
##  $ C4                      : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 2 1 ...
##  $ status                  : Factor w/ 2 levels "acquired","closed": 1 1 1 1 2 2 1 1 1 1 ...

summary(df)

##        ID          state_code         id                 city    
##  Min.   :  1.0   CA     :488   c:28482 :  2   San Francisco:128  
##  1st Qu.:231.5   NY     :106   c:10054 :  1   New York     : 91  
##  Median :462.0   MA     : 83   c:101312:  1   Mountain View: 47  
##  Mean   :462.0   TX     : 42   c:10137 :  1   Palo Alto    : 35  
##  3rd Qu.:692.5   WA     : 42   c:10153 :  1   Austin       : 27  
##  Max.   :923.0   CO     : 19   c:10158 :  1   Santa Clara  : 27  
##                  (Other):143   (Other) :916   (Other)      :568  
##               name         labels          founded_at     closed_at  
##  Redwood Systems:  2   Min.   :0.0000   1/1/2003: 55           :588  
##  #waywire       :  1   1st Qu.:0.0000   1/1/2002: 54   6/1/2013: 25  
##  1000memories   :  1   Median :1.0000   1/1/2005: 54   1/1/2012: 24  
##  41st Parameter :  1   Mean   :0.6468   1/1/2006: 54   7/1/2013: 15  
##  5min Media     :  1   3rd Qu.:1.0000   1/1/2000: 53   5/1/2013: 12  
##  Aardvark       :  1   Max.   :1.0000   1/1/2004: 50   1/1/2011:  8  
##  (Other)        :916                    (Other) :603   (Other) :251  
##   first_funding_at  last_funding_at age_first_funding_year
##  1/1/2008 : 18     1/1/2008 : 11    Min.   :-9.0466       
##  1/1/2007 : 16     1/1/2012 :  6    1st Qu.: 0.5767       
##  1/1/2006 : 12     12/1/2007:  6    Median : 1.4466       
##  3/1/2006 : 10     9/1/2008 :  6    Mean   : 2.2356       
##  12/1/2006:  9     4/1/2012 :  5    3rd Qu.: 3.5753       
##  2/1/2005 :  9     7/1/2008 :  5    Max.   :21.8959       
##  (Other)  :849     (Other)  :884                          
##  age_last_funding_year age_first_milestone_year age_last_milestone_year
##  Min.   :-9.047        Min.   :-14.170          Min.   :-7.005         
##  1st Qu.: 1.670        1st Qu.:  1.252          1st Qu.: 2.930         
##  Median : 3.529        Median :  3.055          Median : 4.754         
##  Mean   : 3.931        Mean   :  3.055          Mean   : 4.754         
##  3rd Qu.: 5.560        3rd Qu.:  4.003          3rd Qu.: 6.040         
##  Max.   :21.896        Max.   : 24.685          Max.   :24.685         
##                                                                        
##  relationships    funding_rounds   funding_total_usd     milestones   
##  Min.   : 0.000   Min.   : 1.000   Min.   :1.100e+04   Min.   :0.000  
##  1st Qu.: 3.000   1st Qu.: 1.000   1st Qu.:2.725e+06   1st Qu.:1.000  
##  Median : 5.000   Median : 2.000   Median :1.000e+07   Median :2.000  
##  Mean   : 7.711   Mean   : 2.311   Mean   :2.542e+07   Mean   :1.842  
##  3rd Qu.:10.000   3rd Qu.: 3.000   3rd Qu.:2.472e+07   3rd Qu.:3.000  
##  Max.   :63.000   Max.   :10.000   Max.   :5.700e+09   Max.   :8.000  
##                                                                       
##      is_CA            is_NY            is_MA             is_TX       
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :1.0000   Median :0.0000   Median :0.00000   Median :0.0000  
##  Mean   :0.5276   Mean   :0.1148   Mean   :0.08992   Mean   :0.0455  
##  3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000  
##                                                                      
##  is_otherstate       category_code  is_software         is_web     
##  Min.   :0.000   software   :153   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.000   web        :144   1st Qu.:0.0000   1st Qu.:0.000  
##  Median :0.000   mobile     : 79   Median :0.0000   Median :0.000  
##  Mean   :0.221   enterprise : 73   Mean   :0.1658   Mean   :0.156  
##  3rd Qu.:0.000   advertising: 62   3rd Qu.:0.0000   3rd Qu.:0.000  
##  Max.   :1.000   games_video: 52   Max.   :1.0000   Max.   :1.000  
##                  (Other)    :360                                   
##    is_mobile       is_enterprise     is_advertising    is_gamesvideo    
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.08559   Mean   :0.07909   Mean   :0.06717   Mean   :0.05634  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.00000  
##                                                                         
##   is_ecommerce       is_biotech      is_consulting     is_othercategory
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.0000  
##  Mean   :0.02709   Mean   :0.03684   Mean   :0.00325   Mean   :0.3229  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:1.0000  
##  Max.   :1.00000   Max.   :1.00000   Max.   :1.00000   Max.   :1.0000  
##                                                                        
##      has_VC         has_angel        has_roundA       has_roundB    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :1.0000   Median :0.0000  
##  Mean   :0.3261   Mean   :0.2546   Mean   :0.5081   Mean   :0.3922  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                     
##    has_roundC       has_roundD      avg_participants   is_top500     
##  Min.   :0.0000   Min.   :0.00000   Min.   : 1.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.: 1.500   1st Qu.:1.0000  
##  Median :0.0000   Median :0.00000   Median : 2.500   Median :1.0000  
##  Mean   :0.2329   Mean   :0.09967   Mean   : 2.839   Mean   :0.8093  
##  3rd Qu.:0.0000   3rd Qu.:0.00000   3rd Qu.: 3.800   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :1.00000   Max.   :16.000   Max.   :1.0000  
##                                                                      
##                Gender         Age             Experience      Education  
##                   : 52   31-40  :819   6-8 Years   : 52   Bachelors:562  
##  Female           : 51          : 52   8-10 Years  :104   Masters  :309  
##  Male             :768   41-100 :  1   Over 8 Years:767   PhD      : 52  
##  Prefer not to say: 52   41-101 :  1                                     
##                          41-50  :  1                                     
##                          41-51  :  1                                     
##                          (Other): 48                                     
##    C1        C2        C3        C4           status   
##  Yes:923   No :256   No :153   No :307   acquired:597  
##            Yes:667   Yes:770   Yes:616   closed  :326  
##                                                        
##                                                        
##                                                        
##                                                        
##

print('Those that were acquired totaled to 597 and those that were closed totaled to 326. From this data summary and summation, it is as clear, that a considerable number of firms had struggled in their operations and were eventually closed whereas others must have been promising enough but due to lack of enough funds of operations were instead acquired. The main reasons for acquisition will be largely expounded on under the discussion section. To this point though, it is only wise to note that the results do dictate the fact that, there is a clear view that more firms were performing close to better and looked promising and due to this fact, could easily be selected for acquisition')

## [1] "Those that were acquired totaled to 597 and those that were closed totaled to 326. From this data summary and summation, it is as clear, that a considerable number of firms had struggled in their operations and were eventually closed whereas others must have been promising enough but due to lack of enough funds of operations were instead acquired. The main reasons for acquisition will be largely expounded on under the discussion section. To this point though, it is only wise to note that the results do dictate the fact that, there is a clear view that more firms were performing close to better and looked promising and due to this fact, could easily be selected for acquisition"

#Doing Univaraite Analysis
describe(df$age_first_funding_year)

##    vars   n mean   sd median trimmed  mad   min  max range skew kurtosis   se
## X1    1 923 2.24 2.51   1.45    1.89 1.77 -9.05 21.9 30.94  2.1     9.97 0.08

describe(df$age_last_funding_year)

##    vars   n mean   sd median trimmed  mad   min  max range skew kurtosis  se
## X1    1 923 3.93 2.97   3.53    3.65 2.88 -9.05 21.9 30.94 1.09     3.12 0.1

describe(df$age_first_milestone_year)

##    vars   n mean   sd median trimmed  mad    min   max range skew kurtosis   se
## X1    1 923 3.06 2.72   3.06    2.83 2.19 -14.17 24.68 38.85 1.03     7.27 0.09

describe(df$age_last_milestone_year)

##    vars   n mean   sd median trimmed  mad   min   max range skew kurtosis  se
## X1    1 923 4.75 2.94   4.75     4.6 2.58 -7.01 24.68 31.69 0.78     3.04 0.1

describe(df$relationships)

##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 923 7.71 7.27      5     6.5 4.45   0  63    63 2.32     8.56 0.24

describe(df$funding_rounds)

##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 923 2.31 1.39      2     2.1 1.48   1  10     9 1.35     2.23 0.05

describe(df$funding_total_usd)

##    vars   n     mean        sd median  trimmed      mad   min     max
## X1    1 923 25419749 189634364  1e+07 13381063 12883794 11000 5.7e+09
##         range  skew kurtosis      se
## X1 5699989000 29.06   865.75 6241891

describe(df$milestones)

##    vars   n mean   sd median trimmed  mad min max range skew kurtosis   se
## X1    1 923 1.84 1.32      2    1.75 1.48   0   8     8 0.58     0.25 0.04

#Getting to know the statistical distribution through Histograms

hist(df$age_first_funding_year, col = blues9)

#above is age of firms first funding distribution.

The dataset from the above histogram plot, clearly indicates a normally distributed data points with the peak at the center and yet the shorter data points spread out to the ends of the data points

hist(df$age_last_funding_year, col = blues9)

#above is age of firms last funding distribution

For the above two cases, most firms, got their first and last funding in the periods that were between 0 and 5 years on total. These years are the most crucial years of firm and a firm in most cases does not look at expanding its operations, but looks forward into expanding its operations instead. The firms that receive funding years later than 5 years of existence on the other side tends to be seeking expansion funds and that is allowed as this stage needs greater investments.

hist(df$age_first_milestone_year, col = blues9)

#above is age of firms first milestone distribution

hist(df$age_last_milestone_year, col = blues9)

#above is age of firms last milestone distribution

hist(df$relationships, col = blues9)

#above is the different firms' relations to other firms distribution

The number of relationships as indicated by the above graph has it that of the firms that portrayed levels of having relationships with other firms, most firms in the list had relationship periods that spanned from 0 years to 5 years and that number kept dropping and from here the data is highly skewed to the right.

hist(df$funding_rounds, col = blues9)

#above is different firms distribution on the funding rounds directed to firms

The data for the number of funding given to firms indicates that, number if round that firms were funded ranged between 1 and 2 and then the data frequency drops and therefore the dataset is skewed to the right. Most firms are just well on their own after the first or even second funding and that’s why there are fewer funding frequencies as numbers move along.

hist(df$funding_total_usd, col = blues9)

#the above illustrates the total funds directed to firms

hist(df$milestones, col = blues9)

#the above gives the total milestones range for each firm

Doing bi-variate analysis

#doing bi-variate analysis
#we will do on continuous and categorical
#we will do on categorical and categorical
#we will do on continuous and continous

?by
by(df$funding_total_usd, df$status, summary)

## df$status: acquired
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 5.970e+04 4.575e+06 1.270e+07 3.105e+07 2.706e+07 5.700e+09 
## ------------------------------------------------------------ 
## df$status: closed
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##     11000   1000000   5000000  15115322  16525000 510000000

by(df$funding_total_usd, df$status, mean)

## df$status: acquired
## [1] 31046622
## ------------------------------------------------------------ 
## df$status: closed
## [1] 15115322

by(df$funding_total_usd, df$status, median)

## df$status: acquired
## [1] 12700000
## ------------------------------------------------------------ 
## df$status: closed
## [1] 5e+06

?by
by(df$milestones, df$status, summary)

## df$status: acquired
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   2.162   3.000   8.000 
## ------------------------------------------------------------ 
## df$status: closed
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.255   2.000   6.000

by(df$milestones, df$status, mean)

## df$status: acquired
## [1] 2.162479
## ------------------------------------------------------------ 
## df$status: closed
## [1] 1.254601

by(df$milestones, df$status, median)

## df$status: acquired
## [1] 2
## ------------------------------------------------------------ 
## df$status: closed
## [1] 1

?boxplot
boxplot(df$funding_total_usd~df$status, notch=TRUE)

boxplot(df$milestones~df$status, notch=TRUE)

?boxplot
boxplot(df$funding_total_usd~df$C1, notch=TRUE)

boxplot(df$milestones~df$C1, notch=TRUE)

boxplot(df$funding_total_usd~df$C2, notch=TRUE)

boxplot(df$milestones~df$C2, notch=TRUE)

boxplot(df$funding_total_usd~df$C3, notch=TRUE)

boxplot(df$milestones~df$C3, notch=TRUE)

boxplot(df$funding_total_usd~df$C4, notch=TRUE)

boxplot(df$milestones~df$C4, notch=TRUE)

#Allocating Equipment
#Doing probability plots
# Filter for firm types
dr1=df %>%
  filter(status=='acquired')
qqnorm(dr1$milestones)
qqline(dr1$milestones, col=3)

dr2=df %>%
  filter(status=='closed')
qqnorm(dr1$milestones)
qqline(dr1$milestones, col=3)

dr1=df %>%
  filter(status=='acquired')
qqnorm(dr1$funding_total_usd)
qqline(dr1$funding_total_usd, col=3)

Density plot

#density plots: with these plots the intensity of an entry in a column is noticeable by it's overiding nature   
df %>%
  filter(status=='closed' | status=='acquired') %>%
  ggplot(aes(x=funding_total_usd, fill = status)) +
  geom_density(alpha=0.9, color='darkblue') +
  ggtitle('funding_total_usd')

df %>%
  filter(C2=='Yes' | status=='No') %>%
  ggplot(aes(x=funding_total_usd, fill = status)) +
  geom_density(alpha=0.9, color='darkblue') +
  ggtitle('C2')

df %>%
  filter(C4=='Yes' | status=='No') %>%
  ggplot(aes(x=funding_total_usd, fill = status)) +
  geom_density(alpha=0.9, color='darkblue') +
  ggtitle('C4')

Building Linear Regression Model

#linear regression is to asses the statistically significant attributes for building classifiction algorithms  
mod1=lm(milestones~age_first_funding_year+
          age_last_funding_year+
          age_first_milestone_year+
          age_last_milestone_year+
          relationships+
          funding_rounds+
          funding_total_usd, data=df)
summary(mod1)#gives information on the figures from linear regression including error rates

## 
## Call:
## lm(formula = milestones ~ age_first_funding_year + age_last_funding_year + 
##     age_first_milestone_year + age_last_milestone_year + relationships + 
##     funding_rounds + funding_total_usd, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6120 -0.5293  0.0425  0.5834  3.6984 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               8.947e-01  8.668e-02  10.322  < 2e-16 ***
## age_first_funding_year   -4.403e-02  3.090e-02  -1.425  0.15457    
## age_last_funding_year    -7.552e-02  2.913e-02  -2.592  0.00968 ** 
## age_first_milestone_year -2.353e-01  2.038e-02 -11.543  < 2e-16 ***
## age_last_milestone_year   2.804e-01  1.850e-02  15.156  < 2e-16 ***
## relationships             7.149e-02  5.226e-03  13.681  < 2e-16 ***
## funding_rounds            7.958e-02  3.892e-02   2.045  0.04117 *  
## funding_total_usd        -2.657e-10  1.732e-10  -1.534  0.12533    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9848 on 915 degrees of freedom
## Multiple R-squared:  0.4498, Adjusted R-squared:  0.4456 
## F-statistic: 106.9 on 7 and 915 DF,  p-value: < 2.2e-16

##status Column will be taken as our independent variable.
#Splitting will be done with respect to status
split_model=sample.split(df,SplitRatio = 0.70)
train=subset(df,split_model==TRUE)
test=subset(df,split_model==FALSE)

nrow(train)

## [1] 634

nrow(test)

## [1] 289

#now lets proceed and develop logistic regression
modlog=glm(formula = status~age_first_funding_year+
             age_last_funding_year+
             age_first_milestone_year+
             age_last_milestone_year+
             relationships+
             funding_rounds+
             funding_total_usd+milestones+
             avg_participants,
           data = train,family = 'binomial')

#predicting the values using the test dataset
resultlog=predict(modlog,newdata = test,type = 'response')
head(resultlog)

##          5          8         13         19         23         24 
## 0.69664595 0.01081804 0.78656464 0.12708446 0.78663663 0.27325968

range(resultlog)

## [1] 0.001978452 1.000000000

##from the range of probabilities above we can now get the confusion matrix
#for us to do that we develop a table of the actual and the predicted values
table(test$status,resultlog>0.5)

##           
##            FALSE TRUE
##   acquired   158   32
##   closed      42   57

table(test$status,resultlog>0.4)

##           
##            FALSE TRUE
##   acquired   135   55
##   closed      29   70

It is clearly evident that, only the attributes age_first_funding_year and funding_total_usd are not statistically significant and for that matter and the remaining attributes, those that are statistically significant will be used for building the classification algorithms.

#predicting the values using the test dataset
resultlog=predict(modlog, newdata = test, type = 'response')
head(resultlog)

##          5          8         13         19         23         24 
## 0.69664595 0.01081804 0.78656464 0.12708446 0.78663663 0.27325968

log_new=cbind(Actual=test$status,Predicted=resultlog)


#finding accuracy and errors
log_new=as.data.frame(log_new)
#getting the error in prediction(subtract the predicted values from the actual values)
error1=log_new$Actual-log_new$Predicted
head(error1)

## [1] 1.3033540 0.9891820 1.2134354 0.8729155 1.2133634 1.7267403

#lets bind the error back to the same data set
log_new = cbind(log_new,error1)
head(log_new)

##    Actual  Predicted    error1
## 5       2 0.69664595 1.3033540
## 8       1 0.01081804 0.9891820
## 13      2 0.78656464 1.2134354
## 19      1 0.12708446 0.8729155
## 23      2 0.78663663 1.2133634
## 24      2 0.27325968 1.7267403

#finding the root mean square error
sqrt(mean((log_new$error1)^2))

## [1] 1.07117

rmse1=sqrt(mean((log_new$error1)^2))

#Mean Absolute Error
library(MLmetrics)
MAE=MAE(log_new$Actual,log_new$Predicted)
MAE

## [1] 0.9856348

##Relative Root Mean Square Error
#RMSE/mean of observed data
mean(resultlog)

## [1] 0.3569257

rmse1=1.110258
RRSME=1.110258/0.3400743
RRSME

## [1] 3.264751

#r squared
summary(modlog)$r.squared

## NULL

Using the statistically significant attributes and splitting the main dataset in a 70:30 ratio for train and test sets respectively, the developed model was trained using 632 rows and tested on a test data with 291 rows. Of the developed model, the accuracy rate was ranging between values of 6.234853e-06 and 8.097985e-01.

library(ROCR)

#Developing an ROCR curve as a measure of accuracy
PL=prediction(resultlog,test$status)
ACC=performance(PL,"acc")
plot(ACC)

From the above codeline, the highest value of perfromance is when the threshhold is at 0.58 With the accuracy degree of 0.74 which is 74%. For there to be highest level and most accurate predictions.

##Building a Decision Tree
library(tree)
tree.df=tree(status~age_first_funding_year+
               age_last_funding_year+
               age_first_milestone_year+
               age_last_milestone_year+
               relationships+
               funding_rounds+
               funding_total_usd+milestones+
               avg_participants,
             data=df)
summary(tree.df)

## 
## Classification tree:
## tree(formula = status ~ age_first_funding_year + age_last_funding_year + 
##     age_first_milestone_year + age_last_milestone_year + relationships + 
##     funding_rounds + funding_total_usd + milestones + avg_participants, 
##     data = df)
## Variables actually used in tree construction:
## [1] "relationships"           "funding_total_usd"      
## [3] "milestones"              "age_last_milestone_year"
## [5] "age_last_funding_year"   "age_first_funding_year" 
## [7] "avg_participants"       
## Number of terminal nodes:  10 
## Residual mean deviance:  0.9438 = 861.7 / 913 
## Misclassification error rate: 0.2243 = 207 / 923

##Plotting
plot(tree.df)
text(tree.df,pretty = 0)

##Using Split Data
tree.df1=tree(status~age_first_funding_year+
               age_last_funding_year+
               age_first_milestone_year+
               age_last_milestone_year+
               relationships+
               funding_rounds+
               funding_total_usd+milestones+
               avg_participants,
             data=train)

plot(tree.df1)
text(tree.df1,pretty = 0)

##Prediciting values
tree.pred=predict(tree.df1, test, type = "class")

#confusion matrix 
table(test$status,tree.pred)

##           tree.pred
##            acquired closed
##   acquired      183      7
##   closed         61     38

#cross validation
cv.df=cv.tree(tree.df1, FUN = prune.misclass)
cv.df

## $size
## [1] 9 8 5 2 1
## 
## $dev
## [1] 163 160 159 164 227
## 
## $k
## [1]      -Inf  0.000000  1.333333  7.000000 69.000000
## 
## $method
## [1] "misclass"
## 
## attr(,"class")
## [1] "prune"         "tree.sequence"

plot(cv.df)

#pruning
prune.df=prune.misclass(tree.df1, best = 8)
plot(prune.df)
text(prune.df, pretty = 0)

#predicting values again
tree.pred=predict(tree.df1, test, type = "class")
table(test$status,tree.pred)

##           tree.pred
##            acquired closed
##   acquired      183      7
##   closed         61     38

##Building Random Forest
library(randomForest)
set.seed(111)
rf1=randomForest(status~age_first_funding_year+
                    age_last_funding_year+
                    age_first_milestone_year+
                    age_last_milestone_year+
                    relationships+
                    funding_rounds+
                    funding_total_usd+milestones+
                    avg_participants,
                  data=train)

rf1

## 
## Call:
##  randomForest(formula = status ~ age_first_funding_year + age_last_funding_year +      age_first_milestone_year + age_last_milestone_year + relationships +      funding_rounds + funding_total_usd + milestones + avg_participants,      data = train) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 21.92%
## Confusion matrix:
##          acquired closed class.error
## acquired      368     39   0.0958231
## closed        100    127   0.4405286

#prediction
rp1=predict(rf1,test)
table(test$status,rp1)

##           rp1
##            acquired closed
##   acquired      174     16
##   closed         34     65

#number of nodes for tree
hist(treesize(rf1),
     main = "Number of Nodes",
     col = "blue")

#variable importance
varImpPlot(rf1,
           sort=T,
           n.var = 10,
           main = "Top 10 - Variable Imortance")

importance(rf1)

##                          MeanDecreaseGini
## age_first_funding_year           36.20969
## age_last_funding_year            31.33394
## age_first_milestone_year         27.17028
## age_last_milestone_year          38.22115
## relationships                    56.50446
## funding_rounds                   14.78873
## funding_total_usd                42.08283
## milestones                       23.11928
## avg_participants                 20.88417

varUsed(rf1)

## [1] 7201 6978 5702 6029 5665 3206 7633 2831 4810

##K-Nearst Neighbor
# Libraries
library(caret)
library(pROC)
library(mlbench)



# KNN Model
trControl <- trainControl(method = "repeatedcv",
                          number = 10,
                          repeats = 3,
                          classProbs = TRUE,
                          summaryFunction = twoClassSummary)
set.seed(222)
fit <- train(status ~ age_first_funding_year+
               age_last_funding_year+
               age_first_milestone_year+
               age_last_milestone_year+
               relationships+
               funding_rounds+
               funding_total_usd+milestones+
               avg_participants,
             data = train,
             method = 'knn',
             tuneLength = 20,
             trControl = trControl,
             preProc = c("center", "scale"),
             metric = "ROC",
             tuneGrid = expand.grid(k = 1:60))

# Model Performance
fit

## k-Nearest Neighbors 
## 
## 634 samples
##   9 predictor
##   2 classes: 'acquired', 'closed' 
## 
## Pre-processing: centered (9), scaled (9) 
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 571, 571, 572, 570, 570, 571, ... 
## Resampling results across tuning parameters:
## 
##   k   ROC        Sens       Spec     
##    1  0.6662039  0.7951220  0.5372859
##    2  0.7027739  0.7936789  0.5390646
##    3  0.7225433  0.8444309  0.5155468
##    4  0.7326698  0.8329268  0.5364295
##    5  0.7437672  0.8575813  0.5125823
##    6  0.7554646  0.8592073  0.5099473
##    7  0.7628925  0.8648780  0.5258893
##    8  0.7674427  0.8673374  0.5389987
##    9  0.7700594  0.8738211  0.5170619
##   10  0.7720584  0.8722358  0.5272069
##   11  0.7734950  0.8796138  0.5215415
##   12  0.7712766  0.8665854  0.5169302
##   13  0.7742161  0.8788008  0.5198287
##   14  0.7770114  0.8713211  0.5125823
##   15  0.7786278  0.8780081  0.5185771
##   16  0.7774859  0.8731098  0.5214756
##   17  0.7770979  0.8821341  0.5095520
##   18  0.7789911  0.8804472  0.5065217
##   19  0.7771332  0.8771545  0.5094862
##   20  0.7791949  0.8804065  0.5080369
##   21  0.7803944  0.8755285  0.5021739
##   22  0.7811409  0.8714228  0.4991436
##   23  0.7839361  0.8763415  0.5050725
##   24  0.7839984  0.8754878  0.4991436
##   25  0.7833236  0.8771545  0.5021080
##   26  0.7844370  0.8853455  0.5065876
##   27  0.7854299  0.8820732  0.5050725
##   28  0.7877190  0.8836585  0.5094862
##   29  0.7895708  0.8869715  0.5065876
##   30  0.7902137  0.8927236  0.5080369
##   31  0.7907341  0.8943293  0.5109354
##   32  0.7909993  0.8984146  0.5094203
##   33  0.7928511  0.8984146  0.5095520
##   34  0.7942122  0.8976220  0.5050725
##   35  0.7950387  0.9025813  0.5110672
##   36  0.7948743  0.9026016  0.5020422
##   37  0.7964964  0.9034553  0.5065876
##   38  0.7972950  0.8992886  0.5066535
##   39  0.7990403  0.9010163  0.5066535
##   40  0.7993030  0.9034756  0.5007246
##   41  0.8002782  0.9010163  0.5063900
##   42  0.7994470  0.9026220  0.5036232
##   43  0.7977072  0.9017683  0.5050066
##   44  0.7989044  0.9009146  0.5065217
##   45  0.7993054  0.9001220  0.5065876
##   46  0.7975124  0.8976016  0.5110672
##   47  0.7966311  0.8967886  0.5024374
##   48  0.7955276  0.8968699  0.5007246
##   49  0.7957459  0.8984959  0.5052701
##   50  0.7965302  0.8976829  0.5067194
##   51  0.7970009  0.8968699  0.5022398
##   52  0.7966643  0.8952439  0.5022398
##   53  0.7955690  0.8968699  0.5038867
##   54  0.7959459  0.8976626  0.4965086
##   55  0.7963677  0.8976829  0.4935441
##   56  0.7971509  0.8935976  0.4936100
##   57  0.7979647  0.8960366  0.4949934
##   58  0.7985234  0.8960366  0.4965086
##   59  0.7990948  0.8927846  0.4921607
##   60  0.7995663  0.8927642  0.4877470
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 41.

plot(fit)

varImp(fit)

## ROC curve variable importance
## 
##                          Importance
## relationships                100.00
## milestones                    71.57
## funding_rounds                50.77
## age_last_milestone_year       49.35
## funding_total_usd             40.54
## avg_participants              37.29
## age_first_milestone_year      28.30
## age_last_funding_year         17.16
## age_first_funding_year         0.00

pred <- predict(fit, newdata = test)
confusionMatrix(pred, test$status)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction acquired closed
##   acquired      172     45
##   closed         18     54
##                                           
##                Accuracy : 0.782           
##                  95% CI : (0.7299, 0.8282)
##     No Information Rate : 0.6574          
##     P-Value [Acc > NIR] : 2.573e-06       
##                                           
##                   Kappa : 0.4822          
##                                           
##  Mcnemar's Test P-Value : 0.001054        
##                                           
##             Sensitivity : 0.9053          
##             Specificity : 0.5455          
##          Pos Pred Value : 0.7926          
##          Neg Pred Value : 0.7500          
##              Prevalence : 0.6574          
##          Detection Rate : 0.5952          
##    Detection Prevalence : 0.7509          
##       Balanced Accuracy : 0.7254          
##                                           
##        'Positive' Class : acquired        
##

###Niave Bayes Classification
# Libraries
library(naivebayes)
library(dplyr)
library(ggplot2)
library(psych)

# Naive Bayes Model
md=naive_bayes(status ~ age_first_funding_year+
                 age_last_funding_year+
                 age_first_milestone_year+
                 age_last_milestone_year+
                 relationships+
                 funding_rounds+
                 funding_total_usd+milestones+
                 avg_participants, data = train, usekernel = T)
md

## 
## ================================== Naive Bayes ================================== 
##  
##  Call: 
## naive_bayes.formula(formula = status ~ age_first_funding_year + 
##     age_last_funding_year + age_first_milestone_year + age_last_milestone_year + 
##     relationships + funding_rounds + funding_total_usd + milestones + 
##     avg_participants, data = train, usekernel = T)
## 
## --------------------------------------------------------------------------------- 
##  
## Laplace smoothing: 0
## 
## --------------------------------------------------------------------------------- 
##  
##  A priori probabilities: 
## 
##  acquired    closed 
## 0.6419558 0.3580442 
## 
## --------------------------------------------------------------------------------- 
##  
##  Tables: 
## 
## --------------------------------------------------------------------------------- 
##  ::: age_first_funding_year::acquired (KDE)
## --------------------------------------------------------------------------------- 
## 
## Call:
##  density.default(x = x, na.rm = TRUE)
## 
## Data: x (407 obs.);  Bandwidth 'bw' = 0.5093
## 
##        x                 y            
##  Min.   :-2.0264   Min.   :6.751e-05  
##  1st Qu.: 0.7464   1st Qu.:2.152e-02  
##  Median : 3.5192   Median :6.654e-02  
##  Mean   : 3.5192   Mean   :9.007e-02  
##  3rd Qu.: 6.2920   3rd Qu.:1.309e-01  
##  Max.   : 9.0648   Max.   :2.905e-01  
## 
## --------------------------------------------------------------------------------- 
##  ::: age_first_funding_year::closed (KDE)
## --------------------------------------------------------------------------------- 
## 
## Call:
##  density.default(x = x, na.rm = TRUE)
## 
## Data: x (227 obs.);  Bandwidth 'bw' = 0.7533
## 
##        x                 y            
##  Min.   :-11.306   Min.   :8.000e-08  
##  1st Qu.: -2.713   1st Qu.:4.083e-04  
##  Median :  5.879   Median :2.328e-03  
##  Mean   :  5.879   Mean   :2.906e-02  
##  3rd Qu.: 14.472   3rd Qu.:2.013e-02  
##  Max.   : 23.065   Max.   :2.224e-01  
## 
## --------------------------------------------------------------------------------- 
##  ::: age_last_funding_year::acquired (KDE)
## --------------------------------------------------------------------------------- 
## 
## Call:
##  density.default(x = x, na.rm = TRUE)
## 
## Data: x (407 obs.);  Bandwidth 'bw' = 0.6932
## 
##        x                y            
##  Min.   :-2.493   Min.   :1.663e-05  
##  1st Qu.: 1.626   1st Qu.:7.015e-03  
##  Median : 5.745   Median :4.243e-02  
##  Mean   : 5.745   Mean   :6.063e-02  
##  3rd Qu.: 9.864   3rd Qu.:1.167e-01  
##  Max.   :13.984   Max.   :1.511e-01  
## 
## --------------------------------------------------------------------------------- 
##  ::: age_last_funding_year::closed (KDE)
## --------------------------------------------------------------------------------- 
## 
## Call:
##  density.default(x = x, na.rm = TRUE)
## 
## Data: x (227 obs.);  Bandwidth 'bw' = 0.9765
## 
##        x                 y            
##  Min.   :-11.976   Min.   :5.430e-06  
##  1st Qu.: -3.048   1st Qu.:8.298e-04  
##  Median :  5.879   Median :2.983e-03  
##  Mean   :  5.879   Mean   :2.797e-02  
##  3rd Qu.: 14.807   3rd Qu.:3.354e-02  
##  Max.   : 23.735   Max.   :1.462e-01  
## 
## --------------------------------------------------------------------------------- 
##  ::: age_first_milestone_year::acquired (KDE)
## --------------------------------------------------------------------------------- 
## 
## Call:
##  density.default(x = x, na.rm = TRUE)
## 
## Data: x (407 obs.);  Bandwidth 'bw' = 0.6144
## 
##        x                 y            
##  Min.   :-5.5939   Min.   :1.811e-05  
##  1st Qu.:-0.4819   1st Qu.:3.391e-03  
##  Median : 4.6302   Median :1.506e-02  
##  Mean   : 4.6302   Mean   :4.886e-02  
##  3rd Qu.: 9.7422   3rd Qu.:7.780e-02  
##  Max.   :14.8542   Max.   :2.016e-01  
## 
## --------------------------------------------------------------------------------- 
##  ::: age_first_milestone_year::closed (KDE)
## --------------------------------------------------------------------------------- 
## 
## Call:
##  density.default(x = x, na.rm = TRUE)
## 
## Data: x (227 obs.);  Bandwidth 'bw' = 0.5613
## 
##        x                y            
##  Min.   :-8.689   Min.   :1.000e-08  
##  1st Qu.:-1.930   1st Qu.:6.107e-04  
##  Median : 4.829   Median :4.143e-03  
##  Mean   : 4.829   Mean   :3.695e-02  
##  3rd Qu.:11.588   3rd Qu.:3.847e-02  
##  Max.   :18.347   Max.   :2.791e-01  
## 
## --------------------------------------------------------------------------------- 
##  ::: age_last_milestone_year::acquired (KDE)
## --------------------------------------------------------------------------------- 
## 
## Call:
##  density.default(x = x, na.rm = TRUE)
## 
## Data: x (407 obs.);  Bandwidth 'bw' = 0.7206
## 
##        x                 y            
##  Min.   :-5.9126   Min.   :1.571e-05  
##  1st Qu.:-0.4974   1st Qu.:2.453e-03  
##  Median : 4.9178   Median :2.446e-02  
##  Mean   : 4.9178   Mean   :4.612e-02  
##  3rd Qu.:10.3330   3rd Qu.:8.403e-02  
##  Max.   :15.7482   Max.   :1.638e-01  
## 
## --------------------------------------------------------------------------------- 
##  ::: age_last_milestone_year::closed (KDE)
## --------------------------------------------------------------------------------- 
## 
## Call:
##  density.default(x = x, na.rm = TRUE)
## 
## Data: x (227 obs.);  Bandwidth 'bw' = 0.7199
## 
##        x                y            
##  Min.   :-9.165   Min.   :2.781e-05  
##  1st Qu.:-1.753   1st Qu.:1.268e-03  
##  Median : 5.659   Median :2.571e-03  
##  Mean   : 5.659   Mean   :3.370e-02  
##  3rd Qu.:13.071   3rd Qu.:4.343e-02  
##  Max.   :20.483   Max.   :2.384e-01  
## 
## --------------------------------------------------------------------------------- 
##  ::: relationships::acquired (KDE)
## --------------------------------------------------------------------------------- 
## 
## Call:
##  density.default(x = x, na.rm = TRUE)
## 
## Data: x (407 obs.);  Bandwidth 'bw' = 1.817
## 
##        x                y            
##  Min.   :-5.452   Min.   :4.770e-06  
##  1st Qu.:13.024   1st Qu.:4.150e-04  
##  Median :31.500   Median :1.555e-03  
##  Mean   :31.500   Mean   :1.352e-02  
##  3rd Qu.:49.976   3rd Qu.:1.503e-02  
##  Max.   :68.452   Max.   :8.707e-02  
## 
## --------------------------------------------------------------------------------- 
##  ::: relationships::closed (KDE)
## --------------------------------------------------------------------------------- 
## 
## Call:
##  density.default(x = x, na.rm = TRUE)
## 
## Data: x (227 obs.);  Bandwidth 'bw' = 0.7943
## 
##        x                y            
##  Min.   :-2.383   Min.   :6.220e-06  
##  1st Qu.: 5.559   1st Qu.:9.871e-04  
##  Median :13.500   Median :7.235e-03  
##  Mean   :13.500   Mean   :3.145e-02  
##  3rd Qu.:21.441   3rd Qu.:4.189e-02  
##  Max.   :29.383   Max.   :1.569e-01  
## 
## ---------------------------------------------------------------------------------
## 
## # ... and 4 more tables
## 
## ---------------------------------------------------------------------------------

#plotting model and finding several plots
plot(md)

#Prediction
p <- predict(md, train, type = 'prob')
head(cbind(p, train))

##    acquired     closed ID state_code      id          city
## 1 0.8994251 0.10057493  1         CA  c:6669     San Diego
## 2 0.9765062 0.02349377  2         CA c:16283     Los Gatos
## 3 0.5495379 0.45046212  3         CA c:65620     San Diego
## 4 0.9346075 0.06539251  4         CA c:42668     Cupertino
## 6 0.3733746 0.62662545  6         CA c:22898 Mountain View
## 7 0.9589714 0.04102858  7         CA c:16191 Mountain View
##                    name labels founded_at closed_at first_funding_at
## 1           Bandsintown      1   1/1/2007                   4/1/2009
## 2             TriCipher      1   1/1/2000                  2/14/2005
## 3                 Plixi      1  3/18/2009                  3/30/2010
## 4     Solidcore Systems      1   1/1/2002                  2/17/2005
## 6      Matisse Networks      0   1/1/2002 2/15/2009        7/18/2006
## 7 RingCube Technologies      1   1/1/2005                  9/21/2006
##   last_funding_at age_first_funding_year age_last_funding_year
## 1        1/1/2010                 2.2493                3.0027
## 2      12/28/2009                 5.1260                9.9973
## 3       3/30/2010                 1.0329                1.0329
## 4       4/25/2007                 3.1315                5.3151
## 6       7/18/2006                 4.5452                4.5452
## 7       3/18/2010                 1.7205                5.2110
##   age_first_milestone_year age_last_milestone_year relationships funding_rounds
## 1                   4.6685                  6.7041             3              3
## 2                   7.0055                  7.0055             9              4
## 3                   1.4575                  2.2055             5              1
## 4                   6.0027                  6.0027             5              3
## 6                   5.0027                  5.0027             3              1
## 7                   3.0000                  6.6082             6              3
##   funding_total_usd milestones is_CA is_NY is_MA is_TX is_otherstate
## 1            375000          3     1     0     0     0             0
## 2          40100000          1     1     0     0     0             0
## 3           2600000          2     1     0     0     0             0
## 4          40000000          1     1     0     0     0             0
## 6           7500000          1     1     0     0     0             0
## 7          26000000          2     1     0     0     0             0
##     category_code is_software is_web is_mobile is_enterprise is_advertising
## 1           music           0      0         0             0              0
## 2      enterprise           0      0         0             1              0
## 3             web           0      1         0             0              0
## 4        software           1      0         0             0              0
## 6 network_hosting           0      0         0             0              0
## 7        software           1      0         0             0              0
##   is_gamesvideo is_ecommerce is_biotech is_consulting is_othercategory has_VC
## 1             0            0          0             0                1      0
## 2             0            0          0             0                0      1
## 3             0            0          0             0                0      0
## 4             0            0          0             0                0      0
## 6             0            0          0             0                1      0
## 7             0            0          0             0                0      1
##   has_angel has_roundA has_roundB has_roundC has_roundD avg_participants
## 1         1          0          0          0          0           1.0000
## 2         0          0          1          1          1           4.7500
## 3         0          1          0          0          0           4.0000
## 4         0          0          1          1          1           3.3333
## 6         0          0          1          0          0           3.0000
## 7         0          1          1          0          0           1.6667
##   is_top500            Gender   Age   Experience Education  C1  C2  C3  C4
## 1         0                            6-8 Years       PhD Yes Yes Yes Yes
## 2         1 Prefer not to say 41-50 Over 8 Years   Masters Yes Yes Yes Yes
## 3         1              Male 31-40 Over 8 Years Bachelors Yes Yes Yes Yes
## 4         1              Male 31-40   8-10 Years   Masters Yes Yes Yes Yes
## 6         1              Male 31-40 Over 8 Years Bachelors Yes Yes Yes Yes
## 7         1              Male 31-40 Over 8 Years Bachelors Yes Yes  No  No
##     status
## 1 acquired
## 2 acquired
## 3 acquired
## 4 acquired
## 6   closed
## 7 acquired

#confusion matrix
#Confusion Matrix - train data
p1 <- predict(md, train)
(tab1 <- table(p1, train$status))

##           
## p1         acquired closed
##   acquired      324     60
##   closed         83    167

1 - sum(diag(tab1)) / sum(tab1)

## [1] 0.2255521

Conclusion-

Random Forest Model gives most accurate prediction at 76% for a company’s success or failure

There are more firms that were acquired than those that were closed

Most Managers Seem to be between age 31 to 40

A positive association exists between Project Success and Entrepreneurial Acclimatization

A positive association exists between Acclimatization in Technology and Success in Projects

A positive arbitration that exists in the relationship of Entrepreneurial Acclimatization and Project Success

Final-Project DATA-698

Ashish Kumar

May 19 2021

Lets Dig into the data

The next step that would follow would be the Descriptive Research Design and for this matter therefore, there will be a run of the descriptive analytics and to start off it is wise to note that the dataset has a mix of integer, numeric and character attributes in the mix up.

Conclusion-