Future 500

##INTRODUCTION

#Future 500 company dataset contains 500 financial records of employees belonging to six different classes of industries' companies. It has 500 rows and 11 columns.
#This dataset has several discrepancies, outliers and missing data which needs to be rectified to get a prepared data on which robust analysis can be done.

#Our aim is to analyse the profit and growth in various industries. Relation and comparison amongst them ? Which industry is likely to fall down and which holds the brightest future ? Which industry generates most revenue but makes less profit and several other business insights which will add business value ?

# So let's import our data and do some statistics to understand it.

#Importing the dataset

fin <- read.csv("Future-500.csv", na.strings=c(""))
head(fin, 10) # Let's see how our data looks

##    ID         Name           Industry Inception Employees State
## 1   1     Over-Hex           Software      2006        25    TN
## 2   2    Unimattax        IT Services      2009        36    PA
## 3   3     Greenfax             Retail      2012        NA    SC
## 4   4    Blacklane        IT Services      2011        66    CA
## 5   5     Yearflex           Software      2013        45    WI
## 6   6 Indigoplanet        IT Services      2013        60    NJ
## 7   7      Treslam Financial Services      2009       116    MO
## 8   8    Rednimdox       Construction      2013        73    NY
## 9   9      Lamtone        IT Services      2009        55    CA
## 10 10    Stripfind Financial Services      2010        25    FL
##              City     Revenue          Expenses   Profit Growth
## 1        Franklin  $9,684,527 1,130,700 Dollars  8553827    19%
## 2  Newtown Square $14,016,543   804,035 Dollars 13212508    20%
## 3      Greenville  $9,746,272 1,044,375 Dollars  8701897    16%
## 4          Orange $15,359,369 4,631,808 Dollars 10727561    19%
## 5         Madison  $8,567,910 4,374,841 Dollars  4193069    19%
## 6       Manalapan $12,805,452 4,626,275 Dollars  8179177    22%
## 7         Clayton  $5,387,469 2,127,984 Dollars  3259485    17%
## 8        Woodside        <NA>              <NA>       NA   <NA>
## 9       San Ramon $11,757,018 6,482,465 Dollars  5274553    30%
## 10     Boca Raton $12,329,371   916,455 Dollars 11412916    20%

library(psych)
describe(fin)

##           vars   n       mean         sd    median    trimmed        mad
## ID           1 500     250.50     144.48     250.5     250.50     185.32
## Name*        2 500     250.50     144.48     250.5     250.50     185.32
## Industry*    3 498       4.25       1.79       5.0       4.32       1.48
## Inception    4 499    2010.17       3.23    2011.0    2010.71       1.48
## Employees    5 498     148.61     397.35      56.0      80.68      52.63
## State*       6 496      21.95      13.33      21.5      22.16      20.02
## City*        7 500     147.09      85.86     153.0     147.41     111.19
## Revenue*     8 498     249.50     143.90     249.5     249.50     184.58
## Expenses*    9 497     249.00     143.62     249.0     249.00     183.84
## Profit      10 498 6539474.01 3869933.65 6513366.0 6421961.00 4537424.65
## Growth*     11 499      17.70       8.36      16.0      17.54       8.90
##             min      max    range  skew kurtosis        se
## ID            1      500      499  0.00    -1.21      6.46
## Name*         1      500      499  0.00    -1.21      6.46
## Industry*     1        7        6 -0.28    -0.82      0.08
## Inception  1999     2014       15 -1.59     2.39      0.14
## Employees     1     7125     7124 11.98   192.87     17.81
## State*        1       42       41 -0.09    -1.49      0.60
## City*         1      297      296 -0.06    -1.27      3.84
## Revenue*      1      498      497  0.00    -1.21      6.45
## Expenses*     1      497      496  0.00    -1.21      6.44
## Profit    12434 19624534 19612100  0.25    -0.55 173415.87
## Growth*       1       32       31  0.25    -1.14      0.37

str(fin)

## 'data.frame':    500 obs. of  11 variables:
##  $ ID       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Name     : Factor w/ 500 levels "Abstractedchocolat",..: 297 451 168 40 485 199 435 339 242 395 ...
##  $ Industry : Factor w/ 7 levels "Construction",..: 7 5 6 5 7 5 2 1 5 2 ...
##  $ Inception: int  2006 2009 2012 2011 2013 2013 2009 2013 2009 2010 ...
##  $ Employees: int  25 36 NA 66 45 60 116 73 55 25 ...
##  $ State    : Factor w/ 42 levels "AL","AZ","CA",..: 36 33 35 3 41 27 22 29 3 8 ...
##  $ City     : Factor w/ 297 levels "Addison","Alexandria",..: 94 181 105 195 151 154 53 295 232 26 ...
##  $ Revenue  : Factor w/ 498 levels "$1,614,585","$1,835,717",..: 479 194 485 246 402 141 308 NA 96 117 ...
##  $ Expenses : Factor w/ 497 levels "1,026,548 Dollars",..: 6 485 3 248 227 247 57 NA 402 495 ...
##  $ Profit   : int  8553827 13212508 8701897 10727561 4193069 8179177 3259485 NA 5274553 11412916 ...
##  $ Growth   : Factor w/ 32 levels "-2%","-3%","0%",..: 14 16 11 14 14 18 12 NA 26 16 ...

summary(fin)

##        ID                        Name                   Industry  
##  Min.   :  1.0   Abstractedchocolat:  1   IT Services       :146  
##  1st Qu.:125.8   Abusivebong       :  1   Health            : 86  
##  Median :250.5   Acclaimedcirl     :  1   Software          : 64  
##  Mean   :250.5   Admitruppell      :  1   Financial Services: 54  
##  3rd Qu.:375.2   Admonishbadelynge :  1   Construction      : 50  
##  Max.   :500.0   Ahemparticular    :  1   (Other)           : 98  
##                  (Other)           :494   NA's              :  2  
##    Inception      Employees           State              City    
##  Min.   :1999   Min.   :   1.00   CA     : 57   San Diego  : 13  
##  1st Qu.:2009   1st Qu.:  27.25   VA     : 50   New York   : 11  
##  Median :2011   Median :  56.00   TX     : 47   Reston     : 10  
##  Mean   :2010   Mean   : 148.61   FL     : 34   Houston    :  9  
##  3rd Qu.:2012   3rd Qu.: 126.00   MD     : 25   Austin     :  8  
##  Max.   :2014   Max.   :7125.00   (Other):283   Minneapolis:  8  
##  NA's   :1      NA's   :2         NA's   :  4   (Other)    :441  
##         Revenue                 Expenses       Profit        
##  $1,614,585 :  1   1,026,548 Dollars:  1   Min.   :   12434  
##  $1,835,717 :  1   1,040,662 Dollars:  1   1st Qu.: 3272074  
##  $10,064,297:  1   1,044,375 Dollars:  1   Median : 6513366  
##  $10,067,223:  1   1,097,353 Dollars:  1   Mean   : 6539474  
##  $10,072,452:  1   1,117,206 Dollars:  1   3rd Qu.: 9303951  
##  (Other)    :493   (Other)          :492   Max.   :19624534  
##  NA's       :  2   NA's             :  3   NA's   :2         
##      Growth   
##  20%    : 39  
##  19%    : 35  
##  17%    : 27  
##  6%     : 25  
##  12%    : 24  
##  (Other):349  
##  NA's   :  1

#By looking at the data types we can say that ID and inception needs to be converted into factors as we don't need to summarise it.
#Profit, Expenses and Growth are identified as factors as they aren't in pure numerical form having alphabets with them. So, we need to clean them to get them into the pure numerical form so that analysis can be performed on them. This is the main part of data cleaning. So, let's get into it..

fin$ID <- factor(fin$ID)
fin$Inception <- factor(fin$Inception)
#Let's check the data type of ID and Inception after changing the data type.
class(fin$Inception)

## [1] "factor"

class(fin$ID)

## [1] "factor"

#See, our conversion is successfully done. 
#Let's move forward to clean other columns, for that we will make use of special functions sub() and gsub() to get rid of special letters in Revenue, Expenses and Growth column.

#gsub() and sub() - Functions used to Get rid of special letters 

fin$Revenue <- gsub("\\$","",fin$Revenue)
fin$Revenue <- gsub( ",","",fin$Revenue)
fin$Expenses <- gsub(",","",fin$Expenses)
fin$Expenses <- gsub(" Dollars","",fin$Expenses)
fin$Growth <- gsub("%","",fin$Growth)
str(fin)

## 'data.frame':    500 obs. of  11 variables:
##  $ ID       : Factor w/ 500 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Name     : Factor w/ 500 levels "Abstractedchocolat",..: 297 451 168 40 485 199 435 339 242 395 ...
##  $ Industry : Factor w/ 7 levels "Construction",..: 7 5 6 5 7 5 2 1 5 2 ...
##  $ Inception: Factor w/ 16 levels "1999","2000",..: 8 11 14 13 15 15 11 15 11 12 ...
##  $ Employees: int  25 36 NA 66 45 60 116 73 55 25 ...
##  $ State    : Factor w/ 42 levels "AL","AZ","CA",..: 36 33 35 3 41 27 22 29 3 8 ...
##  $ City     : Factor w/ 297 levels "Addison","Alexandria",..: 94 181 105 195 151 154 53 295 232 26 ...
##  $ Revenue  : chr  "9684527" "14016543" "9746272" "15359369" ...
##  $ Expenses : chr  "1130700" "804035" "1044375" "4631808" ...
##  $ Profit   : int  8553827 13212508 8701897 10727561 4193069 8179177 3259485 NA 5274553 11412916 ...
##  $ Growth   : chr  "19" "20" "16" "19" ...

# we have formed a char datatype due to elimination of special letters. So, let's convert them into numeric forms.
fin$Revenue <- as.numeric(fin$Revenue)
fin$Expenses <- as.numeric(fin$Expenses)
fin$Growth <- as.numeric(fin$Growth)

#Let's see how our dataset looks now!!
head(fin[,c(8,9,10,11)])

##    Revenue Expenses   Profit Growth
## 1  9684527  1130700  8553827     19
## 2 14016543   804035 13212508     20
## 3  9746272  1044375  8701897     16
## 4 15359369  4631808 10727561     19
## 5  8567910  4374841  4193069     19
## 6 12805452  4626275  8179177     22

summary(fin)

##        ID                      Name                   Industry  
##  1      :  1   Abstractedchocolat:  1   IT Services       :146  
##  2      :  1   Abusivebong       :  1   Health            : 86  
##  3      :  1   Acclaimedcirl     :  1   Software          : 64  
##  4      :  1   Admitruppell      :  1   Financial Services: 54  
##  5      :  1   Admonishbadelynge :  1   Construction      : 50  
##  6      :  1   Ahemparticular    :  1   (Other)           : 98  
##  (Other):494   (Other)           :494   NA's              :  2  
##    Inception     Employees           State              City    
##  2011   : 93   Min.   :   1.00   CA     : 57   San Diego  : 13  
##  2010   : 83   1st Qu.:  27.25   VA     : 50   New York   : 11  
##  2012   : 80   Median :  56.00   TX     : 47   Reston     : 10  
##  2013   : 69   Mean   : 148.61   FL     : 34   Houston    :  9  
##  2009   : 60   3rd Qu.: 126.00   MD     : 25   Austin     :  8  
##  (Other):114   Max.   :7125.00   (Other):283   Minneapolis:  8  
##  NA's   :  1   NA's   :2         NA's   :  4   (Other)    :441  
##     Revenue            Expenses           Profit             Growth     
##  Min.   : 1614585   Min.   :  71219   Min.   :   12434   Min.   :-3.00  
##  1st Qu.: 8695702   1st Qu.:2758425   1st Qu.: 3272074   1st Qu.: 8.00  
##  Median :10647231   Median :4365512   Median : 6513366   Median :15.00  
##  Mean   :10845170   Mean   :4310134   Mean   : 6539474   Mean   :14.38  
##  3rd Qu.:13106928   3rd Qu.:5832473   3rd Qu.: 9303951   3rd Qu.:20.00  
##  Max.   :21810051   Max.   :9860686   Max.   :19624534   Max.   :30.00  
##  NA's   :2          NA's   :3         NA's   :2          NA's   :1

# Nice, suitable data types have been formed.
# So our first part of data preparation has been done, data has been cleansed. Now it's time to deal with the missing values one by one. Dealing with missing values demand a quality skill set. How to deal with them ?

#1) If missing values are less in numbers without having any serious impact on our analysis then it can be deleted but if these are insignificant in numbers then deletion can' be made as it would result in loss of useful information.
#2) If data is normally distributed then it should be replaced with the mean values. Uniform distribution can be checked using tests, boxplot, histogram, qqplots etc.
#3) If data is not normally distributed or if that column's values are dependent on categories which they belong to then we will make use of median values.
#4) Sometimes, missing values can be computed mathematically/statistically from other or/and the same column(s) which will be more accurate then the traditional computational method.

#Note: Data preparation also includes dealing with outliers and anomalies before dealing with missing values.sometimes outliers predict new data science but usually it indicates data entry faults. They can be ignored if their influence is negligible otherwise their effect needs to be analysed. We are not doing here as this dataset contains only 500 rows so their negative impact might not be significant. Though, It's good practice to repeat your analysis with and without the outliers.If they have minimal effect on the results, and you can't figure out why they're there, it's reasonable to replace them with missing values and move on. However, if they have a substantial effect on your results, you shouldn't drop them without justification. You'll need to figure out what caused them (e.g., a data entry error) and disclose that you removed them in your write-up.

#Too much theory of theory, let's dive straight back to your work.

# Dealing with missing data

#Replacing Missing Data: Factual Analysis
#-----
#dealing with State missing values having NEW YORK as it's City
fin[is.na(fin$State),] # data with missing state values

##      ID            Name Industry Inception Employees State          City
## 11   11 Canecorporation   Health      2012         6  <NA>      New York
## 84   84      Drilldrill Software      2010        30  <NA> San Francisco
## 267 267      Circlechop Software      2010        14  <NA> San Francisco
## 379 379       Stovepuck   Retail      2013        73  <NA>      New York
##      Revenue Expenses  Profit Growth
## 11  10597009  7591189 3005820      7
## 84   7800620  2785799 5014821     17
## 267  9067070  5929828 3137242     20
## 379 13814975  5904502 7910473     10

fin[is.na(fin$State) & fin$City=="New York",] #Data with missing values having city as New York

##      ID            Name Industry Inception Employees State     City
## 11   11 Canecorporation   Health      2012         6  <NA> New York
## 379 379       Stovepuck   Retail      2013        73  <NA> New York
##      Revenue Expenses  Profit Growth
## 11  10597009  7591189 3005820      7
## 379 13814975  5904502 7910473     10

fin[is.na(fin$State) & fin$City=="New York","State"] <- "NY" #State with missing values having city as New York is being replaced with NY
fin[is.na(fin$State) & fin$City=="New York",] #Empty - as we don't have any state with missing values having city as New York.

##  [1] ID        Name      Industry  Inception Employees State     City     
##  [8] Revenue   Expenses  Profit    Growth   
## <0 rows> (or 0-length row.names)

fin[!complete.cases(fin),]

##      ID        Name           Industry Inception Employees State
## 3     3    Greenfax             Retail      2012        NA    SC
## 8     8   Rednimdox       Construction      2013        73    NY
## 14   14    Techline               <NA>      2006        65    CA
## 15   15     Cityace               <NA>      2010        25    CO
## 17   17     Ganzlax        IT Services      2011        75    NJ
## 22   22  Lathotline             Health      <NA>       103    VA
## 44   44   Ganzgreen       Construction      2010       224    TN
## 84   84  Drilldrill           Software      2010        30  <NA>
## 267 267  Circlechop           Software      2010        14  <NA>
## 332 332 Westminster Financial Services      2010        NA    MI
##              City  Revenue Expenses   Profit Growth
## 3      Greenville  9746272  1044375  8701897     16
## 8        Woodside       NA       NA       NA     NA
## 14      San Ramon 13898119  5470303  8427816     23
## 15     Louisville  9254614  6249498  3005116      6
## 17         Iselin 14001180       NA 11901180     18
## 22         McLean  9418303  7567233  1851070      2
## 44       Franklin       NA       NA       NA      9
## 84  San Francisco  7800620  2785799  5014821     17
## 267 San Francisco  9067070  5929828  3137242     20
## 332          Troy 11861652  5245126  6616526     15

#----
#dealing with State missing values having San Francisco as it's City
fin[is.na(fin$State) & fin$City=="San Francisco", "State"] <- "CA"
fin[!complete.cases(fin),]

##      ID        Name           Industry Inception Employees State
## 3     3    Greenfax             Retail      2012        NA    SC
## 8     8   Rednimdox       Construction      2013        73    NY
## 14   14    Techline               <NA>      2006        65    CA
## 15   15     Cityace               <NA>      2010        25    CO
## 17   17     Ganzlax        IT Services      2011        75    NJ
## 22   22  Lathotline             Health      <NA>       103    VA
## 44   44   Ganzgreen       Construction      2010       224    TN
## 332 332 Westminster Financial Services      2010        NA    MI
##           City  Revenue Expenses   Profit Growth
## 3   Greenville  9746272  1044375  8701897     16
## 8     Woodside       NA       NA       NA     NA
## 14   San Ramon 13898119  5470303  8427816     23
## 15  Louisville  9254614  6249498  3005116      6
## 17      Iselin 14001180       NA 11901180     18
## 22      McLean  9418303  7567233  1851070      2
## 44    Franklin       NA       NA       NA      9
## 332       Troy 11861652  5245126  6616526     15

#State has been taken care of.


#Replacing Missing data: Median Imputation Method part - 1 - Employee Column

med_employee_retail <- median(fin[fin$Industry=="Retail", "Employees"], na.rm=T)
fin[is.na(fin$Employees) & fin$Industry=="Retail", "Employees"] <- med_employee_retail
fin[is.na(fin$Employees),]

##      ID        Name           Industry Inception Employees State City
## 332 332 Westminster Financial Services      2010        NA    MI Troy
##      Revenue Expenses  Profit Growth
## 332 11861652  5245126 6616526     15

med_employee_FinancialServices <- median(fin[fin$Industry=="Financial Services", "Employees"], na.rm=T)
fin[is.na(fin$Employees),] #332

##      ID        Name           Industry Inception Employees State City
## 332 332 Westminster Financial Services      2010        NA    MI Troy
##      Revenue Expenses  Profit Growth
## 332 11861652  5245126 6616526     15

fin[is.na(fin$Employees) & fin$Industry=="Financial Services", "Employees"] <- med_employee_FinancialServices
fin[332,] #Uploaded with median =80

##      ID        Name           Industry Inception Employees State City
## 332 332 Westminster Financial Services      2010        80    MI Troy
##      Revenue Expenses  Profit Growth
## 332 11861652  5245126 6616526     15

fin[!complete.cases(fin),]

##    ID       Name     Industry Inception Employees State       City
## 8   8  Rednimdox Construction      2013        73    NY   Woodside
## 14 14   Techline         <NA>      2006        65    CA  San Ramon
## 15 15    Cityace         <NA>      2010        25    CO Louisville
## 17 17    Ganzlax  IT Services      2011        75    NJ     Iselin
## 22 22 Lathotline       Health      <NA>       103    VA     McLean
## 44 44  Ganzgreen Construction      2010       224    TN   Franklin
##     Revenue Expenses   Profit Growth
## 8        NA       NA       NA     NA
## 14 13898119  5470303  8427816     23
## 15  9254614  6249498  3005116      6
## 17 14001180       NA 11901180     18
## 22  9418303  7567233  1851070      2
## 44       NA       NA       NA      9

#Wow, Only 6 rows are being left out now.

#Replacing Missing data: Median Imputation Method part - 2 - Growth

median_growth_construction <- median(fin[fin$Industry=="Construction", "Growth"], na.rm=T)
fin[is.na(fin$Growth) & fin$Industry=="Construction", "Growth"] <- median_growth_construction
fin[8,]

##   ID      Name     Industry Inception Employees State     City Revenue
## 8  8 Rednimdox Construction      2013        73    NY Woodside      NA
##   Expenses Profit Growth
## 8       NA     NA     10

fin[!complete.cases(fin),]

##    ID       Name     Industry Inception Employees State       City
## 8   8  Rednimdox Construction      2013        73    NY   Woodside
## 14 14   Techline         <NA>      2006        65    CA  San Ramon
## 15 15    Cityace         <NA>      2010        25    CO Louisville
## 17 17    Ganzlax  IT Services      2011        75    NJ     Iselin
## 22 22 Lathotline       Health      <NA>       103    VA     McLean
## 44 44  Ganzgreen Construction      2010       224    TN   Franklin
##     Revenue Expenses   Profit Growth
## 8        NA       NA       NA     10
## 14 13898119  5470303  8427816     23
## 15  9254614  6249498  3005116      6
## 17 14001180       NA 11901180     18
## 22  9418303  7567233  1851070      2
## 44       NA       NA       NA      9

#6 left out rows are further filtered out.

median_revenue_construction <- median(fin[fin$Industry=="Construction", "Revenue"], na.rm=T)

#fin[is.na(fin$Revenue) & fin$Industry=="Construction","Revenue"] <- median_revenue_construction - Not used explanation given below.

fin[is.na(fin$Revenue) & fin$Industry=="Construction" & is.na(fin$Profit),"Revenue"] <- median_revenue_construction

median_expenses_construction <- median(fin[fin$Industry=="Construction", "Expenses"], na.rm=T)

#fin[is.na(fin$Expenses) & fin$Industry=="Construction","Expenses"] <- median_expenses_construction - Not used explanation given below.

fin[is.na(fin$Expenses) & fin$Industry=="Construction" &is.na(fin$Profit),"Expenses"] <- median_expenses_construction
fin[!complete.cases(fin),]

##    ID       Name     Industry Inception Employees State       City
## 8   8  Rednimdox Construction      2013        73    NY   Woodside
## 14 14   Techline         <NA>      2006        65    CA  San Ramon
## 15 15    Cityace         <NA>      2010        25    CO Louisville
## 17 17    Ganzlax  IT Services      2011        75    NJ     Iselin
## 22 22 Lathotline       Health      <NA>       103    VA     McLean
## 44 44  Ganzgreen Construction      2010       224    TN   Franklin
##     Revenue Expenses   Profit Growth
## 8   9055059  4506976       NA     10
## 14 13898119  5470303  8427816     23
## 15  9254614  6249498  3005116      6
## 17 14001180       NA 11901180     18
## 22  9418303  7567233  1851070      2
## 44  9055059  4506976       NA      9

#fin[is.na(fin$Expenses) & fin$Industry=="Construction","Expenses"] <- median_expenses_construction is not used because if my other columns had data then the Expenses would have been removed by the median computational method but we don't want that as that value can be computed accurately by the values given by revenue and profit which might be present so we need to filter out our code by considering only rows having atleast two missing values where mathematical computation isn't possible. Similar explanation is for #fin[is.na(fin$Revenue) & fin$Industry=="Construction","Revenue"] <- median_revenue_construction.

#Mathematical computation where nearby values are given is shown below.

#Replacing Missing Values: Expenses = Revenue - Profit, Profit = Revenue - Expenses

fin[is.na(fin$Profit), "Profit"] <- (fin$Revenue - fin$Expenses)[is.na(fin$Profit)]
#fin[is.na(fin$Profit), "Profit"] <- fin[is.na(fin$Profit),"Revenue"] - fin[is.na(fin$Expenses), "Expenses"] - another method

fin[!complete.cases(fin$Profit)]

## data frame with 0 columns and 500 rows

#Profit has been taken care of now. Let's do the same for Expenses.

fin[is.na(fin$Expenses), "Expenses"] <- (fin$Revenue - fin$Profit)[is.na(fin$Expenses)]
#fin[is.na(fin$Profit), "Expenses"] <- fin[is.na(fin$Profit),"Revenue"] - fin[is.na(fin$Expenses), "Profit"] - another method

fin[!complete.cases(fin),]

##    ID       Name Industry Inception Employees State       City  Revenue
## 14 14   Techline     <NA>      2006        65    CA  San Ramon 13898119
## 15 15    Cityace     <NA>      2010        25    CO Louisville  9254614
## 22 22 Lathotline   Health      <NA>       103    VA     McLean  9418303
##    Expenses  Profit Growth
## 14  5470303 8427816     23
## 15  6249498 3005116      6
## 22  7567233 1851070      2

fi <- fin[complete.cases(fin$Industry),]
fi[!complete.cases(fi),]

##    ID       Name Industry Inception Employees State   City Revenue
## 22 22 Lathotline   Health      <NA>       103    VA McLean 9418303
##    Expenses  Profit Growth
## 22  7567233 1851070      2

fin <- fi
fin[!complete.cases(fin),]

##    ID       Name Industry Inception Employees State   City Revenue
## 22 22 Lathotline   Health      <NA>       103    VA McLean 9418303
##    Expenses  Profit Growth
## 22  7567233 1851070      2

#----------The End to data preparation-----------#-----Let's Move to the visualisation part.

#Visualisation Part

#Scatterplot showing the variations of Expenses, Revenue and Profit.

library(tidyverse)

## -- Attaching packages ------------------------------------------------------------------------------------------------------------------------------ tidyverse 1.2.1 --

## v ggplot2 3.0.0     v purrr   0.2.5
## v tibble  1.4.2     v dplyr   0.7.6
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0

## -- Conflicts --------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x ggplot2::%+%()   masks psych::%+%()
## x ggplot2::alpha() masks psych::alpha()
## x dplyr::filter()  masks stats::filter()
## x dplyr::lag()     masks stats::lag()

ggplot(fin, aes(Revenue, Expenses,size=Profit)) + geom_point() + geom_smooth(se=F)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# As revenue expenditure increases our profit also gets increased. Showing positive correlation. Let's test the correlation.
cor.test(fin$Revenue, fin$Profit)

## 
##  Pearson's product-moment correlation
## 
## data:  fin$Revenue and fin$Profit
## t = 34.141, df = 496, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8092424 0.8619849
## sample estimates:
##       cor 
## 0.8375544

#Hence proved, since p<0.05 thus, there is a statistical positive correlation between revenue and profit.

#Scatterplot showing the variations of Expenses, Revenue and Profit in different industries.

ggplot(fin, aes(Revenue, Expenses, color=Industry, size=Profit)) + geom_point() + geom_smooth(se=F)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#Maximum revenue is being generated by It services and software industries.

#Comparing different industries.

ggplot(fin, aes(Revenue, Expenses, color=Industry)) + geom_point() + geom_smooth(se=F) + facet_wrap(~Industry)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#IT services and Retail industries falls off with profit with increase in revenue.

#Boxplots showing growth and revenue distribution of different industries

ggplot(fin, aes(Industry, Revenue)) + geom_boxplot(aes(reorder(Industry, Revenue, median),  color=Industry)) + coord_flip()

#retail and IT industries produces maximum revenue.

ggplot(fin, aes(Industry,Growth )) + geom_boxplot(aes(reorder(Industry, Growth, median),  color=Industry))

#Software and IT services' Industries have the highest growth.

ggplot(fin, aes(Industry,Growth, color=Industry )) + geom_boxplot(aes(reorder(Industry, Growth, median)), outlier.color = NA)+ geom_jitter()

##Conclusion: Yo !! We have made it to the end. We have invested in a lot of skills and time in building this project. We have systematically imported the data, cleaned it and converted them to appropriate forms used for analysis purposes, dealt with missing values and the science involved with the dealing process. And once the data is successfully prepared for business purposes we then drawn significant insights holding strong business value by making use of ggplot visualisation process. 

#And finally, I am very thankful to you for investing your valuable time in reading my project. Take care !!!

Future 500

Puneet