##INTRODUCTION
#Future 500 company dataset contains 500 financial records of employees belonging to six different classes of industries' companies. It has 500 rows and 11 columns.
#This dataset has several discrepancies, outliers and missing data which needs to be rectified to get a prepared data on which robust analysis can be done.
#Our aim is to analyse the profit and growth in various industries. Relation and comparison amongst them ? Which industry is likely to fall down and which holds the brightest future ? Which industry generates most revenue but makes less profit and several other business insights which will add business value ?
# So let's import our data and do some statistics to understand it.
#Importing the dataset
fin <- read.csv("Future-500.csv", na.strings=c(""))
head(fin, 10) # Let's see how our data looks
## ID Name Industry Inception Employees State
## 1 1 Over-Hex Software 2006 25 TN
## 2 2 Unimattax IT Services 2009 36 PA
## 3 3 Greenfax Retail 2012 NA SC
## 4 4 Blacklane IT Services 2011 66 CA
## 5 5 Yearflex Software 2013 45 WI
## 6 6 Indigoplanet IT Services 2013 60 NJ
## 7 7 Treslam Financial Services 2009 116 MO
## 8 8 Rednimdox Construction 2013 73 NY
## 9 9 Lamtone IT Services 2009 55 CA
## 10 10 Stripfind Financial Services 2010 25 FL
## City Revenue Expenses Profit Growth
## 1 Franklin $9,684,527 1,130,700 Dollars 8553827 19%
## 2 Newtown Square $14,016,543 804,035 Dollars 13212508 20%
## 3 Greenville $9,746,272 1,044,375 Dollars 8701897 16%
## 4 Orange $15,359,369 4,631,808 Dollars 10727561 19%
## 5 Madison $8,567,910 4,374,841 Dollars 4193069 19%
## 6 Manalapan $12,805,452 4,626,275 Dollars 8179177 22%
## 7 Clayton $5,387,469 2,127,984 Dollars 3259485 17%
## 8 Woodside <NA> <NA> NA <NA>
## 9 San Ramon $11,757,018 6,482,465 Dollars 5274553 30%
## 10 Boca Raton $12,329,371 916,455 Dollars 11412916 20%
library(psych)
describe(fin)
## vars n mean sd median trimmed mad
## ID 1 500 250.50 144.48 250.5 250.50 185.32
## Name* 2 500 250.50 144.48 250.5 250.50 185.32
## Industry* 3 498 4.25 1.79 5.0 4.32 1.48
## Inception 4 499 2010.17 3.23 2011.0 2010.71 1.48
## Employees 5 498 148.61 397.35 56.0 80.68 52.63
## State* 6 496 21.95 13.33 21.5 22.16 20.02
## City* 7 500 147.09 85.86 153.0 147.41 111.19
## Revenue* 8 498 249.50 143.90 249.5 249.50 184.58
## Expenses* 9 497 249.00 143.62 249.0 249.00 183.84
## Profit 10 498 6539474.01 3869933.65 6513366.0 6421961.00 4537424.65
## Growth* 11 499 17.70 8.36 16.0 17.54 8.90
## min max range skew kurtosis se
## ID 1 500 499 0.00 -1.21 6.46
## Name* 1 500 499 0.00 -1.21 6.46
## Industry* 1 7 6 -0.28 -0.82 0.08
## Inception 1999 2014 15 -1.59 2.39 0.14
## Employees 1 7125 7124 11.98 192.87 17.81
## State* 1 42 41 -0.09 -1.49 0.60
## City* 1 297 296 -0.06 -1.27 3.84
## Revenue* 1 498 497 0.00 -1.21 6.45
## Expenses* 1 497 496 0.00 -1.21 6.44
## Profit 12434 19624534 19612100 0.25 -0.55 173415.87
## Growth* 1 32 31 0.25 -1.14 0.37
str(fin)
## 'data.frame': 500 obs. of 11 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : Factor w/ 500 levels "Abstractedchocolat",..: 297 451 168 40 485 199 435 339 242 395 ...
## $ Industry : Factor w/ 7 levels "Construction",..: 7 5 6 5 7 5 2 1 5 2 ...
## $ Inception: int 2006 2009 2012 2011 2013 2013 2009 2013 2009 2010 ...
## $ Employees: int 25 36 NA 66 45 60 116 73 55 25 ...
## $ State : Factor w/ 42 levels "AL","AZ","CA",..: 36 33 35 3 41 27 22 29 3 8 ...
## $ City : Factor w/ 297 levels "Addison","Alexandria",..: 94 181 105 195 151 154 53 295 232 26 ...
## $ Revenue : Factor w/ 498 levels "$1,614,585","$1,835,717",..: 479 194 485 246 402 141 308 NA 96 117 ...
## $ Expenses : Factor w/ 497 levels "1,026,548 Dollars",..: 6 485 3 248 227 247 57 NA 402 495 ...
## $ Profit : int 8553827 13212508 8701897 10727561 4193069 8179177 3259485 NA 5274553 11412916 ...
## $ Growth : Factor w/ 32 levels "-2%","-3%","0%",..: 14 16 11 14 14 18 12 NA 26 16 ...
summary(fin)
## ID Name Industry
## Min. : 1.0 Abstractedchocolat: 1 IT Services :146
## 1st Qu.:125.8 Abusivebong : 1 Health : 86
## Median :250.5 Acclaimedcirl : 1 Software : 64
## Mean :250.5 Admitruppell : 1 Financial Services: 54
## 3rd Qu.:375.2 Admonishbadelynge : 1 Construction : 50
## Max. :500.0 Ahemparticular : 1 (Other) : 98
## (Other) :494 NA's : 2
## Inception Employees State City
## Min. :1999 Min. : 1.00 CA : 57 San Diego : 13
## 1st Qu.:2009 1st Qu.: 27.25 VA : 50 New York : 11
## Median :2011 Median : 56.00 TX : 47 Reston : 10
## Mean :2010 Mean : 148.61 FL : 34 Houston : 9
## 3rd Qu.:2012 3rd Qu.: 126.00 MD : 25 Austin : 8
## Max. :2014 Max. :7125.00 (Other):283 Minneapolis: 8
## NA's :1 NA's :2 NA's : 4 (Other) :441
## Revenue Expenses Profit
## $1,614,585 : 1 1,026,548 Dollars: 1 Min. : 12434
## $1,835,717 : 1 1,040,662 Dollars: 1 1st Qu.: 3272074
## $10,064,297: 1 1,044,375 Dollars: 1 Median : 6513366
## $10,067,223: 1 1,097,353 Dollars: 1 Mean : 6539474
## $10,072,452: 1 1,117,206 Dollars: 1 3rd Qu.: 9303951
## (Other) :493 (Other) :492 Max. :19624534
## NA's : 2 NA's : 3 NA's :2
## Growth
## 20% : 39
## 19% : 35
## 17% : 27
## 6% : 25
## 12% : 24
## (Other):349
## NA's : 1
#By looking at the data types we can say that ID and inception needs to be converted into factors as we don't need to summarise it.
#Profit, Expenses and Growth are identified as factors as they aren't in pure numerical form having alphabets with them. So, we need to clean them to get them into the pure numerical form so that analysis can be performed on them. This is the main part of data cleaning. So, let's get into it..
fin$ID <- factor(fin$ID)
fin$Inception <- factor(fin$Inception)
#Let's check the data type of ID and Inception after changing the data type.
class(fin$Inception)
## [1] "factor"
class(fin$ID)
## [1] "factor"
#See, our conversion is successfully done.
#Let's move forward to clean other columns, for that we will make use of special functions sub() and gsub() to get rid of special letters in Revenue, Expenses and Growth column.
#gsub() and sub() - Functions used to Get rid of special letters
fin$Revenue <- gsub("\\$","",fin$Revenue)
fin$Revenue <- gsub( ",","",fin$Revenue)
fin$Expenses <- gsub(",","",fin$Expenses)
fin$Expenses <- gsub(" Dollars","",fin$Expenses)
fin$Growth <- gsub("%","",fin$Growth)
str(fin)
## 'data.frame': 500 obs. of 11 variables:
## $ ID : Factor w/ 500 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Name : Factor w/ 500 levels "Abstractedchocolat",..: 297 451 168 40 485 199 435 339 242 395 ...
## $ Industry : Factor w/ 7 levels "Construction",..: 7 5 6 5 7 5 2 1 5 2 ...
## $ Inception: Factor w/ 16 levels "1999","2000",..: 8 11 14 13 15 15 11 15 11 12 ...
## $ Employees: int 25 36 NA 66 45 60 116 73 55 25 ...
## $ State : Factor w/ 42 levels "AL","AZ","CA",..: 36 33 35 3 41 27 22 29 3 8 ...
## $ City : Factor w/ 297 levels "Addison","Alexandria",..: 94 181 105 195 151 154 53 295 232 26 ...
## $ Revenue : chr "9684527" "14016543" "9746272" "15359369" ...
## $ Expenses : chr "1130700" "804035" "1044375" "4631808" ...
## $ Profit : int 8553827 13212508 8701897 10727561 4193069 8179177 3259485 NA 5274553 11412916 ...
## $ Growth : chr "19" "20" "16" "19" ...
# we have formed a char datatype due to elimination of special letters. So, let's convert them into numeric forms.
fin$Revenue <- as.numeric(fin$Revenue)
fin$Expenses <- as.numeric(fin$Expenses)
fin$Growth <- as.numeric(fin$Growth)
#Let's see how our dataset looks now!!
head(fin[,c(8,9,10,11)])
## Revenue Expenses Profit Growth
## 1 9684527 1130700 8553827 19
## 2 14016543 804035 13212508 20
## 3 9746272 1044375 8701897 16
## 4 15359369 4631808 10727561 19
## 5 8567910 4374841 4193069 19
## 6 12805452 4626275 8179177 22
summary(fin)
## ID Name Industry
## 1 : 1 Abstractedchocolat: 1 IT Services :146
## 2 : 1 Abusivebong : 1 Health : 86
## 3 : 1 Acclaimedcirl : 1 Software : 64
## 4 : 1 Admitruppell : 1 Financial Services: 54
## 5 : 1 Admonishbadelynge : 1 Construction : 50
## 6 : 1 Ahemparticular : 1 (Other) : 98
## (Other):494 (Other) :494 NA's : 2
## Inception Employees State City
## 2011 : 93 Min. : 1.00 CA : 57 San Diego : 13
## 2010 : 83 1st Qu.: 27.25 VA : 50 New York : 11
## 2012 : 80 Median : 56.00 TX : 47 Reston : 10
## 2013 : 69 Mean : 148.61 FL : 34 Houston : 9
## 2009 : 60 3rd Qu.: 126.00 MD : 25 Austin : 8
## (Other):114 Max. :7125.00 (Other):283 Minneapolis: 8
## NA's : 1 NA's :2 NA's : 4 (Other) :441
## Revenue Expenses Profit Growth
## Min. : 1614585 Min. : 71219 Min. : 12434 Min. :-3.00
## 1st Qu.: 8695702 1st Qu.:2758425 1st Qu.: 3272074 1st Qu.: 8.00
## Median :10647231 Median :4365512 Median : 6513366 Median :15.00
## Mean :10845170 Mean :4310134 Mean : 6539474 Mean :14.38
## 3rd Qu.:13106928 3rd Qu.:5832473 3rd Qu.: 9303951 3rd Qu.:20.00
## Max. :21810051 Max. :9860686 Max. :19624534 Max. :30.00
## NA's :2 NA's :3 NA's :2 NA's :1
# Nice, suitable data types have been formed.
# So our first part of data preparation has been done, data has been cleansed. Now it's time to deal with the missing values one by one. Dealing with missing values demand a quality skill set. How to deal with them ?
#1) If missing values are less in numbers without having any serious impact on our analysis then it can be deleted but if these are insignificant in numbers then deletion can' be made as it would result in loss of useful information.
#2) If data is normally distributed then it should be replaced with the mean values. Uniform distribution can be checked using tests, boxplot, histogram, qqplots etc.
#3) If data is not normally distributed or if that column's values are dependent on categories which they belong to then we will make use of median values.
#4) Sometimes, missing values can be computed mathematically/statistically from other or/and the same column(s) which will be more accurate then the traditional computational method.
#Note: Data preparation also includes dealing with outliers and anomalies before dealing with missing values.sometimes outliers predict new data science but usually it indicates data entry faults. They can be ignored if their influence is negligible otherwise their effect needs to be analysed. We are not doing here as this dataset contains only 500 rows so their negative impact might not be significant. Though, It's good practice to repeat your analysis with and without the outliers.If they have minimal effect on the results, and you can't figure out why they're there, it's reasonable to replace them with missing values and move on. However, if they have a substantial effect on your results, you shouldn't drop them without justification. You'll need to figure out what caused them (e.g., a data entry error) and disclose that you removed them in your write-up.
#Too much theory of theory, let's dive straight back to your work.
# Dealing with missing data
#Replacing Missing Data: Factual Analysis
#-----
#dealing with State missing values having NEW YORK as it's City
fin[is.na(fin$State),] # data with missing state values
## ID Name Industry Inception Employees State City
## 11 11 Canecorporation Health 2012 6 <NA> New York
## 84 84 Drilldrill Software 2010 30 <NA> San Francisco
## 267 267 Circlechop Software 2010 14 <NA> San Francisco
## 379 379 Stovepuck Retail 2013 73 <NA> New York
## Revenue Expenses Profit Growth
## 11 10597009 7591189 3005820 7
## 84 7800620 2785799 5014821 17
## 267 9067070 5929828 3137242 20
## 379 13814975 5904502 7910473 10
fin[is.na(fin$State) & fin$City=="New York",] #Data with missing values having city as New York
## ID Name Industry Inception Employees State City
## 11 11 Canecorporation Health 2012 6 <NA> New York
## 379 379 Stovepuck Retail 2013 73 <NA> New York
## Revenue Expenses Profit Growth
## 11 10597009 7591189 3005820 7
## 379 13814975 5904502 7910473 10
fin[is.na(fin$State) & fin$City=="New York","State"] <- "NY" #State with missing values having city as New York is being replaced with NY
fin[is.na(fin$State) & fin$City=="New York",] #Empty - as we don't have any state with missing values having city as New York.
## [1] ID Name Industry Inception Employees State City
## [8] Revenue Expenses Profit Growth
## <0 rows> (or 0-length row.names)
fin[!complete.cases(fin),]
## ID Name Industry Inception Employees State
## 3 3 Greenfax Retail 2012 NA SC
## 8 8 Rednimdox Construction 2013 73 NY
## 14 14 Techline <NA> 2006 65 CA
## 15 15 Cityace <NA> 2010 25 CO
## 17 17 Ganzlax IT Services 2011 75 NJ
## 22 22 Lathotline Health <NA> 103 VA
## 44 44 Ganzgreen Construction 2010 224 TN
## 84 84 Drilldrill Software 2010 30 <NA>
## 267 267 Circlechop Software 2010 14 <NA>
## 332 332 Westminster Financial Services 2010 NA MI
## City Revenue Expenses Profit Growth
## 3 Greenville 9746272 1044375 8701897 16
## 8 Woodside NA NA NA NA
## 14 San Ramon 13898119 5470303 8427816 23
## 15 Louisville 9254614 6249498 3005116 6
## 17 Iselin 14001180 NA 11901180 18
## 22 McLean 9418303 7567233 1851070 2
## 44 Franklin NA NA NA 9
## 84 San Francisco 7800620 2785799 5014821 17
## 267 San Francisco 9067070 5929828 3137242 20
## 332 Troy 11861652 5245126 6616526 15
#----
#dealing with State missing values having San Francisco as it's City
fin[is.na(fin$State) & fin$City=="San Francisco", "State"] <- "CA"
fin[!complete.cases(fin),]
## ID Name Industry Inception Employees State
## 3 3 Greenfax Retail 2012 NA SC
## 8 8 Rednimdox Construction 2013 73 NY
## 14 14 Techline <NA> 2006 65 CA
## 15 15 Cityace <NA> 2010 25 CO
## 17 17 Ganzlax IT Services 2011 75 NJ
## 22 22 Lathotline Health <NA> 103 VA
## 44 44 Ganzgreen Construction 2010 224 TN
## 332 332 Westminster Financial Services 2010 NA MI
## City Revenue Expenses Profit Growth
## 3 Greenville 9746272 1044375 8701897 16
## 8 Woodside NA NA NA NA
## 14 San Ramon 13898119 5470303 8427816 23
## 15 Louisville 9254614 6249498 3005116 6
## 17 Iselin 14001180 NA 11901180 18
## 22 McLean 9418303 7567233 1851070 2
## 44 Franklin NA NA NA 9
## 332 Troy 11861652 5245126 6616526 15
#State has been taken care of.
#Replacing Missing data: Median Imputation Method part - 1 - Employee Column
med_employee_retail <- median(fin[fin$Industry=="Retail", "Employees"], na.rm=T)
fin[is.na(fin$Employees) & fin$Industry=="Retail", "Employees"] <- med_employee_retail
fin[is.na(fin$Employees),]
## ID Name Industry Inception Employees State City
## 332 332 Westminster Financial Services 2010 NA MI Troy
## Revenue Expenses Profit Growth
## 332 11861652 5245126 6616526 15
med_employee_FinancialServices <- median(fin[fin$Industry=="Financial Services", "Employees"], na.rm=T)
fin[is.na(fin$Employees),] #332
## ID Name Industry Inception Employees State City
## 332 332 Westminster Financial Services 2010 NA MI Troy
## Revenue Expenses Profit Growth
## 332 11861652 5245126 6616526 15
fin[is.na(fin$Employees) & fin$Industry=="Financial Services", "Employees"] <- med_employee_FinancialServices
fin[332,] #Uploaded with median =80
## ID Name Industry Inception Employees State City
## 332 332 Westminster Financial Services 2010 80 MI Troy
## Revenue Expenses Profit Growth
## 332 11861652 5245126 6616526 15
fin[!complete.cases(fin),]
## ID Name Industry Inception Employees State City
## 8 8 Rednimdox Construction 2013 73 NY Woodside
## 14 14 Techline <NA> 2006 65 CA San Ramon
## 15 15 Cityace <NA> 2010 25 CO Louisville
## 17 17 Ganzlax IT Services 2011 75 NJ Iselin
## 22 22 Lathotline Health <NA> 103 VA McLean
## 44 44 Ganzgreen Construction 2010 224 TN Franklin
## Revenue Expenses Profit Growth
## 8 NA NA NA NA
## 14 13898119 5470303 8427816 23
## 15 9254614 6249498 3005116 6
## 17 14001180 NA 11901180 18
## 22 9418303 7567233 1851070 2
## 44 NA NA NA 9
#Wow, Only 6 rows are being left out now.
#Replacing Missing data: Median Imputation Method part - 2 - Growth
median_growth_construction <- median(fin[fin$Industry=="Construction", "Growth"], na.rm=T)
fin[is.na(fin$Growth) & fin$Industry=="Construction", "Growth"] <- median_growth_construction
fin[8,]
## ID Name Industry Inception Employees State City Revenue
## 8 8 Rednimdox Construction 2013 73 NY Woodside NA
## Expenses Profit Growth
## 8 NA NA 10
fin[!complete.cases(fin),]
## ID Name Industry Inception Employees State City
## 8 8 Rednimdox Construction 2013 73 NY Woodside
## 14 14 Techline <NA> 2006 65 CA San Ramon
## 15 15 Cityace <NA> 2010 25 CO Louisville
## 17 17 Ganzlax IT Services 2011 75 NJ Iselin
## 22 22 Lathotline Health <NA> 103 VA McLean
## 44 44 Ganzgreen Construction 2010 224 TN Franklin
## Revenue Expenses Profit Growth
## 8 NA NA NA 10
## 14 13898119 5470303 8427816 23
## 15 9254614 6249498 3005116 6
## 17 14001180 NA 11901180 18
## 22 9418303 7567233 1851070 2
## 44 NA NA NA 9
#6 left out rows are further filtered out.
median_revenue_construction <- median(fin[fin$Industry=="Construction", "Revenue"], na.rm=T)
#fin[is.na(fin$Revenue) & fin$Industry=="Construction","Revenue"] <- median_revenue_construction - Not used explanation given below.
fin[is.na(fin$Revenue) & fin$Industry=="Construction" & is.na(fin$Profit),"Revenue"] <- median_revenue_construction
median_expenses_construction <- median(fin[fin$Industry=="Construction", "Expenses"], na.rm=T)
#fin[is.na(fin$Expenses) & fin$Industry=="Construction","Expenses"] <- median_expenses_construction - Not used explanation given below.
fin[is.na(fin$Expenses) & fin$Industry=="Construction" &is.na(fin$Profit),"Expenses"] <- median_expenses_construction
fin[!complete.cases(fin),]
## ID Name Industry Inception Employees State City
## 8 8 Rednimdox Construction 2013 73 NY Woodside
## 14 14 Techline <NA> 2006 65 CA San Ramon
## 15 15 Cityace <NA> 2010 25 CO Louisville
## 17 17 Ganzlax IT Services 2011 75 NJ Iselin
## 22 22 Lathotline Health <NA> 103 VA McLean
## 44 44 Ganzgreen Construction 2010 224 TN Franklin
## Revenue Expenses Profit Growth
## 8 9055059 4506976 NA 10
## 14 13898119 5470303 8427816 23
## 15 9254614 6249498 3005116 6
## 17 14001180 NA 11901180 18
## 22 9418303 7567233 1851070 2
## 44 9055059 4506976 NA 9
#fin[is.na(fin$Expenses) & fin$Industry=="Construction","Expenses"] <- median_expenses_construction is not used because if my other columns had data then the Expenses would have been removed by the median computational method but we don't want that as that value can be computed accurately by the values given by revenue and profit which might be present so we need to filter out our code by considering only rows having atleast two missing values where mathematical computation isn't possible. Similar explanation is for #fin[is.na(fin$Revenue) & fin$Industry=="Construction","Revenue"] <- median_revenue_construction.
#Mathematical computation where nearby values are given is shown below.
#Replacing Missing Values: Expenses = Revenue - Profit, Profit = Revenue - Expenses
fin[is.na(fin$Profit), "Profit"] <- (fin$Revenue - fin$Expenses)[is.na(fin$Profit)]
#fin[is.na(fin$Profit), "Profit"] <- fin[is.na(fin$Profit),"Revenue"] - fin[is.na(fin$Expenses), "Expenses"] - another method
fin[!complete.cases(fin$Profit)]
## data frame with 0 columns and 500 rows
#Profit has been taken care of now. Let's do the same for Expenses.
fin[is.na(fin$Expenses), "Expenses"] <- (fin$Revenue - fin$Profit)[is.na(fin$Expenses)]
#fin[is.na(fin$Profit), "Expenses"] <- fin[is.na(fin$Profit),"Revenue"] - fin[is.na(fin$Expenses), "Profit"] - another method
fin[!complete.cases(fin),]
## ID Name Industry Inception Employees State City Revenue
## 14 14 Techline <NA> 2006 65 CA San Ramon 13898119
## 15 15 Cityace <NA> 2010 25 CO Louisville 9254614
## 22 22 Lathotline Health <NA> 103 VA McLean 9418303
## Expenses Profit Growth
## 14 5470303 8427816 23
## 15 6249498 3005116 6
## 22 7567233 1851070 2
fi <- fin[complete.cases(fin$Industry),]
fi[!complete.cases(fi),]
## ID Name Industry Inception Employees State City Revenue
## 22 22 Lathotline Health <NA> 103 VA McLean 9418303
## Expenses Profit Growth
## 22 7567233 1851070 2
fin <- fi
fin[!complete.cases(fin),]
## ID Name Industry Inception Employees State City Revenue
## 22 22 Lathotline Health <NA> 103 VA McLean 9418303
## Expenses Profit Growth
## 22 7567233 1851070 2
#----------The End to data preparation-----------#-----Let's Move to the visualisation part.
#Visualisation Part
#Scatterplot showing the variations of Expenses, Revenue and Profit.
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------------------------------------------ tidyverse 1.2.1 --
## v ggplot2 3.0.0 v purrr 0.2.5
## v tibble 1.4.2 v dplyr 0.7.6
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts --------------------------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x ggplot2::%+%() masks psych::%+%()
## x ggplot2::alpha() masks psych::alpha()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
ggplot(fin, aes(Revenue, Expenses,size=Profit)) + geom_point() + geom_smooth(se=F)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# As revenue expenditure increases our profit also gets increased. Showing positive correlation. Let's test the correlation.
cor.test(fin$Revenue, fin$Profit)
##
## Pearson's product-moment correlation
##
## data: fin$Revenue and fin$Profit
## t = 34.141, df = 496, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.8092424 0.8619849
## sample estimates:
## cor
## 0.8375544
#Hence proved, since p<0.05 thus, there is a statistical positive correlation between revenue and profit.
#Scatterplot showing the variations of Expenses, Revenue and Profit in different industries.
ggplot(fin, aes(Revenue, Expenses, color=Industry, size=Profit)) + geom_point() + geom_smooth(se=F)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#Maximum revenue is being generated by It services and software industries.
#Comparing different industries.
ggplot(fin, aes(Revenue, Expenses, color=Industry)) + geom_point() + geom_smooth(se=F) + facet_wrap(~Industry)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

#IT services and Retail industries falls off with profit with increase in revenue.
#Boxplots showing growth and revenue distribution of different industries
ggplot(fin, aes(Industry, Revenue)) + geom_boxplot(aes(reorder(Industry, Revenue, median), color=Industry)) + coord_flip()

#retail and IT industries produces maximum revenue.
ggplot(fin, aes(Industry,Growth )) + geom_boxplot(aes(reorder(Industry, Growth, median), color=Industry))

#Software and IT services' Industries have the highest growth.
ggplot(fin, aes(Industry,Growth, color=Industry )) + geom_boxplot(aes(reorder(Industry, Growth, median)), outlier.color = NA)+ geom_jitter()

##Conclusion: Yo !! We have made it to the end. We have invested in a lot of skills and time in building this project. We have systematically imported the data, cleaned it and converted them to appropriate forms used for analysis purposes, dealt with missing values and the science involved with the dealing process. And once the data is successfully prepared for business purposes we then drawn significant insights holding strong business value by making use of ggplot visualisation process.
#And finally, I am very thankful to you for investing your valuable time in reading my project. Take care !!!