Exercises Instructions: Descriptive statistics with R_practice 1. Choose a dataset within your respective units 2. How many variables and observations your dataset has? 3. If your dataset has information about time/date, filter it to keep only observations that fall in your desired time interval. 4. Identify a variable that can help you to categorize your population 5. Use stratified sampling to sample certain number of individuals from each category/group 6. Analyze the dataset and provide summary statistics with interpretations

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(readxl)
1.  Choose a dataset within your respective units
Startup_connect <- read_excel("C:\\Users\\CcHUB\\Downloads\\R Training\\Startup_Connect_Dataset.xlsx")

Context of the data: This dataset is on the Startup Connect Application from the Innovation Consulting Unit, CcHUB. The data shows all startups that applied for the 2018 Union bank Startup connect, and has 20 columns: Startup name, Year founded, State, Sector, External capital raised etc.

2.  How many variables and observations your dataset has?

The dataset has 727 observations and 21 variables.

3.  If your dataset has information about time/date, filter it to keep only observations that fall in  your desired time interval.
Startup_connect_2017 <- filter(Startup_connect, `Launch Year` == 2017)

The Startup connect dataset was filtered to show only startups that launched in 2017. The observation decreased from 727 observations to 305 observations.

4.  Identify a variable that can help you to categorize your population

A variable that can help classify the dataset would be grouping the startups by their sector. The population can also be classified based on: 1)the gender of startup founder 2)the number of startups that have or have not raised external capital 3) the location of the startups

5.  Use stratified sampling to sample certain number of individuals from each category/group
Gender_category <- Startup_connect_2017 %>% group_by(Gender) %>% sample_n(50)

Capital_category <- Startup_connect_2017 %>% group_by(`Capital`) %>% sample_n(100)

Location_category <- Startup_connect_2017 %>% group_by(`Zone`) %>% sample_n(10)
6.  Analyze the dataset and provide summary statistics with interpretations
#Number of startups by sector

ggplot(Startup_connect_2017) + geom_bar(aes(x= Sector, fill = Gender)) + theme_minimal() + coord_flip() +  labs(title = 'Number of Startup by Sector', y = 'Startup') 

The startups were grouped based on industry sector and further grouped by the gender of the startups founder. From the observation: 1) A high number of startups are in the Agriculture, IcT and Beauty/Fashion sector. 2) There are more male founders than female founders in all sectors except the Beauty/Fashion, Human Resources, Hospitality and Law sectors.

#Number of startups by sector and external capital raised

ggplot(Startup_connect_2017) + geom_bar(aes(x= Sector, fill = Gender)) + theme_minimal() + coord_flip() + facet_wrap(~Capital) +  labs(title = 'Number of Startup by Sector & External Capital Raised', y = 'Startup') 

The dataset was further divided to see the number of startups that have raised external capital for their business. From the observation: 1) There are more startups that have not raised external capital than those startups who have raised.

#Number of startups by Geopolitical Zone

ggplot(Startup_connect_2017) + geom_bar(aes(x= Sector, fill = Zone)) + theme_minimal() + coord_flip() + labs(title = 'Number of Startup by Geopolitical Zone', y = 'Startup') 

#Number of startups by Geopolitical Zone

ggplot(Startup_connect_2017) + geom_bar(aes(x= Sector, fill = Zone)) + theme_minimal() + coord_flip() + facet_wrap(~Zone, nrow = 1) +  labs(title = 'Number of Startup by Geopolitical Zone', y = 'Startup') + theme(legend.position ="none")

The startups were also grouped by sector and geo-political location in Nigeria. From the observation: 1)The highest number of startups are located in the west zone in Nigeria. This is due to a large number of the startups from Lagos. 2) The East and North zones have the lowest number of startups.