library(readxl)
Shark_Tank <- read_excel("Shark_Tank.xlsx")
View(Shark_Tank)
In this exercise, I will be analyzing the data sets from the reality show Shark Tank. The show depicts the entrepreneurial ecosystem in the USA, or more precisely, it shows the pitches of over a thousand entrepreneurs, who wish to receive investment from the “sharks” (investors). This data set encompasses different information, some of which I won’t be needing, so the first thing is to create a clean format.
I will create a new data set named Shark_Tanks_Clean that includes only the variables that I want:
Shark_Tank_Clean <- Shark_Tank [ , c(5, 6, 8, 10, 16, 17, 18, 19, 20, 21, 22, 23)]
Out of the total 48 variables, in the new data set, I included the following:
Shark_Tank_Clean$`Got Deal` <- factor(Shark_Tank_Clean$`Got Deal` ,
levels = c(0, 1),
labels = c("No", "Yes"))
In the “Got deal” variable, the categorical observations include the numbers 0 and 1. Instead, I converted the zeros into “No” and the ones (1’s) into “Yes”.
summary(Shark_Tank_Clean)
## Startup Name Industry Pitchers Gender Pitchers State
## Length:1274 Length:1274 Length:1274 Length:1274
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Original Ask Amount Original Offered Equity Valuation Requested Got Deal
## Min. : 10000 Min. : 1.0 Min. : 40000 No :509
## 1st Qu.: 100000 1st Qu.: 10.0 1st Qu.: 666667 Yes:765
## Median : 200000 Median : 10.0 Median : 1500000
## Mean : 284137 Mean : 13.8 Mean : 3550595
## 3rd Qu.: 350000 3rd Qu.: 20.0 3rd Qu.: 4000000
## Max. :5000000 Max. :100.0 Max. :100000000
##
## Total Deal Amount Total Deal Equity Deal Valuation
## Min. : 10000 Min. : 0.00 Min. : 0
## 1st Qu.: 100000 1st Qu.: 15.00 1st Qu.: 400000
## Median : 200000 Median : 20.00 Median : 1000000
## Mean : 296063 Mean : 24.28 Mean : 2178226
## 3rd Qu.: 350000 3rd Qu.: 30.00 3rd Qu.: 2083333
## Max. :5000000 Max. :100.00 Max. :36000000
## NA's :509 NA's :509 NA's :509
## Number of sharks in deal
## Min. :1.000
## 1st Qu.:1.000
## Median :1.000
## Mean :1.311
## 3rd Qu.:2.000
## Max. :5.000
## NA's :509
Shark_Tank_Stat <- Shark_Tank_Clean[ , c(-1, -2, -3, -4)]
summary(Shark_Tank_Stat)
## Original Ask Amount Original Offered Equity Valuation Requested Got Deal
## Min. : 10000 Min. : 1.0 Min. : 40000 No :509
## 1st Qu.: 100000 1st Qu.: 10.0 1st Qu.: 666667 Yes:765
## Median : 200000 Median : 10.0 Median : 1500000
## Mean : 284137 Mean : 13.8 Mean : 3550595
## 3rd Qu.: 350000 3rd Qu.: 20.0 3rd Qu.: 4000000
## Max. :5000000 Max. :100.0 Max. :100000000
##
## Total Deal Amount Total Deal Equity Deal Valuation
## Min. : 10000 Min. : 0.00 Min. : 0
## 1st Qu.: 100000 1st Qu.: 15.00 1st Qu.: 400000
## Median : 200000 Median : 20.00 Median : 1000000
## Mean : 296063 Mean : 24.28 Mean : 2178226
## 3rd Qu.: 350000 3rd Qu.: 30.00 3rd Qu.: 2083333
## Max. :5000000 Max. :100.00 Max. :36000000
## NA's :509 NA's :509 NA's :509
## Number of sharks in deal
## Min. :1.000
## 1st Qu.:1.000
## Median :1.000
## Mean :1.311
## 3rd Qu.:2.000
## Max. :5.000
## NA's :509
In the table above, we can see the summary of the numerical values only. I removed the categorical variables, while leaving out only the “deals gotten” categorical variable. Interpretations:
The minimum ask amount was 10,000$, while the maximum ask amount was 5,000,000 $.
Interestingly enough, the maximum offered equity that a start up extended to the investors was 100%. The average offered equity was 13.8%.
On average, the valuation requested by the start ups was 3,550,595 $. On the other hand, 50% of the start ups asked for an investment amounted below or equal to 1,500,000 $, while the other 50% asked for an investment amount more than 1,500,000 $.
Out of the total start ups, 765 of them received an investment (got a deal), while 509 of them didn’t (deal not gotten). The number of NA’s after the variable “Got deal” refer to the deals not gotten. I will remove these rows and will further work with “deals gotten only” data.
library(tidyr)
Shark_Tank_Deals <- drop_na(Shark_Tank_Stat)
Now I will calculate some statistical information regarding the deals gotten. I will also remove the “Got deal” variable because it is a categorical variable and these are not applicable when calculating statistical parameters.
library(pastecs)
##
## Attaching package: 'pastecs'
## The following object is masked from 'package:tidyr':
##
## extract
options(width = 120)
options(scipen = 999)
round(stat.desc(Shark_Tank_Deals), 2)
## Original Ask Amount Original Offered Equity Valuation Requested Got Deal Total Deal Amount
## nbr.val 765.00 765.00 765.00 NA 765.00
## nbr.null 0.00 0.00 0.00 NA 0.00
## nbr.na 0.00 0.00 0.00 NA 0.00
## min 10000.00 1.00 40000.00 NA 10000.00
## max 5000000.00 51.00 50000000.00 NA 5000000.00
## range 4990000.00 50.00 49960000.00 NA 4990000.00
## sum 200776000.00 9996.10 2590975817.00 NA 226488166.00
## median 200000.00 10.00 1600000.00 NA 200000.00
## mean 262452.29 13.07 3386896.49 NA 296062.96
## SE.mean 11245.23 0.29 182999.08 NA 12973.46
## CI.mean.0.95 22075.23 0.56 359240.72 NA 25467.87
## var 96738305639.39 62.31 25618828274796.72 NA 128757715315.35
## std.dev 311027.82 7.89 5061504.55 NA 358828.25
## coef.var 1.19 0.60 1.49 NA 1.21
## Total Deal Equity Deal Valuation Number of sharks in deal
## nbr.val 765.00 765.00 765.00
## nbr.null 8.00 9.00 0.00
## nbr.na 0.00 0.00 0.00
## min 0.00 0.00 1.00
## max 100.00 36000000.00 5.00
## range 100.00 36000000.00 4.00
## sum 18572.18 1666343265.00 1003.00
## median 20.00 1000000.00 1.00
## mean 24.28 2178226.49 1.31
## SE.mean 0.56 136320.07 0.02
## CI.mean.0.95 1.10 267606.37 0.04
## var 238.96 14216118786686.66 0.38
## std.dev 15.46 3770426.87 0.61
## coef.var 0.64 1.73 0.47
Out of all deals gotten, the maximum deal valuation, was 36,000,000$.
Out of all the variables, the highest variation can be seen in deal valuation with a coefficient variation of 1.73.
The total deal amounts vary 358,828.25 $ around the mean.
hist(Shark_Tank_Deals$`Total Deal Equity`,
main = "Distribution of Investor Deal Equities",
xlab = "Percentages (%)",
ylab = "Frequency")
The most frequent equity percentage negotiated was 20%. We can also check this information using the mode function:
library(modeest)
mlv(Shark_Tank_Deals$`Total Deal Equity`)
## Warning: argument 'method' is missing. Data are supposed to be continuous.
## Default method 'shorth' is used
## Warning: encountered a tie, and the difference between minimal and
## maximal value is > length('x') * 'tie.limit'
## the distribution could be multimodal
## [1] 20.49922
library(readxl)
Business_School <- read_excel("R Take Home Exam 2024/Task 2/Business School.xlsx")
View(Business_School)
library(pastecs)
round(stat.desc(Business_School[ , c(-1, -2, -5, -8)]), 2)
## Undergrad Grade MBA Grade Employability (Before) Employability (After) Annual Salary
## nbr.val 100.00 100.00 100.00 100.00 100.00
## nbr.null 0.00 0.00 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00 0.00 0.00
## min 61.20 58.14 101.00 119.00 20000.00
## max 100.00 95.00 421.00 631.03 340000.00
## range 38.80 36.86 320.00 512.03 320000.00
## sum 7689.90 7604.06 25793.08 42269.06 10905800.00
## median 76.65 76.38 256.83 435.64 103500.00
## mean 76.90 76.04 257.93 422.69 109058.00
## SE.mean 0.75 0.77 5.93 12.92 4150.15
## CI.mean.0.95 1.48 1.52 11.78 25.64 8234.80
## var 55.68 58.91 3522.10 16701.30 1722373474.75
## std.dev 7.46 7.68 59.35 129.23 41501.49
## coef.var 0.10 0.10 0.23 0.31 0.38
library(ggplot2)
ggplot(Business_School, aes(x=`Undergrad Degree`)) +
geom_bar(color= "white", fill= "deepskyblue4") +
ylab("Frequency")
The most common undergraduate degree is Business.
round(stat.desc(Business_School$`Annual Salary`), 2)
## nbr.val nbr.null nbr.na min max range sum median
## 100.00 0.00 0.00 20000.00 340000.00 320000.00 10905800.00 103500.00
## mean SE.mean CI.mean.0.95 var std.dev coef.var
## 109058.00 4150.15 8234.80 1722373474.75 41501.49 0.38
library(ggplot2)
ggplot(Business_School, aes(x= `Annual Salary`)) +
geom_histogram(width= 5, colour= "white", fill= "darkolivegreen") +
ylab("Frequency")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The annual salary distribution is right skewed. The reason for that are the three values above 200,000, especially the highest that is pulling the data to be skewed. But this extreme amount is not an outlier because maybe in some cases it is normal for the annual starting salary to be above 30,000 - it’s just not as frequent. From the data we can see that the student had a previous working experience and we can assume that s/he started from a higher position. Either way, the value is explainable and will not be removed. From the graph we can analyze the most frequent salary which is 100,000.
𝐻0:𝜇MBA Grade = 74
𝐻A:𝜇MBA Grade ≠ 74
Degrees of freedom => 100 - 1 = 99
t.test(Business_School$`MBA Grade`,
mu= 74,
alternative = "two.sided")
##
## One Sample t-test
##
## data: Business_School$`MBA Grade`
## t = 2.6587, df = 99, p-value = 0.00915
## alternative hypothesis: true mean is not equal to 74
## 95 percent confidence interval:
## 74.51764 77.56346
## sample estimates:
## mean of x
## 76.04055
p-value = 0,00915
Alpha = 0,05
Alpha > p-value
The p-value is less than the significance level of 0,05. So, we can reject the null hypothesis (p < 0,009). With 95% confidence interval, we can say that the population mean of the MBA Grade this year is not equal to 74 (conf. interval is 74,51 - 77,56) .
This year, the average of the MBA Grade has increased (mean = 76,04).
library(readxl)
Apartments <- read_excel("R Take Home Exam 2024/Task 3/Apartments.xlsx")
View(Apartments)
Description:
Apartments$Parking <- factor(Apartments$Parking,
levels= c(0, 1),
labels = c("No", "Yes"))
Apartments$Balcony <- factor(Apartments$Balcony,
levels= c(0, 1),
labels = c("No", "Yes"))
t.test(Apartments$Price,
mu= 1900,
alternative = "two.sided")
##
## One Sample t-test
##
## data: Apartments$Price
## t = 2.9022, df = 84, p-value = 0.004731
## alternative hypothesis: true mean is not equal to 1900
## 95 percent confidence interval:
## 1937.443 2100.440
## sample estimates:
## mean of x
## 2018.941
The p-value is less than the significance level of 0,05. So, we can reject the null hypothesis (p < 0,005). With 95% confidence interval, we can say that the population mean of the prices of apartments is not equal to 1900 EUR. (conf. interval is 1937,4 - 2100,4) .
The average of the apartment prices is higher (mean = 2018.9).
fit1 <- lm(Price ~ Age,
data = Apartments)
summary(fit1)
##
## Call:
## lm(formula = Price ~ Age, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -623.9 -278.0 -69.8 243.5 776.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2185.455 87.043 25.108 <0.0000000000000002 ***
## Age -8.975 4.164 -2.156 0.034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 369.9 on 83 degrees of freedom
## Multiple R-squared: 0.05302, Adjusted R-squared: 0.04161
## F-statistic: 4.647 on 1 and 83 DF, p-value: 0.03401
Price = 2185,4 - 8.97*Age
On average, if the Age of the apartment increases for 1 year, the price per m2 will decrease for 8.97 EUR.
Regression coefficient = -8.97. It depicts the relationship between the dependent variable (Price) and the explanatory variable (Age). In this case, it tells us that on average, if the Age of the apartment increases for 1 year, the price per m2 will decrease for 8.97 EUR.
Coefficient of determination - how well the explanatory variable explains the dependent variable. In this case, the coefficient of determination is quite low and closer to zero than to 1, so we can conclude that it explains only 5.3% of the variability in the prices per m2.
cor(Apartments$Price, Apartments$Age)
## [1] -0.230255
Coefficient of correlation (Pearson Correlation Coefficient) - This coefficient shows what kind of relationship exists between the two variables. In this case, the relationship between price per m2 and age in years is linear, weak and negative.
library(car)
## Loading required package: carData
scatterplotMatrix(Apartments[c("Price", "Age", "Distance")],
smooth = TRUE)
Price - Age
The upper middle graph (where price = y and age = x) indicates a negative relationship, meaning that as age increases, price increases. However, as analysed before, this relationship is weak. As for the multicollinearity, there is no multicolinearity since we can not observe values that are quite close to the line which will create a multicolinearity problem. Even when using the “smooth” option, we can detect some observations stepping out of this predetermined line.
Price - Distance
The same observation and conclusion about multicollinearity goes for the relationship that can be observed on the top right graph. Though, compared to the price-age, this is a stronger negative relationship.
fit2 <- lm(Price ~ Age + Distance,
data = Apartments)
vif(fit2)
## Age Distance
## 1.001845 1.001845
Because the VIF of both variables is the same = 1.001845, and is both below 5 and approximate to 1. We can conclude that there is no significant multicollinearity between age and distance.
Apartments$StdRes <- round(rstandard(fit2), 2)
hist(Apartments$StdRes,
xlab = "Standardised Residuals",
ylab = "Frequency",
main = "Histogram of standardised residuals",
col = "magenta4")
shapiro.test(Apartments$StdRes)
##
## Shapiro-Wilk normality test
##
## data: Apartments$StdRes
## W = 0.95307, p-value = 0.003667
The standardised residuals are left skewed. The Shapiro test shows that the test distributions are not normally distributed because p < 0.05 - we reject H0 that the data is normally distributed.
The values above +3 and below -3 are outliers and should be removed.
Apartments$CooksDist <- round(cooks.distance(fit2), 3)
hist(Apartments$CooksDist,
xlab = "Cooks Distances",
ylab = "Frequency",
main = "Histogram of Cooks Distances",
col = "darkseagreen3")
In this graph we can observe a value that has a great impace on our data, and through the below outlined function, we can detect outliers or high values.
head(Apartments[order(-Apartments$CooksDist), c("Price", "CooksDist")], 6)
## # A tibble: 6 × 2
## Price CooksDist
## <dbl> <dbl>
## 1 2180 0.32
## 2 1740 0.104
## 3 2790 0.069
## 4 1760 0.066
## 5 2540 0.061
## 6 2400 0.038
Outliers High Values:
Price - 2180; Price - 1740; Price - 2790; Price - 1760.
I remove the high values:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:pastecs':
##
## first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
ApartmentsNew <- Apartments %>% filter(!Price %in% c(2180, 1740, 2790, 1760))
fit2.2 <- lm(Price ~ Age + Distance,
data = ApartmentsNew)
ApartmentsNew$CooksDist <- round(cooks.distance(fit2.2), 3)
hist(ApartmentsNew$CooksDist,
xlab = "Cooks Distances",
ylab = "Frequency",
main = "Histogram of Cooks Distances with removed outliers",
col = "green4")
ApartmentsNew$StdRes <- round(rstandard(fit2.2), 3)
hist(ApartmentsNew$StdRes,
xlab = "Standardised Residuals",
ylab = "Frequency",
main = "Histogram of standardised residuals with removed outliers",
col = "darksalmon")
shapiro.test(ApartmentsNew$StdRes)
##
## Shapiro-Wilk normality test
##
## data: ApartmentsNew$StdRes
## W = 0.94297, p-value = 0.001286
Despite removing the outliers, the p-value of the Shapiro-Wilk normality test is less than 0.05, which indicates that the data is not normally distributed. However, our sample size is over 30, so a violation of this assumption would not significantly affect the results.
ApartmentsNew$Fitted <- scale (fit2.2$fitted.values)
library(car)
scatterplot(y = ApartmentsNew$StdRes,
x = ApartmentsNew$Fitted,
ylab = "Standardized residuals",
xlab = "Standardized fitted values",
boxplots = FALSE,
regLine = FALSE,
smooth = FALSE,
col = "red4")
Looking at this graph, there is no potential risk of heteroskedasticity. We can also check the heteroskedasticity using the Breusch-Pagan test below:
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
ols_test_breusch_pagan(fit2.2)
##
## Breusch Pagan Test for Heteroskedasticity
## -----------------------------------------
## Ho: the variance is constant
## Ha: the variance is not constant
##
## Data
## ---------------------------------
## Response : Price
## Variables: fitted values of Price
##
## Test Summary
## ----------------------------
## DF = 1
## Chi2 = 1.64762
## Prob > Chi2 = 0.1992832
Since the p-value is greater than 1, we can not reject the null hypothesis stating that the variance is constant. This confirms our data homoskedasticity.
summary(fit2)
##
## Call:
## lm(formula = Price ~ Age + Distance, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -603.23 -219.94 -85.68 211.31 689.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2460.101 76.632 32.10 < 0.0000000000000002 ***
## Age -7.934 3.225 -2.46 0.016 *
## Distance -20.667 2.748 -7.52 0.0000000000618 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 286.3 on 82 degrees of freedom
## Multiple R-squared: 0.4396, Adjusted R-squared: 0.4259
## F-statistic: 32.16 on 2 and 82 DF, p-value: 0.00000000004896
Age
On average, if the age of the apartment increases for one year, the price of the apartment decreases for 7.9 EUR per m2 (p < 0.05).
Distance
On average, if the apartment’s distance from the city center increases for 1 km, the price of the apartment decreases for 20.6 EUR per m2 (p < 0.01).
fit3 <- lm(Price ~ Age + Distance + Parking + Balcony,
data = Apartments)
anova(fit3, fit2)
## Analysis of Variance Table
##
## Model 1: Price ~ Age + Distance + Parking + Balcony
## Model 2: Price ~ Age + Distance
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 80 5991088
## 2 82 6720983 -2 -729894 4.8732 0.01007 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Yes, fit3 fits better than fit2 data.
summary(fit3)
##
## Call:
## lm(formula = Price ~ Age + Distance + Parking + Balcony, data = Apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -459.92 -200.66 -57.48 260.08 594.37
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2301.667 94.271 24.415 < 0.0000000000000002 ***
## Age -6.799 3.110 -2.186 0.03172 *
## Distance -18.045 2.758 -6.543 0.00000000528 ***
## ParkingYes 196.168 62.868 3.120 0.00251 **
## BalconyYes 1.935 60.014 0.032 0.97436
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 273.7 on 80 degrees of freedom
## Multiple R-squared: 0.5004, Adjusted R-squared: 0.4754
## F-statistic: 20.03 on 4 and 80 DF, p-value: 0.00000000001849
Balcony
It’s not statistically significant because p > 0.05.
Parking
Given the values of the other explanatory variables (age, distance, and balcony), apartments that have a parking are on average more expensive for 196.17 EUR per m2, than those who don’t have a parking (p < 0.05).
Hypothesis
H0: There is no relationship between the dependent variable and the explanatory variables. H1: There is a relationship between the dependent variable and the explanatory variables.
Since we don’t reject the H0, we can conclude that the regression model is significant, in a sense that, the dependent variable (price), can be explained with the given explanatory variables, i.e. they have an influence over the price.
Apartments$Fitted <- fitted.values(fit3)
Apartments$Residuals <- residuals(fit3)
head(Apartments[ , -6])
## # A tibble: 6 × 8
## Age Distance Price Parking Balcony CooksDist Fitted Residuals
## <dbl> <dbl> <dbl> <fct> <fct> <dbl> <dbl> <dbl>
## 1 7 28 1640 No Yes 0.007 1751. -111.
## 2 18 1 2800 Yes No 0.03 2357. 443.
## 3 7 28 1660 No No 0.006 1749. -88.8
## 4 28 29 1850 No Yes 0.008 1590. 260.
## 5 18 18 1640 Yes Yes 0.005 2053. -413.
## 6 28 12 1770 No Yes 0.005 1897. -127.
Residual for apartment ID2 is the residual for the apartment in the second row, which in this example is 442.59 ≈ 443.