1. Virat Kohli’s scores in the last ten innings are as follows:
78, 94, 134, 56, 2, 67, 89, 152, 26, 42
Based on the above data, compute the expected value and standard deviation of Virat Kohli’s batting score.
c1 <- c(78, 94, 134, 56, 2, 67, 89, 152, 26, 42)
mc1 <- sum(c1)/length(c1)
# Since the given data doesn't indicate whether the data is for a population or a sample, I am calculating simple standard deviation
std <- sqrt(sum((c1 - mc1)**2)/length(c1))
print(paste("Mean: ",mc1))
## [1] "Mean: 74"
print(paste("Simple standard deviation: ",std))
## [1] "Simple standard deviation: 43.8292140016223"
2. The amount of regular unleaded gasoline purchased every week at a gas station near UCLA follows the normal distribution with mean 50000 gallons and standard deviation 10000 gallons. The starting supply of gasoline is 74000 gallons, and there is a scheduled weekly delivery of 47000 gallons.
a. Find the probability that, after 11 weeks, the supply of gasoline will be below 20000 gallons.
b. How much should the weekly delivery be so that after 11 weeks the probability that the supply is below 20000 gallons is only 0.5%?
purchMean = 50000
purchSD = 10000
supplyStart = 74000
supplyDel = 47000
exp11WkSupply = supplyStart + supplyDel*11
exp11WkDemand = purchMean*11
# a. For supply to be below 20000 the demand should be as high as (supply-demand) < 20000. We need to know how the demand will be with the given details.
# We want any demand in gasoline that will result in supply being below 20,000 so we want all those right side of the curve
pnorm((exp11WkSupply - 20000),exp11WkDemand,(sqrt(11)*10000),lower.tail = FALSE)
## [1] 0.2633101
# b. If probability of supply being below 20000 gallons has to be only 0.5% - we need to derive the delivery amount. So we don't know supplyDel variable
# First let's calculate z score for 0.5% probability i.e., p = 0.005
z = 2.5758
# 2.5758 = (supplyStart + x*11 - 20000 - exp11WkDemand)/(sqrt(11)*10000). SOlve this equation
expDelivery <- ((2.57578 * sqrt(11) * 10000) + exp11WkDemand + 20000 - supplyStart)/11
expDelivery
## [1] 52857.18
3. If the average number of claims handled daily by an insurance company is 5, what proportions of days have less than 3 claims? What is the probability that there will be 4 claims in exactly 3 of the next 5 days? Assume that the number of claims on different days is independent.
avgClaimsDaily = 5
# Using Poisson distribution - P(x < 3) = P(x = 0) + P(x = 1) + P(x = 2)
exp(-avgClaimsDaily) + exp(-avgClaimsDaily) * avgClaimsDaily + (exp(-avgClaimsDaily) * (avgClaimsDaily ** 2)/2)
## [1] 0.124652
# For part b first let's calculate probability of having 4 claims.
probOf4Claims <- exp(-avgClaimsDaily) * (avgClaimsDaily ** 4)/factorial(4)
# Now these 4 claims can happen on any 3 days. Hence using binomial representation.
choose(5,3) * (probOf4Claims ** 3) * ((1-probOf4Claims)**2)
## [1] 0.03672864
4. An architect is designing a doorway for a public building to be used by people whose heights are normally distributed, with mean 1 meter 75 centimeter, and standard deviation 7.5 centimeter. How long can the doorway be so that no more than 1 % of the people bump their heads?
meanVal <- 175
sdVal <- 7.5
# If not more than 1% people should bump then 99% of the population should lie left of the curve
qnorm(0.99,meanVal,sdVal)
## [1] 192.4476
5. Prepare a brief report consisting of the analysis of the Boston Housing dataset in R (on any two or three variables of different type). Use both the graphical and the numerical summaries. Your report should briefly describe what those summaries tell you, and anything of that particular note (both the univariate and bivariate analysis is required.)
library(mlbench)
library(ggplot2)
data("BostonHousing")
summary(BostonHousing)
## crim zn indus chas nox
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 0:471 Min. :0.3850
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1: 35 1st Qu.:0.4490
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.5380
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.5547
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.6240
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :0.8710
## rm age dis rad
## Min. :3.561 Min. : 2.90 Min. : 1.130 Min. : 1.000
## 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100 1st Qu.: 4.000
## Median :6.208 Median : 77.50 Median : 3.207 Median : 5.000
## Mean :6.285 Mean : 68.57 Mean : 3.795 Mean : 9.549
## 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188 3rd Qu.:24.000
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.000
## tax ptratio b lstat
## Min. :187.0 Min. :12.60 Min. : 0.32 Min. : 1.73
## 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38 1st Qu.: 6.95
## Median :330.0 Median :19.05 Median :391.44 Median :11.36
## Mean :408.2 Mean :18.46 Mean :356.67 Mean :12.65
## 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23 3rd Qu.:16.95
## Max. :711.0 Max. :22.00 Max. :396.90 Max. :37.97
## medv
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.53
## 3rd Qu.:25.00
## Max. :50.00
# 1. 75% of the towns in Boston are having per capita crime rate of less than 3.67. An outlier exists whose per capita crime rate is as high as 88.97
# 2. Proportion of owner occupied units built prior to 1940 are on an average more than 68%
# 3. Pupil-teacher ratio is left skewed, however the difference between mean & median is very low
qplot(x=ptratio,data=BostonHousing) + labs(title = 'Distribution of Pupil-Teacher ratio', x='pupil-teacher ratio by town', y = 'Frequency')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# 4. Property tax rate is highly right skewed
ggplot(BostonHousing,aes(x=tax)) + geom_boxplot() + labs(title = 'Distribution of Property tax rate', x='full-value property-tax rate per USD 10,000')
summary(BostonHousing$tax)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 187.0 279.0 330.0 408.2 666.0 711.0
# 5. Average number of rooms per dwelling approximately follows almost a normal distribution with a mean of 6.285 rooms per dwelling
qplot(x=rm,data=BostonHousing) + labs(title = 'Distribution of avg. no. of rooms per dwelling', x='average number of rooms per dwelling', y = 'Frequency')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(BostonHousing$rm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.561 5.886 6.208 6.285 6.623 8.780
# Let's now focus on Bivariate analysis by checking the following:
# 1. Is per capita crime rate has any relation with % of lower status population in the town?
ggplot(BostonHousing,aes(x=lstat,y=crim)) + geom_point() + labs(title = 'Lower status pop. vs crime rate', x='Lower status population % per town', y = 'Per capita crime rate')
# The correlation test do give a positive correlation that higher the lower status population in the town - there is a higher per capita crime rate but the scatter plot do not give a clear linear pattern.
cor.test(BostonHousing$lstat,BostonHousing$crim)
##
## Pearson's product-moment correlation
##
## data: BostonHousing$lstat and BostonHousing$crim
## t = 11.491, df = 504, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.3836915 0.5220562
## sample estimates:
## cor
## 0.4556215
# 2. Very controversial question - Proportion of blacks living in a town has anything to do with crime rate?
# We can see there is a negative correlation and higher the black proportion the lesser the crime rate
ggplot(BostonHousing,aes(x=b,y=crim)) + geom_point() + labs(title = 'Black pop. % vs crime rate', x='Proportion of blacks population per town', y = 'Per capita crime rate')
cor.test(BostonHousing$b,BostonHousing$crim)
##
## Pearson's product-moment correlation
##
## data: BostonHousing$b and BostonHousing$crim
## t = -9.367, df = 504, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4568967 -0.3082415
## sample estimates:
## cor
## -0.3850639
ggplot(BostonHousing,aes(x=b)) + geom_boxplot()
# Since there are so many outliers let's exclude them and then re-do the analysis for b proportion above 380
ggplot(BostonHousing[BostonHousing$b >= 380,],aes(x=b,y=crim)) + geom_point() + labs(title = 'Black pop. % vs crime rate', x='Proportion of blacks population per town', y = 'Per capita crime rate')
# Even with filtering, still the correlation is not strong even though its positive.
cor.test(BostonHousing[BostonHousing$b >= 380,]$b,BostonHousing[BostonHousing$b >= 380,]$crim)
##
## Pearson's product-moment correlation
##
## data: BostonHousing[BostonHousing$b >= 380, ]$b and BostonHousing[BostonHousing$b >= 380, ]$crim
## t = 1.1794, df = 348, p-value = 0.239
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.04201094 0.16682315
## sample estimates:
## cor
## 0.06309677
# 3. Is having lesser teachers leads to higher crime rate? For this question lets take ptratio vs crime rate
ggplot(BostonHousing,aes(x=ptratio)) + geom_boxplot()
# The data is highly left skewed and the correlation is positive. Even though it suggests higher the pupil-teacher ratio - higher the crime rate, the correlation is not significantly higher to conclude.
ggplot(BostonHousing,aes(x=ptratio,y=crim)) + geom_point() + labs(title = 'Pupil-Teacher ratio vs crime rate', x='Pupil-Teacher ratio', y = 'Per capita crime rate')
cor.test(BostonHousing$ptratio,BostonHousing$crim)
##
## Pearson's product-moment correlation
##
## data: BostonHousing$ptratio and BostonHousing$crim
## t = 6.8014, df = 504, p-value = 2.943e-11
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.2080348 0.3678180
## sample estimates:
## cor
## 0.2899456
# 4. Does crime rate has any correlation with number of rooms per dwelling?
# This plot slightly suggests that there is a negative correlation and as the avg. number of rooms increase crime rate decreases.
ggplot(BostonHousing,aes(x=rm,y=crim)) + geom_point() + labs(title = 'Avg. no. of rooms/dwelling vs crime rate', x='Average number of rooms per dwelling', y = 'Per capita crime rate')
# The negative correlation below suggests the same as well.
cor.test(BostonHousing$rm,BostonHousing$crim)
##
## Pearson's product-moment correlation
##
## data: BostonHousing$rm and BostonHousing$crim
## t = -5.0448, df = 504, p-value = 6.347e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3006692 -0.1346514
## sample estimates:
## cor
## -0.2192467
# If we remove all the outliers in the rm and concentrate on the majority, still the negative correlation exists. This inference suggests that as the town's population get prosperous, the crime rate is decreased.
summary(BostonHousing$rm)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.561 5.886 6.208 6.285 6.623 8.780
cor.test(BostonHousing[BostonHousing$rm >= 4.9 & BostonHousing$rm <= 7.5,]$rm,BostonHousing[BostonHousing$rm >= 4.9 & BostonHousing$rm <= 7.5,]$crim)
##
## Pearson's product-moment correlation
##
## data: BostonHousing[BostonHousing$rm >= 4.9 & BostonHousing$rm <= 7.5, ]$rm and BostonHousing[BostonHousing$rm >= 4.9 & BostonHousing$rm <= 7.5, ]$crim
## t = -2.726, df = 468, p-value = 0.006651
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.21305958 -0.03496735
## sample estimates:
## cor
## -0.1250204
ggplot(BostonHousing[BostonHousing$rm >= 4.9 & BostonHousing$rm <= 7.5,],aes(x=rm,y=crim)) + geom_point() + geom_smooth() + labs(title = 'Filtered Avg. no. of rooms/dwelling vs crime rate', x='Average number of rooms per dwelling (between 4.9 and 7.5)', y = 'Per capita crime rate')
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# 5. Does having larger proportion of non-retail business acres per town lead to higher nitric oxides concentration?
cor.test(BostonHousing$indus,BostonHousing$nox)
##
## Pearson's product-moment correlation
##
## data: BostonHousing$indus and BostonHousing$nox
## t = 26.554, df = 504, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7247252 0.7977188
## sample estimates:
## cor
## 0.7636514
ggplot(BostonHousing,aes(x=indus,y=nox)) + geom_point() + labs(title = 'Industrial area vs Nitric oxides concentration', x='proportion of non-retail business acres per town', y = 'nitric oxides concentration (parts per 10 million)') + geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# The graph as well as correlation clearly indicates a higher amount of nitric oxides in atmosphere as the industrial area is more per town.