Problem Set 1

*Note: Exercises from Section 2.1 and 2.2 are not on this link, they are on a seperate PDF on Canvas.

Chapter 1.2

Exercise 20. b.

Construct a histogram using class boundaries 0, 1000,2000, 3000, 4000, 5000, and 6000.

streetlength <- c(1280, 5320, 4390, 2100, 1240, 3060, 4770, 1050, 360, 3330, 3380, 340, 1000, 960, 1320, 530, 3350, 540, 3870, 1250, 2400, 960, 1120, 2120, 450, 2250, 2320, 2400, 3150, 5700, 5220, 500, 1850, 2460, 5850, 2700, 2730, 1670, 100, 5770, 3150, 1890, 510, 240, 396, 1419, 2109)

hist(streetlength, main = "Histogram of Steet Total Lenght", breaks = c(0, 1000, 2000, 3000, 4000, 5000, 6000), col = "lightblue", xlab="Street Length", labels = TRUE)

*Note that this histogram values that are the same as those of border of each bin, e.g. 2000, are placed on the leftside of the bin. 2000, therefore is represented on the 1001-2000 bin.

What proportion of subdivisions have total length less than 2000?

There are 13 streets with less than or equal to 1000 and 10 streets with less than 2000. There are 47 streets in total.

(13+10)/47

## [1] 0.4893617

The proportion of streets with total length less than 2000 is 23/47, this approximatly 48.9% of the streets from the data set.

Between 2000 and 4000?

There are 10 streets between 2001 and 3000, 7 between 3001 and 4000.

(10+7)/47

## [1] 0.3617021

The proportion of streets with total between 2000 and 4000 is 17/47, which is approximatly 36.2% of the streets.

How would you describe the shape of the histogram?

There is an accumilation of streets between 0 and 2000, around 49% of the streets, there is then a progressive decrease in the number of streets between 2001 and 5000. However, this is not a positiviely skewed histogram, as the number of streets in the bin between 5001 and 6000 increases, counteracting the trend observed in the previous bins. This appears to be a bimodal histogram, although there is a higher concentration of houses between 1 and 3000, then between 3001 and 6000.

Chapter 1.3

Exercise 34) a) and b)

Here are the data sets for the problem:

U <- c(6.0, 5.0, 11.0, 33.0, 4.0, 5.0, 80.0, 18.0, 35.0, 17.0, 23.0)
F <- c(4.0, 14.0, 11.0, 9.0, 9.0, 8.0, 4.0, 20.0, 5.0, 8.9, 21.0, 9.2, 3.0, 2.0, 0.3)

U stands for settle dust concentration in urban homes, and F stands for settled dust concentration in Farm homes.

a)

To the determine the mean for each of the samples all the values of in each sample must be added and devided by the total number of objects in each sample, 11 for U, and 15 for F.

The mean for each sample will be:

mean(U)

## [1] 21.54545

mean(F)

## [1] 8.56

mean(U)/mean(F)

## [1] 2.516992

It appears that the sample mean for urban homes is highter than the one for farm homes. More precicely, the mean for urban homes is around 2.52 times higher than the mean for farm homes.

b)

The median will be the middle value of each sample when all the values are in an organized list in increasing order. Let’s first organize the samples:

sort(U)

##  [1]  4  5  5  6 11 17 18 23 33 35 80

sort(F)

##  [1]  0.3  2.0  3.0  4.0  4.0  5.0  8.0  8.9  9.0  9.0  9.2 11.0 14.0 20.0
## [15] 21.0

For sample U that will be number in position 6, and for sample F, the number in possition 8.

median(U)

## [1] 17

median(F)

## [1] 8.9

The median, 8.90, for farm houses is similar to the sample’s mean, 8.56. The values are very close to each other.

The same, however, is not the case for urban houses where the median is 17, but the mean is approximatly 21.5, being a noticeably larger number. This may be the case because of the existance of a potential outlier.

sort(U)

##  [1]  4  5  5  6 11 17 18 23 33 35 80

If 80 is taken out of the sample, then the largest number 35, a number that is less then half of 80. Given that all the values but 80 are between 4 and 35, the mean will be within this range, therefore the number 80 will not radically change the median. For the average however, the number 80 will have a much more noticeable effect on the result. In sum, the median was not affected by the extremes values of the sample, whereas the mean was, explaining the disparity between the results.

Exercise 38

Here is the data set for this exercise, the data represents the blood pressure (BP) or 9 randomly selected individual:

BP <- c(118.6, 127.4, 138.4, 130.0, 113.7, 122.0, 108.3, 131.5, 133.2)

a)

To find the median, let’s turn the sample into an ordered list:

sort(BP)

## [1] 108.3 113.7 118.6 122.0 127.4 130.0 131.5 133.2 138.4

There are 9 values in the sample, the median will be the value in possition 5, 127.4.

median(BP)

## [1] 127.4

b)

Suppose that the blood pressure of the second individual (127.4) is now 127.6.

Let’s take a look at the new sample:

BP2 <- c(118.6, 127.6, 138.4, 130.0, 113.7, 122.0, 108.3, 131.5, 133.2)
sort(BP2)

## [1] 108.3 113.7 118.6 122.0 127.6 130.0 131.5 133.2 138.4

median(BP2)

## [1] 127.6

The new median of the sample is 127.6. This is the case because the second patient’s blood pressure is the fith highest bloodpressue in both cases, being the middle value. The median did not change very much, only by 0.2. However if the blood pressure of patient two had increased to for instance 150, or decreased to 100, the median would still have remained in the range of 122-130. This shows the median is a rather stable measure for sample’s center since it is will not be dramitically changes by a great increase or decrease in one particular value in the sample. However, the mean is rather sensitive to very small changes in the values in the middle of the sample, as this problem demonstrates. If one were to round the values of the median, the observer would still get similar results with the different values for patient two. In sum, the median may capture very small changes in the values at the center of the sample, however it will not be sensitive to very large changes, as such the median is a stable representation of the center of the data set when rounded.

Exercise 43

The data set for this problem is below, it deals with the lifetimes (LT) of component:

LT <- c(48, 79, 100, 35, 92, 86, 57, 100, 17, 29)

Note: the two 100 values on the sample are not actually 100, rather 100+, they represent the lifetimes of components that are still functioning after 100 hours, their lifetimes are unknown.

Given that the values of all the elements in the sample are know, it is not possible to take the average as a means to meausure the center of the data set. Nevertheless, of the 10 elements, 8 are known, and all of these 8 values are smaller then 100, therefore it is possible to determine the median of this sample.

Let’s first order the sample:

sort(LT)

##  [1]  17  29  35  48  57  79  86  92 100 100

There are 10 elements, therefore the median will be the avarage of the 5th and the 6th ordered elements of the sample: (57+79)/2

median(LT)

## [1] 68

Chapter 1.4

Exercise 45

Here is the data:

GPa <- c(123, 125, 128, 132, 137)

a)

The mean of the sample is:

mean(GPa)

## [1] 129

The deviations from the mean are:

123-129

## [1] -6

125-129

## [1] -4

128-129

## [1] -1

132-129

## [1] 3

137-129

## [1] 8

b)

The sample variance is:

((-6)^2+(-4)^2+(-1)^2+(3)^2+(8)^2)/(5-1)

## [1] 31.5

Check:

var(GPa)

## [1] 31.5

The sample standard deviation is:

(31.5)^(1/2)

## [1] 5.612486

The standard deviation is 5.612.

Check:

sd(GPa)

## [1] 5.612486

c)

Find the sample variance using the computational formula for Sxx:

((123^2+125^2+128^2+132^2+137^2)-((123+125+128+132+137)^2)/5)/(4)

## [1] 31.5

(sum(GPa^2)-((sum(GPa))^2)/5)/4

## [1] 31.5

d)

Subtract 100 from each observation:

GPa2 <- c(23, 25, 28, 32, 37)

Sample variance:

(sum(GPa2^2)-((sum(GPa2))^2)/5)/4

## [1] 31.5

The sample varience is the same as before, note that the variance did not change:

mean(GPa2)

## [1] 29

23-29

## [1] -6

25-29

## [1] -4

28-29

## [1] -1

32-29

## [1] 3

37-29

## [1] 8

If every elements was subtracted by 100, then the difference between the mean value ( now 29) and the elements will be the same as in the original data, as such the sample variance is the same.

Supplementary Exercises

Exercise 70

Here is the data:

Weight <- c(14.6, 14.4, 19.5, 24.3, 16.3, 22.1, 23.0, 18.7, 19.0, 17.0, 19.1, 19.6, 23.2, 18.5, 15.9)
Treadmill <- c(11.3, 5.3, 9.1, 15.2, 10.1, 19.6, 20.8, 10.3, 10.3, 2.6, 16.6, 22.4, 23.6, 12.6, 4.4)

a)

The five number summary of the sample are the following:

fivenum(Weight)

## [1] 14.40 16.65 19.00 20.85 24.30

fivenum(Treadmill)

## [1]  2.6  9.6 11.3 18.1 23.6

Construct a comparative boxplot:

df <- data.frame(Weight, Treadmill)
boxplot(df, xlab="Oxygen Consumption (liters)", horizontal = TRUE)

It is possible to observe that the range of value for the treadmill is much wider then that of weight exercise, the mean value for weight exercise is greater then that for the treadmill. 50% of the treadmill data is bellow 11.3, whereas all the data for weigth is above 14.40, showing that the values for weight exercise may be higher then for treadmill exercise. It is also worth noting that weight exercise has a greater top value (24.30) then treadmill exersice (23.6). Based on this data it can be conjectured that there may be more oxygen consumption after weight exercise than treadmill exercise.

b)

Find the sample differences:

Weight - Treadmill

##  [1]  3.3  9.1 10.4  9.1  6.2  2.5  2.2  8.4  8.7 14.4  2.5 -2.8 -0.4  5.9
## [15] 11.5

Difference <- c(3.3, 9.1, 10.4, 9.1, 6.2, 2.5, 2.2, 8.4, 8.7, 14.4, 2.5, -2.8, -0.4, 5.9, 11.5)

Here is the five number summary for the sample:

fivenum(Difference)

## [1] -2.8  2.5  6.2  9.1 14.4

Construct a box plot:

boxplot(Difference, xlab = "Difference in Oxyen Consumption", horizontal = TRUE)

It is possible to observe that most of the elements (at least 75%) are possitive, this suggests that there is more oxygen consumption after weight exercise than there is after treadmill exercise. It is also worth noticing that the largest 25% (9.1 to 14.4) includes elements of of larger absolute values then the smallest 25% (-2.8 to 2.5). This shows that in this sample, when there is more oxygen consumption after treadmill excersise, the difference is rather small, but when there is more oxygen after weight exercise the difference can be rather larger. This further supports the belief that more oxgyen will consumed after weight exercise than after treadmill exercise.

Not in Textbook Problem

This is the solution to the not in not in textbook problem.

a) Draw a stem-and-leaf plot of the prices.

*Note that the values on the stem-and-leaf plot are rounded to the tens of thousandths.

Appartment2020 <- c(350000, 442000, 466000, 475000, 498000, 499000, 529000, 539000, 545000, 549000, 580000, 595000, 600000, 619000, 625000, 639000, 750000, 1150000, 1160000, 1170000, 1185000, 1190000, 1200000, 1210000, 1220000, 1230000, 1260000, 1265000, 1265000, 1280000, 1280000, 1290000, 1295000, 1295000, 1305000, 1310000, 1310000, 1325000, 1325000, 1340000, 1340000, 1355000, 1375000)

stem(Appartment2020,scale=2, width=80, atom=1e-80)

## 
##   The decimal point is 5 digit(s) to the right of the |
## 
##    3 | 5
##    4 | 478
##    5 | 0034558
##    6 | 00234
##    7 | 5
##    8 | 
##    9 | 
##   10 | 
##   11 | 56799
##   12 | 0123677889
##   13 | 00111334468

b) Compare to the stem-and-leaf plot from the class slides for one-bedroom apartments in Morningside Heights in 2016. How do they differ?

Here is a reminder of how the data from class looked like:

*Notice that the prices are in the thoundands and were rounded to the tenthousandths.

AppartmentClass<-c(380,430,450,450,500,530,540,540,550,600,670,680,700,700,730,730,750,800)

stem(AppartmentClass,width=80)

## 
##   The decimal point is 2 digit(s) to the right of the |
## 
##   3 | 8
##   4 | 355
##   5 | 03445
##   6 | 078
##   7 | 00335
##   8 | 0

Between 300000 and 800000, there does not appear to be a significant difference between the 2016 and 2020 prices, with the exeption that in 2020 prices appear to be slighlty negatively skewed, and slightly bimodal in 2016 (at 500000 and 700000).

However in 2020 there is a whole range of appartments that costs between 1100000 and 1400000 which are not in the 2016 listing, making the 2020 stem-and-plot look very different from the 2016 one.

c) Remove the Vandewater listings from the current data set and redraw. How does the stem-and-leaf compare to the 2016 data now?

Now remove the Vanderwater listings:

Withoutvanderwater <- c(350000, 442000, 466000, 475000, 498000, 499000, 529000, 539000, 545000, 549000, 580000, 595000, 600000, 619000, 625000, 639000, 750000)

stem(Withoutvanderwater,scale=1,width = 80,atom=1e-80)

## 
##   The decimal point is 5 digit(s) to the right of the |
## 
##   3 | 5
##   4 | 478
##   5 | 0034558
##   6 | 00234
##   7 | 5

Stem-and-leaf from class:

AppartmentClass<-c(380,430,450,450,500,530,540,540,550,600,670,680,700,700,730,730,750,800)

stem(AppartmentClass,width=80)

## 
##   The decimal point is 2 digit(s) to the right of the |
## 
##   3 | 8
##   4 | 355
##   5 | 03445
##   6 | 078
##   7 | 00335
##   8 | 0

Now the plots look a lot more similar, again there does not appear to be a significant difference between the 2016 and 2020 prices, with the exeption that in 2020 prices appear to be slighlty negatively skewed, and slightly bimodal in 2016 (at 500000 and 700000).

d) Draw a cumulative frequency histogram for the full dataset. Describe any prominent features.

hist(Appartment2020,main = "Histogram of 2020 Appartment Prices", col="lightblue",xlab="Appatrment Prices", las=1)->h

h$counts<-cumsum(h$counts)
plot(h,main = "Cumulative Histogram of 2020 Appartment Prices", col="lightblue", labels = TRUE, xlab="Appatrment Prices", las=1)

From the cumulative histogram it is possible to see that are very few appartment that cost 40,000 or less, only 1, that around half of the appartment (23) cost 120000 or less, and that the remainder (17), the Vanderwater appartments, cost between 120001 and 1400000. It is also worth noticing that there are no appartments between 80001 and 100000, suggesting that appartments usually costs between 20001 and 80000 on the low end, and 100001 and 1400000 on the high end.

e) Draw a density histogram of the full dataset. What is the sum of the area of the bars?

hist(Appartment2020,freq=FALSE,col="lightblue",xlab="Appatrment Prices", las=1)

Let’s simplify the data to obtain a y-axis with values with higher magnitude:

Appartments <- c(350, 442, 466, 475, 498, 499, 529, 539, 545, 549, 580, 595, 600, 619, 625, 639, 750, 1150, 1160, 1170, 1185, 1190, 1200, 1210, 1220, 1230, 1260, 1265, 1265, 1280, 1280, 1290, 1295, 1295, 1305, 1310, 1310, 1325, 1325, 1340, 1340, 1355, 1375)
hist(Appartments,freq=FALSE,col="lightblue", xlab="Appatrment Prices in the Thousands", las=1)

Since this is a density histogram, the sum of the areas under the bars must be 1. The relative frequecy of each class is equal to class width times the desity of that class, which is equivalent to multiplying the rectangle width by the rectangle height which yields the rectangle area. If one is to add the relative frequencies of all classes then one will naturally get 1 as a result. As such the sum of the area of the bars is 1.

Let’s check if this is true:

First, the frequency of appartments corresponding to each class will be used to determine the probability of an appartment belonging to each class. Second, the probability of the classes will be divided by class length (100), in order to determine the density. Third, all of the desity values will be multiplied by class lengths (rectangle height times rectangle width) and added together, yielding the total area of the bars in the graph, which should be one.

Appartmentfrequency <- c(1, 12, 4, 0, 6, 20)

Appartmentprobability <- Appartmentfrequency/sum(Appartmentfrequency)

Appartmentprobability

## [1] 0.02325581 0.27906977 0.09302326 0.00000000 0.13953488 0.46511628

Density <- Appartmentprobability/100

Density

## [1] 0.0002325581 0.0027906977 0.0009302326 0.0000000000 0.0013953488
## [6] 0.0046511628

Area <- sum(Density*100)

Area

## [1] 1