Blomfeldt Homework 2

#1) There are 12 numbers on a list, and the mean is 24. The smallest number on the list is changed from 11.9 to 1.19.

#(a) Is it possible to determine the direction in which (increase/decrease) the mean changes? Or how much the mean changes? If so, by how much does it change? If not, why not?

mean(c(11.9,24,24,24,24,24,24,24,24,24,24,36.1))

## [1] 24

mean(c(1.19,24,24,24,24,24,24,24,24,24,24,36.1))

## [1] 23.1075

From the data above we can see it is possible to determine the direction of the mean change. The mean decreses and does so by by 0.8925.

#(b) Is it possible to determine the direction in which the median changes? Or how much the median changes? If so, by how much does it change? If not, why not?

Within the specified boundaries of the data - 12 points, mean being 24 and smallest number being changed from 11.9 to 1.9 - the median of this set will not change and therefore will not have a change in direction or any quantitative shift. This is because if 11.9 is already the smallest data point, if this number is only being reduced it does nothing to the middle two numbers in the data set.

#(c) Is it possible to predict the direction in which the standard deviation changes? If so, does it get larger or smaller? If not, why not? Describe why it is difficult to predict by how much the standard deviation will change in this case

The standard deviation is highly likely to increase as the smallest number is changing to be further away from the mean, creating greater variance. The data could be spaced in a manner which the standard deviation would decrease but this is highly unlikely. It is difficult to determin the exact deviation change as we are not given the initial standard deviation or all of the final data.

#2. A zoologist collected wild lizards in the Southwestern United States. Thirty lizards from the genus Phrynosoma were placed on a treatmill and their speed measured. The recorded speeds (meters/second) (the fastest time to run a half meter) for the thirty lizards are summarized in the relative frequency histogram below. (Data Courtesy of K. Bonine *)

#(a) Is the percent of lizards with recorded speed below 1.25 closest to: 25%, 50%, or 75%?

Based off of the histogram, 6 lizards had speeds below 1.25. With 30 total lizards being tested on the treadmills, 6/20x100= 20%, which is closest to 25%

#(b) In which interval are there more speeds recorded: 1.5-1.75 or 2-2.5?

For 1.5-1.75 more speeds were recorded (7) than during 2-2.5 (4)

#(c) About how many lizards had recorded speeds above 1 meters/second?

28 lizards had recorded speeds above 1 m/s

#(d) In which bin does the median fall? Show how you know.

The median falls in the 1.5-1.75 bin. You are able to tell by summing the relative frequency until you reach the frequency of 0.5, which occurs in the 1.5-1.75 bin.

#3) After manufacture, computer disks are tested for errors. The table below gives the number of errors detected on a random sample of 100 disks. Hint: You can use the rep() function in R to make a vector of repeated numbers.

#(a) Describe the type of data (ex: nominal) that is being recorded about the sample of 100 disks, being as specific as possible.

The data is numerical, discrete, bivariable, ordinal (ranks the number of defects 0-4) and right skewed.

#(b) Construct a frequency histogram of the information with R.

library(tidyverse)

## ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0

## ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

disks <- c(rep(0,41), rep(1,31), rep(2,15), rep(3,8), rep(4,5))
frames <- data.frame(disks, stringsAsFactors = TRUE) 
frames

##     disks
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0
## 7       0
## 8       0
## 9       0
## 10      0
## 11      0
## 12      0
## 13      0
## 14      0
## 15      0
## 16      0
## 17      0
## 18      0
## 19      0
## 20      0
## 21      0
## 22      0
## 23      0
## 24      0
## 25      0
## 26      0
## 27      0
## 28      0
## 29      0
## 30      0
## 31      0
## 32      0
## 33      0
## 34      0
## 35      0
## 36      0
## 37      0
## 38      0
## 39      0
## 40      0
## 41      0
## 42      1
## 43      1
## 44      1
## 45      1
## 46      1
## 47      1
## 48      1
## 49      1
## 50      1
## 51      1
## 52      1
## 53      1
## 54      1
## 55      1
## 56      1
## 57      1
## 58      1
## 59      1
## 60      1
## 61      1
## 62      1
## 63      1
## 64      1
## 65      1
## 66      1
## 67      1
## 68      1
## 69      1
## 70      1
## 71      1
## 72      1
## 73      2
## 74      2
## 75      2
## 76      2
## 77      2
## 78      2
## 79      2
## 80      2
## 81      2
## 82      2
## 83      2
## 84      2
## 85      2
## 86      2
## 87      2
## 88      3
## 89      3
## 90      3
## 91      3
## 92      3
## 93      3
## 94      3
## 95      3
## 96      4
## 97      4
## 98      4
## 99      4
## 100     4

ggplot(data=frames,aes(x= disks)) + 
  geom_histogram(fill="white",color="black", binwidth = 1) + 
  ggtitle("Histogram of Disk Defects (3B)")

#(c) What is the shape of the histogram for the number of defects observed in this sample?

The histogran is skew right

#(d) Calculate the mean and median number of errors detected on the 100 disks by hand and with R. How do the mean and median values compare and is that consistent with what we would guess based on the shape?

sum <- 0*41+1*31+2*15+3*8+4*5
total <- 41+31+15+8+5
sum/total

## [1] 1.05

41+9

## [1] 50

From the R coding above we can see that the mean is 1.05 and the median is 1 as it occurs in the secind bin. This is consistant with a right skewed hisogram as the mean is slightly higher than the median.

#(e) Calculate the sample standard deviation with R. Explain what this value means in the context of the problem.

sd(disks)

## [1] 1.157976

The standard deviation in this case referes to how far, on average, the mean varies from the data points.

#(f) Calculate the first and third quartiles and IQR by hand and with R. Are the values consistent betweeen the two methods? Explain what the three values mean in the context of the problem.

quantile(disks, probs = 0.25)

## 25% 
##   0

quantile(disks, probs = 0.75)

## 75% 
##   2

IQR = 3rd quartile - 1st quartile =2 (2-0 = 2). When doing this by hand the values are the same. In this context, the 1st Quartile represents the median value of the lower half of the data, which is 0. While the 3rd Quartile is the median of the upper half of the data, 2 in this case. The IQR measures the variability between these quartiles and gives us a better idea of the variance within the data.

#(g) What proportion of the computer disks had a number of errors greater than the mean number of errors?

mean((disks>1.05))

## [1] 0.28

From the Proportionality seem above, 28% of the computer disks had errors greater than 1.05

#(h) What range of values for this sample data are not considered outliers using the [Q1-1.5IQR, Q3+1.5IQR] designation (using the IQR you calculated by hand)?.

For the lower range of values: 0 - 1.52 = -3 For the upper range of values: 2 + 1.52 = 5 Outliers will not be considered between (-3) - 5

#(i) Make a boxplot of the data using R and compare the lines to the values you calculated by hand.

ggplot(data=frames,aes(y=disks)) + 
  geom_boxplot(fill="darkorchid") + 
  ggtitle("Box Plot of Defect Frequency")

The 1st and 3rd Quartiles as well as the Median match the data calculated by hand.

#(j) Compare and contrast (briefly) the information about the data given by the histogram in part b and the boxplot in part i.

The histogram gives a sense of the frequency for each number of defects for the computer disk but does a poor job of diplaying the mean and median with being a skew. The box plot however does a good job of presenting the median and range of frequent data but falls short when representing just how large the sampe size is.

#4. The file brexit.csv contains the results of 127 polls, including both online polls and telephone polls, carried out from January 2016 to the referendum date on June 23, 2016. Use that dataset to answer the following questions.

#a. Use R to create a histogram for the proportion who answered “Remain” when polled. Describe the shape of the data.

brexit <- read_csv("brexit.csv")

## Parsed with column specification:
## cols(
##   startdate = col_date(format = ""),
##   enddate = col_date(format = ""),
##   pollster = col_character(),
##   poll_type = col_character(),
##   samplesize = col_double(),
##   remain = col_double(),
##   leave = col_double(),
##   undecided = col_double(),
##   spread = col_double()
## )

brexit

## # A tibble: 127 x 9
##    startdate  enddate    pollster poll_type samplesize remain leave undecided
##    <date>     <date>     <chr>    <chr>          <dbl>  <dbl> <dbl>     <dbl>
##  1 2016-06-23 2016-06-23 YouGov   Online          4772   0.52  0.48      0   
##  2 2016-06-22 2016-06-22 Populus  Online          4700   0.55  0.45      0   
##  3 2016-06-20 2016-06-22 YouGov   Online          3766   0.51  0.49      0   
##  4 2016-06-20 2016-06-22 Ipsos M… Telephone       1592   0.49  0.46      0.01
##  5 2016-06-20 2016-06-22 Opinium  Online          3011   0.44  0.45      0.09
##  6 2016-06-17 2016-06-22 ComRes   Telephone       1032   0.54  0.46      0   
##  7 2016-06-17 2016-06-22 ComRes   Telephone       1032   0.48  0.42      0.11
##  8 2016-06-16 2016-06-22 TNS      Online          2320   0.41  0.43      0.16
##  9 2016-06-20 2016-06-20 Survati… Telephone       1003   0.45  0.44      0.11
## 10 2016-06-18 2016-06-19 YouGov   Online          1652   0.42  0.44      0.13
## # … with 117 more rows, and 1 more variable: spread <dbl>

ggplot(data=brexit,aes(x=brexit$remain))+
geom_histogram(fill="blue",color="black")+
ggtitle("Remain Responses")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This graph is unimodal, it is close to being symmetric however the right side of the graph jumps around too much and has too large of gaps to define it as symmetric.

#b. Now construct two separate histograms for the proportion who answered “Remain”. Make one histogram for online polls and another histogram for telephone polls. Describe the shape and relative position of the data.

online <- brexit %>% filter(poll_type=="Online")
online

## # A tibble: 85 x 9
##    startdate  enddate    pollster poll_type samplesize remain leave undecided
##    <date>     <date>     <chr>    <chr>          <dbl>  <dbl> <dbl>     <dbl>
##  1 2016-06-23 2016-06-23 YouGov   Online          4772   0.52  0.48      0   
##  2 2016-06-22 2016-06-22 Populus  Online          4700   0.55  0.45      0   
##  3 2016-06-20 2016-06-22 YouGov   Online          3766   0.51  0.49      0   
##  4 2016-06-20 2016-06-22 Opinium  Online          3011   0.44  0.45      0.09
##  5 2016-06-16 2016-06-22 TNS      Online          2320   0.41  0.43      0.16
##  6 2016-06-18 2016-06-19 YouGov   Online          1652   0.42  0.44      0.13
##  7 2016-06-16 2016-06-17 YouGov   Online          1694   0.44  0.43      0.09
##  8 2016-06-14 2016-06-17 Opinium  Online          2006   0.44  0.44      0.12
##  9 2016-06-15 2016-06-16 YouGov   Online          1734   0.42  0.44      0.09
## 10 2016-06-10 2016-06-15 BMG Res… Online          1468   0.37  0.47      0.16
## # … with 75 more rows, and 1 more variable: spread <dbl>

telephone <- brexit %>% filter(poll_type=="Telephone")
telephone

## # A tibble: 42 x 9
##    startdate  enddate    pollster poll_type samplesize remain leave undecided
##    <date>     <date>     <chr>    <chr>          <dbl>  <dbl> <dbl>     <dbl>
##  1 2016-06-20 2016-06-22 Ipsos M… Telephone       1592   0.49  0.46      0.01
##  2 2016-06-17 2016-06-22 ComRes   Telephone       1032   0.54  0.46      0   
##  3 2016-06-17 2016-06-22 ComRes   Telephone       1032   0.48  0.42      0.11
##  4 2016-06-20 2016-06-20 Survati… Telephone       1003   0.45  0.44      0.11
##  5 2016-06-16 2016-06-19 ORB/Tel… Telephone        800   0.53  0.46      0.02
##  6 2016-06-17 2016-06-18 Survati… Telephone       1004   0.45  0.42      0.13
##  7 2016-06-15 2016-06-15 Survati… Telephone       1104   0.42  0.45      0.13
##  8 2016-06-10 2016-06-15 BMG Res… Telephone       1064   0.46  0.43      0.11
##  9 2016-06-11 2016-06-14 Ipsos M… Telephone       1257   0.43  0.49      0.03
## 10 2016-06-10 2016-06-13 ICM      Telephone       1000   0.45  0.5       0.05
## # … with 32 more rows, and 1 more variable: spread <dbl>

plot1=ggplot(data=online,aes(x=online$remain))+
geom_histogram(fill="blue",color="black")+
ggtitle("Online Remain Responses")

plot2=ggplot(data=telephone,aes(x=telephone$remain))+
geom_histogram(fill="blue",color="black")+
ggtitle("Telephone Remain Responses")

library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

grid.arrange(plot1, plot2, ncol=2)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

As seen in the histograms above, the online graph is unimodal and has much of the being on the left side of the graph. While the telephone histogram is bimodal and the data is heavy on the right side of the graph.

#c. Compute the mean and median proportion voting “Remain” observed for the online and telephone polls. Compare both measures of center across the two groups.

mean(c(online$remain))

## [1] 0.4212941

median(c(online$remain))

## [1] 0.42

mean(c(telephone$remain))

## [1] 0.485

median(c(telephone$remain))

## [1] 0.48

For the online and telephone groups, the mean and media are both very close to one another. Looking at comparing online and telephone groups, their measures of center are also quite close, within a little over one tenth of one another with online having a smaller mean and median than telephone.

#d. Compute and compare the standard deviation observed in the two groups.

sd(online$remain)

## [1] 0.03750518

sd(telephone$remain)

## [1] 0.04232568

Online has a smaller standard deviation, indicating that its data is, on average, closer to its mean than the telephone groups data.

#e. Use R to help you create side by side boxplots of the two sets so they are easily comparable.

plot3=ggplot(data=online,aes(y=online$remain)) + 
  geom_boxplot(fill="darkorchid") + 
  ggtitle("Box Plot of Online-Remain")

plot4=ggplot(data=telephone,aes(y=telephone$remain)) + 
  geom_boxplot(fill="darkorchid") + 
  ggtitle("Box Plot of Telephone-Remain")

library(gridExtra)
grid.arrange(plot3, plot4, ncol=2)

#f. How many values were identified as outliers? Would these values have been identified as an outlier in the other type of poll? Use the 1.5IQR rule for identifying outliers.

For Online there are 3 outliers, for Telephone there is 1. These outliers would not have been identidied as such if they were used in the other type of poll.

#h. What would be the mean and median proportion answering “Remain” if we combined the two poll types together? Show how one of these can be calculated directly from your summary measures in part (c).

mean(c(online$remain))+mean(c(telephone$remain))

## [1] 0.9062941

median(c(online$remain))+median(c(telephone$remain))

## [1] 0.9

0.906 is the combined mean and 0.9 is the combined median, this was done by simply adding these values together from part C. This can be done because these values are mutually exclusive and would not be double counted when being summed.

#i. Next, calculate the mean proportion of respondents that answered “Leave” for both online and telephone polls. What other factor in the data can explain the much smaller gap between means here compared to part c? Explain.

mean(c(online$leave))

## [1] 0.4258824

mean(c(telephone$leave))

## [1] 0.415

Such a small mean gap between the “leave” data for online and telephone polls is because of the group that is left to be undecided. If everyone had answered the poll, the “remain” data gap between telephone and online polls would be the same for the “leave” data.

Blomfeldt Homework 2

Andrew Blomfeldt

2/3/2020