Problem 1

Your first task is to download R and RStudio and install them on your computer. After you do so, you can download the file hw1.Rmd from the homework section on Compass. That is this file, which is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML documents (i.e., they can be viewed in any web browser). For more details on using R Markdown see http://rmarkdown.rstudio.com. You can compete this homework by filling the rest of the .Rmd document.

When you click the Knit button in RStudio a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

# This line is a comment. Any thing starting with "#" will not be executed by R
# Remove the "#" before the next line to make it execute. Then run **Knit** again.
# sessionInfo()

After you run Knit, you should see some additional information about the version of R you are running. Throughout the rest of this document, you will write small bits of R code in chunks, which will be displayed after your Knit your document. You can also try out things using the R prompt window. You can copy and paste from the Rmarkdown document to the prompt and vice-versa. Several of the problems have data you will need to use. These data sets print out as tables in the Knit version of the document and are available as variables that you can use to do your exploratory data analysis. Good luck!

Problem 2

Material Weight PercentRecycled PercentWeight
Other 34.8 2.8 13.9
Wood 71.3 62.5 28.5
Trimmings 15.9 14.5 6.4
G +M 33.4 57.5 13.4
Food 8.6 16.3 3.4
R+P 33.9 28.0 13.6
Paper 51.9 10.9 20.8
  1. Make a bar graph of the percentage of each type by weight. You can order the bars from tallest to shortest by sorting the data:
garbage.sorted.by.weight <- garbage[order(garbage$PercentWeight),                                  ]

Use the barplot function in R to make your plot. Don’t forget to label the sections. How can you tell R to use the rownames from your table as the labels in the bar plot? (See ?barplot for help.)

# place your bar plot chart here
barplot(garbage.sorted.by.weight$PercentWeight ,main = "Bar Chart of Garbage by Percent Weight", xlab = "Materials", names.arg = garbage$Material, ylab = "Percent Weight", ylim = c(0,35), col = c(1,2,3,4,5,6,7), legend.text = "R + P = Rubber and Plastics, G + M = Glass and Metals")

2 Make a pie chart of each type by weight. See the help page at ?pie for information on how to use this function

# place your pie chart here
pie(garbage.sorted.by.weight$PercentWeight, labels = garbage$Material, main = "Pie Chart of Garbage by Percent Weight", col = c(8,9,10,11,12,13,14))

  1. Now make a bar plot using the percentage of each type of material recycled. Again, sort your bar plot.
# place bar plot here
garbage.sorted.by.recycled <- garbage[order(garbage$PercentRecycled),]
garbage.names.sorted.by.recycled = c("Other", "Paper", "Trimmings", "Food","R+P", "G+M", "Wood" )
barplot(garbage.sorted.by.recycled$PercentRecycled, main = "Bar Chart of Percent of Garbage Recycled", col = c(1,2,3,4,5,6,7,8), ylab = "Percent Recycled", ylim = c(0,70), xlab = "Materials", names.arg = garbage.names.sorted.by.recycled)

  1. Why is it inappropriate to make a pie chart of the PercentRecycled column? How is this column different than the PercentWeight column? It is inappropriate to make a pie chart of ‘PercentRecycled’ because one of the categories is “other”. This is problematic because pie charts are only to be used when comparing a certain subcategory in relation to the whole. “Other” does not specify what types of materials, and therefore we cannot get a clear picture of the relation of certain materials towards the entire category of recycled materials. However, for a bar chart, this “other” category is completely acceptable. That is the main difference between the ‘PercentRecycled’ column and the ‘PercentWeight’ column.

Problem 3

Year Time
1972 190
1973 186
1974 167
1975 162
1976 167
1977 168
1978 165
1979 155
1980 154
1981 147
1982 150
1983 143
1984 149
1985 154
1986 145
1987 146
1988 145
1989 144
1990 145
1991 144
1992 144
1993 145
1994 142
1995 145
1996 147
1997 146
1998 143
1999 143
2000 146
2001 144
2002 141
2003 145
2004 144
2005 145
2006 143
2007 149
2008 145
2009 152
2010 146
2011 142
2012 151

The above data record the time for the women winning the Boston Marathon during the period 1972 to 2012.

  1. Use the function summary on the Time column of the marathon data.frame.
# put your summary here
summary(marathon$Time)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   141.0   144.0   146.0   150.6   152.0   190.0

What does this information tell you?

It tells me what the earliest time and last time to finish were. It also shows me the median time to finish (50 % finished before and after this time). Lastly, it shows me the first quartile and the third quartile, which together give a general idea of how long it took people to finish the race.

  1. Make a line graph of time (on the y-axis) versus year (on the x-axis). See the type argument for the function plot by typing ?plot at the prompt.
# put your plot here.
plot(marathon$Year, marathon$Time, type = "l", xlab = "Year", ylab = "Time")

  1. What pattern do you see in this graph?

One of the clear patterns that I see is over time the speed at which women are running the Boston Marathon significantly jumped from early 1970s to the 1980s. After this increase, the times at which women run the Boston Marathon has stayed relatively constant.

  1. Propose a question you could answer using these data. What would your answer be?

What is the slowest winning time recorded by women running the Boston Marathon between the time period 1972 and 2012. The answer is 190 minutes.

Problem 4

Agricultural producers need to know both averages (e.g., how much does a bag of potatoes weigh on average?) but also the amount of variation within bags (e.g., does this bag have too many big or small potatoes?).

# weights of 25 potatoes in ounces from a (nominal) 10lbs bag
potatoes <- c(7.6, 7.9, 8.0, 6.9, 6.7, 7.9, 7.9, 7.9, 7.6, 7.8, 7.0, 4.7, 7.6,
              6.3, 4.7, 4.7, 4.7, 6.3, 6.0, 5.3, 4.3, 7.9, 5.2, 6.0, 3.7)
  1. Compute the mean, median, and standard deviation of the data directly. You can use R to do the arithmetic, but show each step of your work. You may find the sum function helpful. You can check your work with the mean, median, and sd functions.

Calculating the mean:

potatoAverage = sum(7.6, 7.9, 8.0, 6.9, 6.7, 7.9, 7.9, 7.9, 7.6, 7.8, 7.0, 4.7, 7.6,
              6.3, 4.7, 4.7, 4.7, 6.3, 6.0, 5.3, 4.3, 7.9, 5.2, 6.0, 3.7)/25

potatoAverage
## [1] 6.424

Calculating the median:

Note: If I were to calculate the median by hand, I would sort them from smallest to largest value and then cross off 12 on each side until I’m left with the most centric value of the potato distribution.

potatoes.sorted.by.weight <- potatoes[order(potatoes)]
table(potatoes.sorted.by.weight)
## potatoes.sorted.by.weight
## 3.7 4.3 4.7 5.2 5.3   6 6.3 6.7 6.9   7 7.6 7.8 7.9   8 
##   1   1   4   1   1   2   2   1   1   1   3   1   5   1
median(potatoes.sorted.by.weight)
## [1] 6.7

Calculating the Standard Deviation:

Since I am not really sure how to do the math by hand in R I will show a few examples of how I lead to the correct answer. We know that the standard deviation is the squareroot of Sigma x- xbar squared all divided by n-1. First I would do the deviation of each. So for example: standardDev = (7.6- potatoAverage)^2 + (7.9- potatoAverage)^2 + (8.0- potatoAverage) + … Please note that I am using the same potatoAverage I calculated earlier for a different step of the problem to represent xbar. After I have summed all those values, you take the squareroot of that sum, then divide it by 24 (N= 25, N-1 = 24). This will yield the Standard Deviation.

sd(potatoes)
## [1] 1.399786
  1. Do you think your numerical summaries do an effective job of describing these data? Why or why not?

I think that finding the mean, median, and standard deviation do a good job of describing this data set. For example, the median tells me that 50 % of the potatoes will be above and below its value. Also, the standard deviation which equals 1.4, tells me how far on average a potatoes weight is from the mean. I think combining all three of these data points gives a good estimation of what to expect from a sample size of 25 random potatoes. We have an idea of the spread and the center of the distribution.

  1. There appear to be two distinct clusters of weights for these potatoes. Divide the data into two subsamples and provide the mean and standard deviation for each subsample (you may use the pre-made R functions this time). Here is some R code to get you started:
group1 <- potatoes[potatoes < 7]
group2 <- potatoes[potatoes >= 7]

group1mean = mean(group1)
group1sd = sd(group1)

group2mean = mean(group2)
group2sd = sd(group2)

group1mean
## [1] 5.392857
group1sd
## [1] 0.9762284
group2mean
## [1] 7.736364
group2sd
## [1] 0.2838053

Do you think that this way of summarizing the data is better than using all the data? Give a reason for your answer.

Yes, after looking at the two different sets of data, there are some clear differences which make this method better. The first group has a significantly lower mean (5.39) but has a much higher standard deviation (.976). The second group has a much higher mean (7.736) but has a much lower standard deviation (.2838). In my eyes, the two sets are so different that mixing them together gives an approximation that doesn’t really tell the true story. For example, if you recieve a new batch of random potatoes, they could all be on the heavier side or they could all be on the lighter side. I think grouping them together makes the data (mean, median, standard deviation) less meaningful.

Problem 5

C-reactive protein (CRP) is a substance that can be measured in the blood, and elevated levels are linked to disease. The following data represent a sample of apparently healthy children.

crp <- c(0, 3.9, 5.64, 8.22, 0, 5.62, 3.92, 6.81, 30.61, 0, 73.2, 0, 
46.7, 0, 0, 26.41, 22.82, 0, 0, 3.49, 0, 0, 4.81, 9.57, 5.36, 
0, 5.66, 0, 59.76, 12.38, 15.74, 0, 0, 0, 0, 9.37, 20.78, 7.1, 
7.89, 5.53)
  1. Find the five number summary for these data.
summary(crp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   5.085  10.030   9.420  73.200
  1. Make a box plot for these data (hint: use the boxplot function).
boxplot(crp, range = 1.5, col = "bisque", horizontal = TRUE)

  1. Make a histogram (hint: use the hist function).
hist(crp, col = "bisque", main = "Histogram of CRP in Healthy Children", xlab = "CRP", ylim = c(0,35))

  1. Write a short summary of the major features of this distribution. Are their any outliers in the data? If so, why do you think so.

This data set is definitely right-skewed. The majority of children have a CRP value of 0. A few possible outliers exist starting at around a CRP value of 30. One of the clear outliers is on the far right inbetween 60 and 80. If I had to guess the median would be close to zero while the mean would be much higher due to the outliers.