Homework 1

Problem 1

Your first task is to download R and RStudio and install them on your computer. After you do so, you can download the file hw1.Rmd from the homework section on Compass. That is this file, which is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML documents (i.e., they can be viewed in any web browser). For more details on using R Markdown see http://rmarkdown.rstudio.com. You can compete this homework by filling the rest of the .Rmd document.

When you click the Knit button in RStudio a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

# This line is a comment. Any thing starting with "#" will not be executed by R
# Remove the "#" before the next line to make it execute. Then run **Knit** again.
# sessionInfo()

After you run Knit, you should see some additional information about the version of R you are running. Throughout the rest of this document, you will write small bits of R code in chunks, which will be displayed after your Knit your document. You can also try out things using the R prompt window. You can copy and paste from the Rmarkdown document to the prompt and vice-versa. Several of the problems have data you will need to use. These data sets print out as tables in the Knit version of the document and are available as variables that you can use to do your exploratory data analysis. Good luck!

Problem 2

Material	Weight	PercentRecycled	PercentWeight
Other	34.8	2.8	13.9
Wood	71.3	62.5	28.5
Trimmings	15.9	14.5	6.4
G +M	33.4	57.5	13.4
Food	8.6	16.3	3.4
R+P	33.9	28.0	13.6
Paper	51.9	10.9	20.8

Make a bar graph of the percentage of each type by weight. You can order the bars from tallest to shortest by sorting the data:

garbage.sorted.by.weight <- garbage[order(garbage$PercentWeight),                                  ]

Use the barplot function in R to make your plot. Don’t forget to label the sections. How can you tell R to use the rownames from your table as the labels in the bar plot? (See ?barplot for help.)

# place your bar plot chart here
barplot(garbage.sorted.by.weight$PercentWeight ,main = "Bar Chart of Garbage by Percent Weight", xlab = "Materials", names.arg = garbage$Material, ylab = "Percent Weight", ylim = c(0,35), col = c(1,2,3,4,5,6,7), legend.text = "R + P = Rubber and Plastics, G + M = Glass and Metals")

2 Make a pie chart of each type by weight. See the help page at ?pie for information on how to use this function

# place your pie chart here
pie(garbage.sorted.by.weight$PercentWeight, labels = garbage$Material, main = "Pie Chart of Garbage by Percent Weight", col = c(8,9,10,11,12,13,14))

Now make a bar plot using the percentage of each type of material recycled. Again, sort your bar plot.

# place bar plot here
garbage.sorted.by.recycled <- garbage[order(garbage$PercentRecycled),]
garbage.names.sorted.by.recycled = c("Other", "Paper", "Trimmings", "Food","R+P", "G+M", "Wood" )
barplot(garbage.sorted.by.recycled$PercentRecycled, main = "Bar Chart of Percent of Garbage Recycled", col = c(1,2,3,4,5,6,7,8), ylab = "Percent Recycled", ylim = c(0,70), xlab = "Materials", names.arg = garbage.names.sorted.by.recycled)

Why is it inappropriate to make a pie chart of the PercentRecycled column? How is this column different than the PercentWeight column? It is inappropriate to make a pie chart of ‘PercentRecycled’ because one of the categories is “other”. This is problematic because pie charts are only to be used when comparing a certain subcategory in relation to the whole. “Other” does not specify what types of materials, and therefore we cannot get a clear picture of the relation of certain materials towards the entire category of recycled materials. However, for a bar chart, this “other” category is completely acceptable. That is the main difference between the ‘PercentRecycled’ column and the ‘PercentWeight’ column.

Problem 3

Year	Time
1972	190
1973	186
1974	167
1975	162
1976	167
1977	168
1978	165
1979	155
1980	154
1981	147
1982	150
1983	143
1984	149
1985	154
1986	145
1987	146
1988	145
1989	144
1990	145
1991	144
1992	144
1993	145
1994	142
1995	145
1996	147
1997	146
1998	143
1999	143
2000	146
2001	144
2002	141
2003	145
2004	144
2005	145
2006	143
2007	149
2008	145
2009	152
2010	146
2011	142
2012	151

The above data record the time for the women winning the Boston Marathon during the period 1972 to 2012.

Use the function summary on the Time column of the marathon data.frame.

# put your summary here
summary(marathon$Time)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   141.0   144.0   146.0   150.6   152.0   190.0

What does this information tell you?

It tells me what the earliest time and last time to finish were. It also shows me the median time to finish (50 % finished before and after this time). Lastly, it shows me the first quartile and the third quartile, which together give a general idea of how long it took people to finish the race.

Make a line graph of time (on the y-axis) versus year (on the x-axis). See the type argument for the function plot by typing ?plot at the prompt.

# put your plot here.
plot(marathon$Year, marathon$Time, type = "l", xlab = "Year", ylab = "Time")

What pattern do you see in this graph?

One of the clear patterns that I see is over time the speed at which women are running the Boston Marathon significantly jumped from early 1970s to the 1980s. After this increase, the times at which women run the Boston Marathon has stayed relatively constant.

Propose a question you could answer using these data. What would your answer be?

What is the slowest winning time recorded by women running the Boston Marathon between the time period 1972 and 2012. The answer is 190 minutes.

Problem 4

Agricultural producers need to know both averages (e.g., how much does a bag of potatoes weigh on average?) but also the amount of variation within bags (e.g., does this bag have too many big or small potatoes?).

# weights of 25 potatoes in ounces from a (nominal) 10lbs bag
potatoes <- c(7.6, 7.9, 8.0, 6.9, 6.7, 7.9, 7.9, 7.9, 7.6, 7.8, 7.0, 4.7, 7.6,
              6.3, 4.7, 4.7, 4.7, 6.3, 6.0, 5.3, 4.3, 7.9, 5.2, 6.0, 3.7)

Compute the mean, median, and standard deviation of the data directly. You can use R to do the arithmetic, but show each step of your work. You may find the sum function helpful. You can check your work with the mean, median, and sd functions.

Calculating the mean:

potatoAverage = sum(7.6, 7.9, 8.0, 6.9, 6.7, 7.9, 7.9, 7.9, 7.6, 7.8, 7.0, 4.7, 7.6,
              6.3, 4.7, 4.7, 4.7, 6.3, 6.0, 5.3, 4.3, 7.9, 5.2, 6.0, 3.7)/25

potatoAverage

## [1] 6.424

Calculating the median:

Note: If I were to calculate the median by hand, I would sort them from smallest to largest value and then cross off 12 on each side until I’m left with the most centric value of the potato distribution.

potatoes.sorted.by.weight <- potatoes[order(potatoes)]
table(potatoes.sorted.by.weight)

## potatoes.sorted.by.weight
## 3.7 4.3 4.7 5.2 5.3   6 6.3 6.7 6.9   7 7.6 7.8 7.9   8 
##   1   1   4   1   1   2   2   1   1   1   3   1   5   1

median(potatoes.sorted.by.weight)

## [1] 6.7

Calculating the Standard Deviation:

Since I am not really sure how to do the math by hand in R I will show a few examples of how I lead to the correct answer. We know that the standard deviation is the squareroot of Sigma x- xbar squared all divided by n-1. First I would do the deviation of each. So for example: standardDev = (7.6- potatoAverage)^2 + (7.9- potatoAverage)^2 + (8.0- potatoAverage) + … Please note that I am using the same potatoAverage I calculated earlier for a different step of the problem to represent xbar. After I have summed all those values, you take the squareroot of that sum, then divide it by 24 (N= 25, N-1 = 24). This will yield the Standard Deviation.

sd(potatoes)

## [1] 1.399786

Do you think your numerical summaries do an effective job of describing these data? Why or why not?

I think that finding the mean, median, and standard deviation do a good job of describing this data set. For example, the median tells me that 50 % of the potatoes will be above and below its value. Also, the standard deviation which equals 1.4, tells me how far on average a potatoes weight is from the mean. I think combining all three of these data points gives a good estimation of what to expect from a sample size of 25 random potatoes. We have an idea of the spread and the center of the distribution.

There appear to be two distinct clusters of weights for these potatoes. Divide the data into two subsamples and provide the mean and standard deviation for each subsample (you may use the pre-made R functions this time). Here is some R code to get you started:

group1 <- potatoes[potatoes < 7]
group2 <- potatoes[potatoes >= 7]

group1mean = mean(group1)
group1sd = sd(group1)

group2mean = mean(group2)
group2sd = sd(group2)

group1mean

## [1] 5.392857

group1sd

## [1] 0.9762284

group2mean

## [1] 7.736364

group2sd

## [1] 0.2838053

Do you think that this way of summarizing the data is better than using all the data? Give a reason for your answer.

Yes, after looking at the two different sets of data, there are some clear differences which make this method better. The first group has a significantly lower mean (5.39) but has a much higher standard deviation (.976). The second group has a much higher mean (7.736) but has a much lower standard deviation (.2838). In my eyes, the two sets are so different that mixing them together gives an approximation that doesn’t really tell the true story. For example, if you recieve a new batch of random potatoes, they could all be on the heavier side or they could all be on the lighter side. I think grouping them together makes the data (mean, median, standard deviation) less meaningful.

Problem 5

C-reactive protein (CRP) is a substance that can be measured in the blood, and elevated levels are linked to disease. The following data represent a sample of apparently healthy children.

crp <- c(0, 3.9, 5.64, 8.22, 0, 5.62, 3.92, 6.81, 30.61, 0, 73.2, 0, 
46.7, 0, 0, 26.41, 22.82, 0, 0, 3.49, 0, 0, 4.81, 9.57, 5.36, 
0, 5.66, 0, 59.76, 12.38, 15.74, 0, 0, 0, 0, 9.37, 20.78, 7.1, 
7.89, 5.53)

Find the five number summary for these data.

summary(crp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   5.085  10.030   9.420  73.200

Make a box plot for these data (hint: use the boxplot function).

boxplot(crp, range = 1.5, col = "bisque", horizontal = TRUE)

Make a histogram (hint: use the hist function).

hist(crp, col = "bisque", main = "Histogram of CRP in Healthy Children", xlab = "CRP", ylim = c(0,35))

Write a short summary of the major features of this distribution. Are their any outliers in the data? If so, why do you think so.

This data set is definitely right-skewed. The majority of children have a CRP value of 0. A few possible outliers exist starting at around a CRP value of 30. One of the clear outliers is on the far right inbetween 60 and 80. If I had to guess the median would be close to zero while the mean would be much higher due to the outliers.