Your first task is to download R and RStudio and install them on your computer. After you do so, you can download the file hw1.Rmd
from the homework section on Compass. That is this file, which is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML documents (i.e., they can be viewed in any web browser). For more details on using R Markdown see http://rmarkdown.rstudio.com. You can compete this homework by filling the rest of the .Rmd
document.
When you click the Knit button in RStudio a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.
# This line is a comment. Any thing starting with "#" will not be executed by R
# Remove the "#" before the next line to make it execute. Then run **Knit** again.
# sessionInfo()
After you run Knit, you should see some additional information about the version of R you are running. Throughout the rest of this document, you will write small bits of R code in chunks, which will be displayed after your Knit your document. You can also try out things using the R prompt window. You can copy and paste from the Rmarkdown document to the prompt and vice-versa. Several of the problems have data you will need to use. These data sets print out as tables in the Knit version of the document and are available as variables that you can use to do your exploratory data analysis. Good luck!
Material | Weight | PercentRecycled | PercentWeight |
---|---|---|---|
Other | 34.8 | 2.8 | 13.9 |
Wood | 71.3 | 62.5 | 28.5 |
Trimmings | 15.9 | 14.5 | 6.4 |
G +M | 33.4 | 57.5 | 13.4 |
Food | 8.6 | 16.3 | 3.4 |
R+P | 33.9 | 28.0 | 13.6 |
Paper | 51.9 | 10.9 | 20.8 |
garbage.sorted.by.weight <- garbage[order(garbage$PercentWeight), ]
Use the barplot
function in R to make your plot. Don’t forget to label the sections. How can you tell R to use the rownames from your table as the labels in the bar plot? (See ?barplot
for help.)
# place your bar plot chart here
barplot(garbage.sorted.by.weight$PercentWeight ,main = "Bar Chart of Garbage by Percent Weight", xlab = "Materials", names.arg = garbage$Material, ylab = "Percent Weight", ylim = c(0,35), col = c(1,2,3,4,5,6,7), legend.text = "R + P = Rubber and Plastics, G + M = Glass and Metals")
2 Make a pie chart of each type by weight. See the help page at ?pie
for information on how to use this function
# place your pie chart here
pie(garbage.sorted.by.weight$PercentWeight, labels = garbage$Material, main = "Pie Chart of Garbage by Percent Weight", col = c(8,9,10,11,12,13,14))
# place bar plot here
garbage.sorted.by.recycled <- garbage[order(garbage$PercentRecycled),]
garbage.names.sorted.by.recycled = c("Other", "Paper", "Trimmings", "Food","R+P", "G+M", "Wood" )
barplot(garbage.sorted.by.recycled$PercentRecycled, main = "Bar Chart of Percent of Garbage Recycled", col = c(1,2,3,4,5,6,7,8), ylab = "Percent Recycled", ylim = c(0,70), xlab = "Materials", names.arg = garbage.names.sorted.by.recycled)
PercentRecycled
column? How is this column different than the PercentWeight
column? It is inappropriate to make a pie chart of ‘PercentRecycled’ because one of the categories is “other”. This is problematic because pie charts are only to be used when comparing a certain subcategory in relation to the whole. “Other” does not specify what types of materials, and therefore we cannot get a clear picture of the relation of certain materials towards the entire category of recycled materials. However, for a bar chart, this “other” category is completely acceptable. That is the main difference between the ‘PercentRecycled’ column and the ‘PercentWeight’ column.Year | Time |
---|---|
1972 | 190 |
1973 | 186 |
1974 | 167 |
1975 | 162 |
1976 | 167 |
1977 | 168 |
1978 | 165 |
1979 | 155 |
1980 | 154 |
1981 | 147 |
1982 | 150 |
1983 | 143 |
1984 | 149 |
1985 | 154 |
1986 | 145 |
1987 | 146 |
1988 | 145 |
1989 | 144 |
1990 | 145 |
1991 | 144 |
1992 | 144 |
1993 | 145 |
1994 | 142 |
1995 | 145 |
1996 | 147 |
1997 | 146 |
1998 | 143 |
1999 | 143 |
2000 | 146 |
2001 | 144 |
2002 | 141 |
2003 | 145 |
2004 | 144 |
2005 | 145 |
2006 | 143 |
2007 | 149 |
2008 | 145 |
2009 | 152 |
2010 | 146 |
2011 | 142 |
2012 | 151 |
The above data record the time for the women winning the Boston Marathon during the period 1972 to 2012.
summary
on the Time
column of the marathon
data.frame.# put your summary here
summary(marathon$Time)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 141.0 144.0 146.0 150.6 152.0 190.0
What does this information tell you?
It tells me what the earliest time and last time to finish were. It also shows me the median time to finish (50 % finished before and after this time). Lastly, it shows me the first quartile and the third quartile, which together give a general idea of how long it took people to finish the race.
type
argument for the function plot
by typing ?plot
at the prompt.# put your plot here.
plot(marathon$Year, marathon$Time, type = "l", xlab = "Year", ylab = "Time")
One of the clear patterns that I see is over time the speed at which women are running the Boston Marathon significantly jumped from early 1970s to the 1980s. After this increase, the times at which women run the Boston Marathon has stayed relatively constant.
What is the slowest winning time recorded by women running the Boston Marathon between the time period 1972 and 2012. The answer is 190 minutes.
Agricultural producers need to know both averages (e.g., how much does a bag of potatoes weigh on average?) but also the amount of variation within bags (e.g., does this bag have too many big or small potatoes?).
# weights of 25 potatoes in ounces from a (nominal) 10lbs bag
potatoes <- c(7.6, 7.9, 8.0, 6.9, 6.7, 7.9, 7.9, 7.9, 7.6, 7.8, 7.0, 4.7, 7.6,
6.3, 4.7, 4.7, 4.7, 6.3, 6.0, 5.3, 4.3, 7.9, 5.2, 6.0, 3.7)
sum
function helpful. You can check your work with the mean
, median
, and sd
functions.Calculating the mean:
potatoAverage = sum(7.6, 7.9, 8.0, 6.9, 6.7, 7.9, 7.9, 7.9, 7.6, 7.8, 7.0, 4.7, 7.6,
6.3, 4.7, 4.7, 4.7, 6.3, 6.0, 5.3, 4.3, 7.9, 5.2, 6.0, 3.7)/25
potatoAverage
## [1] 6.424
Calculating the median:
Note: If I were to calculate the median by hand, I would sort them from smallest to largest value and then cross off 12 on each side until I’m left with the most centric value of the potato distribution.
potatoes.sorted.by.weight <- potatoes[order(potatoes)]
table(potatoes.sorted.by.weight)
## potatoes.sorted.by.weight
## 3.7 4.3 4.7 5.2 5.3 6 6.3 6.7 6.9 7 7.6 7.8 7.9 8
## 1 1 4 1 1 2 2 1 1 1 3 1 5 1
median(potatoes.sorted.by.weight)
## [1] 6.7
Calculating the Standard Deviation:
Since I am not really sure how to do the math by hand in R I will show a few examples of how I lead to the correct answer. We know that the standard deviation is the squareroot of Sigma x- xbar squared all divided by n-1. First I would do the deviation of each. So for example: standardDev = (7.6- potatoAverage)^2 + (7.9- potatoAverage)^2 + (8.0- potatoAverage) + … Please note that I am using the same potatoAverage I calculated earlier for a different step of the problem to represent xbar. After I have summed all those values, you take the squareroot of that sum, then divide it by 24 (N= 25, N-1 = 24). This will yield the Standard Deviation.
sd(potatoes)
## [1] 1.399786
I think that finding the mean, median, and standard deviation do a good job of describing this data set. For example, the median tells me that 50 % of the potatoes will be above and below its value. Also, the standard deviation which equals 1.4, tells me how far on average a potatoes weight is from the mean. I think combining all three of these data points gives a good estimation of what to expect from a sample size of 25 random potatoes. We have an idea of the spread and the center of the distribution.
group1 <- potatoes[potatoes < 7]
group2 <- potatoes[potatoes >= 7]
group1mean = mean(group1)
group1sd = sd(group1)
group2mean = mean(group2)
group2sd = sd(group2)
group1mean
## [1] 5.392857
group1sd
## [1] 0.9762284
group2mean
## [1] 7.736364
group2sd
## [1] 0.2838053
Do you think that this way of summarizing the data is better than using all the data? Give a reason for your answer.
Yes, after looking at the two different sets of data, there are some clear differences which make this method better. The first group has a significantly lower mean (5.39) but has a much higher standard deviation (.976). The second group has a much higher mean (7.736) but has a much lower standard deviation (.2838). In my eyes, the two sets are so different that mixing them together gives an approximation that doesn’t really tell the true story. For example, if you recieve a new batch of random potatoes, they could all be on the heavier side or they could all be on the lighter side. I think grouping them together makes the data (mean, median, standard deviation) less meaningful.
C-reactive protein (CRP) is a substance that can be measured in the blood, and elevated levels are linked to disease. The following data represent a sample of apparently healthy children.
crp <- c(0, 3.9, 5.64, 8.22, 0, 5.62, 3.92, 6.81, 30.61, 0, 73.2, 0,
46.7, 0, 0, 26.41, 22.82, 0, 0, 3.49, 0, 0, 4.81, 9.57, 5.36,
0, 5.66, 0, 59.76, 12.38, 15.74, 0, 0, 0, 0, 9.37, 20.78, 7.1,
7.89, 5.53)
summary(crp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 5.085 10.030 9.420 73.200
boxplot
function).boxplot(crp, range = 1.5, col = "bisque", horizontal = TRUE)
hist
function).hist(crp, col = "bisque", main = "Histogram of CRP in Healthy Children", xlab = "CRP", ylim = c(0,35))
This data set is definitely right-skewed. The majority of children have a CRP value of 0. A few possible outliers exist starting at around a CRP value of 30. One of the clear outliers is on the far right inbetween 60 and 80. If I had to guess the median would be close to zero while the mean would be much higher due to the outliers.