Exercise 1

R (10 points)

Part 1

Write a function that accepts a vector, a vector of integers, a main axis label and an x axis label. This function should iterate over each element in the vector of integers and produce a histogram for each integer value, setting the bin count to the element in the input vector, and labeling main and x-axis with the specified parameters. You should label the y-axis to read Frequency, bins = and the number of bins.

Hint: You can simplify this function by using the parameter … - see ?plot or ?hist

Load all required packages

require(dplyr)
require(ggplot2)
require(resape2)

## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'resape2'

require(readr)

Load Data into R

hidalgo <- read.csv("C:/Users/Jeremiah Lowhorn/Desktop/STAT 700/Final/hidalgo.dat")

Create the plot.histograms function

plot.histograms <- function(x, bins, main,xlab){
  
  #where x is a data.frame, bins is a numeric vector, main is a text string, and xlab is a text string
  #create a for loop to iterate through the numeric vector
  
  for(i in seq_along(bins)){
    
    #create an object that is names plot iterated over the length of the numeric vector
    
  name <- paste0('plot',i)
  
  #assign the name variable to the ggplot
  #create an object for the histogram setting the aesthetic parameter x to the first column of the x data.frame
  
  assign(name, ggplot(x,aes(x[,1])) + 
  
  #create a histogram layer setting the stat parameter to bin, bins to the ith value of the bin vector, the fill, and color        
           
    geom_histogram(stat='bin',
                   bins=bins[i],
                   fill="lightblue",
                   color='black') +
    
    #create the labels layer setting the title to the main character string and x to the xlab string
    
    labs(title=main,
         x=xlab))
   
  }
  
  #create a list and print each plot in the list
  list <- list(plot1,plot2,plot3)
  
  
}

Part 2

Test your function with the hidalgo data set (see below), using bin numbers 12, 36, and 60. You should be able to call your function with something like

plot.histograms(hidalgo.dat[,1],c(12,36,60), main=“1872 Hidalgo issue”,xlab= “Thickness (mm)”)

to plot three different histograms of the hidalgo data set.

Plot the histograms with the given parameters for the exercise.

plot.histograms(hidalgo,bins=c(12,36,60), main="1872 Hidalgo issue",xlab= "Thickness (mm)")[[1]]

plot.histograms(hidalgo,bins=c(12,36,60), main="1872 Hidalgo issue",xlab= "Thickness (mm)")[[2]]

plot.histograms(hidalgo,bins=c(12,36,60), main="1872 Hidalgo issue",xlab= "Thickness (mm)")[[3]]

Part 3 (Either R or SAS) (8 points)

Produce a QQ-normal plot for the hidalgo data. You don’t need to modify titles or axes, just the default plot.

What might be an advantage of QQ-normal plots versus histograms? What might be an advantage of histograms?

Answer: A Q-Q plot is a great way to see if the data is normally distributed without the donfall or setting the number of bins to view the data distribution. A Q-Q plot should present a straight line if the data being tested is perfectly distrubuted. The histogram has an advantage in that it is easier to see differnt types of distributions vissually. In our sample histograms the distribution of the data set closely resembles a Weibull distrobution rather than a normal distrobution. This cannot be determined using the Q-Q plot.

#Use ggplot to plot hidalgo with stat_qq as the plot type. Within stat qq set the sample as the variable X.060
ggplot(hidalgo) +
  stat_qq(aes(sample = X.060)) +
  labs(title = 'QQ Plot of Hidalgo Data')

Part 4 (Either R or SAS) (2 points)

The hidalgo data set is in the file hidalgo.dat These data consist of paper thickness measurements of stamps from the 1872 Hidalgo issue of Mexico. This data set is commonly used to illustrate methods of determining the number of components in a mixture (in this case, different batches of paper). See https://www.jstor.org/stable/2290118, https://books.google.com/books?id=1CuznRORa3EC&lpg=PA95&pg=PA94#v=onepage&q&f=false and https://books.google.com/books?id=c2_fAox0DQoC&pg=PA180&lpg=PA180&f=false .

Some analysis suggest there are three different batches of paper used to produce the 1872 Hidalgo issue; other analysis suggest seven. Why do you think there might be disagreement about the number of batches?

Answer: I believe the reason we were asked to produce three plots gives us the answer to this question. If you examine the histogram of the distrobution with only twelve bins you can see three clear groups of the distribution. Conversly, examining the last histogram with 60 bins you can begin to see seven groups of of paper because there are 7 peaks and valleys on the histogram. This is why it is important to determine the number of bins in a distribution and to really examine and begin to understand your data before you call an alaysis final. In this scenario it is easy to see why some analysts would say there are three groups of paper versus seven, simply becaus they are looking at the same distribution differently.

Exercise 2.

Part 1 (Both R and SAS) (20 points)

Using countyPrecipitation and countyTemperature data from the R Exam, plot a summary of precipitation and temperate by month, using box-whisker plots, one plot for each data file.

You might need to reshape the data. The boxes should be in order, left to right, from Jan to Dec, but the x-axis tick labels should show the abbreviations of the month (“Jan”, “Feb”, etc). You can create an index for month to get the order correct, but you will need to set the axis tick labels.

Label the y-axis “Temperature (degrees F)” or “Precipitation (inches)”

Create two plots, one for each data set, using both R and SAS.

Read the data into R and melt the tables to long format.

countyTemperature <- read_delim("C:/Users/Jeremiah Lowhorn/Desktop/STAT 700/R Exam/countyTemperature.tab", 
                                    "\t", escape_double = FALSE, trim_ws = TRUE)

countyPrecipitation <- read_delim("C:/Users/Jeremiah Lowhorn/Desktop/STAT 700/R Exam/countyPrecipitation.tab", 
                                       "\t", escape_double = FALSE, trim_ws = TRUE)

#melt the countyTemperature data set and rename the Temp and month variables
countyTemperature <- reshape2::melt(countyTemperature) %>%
  rename(Temp = value, Month = variable)

#melt the countyPrcipitation data set and rename the Precipitation and Month variabls
countyPrecipitation <- reshape2::melt(countyPrecipitation) %>%
  rename(Precipitation = value, Month = variable)

Plot the countyTemperature data set.

ggplot(countyTemperature, aes(x = Month, y = Temp)) +
  geom_boxplot() +
  labs(y = 'Temperature (degrees F)',
       title="Box Plot of County Temperature")

Plot the countyPrecipitation data set.

ggplot(countyPrecipitation, aes(x = Month, y = Precipitation)) +
  geom_boxplot() +
  labs(y = 'Precipitation (inches)',
       title="Box Plot of County Precipitation")

Part 2 (Either R or SAS)

What month has the most variability in temperature across counties in South Dakota? Which month has the most variability in precipitation? If necessary, produce a summary to support your choices.

Examine the variance in the countyTemperature data set. Answer: It is clear from the output of the below table that January has the highest variability of temperature.

 countyTemperature %>%
  group_by(Month) %>%
  summarize(Variance = var(Temp)) %>%
  arrange

## # A tibble: 12 x 2
##     Month  Variance
##    <fctr>     <dbl>
##  1    Jan 14.455317
##  2    Feb  9.806819
##  3    Mar  4.564152
##  4    Apr  1.690177
##  5    May  2.227352
##  6    Jun  2.618977
##  7    Jul  1.717281
##  8    Aug  2.051890
##  9    Sep  2.257055
## 10    Oct  2.012044
## 11    Nov  3.054431
## 12    Dec  7.003327

Examine the variance in the countyPrecipitation data set. Answer: It is clear from the output of the below table that September has the highest variability in precipitation.

countyPrecipitation %>%
  group_by(Month) %>%
  summarize(Variance = var(Precipitation)) %>%
  arrange(desc(Variance))

## # A tibble: 12 x 2
##     Month    Variance
##    <fctr>       <dbl>
##  1    Sep 0.346761026
##  2    Aug 0.286733636
##  3    Apr 0.226268089
##  4    Jun 0.195846061
##  5    Jul 0.158431166
##  6    May 0.096719277
##  7    Oct 0.087345944
##  8    Nov 0.067534266
##  9    Mar 0.052152448
## 10    Dec 0.011861445
## 11    Jan 0.009271445
## 12    Feb 0.004790396

Exercise 3 (Either R or SAS) (10 points)

Part 1

Merge countyPrecipitation and countyTemperature and generate a scatter plot showing the relationship between temperature (x axis) and precipitation (y-axis).

Use a different color or symbol (or both) for Month.

Label the x-axis “Temperature (degrees F)” and the y-axis “Precipitation (inches)”

Merge the two data sets.

merged <- left_join(countyPrecipitation,countyTemperature,by=c('County','Month'))

Create the requested plot.

ggplot(merged, aes(y=Precipitation, x=Temp,color=Month)) +
  geom_point(shape=19) +
  labs(x = 'Temperature (degrees F)',
       y = 'Precipitation (inches)')

Part 2

Compare the plots in Exercises 2 and 3. Does the shape of the scatter plot follow the patterns seen in the box-whisker plots? That is, the scatter plot shows how precipitation varies with temperature; the box-whisker plots show how both temperature and precipitation vary with time. Is there a correlation among time, precipitation and temperature?

Answer: Yes you can tell that there is variance in the scatter plot long and wide corresponding to the precipitation and temperature box plot. The scatter plot is a bit harder for me to see the variance and trend in temperature and precipitation, but it helps me to examine the relationship between the two variables. Yes there is a correlation between precipitation and temperature over time as seen visually in the scatter plot. This can also be determined by Pearson’s correlation coefficient which is 0.86.

Examine correlation coefficient of the merged data set.

cor(merged$Precipitation,merged$Temp,method='pearson')

## [1] 0.864649

STAT 700 Final

Jeremiah Lowhorn

08/01/2017

Exercise 1

R (10 points)

Part 1

Part 2

Part 4 (Either R or SAS) (2 points)

Exercise 2.

Part 1 (Both R and SAS) (20 points)

Part 2 (Either R or SAS)

Exercise 3 (Either R or SAS) (10 points)

Part 1

Part 2