Homework 1

1. First paste into R Markdown the information above as well as the question and create the bullets within the R Markdown file.

Using the data set called airquality from New York from May 1, 1973 to September 30, 1973.

Daily readings of the following air quality values for May 1, 1973 (a Tuesday) to September 30, 1973.

Ozone: Mean ozone in parts per billion (ppb) from 1300 to 1500 hours at Roosevelt Island
Solar.R: Solar radiation in Langleys (lang) in the frequency band 4000–7700 Angstroms from 0800 to 1200 hours at Central Park
Wind: Average wind speed in miles per hour (mph) at 0700 and 1000 hours at LaGuardia Airport
Temp: Maximum daily temperature in degrees Fahrenheit at La Guardia Airport.
Month: Numeric, 1-12
Day: Numeric within the month 1-31
Format: A data frame with 153 observations on 6 variables.

The data set is described at: https://www.rdocumentation.org/packages/datasets/versions/3.6.1/topics/airquality

2. Create a histogram and a boxplot with all the details specified in class for the wind speed from May 1, 1973 to September 30, 1973.

data(airquality)
head(airquality)  # To find out the variable name for wind speed.  The variable is called wind.

##   Ozone Solar.R Wind Temp Month Day
## 1    41     190  7.4   67     5   1
## 2    36     118  8.0   72     5   2
## 3    12     149 12.6   74     5   3
## 4    18     313 11.5   62     5   4
## 5    NA      NA 14.3   56     5   5
## 6    28      NA 14.9   66     5   6

Now to make the detailed histogram on airquality$wind. First, I determine the min and the max so I can set the bins. Remember R is case sensitive so wind is not Wind. At first, with “wind,” my command returned the null set.

# Detailed Relative Frequency Histograms required for the course

summary(airquality$Wind) # Run first to determine where to start and stop bins.  min = 1 max = 6.9

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.700   7.400   9.700   9.958  11.500  20.700

length(airquality$Wind) #get sample size for the title of the histogram

## [1] 153

The min is 1.7 mph and the max is 20.70 mph. For 153 data points, I am thinking about 10 to 15 bins and just picking whatever divides into the difference well. The difference is 20.7 - 1.7 = 19. Since 19 is a prime I am choosing to go with 20 instead of 18. My rationale is to go with the next larger divisible number so the maximum is not in an interval by itself.

sorted <- sort(airquality$Wind)
sorted

##   [1]  1.7  2.3  2.8  3.4  4.0  4.1  4.6  4.6  4.6  4.6  5.1  5.1  5.1  5.7
##  [15]  5.7  5.7  6.3  6.3  6.3  6.3  6.3  6.3  6.3  6.3  6.9  6.9  6.9  6.9
##  [29]  6.9  6.9  6.9  6.9  6.9  7.4  7.4  7.4  7.4  7.4  7.4  7.4  7.4  7.4
##  [43]  7.4  8.0  8.0  8.0  8.0  8.0  8.0  8.0  8.0  8.0  8.0  8.0  8.6  8.6
##  [57]  8.6  8.6  8.6  8.6  8.6  8.6  9.2  9.2  9.2  9.2  9.2  9.2  9.2  9.2
##  [71]  9.7  9.7  9.7  9.7  9.7  9.7  9.7  9.7  9.7  9.7  9.7 10.3 10.3 10.3
##  [85] 10.3 10.3 10.3 10.3 10.3 10.3 10.3 10.3 10.9 10.9 10.9 10.9 10.9 10.9
##  [99] 10.9 10.9 11.5 11.5 11.5 11.5 11.5 11.5 11.5 11.5 11.5 11.5 11.5 11.5
## [113] 11.5 11.5 11.5 12.0 12.0 12.0 12.0 12.6 12.6 12.6 13.2 13.2 13.8 13.8
## [127] 13.8 13.8 13.8 14.3 14.3 14.3 14.3 14.3 14.3 14.9 14.9 14.9 14.9 14.9
## [141] 14.9 14.9 14.9 15.5 15.5 15.5 16.1 16.6 16.6 16.6 18.4 20.1 20.7

# Note: You can define multiple bin sets.  You can also use the sequence command to define the bins.
bins1<- seq(1.7,22.7, by=2.0)       #defining for code below where 1.7 is included in the first interval.
# As compared to Excel where you stop one bin early and the last group is classified as “More,” in 
# R you must go past the last data point or R will tell you that you left out numbers.
#Notice that even though I went to 22.7 the numbering stopped at 21.7 where the data stopped because the max is 21.7. However, if I put 32.7 where there are complete bins of size two instead of a partial bin, the histogram will go past the max and put the bins with zero above the bin.

# MAKING THE FREQUENCY HISTOGRAM
#Run the histogram to get the counts. For two purposes: 1) to see if there are lots of bins with just 1 data point at the ends (as would happen with a bin width of 0.5, and you would need to adjust the bin width to larger) and to create the relative frequencies from the h$counts.
h<-hist(airquality$Wind, 
        main="Figure 3: Frequency Histogram of Wind Speed (n = 153)",
        xlab="Wind Speed (mph)",
        ylab="Frequency",
        breaks=bins1,
        ylim=c(0,40),
        axes=FALSE,
        labels=TRUE)                  # axes=F turns off the axes so you can redefine the tick marks
axis(2)
axis(1,at=bins1, labels=bins1)

#MAKING THE RELATIVE FREQUENCY HISTOGRAM
#The following command fools R by putting relative frequencies back into the the "counts" for the histogram stored in "h."
h$counts=round(h$counts/sum(h$counts),2)

#You may add all the labels as well the same way that you do with a hist() or any other graphic command. You DO NOT USE the breaks subcommand within the plot() command.

plot(h, main="Figure 4: Relative Frequency Histogram of Wind Speed (n = 153)",
     cex.main=1.0,
     xlab="Wind Speed (mph)",
     ylab="Frequency",
     ylim=c(0,0.3),
     axes=FALSE,
     labels=TRUE)                  # axes=F turns off the axes so you can redefine the tick marks
axis(2)                         # y axis is put back

#The next command redefines tick marks to be where the bins are defined to be with the "at" and the "labels" puts number labels at the tick marks.

axis(1,at=bins1, labels=bins1)

It is interesting that eventhough there are 1/153 days in the interval 17.7 to 19.7 mph and 2/153 days in the interval from 19.7 to 21.7 mph. The relative frequency histogram lacks this detail, and both intervals look like they have the same amount of days in them. This is because 1/153 = .007 and 2/153 = .013 and both values round to .01.

##Interpretation

The histogram shows that the majority of the days have average wind speeds of between 5.7 and 11.7 mph. Twenty-five percent of the days between May 1 and September 30 in 1973 have average wind speeds that range from 11.7 to 21.7 mph. Since the right tail ranges by 10 mph while the left tail ranges about 5 mph, the distribution of average wind speeds is skewed to the right. The median may be a better value for the center. The median is 9.7 mph while the average is 10.0 mph.

3. Create a boxplot with all the details specified in class for the wind speed from May 1, 1973 to September 30, 1973. Interpret your boxplot within R Markdown outside of the code chunks.

summary(airquality$Wind) #To get the min and max for the y axis label.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.700   7.400   9.700   9.958  11.500  20.700

boxplot(airquality$Wind,
        main="Figure 5: Boxplot of Wind Speed in New York (n = 153) \n May 1, 1973 to September 30, 1973",
        cex.main=1.2,
        sub="New York City",
        cex.sub=0.9,
        ylab="Windspeed (mph)",
        xlab="Windspeed",
        boxwex=0.4)     #boxwex = 0.4 makes the width of the boxplot 40% of the original size

#Label the 5-number summary on the boxplot
# x=1.20 is the location of the labels horizontally and the label location can be moved
# cex = 0.9 tells R to print the labels at 90% of what the normal font is
text(y = boxplot.stats(airquality$Wind)$stats, 
labels = round(boxplot.stats(airquality$Wind)$stats,1), x = 1.20, cex = 0.9)

#To label outliers--IF NEEDED
text(y = round(max(airquality$Wind),1), labels = round(max(airquality$Wind),1), x = 1.20, cex = 0.9)
text(y = round(min(airquality$Wind),1), labels = round(min(airquality$Wind),1), x = 1.20, cex = 0.9) 

#Label the mean on the boxplot. Useful for telling if the quantitative variable is skewed.
#List of pch symbols: http://www.endmemo.com/program/R/pchsymbols.php
#pch=7 dictates the shape: a square with an “x” in the square

points(mean(airquality$Wind),pch=7) # character =7 is the box with an "x" inside
text(y = round(mean(airquality$Wind),1), labels = round(mean(airquality$Wind),1), x = 0.80, cex = 0.9)

Homework 1

Susan Mathews Hardy

8/10/2019