Homework Week 3

Synopsis

This assignment is primarily for developing good practices when getting a new data set.

The Three objectives are

Reviewing the codebook to understand the source of the data and explains the variables and measures, as well as identifying any missing data or values used to indicate missing data
Learn about the data by exploring the structure, number of observations, names and numbers of variables, how variables are coded,dimensions, missing values, view raw data, and create summary statistics. Data may need to be converted to a different type to be more meaningful
Visualization helps to bring more information about the data set and relationships between variables

Load Packages Required

library(readr)
library(rmarkdown)

library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

Import Data

Dimensions, Variables and Type

str(CINCIN)

## 'data.frame':    7963 obs. of  4 variables:
##  $ V1: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ V2: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ V3: int  1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
##  $ V4: num  41.1 22.2 22.8 14.9 9.5 23.8 31.1 26.9 31.3 31.5 ...

Source Code From Codebook

This archive contains files of average daily temperatures for 157 U.S. and 167 international cities. Source data for these files are from the Global Summary of the Day (GSOD) database archived by the National Climatic Data Center (NCDC). The average daily temperatures posted on this site are computed from 24 hourly temperature readings in the Global Summary of the Day (GSOD) data.

The data fields in each file posted on this site are: month, day, year, average daily temperature (F). We use “-99” as a no-data flag when data are not available.

The average daily temperatures posted on this site are from Global Summary of the Day (GSOD) dataset and are computed from 24 hourly temperature readings. The GSOD dataset also includes daily minimum and maximum temperatures. Some earlier datasets compiled by the NCDC, such as the Local Climatological Data Monthly Summary, contained daily minimum and maximum temperatures, but did not contain the average daily temperature computed from 24 hourly readings. As a result, some users calculated the average daily temperature as the average of the daily minimum and maximum temperatures. We compared average daily temperatures calculated from 24 hourly readings, T24, to average daily temperatures calculated as the average of the daily minimum and maximum temperatures, Tminmax, for 53,004 daily temperature records in the GSOD dataset. We found that, on average, the absolute value of the deviation between T24 and Tminmax was 1.48 F. In addition, we found that, on average, Tminmax was 0.0790 F higher than T24. Temperatures in the GSOD dataset are reported with a precision of 0.1 F. Thus, the average bias is less than the precision of the source data, and we conclude that the bias between T24 and Tminmax is not statistically significant. If the bias is negligible, then the deviation is random and will sum to zero over any sufficiently long time period. Thus, use of either T24 or Tminmax “average” daily temperatures should give similar results.

Data Description

Data Frame of 7963 Observations of 4 Variables Variables
* month (V1)-Integer
* day (V2)-Integer
* year (V3)-Integer
* average daily temperature (F) (V4)-Numeric
* “-99” as a no-data flag when data are not available.

Change column Names to be more descriptive

colnames(CINCIN) <- c("Month", "Day", "Year", "TempF")

Change Integers to Categorical data (factor)

CINCIN$Month <- as.factor (CINCIN$Month)
CINCIN$Day <- as.factor (CINCIN$Day)
CINCIN$Year <- as.factor (CINCIN$Year)

Change missing values (-99) to NA

CINCIN[CINCIN==-99] <- NA

Count NA’s

sum(is.na(CINCIN))

## [1] 14

Remove NA values

CINCINNA <- na.omit(CINCIN)
sum(is.na(CINCINNA))

## [1] 0

Data Visualization

Histogram with Density Plot and Mean Temperature Line

Shows distribution and frequency of temperatures

P1<-ggplot(CINCINNA, aes(x=CINCINNA$TempF)) + 
  geom_histogram(aes(y=..density..), 
                 binwidth=.5,
                 colour="black", fill="white") +
  geom_density(alpha=.2, fill="#99CCFF") +
  labs(title="Histogram for Cincinnati Temperature with Mean 1995-2015") +
  labs(x="Temperature F", y="Density") +
  geom_vline(aes(xintercept=mean(TempF, na.rm=T)),   # Ignore NA values for mean
           color="red", linetype="dashed", size=1)
P1

Line Graph Average Annual Temperture Over Time with trend Line

Displays annual avarage temperature and if there is a negative or positive relationship. Temperature increase over time.

AVGYR<- aggregate(TempF ~ Year, FUN=mean, data=CINCINNA)
P2<-ggplot(data=AVGYR, aes(x=Year, y=TempF, group=1)) + 
  geom_point(colour="black", size = 2, shape=8, fill = "black") +
  geom_smooth (colour="red", size= 2) +
  labs(title="Average Yearly Temperature 1995-2015 with Trend Line") +
  labs(x="Year", y="Average Temperature F")
P2

Boxplot of Average Monthly Temperature 1995-2015

Displays distribution of temperatures by month and shows median temperstures, range and outliers

AVGMONTH<- aggregate(TempF~ Year + Month, FUN=mean, data=CINCINNA)
P3<- ggplot(AVGMONTH, aes(x=Month, y=TempF, fill=Month)) + 
    stat_boxplot(geom ='errorbar') +
    geom_boxplot() + 
    stat_summary(fun.y=mean, geom="point", shape=21, size = 3)
P3

Week-3.Rmd

Lawrence Porter

October 25, 2016