This assignment is primarily for developing good practices when getting a new data set.
The Three objectives are
Reviewing the codebook to understand the source of the data and explains the variables and measures, as well as identifying any missing data or values used to indicate missing data
Learn about the data by exploring the structure, number of observations, names and numbers of variables, how variables are coded,dimensions, missing values, view raw data, and create summary statistics. Data may need to be converted to a different type to be more meaningful
Visualization helps to bring more information about the data set and relationships between variables
library(readr)
library(rmarkdown)
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
Dimensions, Variables and Type
str(CINCIN)
## 'data.frame': 7963 obs. of 4 variables:
## $ V1: int 1 1 1 1 1 1 1 1 1 1 ...
## $ V2: int 1 2 3 4 5 6 7 8 9 10 ...
## $ V3: int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
## $ V4: num 41.1 22.2 22.8 14.9 9.5 23.8 31.1 26.9 31.3 31.5 ...
This archive contains files of average daily temperatures for 157 U.S. and 167 international cities. Source data for these files are from the Global Summary of the Day (GSOD) database archived by the National Climatic Data Center (NCDC). The average daily temperatures posted on this site are computed from 24 hourly temperature readings in the Global Summary of the Day (GSOD) data.
This archive contains files of average daily temperatures for 157 U.S. and 167 international cities. Source data for these files are from the Global Summary of the Day (GSOD) database archived by the National Climatic Data Center (NCDC). The average daily temperatures posted on this site are computed from 24 hourly temperature readings in the Global Summary of the Day (GSOD) data.
The data fields in each file posted on this site are: month, day, year, average daily temperature (F). We use “-99” as a no-data flag when data are not available.
The average daily temperatures posted on this site are from Global Summary of the Day (GSOD) dataset and are computed from 24 hourly temperature readings. The GSOD dataset also includes daily minimum and maximum temperatures. Some earlier datasets compiled by the NCDC, such as the Local Climatological Data Monthly Summary, contained daily minimum and maximum temperatures, but did not contain the average daily temperature computed from 24 hourly readings. As a result, some users calculated the average daily temperature as the average of the daily minimum and maximum temperatures. We compared average daily temperatures calculated from 24 hourly readings, T24, to average daily temperatures calculated as the average of the daily minimum and maximum temperatures, Tminmax, for 53,004 daily temperature records in the GSOD dataset. We found that, on average, the absolute value of the deviation between T24 and Tminmax was 1.48 F. In addition, we found that, on average, Tminmax was 0.0790 F higher than T24. Temperatures in the GSOD dataset are reported with a precision of 0.1 F. Thus, the average bias is less than the precision of the source data, and we conclude that the bias between T24 and Tminmax is not statistically significant. If the bias is negligible, then the deviation is random and will sum to zero over any sufficiently long time period. Thus, use of either T24 or Tminmax “average” daily temperatures should give similar results.
Data Frame of 7963 Observations of 4 Variables Variables
* month (V1)-Integer
* day (V2)-Integer
* year (V3)-Integer
* average daily temperature (F) (V4)-Numeric
* “-99” as a no-data flag when data are not available.
colnames(CINCIN) <- c("Month", "Day", "Year", "TempF")
CINCIN$Month <- as.factor (CINCIN$Month)
CINCIN$Day <- as.factor (CINCIN$Day)
CINCIN$Year <- as.factor (CINCIN$Year)
CINCIN[CINCIN==-99] <- NA
sum(is.na(CINCIN))
## [1] 14
CINCINNA <- na.omit(CINCIN)
sum(is.na(CINCINNA))
## [1] 0
Shows distribution and frequency of temperatures
P1<-ggplot(CINCINNA, aes(x=CINCINNA$TempF)) +
geom_histogram(aes(y=..density..),
binwidth=.5,
colour="black", fill="white") +
geom_density(alpha=.2, fill="#99CCFF") +
labs(title="Histogram for Cincinnati Temperature with Mean 1995-2015") +
labs(x="Temperature F", y="Density") +
geom_vline(aes(xintercept=mean(TempF, na.rm=T)), # Ignore NA values for mean
color="red", linetype="dashed", size=1)
P1
Displays annual avarage temperature and if there is a negative or positive relationship. Temperature increase over time.
AVGYR<- aggregate(TempF ~ Year, FUN=mean, data=CINCINNA)
P2<-ggplot(data=AVGYR, aes(x=Year, y=TempF, group=1)) +
geom_point(colour="black", size = 2, shape=8, fill = "black") +
geom_smooth (colour="red", size= 2) +
labs(title="Average Yearly Temperature 1995-2015 with Trend Line") +
labs(x="Year", y="Average Temperature F")
P2
Displays distribution of temperatures by month and shows median temperstures, range and outliers
AVGMONTH<- aggregate(TempF~ Year + Month, FUN=mean, data=CINCINNA)
P3<- ggplot(AVGMONTH, aes(x=Month, y=TempF, fill=Month)) +
stat_boxplot(geom ='errorbar') +
geom_boxplot() +
stat_summary(fun.y=mean, geom="point", shape=21, size = 3)
P3