Synopsis

This is my homework report for week 3. In this week, our main focus is to investigate good practices for when we get a new data set. Understanding the basic characteristics of our data is the first step towards a meaningful analysis. We are going to focus on three objectives that we should have when we first open up a new data set: 1. Review the codebook 2. Learn about the data 3. Visualize the data

library(knitr)
read_chunk("week-3.R")

Packages Required

library(knitr)
library(tidyverse)

Source Code

The data is scraped from here This archive contains files of average daily temperatures for 157 U.S. and 167 international cities. The files are updated on a regular basis and contain data from January 1, 1995 to present. Source data for these files are from the Global Summary of the Day (GSOD) database archived by the National Climatic Data Center (NCDC). The average daily temperatures posted on this site are computed from 24 hourly temperature readings in the Global Summary of the Day (GSOD) data.

Details about the Data: 1. The data fields in each file posted on this site are: month, day, year, average daily temperature (F).
2. We use “-99” as a no-data flag when data are not available.

Scraping the Data :

url <- "http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt"
weather_data <- read.table(url, header=FALSE, col.names = c("Month","Day","Year","Temperature"))
kable(head(weather_data))
Month Day Year Temperature
1 1 1995 41.1
1 2 1995 22.2
1 3 1995 22.8
1 4 1995 14.9
1 5 1995 9.5
1 6 1995 23.8

Data Description

Number of Variables

ncol(weather_data)
## [1] 4

Names of the Variables

names(weather_data)
## [1] "Month"       "Day"         "Year"        "Temperature"

Number of Rows

nrow(weather_data)
## [1] 7963

Number of Rows and Variables

dim(weather_data)
## [1] 7963    4

Structure of the Data

kable(str(weather_data))
## 'data.frame':    7963 obs. of  4 variables:
##  $ Month      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Day        : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Year       : int  1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
##  $ Temperature: num  41.1 22.2 22.8 14.9 9.5 23.8 31.1 26.9 31.3 31.5 ...

A Few observations

kable(head(weather_data))
Month Day Year Temperature
1 1 1995 41.1
1 2 1995 22.2
1 3 1995 22.8
1 4 1995 14.9
1 5 1995 9.5
1 6 1995 23.8
kable(tail(weather_data))
Month Day Year Temperature
7958 10 14 2016 54.4
7959 10 15 2016 63.2
7960 10 16 2016 68.7
7961 10 17 2016 71.1
7962 10 18 2016 74.4
7963 10 19 2016 75.3

Number of Missing Values

sum((weather_data==-99))
## [1] 14

Removing the missing values

weather_data[weather_data==-99] <- NA
weather_data <- na.omit(weather_data)

Summary of the Data

summary(weather_data)
##      Month             Day             Year       Temperature   
##  Min.   : 1.000   Min.   : 1.00   Min.   :1995   Min.   :-2.20  
##  1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2000   1st Qu.:40.20  
##  Median : 6.000   Median :16.00   Median :2005   Median :57.10  
##  Mean   : 6.477   Mean   :15.71   Mean   :2005   Mean   :54.73  
##  3rd Qu.: 9.000   3rd Qu.:23.00   3rd Qu.:2011   3rd Qu.:70.70  
##  Max.   :12.000   Max.   :31.00   Max.   :2016   Max.   :89.20

Data Visualization

  1. A basic ggplot of year vs temperature colored by month
  2. A smooth ggplot of year vs temperature
  3. A boxplot for each month showing temperature values
ggplot(data = weather_data) + 
  geom_point(mapping = aes(x = Year, y = Temperature, color=Month))

ggplot(data = weather_data) + 
  geom_smooth(mapping = aes(x = Year, y = Temperature))

weather_data$Month <- as.factor(weather_data$Month)
ggplot(data = weather_data) + geom_boxplot(aes(x=Month, y=Temperature, color=Year))