This is my homework report for week 3. In this week, our main focus is to investigate good practices for when we get a new data set. Understanding the basic characteristics of our data is the first step towards a meaningful analysis. We are going to focus on three objectives that we should have when we first open up a new data set: 1. Review the codebook 2. Learn about the data 3. Visualize the data
library(knitr)
read_chunk("week-3.R")
library(knitr)
library(tidyverse)
The data is scraped from here This archive contains files of average daily temperatures for 157 U.S. and 167 international cities. The files are updated on a regular basis and contain data from January 1, 1995 to present. Source data for these files are from the Global Summary of the Day (GSOD) database archived by the National Climatic Data Center (NCDC). The average daily temperatures posted on this site are computed from 24 hourly temperature readings in the Global Summary of the Day (GSOD) data.
Details about the Data: 1. The data fields in each file posted on this site are: month, day, year, average daily temperature (F).
2. We use “-99” as a no-data flag when data are not available.
Scraping the Data :
url <- "http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt"
weather_data <- read.table(url, header=FALSE, col.names = c("Month","Day","Year","Temperature"))
kable(head(weather_data))
| Month | Day | Year | Temperature |
|---|---|---|---|
| 1 | 1 | 1995 | 41.1 |
| 1 | 2 | 1995 | 22.2 |
| 1 | 3 | 1995 | 22.8 |
| 1 | 4 | 1995 | 14.9 |
| 1 | 5 | 1995 | 9.5 |
| 1 | 6 | 1995 | 23.8 |
Number of Variables
ncol(weather_data)
## [1] 4
Names of the Variables
names(weather_data)
## [1] "Month" "Day" "Year" "Temperature"
Number of Rows
nrow(weather_data)
## [1] 7963
Number of Rows and Variables
dim(weather_data)
## [1] 7963 4
Structure of the Data
kable(str(weather_data))
## 'data.frame': 7963 obs. of 4 variables:
## $ Month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Day : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Year : int 1995 1995 1995 1995 1995 1995 1995 1995 1995 1995 ...
## $ Temperature: num 41.1 22.2 22.8 14.9 9.5 23.8 31.1 26.9 31.3 31.5 ...
A Few observations
kable(head(weather_data))
| Month | Day | Year | Temperature |
|---|---|---|---|
| 1 | 1 | 1995 | 41.1 |
| 1 | 2 | 1995 | 22.2 |
| 1 | 3 | 1995 | 22.8 |
| 1 | 4 | 1995 | 14.9 |
| 1 | 5 | 1995 | 9.5 |
| 1 | 6 | 1995 | 23.8 |
kable(tail(weather_data))
| Month | Day | Year | Temperature | |
|---|---|---|---|---|
| 7958 | 10 | 14 | 2016 | 54.4 |
| 7959 | 10 | 15 | 2016 | 63.2 |
| 7960 | 10 | 16 | 2016 | 68.7 |
| 7961 | 10 | 17 | 2016 | 71.1 |
| 7962 | 10 | 18 | 2016 | 74.4 |
| 7963 | 10 | 19 | 2016 | 75.3 |
Number of Missing Values
sum((weather_data==-99))
## [1] 14
Removing the missing values
weather_data[weather_data==-99] <- NA
weather_data <- na.omit(weather_data)
Summary of the Data
summary(weather_data)
## Month Day Year Temperature
## Min. : 1.000 Min. : 1.00 Min. :1995 Min. :-2.20
## 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:2000 1st Qu.:40.20
## Median : 6.000 Median :16.00 Median :2005 Median :57.10
## Mean : 6.477 Mean :15.71 Mean :2005 Mean :54.73
## 3rd Qu.: 9.000 3rd Qu.:23.00 3rd Qu.:2011 3rd Qu.:70.70
## Max. :12.000 Max. :31.00 Max. :2016 Max. :89.20
ggplot(data = weather_data) +
geom_point(mapping = aes(x = Year, y = Temperature, color=Month))
ggplot(data = weather_data) +
geom_smooth(mapping = aes(x = Year, y = Temperature))
weather_data$Month <- as.factor(weather_data$Month)
ggplot(data = weather_data) + geom_boxplot(aes(x=Month, y=Temperature, color=Year))