This R Markdown file is created as part of my Week 3 Assignment requirement for Data Wrangling in R course taught at UC. This week, I have scrapped the Cincinnati Weather Data file from U Dayton website, identified the summary statistics of the data and created three visualzations of the data.
I have used the following packages for creating this RMD file
library(printr) # for proper formatting while printing
library(tidyverse) # for creating visualzations
The file contains data of the average daily temperature of Cincinnati from 1995 to present. U Dayton has sources and regularly updates this data from he Global Summary of the Day (GSOD) database archived by the National Climatic Data Center (NCDC). The average daily temperature is calculated from 24 hourly temperature readings in the Global Summary of the Day (GSOD) data.
The file has 4 variables:
month: Has the month the observation corresponds to
day: Has the day of the month the observation corresponds to
year: Has the year the observation corresponds to
avg_temp: has the average daily temperature in fahrenheit measured as the mean of 24 hourly temperature readings in GSOD data
The missing values are represented by -99 in the original file, which I later changed to NA for computational purposes.
cincy_url<-"http://academic.udayton.edu/kissock/http/Weather/gsod95-current/OHCINCIN.txt"
cincy_data<-read.table(cincy_url, sep ="", header =FALSE, col.names=c("month","day","year","avg_temp"))
head(cincy_data)
| month | day | year | avg_temp |
|---|---|---|---|
| 1 | 1 | 1995 | 41.1 |
| 1 | 2 | 1995 | 22.2 |
| 1 | 3 | 1995 | 22.8 |
| 1 | 4 | 1995 | 14.9 |
| 1 | 5 | 1995 | 9.5 |
| 1 | 6 | 1995 | 23.8 |
## computations for printing inline in summary statistics below
num_var<-ncol(cincy_data)
num_row<-nrow(cincy_data)
cincy_data$avg_temp[cincy_data$avg_temp==-99] <- NA
n_miss<-sum(is.na(cincy_data))
mean_temp<-mean(cincy_data$avg_temp, na.rm=TRUE)
med_temp<-median(cincy_data$avg_temp, na.rm=TRUE)
min_temp<-min(cincy_data$avg_temp, na.rm=TRUE)
max_temp<-max(cincy_data$avg_temp, na.rm=TRUE)
Number of varibales = 4
Number of observations = 7963
Number of missing values = 14
Mean value of avg_temp = 54.7322934
Median value of avg_temp = 57.1
Minimum value of avg_temp = -2.2
Maximum value of avg_temp = 89.2
The following is a visualization of the average yearly temperature from 1995 to present
ggplot(cincy_data, aes(x=factor(year), y=avg_temp)) +
stat_summary(fun.y="mean", geom="bar", fill="red")
The following is a visualization of the temperature variation by month, based on the daily avrage temperature data from 1995 to present
ggplot(cincy_data, aes(x=factor(month), y=avg_temp)) +
stat_summary(fun.y="mean", geom="bar", fill="blue")
The follwing visualization shows the variation of the average monthly temperature across all the years from 1995 to present
ggplot(cincy_data, aes(x=factor(month), y=avg_temp)) +
stat_summary(fun.y="mean", geom="bar", fill="green", color="black") +
facet_wrap(~year)