ST406 - Project 2 Data, EDA

P.I.N.Kehelbedda

1/3/2021

1 Introduction

2 Data

This data set is meteorological data from the HI-SEAS weather station from four months (September through December 2016) between Mission IV and Mission V.

The data set contains such columns as: “wind direction”, “wind speed”, “humidity” and temperature. The response parameter that is to be predicted is: “Solar_radiation”. It contains measurements for the past 4 months and you have to predict the level of solar radiation. Just imagine that you’ve got solar energy batteries and you want to know will it be reasonable to use them in future?.

2.1 Reading data

SolarRadPrediction <- read_csv("Data/SolarRadPrediction.csv", 
    col_types = cols(Data = col_datetime(format = "%m/%d/%Y %H:%M:%S %p"), 
        Time = col_time(format = "%H:%M:%S"), 
        TimeSunRise = col_time(format = "%H:%M:%S"), 
        TimeSunSet = col_time(format = "%H:%M:%S")))

colnames(SolarRadPrediction)[8] <- "WindDirection"
dim(SolarRadPrediction)
## [1] 32686    11
str(SolarRadPrediction)
## tibble [32,686 x 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ UNIXTime     : num [1:32686] 1.48e+09 1.48e+09 1.48e+09 1.48e+09 1.48e+09 ...
##  $ Data         : POSIXct[1:32686], format: "2016-09-29" "2016-09-29" ...
##  $ Time         : 'hms' num [1:32686] 23:55:26 23:50:23 23:45:26 23:40:21 ...
##   ..- attr(*, "units")= chr "secs"
##  $ Radiation    : num [1:32686] 1.21 1.21 1.23 1.21 1.17 1.21 1.2 1.24 1.23 1.21 ...
##  $ Temperature  : num [1:32686] 48 48 48 48 48 48 49 49 49 49 ...
##  $ Pressure     : num [1:32686] 30.5 30.5 30.5 30.5 30.5 ...
##  $ Humidity     : num [1:32686] 59 58 57 60 62 64 72 71 80 85 ...
##  $ WindDirection: num [1:32686] 177 177 159 138 105 ...
##  $ Speed        : num [1:32686] 5.62 3.37 3.37 3.37 5.62 5.62 6.75 5.62 4.5 4.5 ...
##  $ TimeSunRise  : 'hms' num [1:32686] 06:13:00 06:13:00 06:13:00 06:13:00 ...
##   ..- attr(*, "units")= chr "secs"
##  $ TimeSunSet   : 'hms' num [1:32686] 18:13:00 18:13:00 18:13:00 18:13:00 ...
##   ..- attr(*, "units")= chr "secs"
##  - attr(*, "spec")=
##   .. cols(
##   ..   UNIXTime = col_double(),
##   ..   Data = col_datetime(format = "%m/%d/%Y %H:%M:%S %p"),
##   ..   Time = col_time(format = "%H:%M:%S"),
##   ..   Radiation = col_double(),
##   ..   Temperature = col_double(),
##   ..   Pressure = col_double(),
##   ..   Humidity = col_double(),
##   ..   `WindDirection(Degrees)` = col_double(),
##   ..   Speed = col_double(),
##   ..   TimeSunRise = col_time(format = "%H:%M:%S"),
##   ..   TimeSunSet = col_time(format = "%H:%M:%S")
##   .. )
head(SolarRadPrediction)
UNIXTime Data Time Radiation Temperature Pressure Humidity WindDirection Speed TimeSunRise TimeSunSet
1475229326 2016-09-29 23:55:26 1.21 48 30.46 59 177.39 5.62 06:13:00 18:13:00
1475229023 2016-09-29 23:50:23 1.21 48 30.46 58 176.78 3.37 06:13:00 18:13:00
1475228726 2016-09-29 23:45:26 1.23 48 30.46 57 158.75 3.37 06:13:00 18:13:00
1475228421 2016-09-29 23:40:21 1.21 48 30.46 60 137.71 3.37 06:13:00 18:13:00
1475228124 2016-09-29 23:35:24 1.17 48 30.46 62 104.95 5.62 06:13:00 18:13:00
1475227824 2016-09-29 23:30:24 1.21 48 30.46 64 120.20 5.62 06:13:00 18:13:00

2.2 About data

Data set contains these variables,

  • UNIXTIME: Unix form of time variable.
  • Data: Date in format of yyyy-mm-dd
  • Time: The local time in the format of hh:mm:ss 24-hr.
  • Radiation: Solar radiation in watts per meter squared (\(1kg/s^3\)).
  • Temperature: Temperature in degrees fahrenheit (\(^\circ F\)).
  • Pressure: Barometric Pressure in \(Hg\).
  • Humidity: Humidity precent.
  • WindDirection: Wind derection in degrees.
  • Speed: Wind speed in miles per hour (mph).
  • TimeSunRise: Hawaii time of Sun rise.
  • TimeSunSet: Hawaii time of Sun set.

I’m assuming that the location is Hawaii. Furthermore, the wind direction is measured clockwise from 0 degrees North.

2.3 Data Pre-Processing

2.3.1 Missing values

mean(is.na(SolarRadPrediction))
## [1] 0

There is no missing values. Data is 100% cleaned.

2.3.2 Dates

3 EDA

3.1 Data

df <- SolarRadPrediction[, -1]
colnames(df)[1] <- "Date"
colnames(df)[8] <- "WindSpeed"

3.2 Dependent variable Radiation

3.2.1 Radiation

3.2.1.1 Tables

summary(df$Radiation)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.11    1.23    2.66  207.12  354.24 1601.26
## Radiation highest value when time ?
df %>%
  filter(Radiation == max(Radiation))
Date Time Radiation Temperature Pressure Humidity WindDirection WindSpeed TimeSunRise TimeSunSet
2016-09-04 12:15:04 1601.26 61 30.47 93 3.56 9 06:08:00 18:35:00
## Min radiation
df %>%
  filter(Radiation == min(Radiation))
Date Time Radiation Temperature Pressure Humidity WindDirection WindSpeed TimeSunRise TimeSunSet
2016-12-29 02:50:49 1.11 37 30.35 54 192.35 6.75 06:56:00 17:53:00

3.2.1.2 Figures

ggplot(data = df, aes(x = Time, y = Radiation)) +
  geom_line(col = "blue") + labs(title = "Radiation by Time")

3.3 Handle Date variable

# GetMonths 
m <- month(df$Date)
table(m)
## m
##    9   10   11   12 
## 7417 8821 8284 8164
# Get dates
d <- day(df$Date)

3.4 Additional Tables

summary(df[, c(3,4,5,6,8)])
##    Radiation        Temperature      Pressure        Humidity     
##  Min.   :   1.11   Min.   :34.0   Min.   :30.19   Min.   :  8.00  
##  1st Qu.:   1.23   1st Qu.:46.0   1st Qu.:30.40   1st Qu.: 56.00  
##  Median :   2.66   Median :50.0   Median :30.43   Median : 85.00  
##  Mean   : 207.12   Mean   :51.1   Mean   :30.42   Mean   : 75.02  
##  3rd Qu.: 354.24   3rd Qu.:55.0   3rd Qu.:30.46   3rd Qu.: 97.00  
##  Max.   :1601.26   Max.   :71.0   Max.   :30.56   Max.   :103.00  
##    WindSpeed     
##  Min.   : 0.000  
##  1st Qu.: 3.370  
##  Median : 5.620  
##  Mean   : 6.244  
##  3rd Qu.: 7.870  
##  Max.   :40.500

3.5 Additional Figures

# Plotting wind speed and wind direction over time
df1 <- df %>%
  select(ws = WindSpeed, wd = WindDirection, date = Date)
## weekdays wind
polarFreq(mydata = df1, cols = "jet")

polarFreq(mydata = df1, cols = "jet", type = "weekday")

## weekdays * season wind
# polarFreq(mydata = df1, cols = "jet", type = c("weekday", "season"))

#Correlations
#numData <- df[, -c(1,2,9,10)]
#ggpairs(numData)
dfd <- df # new data frame since no need to change df

month <- month(df$Date)
day <- day(df$Date)

ggplot(dfd, aes(factor(month), Radiation)) + 
  geom_boxplot(aes(fill = factor(month))) +
  ggtitle("Boxplot of Radiation values for each month") +
  scale_x_discrete(labels = c("September", "Octomber", "November", "December")) +
  scale_fill_discrete(name = "Months", 
                      labels = c("September", "Octomber", "November", "December")) +
  xlab("Month")

# df <- SolarRadPrediction
# # any(is.na(df))
# 
# getDate<-function(x,pos1,pos2){
#   if(pos1==1){
#     val<-as.numeric(strsplit(strsplit(as.character(x)," ")[[1]][pos1],'/')[[1]][pos2])
#   }
#   else if(pos1==3 & pos2==0){
#     val<-as.factor(strsplit(strsplit(as.character(x)," ")[[1]][pos1],'/')[[1]])
#   }
#   return(val)
# }
# 
# getTIME<-function(x,pos){
#   val<-strsplit(as.character(x),":")[[1]][pos]
#   return(as.numeric(val))
# }
# df$Month <- sapply(df$Data, getDate,1,1)
# df$Day <- sapply(df$Data,getDate,1,2)
# df$Year <- sapply(df$Data,getDate,1,3)
# df$TimeAbbr <- sapply(df$Data,getDate,3,0)
# df$hour <- sapply(df$Time,getTIME,1)
# df$minute <- sapply(df$Time,getTIME,2)
# df$sec <- sapply(df$Time,getTIME,3)
# mymonths <- c("January","February",
#               "March","April","May","June","July",
#               "August","September","October","November","December")
# 
# df$MonthAbb <- mymonths[ df$Month ]
# df$ordered_month <- factor(df$MonthAbb, levels = month.name)
# 
# df$DateTs<-as.POSIXct(paste0(df$Year,'-',
#                              df$Month,'-',
#                              df$Day,' ',
#                              as.character(df$Time)),
#                       format="%Y-%m-%d %H:%M:%S")
# 
# df$DailyTs <- as.POSIXct(as.character(df$Time), format="%H:%M:%S")
# 
# df$DiffTime<-as.numeric(difftime(as.POSIXct(paste0(df$Year,
#                                                    '-',df$Month,
#                                                    '-',df$Day,' ',
#                                                    as.character(df$TimeSunSet)), 
#                                             format="%Y-%m-%d %H:%M:%S"),
#   
#                                  as.POSIXct(paste0(df$Year,'-',
#                                                    df$Month,'-',
#                                                    df$Day,' ',
#                                                    as.character(df$TimeSunRise)), 
#                                             format="%Y-%m-%d %H:%M:%S"),
#                                  units='sec'))
# plot
# ggplot(data=df,aes(x=Radiation,fill=ordered_month)) + 
#   geom_histogram(bins=100) + 
#   scale_y_log10() + 
#   scale_fill_manual(name="",values=rainbow(4)) + 
#   theme(legend.position='top') + 
#   facet_wrap(~ordered_month) +
#   xlab('Radiation level [W/m^-2]') + ylab('Count')
# 
# # plot
# ggplot(data=df,aes(x=DiffTime,y=Radiation)) + 
#   geom_point(aes(color=ordered_month)) +
#   scale_color_manual(name="",values=rainbow(4)) + 
#   theme(legend.position='top') + xlab("SunSet -SunRise [sec]")
# 
# # plot
# df %>% select(ordered_month,Day,Radiation) %>% 
#   group_by(ordered_month,Day) %>% 
#   summarise(dailyRad = mean(Radiation)) %>% 
#   ggplot(aes(x=ordered_month,y=dailyRad,color=dailyRad)) + 
#   scale_color_gradientn(colours=rev(brewer.pal(10,'Spectral'))) + 
#   geom_boxplot(colour='black',size=.4,alpha=.5) + 
#   geom_jitter(shape=16,width=.2,size=2) +
#   xlab('') + ylab('') + theme(legend.position='top')