library(dplyr)
library(lubridate)
library(magrittr)
library(readr)
library(readxl)
library(tidyr)
library(outliers)
library(graphics)
Data pre-processing is a process of importing data, understanding, tidy and manipulate, scan and transform. We have considered the Mount Rainier Weather and Climbing data for processing. There are three datasets - climbing statistics, Rainier weather and factor. The climbing statistics dataset has five variables and 4077 observations, it states the attempted,succeeded, sucess percentage, route, date of people climbing the Mt Rainier for each day. The Rainier weather dataset provides information of the day to day weather conditions on Mt Rainier and has 7 variables with 464 observations.We have the information on the date, battery voltage, temperature, humidity, wind direction, wind speed and solar radiation. The last dataset is the Factor dataset states the difficulty level for each route for the mountain climbing with 2 variables and 25 observations.
After importing these data sets, we have transformed date variable from climbing statistics and Rainier weather datasets so that it is easier to combine the datasets. After joining the two data sets, we have factorized it. We have investigated this data set to check whether the data was tidy or not. We have created a new variable to extract the month. Then we have scanned the data set for missing values, inconsistencies and obvious errors. Further, we have scanned all the numeric data for outliers. For this step we have used boxplot which we have removed the outliers for the data set finally, the transformation of the data to normal distribution.
This project will provide the pathway to identify the relationship between temperature and success rate. And also, this gives a brief idea about the relationship between temperature and humidity, which can be said that when temperature increases the humidity will relatively increase provided the availability of water (sea level). This will provide evidence for the decrease in humidity. The main advantage will be that while using regression analysis, statistical calculation gives a inference .Therefore, we have transformed variables to follow the normal distribution which is one the main assumptions for linear regression analysis to prove the relationship between temperature and humidity. The major impact of this investigation would be analysing the success rate of climbing the Mt Rainier over months.
We extracted the datasets from https://www.kaggle.com/codersree/mount-rainier-weather-and-climbing-data. There are three datasets where we perform the preprocessing steps. The purpose of the report is, we try to figure out which month of the year has majority of successes in climbing the Mt Rainier and further the weather dataset can help us analyse the reasons for the successes and failure rates for that month. Now, to proceed with this we have to join all the datasets with some common variable. However, before combining the data we are to change the format for date variable.
climbing_statistics <- read_csv("climbing_statistics.csv")
Parsed with column specification:
cols(
Date = [31mcol_character()[39m,
Route = [31mcol_character()[39m,
Attempted = [32mcol_double()[39m,
Succeeded = [32mcol_double()[39m,
`Success Percentage` = [32mcol_double()[39m
)
head(climbing_statistics)
Rainier_Weather <- read_csv("Rainier_Weather.csv")
Parsed with column specification:
cols(
Date = [31mcol_character()[39m,
`Battery Voltage AVG` = [32mcol_double()[39m,
`Temperature AVG` = [32mcol_double()[39m,
`Relative Humidity AVG` = [32mcol_double()[39m,
`Wind Speed Daily AVG` = [32mcol_double()[39m,
`Wind Direction AVG` = [32mcol_double()[39m,
`Solare Radiation AVG` = [32mcol_double()[39m
)
head(Rainier_Weather)
Factor <- read_excel("Factor.xlsx")
head(Factor)
NA
For the climbing statistics and Rainier weather datasets, we have observed that the observations for date variable are classified as character. We transform the date (character variable to date format) and then we join the two datasets on the date variable. We use the factor dataset and joined the main dataset with the route variable. The new dataset main1 has 1895 observations with 12 variables. Out of the 12 variables, 1 is in the date format, route and difficulty levels are in character format and rest are in numeric format. We converted the difficulty level variable to factors and labeled it.
str(climbing_statistics)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 4077 obs. of 5 variables:
$ Date : chr "11/27/2015" "11/21/2015" "10/15/2015" "10/13/2015" ...
$ Route : chr "Disappointment Cleaver" "Disappointment Cleaver" "Disappointment Cleaver" "Little Tahoma" ...
$ Attempted : num 2 3 2 8 2 10 2 2 2 2 ...
$ Succeeded : num 0 0 0 0 0 0 0 0 0 0 ...
$ Success Percentage: num 0 0 0 0 0 0 0 0 0 0 ...
- attr(*, "spec")=
.. cols(
.. Date = [31mcol_character()[39m,
.. Route = [31mcol_character()[39m,
.. Attempted = [32mcol_double()[39m,
.. Succeeded = [32mcol_double()[39m,
.. `Success Percentage` = [32mcol_double()[39m
.. )
str(Rainier_Weather)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 464 obs. of 7 variables:
$ Date : chr "12/31/2015" "12/30/2015" "12/29/2015" "12/28/2015" ...
$ Battery Voltage AVG : num 13.8 13.8 13.8 13.7 13.4 ...
$ Temperature AVG : num 19.06 14.63 6.61 8.69 14.14 ...
$ Relative Humidity AVG: num 21.9 18.5 34.1 70.6 95.8 ...
$ Wind Speed Daily AVG : num 21.98 3.54 0 0 0 ...
$ Wind Direction AVG : num 62.3 121.5 130.3 164.7 268.5 ...
$ Solare Radiation AVG : num 84.9 86.2 85.1 86.2 31.1 ...
- attr(*, "spec")=
.. cols(
.. Date = [31mcol_character()[39m,
.. `Battery Voltage AVG` = [32mcol_double()[39m,
.. `Temperature AVG` = [32mcol_double()[39m,
.. `Relative Humidity AVG` = [32mcol_double()[39m,
.. `Wind Speed Daily AVG` = [32mcol_double()[39m,
.. `Wind Direction AVG` = [32mcol_double()[39m,
.. `Solare Radiation AVG` = [32mcol_double()[39m
.. )
str(Factor)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 25 obs. of 2 variables:
$ Route : chr "Curtis RIngraham Directge" "Disappointment Cleaver" "Edmonds HW" "Emmons-Winthrop" ...
$ Difficulty level: chr "IV" "II" "II" "II" ...
climbing_statistics$Date <- mdy(climbing_statistics$Date)
Rainier_Weather$Date <- mdy(Rainier_Weather$Date)
main <- climbing_statistics %>% inner_join(Rainier_Weather,by = "Date")
main1 <- main %>% left_join(Factor, by = "Route")
main1 %<>% subset(,c(1:10,12))
head(main1)
main1$`Difficulty level` <- as.factor(main1$`Difficulty level`)
main1$`Difficulty level` <- factor(main1$`Difficulty level`, levels = c("II","III","IV","V+","N/A"),
labels = c("Easy","Moderate","More difficult","Most difficult","Not applicable"),ordered = TRUE)
head(main1$`Difficulty level`)
[1] Easy Easy Easy Moderate Easy Easy
Levels: Easy < Moderate < More difficult < Most difficult < Not applicable
After collating the data the next task was to check if the data fulfils the conditions of tidy data set. The main1 dataset we note that each column has a unique variable, every observation has its own row and every value has its own cell too. In conclusion we can say that the dataset is now Tidy and we can proceed with the further steps of preprocessing.
As we are preparing the dataset for the testing monthly success rate we used the mutate function to extract the months and create a new variable from it. We also factorize the months data.
main1 %<>% mutate(Month = month(Date))
main1$Month <- as.factor(main1$Month)
After assuring the dataset is tidy and adding the new variable we will test the dataset for missing values, inconsistencies and obvious errors. To do the infinite and NaN testing for the data frame we use the sapply function. The result of the testing is that there are no null, infinite or not a number in the dataset.
colSums(is.na(main1))
Date Route Attempted Succeeded
0 0 0 0
Success Percentage Battery Voltage AVG Temperature AVG Relative Humidity AVG
0 0 0 0
Wind Speed Daily AVG Wind Direction AVG Difficulty level Month
0 0 27 0
colSums(sapply(main1,FUN = is.infinite))
Date Route Attempted Succeeded
0 0 0 0
Success Percentage Battery Voltage AVG Temperature AVG Relative Humidity AVG
0 0 0 0
Wind Speed Daily AVG Wind Direction AVG Difficulty level Month
0 0 0 0
colSums(sapply(main1,FUN = is.nan))
Date Route Attempted Succeeded
0 0 0 0
Success Percentage Battery Voltage AVG Temperature AVG Relative Humidity AVG
0 0 0 0
Wind Speed Daily AVG Wind Direction AVG Difficulty level Month
0 0 0 0
We used the boxplot function to plot our outliers for each of the numeric variables and with the “out” function from the outlier package the number of outliers were determined. Out of all the different numeric variables we encountered outliers in the ‘battery voltage’, ‘no. of success’ and ‘wind speed daily average’ To filter the outliers for no. of success we filtered out the vaues which were greater than 10.5, for battery voltage avg outliers were beyond both the min and max ends and hence we filered it by defining the quantiles. Wind speed daily average had outliers whose values were below 30.84 and hence the were filtered out. We ignored the two outliers which appear after filtering the data as they were only reflected after the filtering was done and hence they cannot be considered as outliers. We did not observe any outliers for ‘attempted’, ‘average temperature’, ‘avgerage humidity’, ‘relative humidity’ and ‘average wind direction’.
#variable 1
boxplot(main1$Attempted, main = "Attempted") #boxplot for no. of people who attempted to climb Mt Rainier
#variable 2
success <- boxplot(main1$Succeeded, main ="No. of success") #boxplot for no. of People successful in climbing Mt Rainier
success$out #count the no. of outliers
[1] 12 11 11 12 11 12 12 12 12 12 11 12 11 12 71 12 11 11 11 11 12 12 11 11 12 12 12 11 11 11 12 12 12
[34] 11 12 12 12 12 11 12 12 12 12 12 11 11 12 11 11 12 12 12 11 11 11 11 12 12 12 12 11 11 11 12 12 11
[67] 12 12 12 11 11 12 12 11 11 11 11 11 11
main1 %<>% filter(Succeeded < 10.5) #filtering out the outliers
#variable 3
boxplot(main1$`Success Percentage`, main = "Success percentage")
#variable 4
boxplot(main1$`Battery Voltage AVG`, main = "Battery voltage") #boxplot for avg. Battery voltage
quantiles <- quantile( main1$`Battery Voltage AVG`, c(.25, 0.75),na.rm = TRUE ) #quantiles for avg. battery voltage
main1 %<>% filter((quantiles[1] - 1.5*(quantiles[2]-quantiles[1]))<`Battery Voltage AVG`)
main1 %<>% filter((quantiles[2] + 1.5*(quantiles[2]-quantiles[1]))>`Battery Voltage AVG`) #filtering the outliers
boxplot(main1$`Battery Voltage AVG`, main = "Battery voltage") #boxplot to re-check for outliers
#variable 5
boxplot(main1$`Temperature AVG`, main = "Average Temperature") #boxplot for avg. temperature
#variable 6
boxplot(main1$`Relative Humidity AVG`, main = "Average Humidity") #boxplot for avg. relative humidity
#variable 7
winddaily <- boxplot(main1$`Wind Speed Daily AVG`, main = "Average daily wind speed") #boxplot for avg. wind speed
daily <- winddaily$out #count the no. of outliers
min(daily) #finding the min value to filter the outliers
[1] 30.84
main1 %<>% filter(`Wind Speed Daily AVG`<30.84) #filtering the outliers whose values are lower than the min value
boxplot(main1$`Wind Speed Daily AVG`, main = "Average daily wind speed") #ignore the two outliers as they were not shown before
#variable 8
boxplot(main1$`Wind Direction AVG`, main = "Average wind direction") #boxplot for avg. wind direction
table(main1$Month)
1 2 3 4 5 6 7 8 9 10 11 12
12 17 13 20 197 552 500 290 107 9 2 0
For carrying out the tests to check the dependency of the success rates on the time of the year, climatic conditions and to test the relation of the variables like humidity vs temperture we need to make sure that the data is in normal form or the histogram of the data follows normal distribution and hence we transform the temperature and humidity data to normal form. It was observed that the temperature variable was left skewed and to transform the left skewness we use square of the variable values but after trail and error we concluded to use power of 1.75 to transform the temperature variable. The histogram for humidity average was slightly skewed to the right and hence we used the square root method to obtain normality for that data. Normalising the temperature average and humidity average data we use scatterplot to test the linearity. It is clear from the scatterplot that with increase in the tempreture we notice a drop in humidity.
#Transormation for tempertaure and humidity
hist(main1$`Temperature AVG`, main = "Histogram of temperature")
temp2 <- (main1$`Temperature AVG`)^1.75
hist(temp2, main = "Histogram of transformed Temperature avg.")
hist(main1$`Relative Humidity AVG`, main = "Histogram of humidity")
humid2 <- sqrt(main1$`Relative Humidity AVG`)
hist(humid2, main = "Histogram of transformed Relative humidity avg.")
plot(temp2,humid2, main = "Relationship between Humidity and Temperature", xlab = "Temperature",
ylab = "Humidity")
The main1 data is preprocessed and is ready to be used for the Hypothesis testing. The change in the data will not affect the preprocessing steps. Irrespective of the update in the data (year on year) we will follow the same steps, this is one of the advantages of the process.