Data Preparation

# load data
library(RCurl)

weather_data <- getURL("https://raw.githubusercontent.com/josephsimone/DATA607/master/ww-ii-data..csv")
weather.raw <- read.csv(text = weather_data)
head(weather.raw, 5)
##     STA     Date Precip WindGustSpd  MaxTemp  MinTemp MeanTemp Snowfall
## 1 10001 7/1/1942  1.016          NA 25.55556 22.22222 23.88889        0
## 2 10001 7/2/1942      0          NA 28.88889 21.66667 25.55556        0
## 3 10001 7/3/1942   2.54          NA 26.11111 22.22222 24.44444        0
## 4 10001 7/4/1942   2.54          NA 26.66667 22.22222 24.44444        0
## 5 10001 7/5/1942      0          NA 26.66667 21.66667 24.44444        0
##   PoorWeather YR MO DA PRCP DR SPD MAX MIN MEA SNF SND FT FB FTI ITH PGT
## 1             42  7  1 0.04 NA  NA  78  72  75   0  NA NA NA  NA  NA  NA
## 2             42  7  2    0 NA  NA  84  71  78   0  NA NA NA  NA  NA  NA
## 3             42  7  3  0.1 NA  NA  79  72  76   0  NA NA NA  NA  NA  NA
## 4             42  7  4  0.1 NA  NA  80  72  76   0  NA NA NA  NA  NA  NA
## 5             42  7  5    0 NA  NA  80  71  76   0  NA NA NA  NA  NA  NA
##   TSHDSBRSGF SD3 RHX RHN RVG WTE
## 1             NA  NA  NA  NA  NA
## 2             NA  NA  NA  NA  NA
## 3             NA  NA  NA  NA  NA
## 4             NA  NA  NA  NA  NA
## 5             NA  NA  NA  NA  NA

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

Is there a relationship between the daily minimum and maximum temperature? Can you predict the maximum temperature given the minimum temperature?

Cases

What are the cases, and how many are there?

At first glace, the information in this dataset includes precipitation, snowfall, temperatures, wind speed and whether the day included thunder storms or other poor weather conditions.There are over twenty cases in this dataset, however, a lot of the cases and NULL. Therefore, I will be eleminating them from my dataset for analysis. The cases that I will be keeping include, the station #, , date, precipitation, wind gust speed, max, min and mean temperature, and snowfall.

Data collection

Describe the method of data collection.

Contains 1940–1945 data for 162 stations outside of the United States. The actual period of data availability varies depending upon the station’s activity. Many stations in the European and Pacific theaters of operation are included

Type of study

What type of study is this (observational/experiment)? This study is an observational study, this is a collected of weather conditions recorded on each day at various weather stations around the world.

Data Source

If you collected the data, state self-collected. If not, provide a citation/link.

“World War II Era Data.” National Climatic Data Center, www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/world-war-ii-era-data.

Dependent Variable

What is the response variable? Is it quantitative or qualitative?

The response vairbale that I will be trying to calculate, is going to be temperature. Given the minimum temperature can you predict the maximum. Therefore, this is a quantitative variable.

Independent Variable

You should have two independent variables, one quantitative and one qualitative.

The two independent variables that I will using for this linear regression analysis are the minimum and maximum tempatures.

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

names(weather.raw)
##  [1] "STA"         "Date"        "Precip"      "WindGustSpd" "MaxTemp"    
##  [6] "MinTemp"     "MeanTemp"    "Snowfall"    "PoorWeather" "YR"         
## [11] "MO"          "DA"          "PRCP"        "DR"          "SPD"        
## [16] "MAX"         "MIN"         "MEA"         "SNF"         "SND"        
## [21] "FT"          "FB"          "FTI"         "ITH"         "PGT"        
## [26] "TSHDSBRSGF"  "SD3"         "RHX"         "RHN"         "RVG"        
## [31] "WTE"
weather_df = subset(weather.raw, select = c(STA,Date,MaxTemp,MinTemp,MeanTemp,MAX,MIN,MEA))
head(weather_df, 5)
##     STA     Date  MaxTemp  MinTemp MeanTemp MAX MIN MEA
## 1 10001 7/1/1942 25.55556 22.22222 23.88889  78  72  75
## 2 10001 7/2/1942 28.88889 21.66667 25.55556  84  71  78
## 3 10001 7/3/1942 26.11111 22.22222 24.44444  79  72  76
## 4 10001 7/4/1942 26.66667 22.22222 24.44444  80  72  76
## 5 10001 7/5/1942 26.66667 21.66667 24.44444  80  71  76
summary(weather_df$MaxTemp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -33.33   25.56   29.44   27.05   31.67   50.00
boxplot(weather_df$MaxTemp)

barplot(weather_df$MaxTemp)

summary(weather_df$MinTemp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -38.33   15.00   21.11   17.79   23.33   34.44
barplot(weather_df$MinTemp)

summary(weather_df$MeanTemp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -35.56   20.56   25.56   22.41   27.22   40.00
barplot(weather_df$MeanTemp)