Harold Nelson
10/18/2019
You are in a field where data exists.
You want to answer questions using the data.
Data Analysis builds on statistics.
Data Analysis covers what you need to do statistics in the real world or graduate school.
It is much more exploratory and visual than inferential.
Learn the language R and the use of the RStudio IDE.
Learn to import data and deal with anomalies.
Learn to modify data.
Learn to visualize data.
Learn to build predictive models.
Learn to produce presentable documents - Easy in RStudio.
The only prerequisite is an introductory statistics course like MTH 201.
No prior programming experience is required. R can be your first programming language.
The current catalog specifies CSC 101 as a prerequisite, but this will be waived.
You will need some maturity in your thinking, but this is hard to define in terms of coursework.
All instructional material is free.
Textbooks are free on the internet.
Datacamp courses are free duing the class.
Software is all available in the cloud and free.
Software can be installed on your personal computer (Mac or PC).
Let’s see what you can do with the weather. I’ll use this data to answer a few questions.
I obtained data from NOAA using https://www.ncdc.noaa.gov/. The data from NOAA is in the form of a CSV file. After a little cleaning up in Excel, it’s ready to be imported into R.
Look at the data with a few simple commands and understand what it means. Does it pass some obvious validity tests?
## Observations: 28,649
## Variables: 5
## $ DATE <date> 1941-05-13, 1941-05-14, 1941-05-15, 1941-05-16, 1941-05-17…
## $ PRCP <dbl> 0.00, 0.00, 0.30, 1.08, 0.06, 0.00, 0.00, 0.00, 0.00, 0.00,…
## $ SNOW <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ TMAX <dbl> 66, 63, 58, 55, 57, 59, 58, 65, 68, 85, 84, 75, 72, 59, 61,…
## $ TMIN <dbl> 50, 47, 44, 45, 46, 39, 40, 50, 42, 46, 46, 50, 41, 37, 48,…
## DATE PRCP SNOW TMAX
## Min. :1941-05-13 Min. :0.0000 Min. : 0.00 Min. : 18.00
## 1st Qu.:1960-12-21 1st Qu.:0.0000 1st Qu.: 0.00 1st Qu.: 50.00
## Median :1980-07-31 Median :0.0000 Median : 0.00 Median : 59.00
## Mean :1980-07-31 Mean :0.1362 Mean : 0.04 Mean : 60.56
## 3rd Qu.:2000-03-10 3rd Qu.:0.1400 3rd Qu.: 0.00 3rd Qu.: 71.00
## Max. :2019-10-19 Max. :4.8200 Max. :14.20 Max. :104.00
## NA's :3 NA's :5407 NA's :11
## TMIN
## Min. :-8.00
## 1st Qu.:33.00
## Median :40.00
## Mean :39.82
## 3rd Qu.:47.00
## Max. :69.00
## NA's :11
For my purposes, I want direct access to the components of the date. Those are easy to extract using some features from the lubridate package.
## Observations: 28,649
## Variables: 8
## $ DATE <date> 1941-05-13, 1941-05-14, 1941-05-15, 1941-05-16, 1941-05-17…
## $ PRCP <dbl> 0.00, 0.00, 0.30, 1.08, 0.06, 0.00, 0.00, 0.00, 0.00, 0.00,…
## $ SNOW <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ TMAX <dbl> 66, 63, 58, 55, 57, 59, 58, 65, 68, 85, 84, 75, 72, 59, 61,…
## $ TMIN <dbl> 50, 47, 44, 45, 46, 39, 40, 50, 42, 46, 46, 50, 41, 37, 48,…
## $ yr <dbl> 1941, 1941, 1941, 1941, 1941, 1941, 1941, 1941, 1941, 1941,…
## $ mo <dbl> 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6,…
## $ dy <int> 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27,…
Today is October 29, 2019.
Let’s extract all of the historical data for this month and day.
Let’s run a summary of all of the variables in our smaller dataframe.
## DATE PRCP SNOW TMAX
## Min. :1941-10-29 Min. :0.0000 Min. :0 Min. :44.00
## 1st Qu.:1961-01-28 1st Qu.:0.0000 1st Qu.:0 1st Qu.:52.00
## Median :1980-04-29 Median :0.0200 Median :0 Median :55.00
## Mean :1980-04-29 Mean :0.1465 Mean :0 Mean :55.24
## 3rd Qu.:1999-07-29 3rd Qu.:0.1800 3rd Qu.:0 3rd Qu.:58.00
## Max. :2018-10-29 Max. :1.6400 Max. :0 Max. :68.00
## NA's :15
## TMIN yr mo dy
## Min. :21.00 Min. :1941 Min. :10 Min. :29
## 1st Qu.:32.00 1st Qu.:1960 1st Qu.:10 1st Qu.:29
## Median :39.50 Median :1980 Median :10 Median :29
## Mean :37.97 Mean :1980 Mean :10 Mean :29
## 3rd Qu.:43.00 3rd Qu.:1999 3rd Qu.:10 3rd Qu.:29
## Max. :51.00 Max. :2018 Max. :10 Max. :29
##
There are three variables in which we are most interested.
We can examine these graphically using some features of ggplot2. I’ll use both a histogram and a density plot, which I prefer.
The maximum temperature.
The minimum temperature.
The precipitation.
I’ll create a similar dataframe with data from a month ago and a few months into the future. Then I’ll do a facetted graphic to compare all of the months.
First the data.
Now the graphic.
I can see how the distribution of daily maximum temperatures has drifted down since August and that it will decline more as the month changes until we reach December.
I’ll stick with the 29th day of each month and do a different style of graph, a side-by-side boxplot.
During July 2018, I thought that the weather was unusually hot for July. Let’s look at this question graphically. I’ll create a marker variable for 2018 and filter the data to include only the July days. Then I can look at the distribution of temperatures for the 31 days of July 2018 in comparision with all of the other July days.
olywthr %>% filter(mo == 7) %>%
mutate(marker = ifelse(yr == 2018,"2018","Other")) -> QJ18
QJ18 %>% ggplot(aes(x=TMAX,color=marker)) +
geom_density()
Most of the time I’m satisfied with graphical answers to my questions. However, it’s certainly possible to use traditional statistical methods. Here, I should test the null hypothesis that the average daily maximum temperature for July, 2018 is the same as the average daily maximum temperature for all other years. In other words, Is the difference I see in the graphic nothing but random fluctuation.
I’ll use t.test() to do this.
##
## Welch Two Sample t-test
##
## data: TMAX by marker
## t = 3.4349, df = 30.678, p-value = 0.001721
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.182376 8.568658
## sample estimates:
## mean in group 2018 mean in group Other
## 82.70968 77.33416
The p-value is far below 5%, so I’ll dismiss the idea of random fluctuation.
All of the resources used in CSC360 are free. The links below take you to the most important resources.
Modern Dive is written as an alternative Introductory Statistics Textbook focused on R and the tidyverse collection of packages. You can see it at https://moderndive.com/.
Hands-on Programming with R is an excellent introduction to the R language. See https://rstudio-education.github.io/hopr/.
Datacamp is a very popular provider of online courses in data science. Normally there is a monthly charge for access, but students in CSC360 will have free access for six months. You can do the first chapter of any course for no fee right now. If you want to look, I suggest “Introduction to R” and “Introduction to the Tidyverse”. See https://www.datacamp.com/.
R for Data Science is the most complete textbook for the course. See https://r4ds.had.co.nz/.
Contact Harold Nelson: hnelson@stmartin.edu.