This worksheet illustrates how I worked used Mark Biegert’s RPub file (Retrieved from http://mathscinotes.com/2017/02/an-example-of-cleaning-untidy-data-with-tidyr/ ) to show examples of tidy vs. untidy data. I then used self-created completely fictitious .xlsx files to complete part 2 of the assignment.
These are the libraries that were loaded:
library(readxl)
library(dplyr)
library(csvread)
I simply copied this file from the web site linked above. This was a file of untidy data:
#UNTIDY
untidy <- read_excel("~/Desktop/WFED 540 FA17/Class4Homework/Untidy.xlsx", col_names = FALSE)
untidy
# A tibble: 66 x 7
X__1 X__2
<chr> <chr>
1 "Type\tModel \t1941 \t1942 \t1943 \t1944 \t1945" <NA>
2 Very Heavy Bombers <NA>
3 "B-29 \t\t897" "730 \t \t605"
4 Heavy Bombers <NA>
5 "B-17 \t301" "221 \t258"
6 "B-24 \t379" "162 \t304"
7 "B-32 \t- \t790" "433 \t- \t790"
8 Medium Bombers <NA>
9 "B-25 \t180" "031 \t153"
10 "B-26 \t261" "062 \t239"
# ... with 56 more rows, and 5 more variables: X__3 <chr>, X__4 <chr>,
# X__5 <chr>, X__6 <chr>, X__7 <chr>
#TIDY
tidy <- read_excel("~/Desktop/WFED 540 FA17/Class4Homework/Tidy.xlsx")
tidy
# A tibble: 280 x 4
Type Model Year Cost
<chr> <chr> <dbl> <chr>
1 NA B-29 1941 NA
2 NA B-17 1941 301221
3 NA B-24 1941 379162
4 NA B-32 1941 NA
5 NA B-25 1941 180031
6 NA B-26 1941 261062
7 NA A-20 1941 136813
8 NA A-26 1941 224498
9 NA A-28 1941 NA
10 NA A-29 1941 NA
# ... with 270 more rows
The 3 datasets below - Examples 1 through 3 - are completely fictious. In retrospect, I should have probably chosen different kinds of data for a variety of variables, which I will discuss in more detail below.
First, we will look at an Excel file that includes data about the year and age of participants: ##Example 1: Year and Age
age <- read_excel("~/Desktop/WFED 540 FA17/Class4Homework/Example1.xlsx")
age
## # A tibble: 10 x 3
## X__1 year age
## <dbl> <dbl> <dbl>
## 1 1 2017 22
## 2 2 2016 21
## 3 3 2015 20
## 4 4 2014 19
## 5 5 2013 18
## 6 6 2012 17
## 7 7 2011 16
## 8 8 2010 15
## 9 9 2009 14
## 10 10 2008 13
Age is a ratio variable and year is a nominal variable.
Now, we will look at an Excel file that includes data about student gpa and amount of time studying:
gpa <- read_excel("~/Desktop/WFED 540 FA17/Class4Homework/Example2.xlsx")
gpa
## # A tibble: 10 x 3
## X__1 gpa `hours studying`
## <dbl> <dbl> <dbl>
## 1 1 4.0 38.0
## 2 2 3.9 29.0
## 3 3 2.2 2.2
## 4 4 3.5 32.0
## 5 5 3.2 12.0
## 6 6 2.9 21.0
## 7 7 3.1 14.0
## 8 8 3.6 26.0
## 9 9 2.5 19.0
## 10 10 2.0 5.0
Gpa is a ratio variable and hours of study is also a ratio variable.
Finally, we will look at an Excel file that includes data about worship attendance and offering/collection received:
giving <- read_excel("~/Desktop/WFED 540 FA17/Class4Homework/Example3.xlsx")
giving
## # A tibble: 10 x 3
## X__1 attendance offering
## <dbl> <dbl> <dbl>
## 1 1 35 572
## 2 2 57 999
## 3 3 125 1357
## 4 4 20 782
## 5 5 46 777
## 6 6 222 3456
## 7 7 350 9891
## 8 8 425 12145
## 9 9 87 1101
## 10 10 108 1234
Worship attendance is a ratio variable and giving is also a ratio variable.