Rongen Zhang
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## âś“ ggplot2 3.3.3 âś“ purrr 0.3.4
## âś“ tibble 3.1.2 âś“ dplyr 1.0.6
## âś“ tidyr 1.1.3 âś“ stringr 1.4.0
## âś“ readr 1.4.0 âś“ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
read.csv() vs. read_csv()First, let’s use the usual way to read csv data into R (make sure that you are using the right working directory).
df_original <- read.csv(file="HousePrices.csv",
header=TRUE, stringsAsFactors=FALSE)
class(df_original) # this is a data frame## [1] "data.frame"
We can also use read_csv() from the readr package in the tidyverse to read the same csv file:
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
If you haven’t done this already (from lab-3), download HousePrices.csv and put it into your working directory.
Run the following code:
install.packages("tidyverse")
library(tidyverse)
df_original <- read.csv(file="HousePrices.csv",
header=TRUE, stringsAsFactors=FALSE)
df <- read_csv(file="HousePrices.csv")And then print these two data sets by entering their names to the console. What are your observations?
dplyr provides the following verbs for data manipulation.
We will be using HousePrices.csv throughout this lab. Please download and put the file in your working directory. (Use getwd and setwd to make sure you have a correct working directory.)
select(): Pick columns by name## # A tibble: 1,728 x 2
## price lot_size
## <dbl> <dbl>
## 1 132500 0.09
## 2 181115 0.92
## 3 109000 0.19
## 4 155000 0.41
## 5 86060 0.11
## 6 120000 0.68
## 7 153000 0.4
## 8 170000 1.21
## 9 90000 0.83
## 10 122900 1.94
## # … with 1,718 more rows
Pipe operator %>% makes the code much readable. Essentially, you send the data through a set of operations and the operations are connected by a pipe.
Note: You must have a space before and after %>%
# in tidyverse, we use dot/period to represent whatever comes from the previous stage
df %>% select(., c(price, lot_size)) # notice that df is outside## # A tibble: 1,728 x 2
## price lot_size
## <dbl> <dbl>
## 1 132500 0.09
## 2 181115 0.92
## 3 109000 0.19
## 4 155000 0.41
## 5 86060 0.11
## 6 120000 0.68
## 7 153000 0.4
## 8 170000 1.21
## 9 90000 0.83
## 10 122900 1.94
## # … with 1,718 more rows
# if the first argument is a dot/period, you may omit it in your code
# the packages in the tidyverse will automatically fill it in for you
df %>% select(c(price, lot_size))## # A tibble: 1,728 x 2
## price lot_size
## <dbl> <dbl>
## 1 132500 0.09
## 2 181115 0.92
## 3 109000 0.19
## 4 155000 0.41
## 5 86060 0.11
## 6 120000 0.68
## 7 153000 0.4
## 8 170000 1.21
## 9 90000 0.83
## 10 122900 1.94
## # … with 1,718 more rows
## # A tibble: 1,728 x 9
## price lot_size waterfront age land_value construction air_cond fuel heat
## <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr> <chr>
## 1 132500 0.09 No 42 50000 No No Elec… Elec…
## 2 181115 0.92 No 0 22300 No No Gas Hot …
## 3 109000 0.19 No 133 7300 No No Gas Hot …
## 4 155000 0.41 No 13 18700 No No Gas Hot …
## 5 86060 0.11 No 0 15000 Yes Yes Gas Hot …
## 6 120000 0.68 No 31 14000 No No Gas Hot …
## 7 153000 0.4 No 33 23300 No No Oil Hot …
## 8 170000 1.21 No 23 14600 No No Oil Hot …
## 9 90000 0.83 No 36 22200 No No Elec… Elec…
## 10 122900 1.94 No 4 21200 No No Gas Hot …
## # … with 1,718 more rows
## # A tibble: 1,728 x 2
## bathrooms rooms
## <dbl> <dbl>
## 1 1 5
## 2 2.5 6
## 3 1 8
## 4 1.5 5
## 5 1 3
## 6 1 8
## 7 1.5 8
## 8 1.5 9
## 9 1.5 8
## 10 1.5 6
## # … with 1,718 more rows
Select columns age and land_value from df
filter(): Keep rows that match criteria## # A tibble: 246 x 15
## X1 price lot_size waterfront age land_value construction air_cond fuel
## <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 1 132500 0.09 No 42 50000 No No Elec…
## 2 9 90000 0.83 No 36 22200 No No Elec…
## 3 13 85860 8.97 No 13 4800 No No Elec…
## 4 21 112000 1 No 12 8600 No No Elec…
## 5 22 104900 0.43 No 21 5600 No No Elec…
## 6 25 90400 0.36 No 16 5200 No No Elec…
## 7 60 175000 0.47 No 15 27200 No No Elec…
## 8 65 171000 3.16 No 15 24100 No No Elec…
## 9 69 182000 1 No 16 26600 No No Elec…
## 10 81 114000 0.06 No 14 1600 No No Elec…
## # … with 236 more rows, and 6 more variables: heat <chr>, sewer <chr>,
## # living_area <dbl>, fireplaces <dbl>, bathrooms <dbl>, rooms <dbl>
## # A tibble: 246 x 3
## price fuel air_cond
## <dbl> <chr> <chr>
## 1 132500 Electric No
## 2 90000 Electric No
## 3 85860 Electric No
## 4 112000 Electric No
## 5 104900 Electric No
## 6 90400 Electric No
## 7 175000 Electric No
## 8 171000 Electric No
## 9 182000 Electric No
## 10 114000 Electric No
## # … with 236 more rows
Note: In the interest of readability, put different operations in separate lines.
After the operations, you may want to save the result as a new data frame. You can then use the new data frame for other analyses.
new_df <- df %>%
filter(price < 200000, fuel == "Electric") %>%
select(price, fuel, air_cond)
nrow(new_df)## [1] 246
Use pipe %>% to chain the following two operations
arrange(): Reorder rows## # A tibble: 1,728 x 3
## price air_cond rooms
## <dbl> <chr> <dbl>
## 1 5000 Yes 6
## 2 10300 No 4
## 3 10300 No 6
## 4 20000 No 4
## 5 25000 No 6
## 6 45000 No 4
## 7 49387 No 5
## 8 58500 No 5
## 9 60000 No 6
## 10 60000 No 7
## # … with 1,718 more rows
use desc() for a descending order
## # A tibble: 1,728 x 3
## price air_cond rooms
## <dbl> <chr> <dbl>
## 1 775000 Yes 9
## 2 775000 Yes 12
## 3 760000 Yes 12
## 4 725000 Yes 10
## 5 670000 No 8
## 6 658300 Yes 11
## 7 655000 No 6
## 8 650000 Yes 10
## 9 650000 Yes 12
## 10 649000 Yes 12
## # … with 1,718 more rows
mutate(): Add new variablesCreate new variables that are functions of existing variables
df %>%
mutate(bedrooms = rooms - bathrooms) %>%
filter(bedrooms>3) %>%
select(bedrooms, bathrooms, rooms)## # A tibble: 1,381 x 3
## bedrooms bathrooms rooms
## <dbl> <dbl> <dbl>
## 1 4 1 5
## 2 3.5 2.5 6
## 3 7 1 8
## 4 3.5 1.5 5
## 5 7 1 8
## 6 6.5 1.5 8
## 7 7.5 1.5 9
## 8 6.5 1.5 8
## 9 4.5 1.5 6
## 10 11 1 12
## # … with 1,371 more rows
expensive_house, which equals “yes” if the house price is greater than 150000 and “no” otherwise.
price and expensive_house columns and arrange the rows in the descending order of price.## # A tibble: 1,728 x 2
## price expensive_house
## <dbl> <chr>
## 1 775000 yes
## 2 775000 yes
## 3 760000 yes
## 4 725000 yes
## 5 670000 yes
## 6 658300 yes
## 7 655000 yes
## 8 650000 yes
## 9 650000 yes
## 10 649000 yes
## # … with 1,718 more rows
summarise(): Reduce variables to valuesgroup_by() creates the groups that will be operated onsummarise() uses the provided aggregation function to summarise each group## # A tibble: 2 x 2
## air_cond avg_price
## <chr> <dbl>
## 1 No 187022.
## 2 Yes 254904.
You can have multiple summary/aggregate statistics:
df %>%
group_by(air_cond) %>%
summarise(n_house = n(), # n() gives the number of rows in each group
avg_price = mean(price)) ## # A tibble: 2 x 3
## air_cond n_house avg_price
## <chr> <int> <dbl>
## 1 No 1093 187022.
## 2 Yes 635 254904.
Some houses have a drive way, some don’t. Use group_by and summarise to obtain the following:
## # A tibble: 2 x 4
## construction n_house min_lotsize max_lotsize
## <chr> <int> <dbl> <dbl>
## 1 No 1647 0 12.2
## 2 Yes 81 0.01 3.47
Note: left_join(a, b, by="x1") is equivalent to a %>% left_join(b, by="x1")
This lab assignment involves 2 tasks (see the following slides). Once you finish the following tasks, please put everything in one single R file with the file name assignment2.R (.R is the file extension) and upload it to iCollege (Lab Assignment 2).
You will lose 50% of the points if you use a different file name or put your code in multiple files.
In addition, lab assignments will be graded based on:
1.1. Show price, air_cond, heat, fireplaces for houses that have no fireplaces
1.2. Create a new variable price_per_bedroom which is price divided by the number of bedrooms. Show only price, bedrooms, and price_per_bedroom columns and arrange the rows in the descending order of price_per_bedroom
1.3. Create a new variable has_4_or_more_bedrooms which is TRUE if the house has 4 or more bedrooms and FALSE otherwise. Use this variable and group_by and summarise() to find how many houses have 4 or more bedrooms and how many don’t
2.1. Define and write a function called is_even which takes one number as input and detects whether the number is even. Return TRUE if it is even, and FALSE otherwise. You need to first detect if the input is an integer; if not, return NA.
1, 2, 1.5, and "gsu".%% operator (a.k.a. the modulus operator). Example:## [1] 2
2.2. Write a for loop to get the sum of all the even numbers from -50 to 50. You have to use the function you wrote for task 2.1 above.