Agenda

  1. Set up R and RStudio
  2. Understand the layout and functionality of RStudio
  3. Experiment with basic R expressions and data types
  4. Advanced data types
  5. Data manipulation

Introduction

R is a programming environment

  • uses a well-developed but simple programming language
  • allows for rapid development of new tools according to user demand
  • supports data analytics tasks

Downloading and installing R

RStudio

RStudio is a development environment for R, and provides many advanced features to improve efficiency and ease of use for R users.

Rgui

Downloading and installing RStudio

Getting started with RStudio

RStudio: console panel

This is the most important panel, because this is where R actually does stuff

RStudio: editor panel

Collections of commands (scripts) can be edited and saved.

RStudio: a typical workflow

RStudio: a typical workflow (cont.)

There are ways to speed up the workflow:

  • If you don’t select any code, R will just execute the line where the blinking cursor is

  • Instead of clicking the “Run” icon, you can just use the keyboard shortcut: Ctrl + Enter

RStudio: run the whole R script

RStudio: save your R script

Operators: Arithmetics

7 + 5
## [1] 12
7 - 5
## [1] 2
7 * 5
## [1] 35
7 / 5
## [1] 1.4

Like a calculator, R also has many functions that let you do more sophisticated manipulations.

round(2.05)
## [1] 2
factorial(3)  # 3! = 3 * 2 * 1
## [1] 6
sqrt(9)       # square root 
## [1] 3

Note:

  • Use # (number sign) to comment your codes
  • R will ignore anything in a line that follows #
  • It is a good idea to comment your codes so that others can understand what you are trying to do.

Getting Help

There will be many occasions where you want to learn more about a built-in command or function. Type help(function_name) or ?function_name to get more information. For example:

help(factorial)
?factorial

Use two question marks to search the whole help database, especially when you don’t know exactly the function name. For example,

??read

Data types

R can recognize different types of data:

  • numbers
  • character strings (text)
  • data frame

Numeric

Any number, no quotes.

Appropriate for math.

1 + 1
200000
sqrt(9)
class(0.3)  # "class" is function that shows the data type of an input
## [1] "numeric"

Character

Any symbols surrounded by single quotes (’) or double quotes (“)

class("Business Intelligence")
## [1] "character"
nchar('Business Intelligence')
## [1] 21
toupper("Business Intelligence")
## [1] "BUSINESS INTELLIGENCE"
paste("Business", "Intelligence", sep="_") 
## [1] "Business_Intelligence"

Your turn

How many characters are in the following strings:

  • 123
  • Email
  • meiling_jiang@baylor.edu

Use paste to join the following words so that the result looks like How#are#you? (hint: ?paste)

  • How
  • are
  • you?

Data can have names

We can give names to data objects; these give us variables

Variables are created with the assignment operator, <- or =

Be careful that R is a case sensitive language. FOO, Foo, and foo are three different variables!

x = 2      # use the equal sign to assign value
y <- 3     # you can also use an arrow to assign value
x          # print the value of a variable by typing its name
## [1] 2
x * y
## [1] 6

The assignment operator also changes values:

x
## [1] 2
x <- 8
x
## [1] 8

Using names and variables makes code: easier to design, easier to debug, less prone to bugs, easier to improve, and easier for others to read

Variable names

Variable names cannot begin with numbers. Wise to avoid special characters, except for period (.) and underline (_)

Example of valid names:

  • a
  • b
  • FOO
  • my_var
  • .day

Example of invalid names:

  • 1
  • 2nd
  • ^mean
  • !bad
  • $

Data frame

A data frame is a set of vectors of equal length. Consider data frame as an Excel sheet or a database table.

Column names are preserved or guessed if not explicitly set

course <- c("MIS4730", "MIS4710", "MIS4950", "MIS1234") 
num_of_students <- c(20, 10, 40, 30) 
data_analytics_minor <- c(TRUE, TRUE, TRUE, FALSE) 
df <- data.frame(course, n_students=num_of_students, data_analytics_minor,
                stringsAsFactors=F)
df # notice the column names and row names
##    course n_students data_analytics_minor
## 1 MIS4730         20                 TRUE
## 2 MIS4710         10                 TRUE
## 3 MIS4950         40                 TRUE
## 4 MIS1234         30                FALSE

Useful functions for data frames

ncol(df) # number of columns
## [1] 3
nrow(df) # number of rows
## [1] 4
colnames(df) # get column names
## [1] "course"               "n_students"           "data_analytics_minor"
rownames(df) # get row names
## [1] "1" "2" "3" "4"

You can change column and row names:

df2 <- df # create a copy of df, and name it as "df2"
colnames(df2) <- c("col1", "col2", "col3") # assign column names
colnames(df2)   # they were "course", "n_students", "ba_minor_course"
## [1] "col1" "col2" "col3"
rownames(df2) <- c("row1", "row2", "row3", "row4") # assign row names
rownames(df2)   # they were "1", "2", "3", "4"
## [1] "row1" "row2" "row3" "row4"

Getting values from a column

There are many ways you can get values out of a column:

  • The most readable way: dataframe_name$column_name
df$course
## [1] "MIS4730" "MIS4710" "MIS4950" "MIS1234"
df$n_students
## [1] 20 10 40 30

Getting values from rows

df
##    course n_students data_analytics_minor
## 1 MIS4730         20                 TRUE
## 2 MIS4710         10                 TRUE
## 3 MIS4950         40                 TRUE
## 4 MIS1234         30                FALSE

All of a row

df[2,]  # row 2
##    course n_students data_analytics_minor
## 2 MIS4710         10                 TRUE

df
##    course n_students data_analytics_minor
## 1 MIS4730         20                 TRUE
## 2 MIS4710         10                 TRUE
## 3 MIS4950         40                 TRUE
## 4 MIS1234         30                FALSE

Multiple rows

df[c(1,3),]  # rows 1 & 3
##    course n_students data_analytics_minor
## 1 MIS4730         20                 TRUE
## 3 MIS4950         40                 TRUE

df[2,1]              # row 2, column 1
## [1] "MIS4710"
df[c(3,4),c(1,2)]    # rows 3 & 4, columns 1 & 2
##    course n_students
## 3 MIS4950         40
## 4 MIS1234         30
df["2","n_students"] # "2": row name, "n_students": column name
## [1] 10

Rows matching a condition:

df
##    course n_students data_analytics_minor
## 1 MIS4730         20                 TRUE
## 2 MIS4710         10                 TRUE
## 3 MIS4950         40                 TRUE
## 4 MIS1234         30                FALSE
# Be careful that it's df$course in the brackets, not just course
df[df$course == 'MIS4730', ] 
##    course n_students data_analytics_minor
## 1 MIS4730         20                 TRUE

Your Turn

df
##    course n_students data_analytics_minor
## 1 MIS4730         20                 TRUE
## 2 MIS4710         10                 TRUE
## 3 MIS4950         40                 TRUE
## 4 MIS1234         30                FALSE

Retrieve Course MIS4950’s number of students from the data frame (using the data frame name and column name)

#Please do not use: df[3,2] or df$n_student[3]

Getting a dataset into R

Importing data into R is fairly simple. We can use built-in functions or libraries to read data from the following sources:

  • Text file (.txt)
  • Comma-separated values (.csv)
  • Excel (.xlsx or .xls)
  • Database table

Example data files

You can go to Canvas - Dataset Module to download the following data.

  • HousePrices.xlsx
  • HousePrices.csv

Working directory

Your working directory is the folder on your computer in which you are currently working.

# Show your current working directory
getwd() 

# List the files and folders in the current working directory
list.files()

Set Working directory with GUI

You can set your working directory in the following ways:

Set Working directory with GUI

You can set your working directory in the following ways:

Set Working directory with GUI

You can set your working directory in the following ways:

Your turn

  1. Create a new folder BI-Dataset for this course; Download the two HousePrices data files and move them into the folder

  2. Set this new folder as your working directory

  3. Run list.files() in the R console. Do you see these files?


Important: From this point forward and throughout the rest of the semester, I will assume that you are in the BI-Dataset project when you are developing R code in RStudio.

Import your dataset

You can import your dataset in the following ways:

Import the HousePrices.csv file to R

What the data looks like?

head() is a function allows you to see the top few rows of the data frame

HousePrices <- read.csv("~/Downloads/HousePrices.csv")
head(HousePrices, n=3) # n indicates how many rows you'd like to see
##   X  price lot_size waterfront age land_value construction air_cond     fuel
## 1 1 132500     0.09         No  42      50000           No       No Electric
## 2 2 181115     0.92         No   0      22300           No       No      Gas
## 3 3 109000     0.19         No 133       7300           No       No      Gas
##        heat   sewer living_area fireplaces bathrooms rooms
## 1  Electric Private         906          1       1.0     5
## 2 Hot Water Private        1953          0       2.5     6
## 3 Hot Water  Public        1944          1       1.0     8

Read Excel file

To read Excel data, we need to use a library: readxl

First, you need to install the package on your computer.

install.packages("readxl") # R will then go online and donwload the libray
library("readxl") # You need to load the library into R before you can use it.
df <- read_excel("HousePrices.xlsx", col_names = TRUE)
head(df, n=3)
## # A tibble: 3 × 15
##    ...1  price lot_size waterfront   age land_value construction air_cond fuel  
##   <dbl>  <dbl>    <dbl> <chr>      <dbl>      <dbl> <chr>        <chr>    <chr> 
## 1     1 132500     0.09 No            42      50000 No           No       Elect…
## 2     2 181115     0.92 No             0      22300 No           No       Gas   
## 3     3 109000     0.19 No           133       7300 No           No       Gas   
## # ℹ 6 more variables: heat <chr>, sewer <chr>, living_area <dbl>,
## #   fireplaces <dbl>, bathrooms <dbl>, rooms <dbl>

Write data to a file

file_path <- "my_test.csv"
write.csv(df, file=file_path, row.names = FALSE) 
# row.names are usually not very useful. So I like to exclude them.

file_path <- "my_test2.csv"
write.table(df, file_path, sep=",", row.names = FALSE)

Data summary: str()

str(df)
## tibble [1,728 × 15] (S3: tbl_df/tbl/data.frame)
##  $ ...1        : num [1:1728] 1 2 3 4 5 6 7 8 9 10 ...
##  $ price       : num [1:1728] 132500 181115 109000 155000 86060 ...
##  $ lot_size    : num [1:1728] 0.09 0.92 0.19 0.41 0.11 0.68 0.4 1.21 0.83 1.94 ...
##  $ waterfront  : chr [1:1728] "No" "No" "No" "No" ...
##  $ age         : num [1:1728] 42 0 133 13 0 31 33 23 36 4 ...
##  $ land_value  : num [1:1728] 50000 22300 7300 18700 15000 14000 23300 14600 22200 21200 ...
##  $ construction: chr [1:1728] "No" "No" "No" "No" ...
##  $ air_cond    : chr [1:1728] "No" "No" "No" "No" ...
##  $ fuel        : chr [1:1728] "Electric" "Gas" "Gas" "Gas" ...
##  $ heat        : chr [1:1728] "Electric" "Hot Water" "Hot Water" "Hot Air" ...
##  $ sewer       : chr [1:1728] "Private" "Private" "Public" "Private" ...
##  $ living_area : num [1:1728] 906 1953 1944 1944 840 ...
##  $ fireplaces  : num [1:1728] 1 0 1 1 0 1 1 1 0 0 ...
##  $ bathrooms   : num [1:1728] 1 2.5 1 1.5 1 1 1.5 1.5 1.5 1.5 ...
##  $ rooms       : num [1:1728] 5 6 8 5 3 8 8 9 8 6 ...

Data summary: summary()

summary(df)
##       ...1            price           lot_size        waterfront       
##  Min.   :   1.0   Min.   :  5000   Min.   : 0.0000   Length:1728       
##  1st Qu.: 432.8   1st Qu.:145000   1st Qu.: 0.1700   Class :character  
##  Median : 864.5   Median :189900   Median : 0.3700   Mode  :character  
##  Mean   : 864.5   Mean   :211967   Mean   : 0.5002                     
##  3rd Qu.:1296.2   3rd Qu.:259000   3rd Qu.: 0.5400                     
##  Max.   :1728.0   Max.   :775000   Max.   :12.2000                     
##       age           land_value     construction         air_cond        
##  Min.   :  0.00   Min.   :   200   Length:1728        Length:1728       
##  1st Qu.: 13.00   1st Qu.: 15100   Class :character   Class :character  
##  Median : 19.00   Median : 25000   Mode  :character   Mode  :character  
##  Mean   : 27.92   Mean   : 34557                                        
##  3rd Qu.: 34.00   3rd Qu.: 40200                                        
##  Max.   :225.00   Max.   :412600                                        
##      fuel               heat              sewer            living_area  
##  Length:1728        Length:1728        Length:1728        Min.   : 616  
##  Class :character   Class :character   Class :character   1st Qu.:1300  
##  Mode  :character   Mode  :character   Mode  :character   Median :1634  
##                                                           Mean   :1755  
##                                                           3rd Qu.:2138  
##                                                           Max.   :5228  
##    fireplaces       bathrooms       rooms       
##  Min.   :0.0000   Min.   :0.0   Min.   : 2.000  
##  1st Qu.:0.0000   1st Qu.:1.5   1st Qu.: 5.000  
##  Median :1.0000   Median :2.0   Median : 7.000  
##  Mean   :0.6019   Mean   :1.9   Mean   : 7.042  
##  3rd Qu.:1.0000   3rd Qu.:2.5   3rd Qu.: 8.250  
##  Max.   :4.0000   Max.   :4.5   Max.   :12.000

First few rows of data

head(df, n=2)
## # A tibble: 2 × 15
##    ...1  price lot_size waterfront   age land_value construction air_cond fuel  
##   <dbl>  <dbl>    <dbl> <chr>      <dbl>      <dbl> <chr>        <chr>    <chr> 
## 1     1 132500     0.09 No            42      50000 No           No       Elect…
## 2     2 181115     0.92 No             0      22300 No           No       Gas   
## # ℹ 6 more variables: heat <chr>, sewer <chr>, living_area <dbl>,
## #   fireplaces <dbl>, bathrooms <dbl>, rooms <dbl>
df[1:2,] # df[ rows_you_want, columns_you_want ]
## # A tibble: 2 × 15
##    ...1  price lot_size waterfront   age land_value construction air_cond fuel  
##   <dbl>  <dbl>    <dbl> <chr>      <dbl>      <dbl> <chr>        <chr>    <chr> 
## 1     1 132500     0.09 No            42      50000 No           No       Elect…
## 2     2 181115     0.92 No             0      22300 No           No       Gas   
## # ℹ 6 more variables: heat <chr>, sewer <chr>, living_area <dbl>,
## #   fireplaces <dbl>, bathrooms <dbl>, rooms <dbl>

Last few rows of data

tail(df, n=2)
## # A tibble: 2 × 15
##    ...1  price lot_size waterfront   age land_value construction air_cond fuel 
##   <dbl>  <dbl>    <dbl> <chr>      <dbl>      <dbl> <chr>        <chr>    <chr>
## 1  1727 125000     0.24 No            48      16800 No           No       Gas  
## 2  1728 111300     0.59 No            86      26000 No           No       Gas  
## # ℹ 6 more variables: heat <chr>, sewer <chr>, living_area <dbl>,
## #   fireplaces <dbl>, bathrooms <dbl>, rooms <dbl>

The tidyverse

  • The tidyverse is a collection of R packages designed for data science.
  • Install the complete tidyverse with:
install.packages("tidyverse")
  • Load the tidyverse into the R environment
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Core tidyverse

  • ggplot2: ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics.
  • tibble: tibble is a modern re-imagining of the data frame, keeping what time has proven to be effective, and throwing out what it has not.
  • tidyr: tidyr provides a set of functions that help you get to tidy data.
  • readr: readr provides a fast and friendly way to read rectangular data (like csv).
  • purrr: purrr provides a complete and consistent set of tools for working with functions and vectors.
  • dplyr: dplyr provides a grammar of data manipulation

read.csv() vs. read_csv()

First, let’s use the usual way to read csv data into R (make sure that you are using the right working directory).

df_original <- read.csv(file="HousePrices.csv", header=TRUE, stringsAsFactors=FALSE)
class(df_original) # this is a data frame
## [1] "data.frame"

We can also use read_csv() from the readr package in the tidyverse to read the same csv file:

df <- read_csv(file="HousePrices.csv")
class(df) # this is a tibble (tbl_df)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

Your turn

If you haven’t done this already, download HousePrices.csv and put it into your working directory.

Run the following code:

install.packages("tidyverse")
library(tidyverse)
df_original <- read.csv(file="HousePrices.csv", 
                        header=TRUE, stringsAsFactors=FALSE)
df <- read_csv(file="HousePrices.csv")

And then print these two data sets by entering their names to the console. What are your observations?

df_original
df

Package dplyr

dplyr provides the following verbs for data manipulation.

  1. select
  2. filter
  3. arrange
  4. mutate
  5. summarise & group_by
  6. joining (merging) data frames / tables

We will be using HousePrices.csv throughout this lab. Please download from Canvas (Sample Dataset Module) and put the file in your working directory. (Use getwd and setwd to make sure you have a correct working directory.)

select(): Pick columns by name

select(df, c(price, lot_size)) # equivalent to df[, c("price", "lotsize")]
## # A tibble: 1,728 × 2
##     price lot_size
##     <dbl>    <dbl>
##  1 132500     0.09
##  2 181115     0.92
##  3 109000     0.19
##  4 155000     0.41
##  5  86060     0.11
##  6 120000     0.68
##  7 153000     0.4 
##  8 170000     1.21
##  9  90000     0.83
## 10 122900     1.94
## # ℹ 1,718 more rows

Chaining/Pipelining

Pipe operator %>% makes the code much readable. Essentially, you send the data through a set of operations and the operations are connected by a pipe.

Note: You must have a space before and after %>%

df %>% select(,c(price, lot_size)) #or more easily: df %>% select(c(price, lot_size))
## # A tibble: 1,728 × 2
##     price lot_size
##     <dbl>    <dbl>
##  1 132500     0.09
##  2 181115     0.92
##  3 109000     0.19
##  4 155000     0.41
##  5  86060     0.11
##  6 120000     0.68
##  7 153000     0.4 
##  8 170000     1.21
##  9  90000     0.83
## 10 122900     1.94
## # ℹ 1,718 more rows

df %>% select(price:heat) # all columns between price and fuel
## # A tibble: 1,728 × 9
##     price lot_size waterfront   age land_value construction air_cond fuel  heat 
##     <dbl>    <dbl> <chr>      <dbl>      <dbl> <chr>        <chr>    <chr> <chr>
##  1 132500     0.09 No            42      50000 No           No       Elec… Elec…
##  2 181115     0.92 No             0      22300 No           No       Gas   Hot …
##  3 109000     0.19 No           133       7300 No           No       Gas   Hot …
##  4 155000     0.41 No            13      18700 No           No       Gas   Hot …
##  5  86060     0.11 No             0      15000 Yes          Yes      Gas   Hot …
##  6 120000     0.68 No            31      14000 No           No       Gas   Hot …
##  7 153000     0.4  No            33      23300 No           No       Oil   Hot …
##  8 170000     1.21 No            23      14600 No           No       Oil   Hot …
##  9  90000     0.83 No            36      22200 No           No       Elec… Elec…
## 10 122900     1.94 No             4      21200 No           No       Gas   Hot …
## # ℹ 1,718 more rows

# select columns that contain "room" in their column name
df %>% select(contains("room")) 
## # A tibble: 1,728 × 2
##    bathrooms rooms
##        <dbl> <dbl>
##  1       1       5
##  2       2.5     6
##  3       1       8
##  4       1.5     5
##  5       1       3
##  6       1       8
##  7       1.5     8
##  8       1.5     9
##  9       1.5     8
## 10       1.5     6
## # ℹ 1,718 more rows

Your turn

Select columns age and land_value from df

filter(): Keep rows that match criteria

df %>% filter(price < 200000, fuel == "Electric")
## # A tibble: 246 × 15
##     ...1  price lot_size waterfront   age land_value construction air_cond fuel 
##    <dbl>  <dbl>    <dbl> <chr>      <dbl>      <dbl> <chr>        <chr>    <chr>
##  1     1 132500     0.09 No            42      50000 No           No       Elec…
##  2     9  90000     0.83 No            36      22200 No           No       Elec…
##  3    13  85860     8.97 No            13       4800 No           No       Elec…
##  4    21 112000     1    No            12       8600 No           No       Elec…
##  5    22 104900     0.43 No            21       5600 No           No       Elec…
##  6    25  90400     0.36 No            16       5200 No           No       Elec…
##  7    60 175000     0.47 No            15      27200 No           No       Elec…
##  8    65 171000     3.16 No            15      24100 No           No       Elec…
##  9    69 182000     1    No            16      26600 No           No       Elec…
## 10    81 114000     0.06 No            14       1600 No           No       Elec…
## # ℹ 236 more rows
## # ℹ 6 more variables: heat <chr>, sewer <chr>, living_area <dbl>,
## #   fireplaces <dbl>, bathrooms <dbl>, rooms <dbl>

Chaining multiple operations

df %>% 
  filter(price < 200000, fuel == "Electric") %>%
  select(price, fuel, air_cond)
## # A tibble: 246 × 3
##     price fuel     air_cond
##     <dbl> <chr>    <chr>   
##  1 132500 Electric No      
##  2  90000 Electric No      
##  3  85860 Electric No      
##  4 112000 Electric No      
##  5 104900 Electric No      
##  6  90400 Electric No      
##  7 175000 Electric No      
##  8 171000 Electric No      
##  9 182000 Electric No      
## 10 114000 Electric No      
## # ℹ 236 more rows

Note: In the interest of readability, put different operations in separate lines.

After the operations, you may want to save the result as a new data frame. You can then use the new data frame for other analyses.

new_df <- df %>% 
  filter(price < 200000, fuel == "Electric") %>%
  select(price, fuel, air_cond)

nrow(new_df)
## [1] 246

Your turn

Use pipe %>% to chain the following two operations

  1. Select columns waterfront and age from df
  2. Include only houses that have no waterfront

arrange(): Reorder rows

df %>%
  select(price, air_cond, rooms) %>%
  arrange(price)
## # A tibble: 1,728 × 3
##    price air_cond rooms
##    <dbl> <chr>    <dbl>
##  1  5000 Yes          6
##  2 10300 No           4
##  3 10300 No           6
##  4 20000 No           4
##  5 25000 No           6
##  6 45000 No           4
##  7 49387 No           5
##  8 58500 No           5
##  9 60000 No           6
## 10 60000 No           7
## # ℹ 1,718 more rows

use desc() for a descending order

df %>%
  select(price, air_cond, rooms) %>%
  arrange(desc(price))
## # A tibble: 1,728 × 3
##     price air_cond rooms
##     <dbl> <chr>    <dbl>
##  1 775000 Yes          9
##  2 775000 Yes         12
##  3 760000 Yes         12
##  4 725000 Yes         10
##  5 670000 No           8
##  6 658300 Yes         11
##  7 655000 No           6
##  8 650000 Yes         10
##  9 650000 Yes         12
## 10 649000 Yes         12
## # ℹ 1,718 more rows

Your turn

  1. Select houses with Private sewer
  2. Order the results in a descending order of lot_size

mutate(): Add new variables

Create new variables that are functions of existing variables. (But this won’t be saved to original df)

df %>%
  mutate(bedrooms = rooms - bathrooms) %>%
  filter(bedrooms>3) %>%
  select(bedrooms, bathrooms, rooms)
## # A tibble: 1,381 × 3
##    bedrooms bathrooms rooms
##       <dbl>     <dbl> <dbl>
##  1      4         1       5
##  2      3.5       2.5     6
##  3      7         1       8
##  4      3.5       1.5     5
##  5      7         1       8
##  6      6.5       1.5     8
##  7      7.5       1.5     9
##  8      6.5       1.5     8
##  9      4.5       1.5     6
## 10     11         1      12
## # ℹ 1,371 more rows

mutate(): Add new variables

If we want to save to the original df

#Option 1
df<- df %>% 
  mutate(bedrooms = rooms - bathrooms) 
#Option 2
df$bedrooms = df$rooms - df$bathrooms

Use Ifelse function to create new variable

If we want to use mutate to create a new column, expensive_house, which equals “yes” if the house price is greater than 150000 and “no” otherwise.

  • After that, we can select the price and expensive_house columns and arrange the rows in the descending order of price.
df %>%
  mutate(expensive_house = ifelse(price>150000, "yes", "no")) %>%
  arrange(desc(price)) %>%
  select(price, expensive_house)
## # A tibble: 1,728 × 2
##     price expensive_house
##     <dbl> <chr>          
##  1 775000 yes            
##  2 775000 yes            
##  3 760000 yes            
##  4 725000 yes            
##  5 670000 yes            
##  6 658300 yes            
##  7 655000 yes            
##  8 650000 yes            
##  9 650000 yes            
## 10 649000 yes            
## # ℹ 1,718 more rows

Your Turn

  • Use mutate to create a new column, old_house, which equals “TRUE” if the house age is greater than 40 years and “False” otherwise.

  • After that, select the price, age and old_house columns and arrange the rows in the descending order of price.

## # A tibble: 1,728 × 3
##     price   age old_house
##     <dbl> <dbl> <chr>    
##  1 775000     5 FALSE    
##  2 775000    31 FALSE    
##  3 760000     2 FALSE    
##  4 725000     3 FALSE    
##  5 670000   121 TRUE     
##  6 658300     3 FALSE    
##  7 655000    55 TRUE     
##  8 650000     2 FALSE    
##  9 650000     3 FALSE    
## 10 649000    10 FALSE    
## # ℹ 1,718 more rows

summarise(): Reduce variables to values

  • group_by() creates the groups that will be operated on
  • summarise() uses the provided aggregation function to summarise each group
df %>%
  group_by(air_cond) %>%
  summarise(avg_price = mean(price))
## # A tibble: 2 × 2
##   air_cond avg_price
##   <chr>        <dbl>
## 1 No         187022.
## 2 Yes        254904.

You can have multiple summary/aggregate statistics:

df %>%
  group_by(air_cond) %>%
  summarise(n_house = n(), # n() gives the number of rows in each group
            avg_price = mean(price))  
## # A tibble: 2 × 3
##   air_cond n_house avg_price
##   <chr>      <int>     <dbl>
## 1 No          1093   187022.
## 2 Yes          635   254904.

Your turn

Some houses have a drive way, some don’t. Use group_by and summarise to obtain the following:

## # A tibble: 2 × 4
##   construction n_house min_lotsize max_lotsize
##   <chr>          <int>       <dbl>       <dbl>
## 1 No              1647        0          12.2 
## 2 Yes               81        0.01        3.47

Lab Assignment 1

Once you finish the following tasks, please put everything in one single R file with the file name assignment1.R (.R is the file extension) and upload it to Canvas.

In addition, lab assignments will be graded based on:

  • Accuracy: whether the R script achieves the objectives
  • Readability: whether the R script is clean, well-formatted, and easily readable
    • You risk loosing 10 points if your code has no proper indentation or has more than 80 characters in a line.

Lab assignment 1

Task 1

##         name female age
## row_1   Alex  FALSE  20
## row_2    Bob  FALSE  25
## row_3 Claire   TRUE  30
## row_4 Denise   TRUE  35
  1. Create the above data frame (don’t forget the column/row names!)
  2. Obtain the mean of the age column from the data frame
    • Note: You will not get any point if you do not get the answer through the data.frame.
  3. Retrieve Claire’s age from the data frame
    • Note: You will not get any point if you do not get the answer through the data.frame.

Lab Assignment 1

Task 2

Use the HousePrices.csv dataset and import it as a tibble with read_csv function.

  1. Show price, air_cond, heat, fireplaces for houses that have no fireplaces

  2. First create a new variable bedroom using roomsminusbathrooms. Then create a new variable price_per_bedroom which is price divided by the number of bedrooms. Show only price, bedrooms, and price_per_bedroom columns and arrange the rows in the descending order of price_per_bedroom

  3. Create a new variable has_4_or_more_bedrooms which is TRUE if the house has 4 or more bedrooms and FALSE otherwise. Use this variable and group_by and summarise() to find how many houses have 4 or more bedrooms and how many don’t