Case Study: Richard Liu is the CEO of the XYZ company in the state of NSW, Australia. He is interested in increasing the number of branches and plans to buy an office for the new operation. He has discussed this with the PQR bank manager and has taken out a loan of 375000 for 7 years with various interest rates such as 4, 4.5, 5, 5.5, 6, 6.5 and 7 accordingly, which will cover the cost of the building. Now answer the following questions:
Assign a variable called loanAmount with the value 375000.
Assign an array called interestRate with 7 numbers, 4, 4.5, 5, 5.5, 6, 6.5 and 7.
Write a single expression to work out the remaining principal if a repayment of $50000 is made after a year of interest being added to loanAmount for each of these interestRates (no compounding). You will need to work out the mathematical equation required to write the code.
Calculate principalPaid, based on repaying an annual
sum of $25000 (no compounding). You will need to work out the
mathematical equation required to write the code.
Do the following:
data.frame(loanAmount, annualInterest, principalPaid)
# Write your code and your comments here.
# 1. Assigning a variable called loanAmount with the value 375000.
loanAmount <- 375000
# 2. Assigning an array called interestRate with 7 numbers.
interestRate <- c(4, 4.5, 5, 5.5, 6, 6.5, 7)
# 3. Calculating the remaining principal after one year of interest for each interest rate.
# No compounding, so subtract the annual interest from the loan amount.
annualInterest <- loanAmount * (interestRate / 100)
remainingPrincipal <- loanAmount - annualInterest - 50000
# 4. Calculating principalPaid based on repaying an annual sum of $25000.
principalPaid <- 25000
# 5. Creating a data frame with loanAmount, annualInterest, and principalPaid.
df <- data.frame(loanAmount, annualInterest, principalPaid)
# Printing the data frame
print(df)
## loanAmount annualInterest principalPaid
## 1 375000 15000 25000
## 2 375000 16875 25000
## 3 375000 18750 25000
## 4 375000 20625 25000
## 5 375000 22500 25000
## 6 375000 24375 25000
## 7 375000 26250 25000
# Any plot you generate will be shown in the document.
Case Study: The House Sales in King County, USA is a dataset that contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015. The original dataset can be found here: The dataset consists of:
id : A notation for a house
date: Date house was sold
price: Price is prediction target
bedrooms: Number of bedrooms
bathrooms: Number of bathrooms
sqft_living: Square footage of the home
sqft_lot: Square footage of the lot
floors :Total floors (levels) in house
waterfront :House which has a view to a waterfront
view: Has been viewed
condition :How good the condition is overall
grade: overall grade given to the housing unit, based on King County grading system
sqft_above : Square footage of house apart from basement
sqft_basement: Square footage of the basement
yr_built : Built Year
yr_renovated : Year when house was renovated
zipcode: Zip code
lat: Latitude coordinate
long: Longitude coordinate
sqft_living15 : Living room area in 2015(implies– some renovations) This might or might not have affected the lotsize area
sqft_lot15 : LotSize area in 2015(implies– some renovations)
Now answer the following questions:
How do you import the House Sales in King County, USA dataset?
How many rows and column does the House Sales in King County, USA data have?
What are the names and classes of each column?
Check the presence of missing observations in the dataset.
What do we use to see the first few lines and last few lines of the dataset?
Drop the columns “id” and ” sqft_basement” from the dataset, then use the method summary () to summarise the data of each and every variable present in the dataset. hea
# Write your code and your comments here (delete these comments).
# Import the dataset
#1)
library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("readxl")
df <- read_csv("~/Desktop/House-Sales-in-King-County-USA.csv", show_col_types = FALSE)
df
## # A tibble: 21,613 × 21
## id date price bedrooms bathrooms sqft_living sqft_lot
## <dbl> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7129300520 2014-10-13 00:00:00 2.22e5 3 1 1180 5650
## 2 6414100192 2014-12-09 00:00:00 5.38e5 3 2.25 2570 7242
## 3 5631500400 2015-02-25 00:00:00 1.80e5 2 1 770 10000
## 4 2487200875 2014-12-09 00:00:00 6.04e5 4 3 1960 5000
## 5 1954400510 2015-02-18 00:00:00 5.10e5 3 2 1680 8080
## 6 7237550310 2014-05-12 00:00:00 1.23e6 4 4.5 5420 101930
## 7 1321400060 2014-06-27 00:00:00 2.57e5 3 2.25 1715 6819
## 8 2008000270 2015-01-15 00:00:00 2.92e5 3 1.5 1060 9711
## 9 2414600126 2015-04-15 00:00:00 2.29e5 3 1 1780 7470
## 10 3793500160 2015-03-12 00:00:00 3.23e5 3 2.5 1890 6560
## # ℹ 21,603 more rows
## # ℹ 14 more variables: floors <dbl>, waterfront <dbl>, view <dbl>,
## # condition <dbl>, grade <dbl>, sqft_above <dbl>, sqft_basement <dbl>,
## # yr_built <dbl>, yr_renovated <dbl>, zipcode <dbl>, lat <dbl>, long <dbl>,
## # sqft_living15 <dbl>, sqft_lot15 <dbl>
glimpse(df) # to see the dataset using my RStudio invaroment.
## Rows: 21,613
## Columns: 21
## $ id <dbl> 7129300520, 6414100192, 5631500400, 2487200875, 19544005…
## $ date <dttm> 2014-10-13, 2014-12-09, 2015-02-25, 2014-12-09, 2015-02…
## $ price <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 257500,…
## $ bedrooms <dbl> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,…
## $ bathrooms <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.…
## $ sqft_living <dbl> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 189…
## $ sqft_lot <dbl> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470,…
## $ floors <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1…
## $ waterfront <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ view <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0,…
## $ condition <dbl> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,…
## $ grade <dbl> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7…
## $ sqft_above <dbl> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189…
## $ sqft_basement <dbl> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, …
## $ yr_built <dbl> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20…
## $ yr_renovated <dbl> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ zipcode <dbl> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 98198, …
## $ lat <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561, 47…
## $ long <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -122.0…
## $ sqft_living15 <dbl> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 23…
## $ sqft_lot15 <dbl> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113, …
dim("df") # check the dimensions : number of rows n columns.
## NULL
# Column names
colnames <- names(df)
# Classes of each column
column_classes <- sapply("df", class)
# Combine column names and classes NOT WORKING CODE ????
ncol(df)
## [1] 21
#4)
a <- head (df, n = 10 )# first 10 rows
a
## # A tibble: 10 × 21
## id date price bedrooms bathrooms sqft_living sqft_lot
## <dbl> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7129300520 2014-10-13 00:00:00 2.22e5 3 1 1180 5650
## 2 6414100192 2014-12-09 00:00:00 5.38e5 3 2.25 2570 7242
## 3 5631500400 2015-02-25 00:00:00 1.80e5 2 1 770 10000
## 4 2487200875 2014-12-09 00:00:00 6.04e5 4 3 1960 5000
## 5 1954400510 2015-02-18 00:00:00 5.10e5 3 2 1680 8080
## 6 7237550310 2014-05-12 00:00:00 1.23e6 4 4.5 5420 101930
## 7 1321400060 2014-06-27 00:00:00 2.57e5 3 2.25 1715 6819
## 8 2008000270 2015-01-15 00:00:00 2.92e5 3 1.5 1060 9711
## 9 2414600126 2015-04-15 00:00:00 2.29e5 3 1 1780 7470
## 10 3793500160 2015-03-12 00:00:00 3.23e5 3 2.5 1890 6560
## # ℹ 14 more variables: floors <dbl>, waterfront <dbl>, view <dbl>,
## # condition <dbl>, grade <dbl>, sqft_above <dbl>, sqft_basement <dbl>,
## # yr_built <dbl>, yr_renovated <dbl>, zipcode <dbl>, lat <dbl>, long <dbl>,
## # sqft_living15 <dbl>, sqft_lot15 <dbl>
b <- tail (df, n = 10 ) # last 10 rows
b
## # A tibble: 10 × 21
## id date price bedrooms bathrooms sqft_living sqft_lot
## <dbl> <dttm> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 7852140040 2014-08-25 00:00:00 5.07e5 3 2.5 2270 5536
## 2 9834201367 2015-01-26 00:00:00 4.29e5 3 2 1490 1126
## 3 3448900210 2014-10-14 00:00:00 6.11e5 4 2.5 2520 6023
## 4 7936000429 2015-03-26 00:00:00 1.01e6 4 3.5 3510 7200
## 5 2997800021 2015-02-19 00:00:00 4.75e5 3 2.5 1310 1294
## 6 263000018 2014-05-21 00:00:00 3.60e5 3 2.5 1530 1131
## 7 6600060120 2015-02-23 00:00:00 4 e5 4 2.5 2310 5813
## 8 1523300141 2014-06-23 00:00:00 4.02e5 2 0.75 1020 1350
## 9 291310100 2015-01-16 00:00:00 4 e5 3 2.5 1600 2388
## 10 1523300157 2014-10-15 00:00:00 3.25e5 2 0.75 1020 1076
## # ℹ 14 more variables: floors <dbl>, waterfront <dbl>, view <dbl>,
## # condition <dbl>, grade <dbl>, sqft_above <dbl>, sqft_basement <dbl>,
## # yr_built <dbl>, yr_renovated <dbl>, zipcode <dbl>, lat <dbl>, long <dbl>,
## # sqft_living15 <dbl>, sqft_lot15 <dbl>
str(df)
## spc_tbl_ [21,613 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:21613] 7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
## $ date : POSIXct[1:21613], format: "2014-10-13" "2014-12-09" ...
## $ price : num [1:21613] 221900 538000 180000 604000 510000 ...
## $ bedrooms : num [1:21613] 3 3 2 4 3 4 3 3 3 3 ...
## $ bathrooms : num [1:21613] 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
## $ sqft_living : num [1:21613] 1180 2570 770 1960 1680 ...
## $ sqft_lot : num [1:21613] 5650 7242 10000 5000 8080 ...
## $ floors : num [1:21613] 1 2 1 1 1 1 2 1 1 2 ...
## $ waterfront : num [1:21613] 0 0 0 0 0 0 0 0 0 0 ...
## $ view : num [1:21613] 0 0 0 0 0 0 0 0 0 0 ...
## $ condition : num [1:21613] 3 3 3 5 3 3 3 3 3 3 ...
## $ grade : num [1:21613] 7 7 6 7 8 11 7 7 7 7 ...
## $ sqft_above : num [1:21613] 1180 2170 770 1050 1680 ...
## $ sqft_basement: num [1:21613] 0 400 0 910 0 1530 0 0 730 0 ...
## $ yr_built : num [1:21613] 1955 1951 1933 1965 1987 ...
## $ yr_renovated : num [1:21613] 0 1991 0 0 0 ...
## $ zipcode : num [1:21613] 98178 98125 98028 98136 98074 ...
## $ lat : num [1:21613] 47.5 47.7 47.7 47.5 47.6 ...
## $ long : num [1:21613] -122 -122 -122 -122 -122 ...
## $ sqft_living15: num [1:21613] 1340 1690 2720 1360 1800 ...
## $ sqft_lot15 : num [1:21613] 5650 7639 8062 5000 7503 ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. date = col_datetime(format = ""),
## .. price = col_double(),
## .. bedrooms = col_double(),
## .. bathrooms = col_double(),
## .. sqft_living = col_double(),
## .. sqft_lot = col_double(),
## .. floors = col_double(),
## .. waterfront = col_double(),
## .. view = col_double(),
## .. condition = col_double(),
## .. grade = col_double(),
## .. sqft_above = col_double(),
## .. sqft_basement = col_double(),
## .. yr_built = col_double(),
## .. yr_renovated = col_double(),
## .. zipcode = col_double(),
## .. lat = col_double(),
## .. long = col_double(),
## .. sqft_living15 = col_double(),
## .. sqft_lot15 = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
str (df) # structure of dataframe
## spc_tbl_ [21,613 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:21613] 7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
## $ date : POSIXct[1:21613], format: "2014-10-13" "2014-12-09" ...
## $ price : num [1:21613] 221900 538000 180000 604000 510000 ...
## $ bedrooms : num [1:21613] 3 3 2 4 3 4 3 3 3 3 ...
## $ bathrooms : num [1:21613] 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
## $ sqft_living : num [1:21613] 1180 2570 770 1960 1680 ...
## $ sqft_lot : num [1:21613] 5650 7242 10000 5000 8080 ...
## $ floors : num [1:21613] 1 2 1 1 1 1 2 1 1 2 ...
## $ waterfront : num [1:21613] 0 0 0 0 0 0 0 0 0 0 ...
## $ view : num [1:21613] 0 0 0 0 0 0 0 0 0 0 ...
## $ condition : num [1:21613] 3 3 3 5 3 3 3 3 3 3 ...
## $ grade : num [1:21613] 7 7 6 7 8 11 7 7 7 7 ...
## $ sqft_above : num [1:21613] 1180 2170 770 1050 1680 ...
## $ sqft_basement: num [1:21613] 0 400 0 910 0 1530 0 0 730 0 ...
## $ yr_built : num [1:21613] 1955 1951 1933 1965 1987 ...
## $ yr_renovated : num [1:21613] 0 1991 0 0 0 ...
## $ zipcode : num [1:21613] 98178 98125 98028 98136 98074 ...
## $ lat : num [1:21613] 47.5 47.7 47.7 47.5 47.6 ...
## $ long : num [1:21613] -122 -122 -122 -122 -122 ...
## $ sqft_living15: num [1:21613] 1340 1690 2720 1360 1800 ...
## $ sqft_lot15 : num [1:21613] 5650 7639 8062 5000 7503 ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. date = col_datetime(format = ""),
## .. price = col_double(),
## .. bedrooms = col_double(),
## .. bathrooms = col_double(),
## .. sqft_living = col_double(),
## .. sqft_lot = col_double(),
## .. floors = col_double(),
## .. waterfront = col_double(),
## .. view = col_double(),
## .. condition = col_double(),
## .. grade = col_double(),
## .. sqft_above = col_double(),
## .. sqft_basement = col_double(),
## .. yr_built = col_double(),
## .. yr_renovated = col_double(),
## .. zipcode = col_double(),
## .. lat = col_double(),
## .. long = col_double(),
## .. sqft_living15 = col_double(),
## .. sqft_lot15 = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
colnames(df) # names of columns
## [1] "id" "date" "price" "bedrooms"
## [5] "bathrooms" "sqft_living" "sqft_lot" "floors"
## [9] "waterfront" "view" "condition" "grade"
## [13] "sqft_above" "sqft_basement" "yr_built" "yr_renovated"
## [17] "zipcode" "lat" "long" "sqft_living15"
## [21] "sqft_lot15"
sapply(df, class) # clasess
## $id
## [1] "numeric"
##
## $date
## [1] "POSIXct" "POSIXt"
##
## $price
## [1] "numeric"
##
## $bedrooms
## [1] "numeric"
##
## $bathrooms
## [1] "numeric"
##
## $sqft_living
## [1] "numeric"
##
## $sqft_lot
## [1] "numeric"
##
## $floors
## [1] "numeric"
##
## $waterfront
## [1] "numeric"
##
## $view
## [1] "numeric"
##
## $condition
## [1] "numeric"
##
## $grade
## [1] "numeric"
##
## $sqft_above
## [1] "numeric"
##
## $sqft_basement
## [1] "numeric"
##
## $yr_built
## [1] "numeric"
##
## $yr_renovated
## [1] "numeric"
##
## $zipcode
## [1] "numeric"
##
## $lat
## [1] "numeric"
##
## $long
## [1] "numeric"
##
## $sqft_living15
## [1] "numeric"
##
## $sqft_lot15
## [1] "numeric"
# Check for missing observations
missing_values <- colSums(is.na(df)) #?????????? NOT RUNNING CODE
# Any plot you generate will be shown in the document.