Task A: Question 1

Case Study: Richard Liu is the CEO of the XYZ company in the state of NSW, Australia. He is interested in increasing the number of branches and plans to buy an office for the new operation. He has discussed this with the PQR bank manager and has taken out a loan of 375000 for 7 years with various interest rates such as 4, 4.5, 5, 5.5, 6, 6.5 and 7 accordingly, which will cover the cost of the building. Now answer the following questions:

  1. Assign a variable called loanAmount with the value 375000.

  2. Assign an array called interestRate with 7 numbers, 4, 4.5, 5, 5.5, 6, 6.5 and 7.

  3. Write a single expression to work out the remaining principal if a repayment of $50000 is made after a year of interest being added to loanAmount for each of these interestRates (no compounding). You will need to work out the mathematical equation required to write the code.

  4. Calculate principalPaid, based on repaying an annual sum of $25000 (no compounding). You will need to work out the mathematical equation required to write the code.

  5. Do the following: data.frame(loanAmount, annualInterest, principalPaid)

# Write your code and your comments here.

# 1. Assigning a variable called loanAmount with the value 375000.
loanAmount <- 375000

# 2. Assigning an array called interestRate with 7 numbers.
interestRate <- c(4, 4.5, 5, 5.5, 6, 6.5, 7)

# 3. Calculating the remaining principal after one year of interest for each interest rate.
#    No compounding, so subtract the annual interest from the loan amount.
annualInterest <- loanAmount * (interestRate / 100)
remainingPrincipal <- loanAmount - annualInterest - 50000

# 4. Calculating principalPaid based on repaying an annual sum of $25000.
principalPaid <- 25000

# 5. Creating a data frame with loanAmount, annualInterest, and principalPaid.
df <- data.frame(loanAmount, annualInterest, principalPaid)

# Printing the data frame
print(df)
##   loanAmount annualInterest principalPaid
## 1     375000          15000         25000
## 2     375000          16875         25000
## 3     375000          18750         25000
## 4     375000          20625         25000
## 5     375000          22500         25000
## 6     375000          24375         25000
## 7     375000          26250         25000
# Any plot you generate will be shown in the document.

Task A: Question 2

Case Study: The House Sales in King County, USA is a dataset that contains house sale prices for King County, which includes Seattle. It includes homes sold between May 2014 and May 2015. The original dataset can be found here: The dataset consists of:

id : A notation for a house

date: Date house was sold

price: Price is prediction target

bedrooms: Number of bedrooms

bathrooms: Number of bathrooms

sqft_living: Square footage of the home

sqft_lot: Square footage of the lot

floors :Total floors (levels) in house

waterfront :House which has a view to a waterfront

view: Has been viewed

condition :How good the condition is overall

grade: overall grade given to the housing unit, based on King County grading system

sqft_above : Square footage of house apart from basement

sqft_basement: Square footage of the basement

yr_built : Built Year

yr_renovated : Year when house was renovated

zipcode: Zip code

lat: Latitude coordinate

long: Longitude coordinate

sqft_living15 : Living room area in 2015(implies– some renovations) This might or might not have affected the lotsize area

sqft_lot15 : LotSize area in 2015(implies– some renovations)

Now answer the following questions:

  1. How do you import the House Sales in King County, USA dataset?

  2. How many rows and column does the House Sales in King County, USA data have?

  3. What are the names and classes of each column?

  4. Check the presence of missing observations in the dataset.

  5. What do we use to see the first few lines and last few lines of the dataset?

  6. Drop the columns “id” and ” sqft_basement” from the dataset, then use the method summary () to summarise the data of each and every variable present in the dataset. hea

# Write your code and your comments here (delete these comments).
# Import the dataset
#1) 
library("tidyverse")
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.3     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library("readxl")
df <- read_csv("~/Desktop/House-Sales-in-King-County-USA.csv", show_col_types = FALSE)
df
## # A tibble: 21,613 × 21
##            id date                 price bedrooms bathrooms sqft_living sqft_lot
##         <dbl> <dttm>               <dbl>    <dbl>     <dbl>       <dbl>    <dbl>
##  1 7129300520 2014-10-13 00:00:00 2.22e5        3      1           1180     5650
##  2 6414100192 2014-12-09 00:00:00 5.38e5        3      2.25        2570     7242
##  3 5631500400 2015-02-25 00:00:00 1.80e5        2      1            770    10000
##  4 2487200875 2014-12-09 00:00:00 6.04e5        4      3           1960     5000
##  5 1954400510 2015-02-18 00:00:00 5.10e5        3      2           1680     8080
##  6 7237550310 2014-05-12 00:00:00 1.23e6        4      4.5         5420   101930
##  7 1321400060 2014-06-27 00:00:00 2.57e5        3      2.25        1715     6819
##  8 2008000270 2015-01-15 00:00:00 2.92e5        3      1.5         1060     9711
##  9 2414600126 2015-04-15 00:00:00 2.29e5        3      1           1780     7470
## 10 3793500160 2015-03-12 00:00:00 3.23e5        3      2.5         1890     6560
## # ℹ 21,603 more rows
## # ℹ 14 more variables: floors <dbl>, waterfront <dbl>, view <dbl>,
## #   condition <dbl>, grade <dbl>, sqft_above <dbl>, sqft_basement <dbl>,
## #   yr_built <dbl>, yr_renovated <dbl>, zipcode <dbl>, lat <dbl>, long <dbl>,
## #   sqft_living15 <dbl>, sqft_lot15 <dbl>
glimpse(df) # to see the dataset using my RStudio invaroment. 
## Rows: 21,613
## Columns: 21
## $ id            <dbl> 7129300520, 6414100192, 5631500400, 2487200875, 19544005…
## $ date          <dttm> 2014-10-13, 2014-12-09, 2015-02-25, 2014-12-09, 2015-02…
## $ price         <dbl> 221900, 538000, 180000, 604000, 510000, 1225000, 257500,…
## $ bedrooms      <dbl> 3, 3, 2, 4, 3, 4, 3, 3, 3, 3, 3, 2, 3, 3, 5, 4, 3, 4, 2,…
## $ bathrooms     <dbl> 1.00, 2.25, 1.00, 3.00, 2.00, 4.50, 2.25, 1.50, 1.00, 2.…
## $ sqft_living   <dbl> 1180, 2570, 770, 1960, 1680, 5420, 1715, 1060, 1780, 189…
## $ sqft_lot      <dbl> 5650, 7242, 10000, 5000, 8080, 101930, 6819, 9711, 7470,…
## $ floors        <dbl> 1.0, 2.0, 1.0, 1.0, 1.0, 1.0, 2.0, 1.0, 1.0, 2.0, 1.0, 1…
## $ waterfront    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ view          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0,…
## $ condition     <dbl> 3, 3, 3, 5, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 4, 4,…
## $ grade         <dbl> 7, 7, 6, 7, 8, 11, 7, 7, 7, 7, 8, 7, 7, 7, 7, 9, 7, 7, 7…
## $ sqft_above    <dbl> 1180, 2170, 770, 1050, 1680, 3890, 1715, 1060, 1050, 189…
## $ sqft_basement <dbl> 0, 400, 0, 910, 0, 1530, 0, 0, 730, 0, 1700, 300, 0, 0, …
## $ yr_built      <dbl> 1955, 1951, 1933, 1965, 1987, 2001, 1995, 1963, 1960, 20…
## $ yr_renovated  <dbl> 0, 1991, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ zipcode       <dbl> 98178, 98125, 98028, 98136, 98074, 98053, 98003, 98198, …
## $ lat           <dbl> 47.5112, 47.7210, 47.7379, 47.5208, 47.6168, 47.6561, 47…
## $ long          <dbl> -122.257, -122.319, -122.233, -122.393, -122.045, -122.0…
## $ sqft_living15 <dbl> 1340, 1690, 2720, 1360, 1800, 4760, 2238, 1650, 1780, 23…
## $ sqft_lot15    <dbl> 5650, 7639, 8062, 5000, 7503, 101930, 6819, 9711, 8113, …
dim("df") # check the dimensions : number of rows n columns. 
## NULL
# Column names
colnames <- names(df)

# Classes of each column
column_classes <- sapply("df", class)

# Combine column names and classes NOT WORKING CODE ???? 
ncol(df)
## [1] 21
#4)
a <- head (df, n = 10 )# first 10 rows 
a 
## # A tibble: 10 × 21
##            id date                 price bedrooms bathrooms sqft_living sqft_lot
##         <dbl> <dttm>               <dbl>    <dbl>     <dbl>       <dbl>    <dbl>
##  1 7129300520 2014-10-13 00:00:00 2.22e5        3      1           1180     5650
##  2 6414100192 2014-12-09 00:00:00 5.38e5        3      2.25        2570     7242
##  3 5631500400 2015-02-25 00:00:00 1.80e5        2      1            770    10000
##  4 2487200875 2014-12-09 00:00:00 6.04e5        4      3           1960     5000
##  5 1954400510 2015-02-18 00:00:00 5.10e5        3      2           1680     8080
##  6 7237550310 2014-05-12 00:00:00 1.23e6        4      4.5         5420   101930
##  7 1321400060 2014-06-27 00:00:00 2.57e5        3      2.25        1715     6819
##  8 2008000270 2015-01-15 00:00:00 2.92e5        3      1.5         1060     9711
##  9 2414600126 2015-04-15 00:00:00 2.29e5        3      1           1780     7470
## 10 3793500160 2015-03-12 00:00:00 3.23e5        3      2.5         1890     6560
## # ℹ 14 more variables: floors <dbl>, waterfront <dbl>, view <dbl>,
## #   condition <dbl>, grade <dbl>, sqft_above <dbl>, sqft_basement <dbl>,
## #   yr_built <dbl>, yr_renovated <dbl>, zipcode <dbl>, lat <dbl>, long <dbl>,
## #   sqft_living15 <dbl>, sqft_lot15 <dbl>
b <- tail (df, n = 10 ) # last 10 rows 
b
## # A tibble: 10 × 21
##            id date                 price bedrooms bathrooms sqft_living sqft_lot
##         <dbl> <dttm>               <dbl>    <dbl>     <dbl>       <dbl>    <dbl>
##  1 7852140040 2014-08-25 00:00:00 5.07e5        3      2.5         2270     5536
##  2 9834201367 2015-01-26 00:00:00 4.29e5        3      2           1490     1126
##  3 3448900210 2014-10-14 00:00:00 6.11e5        4      2.5         2520     6023
##  4 7936000429 2015-03-26 00:00:00 1.01e6        4      3.5         3510     7200
##  5 2997800021 2015-02-19 00:00:00 4.75e5        3      2.5         1310     1294
##  6  263000018 2014-05-21 00:00:00 3.60e5        3      2.5         1530     1131
##  7 6600060120 2015-02-23 00:00:00 4   e5        4      2.5         2310     5813
##  8 1523300141 2014-06-23 00:00:00 4.02e5        2      0.75        1020     1350
##  9  291310100 2015-01-16 00:00:00 4   e5        3      2.5         1600     2388
## 10 1523300157 2014-10-15 00:00:00 3.25e5        2      0.75        1020     1076
## # ℹ 14 more variables: floors <dbl>, waterfront <dbl>, view <dbl>,
## #   condition <dbl>, grade <dbl>, sqft_above <dbl>, sqft_basement <dbl>,
## #   yr_built <dbl>, yr_renovated <dbl>, zipcode <dbl>, lat <dbl>, long <dbl>,
## #   sqft_living15 <dbl>, sqft_lot15 <dbl>
str(df)
## spc_tbl_ [21,613 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id           : num [1:21613] 7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
##  $ date         : POSIXct[1:21613], format: "2014-10-13" "2014-12-09" ...
##  $ price        : num [1:21613] 221900 538000 180000 604000 510000 ...
##  $ bedrooms     : num [1:21613] 3 3 2 4 3 4 3 3 3 3 ...
##  $ bathrooms    : num [1:21613] 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
##  $ sqft_living  : num [1:21613] 1180 2570 770 1960 1680 ...
##  $ sqft_lot     : num [1:21613] 5650 7242 10000 5000 8080 ...
##  $ floors       : num [1:21613] 1 2 1 1 1 1 2 1 1 2 ...
##  $ waterfront   : num [1:21613] 0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : num [1:21613] 0 0 0 0 0 0 0 0 0 0 ...
##  $ condition    : num [1:21613] 3 3 3 5 3 3 3 3 3 3 ...
##  $ grade        : num [1:21613] 7 7 6 7 8 11 7 7 7 7 ...
##  $ sqft_above   : num [1:21613] 1180 2170 770 1050 1680 ...
##  $ sqft_basement: num [1:21613] 0 400 0 910 0 1530 0 0 730 0 ...
##  $ yr_built     : num [1:21613] 1955 1951 1933 1965 1987 ...
##  $ yr_renovated : num [1:21613] 0 1991 0 0 0 ...
##  $ zipcode      : num [1:21613] 98178 98125 98028 98136 98074 ...
##  $ lat          : num [1:21613] 47.5 47.7 47.7 47.5 47.6 ...
##  $ long         : num [1:21613] -122 -122 -122 -122 -122 ...
##  $ sqft_living15: num [1:21613] 1340 1690 2720 1360 1800 ...
##  $ sqft_lot15   : num [1:21613] 5650 7639 8062 5000 7503 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   date = col_datetime(format = ""),
##   ..   price = col_double(),
##   ..   bedrooms = col_double(),
##   ..   bathrooms = col_double(),
##   ..   sqft_living = col_double(),
##   ..   sqft_lot = col_double(),
##   ..   floors = col_double(),
##   ..   waterfront = col_double(),
##   ..   view = col_double(),
##   ..   condition = col_double(),
##   ..   grade = col_double(),
##   ..   sqft_above = col_double(),
##   ..   sqft_basement = col_double(),
##   ..   yr_built = col_double(),
##   ..   yr_renovated = col_double(),
##   ..   zipcode = col_double(),
##   ..   lat = col_double(),
##   ..   long = col_double(),
##   ..   sqft_living15 = col_double(),
##   ..   sqft_lot15 = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
str (df) # structure of  dataframe
## spc_tbl_ [21,613 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id           : num [1:21613] 7.13e+09 6.41e+09 5.63e+09 2.49e+09 1.95e+09 ...
##  $ date         : POSIXct[1:21613], format: "2014-10-13" "2014-12-09" ...
##  $ price        : num [1:21613] 221900 538000 180000 604000 510000 ...
##  $ bedrooms     : num [1:21613] 3 3 2 4 3 4 3 3 3 3 ...
##  $ bathrooms    : num [1:21613] 1 2.25 1 3 2 4.5 2.25 1.5 1 2.5 ...
##  $ sqft_living  : num [1:21613] 1180 2570 770 1960 1680 ...
##  $ sqft_lot     : num [1:21613] 5650 7242 10000 5000 8080 ...
##  $ floors       : num [1:21613] 1 2 1 1 1 1 2 1 1 2 ...
##  $ waterfront   : num [1:21613] 0 0 0 0 0 0 0 0 0 0 ...
##  $ view         : num [1:21613] 0 0 0 0 0 0 0 0 0 0 ...
##  $ condition    : num [1:21613] 3 3 3 5 3 3 3 3 3 3 ...
##  $ grade        : num [1:21613] 7 7 6 7 8 11 7 7 7 7 ...
##  $ sqft_above   : num [1:21613] 1180 2170 770 1050 1680 ...
##  $ sqft_basement: num [1:21613] 0 400 0 910 0 1530 0 0 730 0 ...
##  $ yr_built     : num [1:21613] 1955 1951 1933 1965 1987 ...
##  $ yr_renovated : num [1:21613] 0 1991 0 0 0 ...
##  $ zipcode      : num [1:21613] 98178 98125 98028 98136 98074 ...
##  $ lat          : num [1:21613] 47.5 47.7 47.7 47.5 47.6 ...
##  $ long         : num [1:21613] -122 -122 -122 -122 -122 ...
##  $ sqft_living15: num [1:21613] 1340 1690 2720 1360 1800 ...
##  $ sqft_lot15   : num [1:21613] 5650 7639 8062 5000 7503 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   date = col_datetime(format = ""),
##   ..   price = col_double(),
##   ..   bedrooms = col_double(),
##   ..   bathrooms = col_double(),
##   ..   sqft_living = col_double(),
##   ..   sqft_lot = col_double(),
##   ..   floors = col_double(),
##   ..   waterfront = col_double(),
##   ..   view = col_double(),
##   ..   condition = col_double(),
##   ..   grade = col_double(),
##   ..   sqft_above = col_double(),
##   ..   sqft_basement = col_double(),
##   ..   yr_built = col_double(),
##   ..   yr_renovated = col_double(),
##   ..   zipcode = col_double(),
##   ..   lat = col_double(),
##   ..   long = col_double(),
##   ..   sqft_living15 = col_double(),
##   ..   sqft_lot15 = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>
colnames(df) # names of columns 
##  [1] "id"            "date"          "price"         "bedrooms"     
##  [5] "bathrooms"     "sqft_living"   "sqft_lot"      "floors"       
##  [9] "waterfront"    "view"          "condition"     "grade"        
## [13] "sqft_above"    "sqft_basement" "yr_built"      "yr_renovated" 
## [17] "zipcode"       "lat"           "long"          "sqft_living15"
## [21] "sqft_lot15"
sapply(df, class) # clasess 
## $id
## [1] "numeric"
## 
## $date
## [1] "POSIXct" "POSIXt" 
## 
## $price
## [1] "numeric"
## 
## $bedrooms
## [1] "numeric"
## 
## $bathrooms
## [1] "numeric"
## 
## $sqft_living
## [1] "numeric"
## 
## $sqft_lot
## [1] "numeric"
## 
## $floors
## [1] "numeric"
## 
## $waterfront
## [1] "numeric"
## 
## $view
## [1] "numeric"
## 
## $condition
## [1] "numeric"
## 
## $grade
## [1] "numeric"
## 
## $sqft_above
## [1] "numeric"
## 
## $sqft_basement
## [1] "numeric"
## 
## $yr_built
## [1] "numeric"
## 
## $yr_renovated
## [1] "numeric"
## 
## $zipcode
## [1] "numeric"
## 
## $lat
## [1] "numeric"
## 
## $long
## [1] "numeric"
## 
## $sqft_living15
## [1] "numeric"
## 
## $sqft_lot15
## [1] "numeric"
# Check for missing observations
missing_values <- colSums(is.na(df)) #?????????? NOT RUNNING CODE 



# Any plot you generate will be shown in the document.