Q1.Opening Data, Merging Data (10 points) ● Read in the list of towns (#1) and Real Estate Sales (#2) datasets into R. o I’ve filtered data for years 2014-2016 to reduce the number of rows. You can find the clean original data online as a reference.

# import csv files
townList <- read.csv("List_of_Towns.csv")
realEstate <- read.csv("Real_Estate_Sales_2014-2016.csv")

townList
realEstate

● Use the list of towns (#1) dataset to extract the county column to join it with Real Estate dataset (#2) Hint: Use Merge

# merge both datasets on the shared key 'Town'
myData <- merge(x = realEstate, 
                y = townList[ , c('Town','County')], 
                by.x = 'Town', 
                by.y = 'Town')

# print merged data
myData

Q2: Filtering out missing data (5 points) ● It’s often helpful to make a subset of the data that you need to clean. Then you can join it back together later. You can use Excel to verify the columns with missing values. o Identify the columns with missing values using R. Hint: You will observe that Date Recorded and Sales Amount Columns have missing data.

# find out the datatypes of all columns
str(myData)
## 'data.frame':    145987 obs. of  12 variables:
##  $ Town           : Factor w/ 169 levels "Andover","Ansonia",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ï..ID          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SerialNumber   : int  14046 14011 15006 14044 14035 15051 14002 15011 14043 14029 ...
##  $ ListYear       : int  2014 2014 2015 2014 2014 2015 2014 2015 2014 2014 ...
##  $ DateRecorded   : Factor w/ 911 levels "","1/1/2015",..: 890 12 207 887 755 888 121 57 871 658 ...
##  $ Address        : Factor w/ 134658 levels "   WORTH AVE 113",..: 66 1250 1647 3220 4743 6908 9425 10624 13162 13704 ...
##  $ AssessedValue  : int  10720 153100 102900 108700 164000 188400 89400 213200 52400 131700 ...
##  $ SaleAmount     : num  75000 190000 50000 128368 230000 ...
##  $ SalesRatio     : num  0.143 0.806 2.058 0.847 0.713 ...
##  $ PropertyType   : Factor w/ 6 levels "Apartments","Commercial",..: 6 5 5 5 5 5 5 5 5 5 ...
##  $ ResidentialType: Factor w/ 5 levels "Four Family",..: NA 3 3 3 3 3 3 3 3 3 ...
##  $ County         : Factor w/ 8 levels "Fairfield County",..: 7 7 7 7 7 7 7 7 7 7 ...
# convert factors to character as no operations can be performed on datatype factor
myData$DateRecorded <- as.character(myData$DateRecorded)
myData$Town <- as.character(myData$Town)
myData$PropertyType <- as.character(myData$PropertyType)
myData$Address <- as.character(myData$Address)
myData$ResidentialType <- as.character(myData$ResidentialType)
myData$County <- as.character(myData$County)

# recheck datatypes of the columns
str(myData)
## 'data.frame':    145987 obs. of  12 variables:
##  $ Town           : chr  "Andover" "Andover" "Andover" "Andover" ...
##  $ ï..ID          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SerialNumber   : int  14046 14011 15006 14044 14035 15051 14002 15011 14043 14029 ...
##  $ ListYear       : int  2014 2014 2015 2014 2014 2015 2014 2015 2014 2014 ...
##  $ DateRecorded   : chr  "9/29/2015" "1/14/2015" "11/30/2015" "9/28/2015" ...
##  $ Address        : chr  " US ROUTE 6  M 33 B 36 L 22" "1 JUROVATY LANE" "1 ROSE LANE" "10 PINE RIDGE  DR" ...
##  $ AssessedValue  : int  10720 153100 102900 108700 164000 188400 89400 213200 52400 131700 ...
##  $ SaleAmount     : num  75000 190000 50000 128368 230000 ...
##  $ SalesRatio     : num  0.143 0.806 2.058 0.847 0.713 ...
##  $ PropertyType   : chr  "Vacant Land" "Residential" "Residential" "Residential" ...
##  $ ResidentialType: chr  NA "Single Family" "Single Family" "Single Family" ...
##  $ County         : chr  "Tolland County" "Tolland County" "Tolland County" "Tolland County" ...
# Replace any existing blanks in the dataset with NA for R to be able to detect them
myData[myData==""] <- NA

# Find the total number of NAs in all columns
print("Total missing values in each column:")
## [1] "Total missing values in each column:"
colSums(is.na(myData))
##            Town           ï..ID    SerialNumber        ListYear 
##               0               0               0               0 
##    DateRecorded         Address   AssessedValue      SaleAmount 
##               6               2               0            5283 
##      SalesRatio    PropertyType ResidentialType          County 
##               0               0           11905               0

It can be observed that there are missing values in DateRecorded, Address, SaleAmount and ResidentialType.

Q3: Replacing missing data (10 points) ● Replace missing date recorded data with the corresponding year in ListYear column.

# convert data type of ListYear from integer to character
myData$ListYear <- as.character(myData$ListYear)

# check number of NAs in DateRecorded
print("Before replacement:")
## [1] "Before replacement:"
colSums(is.na(myData))
##            Town           ï..ID    SerialNumber        ListYear 
##               0               0               0               0 
##    DateRecorded         Address   AssessedValue      SaleAmount 
##               6               2               0            5283 
##      SalesRatio    PropertyType ResidentialType          County 
##               0               0           11905               0
# create a variable temp that holds a list of indexes where DateRecorded in NA
temp <- which(is.na(myData$DateRecorded))

# list of rows where DateRecorded is NA
print("Rows where DateRecorded is NA:")
## [1] "Rows where DateRecorded is NA:"
temp
## [1]   1080   2090  77004  89625 129158 129592
# Replace NAs in DateRecorded with the corresponding entries in ListYear
myData$DateRecorded[temp] <- myData$ListYear[temp]

# Recheck number of missing values in DateRecorded
print("After replacement:")
## [1] "After replacement:"
colSums(is.na(myData))
##            Town           ï..ID    SerialNumber        ListYear 
##               0               0               0               0 
##    DateRecorded         Address   AssessedValue      SaleAmount 
##               0               2               0            5283 
##      SalesRatio    PropertyType ResidentialType          County 
##               0               0           11905               0
# compare values in DateRecorded and ListYear for rows that were previously
print("Values replaced in DateRecorded:")
## [1] "Values replaced in DateRecorded:"
myData$DateRecorded[temp]
## [1] "2014" "2014" "2014" "2014" "2014" "2014"
print("Corresponding value in ListYear:")
## [1] "Corresponding value in ListYear:"
myData$ListYear[temp]
## [1] "2014" "2014" "2014" "2014" "2014" "2014"
# Change the datatype of ListYear back to numeric
myData$ListYear <- as.numeric(myData$ListYear)

● Replace missing Sales Amount data with the corresponding with Assessed Value + (Assessed Value multiplied by Sales Ratio)

# check number of NAs in SaleAmount
print("Before replacement:")
## [1] "Before replacement:"
colSums(is.na(myData))
##            Town           ï..ID    SerialNumber        ListYear 
##               0               0               0               0 
##    DateRecorded         Address   AssessedValue      SaleAmount 
##               0               2               0            5283 
##      SalesRatio    PropertyType ResidentialType          County 
##               0               0           11905               0
# reassign value of temp with indices where SalesAmount is NA
temp <- which(is.na(myData$SaleAmount))

# check if temp matches with number of NAs in SalesAmount
print("Number of missing values in SaleAmount:")
## [1] "Number of missing values in SaleAmount:"
length(temp)
## [1] 5283
# replace NAs in SalesAmount using the given formula
myData$SaleAmount[temp] <- myData$AssessedValue[temp] + (myData$AssessedValue[temp]*myData$SalesRatio[temp])

# check number of NAs in SalesAmount
print("After replacement:")
## [1] "After replacement:"
colSums(is.na(myData))
##            Town           ï..ID    SerialNumber        ListYear 
##               0               0               0               0 
##    DateRecorded         Address   AssessedValue      SaleAmount 
##               0               2               0               0 
##      SalesRatio    PropertyType ResidentialType          County 
##               0               0           11905               0
# Recheck if the replacement has been implemented correctly
print("Values replaced in SaleAmount:")
## [1] "Values replaced in SaleAmount:"
myData$SaleAmount[temp[1:4]]
## [1] 276145.8 467707.5 320192.1 320192.1
print("Corresponding value in AssessedValue:")
## [1] "Corresponding value in AssessedValue:"
myData$AssessedValue[temp[1:4]]
## [1] 155400 253500 172100 172100
print("Corresponding value in SalesRatio:")
## [1] "Corresponding value in SalesRatio:"
myData$SalesRatio[temp[1:4]]
## [1] 0.7770 0.8450 0.8605 0.8605
print("Assessed Value + (Assessed Value multiplied by Sales Ratio):")
## [1] "Assessed Value + (Assessed Value multiplied by Sales Ratio):"
myData$AssessedValue[temp[1:4]] + (myData$AssessedValue[temp[1:4]]*myData$SalesRatio[temp[1:4]])
## [1] 276145.8 467707.5 320192.1 320192.1

● Add this subset back to the dataset Hint: Check to see if the row counts match to that of the original dataset.

# check if the data has as many rows as it had at the beginning
nrow(myData)
## [1] 145987

Q4: Binning (10 points) ● Make a column “Property Value” based on Column AssessedValue given the conditions: o ‘LowRange’ if Assessed Value <=300,000 o ‘MidRange’ if Assessed Value >300,000 and <=800,000 o ‘HighRange’ if Assessed Value >800,000

# create a list bins of cuts to be used to bin the AssessedValue column
bins <- c(-Inf, 300000, 800000, Inf)

# create list of labels to be given for each bin
binNames <- c('LowRange', 'MidRange', 'HighRange')

# use cut() function to break the AssessedValue column into bins
# save labels in the new column PropertyValue
myData$PropertyValue <- cut(myData$AssessedValue, 
                            breaks = bins, 
                            labels = binNames)

# check the bins created
myData[,c(7,13)]

Q5: Measures of Central Tendency & Variability (25 points) ● Define, Interpret and Calculate the mean, median, mode, standard deviation, variance, range, minimum and maximum values, skewness and kurtosis for all the continuous data columns using the existing individual built in R functions. Hint: You can create a function and use loops to print results. ● Use describe() in psych package to calculate the summary statistics for all continuous columns.

# check datatypes of all variables
str(myData)
## 'data.frame':    145987 obs. of  13 variables:
##  $ Town           : chr  "Andover" "Andover" "Andover" "Andover" ...
##  $ ï..ID          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SerialNumber   : int  14046 14011 15006 14044 14035 15051 14002 15011 14043 14029 ...
##  $ ListYear       : num  2014 2014 2015 2014 2014 ...
##  $ DateRecorded   : chr  "9/29/2015" "1/14/2015" "11/30/2015" "9/28/2015" ...
##  $ Address        : chr  " US ROUTE 6  M 33 B 36 L 22" "1 JUROVATY LANE" "1 ROSE LANE" "10 PINE RIDGE  DR" ...
##  $ AssessedValue  : int  10720 153100 102900 108700 164000 188400 89400 213200 52400 131700 ...
##  $ SaleAmount     : num  75000 190000 50000 128368 230000 ...
##  $ SalesRatio     : num  0.143 0.806 2.058 0.847 0.713 ...
##  $ PropertyType   : chr  "Vacant Land" "Residential" "Residential" "Residential" ...
##  $ ResidentialType: chr  NA "Single Family" "Single Family" "Single Family" ...
##  $ County         : chr  "Tolland County" "Tolland County" "Tolland County" "Tolland County" ...
##  $ PropertyValue  : Factor w/ 3 levels "LowRange","MidRange",..: 1 1 1 1 1 1 1 1 1 1 ...
# import library moments to calculate skewness and kurtosis
library(moments)

# create a list of indices of variables that are continuous
cols <- c(7, 8, 9)

# create a function to calculate mode
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

temp <- data.frame(Variable= character(0),
                   Mean = double(0),
                   Median = double(0),
                   Mode = double(0),
                   StandardDeviation = double(0),
                   Variance = double(0),
                   Range = character(0),
                   Minimum = double(0), 
                   Maximum = double(0),
                   Skewness=double(0), 
                   Kurtosis=double(0), 
                   stringsAsFactors = FALSE)

# write a for loop to calculate the summary statistics for all continuous variables
for(i in 1:length(cols)){
   temp[i,] <- c(as.character(colnames(myData[cols[i]])),
                 as.double(mean(myData[,cols[i]])),
                 as.double(median(myData[,cols[i]])),
                 as.double(getmode(myData[,cols[i]])),
                 as.double(sd(myData[,cols[i]])),
                 as.double(var(myData[,cols[i]])),
                 as.character(range(myData[,cols[i]])[2]-range(myData[,cols[i]])[1]),
                 as.double(min(myData[,cols[i]])),
                 as.double(max(myData[,cols[i]])),
                 as.double(skewness(myData[,cols[i]])),
                 as.double(kurtosis(myData[,cols[i]])))
   }
temp
# import library psych
library(psych)

# use describe function in psych package to get summary statistics of continuous variables
describe(myData[,cols], 
         na.rm = TRUE, 
         interp=FALSE,
         skew = TRUE, 
         ranges = TRUE,
         trim=.1,
         type=3,
         check=TRUE,
         fast=NULL,
         quant=NULL,
         IQR=FALSE,
         omit=FALSE)

Q6: Make three interesting plots using the data and describe what you see (40 points) ● Choose one variable and make a four-panel plot with a density plot, histogram, a boxplot and an overlapping plot with histogram and density plot.

# Remove outliers from SaleAmount
outlier_values <- boxplot.stats(myData$SaleAmount)$out

#myData[which(myData$SaleAmount %in% outlier_values),]

# To not lose the original data, store the data without outliers in a new variable
treated_myData = myData[-which(myData$SaleAmount %in% outlier_values),] 

# set option to display numbers as integers and not in exponential form
options(scipen = 5)    

# create a 4-paneled plot
par(mfrow=c(2,2))

# Plot kernel density plot for SaleAmount
plot(density(treated_myData$SaleAmount),
     main = "Kernel Density Plot for SaleAmount",
     cex.main=1.1)

# Plot a histogram for SaleAmount
hist(treated_myData$SaleAmount,
     main = "Histogram of SaleAmount",
     cex.main=1.1,
     xlab = "SaleAmount")

# Create a boxplot for SaleAmount
boxplot(treated_myData$SaleAmount,
        data=treated_myData, 
        main="Box Plot of SaleAmount",
        cex.main=1.1)

# Plot a overlapping plot with histogram and kernel density
hist(treated_myData$SaleAmount,
     main = "Histogram with Kernel Density of SaleAmount",
     cex.main=1.1,
     xlab = "SaleAmount",
     freq = FALSE)
lines(density(treated_myData$SaleAmount),
      col = "blue")

The panel above shows 4 plots based on the variable SaleAmount.

● Think of impressing your boss by telling a story from the data and build three interesting plots Hint: Box Plots, Scatterplots, Contingency Tables, Map plots, Dot plots, Density plots or Histograms

# Remove outliers from AssessedValue
outlier_values <- boxplot.stats(myData$AssessedValue)$out

#treated_myData[which(treated_myData$AssessedValue %in% outlier_values),]

# To not lose the original data, store the data without outliers in a new variable
treatedNew_myData = treated_myData[-which(treated_myData$AssessedValue %in% outlier_values),]

#ScatterPlot
plot(treatedNew_myData$AssessedValue,
     treatedNew_myData$SaleAmount,
     main = "Scatter Plot for SaleAmount vs AssessedValue",
     xlab = "AssessedValue",
     ylab = "SaleAmount",
     pch = 20,
     xlim = c(0, 460000),
     ylim = c(0, 705000))

# Plot regression line (AssessedValue ~ SaleAmount)
abline(lm(treatedNew_myData$AssessedValue~treatedNew_myData$SaleAmount),
       col="red")

# Basic Scatterplot Matrix
pairs(~SaleAmount+SalesRatio+AssessedValue,
      data=treatedNew_myData,
      main="Simple Scatterplot Matrix")

# Grouped Bar Plot

# create an aggregated table for County and Property Value
bp <- table(treated_myData$County, treated_myData$PropertyValue)

# Plot a grouped bar graph for the aggregated table
barplot(bp, 
        main="PropertyValue Distribution by County",
        xlab="County",
        col=c("red","blue","darkgreen","yellow", "orange","cyan","green","brown"),
        legend = rownames(bp),
     #  horiz = TRUE,
        beside = TRUE)

# create a seperate data frame for SaleAmount, County and ListYear
dc_table <- data.frame(County = treated_myData$County, 
                       ListYear = treated_myData$ListYear,
                       SaleAmount = treated_myData$SaleAmount)

# import library dplyr
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# create an aggregated table for SaleAmount based on County and ListYear
dc <- dc_table%>%
      group_by(County, ListYear)%>%
      summarise(n=sum(SaleAmount))

# sort the table according to ListYear
dc = dc[order(dc$ListYear),]

# Assign colors for each ListYear
dc$color[dc$ListYear==2014]="red"
## Warning: Unknown or uninitialised column: 'color'.
dc$color[dc$ListYear==2015]="blue"
dc$color[dc$ListYear==2016]="darkgreen"

# create a multi-paneled plot
par(mfrow = c(1, 1))

# create a dotchart
dotchart(dc$n,
         labels=dc$County, 
         groups = dc$ListYear, 
         cex=.7,
         main="Total Sales per County per year",
         xlab="Sale Amount",
         color = dc$color)

# create a list of unique values in ListYear and corresponding color for Legend
ly <- unique(dc$ListYear)
clr <- unique(dc$color)

# create a legend for the dot plot
legend("topright", 
       inset = .0125, 
       title = "ListYear",
       legend= ly,
       fill = clr)

  1. Scatter Plot:
  1. Scatterplot Matrix:
  1. Grouped Bar Chart:
  1. Dot Plot:

Q7: Aggregating, Multidimensional Tables (30 points) ● Generate a Frequency tables based on the count of properties in each town.

# import library dplyr
library(dplyr)

# group data by Town and calculate the number of entries for each town
myData%>%
  group_by(Town)%>%
  summarize(n=n())

● Build a contingency table using the variable ‘Property value’ based on the percentages of row and overall total of properties in each range.

# calculate the percentage of data for each Town
myData%>%
  group_by(PropertyValue)%>%
  summarize(n=n())%>%
  mutate(prop=n/sum(n))
# calculate the % and total % of data for each town using Package epidisplay
library(epiDisplay)
## Loading required package: foreign
## Loading required package: survival
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## Loading required package: nnet
## 
## Attaching package: 'epiDisplay'
## The following objects are masked from 'package:psych':
## 
##     alpha, cs, lookup
tab1(myData$PropertyValue,
     main = "Contingency graph for PropertyValue")

## myData$PropertyValue : 
##           Frequency Percent Cum. percent
## LowRange     120076    82.3         82.3
## MidRange      19486    13.3         95.6
## HighRange      6425     4.4        100.0
##   Total      145987   100.0        100.0

● Create a multidimensional table using the dataset based on Town, County and list year

# Multidimensional table for number of entries according to Town, County and ListYear 
# using dplyr
myData%>%
  group_by(Town, County, ListYear)%>%
  summarize(n=n())

Q8: Storytelling with Data (20 points) ● Tell a story based on your understanding and analysis of the data. Hint: Feel free to use R Markdown