Q1.Opening Data, Merging Data (10 points) ● Read in the list of towns (#1) and Real Estate Sales (#2) datasets into R. o I’ve filtered data for years 2014-2016 to reduce the number of rows. You can find the clean original data online as a reference.
# import csv files
townList <- read.csv("List_of_Towns.csv")
realEstate <- read.csv("Real_Estate_Sales_2014-2016.csv")
townList
realEstate
● Use the list of towns (#1) dataset to extract the county column to join it with Real Estate dataset (#2) Hint: Use Merge
# merge both datasets on the shared key 'Town'
myData <- merge(x = realEstate,
y = townList[ , c('Town','County')],
by.x = 'Town',
by.y = 'Town')
# print merged data
myData
Q2: Filtering out missing data (5 points) ● It’s often helpful to make a subset of the data that you need to clean. Then you can join it back together later. You can use Excel to verify the columns with missing values. o Identify the columns with missing values using R. Hint: You will observe that Date Recorded and Sales Amount Columns have missing data.
# find out the datatypes of all columns
str(myData)
## 'data.frame': 145987 obs. of 12 variables:
## $ Town : Factor w/ 169 levels "Andover","Ansonia",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ï..ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ SerialNumber : int 14046 14011 15006 14044 14035 15051 14002 15011 14043 14029 ...
## $ ListYear : int 2014 2014 2015 2014 2014 2015 2014 2015 2014 2014 ...
## $ DateRecorded : Factor w/ 911 levels "","1/1/2015",..: 890 12 207 887 755 888 121 57 871 658 ...
## $ Address : Factor w/ 134658 levels " WORTH AVE 113",..: 66 1250 1647 3220 4743 6908 9425 10624 13162 13704 ...
## $ AssessedValue : int 10720 153100 102900 108700 164000 188400 89400 213200 52400 131700 ...
## $ SaleAmount : num 75000 190000 50000 128368 230000 ...
## $ SalesRatio : num 0.143 0.806 2.058 0.847 0.713 ...
## $ PropertyType : Factor w/ 6 levels "Apartments","Commercial",..: 6 5 5 5 5 5 5 5 5 5 ...
## $ ResidentialType: Factor w/ 5 levels "Four Family",..: NA 3 3 3 3 3 3 3 3 3 ...
## $ County : Factor w/ 8 levels "Fairfield County",..: 7 7 7 7 7 7 7 7 7 7 ...
# convert factors to character as no operations can be performed on datatype factor
myData$DateRecorded <- as.character(myData$DateRecorded)
myData$Town <- as.character(myData$Town)
myData$PropertyType <- as.character(myData$PropertyType)
myData$Address <- as.character(myData$Address)
myData$ResidentialType <- as.character(myData$ResidentialType)
myData$County <- as.character(myData$County)
# recheck datatypes of the columns
str(myData)
## 'data.frame': 145987 obs. of 12 variables:
## $ Town : chr "Andover" "Andover" "Andover" "Andover" ...
## $ ï..ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ SerialNumber : int 14046 14011 15006 14044 14035 15051 14002 15011 14043 14029 ...
## $ ListYear : int 2014 2014 2015 2014 2014 2015 2014 2015 2014 2014 ...
## $ DateRecorded : chr "9/29/2015" "1/14/2015" "11/30/2015" "9/28/2015" ...
## $ Address : chr " US ROUTE 6 M 33 B 36 L 22" "1 JUROVATY LANE" "1 ROSE LANE" "10 PINE RIDGE DR" ...
## $ AssessedValue : int 10720 153100 102900 108700 164000 188400 89400 213200 52400 131700 ...
## $ SaleAmount : num 75000 190000 50000 128368 230000 ...
## $ SalesRatio : num 0.143 0.806 2.058 0.847 0.713 ...
## $ PropertyType : chr "Vacant Land" "Residential" "Residential" "Residential" ...
## $ ResidentialType: chr NA "Single Family" "Single Family" "Single Family" ...
## $ County : chr "Tolland County" "Tolland County" "Tolland County" "Tolland County" ...
# Replace any existing blanks in the dataset with NA for R to be able to detect them
myData[myData==""] <- NA
# Find the total number of NAs in all columns
print("Total missing values in each column:")
## [1] "Total missing values in each column:"
colSums(is.na(myData))
## Town ï..ID SerialNumber ListYear
## 0 0 0 0
## DateRecorded Address AssessedValue SaleAmount
## 6 2 0 5283
## SalesRatio PropertyType ResidentialType County
## 0 0 11905 0
It can be observed that there are missing values in DateRecorded, Address, SaleAmount and ResidentialType.
Q3: Replacing missing data (10 points) ● Replace missing date recorded data with the corresponding year in ListYear column.
# convert data type of ListYear from integer to character
myData$ListYear <- as.character(myData$ListYear)
# check number of NAs in DateRecorded
print("Before replacement:")
## [1] "Before replacement:"
colSums(is.na(myData))
## Town ï..ID SerialNumber ListYear
## 0 0 0 0
## DateRecorded Address AssessedValue SaleAmount
## 6 2 0 5283
## SalesRatio PropertyType ResidentialType County
## 0 0 11905 0
# create a variable temp that holds a list of indexes where DateRecorded in NA
temp <- which(is.na(myData$DateRecorded))
# list of rows where DateRecorded is NA
print("Rows where DateRecorded is NA:")
## [1] "Rows where DateRecorded is NA:"
temp
## [1] 1080 2090 77004 89625 129158 129592
# Replace NAs in DateRecorded with the corresponding entries in ListYear
myData$DateRecorded[temp] <- myData$ListYear[temp]
# Recheck number of missing values in DateRecorded
print("After replacement:")
## [1] "After replacement:"
colSums(is.na(myData))
## Town ï..ID SerialNumber ListYear
## 0 0 0 0
## DateRecorded Address AssessedValue SaleAmount
## 0 2 0 5283
## SalesRatio PropertyType ResidentialType County
## 0 0 11905 0
# compare values in DateRecorded and ListYear for rows that were previously
print("Values replaced in DateRecorded:")
## [1] "Values replaced in DateRecorded:"
myData$DateRecorded[temp]
## [1] "2014" "2014" "2014" "2014" "2014" "2014"
print("Corresponding value in ListYear:")
## [1] "Corresponding value in ListYear:"
myData$ListYear[temp]
## [1] "2014" "2014" "2014" "2014" "2014" "2014"
# Change the datatype of ListYear back to numeric
myData$ListYear <- as.numeric(myData$ListYear)
● Replace missing Sales Amount data with the corresponding with Assessed Value + (Assessed Value multiplied by Sales Ratio)
# check number of NAs in SaleAmount
print("Before replacement:")
## [1] "Before replacement:"
colSums(is.na(myData))
## Town ï..ID SerialNumber ListYear
## 0 0 0 0
## DateRecorded Address AssessedValue SaleAmount
## 0 2 0 5283
## SalesRatio PropertyType ResidentialType County
## 0 0 11905 0
# reassign value of temp with indices where SalesAmount is NA
temp <- which(is.na(myData$SaleAmount))
# check if temp matches with number of NAs in SalesAmount
print("Number of missing values in SaleAmount:")
## [1] "Number of missing values in SaleAmount:"
length(temp)
## [1] 5283
# replace NAs in SalesAmount using the given formula
myData$SaleAmount[temp] <- myData$AssessedValue[temp] + (myData$AssessedValue[temp]*myData$SalesRatio[temp])
# check number of NAs in SalesAmount
print("After replacement:")
## [1] "After replacement:"
colSums(is.na(myData))
## Town ï..ID SerialNumber ListYear
## 0 0 0 0
## DateRecorded Address AssessedValue SaleAmount
## 0 2 0 0
## SalesRatio PropertyType ResidentialType County
## 0 0 11905 0
# Recheck if the replacement has been implemented correctly
print("Values replaced in SaleAmount:")
## [1] "Values replaced in SaleAmount:"
myData$SaleAmount[temp[1:4]]
## [1] 276145.8 467707.5 320192.1 320192.1
print("Corresponding value in AssessedValue:")
## [1] "Corresponding value in AssessedValue:"
myData$AssessedValue[temp[1:4]]
## [1] 155400 253500 172100 172100
print("Corresponding value in SalesRatio:")
## [1] "Corresponding value in SalesRatio:"
myData$SalesRatio[temp[1:4]]
## [1] 0.7770 0.8450 0.8605 0.8605
print("Assessed Value + (Assessed Value multiplied by Sales Ratio):")
## [1] "Assessed Value + (Assessed Value multiplied by Sales Ratio):"
myData$AssessedValue[temp[1:4]] + (myData$AssessedValue[temp[1:4]]*myData$SalesRatio[temp[1:4]])
## [1] 276145.8 467707.5 320192.1 320192.1
● Add this subset back to the dataset Hint: Check to see if the row counts match to that of the original dataset.
# check if the data has as many rows as it had at the beginning
nrow(myData)
## [1] 145987
Q4: Binning (10 points) ● Make a column “Property Value” based on Column AssessedValue given the conditions: o ‘LowRange’ if Assessed Value <=300,000 o ‘MidRange’ if Assessed Value >300,000 and <=800,000 o ‘HighRange’ if Assessed Value >800,000
# create a list bins of cuts to be used to bin the AssessedValue column
bins <- c(-Inf, 300000, 800000, Inf)
# create list of labels to be given for each bin
binNames <- c('LowRange', 'MidRange', 'HighRange')
# use cut() function to break the AssessedValue column into bins
# save labels in the new column PropertyValue
myData$PropertyValue <- cut(myData$AssessedValue,
breaks = bins,
labels = binNames)
# check the bins created
myData[,c(7,13)]
Q5: Measures of Central Tendency & Variability (25 points) ● Define, Interpret and Calculate the mean, median, mode, standard deviation, variance, range, minimum and maximum values, skewness and kurtosis for all the continuous data columns using the existing individual built in R functions. Hint: You can create a function and use loops to print results. ● Use describe() in psych package to calculate the summary statistics for all continuous columns.
# check datatypes of all variables
str(myData)
## 'data.frame': 145987 obs. of 13 variables:
## $ Town : chr "Andover" "Andover" "Andover" "Andover" ...
## $ ï..ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ SerialNumber : int 14046 14011 15006 14044 14035 15051 14002 15011 14043 14029 ...
## $ ListYear : num 2014 2014 2015 2014 2014 ...
## $ DateRecorded : chr "9/29/2015" "1/14/2015" "11/30/2015" "9/28/2015" ...
## $ Address : chr " US ROUTE 6 M 33 B 36 L 22" "1 JUROVATY LANE" "1 ROSE LANE" "10 PINE RIDGE DR" ...
## $ AssessedValue : int 10720 153100 102900 108700 164000 188400 89400 213200 52400 131700 ...
## $ SaleAmount : num 75000 190000 50000 128368 230000 ...
## $ SalesRatio : num 0.143 0.806 2.058 0.847 0.713 ...
## $ PropertyType : chr "Vacant Land" "Residential" "Residential" "Residential" ...
## $ ResidentialType: chr NA "Single Family" "Single Family" "Single Family" ...
## $ County : chr "Tolland County" "Tolland County" "Tolland County" "Tolland County" ...
## $ PropertyValue : Factor w/ 3 levels "LowRange","MidRange",..: 1 1 1 1 1 1 1 1 1 1 ...
# import library moments to calculate skewness and kurtosis
library(moments)
# create a list of indices of variables that are continuous
cols <- c(7, 8, 9)
# create a function to calculate mode
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
temp <- data.frame(Variable= character(0),
Mean = double(0),
Median = double(0),
Mode = double(0),
StandardDeviation = double(0),
Variance = double(0),
Range = character(0),
Minimum = double(0),
Maximum = double(0),
Skewness=double(0),
Kurtosis=double(0),
stringsAsFactors = FALSE)
# write a for loop to calculate the summary statistics for all continuous variables
for(i in 1:length(cols)){
temp[i,] <- c(as.character(colnames(myData[cols[i]])),
as.double(mean(myData[,cols[i]])),
as.double(median(myData[,cols[i]])),
as.double(getmode(myData[,cols[i]])),
as.double(sd(myData[,cols[i]])),
as.double(var(myData[,cols[i]])),
as.character(range(myData[,cols[i]])[2]-range(myData[,cols[i]])[1]),
as.double(min(myData[,cols[i]])),
as.double(max(myData[,cols[i]])),
as.double(skewness(myData[,cols[i]])),
as.double(kurtosis(myData[,cols[i]])))
}
temp
# import library psych
library(psych)
# use describe function in psych package to get summary statistics of continuous variables
describe(myData[,cols],
na.rm = TRUE,
interp=FALSE,
skew = TRUE,
ranges = TRUE,
trim=.1,
type=3,
check=TRUE,
fast=NULL,
quant=NULL,
IQR=FALSE,
omit=FALSE)
Q6: Make three interesting plots using the data and describe what you see (40 points) ● Choose one variable and make a four-panel plot with a density plot, histogram, a boxplot and an overlapping plot with histogram and density plot.
# Remove outliers from SaleAmount
outlier_values <- boxplot.stats(myData$SaleAmount)$out
#myData[which(myData$SaleAmount %in% outlier_values),]
# To not lose the original data, store the data without outliers in a new variable
treated_myData = myData[-which(myData$SaleAmount %in% outlier_values),]
# set option to display numbers as integers and not in exponential form
options(scipen = 5)
# create a 4-paneled plot
par(mfrow=c(2,2))
# Plot kernel density plot for SaleAmount
plot(density(treated_myData$SaleAmount),
main = "Kernel Density Plot for SaleAmount",
cex.main=1.1)
# Plot a histogram for SaleAmount
hist(treated_myData$SaleAmount,
main = "Histogram of SaleAmount",
cex.main=1.1,
xlab = "SaleAmount")
# Create a boxplot for SaleAmount
boxplot(treated_myData$SaleAmount,
data=treated_myData,
main="Box Plot of SaleAmount",
cex.main=1.1)
# Plot a overlapping plot with histogram and kernel density
hist(treated_myData$SaleAmount,
main = "Histogram with Kernel Density of SaleAmount",
cex.main=1.1,
xlab = "SaleAmount",
freq = FALSE)
lines(density(treated_myData$SaleAmount),
col = "blue")
The panel above shows 4 plots based on the variable SaleAmount.
● Think of impressing your boss by telling a story from the data and build three interesting plots Hint: Box Plots, Scatterplots, Contingency Tables, Map plots, Dot plots, Density plots or Histograms
# Remove outliers from AssessedValue
outlier_values <- boxplot.stats(myData$AssessedValue)$out
#treated_myData[which(treated_myData$AssessedValue %in% outlier_values),]
# To not lose the original data, store the data without outliers in a new variable
treatedNew_myData = treated_myData[-which(treated_myData$AssessedValue %in% outlier_values),]
#ScatterPlot
plot(treatedNew_myData$AssessedValue,
treatedNew_myData$SaleAmount,
main = "Scatter Plot for SaleAmount vs AssessedValue",
xlab = "AssessedValue",
ylab = "SaleAmount",
pch = 20,
xlim = c(0, 460000),
ylim = c(0, 705000))
# Plot regression line (AssessedValue ~ SaleAmount)
abline(lm(treatedNew_myData$AssessedValue~treatedNew_myData$SaleAmount),
col="red")
# Basic Scatterplot Matrix
pairs(~SaleAmount+SalesRatio+AssessedValue,
data=treatedNew_myData,
main="Simple Scatterplot Matrix")
# Grouped Bar Plot
# create an aggregated table for County and Property Value
bp <- table(treated_myData$County, treated_myData$PropertyValue)
# Plot a grouped bar graph for the aggregated table
barplot(bp,
main="PropertyValue Distribution by County",
xlab="County",
col=c("red","blue","darkgreen","yellow", "orange","cyan","green","brown"),
legend = rownames(bp),
# horiz = TRUE,
beside = TRUE)
# create a seperate data frame for SaleAmount, County and ListYear
dc_table <- data.frame(County = treated_myData$County,
ListYear = treated_myData$ListYear,
SaleAmount = treated_myData$SaleAmount)
# import library dplyr
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# create an aggregated table for SaleAmount based on County and ListYear
dc <- dc_table%>%
group_by(County, ListYear)%>%
summarise(n=sum(SaleAmount))
# sort the table according to ListYear
dc = dc[order(dc$ListYear),]
# Assign colors for each ListYear
dc$color[dc$ListYear==2014]="red"
## Warning: Unknown or uninitialised column: 'color'.
dc$color[dc$ListYear==2015]="blue"
dc$color[dc$ListYear==2016]="darkgreen"
# create a multi-paneled plot
par(mfrow = c(1, 1))
# create a dotchart
dotchart(dc$n,
labels=dc$County,
groups = dc$ListYear,
cex=.7,
main="Total Sales per County per year",
xlab="Sale Amount",
color = dc$color)
# create a list of unique values in ListYear and corresponding color for Legend
ly <- unique(dc$ListYear)
clr <- unique(dc$color)
# create a legend for the dot plot
legend("topright",
inset = .0125,
title = "ListYear",
legend= ly,
fill = clr)
Q7: Aggregating, Multidimensional Tables (30 points) ● Generate a Frequency tables based on the count of properties in each town.
# import library dplyr
library(dplyr)
# group data by Town and calculate the number of entries for each town
myData%>%
group_by(Town)%>%
summarize(n=n())
● Build a contingency table using the variable ‘Property value’ based on the percentages of row and overall total of properties in each range.
# calculate the percentage of data for each Town
myData%>%
group_by(PropertyValue)%>%
summarize(n=n())%>%
mutate(prop=n/sum(n))
# calculate the % and total % of data for each town using Package epidisplay
library(epiDisplay)
## Loading required package: foreign
## Loading required package: survival
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Loading required package: nnet
##
## Attaching package: 'epiDisplay'
## The following objects are masked from 'package:psych':
##
## alpha, cs, lookup
tab1(myData$PropertyValue,
main = "Contingency graph for PropertyValue")
## myData$PropertyValue :
## Frequency Percent Cum. percent
## LowRange 120076 82.3 82.3
## MidRange 19486 13.3 95.6
## HighRange 6425 4.4 100.0
## Total 145987 100.0 100.0
● Create a multidimensional table using the dataset based on Town, County and list year
# Multidimensional table for number of entries according to Town, County and ListYear
# using dplyr
myData%>%
group_by(Town, County, ListYear)%>%
summarize(n=n())
Q8: Storytelling with Data (20 points) ● Tell a story based on your understanding and analysis of the data. Hint: Feel free to use R Markdown