Assignment 1

Q1.Opening Data, Merging Data (10 points) ● Read in the list of towns (#1) and Real Estate Sales (#2) datasets into R. o I’ve filtered data for years 2014-2016 to reduce the number of rows. You can find the clean original data online as a reference.

# import csv files
townList <- read.csv("List_of_Towns.csv")
realEstate <- read.csv("Real_Estate_Sales_2014-2016.csv")

townList

realEstate

● Use the list of towns (#1) dataset to extract the county column to join it with Real Estate dataset (#2) Hint: Use Merge

# merge both datasets on the shared key 'Town'
myData <- merge(x = realEstate, 
                y = townList[ , c('Town','County')], 
                by.x = 'Town', 
                by.y = 'Town')

# print merged data
myData

Q2: Filtering out missing data (5 points) ● It’s often helpful to make a subset of the data that you need to clean. Then you can join it back together later. You can use Excel to verify the columns with missing values. o Identify the columns with missing values using R. Hint: You will observe that Date Recorded and Sales Amount Columns have missing data.

# find out the datatypes of all columns
str(myData)

## 'data.frame':    145987 obs. of  12 variables:
##  $ Town           : Factor w/ 169 levels "Andover","Ansonia",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ ï..ID          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SerialNumber   : int  14046 14011 15006 14044 14035 15051 14002 15011 14043 14029 ...
##  $ ListYear       : int  2014 2014 2015 2014 2014 2015 2014 2015 2014 2014 ...
##  $ DateRecorded   : Factor w/ 911 levels "","1/1/2015",..: 890 12 207 887 755 888 121 57 871 658 ...
##  $ Address        : Factor w/ 134658 levels "   WORTH AVE 113",..: 66 1250 1647 3220 4743 6908 9425 10624 13162 13704 ...
##  $ AssessedValue  : int  10720 153100 102900 108700 164000 188400 89400 213200 52400 131700 ...
##  $ SaleAmount     : num  75000 190000 50000 128368 230000 ...
##  $ SalesRatio     : num  0.143 0.806 2.058 0.847 0.713 ...
##  $ PropertyType   : Factor w/ 6 levels "Apartments","Commercial",..: 6 5 5 5 5 5 5 5 5 5 ...
##  $ ResidentialType: Factor w/ 5 levels "Four Family",..: NA 3 3 3 3 3 3 3 3 3 ...
##  $ County         : Factor w/ 8 levels "Fairfield County",..: 7 7 7 7 7 7 7 7 7 7 ...

# convert factors to character as no operations can be performed on datatype factor
myData$DateRecorded <- as.character(myData$DateRecorded)
myData$Town <- as.character(myData$Town)
myData$PropertyType <- as.character(myData$PropertyType)
myData$Address <- as.character(myData$Address)
myData$ResidentialType <- as.character(myData$ResidentialType)
myData$County <- as.character(myData$County)

# recheck datatypes of the columns
str(myData)

## 'data.frame':    145987 obs. of  12 variables:
##  $ Town           : chr  "Andover" "Andover" "Andover" "Andover" ...
##  $ ï..ID          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SerialNumber   : int  14046 14011 15006 14044 14035 15051 14002 15011 14043 14029 ...
##  $ ListYear       : int  2014 2014 2015 2014 2014 2015 2014 2015 2014 2014 ...
##  $ DateRecorded   : chr  "9/29/2015" "1/14/2015" "11/30/2015" "9/28/2015" ...
##  $ Address        : chr  " US ROUTE 6  M 33 B 36 L 22" "1 JUROVATY LANE" "1 ROSE LANE" "10 PINE RIDGE  DR" ...
##  $ AssessedValue  : int  10720 153100 102900 108700 164000 188400 89400 213200 52400 131700 ...
##  $ SaleAmount     : num  75000 190000 50000 128368 230000 ...
##  $ SalesRatio     : num  0.143 0.806 2.058 0.847 0.713 ...
##  $ PropertyType   : chr  "Vacant Land" "Residential" "Residential" "Residential" ...
##  $ ResidentialType: chr  NA "Single Family" "Single Family" "Single Family" ...
##  $ County         : chr  "Tolland County" "Tolland County" "Tolland County" "Tolland County" ...

# Replace any existing blanks in the dataset with NA for R to be able to detect them
myData[myData==""] <- NA

# Find the total number of NAs in all columns
print("Total missing values in each column:")

## [1] "Total missing values in each column:"

colSums(is.na(myData))

##            Town           ï..ID    SerialNumber        ListYear 
##               0               0               0               0 
##    DateRecorded         Address   AssessedValue      SaleAmount 
##               6               2               0            5283 
##      SalesRatio    PropertyType ResidentialType          County 
##               0               0           11905               0

It can be observed that there are missing values in DateRecorded, Address, SaleAmount and ResidentialType.

Q3: Replacing missing data (10 points) ● Replace missing date recorded data with the corresponding year in ListYear column.

# convert data type of ListYear from integer to character
myData$ListYear <- as.character(myData$ListYear)

# check number of NAs in DateRecorded
print("Before replacement:")

## [1] "Before replacement:"

colSums(is.na(myData))

##            Town           ï..ID    SerialNumber        ListYear 
##               0               0               0               0 
##    DateRecorded         Address   AssessedValue      SaleAmount 
##               6               2               0            5283 
##      SalesRatio    PropertyType ResidentialType          County 
##               0               0           11905               0

# create a variable temp that holds a list of indexes where DateRecorded in NA
temp <- which(is.na(myData$DateRecorded))

# list of rows where DateRecorded is NA
print("Rows where DateRecorded is NA:")

## [1] "Rows where DateRecorded is NA:"

temp

## [1]   1080   2090  77004  89625 129158 129592

# Replace NAs in DateRecorded with the corresponding entries in ListYear
myData$DateRecorded[temp] <- myData$ListYear[temp]

# Recheck number of missing values in DateRecorded
print("After replacement:")

## [1] "After replacement:"

colSums(is.na(myData))

##            Town           ï..ID    SerialNumber        ListYear 
##               0               0               0               0 
##    DateRecorded         Address   AssessedValue      SaleAmount 
##               0               2               0            5283 
##      SalesRatio    PropertyType ResidentialType          County 
##               0               0           11905               0

# compare values in DateRecorded and ListYear for rows that were previously
print("Values replaced in DateRecorded:")

## [1] "Values replaced in DateRecorded:"

myData$DateRecorded[temp]

## [1] "2014" "2014" "2014" "2014" "2014" "2014"

print("Corresponding value in ListYear:")

## [1] "Corresponding value in ListYear:"

myData$ListYear[temp]

## [1] "2014" "2014" "2014" "2014" "2014" "2014"

# Change the datatype of ListYear back to numeric
myData$ListYear <- as.numeric(myData$ListYear)

● Replace missing Sales Amount data with the corresponding with Assessed Value + (Assessed Value multiplied by Sales Ratio)

# check number of NAs in SaleAmount
print("Before replacement:")

## [1] "Before replacement:"

colSums(is.na(myData))

##            Town           ï..ID    SerialNumber        ListYear 
##               0               0               0               0 
##    DateRecorded         Address   AssessedValue      SaleAmount 
##               0               2               0            5283 
##      SalesRatio    PropertyType ResidentialType          County 
##               0               0           11905               0

# reassign value of temp with indices where SalesAmount is NA
temp <- which(is.na(myData$SaleAmount))

# check if temp matches with number of NAs in SalesAmount
print("Number of missing values in SaleAmount:")

## [1] "Number of missing values in SaleAmount:"

length(temp)

## [1] 5283

# replace NAs in SalesAmount using the given formula
myData$SaleAmount[temp] <- myData$AssessedValue[temp] + (myData$AssessedValue[temp]*myData$SalesRatio[temp])

# check number of NAs in SalesAmount
print("After replacement:")

## [1] "After replacement:"

colSums(is.na(myData))

##            Town           ï..ID    SerialNumber        ListYear 
##               0               0               0               0 
##    DateRecorded         Address   AssessedValue      SaleAmount 
##               0               2               0               0 
##      SalesRatio    PropertyType ResidentialType          County 
##               0               0           11905               0

# Recheck if the replacement has been implemented correctly
print("Values replaced in SaleAmount:")

## [1] "Values replaced in SaleAmount:"

myData$SaleAmount[temp[1:4]]

## [1] 276145.8 467707.5 320192.1 320192.1

print("Corresponding value in AssessedValue:")

## [1] "Corresponding value in AssessedValue:"

myData$AssessedValue[temp[1:4]]

## [1] 155400 253500 172100 172100

print("Corresponding value in SalesRatio:")

## [1] "Corresponding value in SalesRatio:"

myData$SalesRatio[temp[1:4]]

## [1] 0.7770 0.8450 0.8605 0.8605

print("Assessed Value + (Assessed Value multiplied by Sales Ratio):")

## [1] "Assessed Value + (Assessed Value multiplied by Sales Ratio):"

myData$AssessedValue[temp[1:4]] + (myData$AssessedValue[temp[1:4]]*myData$SalesRatio[temp[1:4]])

## [1] 276145.8 467707.5 320192.1 320192.1

● Add this subset back to the dataset Hint: Check to see if the row counts match to that of the original dataset.

# check if the data has as many rows as it had at the beginning
nrow(myData)

## [1] 145987

Q4: Binning (10 points) ● Make a column “Property Value” based on Column AssessedValue given the conditions: o ‘LowRange’ if Assessed Value <=300,000 o ‘MidRange’ if Assessed Value >300,000 and <=800,000 o ‘HighRange’ if Assessed Value >800,000

# create a list bins of cuts to be used to bin the AssessedValue column
bins <- c(-Inf, 300000, 800000, Inf)

# create list of labels to be given for each bin
binNames <- c('LowRange', 'MidRange', 'HighRange')

# use cut() function to break the AssessedValue column into bins
# save labels in the new column PropertyValue
myData$PropertyValue <- cut(myData$AssessedValue, 
                            breaks = bins, 
                            labels = binNames)

# check the bins created
myData[,c(7,13)]

Q5: Measures of Central Tendency & Variability (25 points) ● Define, Interpret and Calculate the mean, median, mode, standard deviation, variance, range, minimum and maximum values, skewness and kurtosis for all the continuous data columns using the existing individual built in R functions. Hint: You can create a function and use loops to print results. ● Use describe() in psych package to calculate the summary statistics for all continuous columns.

# check datatypes of all variables
str(myData)

## 'data.frame':    145987 obs. of  13 variables:
##  $ Town           : chr  "Andover" "Andover" "Andover" "Andover" ...
##  $ ï..ID          : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SerialNumber   : int  14046 14011 15006 14044 14035 15051 14002 15011 14043 14029 ...
##  $ ListYear       : num  2014 2014 2015 2014 2014 ...
##  $ DateRecorded   : chr  "9/29/2015" "1/14/2015" "11/30/2015" "9/28/2015" ...
##  $ Address        : chr  " US ROUTE 6  M 33 B 36 L 22" "1 JUROVATY LANE" "1 ROSE LANE" "10 PINE RIDGE  DR" ...
##  $ AssessedValue  : int  10720 153100 102900 108700 164000 188400 89400 213200 52400 131700 ...
##  $ SaleAmount     : num  75000 190000 50000 128368 230000 ...
##  $ SalesRatio     : num  0.143 0.806 2.058 0.847 0.713 ...
##  $ PropertyType   : chr  "Vacant Land" "Residential" "Residential" "Residential" ...
##  $ ResidentialType: chr  NA "Single Family" "Single Family" "Single Family" ...
##  $ County         : chr  "Tolland County" "Tolland County" "Tolland County" "Tolland County" ...
##  $ PropertyValue  : Factor w/ 3 levels "LowRange","MidRange",..: 1 1 1 1 1 1 1 1 1 1 ...

# import library moments to calculate skewness and kurtosis
library(moments)

# create a list of indices of variables that are continuous
cols <- c(7, 8, 9)

# create a function to calculate mode
getmode <- function(v) {
   uniqv <- unique(v)
   uniqv[which.max(tabulate(match(v, uniqv)))]
}

temp <- data.frame(Variable= character(0),
                   Mean = double(0),
                   Median = double(0),
                   Mode = double(0),
                   StandardDeviation = double(0),
                   Variance = double(0),
                   Range = character(0),
                   Minimum = double(0), 
                   Maximum = double(0),
                   Skewness=double(0), 
                   Kurtosis=double(0), 
                   stringsAsFactors = FALSE)

# write a for loop to calculate the summary statistics for all continuous variables
for(i in 1:length(cols)){
   temp[i,] <- c(as.character(colnames(myData[cols[i]])),
                 as.double(mean(myData[,cols[i]])),
                 as.double(median(myData[,cols[i]])),
                 as.double(getmode(myData[,cols[i]])),
                 as.double(sd(myData[,cols[i]])),
                 as.double(var(myData[,cols[i]])),
                 as.character(range(myData[,cols[i]])[2]-range(myData[,cols[i]])[1]),
                 as.double(min(myData[,cols[i]])),
                 as.double(max(myData[,cols[i]])),
                 as.double(skewness(myData[,cols[i]])),
                 as.double(kurtosis(myData[,cols[i]])))
   }
temp

# import library psych
library(psych)

# use describe function in psych package to get summary statistics of continuous variables
describe(myData[,cols], 
         na.rm = TRUE, 
         interp=FALSE,
         skew = TRUE, 
         ranges = TRUE,
         trim=.1,
         type=3,
         check=TRUE,
         fast=NULL,
         quant=NULL,
         IQR=FALSE,
         omit=FALSE)

Q6: Make three interesting plots using the data and describe what you see (40 points) ● Choose one variable and make a four-panel plot with a density plot, histogram, a boxplot and an overlapping plot with histogram and density plot.

# Remove outliers from SaleAmount
outlier_values <- boxplot.stats(myData$SaleAmount)$out

#myData[which(myData$SaleAmount %in% outlier_values),]

# To not lose the original data, store the data without outliers in a new variable
treated_myData = myData[-which(myData$SaleAmount %in% outlier_values),] 

# set option to display numbers as integers and not in exponential form
options(scipen = 5)    

# create a 4-paneled plot
par(mfrow=c(2,2))

# Plot kernel density plot for SaleAmount
plot(density(treated_myData$SaleAmount),
     main = "Kernel Density Plot for SaleAmount",
     cex.main=1.1)

# Plot a histogram for SaleAmount
hist(treated_myData$SaleAmount,
     main = "Histogram of SaleAmount",
     cex.main=1.1,
     xlab = "SaleAmount")

# Create a boxplot for SaleAmount
boxplot(treated_myData$SaleAmount,
        data=treated_myData, 
        main="Box Plot of SaleAmount",
        cex.main=1.1)

# Plot a overlapping plot with histogram and kernel density
hist(treated_myData$SaleAmount,
     main = "Histogram with Kernel Density of SaleAmount",
     cex.main=1.1,
     xlab = "SaleAmount",
     freq = FALSE)
lines(density(treated_myData$SaleAmount),
      col = "blue")

The panel above shows 4 plots based on the variable SaleAmount.

● Think of impressing your boss by telling a story from the data and build three interesting plots Hint: Box Plots, Scatterplots, Contingency Tables, Map plots, Dot plots, Density plots or Histograms

# Remove outliers from AssessedValue
outlier_values <- boxplot.stats(myData$AssessedValue)$out

#treated_myData[which(treated_myData$AssessedValue %in% outlier_values),]

# To not lose the original data, store the data without outliers in a new variable
treatedNew_myData = treated_myData[-which(treated_myData$AssessedValue %in% outlier_values),]

#ScatterPlot
plot(treatedNew_myData$AssessedValue,
     treatedNew_myData$SaleAmount,
     main = "Scatter Plot for SaleAmount vs AssessedValue",
     xlab = "AssessedValue",
     ylab = "SaleAmount",
     pch = 20,
     xlim = c(0, 460000),
     ylim = c(0, 705000))

# Plot regression line (AssessedValue ~ SaleAmount)
abline(lm(treatedNew_myData$AssessedValue~treatedNew_myData$SaleAmount),
       col="red")

# Basic Scatterplot Matrix
pairs(~SaleAmount+SalesRatio+AssessedValue,
      data=treatedNew_myData,
      main="Simple Scatterplot Matrix")

# Grouped Bar Plot

# create an aggregated table for County and Property Value
bp <- table(treated_myData$County, treated_myData$PropertyValue)

# Plot a grouped bar graph for the aggregated table
barplot(bp, 
        main="PropertyValue Distribution by County",
        xlab="County",
        col=c("red","blue","darkgreen","yellow", "orange","cyan","green","brown"),
        legend = rownames(bp),
     #  horiz = TRUE,
        beside = TRUE)

# create a seperate data frame for SaleAmount, County and ListYear
dc_table <- data.frame(County = treated_myData$County, 
                       ListYear = treated_myData$ListYear,
                       SaleAmount = treated_myData$SaleAmount)

# import library dplyr
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# create an aggregated table for SaleAmount based on County and ListYear
dc <- dc_table%>%
      group_by(County, ListYear)%>%
      summarise(n=sum(SaleAmount))

# sort the table according to ListYear
dc = dc[order(dc$ListYear),]

# Assign colors for each ListYear
dc$color[dc$ListYear==2014]="red"

## Warning: Unknown or uninitialised column: 'color'.

dc$color[dc$ListYear==2015]="blue"
dc$color[dc$ListYear==2016]="darkgreen"

# create a multi-paneled plot
par(mfrow = c(1, 1))

# create a dotchart
dotchart(dc$n,
         labels=dc$County, 
         groups = dc$ListYear, 
         cex=.7,
         main="Total Sales per County per year",
         xlab="Sale Amount",
         color = dc$color)

# create a list of unique values in ListYear and corresponding color for Legend
ly <- unique(dc$ListYear)
clr <- unique(dc$color)

# create a legend for the dot plot
legend("topright", 
       inset = .0125, 
       title = "ListYear",
       legend= ly,
       fill = clr)

Scatter Plot:

Among the plots above, the first plot is a scatterplot for SaleAmount vs AssessedValue. It can be seen that the spread visible indicates some positive correlation between both the variables.
Also, the red line across the plot is the best fit regression line that gives an idea of a positive correlation.

Scatterplot Matrix:

The second plot is a scatterplot matrix between SaleAmount, SalesRatio and AssessedValue. It can be seen again that SaleAmount vs AssessedValue has a wide spread of points, with majority laying close to the diagonal direction. Thus, it can be said that both have a positive correlation.
In the SaleAmount vs SaleRatio scatterplot, it can be seen that SaleAmount doesn’t appear to be varying with the SalesRatio. On the contrary, the SaleAmount is varying for the same low values of SalesRatio.
The AssessedValue vs SalesRation scatterplot shows a similar trend as the SaleAmount vs SalesRatio scatterplot. However, the points appear to be a bit more scattered, making it possible for the both to have a non-zero correlation. Further analysis would be required to obtain the exact relation between these two variables.

Grouped Bar Chart:

The third plot is a grouped bar chart that shows the distribution of all the properties among different Counties. From the graph it can be inferred that:
- Majority of the properties in all the counties are Lowrange properties
- Hartford County has the maximum number of Lowrange properties
- Fairfield County has the maximum number of Midrange and Highrange properties
- Tolland County has the least number of Lowrange properties

Dot Plot:

The fourth plot is a Dotplot depicting variation of SaleAmount across all the counties through 2014 to 2016. As seen in the plot:
- Fairfield County had the highest SaleAmount generation in 2014 and 2016. However, in 2015, it had the 2nd highest SaleAmount, following Hartford County with the highest SaleAmount in 2015.
- Windham County has the least SaleAmount all through 2014-2016.
- It can be infered that the total SaleAmount of all the counties was least in 2015.

Q7: Aggregating, Multidimensional Tables (30 points) ● Generate a Frequency tables based on the count of properties in each town.

# import library dplyr
library(dplyr)

# group data by Town and calculate the number of entries for each town
myData%>%
  group_by(Town)%>%
  summarize(n=n())

● Build a contingency table using the variable ‘Property value’ based on the percentages of row and overall total of properties in each range.

# calculate the percentage of data for each Town
myData%>%
  group_by(PropertyValue)%>%
  summarize(n=n())%>%
  mutate(prop=n/sum(n))

# calculate the % and total % of data for each town using Package epidisplay
library(epiDisplay)

## Loading required package: foreign

## Loading required package: survival

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

## Loading required package: nnet

## 
## Attaching package: 'epiDisplay'

## The following objects are masked from 'package:psych':
## 
##     alpha, cs, lookup

tab1(myData$PropertyValue,
     main = "Contingency graph for PropertyValue")

## myData$PropertyValue : 
##           Frequency Percent Cum. percent
## LowRange     120076    82.3         82.3
## MidRange      19486    13.3         95.6
## HighRange      6425     4.4        100.0
##   Total      145987   100.0        100.0

● Create a multidimensional table using the dataset based on Town, County and list year

# Multidimensional table for number of entries according to Town, County and ListYear 
# using dplyr
myData%>%
  group_by(Town, County, ListYear)%>%
  summarize(n=n())

Q8: Storytelling with Data (20 points) ● Tell a story based on your understanding and analysis of the data. Hint: Feel free to use R Markdown

The given data had two datasets. One of the datasets contained information about the towns in Connecticut. The second dataset consisted of information about the Real Estate sales that happened from the year 2014 to 2016
After merging both the datasets, the resultant dataset contains the town-wise real estate sale information for the 169 towns in all the eight counties of Connecticut
There were missing values in the columns DateYear, SaleAmount and ResidentialType. To get the data ready for analysis, missing values in DateYear and SaleAmount were imputed in reference to the columns ListYear, AssessedValue and SalesRatio
The summary statistics, along with the plots generated for SaleAmount, SalesRatio and AssessedValue give an idea that SaleAmount and AssessedValue are positively correlated, whareas SalesRatio doesn’t really have an effect of the variation of SaleAmount and AssessedValue
Through further analysis of the data by plotting graphs of SaleAmount with respect to the Counties and ListYear, it can be observed that:
- In 2014, the SaleAmount in Fairfield County were the highest, with Stamford having the highest SaleAmount for Apartments. In the same year, SaleAmount was lowest in the towns of Woodstock and Woodbury, both for residential properties
- In 2015, the SaleAmount was highest for Hartford County, whereas Stamford was the town with highest SaleAmount
- In 2016 too, Stamford had the highest SaleAmount and Fairfield county has the highest SaleAmount among all counties
- Over the course of these three years, the SaleAmount for Commercial properties has been higher than rest of the PropertyTypes
- Looking at the aggregation results for PropertyValue, it is clear that during 2014 to 2016, 82.3% of the properties sold were low range properties, 4.4% were high range properties, the remaining were mid range properties.
Through the above observations and analysis, it can be said that among all the counties in Connecticut, Fairfield county has the highest SaleAmount for real-estate sales, specially for Apartments and Commercial properties.

Assignment 1

NetID-MWS19001