In this project I will be cleaning the Nashville Housing dataset,
that was shared by AlexTheAnalyst trough GitHub using MySQL queries with
R. The dataset is available here
This particular project could have been done completely with R
because I have the .csv file available. However, I decided to make this
project out of the personal interest of learning and illustrating how to
use this amazing tools, MySQL and R, together to query our databases,
clean data, and use the results immediately for analysis and charts. I
believe this concepts can be very useful for me in my future career and
for others.
Notes on importing data to MySQl database
There are multiple ways to import a cvs file into a database. I
believe one of the easiest ways is to convert the cvs file into an sql
file using websites like convertcsv.com,
provided the dataset is not too big (like this one that has 56477 rows),
for different reasons.
First of all, you can customize pretty much everything, including the
data type of the fields, and very especially the date format, which can
save you a lot of time after importing your file. The date in a csv file
usually doesn’t match the format used by MySQL of YYYY-MM-DD and this
creates problems very often during the importation or need to be fixed
after importation.
Second it is very easy to use and fast. It creates the CREATE TABLE
and INSERT INTO statements. After downloading the .sql file, I opened it
with VS Code to checked everything was correct. In this case, the last
three rows were all blanks, so I removed them. Then I just created the
table by simply executing the sql statements from a file using the
source command via the command line like mysql> source file_name like
this:
mysql> source /Users/alexandravelez/Desktop.realestate.sql
Setting up the connection with MySQL and R
There are two options for connecting R to a MySQL database: 1. Using
the odbc package with a database driver 2. Using the RMariaDB
package
I used the first method. I installed MySQL community server on my
mac, MySQL Workbench, and iODBC driver manager. Credits to this video that shows
an easy and quick way to set everything up without even needing the
terminal.
Then I installed the R packages:
The DBI package helps connecting R to database management systems
(DBMS). The odbc package provides a DBI-compliant interface to Open
Database Connectivity (ODBC) drivers. It allows for an efficient, easy
way to setup connection to any database using an ODBC driver, including
SQL Server, Oracle, MySQL, PostgreSQL, SQLite and others. Tidyverse to
use dplyr, dbplyr and ggplot2. Dbplyr is included in the tidyverse and
converts dplyr into SQL code.
Connecting R to the database:
library("DBI")
con <- dbConnect(odbc(), Driver = "/usr/local/mysql-connector-odbc-8.2.0-macos13-arm64/lib/libmyodbc8a.so",
Server = "localhost", Database = "sys", UID = "root", PWD = "testDB_123",
Port = 3306, Password = "password")
The connection was stored in an object named con, and I can see under
connections all my SQL databases.
I can call dbListTables(con) to see all the tables in my sys
database. The name of the table I will work with is ‘realestate’.
I can create a direct link with a table using tbl() from dplyr. This
table is not in R yet, it is just a connection to a table that is in the
database
realestate_db <- tbl(con, "realestate")
Registered S3 methods overwritten by 'dbplyr':
method from
print.tbl_lazy
print.tbl_sql
realestate_db
R displays also Source: table [?? x 19] Database: mysql
[root@:/sys] because R does not know how many rows are in this table, it
just take a look to the first rows of the table that is in the
database.
To bring the realestate table to R I have to use the collect ().
Now I have a data frame exactly like if I used read.csv()
Checking for nulls
Creating a query to look for nulls in UniqueID. In this case I stored
the query in an object to pass it to a dbGetQuery function, but
alternatively I can pass the query directly into the function.
null_UniqueID_query <- "SELECT *
FROM realestate
WHERE UniqueID IS NULL"
Now I will use the dbGetQuery function from de DBI package to pass my
query. The result is a a data frame that can be stored to work with in
the future.
dbGetQuery(con, null_UniqueID_query)
Creating a query to look for nulls in ParcelID
null_ParcelID_query <- "SELECT *
FROM realestate
WHERE ParcelID IS NULL"
Passing the query into the dbGetQuery function:
dbGetQuery(con, null_ParcelID_query)
Creating a query to look for nulls in LandUse:
null_LandUse_query <- "SELECT *
FROM realestate
WHERE LandUse IS NULL"
Passing the query into the dbGetQuery function:
dbGetQuery(con, null_LandUse_query)
I got 0 nulls for UniqueID, Parcel ID and LandUse.
Creating a query to look for nulls in PropertyAddress:
null_PropertyAddress_query <- "SELECT *
FROM realestate
WHERE PropertyAddress IS NULL"
dbGetQuery(con, null_PropertyAddress_query)
I got 29 nulls on the PropertyAddress column.
Populating the nulls in the PropertyAddress.
Taking a closer look to the parcel ID for some of the property
address nulls:
ParceID_query <- "SELECT *
FROM realestate
WHERE ParcelID = '052 01 0 296.00'"
dbGetQuery(con, ParceID_query)
ParcelID_query2 <- "SELECT *
FROM realestate
WHERE ParcelID = '093 08 0 054.00'"
dbGetQuery(con, ParcelID_query2)
When inspecting the nulls in PropertyAddress we can see that there
are several properties with the same ParcelID but one of them without a
Property address. In this case this information can be completed. To do
this we can create a joint to the realestate table to itself and obtain
the values where the parcel id is the same but the propertyaddress is
complete.
With the IFNULL function we want to replace the null value with the
property address of our match in the joint
joint_query <- "SELECT a.ParcelID, a.PropertyAddress, b.ParcelID, b.PropertyAddress, IFNULL(a.PropertyAddress, b.PropertyAddress) as populate_address
FROM sys.realestate a
JOIN sys.realestate b
on a.ParcelID = b.ParcelID
AND a.UniqueID <> b.UniqueID
WHERE
a.PropertyAddress is NULL;"
joint <- dbGetQuery(con, joint_query)
joint
Now I want to populate the nulls with the results that we got. For
this I have to pass an update statement using the dbExecute function.
For SQL queries that affect the underlying database, such as UPDATE,
DELETE, INSERT INTO, and DROP TABLE, DBI provides two functions.
dbExecute() passes the SQL statement to the DBMS for execution and
returns the number of rows affected. dbSendStatement() performs in the
same manner, but returns a result object. Call dbGetRowsAffected() with
the result object to get the count of the affected rows. You then need
to call dbClearResult() with the result object afterwards to release
resources.
dbExecute() is a convenience function that calls dbSendStatement(),
dbGetRowsAffected(), and dbClearResult().
dbExecute(
con,
"UPDATE sys.realestate a
JOIN sys.realestate b
on a.ParcelID = b.ParcelID
AND a.UniqueID <> b.UniqueID
SET a.PropertyAddress = IFNULL(a.PropertyAddress, b.PropertyAddress)
WHERE
a.PropertyAddress is NULL;"
)
[1] 29
The console returned [1] 29 which means the 29 rows affected. Now
passing again the query for nulls in the PropertyAddress column, I get 0
nulls.
dbGetQuery(con, null_PropertyAddress_query)
Breaking the PropertyAddress columns in two separate columns.
The second issue with property address contains the address itself
and the city in a single string. It is ideal to have them in separate
columns. To do this we can simply use the substring index function which
returns a substring from a string before a specified number of
occurrences of the delimiter. In this case the delimiter is a ','.
dbExecute(
con,
"ALTER TABLE realestate
ADD AddressProperty VARCHAR(60),
ADD PropertyCity VARCHAR(50);
")
dbExecute(
con,
"UPDATE realestate
SET AddressProperty = SUBSTRING_INDEX(PropertyAddress, ',', 1),
PropertyCity = SUBSTRING_INDEX(PropertyAddress, ',', -1) ;
")
[1] 56477
Checking the property address information was splitted correctly:
realestate_db <- tbl(con, "realestate")
realestate_db
Breaking the Owner Address columns in three separate columns.
The Property address had the address and the city whereas the owner
address has the address, city and estate in the same string. I extracted
the information with the substring_index function, and in the case of
the information between the two commas, I nested a substring_index
function.
dbExecute(
con,
"ALTER TABLE realestate
ADD AddressOwner VARCHAR(60),
ADD OwnerCity VARCHAR(50),
ADD OwnerEstate VARCHAR(50);
")
dbExecute(
con,
"UPDATE realestate
SET AddressOwner = SUBSTRING_INDEX(OwnerAddress, ',', 1),
OwnerCity = SUBSTRING_INDEX(SUBSTRING_INDEX(OwnerAddress, ',', 2), ',',-1),
OwnerEstate = SUBSTRING_INDEX(OwnerAddress, ',', -1);
")
[1] 26015
Search and remove duplicates
It is safe to assume that if two or more entries share the same
parcel ID, address, legal reference, sale date and sale price then there
is a duplicated entry.
To look for entry I will use the row_number window function, that
will assign a row number to each element in a partition. The partition
will be over the parcel ID, address, legal reference, sale date and sale
price to identify how many elements share exactly the same information
in these columns.
There are 104 duplicates. To remove the duplicates I will use the
dbExdecute function with the same query I got the duplicates
dbExecute(
con,
"WITH DuplicatesCTE as (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY PropertyAddress,
SalePrice,
SaleDate,
LegalReference,
ParcelID
order by UniqueID
) row_num
FROM sys.realestate
)
DELETE FROM sys.realestate
WHERE UniqueID IN (
SELECT UniqueID
FROM DuplicatesCTE
WHERE row_num > 1
);
")
[1] 104
Descriptive analysis on land use
We can store the entire table or the results of our queries in data
frames in R, and this makes it very easy to create a chart with it in
R.
land_use <- realestate_clean%>%
count(LandUse, sort=TRUE)
land_use
Then organizing the data to display in the next chart, where I will
display the 5 more common land uses, and the remaining 32 categories of
land uses summarized as “Others” since their numbers are very low. This
information will be stored in the land_use_p1 object, that I will pass
into ggplot() later.
land_use_p1 <- land_use[land_use$n >=247, ]
Land_use_others <- sum(land_use[which(land_use$n<1047), 2])
land_use_p1[nrow(land_use_p1) + 1,] = list("OTHERS", Land_use_others)
land_use_p1
Creating a bar char for land use:
I created and stored a theme to apply to the next charts to keep
consistency.
mytheme <- theme(
plot.title = element_text(family = "Arial", face = "bold", size = (15), colour = "#5A5A5A"),
axis.title = element_text(family = "Arial", size = (10), colour = "#808080", hjust=c(1), vjust=c(0)),
axis.text = element_text(family = "Arial", size = (10), colour = "#808080"),
legend.title = element_text(colour = "#808080", face = "bold", family = "Arial"),
legend.text = element_text(colour = "#808080", family = "Arial"),
plot.subtitle = element_text(colour = "#5A5A5A", family = "Arial"),
plot.caption = element_text(colour = "#5A5A5A", family = "Arial")
)
Creating a bar char with ggplot2:
p1 <- ggplot(land_use_p1, aes(x = LandUse, y = n, fill = LandUse)) +
geom_col(width = 0.4) +
scale_x_discrete(labels = c("Duplex", "Others", "Residential\nCondo", "Single\nFamily", "Vacant\nResidential\nLand", "Zero Lot\nLine"))+
scale_fill_manual(values = c("#b3cde0", "#005b96", "#011F4B","#8a0303", "#03396c","#6497b1" ))+
labs(y= "Number of properties", x = "Land Use", title = "Land use of the Different Properties Sold\nIn Nahsville, Tennessee from 2018 to 2021", fill = "LandUse", caption = "Properties sold in Nashville, Tennessee from 2018 to 2021", substitle = "Single family is by far the most common land use of the properties sold during the period, followed by residential condo ")+
mytheme
p1

Descriptive analysis on sale date
To take a quick look about the property sales per year I will group
the information by year and then count the sales:
sale_date_p2 <- realestate_clean %>%
mutate(year = year(SaleDate)) %>%
group_by(year) %>%
count()
sale_date_p2
Creating a bar char for sales per year:
p2 <- ggplot(sale_date_p2, aes(x = year, y = n, fill = year)) +
geom_col(width = 0.4)+
labs(y= "Year of Sale", x = "Number of Sales", title = "Number of Sales Per year in Nahsville, Tennessee,\nfrom 2018 to 2021", fill = "year", caption = "Properties sold in Nashville, Tennessee from 2018 to 2021", subtitle = "The properties sales increased steadily from 2018 to 2020, declining slightly again in 2021 ")+
mytheme+
theme(legend.position="none")
p2

To wrap up, the ODBC, DBI, dplyr and dbplyr used together, allow us
to connect our databases to R and execute queries directly in R. The
results of our queries can be visualized as they are on the database
with the tbl function or we can bring the results to R and store them
into objects with dbGetQuery to analyze the information and create
charts without the need of importing or exporting the information as a
separate file.
---
title: "Nashville Housing - MySQL Database Queries With R "
output: html_notebook
editor_options: 
  chunk_output_type: inline
---

In this project I will be cleaning the Nashville Housing dataset, that was shared by AlexTheAnalyst trough GitHub using MySQL queries with R.
The dataset is available [here](https://github.com/AlexTheAnalyst/PortfolioProjects/blob/main/Nashville%20Housing%20Data%20for%20Data%20Cleaning%20(reuploaded).xlsx)

This particular project could have been done completely with R because I have the .csv file available. However, I decided to make this project out of the personal interest of learning and illustrating how to use this amazing tools, MySQL and R, together to query our databases, clean data, and use the results immediately for analysis and charts. I believe this concepts can be very useful for me in my future career and for others. 

## Notes on importing data to MySQl database

There are multiple ways to import a cvs file into a database. I believe one of the easiest ways is to convert the cvs file into an sql file using websites like [convertcsv.com](https://www.convertcsv.com/csv-to-sql.htm), provided the dataset is not too big (like this one that has 56477 rows), for different reasons. 

First of all, you can customize pretty much everything, including the data type of the fields, and very especially the date format, which can save you a lot of time after importing your file. The date in a csv file usually doesn't match the format used by MySQL of YYYY-MM-DD and this creates problems very often during the importation or need to be fixed after importation. 

Second it is very easy to use and fast. It creates the CREATE TABLE and INSERT INTO statements. After downloading the .sql file, I opened it with VS Code to checked everything was correct. In this case, the last three rows were all blanks, so I removed them. Then I just created the table by simply executing the sql statements from a file using the source command via the command line like mysql> source file_name like this: 

mysql> source /Users/alexandravelez/Desktop.realestate.sql

## Setting up the connection with MySQL and R

There are two options for connecting R to a MySQL database:
1. Using the odbc package with a database driver
2. Using the RMariaDB package

I used the first method. I installed MySQL community server on my mac, MySQL Workbench, and iODBC driver manager. Credits to this [video](https://www.youtube.com/watch?v=uRkvvODWA1Q) that shows an easy and quick way to set everything up without even needing the terminal.

Then I installed the R packages:

The DBI package helps connecting R to database management systems (DBMS). 
The odbc package provides a DBI-compliant interface to Open Database Connectivity (ODBC) drivers. It allows for an efficient, easy way to setup connection to any database using an ODBC driver, including SQL Server, Oracle, MySQL, PostgreSQL, SQLite and others. 
Tidyverse to use dplyr, dbplyr and ggplot2.
Dbplyr is included in the tidyverse and converts dplyr into SQL code.

```{r Installing packages, results='hide'}
install.packages("DBI")
library(DBI)
install.packages("odbc")
library(odbc)
install.packages("tidyverse")
library(tidyverse)

```

Connecting R to the database:

```{r connecting R to MySQL, eval=FALSE}
library("DBI")
con <- dbConnect(odbc(), Driver = "/usr/local/mysql-connector-odbc-8.2.0-macos13-arm64/lib/libmyodbc8a.so", 
                 Server = "localhost", Database = "sys", UID = "root", PWD = "testDB_123", 
                 Port = 3306, Password = "password")
```

```{r echo=FALSE}
library("DBI")
con <- dbConnect(odbc(), Driver = "/usr/local/mysql-connector-odbc-8.2.0-macos13-arm64/lib/libmyodbc8a.so", 
                 Server = "localhost", Database = "sys", UID = "root", PWD = "testDB_123", 
                 Port = 3306, Password = "!Avr3138667")
```

The connection was stored in an object named con, and I can see under connections all my SQL databases.

I can call dbListTables(con) to see all the tables in my sys database. The name of the table I will work with is 'realestate'.

I can create a direct link with a table using tbl() from dplyr. This table is not in R yet, it is just a connection to a table that is in the database

```{r connecting to my table}
realestate_db <- tbl(con, "realestate")
realestate_db
```

R displays also Source:   table<mytable> [?? x 19] Database: mysql  [root@:/sys] because R does not know how many rows are in this table, it just take a look to the first rows of the table that is in the database.

To bring the realestate table to R I have to use the collect (). 

```{r storing the table in a data frame, echo=TRUE}
realestate_df <- collect(realestate_db)
realestate_df
```

Now I have a data frame exactly like if I used read.csv()

## Checking for nulls

Creating a query to look for nulls in UniqueID. In this case I stored the query in an object to pass it to a dbGetQuery function, but alternatively I can pass the query directly into the function.

```{r query nulls UniqueID}
null_UniqueID_query <- "SELECT *
  FROM realestate
  WHERE UniqueID IS NULL"
```
  
Now I will use the dbGetQuery function from de DBI package to pass my query. The result is a a data frame that can be stored to work with in the future.

```{sql}
dbGetQuery(con, null_UniqueID_query)
```

```{r passing the query to look for null in the UniqueID column, echo=TRUE }
dbGetQuery(con, null_UniqueID_query)
```

Creating a query to look for nulls in ParcelID

```{r query nulls ParcelID}
null_ParcelID_query <- "SELECT *
  FROM realestate
  WHERE ParcelID IS NULL"
```

Passing the query into the dbGetQuery function:

```{sql }
dbGetQuery(con, null_ParcelID_query)
```

```{r passing the query to look for null in the ParcelID column }
dbGetQuery(con, null_ParcelID_query)
```
Creating a query to look for nulls in LandUse:

```{r query nulls LandUse}
null_LandUse_query <- "SELECT *
  FROM realestate
  WHERE LandUse IS NULL"
```

Passing the query into the dbGetQuery function:

```{sql}
dbGetQuery(con, null_LandUse_query)
```

```{r passing the query to look for null in the LandUse column }
dbGetQuery(con, null_LandUse_query)
```

I got 0 nulls for UniqueID, Parcel ID and LandUse.

Creating a query to look for nulls in PropertyAddress:

```{r query nulls PropertyAddress}
null_PropertyAddress_query <- "SELECT *
  FROM realestate
  WHERE PropertyAddress IS NULL"
```

```{sql }
dbGetQuery(con, null_PropertyAddress_query)
```

```{r passing the query to look for null in the PropertyAddress column }
dbGetQuery(con, null_PropertyAddress_query)
```

I got 29 nulls on the PropertyAddress column. 

## Populating the nulls in the PropertyAddress.

Taking a closer look to the parcel ID for some of the property address nulls:

```{r creating the query to look for an especific parcel id where the address was null}
ParceID_query <- "SELECT *
                  FROM realestate
                  WHERE ParcelID = '052 01 0 296.00'" 
```

```{sql}
dbGetQuery(con, ParceID_query)
```

```{r passing the query}
dbGetQuery(con, ParceID_query)
```
```{r creating a second query to look for an especific parcel id where the address was null}
ParcelID_query2 <- "SELECT *
                  FROM realestate
                  WHERE ParcelID = '093 08 0 054.00'" 
```

```{sql }
dbGetQuery(con, ParcelID_query2)
```

```{r passing the second query }
dbGetQuery(con, ParcelID_query2)
```
When inspecting the nulls in PropertyAddress we can see that there are several properties with the same ParcelID but one of them without a Property address. In this case this information can be completed. To do this we can create a joint to the realestate table to itself and obtain the values where the parcel id is the same but the propertyaddress is complete. 

With the IFNULL function we want to replace the null value with the property address of our match in the joint
```{r joint query}
joint_query <- "SELECT a.ParcelID, a.PropertyAddress, b.ParcelID, b.PropertyAddress, IFNULL(a.PropertyAddress, b.PropertyAddress) as populate_address
FROM sys.realestate a
JOIN sys.realestate b
   on a.ParcelID = b.ParcelID
   AND a.UniqueID <> b.UniqueID
WHERE 
a.PropertyAddress is NULL;"
```

```{sql}
joint <- dbGetQuery(con, joint_query)
joint
```

```{r passing the joint query}
joint <- dbGetQuery(con, joint_query)
joint
```

Now I want to populate the nulls with the results that we got. For this I have to pass an update statement using the dbExecute function. 
For SQL queries that affect the underlying database, such as UPDATE, DELETE, INSERT INTO, and DROP TABLE, DBI provides two functions. dbExecute() passes the SQL statement to the DBMS for execution and returns the number of rows affected. dbSendStatement() performs in the same manner, but returns a result object. Call dbGetRowsAffected() with the result object to get the count of the affected rows. You then need to call dbClearResult() with the result object afterwards to release resources.

dbExecute() is a convenience function that calls dbSendStatement(), dbGetRowsAffected(), and dbClearResult().

```{r populating the null values in the PropertyAddress column}
dbExecute(
con,
"UPDATE sys.realestate a
JOIN sys.realestate b
   on a.ParcelID = b.ParcelID
   AND a.UniqueID <> b.UniqueID
SET a.PropertyAddress = IFNULL(a.PropertyAddress, b.PropertyAddress)
WHERE 
a.PropertyAddress is NULL;"
)
```

The console returned [1] 29 which means the 29 rows affected. 
Now passing again the query for nulls in the PropertyAddress column, I get 0 nulls. 

```{sql}
dbGetQuery(con, null_PropertyAddress_query)
```

```{r checking all nulls in property address were fixed}
dbGetQuery(con, null_PropertyAddress_query)
```


## Breaking the PropertyAddress columns in two separate columns.

The second issue with property address contains the address itself and the city in a single string. It is ideal to have them in separate columns. To do this we can simply use the substring index function which returns a substring from a string before a specified number of occurrences of the delimiter. In this case the delimiter is a \',\'.

```{r Creating two new columns, AddressProperty and PropertyCity}
dbExecute(
con,
"ALTER TABLE realestate
  ADD AddressProperty VARCHAR(60),
  ADD PropertyCity VARCHAR(50);
")
```

```{r Spliting the property address into two separate columns}
dbExecute(
con,
"UPDATE realestate
  SET AddressProperty = SUBSTRING_INDEX(PropertyAddress, ',', 1),
      PropertyCity = SUBSTRING_INDEX(PropertyAddress, ',', -1) ;
")
```

Checking the property address information was splitted correctly:
```{sql}
realestate_db <- tbl(con, "realestate")
realestate_db
```

```{r Checking the property address information was splitted correctly}
realestate_db <- tbl(con, "realestate")
realestate_db
```

## Breaking the Owner Address columns in three separate columns.
The Property address had the address and the city whereas the owner address has the address, city and estate in the same string. I extracted the information with the substring_index function, and in the case of the information between the two commas, I nested a substring_index function. 

```{r Creating the threee new columns for owner address, owner city and owner estate}
dbExecute(
con,
"ALTER TABLE realestate
  ADD AddressOwner VARCHAR(60),
  ADD OwnerCity VARCHAR(50),
  ADD OwnerEstate VARCHAR(50);
")
```

```{r Splitting Owner address information into the three new columns}
dbExecute(
con,
"UPDATE realestate
  SET AddressOwner = SUBSTRING_INDEX(OwnerAddress, ',', 1),
      OwnerCity = SUBSTRING_INDEX(SUBSTRING_INDEX(OwnerAddress, ',', 2), ',',-1),
      OwnerEstate = SUBSTRING_INDEX(OwnerAddress, ',', -1);
")
```


```{r Checking the owner address information was splitted correctly}
realestate_db <- tbl(con, "realestate")
realestate_db
```

## Unify the format in the SoldAsVacant column

Right now we have Y, N, yes, no in the SoldAsVacant column:

```{r creating a query to look for teh different values in the SoldAsVacant column}
SoldAsVacant_query <- "SELECT DISTINCT(SoldAsVacant), COUNT(SoldAsVacant)
FROM sys.realestate
GROUP BY SoldAsVacant;"

```

```{sql}
dbGetQuery(con, SoldAsVacant_query)
```

```{r passing the SoldAsVacant query}
dbGetQuery(con, SoldAsVacant_query)
```
I used a case statement to unify the format of the SoldAsVacant column : 
```{r case estatement to unify format of the SoldAsVacant column}
dbExecute(
con,
"UPDATE realestate
  SET SoldAsVacant =
case
    when SoldAsVacant = 'Y' then 'Yes'
when SoldAsVacant = 'N' then 'No'
else SoldAsVacant
end;
")
```
Checking the format is Yes and No:

```{r checking the SoldAsVacant format is uniform }
dbGetQuery(con, SoldAsVacant_query)
```

## Search and remove duplicates
 It is safe to assume that if two or more entries share the same parcel ID, address, legal reference, sale date and sale price then there is a duplicated entry.
 
To look for entry I will use the row_number window function, that will assign a row number to each element in a partition. The partition will be over the parcel ID, address, legal reference, sale date and sale price to identify how many elements share exactly the same information in these columns. 

```{r looking for duplicates with the row_number function}
dbGetQuery(
con,
"WITH DuplicatesCTE as (
    SELECT *,
           ROW_NUMBER() OVER(
               PARTITION BY PropertyAddress,
                   SalePrice,
                   SaleDate,
                   LegalReference,
                   ParcelID
               ORDER BY UniqueID
               ) row_num
    FROM sys.realestate
)
    SELECT *
    FROM DuplicatesCTE
    WHERE row_num > 1;
")
```

There are 104 duplicates.
To remove the duplicates I will use the dbExdecute function with the same query I got the duplicates
```{r Removing the duplicates}
dbExecute(
con,
"WITH DuplicatesCTE as (
    SELECT *,
           ROW_NUMBER() OVER (
               PARTITION BY PropertyAddress,
                   SalePrice,
                   SaleDate,
                   LegalReference,
                   ParcelID
               order by UniqueID
               ) row_num
    FROM sys.realestate
)
DELETE FROM sys.realestate
WHERE UniqueID IN (
    SELECT UniqueID
    FROM DuplicatesCTE
    WHERE row_num > 1
    );
")
```


## Unify Format in LandUse column. 

Similar to what happened with the SoldAsVacant column, in the LandUse column there is VACANT RESIDENTIAL LAND (3450 rows) and VACANT RES LAND(1549 rows) referring to the same category. This can also be fixed with a case statement

```{r case estatement to unify format of the LandUse column for vacant residential land}
dbExecute(
con,
"UPDATE realestate
  SET LandUse =
case
    when LandUse = 'VACANT RES LAND' then 'VACANT RESIDENTIAL LAND'
else LandUse
end;
")
```

The same issue is happening with RESIDENTIAL CONDO and CONDO, since there is a separate category for non residential condos CONDOMINIUM OFC  OR OTHER COM CONDO. I will fix this with a case statement as well
```{r case estatement to unify format of the LandUse column for condos}
dbExecute(
con,
"UPDATE realestate
  SET LandUse =
case
    when LandUse = 'CONDO' then 'RESIDENTIAL CONDO'
else LandUse
end;
")
```

Now the data is clean, I can bring the result to R and store it in a data frame for further analysis, including creating charts.
```{sql}
realestate_clean <- tbl(con, "realestate")
realestate_clean <- collect(realestate_clean)
realestate_clean
```

```{r storing clean data in a data frame}
realestate_clean <- tbl(con, "realestate")
realestate_clean <- collect(realestate_clean)
realestate_clean
```

## Descriptive analysis on land use 

We can store the entire table or the results of our queries in data frames in R, and this makes it very easy to create a chart with it in R.
```{sql}
land_use <- realestate_clean%>%
count(LandUse, sort=TRUE)
land_use
```

```{r Inspecting the different types of land use in the data set}
land_use <- realestate_clean%>%
count(LandUse, sort=TRUE)
land_use
```

Then organizing the data to display in the next chart, where I will display the 5 more common land uses, and the remaining 32 categories of land uses summarized as "Others" since their numbers are very low. This information will be stored in the land_use_p1 object, that I will pass into ggplot() later. 

```{r}
land_use_p1 <- land_use[land_use$n >=247, ]
Land_use_others <- sum(land_use[which(land_use$n<1047), 2])
land_use_p1[nrow(land_use_p1) + 1,] = list("OTHERS", Land_use_others)
land_use_p1 
```


Creating a bar char for land use:

I created and stored a theme to apply to the next charts to keep consistency.
```{r Creating a consistent theme to apply across all the charts}
mytheme <- theme(
  plot.title = element_text(family = "Arial", face = "bold", size = (15), colour = "#5A5A5A"),
  axis.title = element_text(family = "Arial", size = (10), colour = "#808080",  hjust=c(1), vjust=c(0)),
  axis.text = element_text(family = "Arial", size = (10), colour = "#808080"),
  legend.title = element_text(colour = "#808080", face = "bold", family = "Arial"),
  legend.text = element_text(colour = "#808080", family = "Arial"),
  plot.subtitle = element_text(colour = "#5A5A5A", family = "Arial"),
  plot.caption = element_text(colour = "#5A5A5A", family = "Arial")
)

```

Creating a bar char with ggplot2:

```{sql}
p1 <- ggplot(land_use_p1, aes(x = LandUse, y = n, fill = LandUse)) + 
  geom_col(width = 0.4) +
  scale_x_discrete(labels = c("Duplex", "Others", "Residential\nCondo", "Single\nFamily", "Vacant\nResidential\nLand", "Zero Lot\nLine"))+
  scale_fill_manual(values = c("#b3cde0", "#005b96", "#011F4B","#8a0303", "#03396c","#6497b1" ))+
  labs(y= "Number of properties", x = "Land Use", title = "Land use of the Different Properties Sold\nIn Nahsville, Tennessee from 2018 to 2021", fill = "LandUse", caption = "Properties sold in Nashville, Tennessee from 2018 to 2021", substitle = "Single family is by far the most common land use of the properties sold during the period, followed by residential condo ")+
  mytheme
p1
```

```{r Creating a bar char of land use, echo=TRUE}
p1 <- ggplot(land_use_p1, aes(x = LandUse, y = n, fill = LandUse)) + 
  geom_col(width = 0.4) +
  scale_x_discrete(labels = c("Duplex", "Others", "Residential\nCondo", "Single\nFamily", "Vacant\nResidential\nLand", "Zero Lot\nLine"))+
  scale_fill_manual(values = c("#b3cde0", "#005b96", "#011F4B","#8a0303", "#03396c","#6497b1" ))+
  labs(y= "Number of properties", x = "Land Use", title = "Land use of the Different Properties Sold\nIn Nahsville, Tennessee from 2018 to 2021", fill = "LandUse", caption = "Properties sold in Nashville, Tennessee from 2018 to 2021", substitle = "Single family is by far the most common land use of the properties sold during the period, followed by residential condo ")+
  mytheme
p1
```
## Descriptive analysis on sale date

To take a quick look about the property sales per year I will group the information by year and then count the sales:

```{sql}
sale_date_p2 <- realestate_clean %>%
mutate(year = year(SaleDate)) %>%
group_by(year) %>%
count()
sale_date_p2
```

```{r Grouping information by sale date}
sale_date_p2 <- realestate_clean %>%
mutate(year = year(SaleDate)) %>%
group_by(year) %>%
count()
sale_date_p2
```

Creating a bar char for sales per year:
```{sql}
p2 <- ggplot(sale_date_p2, aes(x = year, y = n, fill = year)) + 
  geom_col(width = 0.4)+
  labs(y= "Year of Sale", x = "Number of Sales", title = "Number of Sales Per year in Nahsville, Tennessee,\nfrom 2018 to 2021", fill = "year", caption = "Properties sold in Nashville, Tennessee from 2018 to 2021", subtitle = "The properties sales increased steadily from 2018 to 2020, declining slightly again in 2021 ")+
   mytheme+
   theme(legend.position="none")
p2
```

```{r Creating a bar char of ride lengths by customer type, echo=TRUE}
p2 <- ggplot(sale_date_p2, aes(x = year, y = n, fill = year)) + 
  geom_col(width = 0.4)+
  labs(y= "Year of Sale", x = "Number of Sales", title = "Number of Sales Per year in Nahsville, Tennessee,\nfrom 2018 to 2021", fill = "year", caption = "Properties sold in Nashville, Tennessee from 2018 to 2021", subtitle = "The properties sales increased steadily from 2018 to 2020, declining slightly again in 2021 ")+
   mytheme+
   theme(legend.position="none")
p2
```
To wrap up, the ODBC, DBI, dplyr and dbplyr used together, allow us to connect our databases to R and execute queries directly in R. The results of our queries can be visualized as they are on the database with the tbl function or we can bring the results to R and store them into objects with dbGetQuery to analyze the information and create charts without the need of importing or exporting the information as a separate file. 

