1. Synopsis

These days, Analytics permits organizations to assemble the data that they need to find high-fit leads and make an enduring establishment for developing business. To deal with high expectations from customers, retailers need to use data being gathered and effectively incorporate and analyzed to improve their dynamic procedure.

New York City has multiple bike services, they offer option to rent bikes to and from multiple locations in New York City. The company which we are portraying is NYC Bike Service Inc. Their business is helping city’s people by renting bikes. As we already discussed that nowadays Big data and Analytics has taken over all market and all the businesses are using Analytics for improving their forecasting and business process. So, NYC Bike Service Inc. wanted to implement big data integration and analytics to forecast their business demand changes with the weather variance in New York City. For this they wanted to collect available weather data of NYC and integrate that with their trip data to analyze how weather affected their business.

So, our team has worked on the requirement of NYC Bike Service Inc. and collected their data and weather data from available sources. Since the company maintained their data on trip details in their database, so I was a structured data. Since we had to collect the weather data from internet, and it was not maintained somewhere properly so we got it as semi structured data. Since the volume of the data was quite big, we used big data integration concepts to integrate the weather and bikeshare datasets and then visualize the dependencies that bike tips had on weather, so that the managers of NYC Bike Service Inc. can decide what all can be changed or what more can be made to increase their business with weather changes.

2. Data and Software

The Data was collected from Kaggle.

New York Citi Bike Share Dataset –> new-york-city-bike-share-dataset

Weather Dataset –> Weather Dataset.

Data Integration and Analysis Tools

We will be using snowflake to integrate the structured and semi structured data if bike trips and weather and create a view, which could directly be used from one of the most popular data visualization tool that is Tableau and create the necessary visuals of the data so that required decisions could be made for business improvements.

Snowflake:

Snowflake is an analytic data warehouse provided as Software-as-a-Service (SaaS). Snowflake provides a data warehouse that is faster, easier to use, and far more flexible than traditional data warehouse offerings. Snowflake’s data warehouse is not built on an existing database or “big data” software platform such as Hadoop. The Snowflake data warehouse uses a new SQL database engine with a unique architecture designed for the cloud. To the user, Snowflake has many similarities to other enterprise data warehouses, but also has additional functionality and unique capabilities.

Tableau:

Tableau is a data visualization software that is used for data science and business intelligence. Tableau can create a wide range of different visualization to interactively present the data and showcase insights. It comes with tools that allow to drill down data and see the impact in a visual format that can be easily understood by any individual. Tableau also comes with real-time data analytics capabilities and cloud support.

R:

R is a programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians and data miners for developing statistical software and data analysis.

3. Data Integration

Integrating Data in Snowflake:

We will now discuss about how we integrated the structured and semi structured data using snowflake.

Schema Alignment:

First step of data integration architecture

Schema Alignment is a process that transforms the tabular data in a project to the data format used by a knowledge base.

  1. Mediated Schema
  • First, a mediated schema is created to provide a unified and virtual view of the disparate sources and capture the salient aspects of the domain being considered.
  1. Attribute Matching
  • Next, attributes in each source schema are matched to the corresponding attributes in the mediated schema.
  1. Schema Mapping
  • According to the correspondences of attribute matching, a schema mapping is built between each source schema and the mediated schema. Such mappings specify the semantic relationships between the contents of different data sources and would be used to reformulate a user query on the mediated schema into a set of queries on the underlying data sources.
  1. Query Answering
  • Users query the underlying data in a data-integration system by formulating queries over the mediated schema.

So, the how we used this data integration process in our project is explained below

As we can see in the below image we have firstly created a table in snowflake which contains the data on the trips which NYC Bike Service Inc. has completed, we have created table Trips with all the details about duration, times, starting and ending stations and their locations etc.

Next step was to put data in table Trips. So, as we got the data from Kaggle in CSV format, we imported csv files into snowflake and then copied those data into the table Trips after removing delimiter etc. Below is the screen capture of loading the data into table and showing the data by using select query on table Trips.

Next step was to collect and load the semi structured data of weather Weather data was on json format, as we can see in below screen capture.

So, we have created the below view from the available json format data and

So, after creating the view, once we query the view we get below data.

Challenges faced during these steps were mostly because the of the weather data, since it was in json format, we had to work more on the technical part to extract the original information from the json format to out tabular format. So, when we want to integrate these data and do the visualization, we could either export these data in csv format and use that directly in Tableau or we could connect tableau with snowflake data warehouse and then do the visualization. Below is the process to export the data

The goal of an entity identification and integration project is to determine that a pair of instances of entity representations taken from two data sets refers to the same thing. Depending on the application, the rules for determining similarity may be stricter or looser, but even in the strictest sense; there must be some identifying key that is shared by both records in the pair.

Record Linkage in Data Integration

Record linkage or combining two or more sets of records referring to the same data from different data sets in the interest of analysis, research, poses another set of privacy problem. Records are often matched according to individual identifiers like names, addresses, Social Security Numbers, dates of birth, or other sensitive information. In our project we have matched the data as per the time from two data sets. Interest in record linkage projects has grown as computer technology has made record-keeping easier and policymakers have increasingly relied on rigorous statistical analyses to evaluate the effects of social programs. Below screen capture shows how we did the record linkage part and integrated the Trips table and Weather view with the help of time and extracted below information for our analysis.

Again, once we have this integrated data available in snowflake database, we can either directly connect it with tableau and directly use these data for visualization or we can extract it from snowflake in csv format and then use it in tableau.

Since NYC Bike Service Inc. also requested us to help them to see which are the starting stations that helps them getting maximum number of rides so that they can work more on putting extra vehicle on that part of the city, we have extracted the data of the starting station and number or rides to help them visualize on tableau and work on the betterment of their business. Below is the screen capture of the query and the extracted data.

Now since we were ready with all our necessary data, we started working on visualizing those data using Tableau.

4. Visualisation

Dashboard 1

  1. The first chart shows the pattern of number of bikers for each month 2. Year wise records - shows total number of riders for each station , the dotted line shows the average number of riders when all stations are taken in account . 3. Age and gender wise splits - it gives us the idea of which age group uses the bikes more and how is the gender split in each age bin. 4. The last graph shows the density of timings for each day , both stop and start times.

Dashboard 2

5. Conclusion

We get the following analysis from the Dashboard,

  1. The Number of Trips are highest in the month of October which is around 130k.
  2. All the stations had the highest number of trips in the year 2016 and very low in the year 2017 as compared to 2016.
  3. Majority of the Rider are male.
  4. More then 50k rider are in the age range of 50-59.
  5. Of all the Stations Pershing Square has the highest number of trips which is very close to 500k while W 31 St & 7 Ave have the lowest number of trips which is close to 300k.
  6. People seem to travel more when the sky is clearer & less when there is smoke or drizzle.