Introduction

In this project, I create database schema for two different databases and implement one of the databases in the SQL Server database management system(DBMS). The databases considered include:

  1. A database for an Animal Hospital to maintain pet histories and accounting information.
  2. A database of baseball game-level statistics.

The Animal Hospital database can be considered as an example of a top-down approach to database developmnet where a new database is developed as part of an information systems planning and develpment. I create a database schema for this database without its physical implementation in a specific database management system.

The second database, on the otherhand, is an example of a bottom-up approach to database development where, for example, a database may be created to improve the management of existing data. Here, I study existing data on baseball games (stored in csv files) and create a normalized database to manage this data. A database schema is developed and this is physically implemented in the SQL Server DBMS.

It must be noted I am not familiar at all with the two contexts of database development. For instance, I’m not a baseball fan and knows nothing about the game. However, these scenarios characterize some of the projects that data analyts/scientists work on. In such cases, they have to read relevant literature and also interact with domain experts to gain appropriate understanding of the problem domain. I use the former approach as well as my knowledge of data modelling in designing the two databases.


Animal Hospital Database

As noted earlier, the Animal Hospital database is to help maintain information about pets. For accountability purposes, it should also help track the action of employees and manage transactions with customers, in this case pet owners.

I use the REA(Resources-Events-Agents) data model in designing the Animal Hospital database. The REA data model is especially used for designing databases for an accounting information systems. It provides guidance for identifying entities and structuring the relationship among the entities. An entity is anything about which a firm wants to collect and store information. As the name suggests, it classifies entities into three categories:

  1. Resources - things that have economic value to an organization
  2. Events - business activities that the organization engages in. Some events affect a resource by either increasing (“get” event) or decreasing the quantity of a resource (“give” event)
  3. Agents - people and organizations (both internal and external to the organization) that participate in events.

The three types of entities and the relationships between them are briefly discussed next as applied to the Animal Hospital Database.

REA Data Model

I use the following brief scenario in developing the data model for the Animal Hospital database.

A pet owner must schedule an appointment for every pet that is to be seen by a doctor. For, each appointment, the pet’s weight and the doctor’s diagnoses are recorded. Depending on the diagnoses, the doctor may prescribe any number of different medicines to cure the pet. Pet owners are charged different amounts for each appointment depending on the complexity of the appointment. There are also additional costs for medicines.

Developing a REA data model starts with identifying events and next, the resources and agents. Following the above scenario, here are the REA entities I consider important in developing a database for the Animal Hospital.

  1. Events: scheculing, appointment, cash receipt (i.e., payment received from pet owner).
  2. Resources: an account for cash deposit, medicine, diagnoses (i.e., pet histories)
  3. Agents: pet owner, pet, an employee of the hospital.

An Entity-Relationship(ER) diagram of the REA data model for the Animal Hospital database showing the relationship between entities is provided below. Some attributes of the entities have also been defined.

REA Diagram for an Animal Hospital Database

REA Diagram for an Animal Hospital Database

All the events entities are linked to resources and agents except the Schedule event which does not affect a resource. The Schedule event is an example of what is called a commitment event. In this case, it means a commitment for a pet appointment and it is only when the appointment occurs that resources are affected. Commitment events are therefore linked to other events in a sequential cause-effect relationships. As shown in the diagram, the Schedule event occurs before the Appointment event.

Appointment is a critical event which acts as both a “get” event and a “give” event as every pet appointment may add to the diagnosis information maintained and/or decrease the quantity of avaialable medicines if prescriptions are made for a pet. It must be noted that Appointment Diagnosis List and Prescription are associative entities which resolve the M:N relationship between Appointment and the entities, Diagnosis and Medication.An appointment will also usually result in a cash payment by the pet owner to the hospital.The Cash Receipt event captures this payment and as a “give” event, it increases the cash resource (i.e., Account) of the hospital.

The relationship between events and agents is straightforward;every event is handled by an Employee of the hospital and is link to an external agent which in this case is either a Pet Owner or a Pet.


Baseball Game Database

As noted in the introduction, the purpose of the baseball game database is to ensure proper management of existing game data which is stored in csv files. The data relates to Major League Baseball games and is compiled by Retrosheet, a non-profit organization that has gathered game statistics since the 1800s. We will be working with four main datasets obtained from Retrosheet:

  • game_log.csv
  • team_codes.csv
  • park_codes.csv
  • person_codes.csv.

Let’s begin by importing all the R modules relevant for this project and do some basic exploration of the datasets to get acquainted with the data.

Importing Modules and Data

library(odbc)
library(DBI)
library(readr)
library(dplyr)
#read in the datasets and store them as dataframes
game_log <- read_csv('game_log.csv')
park_codes <- read_csv('park_codes.csv')
person_codes <- read_csv('person_codes.csv')
team_codes <- read_csv('team_codes.csv')
#create a function which shows the dimensions(i.e., total rows and columns) of dataframes

list_of_df <- list(game_log, park_codes, person_codes, team_codes)
names(list_of_df) <- c("game_log", "park_codes", "person_codes", "team_codes")

get_df_dims <- function(df_list) {
 
  dim_mat <- matrix(ncol = 2)
  
  for (df in df_list) {
    if (all(is.na(dim_mat))) {
      dim_mat <- rbind(c(dim(df)[1], dim(df)[2]))
    }
    else 
      dim_mat <- rbind(dim_mat, c(dim(df)[1], dim(df)[2]))
    
  }
  
  colnames(dim_mat) <- c("Num_of_rows", "Num_of_cols")
  rownames(dim_mat) <- names(df_list)
  dim_mat
}

get_df_dims(list_of_df)
##              Num_of_rows Num_of_cols
## game_log          171907         161
## park_codes           252           9
## person_codes       20494           7
## team_codes           150           8

We can see that the dataset containing game-level statistics, game_log, is the largest of the datasets. This is the main file we will be working from. The other datasets will serve as basis for three of the tables needed in creating a normalized database of baseball game statistics.Let’s learn a bit more about each dataset.

Exploring the Data

When working with a new data, it is always important to develop a good understanding of the data before manipulating or analysing it. The purpose of this project is to develop a normalised database. Thus, our exploration of the data will focus on understanding the variables/columns in each dataset and the relationships between variables within and across datasets.

#fetch top 6 rows from game_log
head(game_log)
#fetch the last 6 rows from game_log
tail(game_log)

It appears game data is stored chronologically by the date the game was played and the available data covers the period 1871-2016.Details captured about each game include general game information, stats for each team, players from each team numbered with their defensive positions 1-9, umpires that officiated the game, and awards such as winning and losing pitchers.

The following article specifies the names of the 9 positions as follows:

  1. Pitcher (P)
  2. Catcher (C)
  3. 1st Base (1B)
  4. 2nd Base (2B)
  5. 3rd Base (3B)
  6. Shortstop (SS)
  7. Left Field (LF)
  8. Center Field (CF)
  9. Right Field (RF)
head(park_codes)

park_codes contains details (name, nickname, city of the park, etc) about baseball parks. The park_id field has a matching field in game_log.

head(team_codes)

team_codes lists all teams; team_id matches v_name and h_name in game_log.

head(person_codes)

person_codes seems to capture all persons and their roles in a baseball game: players, managers, coaches and umpires. The data shows a person can be one or more of these roles. The id column identifies each person and seems to match up with data in game_log.

Designing a Normalized Schema

Normalization is the formal process of successively reducing relations that have anomalies to produce smaller, well-structured and stable relations. A relation is a named, two-dimensional table of data. A well-structured relation contains minimum amount of redundancy and allows users to INSERT, UPDATE and DELETE rows in a table without errors or inconsistencies(Hoffer et al., 2016).

Critical to normalization is designating a primary key for a relation. Primary keys could be identified in all the explored datasets except game_log. However, the data dictionary on the Retrosheet site shows that Retrosheet uses a twelve-character ID comprising of the date, h_name and number_of_game fields to identify each game.We will create a game_id field which combines these fields to identify games in our database.

Following the above data exploration, here are some examples of anomalies that need to be addressed in creating a normalized database.

1. Column Repetition

  • In game_log, all game-level stats are repeated for the home team and the visiting team. We can address this column repetition by having a table that lists each game stat twice, one for each team.
  • The appearance in a game has also been repeated for players, managers and umpires. We can create a table to capture appearances in games by these persons.

2. Repetition of data across tables

  • The start, end and sequence columns in team_codes can be derived from game-level data in game_log.
  • Similarly, the start and end columns in park_codes can be reproduced using data in game_log.
  • The identifying information (iDs, names, etc) of teams and persons in game_log can be obtained from team_codes and person_codes respectively. Though, person_codes has data about coaches, game_log has no information about coaches so will not store information about coaches in the database.
  • The debut dates in person_codes can be derived from game-level data in game_log.

3. Non-primary key columns not attributes of the primary key

In a normalized table, every non-key attribute must be determined by the primary key. All attributes not determined by the primary key are better placed in their own tables.

  • In park_codes, state can be identified by city. This is an example of transitive dependency and these attributes should be in a separate table(Hoffer et al., 2016).
  • As noted above regarding column repetition, in game_log, we can keep team-specific stats in a sperate table and game appearances of persons in another table. All other stats not referring to a specific team or player can also be kept in another table.
  • We can also have a table to store information about the various leagues.

Below is the normalized schema of the baseball game database based on the above normalization considerations.

Schema of Baseball Game Database

Schema of Baseball Game Database


Importing Original Game Data

With our normalized schema ready, let’s import the original data into a database called MLB which has been created in SQL Server.

#establish a connection to MLB database in SQL SERVER
db_conn <- dbConnect(odbc(),
                  Driver = "SQL SERVER",
                  Server = "ERNEST",
                  Database = "MLB",
                  Trusted_Connection = "True")

#call dbExecute() from within a custom function to prevent printing its output
execute_command <- function(conn, query) {
  x <- dbExecute(conn, query)
}
#check if table already exists and drop it
for (name in c("game_log", "park_code", "person_code", "team_code")) {
  query <- "DROP TABLE IF EXISTS %s;"
  query <- gsub("%s", name, query, fixed = TRUE)
  execute_command(db_conn, query)
}
#create tables in MLB database and write the dataframes into them.
dbWriteTable(conn=db_conn, name="game_log", value=game_log, header=TRUE)
dbWriteTable(conn=db_conn, name="park_code", value=park_codes, header=TRUE)
dbWriteTable(conn=db_conn, name="person_code", value=person_codes, header=TRUE)
dbWriteTable(conn=db_conn, name="team_code", value=team_codes, header=TRUE)

#show all tables in MLB database
dbGetQuery(db_conn, "SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_TYPE='BASE TABLE'")

Let’s update game_log table by adding game_id column which, as noted earlier, will comprise of the date, h_name and number_of_game columns to identify each game.

add_col <- "
              IF NOT EXISTS (
                SELECT 
                  * 
                FROM INFORMATION_SCHEMA.COLUMNS 
                WHERE TABLE_NAME = 'game_log' AND COLUMN_NAME='game_id'
              )
              
              BEGIN
                ALTER TABLE game_log
                ADD game_id VARCHAR(20)
              END;
           "

update_col <- '
                UPDATE game_log
                SET game_id = CONCAT(STR(date,8,0), h_name, number_of_game)
                WHERE game_id IS NULL;
              '

execute_command(db_conn, add_col)
execute_command(db_conn, update_col)

#check if game_id has been added to game_log
dbGetQuery(db_conn, "SELECT TOP(5) game_id, date, h_name, number_of_game FROM game_log")

Creating and Populating Normalized Tables

We can now create and populate the normalized tables with the original game data imported into SQL Server. We will start by first creating all parent tables which need to exist before the child tables which will have foreign key relations to these tables.

Adding Person, League, City and Appearance_type Tables

create_tab <- "
                IF NOT EXISTS (
                    SELECT 
                      * 
                    FROM INFORMATION_SCHEMA.TABLES 
                    WHERE TABLE_NAME = 'person'
                )
                
                BEGIN
                   CREATE TABLE person (
                      person_id vARCHAR(10) PRIMARY KEY,
                      first_name VARCHAR(25),
                      last_name VARCHAR(25)
                   )
                END;
              "


insert_vals <- "
                  INSERT INTO person
                  SELECT 
                    id, 
                    first, 
                    last 
                  FROM person_code pc
                  WHERE 
                  NOT EXISTS (
                    SELECT 1
                    FROM person p
                    WHERE pc.id = p.person_id
                  );
               "

execute_command(db_conn, create_tab)
execute_command(db_conn, insert_vals)
dbGetQuery(db_conn, "SELECT TOP(5)* FROM person")

Full names of leagues are not provided in the original datasets. Let’s retrieve the various leagues and the years when they were played. This may help in searchig for the names of the leagues.

#mlb <- na.omit(distinct(game_log, h_league))
mlb_dates <- game_log %>% 
              select(date, h_league)%>%
              group_by(h_league)%>%
              summarize(Date_From = min(date), Date_To = max(date)) %>%
              arrange(Date_From)

mlb_dates <- mlb_dates %>%
                rename(League = h_league) %>%
                mutate(Date_From = as.Date(as.character(Date_From), "%Y%m%d")) %>%
                mutate(Date_To = as.Date(as.character(Date_To), "%Y%m%d"))
              
mlb_dates

Altogether, there are six leagues captured in the datasets and it seems that games played between 1871 and 1875 were not part of any league. As the results may suggest, a Wikipedia article shows National League(NL) and American League(AL) as the only current leagues. I used the dates of the leagues to search for the names of the remaining leagues which are listed below:

  1. American Association(AA)
  2. Union Association(UA)
  3. Players League(PL)
  4. Federal League(FL)

With the names of the leagues established, we can create and populate the League table.

create_tab <- "
                  IF NOT EXISTS (
                    SELECT 
                      * 
                    FROM INFORMATION_SCHEMA.TABLES 
                    WHERE TABLE_NAME = 'league'
                 )
                 CREATE TABLE league (
                    league_id CHAR(2) PRIMARY KEY,
                    name VARCHAR(30)
                 );
              "

insert_vals <- "
                  IF NOT EXISTS (
                    SELECT 1
                    FROM league
                    WHERE league_id = 'AA'
                  )
                  
                  BEGIN
                    INSERT INTO league
                    VALUES
                    ('AA', 'American Association'),
                    ('AL', 'American League'),
                    ('FL', 'Federal League'),
                    ('NL', 'National League'),
                    ('PL', 'Players League'),
                    ('UA', 'Union Association')
                  END;
               "

execute_command(db_conn, create_tab)
execute_command(db_conn, insert_vals)
dbGetQuery(db_conn, "SELECT * FROM league")
create_tab <- "
                  IF NOT EXISTS (
                    SELECT 
                      * 
                    FROM INFORMATION_SCHEMA.TABLES 
                    WHERE TABLE_NAME = 'city'
                  )
                  CREATE TABLE city (
                      city_id INT IDENTITY(1,1) PRIMARY KEY,
                      name VARCHAR(30),
                      state TEXT
                  );
              "

insert_vals <- '
                  WITH park_and_team_cities AS
                  (
                    SELECT city FROM park_code
                    UNION
                    SELECT city FROM team_code
                  )
                  
                  INSERT INTO city (
                    name,
                    state
                  )
              
                  SELECT
                    DISTINCT pt.city,
                    p.state
                  FROM park_and_team_cities pt
                  LEFT JOIN park_code p ON pt.city = p.city
                  WHERE 
                  NOT EXISTS (
                    SELECT 1
                    FROM city c
                    WHERE c.name = pt.city
                  );
              '

execute_command(db_conn, create_tab)
execute_command(db_conn, insert_vals)
dbGetQuery(db_conn, "SELECT TOP(5)* FROM city")

The appearance_type table stores information on the different types of appearances available by players, umpires and managers. It also includes data on awards like winning or losing pitcher. These data are contained in appearance_type.csv file which we will use to populate the table.

create_tab <- "
                 IF NOT EXISTS (
                    SELECT 
                      * 
                    FROM INFORMATION_SCHEMA.TABLES 
                    WHERE TABLE_NAME = 'appearance_type'
                  )
                 CREATE TABLE appearance_type (
                    appearance_type_id VARCHAR(5) PRIMARY KEY,
                    name VARCHAR(25),
                    category VARCHAR(10)
                 );
 
             "
execute_command(db_conn, create_tab)

appearance_type_df <- read_csv('appearance_type.csv')

is_populated <- dbGetQuery(db_conn, "SELECT COUNT(*) num_of_rows FROM appearance_type")

if(is_populated$num_of_rows == 0) {
  dbWriteTable(db_conn, "appearance_type", appearance_type_df, append=TRUE, row.names=FALSE)
}

dbGetQuery(db_conn, "SELECT TOP(5)* FROM appearance_type")

Adding Park, Team and Game Tables

create_tab <- "
                 IF NOT EXISTS (
                    SELECT 
                      * 
                    FROM INFORMATION_SCHEMA.TABLES 
                    WHERE TABLE_NAME = 'park'
                 )
                 
                 BEGIN
                    CREATE TABLE park (
                        park_id CHAR(5) PRIMARY KEY,
                        name VARCHAR(70),
                        nickname VARCHAR(70),
                        city_id INTEGER,
                        notes VARCHAR(70),
                        FOREIGN KEY (city_id) REFERENCES city(city_id)
                    )
                END;
              "

insert_vals <- '
                  INSERT INTO park 
                  SELECT
                    pc.park_id,
                    pc.name,
                    pc.aka,
                    c.city_id,
                    pc.notes
                  FROM park_code pc
                  LEFT JOIN city c ON pc.city = c.name
                  WHERE 
                  NOT EXISTS (
                    SELECT 1
                    FROM park p
                    WHERE pc.park_id = p.park_id
                  );
              '

execute_command(db_conn, create_tab)
execute_command(db_conn, insert_vals)
dbGetQuery(db_conn, "SELECT TOP(5)* FROM park")               
create_tab <- "
                  IF NOT EXISTS (
                    SELECT 
                      * 
                    FROM INFORMATION_SCHEMA.TABLES 
                    WHERE TABLE_NAME = 'team'
                  )
                  
                  BEGIN
                    CREATE TABLE team (
                        team_id CHAR(3) PRIMARY KEY,
                        city_id INTEGER,
                        nickname VARCHAR(50),
                        franch_id CHAR(3),
                        FOREIGN KEY (city_id) REFERENCES city(city_id)
                    )
                  END;
              "

insert_vals <- '
                  INSERT INTO team 
                  SELECT
                    DISTINCT tc.team_id,
                    c.city_id,
                    tc.nickname,
                    tc.franch_id
                  FROM team_code tc
                  LEFT JOIN city c ON tc.city = c.name
                  WHERE 
                  NOT EXISTS (
                    SELECT 1
                    FROM team t
                    WHERE tc.team_id = t.team_id
                  );
              '

execute_command(db_conn, create_tab)
execute_command(db_conn, insert_vals)
dbGetQuery(db_conn, "SELECT TOP(5)* FROM team")                
create_tab <- "
                  IF NOT EXISTS (
                    SELECT 
                      * 
                    FROM INFORMATION_SCHEMA.TABLES 
                    WHERE TABLE_NAME = 'game'
                  )
                  
                  BEGIN
                    CREATE TABLE game (
                        game_id VARCHAR(20) PRIMARY KEY,
                        date_of_game DATE,
                        number_of_game TINYINT,
                        park_id CHAR(5),
                        length_outs TINYINT,
                        day_or_night CHAR(1),
                        completion TEXT,
                        forfeit CHAR(1),
                        protest VARCHAR(2),
                        attendance INT,
                        length_minutes SMALLINT,
                        additional_info TEXT,
                        acquisition_info CHAR(1),
                        FOREIGN KEY (park_id) REFERENCES park(park_id)
                    )
                  END;
              "
                  
insert_vals <- '
                  INSERT INTO game
                  SELECT
                      game_id,
                      STR(date,8,0),
                      number_of_game,
                      park_id,
                      length_outs,
                      day_night,
                      completion,
                      forefeit,
                      protest,
                      attendance,
                      length_minutes,
                      additional_info,
                      acquisition_info
                  FROM game_log gl
                  WHERE 
                  NOT EXISTS (
                    SELECT 1
                    FROM game g
                    WHERE gl.game_id = g.game_id
                  );
              '

execute_command(db_conn, create_tab)
execute_command(db_conn, insert_vals)
games <- dbGetQuery(db_conn, "SELECT COUNT(game_id) tot_count FROM game")
sprintf("Total number of games played: %i", games$tot_count)
## [1] "Total number of games played: 171907"
dbGetQuery(db_conn, "SELECT TOP(5)game_id, date_of_game, park_id, attendance, length_minutes FROM game")

Adding Team Appearance Table

create_tab <- "
                IF NOT EXISTS (
                    SELECT 
                      * 
                    FROM INFORMATION_SCHEMA.TABLES 
                    WHERE TABLE_NAME = 'team_appearance'
                )
                
                BEGIN
                  CREATE TABLE team_appearance (
                     team_id CHAR(3),
                     game_id VARCHAR (20),
                     home BIT,
                     league_id CHAR(2),
                     score SMALLINT,
                     line_score TEXT,
                     at_bats SMALLINT,
                     hits SMALLINT,
                     doubles SMALLINT,
                     triples SMALLINT,
                     homeruns SMALLINT,
                     rbi SMALLINT,
                     sacrifice_hits SMALLINT,
                     sacrifice_flies SMALLINT,
                     hit_by_pitch SMALLINT,
                     walks SMALLINT,
                     intentional_walks SMALLINT,
                     strikeouts SMALLINT,
                     stolen_bases SMALLINT,
                     caught_stealing SMALLINT,
                     grounded_into_double SMALLINT,
                     first_catcher_interference SMALLINT,
                     left_on_base SMALLINT,
                     pitchers_used TINYINT,
                     individual_earned_runs SMALLINT,
                     team_earned_runs SMALLINT,
                     wild_pitches SMALLINT,
                     balks SMALLINT,
                     putouts SMALLINT,
                     assists SMALLINT,
                     errors SMALLINT,
                     passed_balls SMALLINT,
                     double_plays SMALLINT,
                     triple_plays SMALLINT,
                     PRIMARY KEY (team_id, game_id),
                     FOREIGN KEY (team_id) REFERENCES team (team_id),
                     FOREIGN KEY (game_id) REFERENCES game (game_id),
                     FOREIGN KEY (league_id) REFERENCES league (league_id)
                )
              END;
            "

execute_command(db_conn, create_tab)
insert_vals <- '
                  INSERT INTO team_appearance
                    SELECT
                        h_name,
                        game_id,
                        1 AS home,
                        h_league,
                        h_score,
                        h_line_score,
                        h_at_bats,
                        h_hits,
                        h_doubles,
                        h_triples,
                        h_homeruns,
                        h_rbi,
                        h_sacrifice_hits,
                        h_sacrifice_flies,
                        h_hit_by_pitch,
                        h_walks,
                        h_intentional_walks,
                        h_strikeouts,
                        h_stolen_bases,
                        h_caught_stealing,
                        h_grounded_into_double,
                        h_first_catcher_interference,
                        h_left_on_base,
                        h_pitchers_used,
                        h_individual_earned_runs,
                        h_team_earned_runs,
                        h_wild_pitches,
                        h_balks,
                        h_putouts,
                        h_assists,
                        h_errors,
                        h_passed_balls,
                        h_double_plays,
                        h_triple_plays
                    FROM game_log g
                    WHERE 
                    NOT EXISTS (
                      SELECT 1
                      FROM team_appearance t
                      WHERE g.h_name = t.team_id
                      AND g.game_id = t.game_id
                    )
                  
                  UNION
                    
                    SELECT
                        v_name,
                        game_id,
                        0 AS home,
                        v_league,
                        v_score,
                        v_line_score,
                        v_at_bats,
                        v_hits,
                        v_doubles,
                        v_triples,
                        v_homeruns,
                        v_rbi,
                        v_sacrifice_hits,
                        v_sacrifice_flies,
                        v_hit_by_pitch,
                        v_walks,
                        v_intentional_walks,
                        v_strikeouts,
                        v_stolen_bases,
                        v_caught_stealing,
                        v_grounded_into_double,
                        v_first_catcher_interference,
                        v_left_on_base,
                        v_pitchers_used,
                        v_individual_earned_runs,
                        v_team_earned_runs,
                        v_wild_pitches,
                        v_balks,
                        v_putouts,
                        v_assists,
                        v_errors,
                        v_passed_balls,
                        v_double_plays,
                        v_triple_plays
                    FROM game_log g
                    WHERE 
                    NOT EXISTS (
                      SELECT 1
                      FROM team_appearance t
                      WHERE g.v_name = t.team_id
                      AND g.game_id = t.game_id
                    );
            '
                    
execute_command(db_conn, insert_vals)
games <- dbGetQuery(db_conn, "SELECT COUNT(DISTINCT game_id) tot_count FROM team_appearance")
sprintf("Total number of games played: %i", games$tot_count)
## [1] "Total number of games played: 171907"
#fetch first and last games played
query <- "
            DECLARE @first_game VARCHAR(20)
            DECLARE @last_game VARCHAR(20)
            
            SET @first_game = (SELECT MIN(game_id) FROM team_appearance)
            SET @last_game = (SELECT MAX(game_id) FROM team_appearance)
            
            
            SELECT
                game_id,
                team_id,
                league_id,
                score
            FROM team_appearance
            WHERE game_id IN (@first_game, @last_game);
         "
dbGetQuery(db_conn, query)

Adding Person Appearance Table

create_tab <- "
                IF NOT EXISTS (
                    SELECT 
                      * 
                    FROM INFORMATION_SCHEMA.TABLES 
                    WHERE TABLE_NAME = 'person_appearance'
                )
                CREATE TABLE person_appearance (
                  appearance_id INT IDENTITY(1,1) PRIMARY KEY,
                  game_id VARCHAR (20),
                  team_id CHAR(3),
                  person_id VARCHAR(10),
                  appearance_type_id VARCHAR(5),
                  FOREIGN KEY (game_id) REFERENCES game (game_id),
                  FOREIGN KEY (team_id) REFERENCES team (team_id),
                  FOREIGN KEY (person_id) REFERENCES person (person_id),
                  FOREIGN KEY (appearance_type_id) REFERENCES appearance_type (appearance_type_id)
                );
              "
execute_command(db_conn, create_tab)
insert_vals <- "
                  INSERT INTO person_appearance (
                      game_id,
                      team_id,
                      person_id,
                      appearance_type_id
                  )
                  
                      SELECT
                          game_id,
                          NULL,
                          hp_umpire_id,
                          'UHP'
                      FROM game_log
                      WHERE hp_umpire_id IS NOT NULL
                  
                  UNION
                  
                      SELECT
                          game_id,
                          NULL,
                          [1b_umpire_id],
                          'U1B'
                      FROM game_log
                      WHERE [1b_umpire_id] IS NOT NULL
                      
                  UNION
                  
                      SELECT
                          game_id,
                          NULL,
                          [2b_umpire_id],
                          'U2B'
                      FROM game_log
                      WHERE [2b_umpire_id] IS NOT NULL
                      
                  UNION
                  
                      SELECT
                          game_id,
                          NULL,
                          [3b_umpire_id],
                          'U3B'
                      FROM game_log
                      WHERE [3b_umpire_id] IS NOT NULL
                      
                  UNION
                  
                      SELECT
                          game_id,
                          NULL,
                          lf_umpire_id,
                          'ULF'
                      FROM game_log
                      WHERE lf_umpire_id IS NOT NULL
                      
                  UNION
                  
                      SELECT
                          game_id,
                          NULL,
                          rf_umpire_id,
                          'URF'
                      FROM game_log
                      WHERE rf_umpire_id IS NOT NULL
                      
                  UNION
                  
                      SELECT
                          game_id,
                          v_name,
                          v_manager_id,
                          'MM'
                      FROM game_log
                      WHERE v_manager_id IS NOT NULL
                      
                  UNION
                  
                      SELECT
                          game_id,
                          h_name,
                          h_manager_id,
                          'MM'
                      FROM game_log
                      WHERE h_manager_id IS NOT NULL
                      
                  UNION
                  
                      SELECT
                          game_id,
                          CASE
                              WHEN h_score > v_score THEN h_name
                              ELSE v_name
                              END,
                          winning_pitcher_id,
                          'AWP'
                      FROM game_log
                      WHERE winning_pitcher_id IS NOT NULL
                      
                  UNION
                  
                      SELECT
                          game_id,
                          CASE
                              WHEN h_score < v_score THEN h_name
                              ELSE v_name
                              END,
                          losing_pitcher_id,
                          'ALP'
                      FROM game_log
                      WHERE losing_pitcher_id IS NOT NULL
                      
                  UNION
                  
                      SELECT
                          game_id,
                          CASE
                              WHEN h_score > v_score THEN h_name
                              ELSE v_name
                              END,
                          saving_pitcher_id,
                          'ASP'
                      FROM game_log
                      WHERE saving_pitcher_id IS NOT NULL
                      
                  UNION
                  
                      SELECT
                          game_id,
                          CASE
                              WHEN h_score > v_score THEN h_name
                              ELSE v_name
                              END,
                          winning_rbi_batter_id,
                          'AWB'
                      FROM game_log
                      WHERE winning_rbi_batter_id IS NOT NULL
                      
                  UNION
                  
                      SELECT
                          game_id,
                          v_name,
                          v_starting_pitcher_id,
                          'PSP'
                      FROM game_log
                      WHERE v_starting_pitcher_id IS NOT NULL
                      
                  UNION
                  
                      SELECT
                          game_id,
                          h_name,
                          h_starting_pitcher_id,
                          'PSP'
                      FROM game_log
                      WHERE h_starting_pitcher_id IS NOT NULL;
               "

execute_command(db_conn, insert_vals)
for (letter in c("h", "v")) {
  for (num in 1:9) {
    template <- "
      INSERT INTO person_appearance (
          game_id,
          team_id,
          person_id,
          appearance_type_id
      )
          SELECT 
            game_id,
            %s_name,
            %s_player_%f_id,
            CONCAT('O', %f)
          FROM game_log
          WHERE %s_player_%f_id IS NOT NULL
          
      UNION
          
          SELECT
            game_id,
            %s_name,
            %s_player_%f_id,
            CONCAT('D', %s_player_%f_def_pos)
          FROM game_log
          WHERE %s_player_%f_id IS NOT NULL;
    "
    
    #replace all of the %s and %f with the correct letter number
    template <- gsub("%s", letter, template, fixed = TRUE)
    template <- gsub("%f", num, template, fixed = TRUE)
    
    execute_command(db_conn, template)
  }
}
games <- dbGetQuery(db_conn, "SELECT COUNT(DISTINCT game_id) tot_count FROM person_appearance")
sprintf("Total number of games played: %i", games$tot_count)
## [1] "Total number of games played: 171907"
query <- "
            DECLARE @last_game VARCHAR(20)
            
            SET @last_game = (SELECT MAX(game_id) FROM person_appearance)
            
            
            SELECT
                p.*,
                pa.game_id,
                pa.team_id,
                at.*
            FROM person p
            INNER JOIN person_appearance pa ON p.person_id = pa.person_id
            INNER JOIN appearance_type at ON at.appearance_type_id = pa.appearance_type_id
            WHERE pa.game_id = @last_game
            ORDER BY pa.team_id;
         "

dbGetQuery(db_conn, query)

Deleting Original Tables

#show all tables in MLB database
dbGetQuery(db_conn, "SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_TYPE='BASE TABLE'")
#check and drop original tables
for (name in c("game_log", "park_code", "person_code", "team_code")) {
  query <- "DROP TABLE IF EXISTS %s;"
  query <- gsub("%s", name, query, fixed = TRUE)
  execute_command(db_conn, query)
}

#show all tables in MLB database
dbGetQuery(db_conn, "SELECT TABLE_NAME FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_TYPE='BASE TABLE'")
#close database connection
dbDisconnect(db_conn)