# MOVES Speciation Post-Processing SQL Script

The goal of the post-processing script is to calculate the NonHAPTOG emissions that are assigned to each speciation profile, so that the work of speciating the NonHAPTOG can be done outside of MOVES. The SQL script is a translation of work done by Claudia Toro in R. 

The SQL script was tested for matching exactly the output of Claudia's R script. That documentation can be found elsewhere in this repository.

## SQL Procedures

The design of the script is based on a SQL procedure - a function that builds queries and executes them based on input. The calling of a SQL function is a lot like calling a function in other languages such as R, Java, and Go.

Procedures are created in a database, so the database name must be prepended to the procedure, unless the call is preceded by a `USE` statement. The inputs can be of any type, including strings.

```sql
-- Example Procedure Call
CALL exampleDB.exampleProcedure(exampleInput1, 'exampleInput2');
```

A procedure can also be called from the SQL command line, assuming it already exists.

```batch
mysql --user=moves --password=moves
MariaDB [(none)]> CALL exampleDB.exampleProcedure(exampleInput1, 'exampleInput2');
```

Finally, the procedure can be called from the command line in as a one-liner by using the `--execute` flag.

```batch
mysql --user=moves --password=moves --execute="CALL exampleDB.exampleProcedure(exampleInput1, 'exampleInput2')"
```

When calling procedures using the `--execute` flag, care needs to be taken when handling strings. Because the code is, itself, a string, I stick to the rule that any string I pass into a procedure is wrapped in single quotes (`'exampleInput2`'), while commands are wrapped in double quotes.

## Script Design

The script does two things:

1. Drops and creates the collector database, with the hardcoded name `speciation_outside_moves_collected`. 
2. Creates the procedure (called `speciate`) that post-processes the output and writes the results to the collector database. 

The procedure is run on a MOVES output database, and uses a working database (similar in principle to the MOVES execution database). These are the procedure inputs, so calling the procedure looks like so:

```sql
CALL speciation_outside_moves_collected.speciate('db_results_inv_2018_20210412_batch0001_c34035_2018_7_other', 'speciate_working');
```

### Using the Script

#### Option 1: In the SQL script itself

The procedure can be called in the SQL script itself after it is defined, on any number of databases. This probably requires modifying the script with the procedure itself, which I'd rather not do for every instance this is called on.

#### Option 2: Create a second SQL script

A single SQL script can be created, where each line is a procedure call on an output database. For example, if we create a processor script "processNonHAPTOG.sql", like so:

```sql
CALL speciation_outside_moves_collected.speciate('db_results_inv_2018_20210412_batch0001_c34035_2018_10_start', 'speciate_working');
CALL speciation_outside_moves_collected.speciate('db_results_inv_2018_20210412_batch0001_c34035_2018_11_start', 'speciate_working');
CALL speciation_outside_moves_collected.speciate('db_results_inv_2018_20210412_batch0001_c34035_2018_12_start', 'speciate_working');
CALL speciation_outside_moves_collected.speciate('db_results_inv_2018_20210412_batch0001_c34035_2018_1_other', 'speciate_working');
-- more lines for other databases here
```

Then we can call 2 SQL scripts to do the processing:

```batch
mysql --user=moves --password=moves
MariaDB [(none)]> source speciationProcedure.sql;
MariaDB [(none)]> source processNonHAPTOG.sql;
```

Alternatively, we could combine them into one:

```sql
source speciationProcedure.sql
CALL speciation_outside_moves_collected.speciate('db_results_inv_2018_20210412_batch0001_c34035_2018_10_start', 'speciate_working');
-- more lines for other databases here
```

#### Option 3: Batch/Bash script

Another option is to do everything via batch using the `--execute` flag. The first step is to source the script, followed by any number of calls to the procedure.  

```batch
mysql --user=moves --password=moves --execute="source speciationProcedure.sql;"
mysql --user=moves --password=moves --execute="CALL speciation_outside_moves_collected.speciate('db_results_inv_2018_20210412_batch0001_c34035_2018_evp', 'speciate_working');"
// more lines for other databases here
```

#### Option 4: Advanced scripting

This option requires more advanced scripting skill. Once the SQL file with the procedure is sourced, it's possible to call the procedure in a more programmatic way. A script in almost any language (R, python, PERL, Go, Java, etc.) can do the following:

1. Source the SQL script "speciationProcedure.sql"
2. Take advantage of the naming convention for the output databases and list the databases that need to be run through the procedure
3. For each database, call the procedure

This allows the flexibility to access the collector database after calling the procedure so that the tables can be written to file, modified if necessary, or for other operations. 

## Procedure Design

The procedure itself has 8 steps. For each one, design decisions and assumptions are noted. Appendix 1 contains a list of these assumptions, so that they can be easily revisited later on.

### Step 1: Prepare the collector database

This step creates all the tables in the collector database that are required if they do not already exists. The database has 6 tables. The first one is a base output table which contains the columns needed for SMOKE-MOVES, plus monthID:

```sql
CREATE TABLE IF NOT EXISTS speciation_outside_moves_collected.base_schema (
    monthID 				SMALLINT(6),
    SMOKE_SCC 				VARCHAR(10),
    togSpeciationProfileID  VARCHAR(10),
    pollutantID 			SMALLINT(6),
    pollutantName 			VARCHAR(50),
    SMOKE_mode 				VARCHAR(20),
    countyID 				INT(11),
    ratio 					DOUBLE,
    PRIMARY KEY (monthID, SMOKE_SCC, togSpeciationProfileID, pollutantID, pollutantName, SMOKE_mode,countyID)
);
```

This `base_schema` table will not contain data. Instead, 4 tables are created for each SMOKE "mode". That way changing the schema in the future requires changing one table rather than 4. These map to each of the 4 csv files created by Claudia's R script.

```sql
CREATE TABLE IF NOT EXISTS speciation_outside_moves_collected.exh_nhtog LIKE speciation_outside_moves_collected.base_schema;
    CREATE TABLE IF NOT EXISTS speciation_outside_moves_collected.epm_nhtog LIKE speciation_outside_moves_collected.base_schema;
    CREATE TABLE IF NOT EXISTS speciation_outside_moves_collected.evp_nhtog LIKE speciation_outside_moves_collected.base_schema;
    CREATE TABLE IF NOT EXISTS speciation_outside_moves_collected.rfl_nhtog LIKE speciation_outside_moves_collected.base_schema;
```

The final table created is a SMOKE-MOVES mapping table, which maps MOVES process, road type combinations to SMOKE processes. This is a utility table that's useful when speciating the output for SMOKE-MOVES. The data that populates this table is hardcoded in the script, so that there's no need to worry about moving around an additional csv or other text file in addition to the SQL file. Hardcoding is okay in this case as well because the mappings will not change very often.

```sql
CREATE TABLE IF NOT EXISTS speciation_outside_moves_collected.SMOKE_MOVES_mapping (
    processID 		SMALLINT(6),
    processName 	VARCHAR(50),
    roadTypeID 		SMALLINT(6),
    rateTable 		VARCHAR(10),
    SMOKE_process 	SMALLINT(6),
    SMOKE_mode 		VARCHAR(20),
    PRIMARY KEY (processID, processName, roadTypeID, rateTable, SMOKE_process, SMOKE_mode)
);
```

The mapping table can be seen in Appendix 2.

### Step 2: Drop and create working database

This step is fairly straightforward. It's worth noting, however, that the working database is *not* dropped after the script runs. Like the MOVES execution database, keeping it around after the script runs can be helpful for debugging.

### Step 3: Read MOVES output metadata from output database

This gets the default database used, CDB used, the county that was run, and the year. The databases and county are read from the `movesrun` table, while the year comes from the county database's `year` table.

The script assumes that each output database only has one MOVES run in it, therefore it only looks for one combination of default database, county database, and year. **If there are multiple combinations in the output database, only the first will be run.**

### Step 4: Get prerequisite data from the default database

There are two areas of concern for this step: 

1. The county's fuel mix is determined using the county databases `fuelsupply` and `fuelformulation` tables, along with the default database's `regioncounty` and `fuelsubtype` tables.
2. The speciation profiles are obtained from the default database. 

### Step 5: Get NonHAPTOG emissions from output database

This is straightforward. It simply selects the entire `movesoutput` table for NonHAPTOG (pollutantID 88) emissions, keeping only the columns that are relevant and skipping others, like nonroad columns `hpID` and `sectorID`.

### Step 6: Assign NonHAPTOG emissions to speciation profiles

First, the MOVES output emissions are split by fuelSubtypeID using the county's fuel mix and market shares. Then the emissions are assigned to their speciation profile according to process, fuel subtype, and regulatory class. The final step is to convert the MOVES SCCs to SMOKE SCCs and aggregate accordingly. 

This produces an intermediate table called `nonhaptog_speciated`.

### Step 7: Write the output to the collector database

In this step, the final intermediate table is broken out according to "SMOKE mode" - exhaust, permeation, evap, and refueling - and inserted into their corresponding tables in the collector database. During this process, the raw emissions output is normalized to give a weight to each profile within an SCC.

The collector tables have a strict primary key (everything but the assigned ratio is in the key), so if multiple MOVES runs with overlapping emissions are gathered into the same collector database, the procedure will generate an error for having duplicate primary keys. This can be changed, if needed.

### Step 8: Adjust Factors as Necessary

This is the final step, which updates the collector database tables as necessary. Right now, there are 2 adjustments:

1. All non-January CNG factors need to be changed to match the January factors.
2. For years before 2010, all non-January factors for diesel source types 51, 61, and 62 need to be changed to match the January factors.

## Appendix 1: List of Assumptions and Design Decisions

The following list of assumptions and design decisions can be revisited and changed at any time. 

- The collector database name is hardcoded. Because SQL procedures must be stored in a database, I don't see a way around this.
- It is assumed that each output database will only have data from one MOVES run. 
- Likewise, it is assumed that the default database is available in the same MySQL instance as the output databases.
- The result of the procedure is a database with a table for each "SMOKE mode". From here, it is easy to write them to a file, save them in another location, or do other post-processing. This can be done in the script or as part of a follow-on script, depending on which is most convenient.
- The collector database is dropped and created every time the script with the procedure is called. Therefore, the procedure doesn't clean up after itself when it's done, although it very easily can.
- The SMOKE-MOVES SCC mapping is hardcoded into the procedure, but can be tracked in a csv file if necessary.
- The script writes the final output to the collector database using `INSERT INTO`, so that duplicate keys don't get written. This assumes that none of the output databases have overlapping output (for example, if 2 runs have ONI activity). If this assumption isn't safe, either `INSERT IGNORE INTO` (which keeps the first data written and ignores the rest) or `REPLACE INTO` (which will overwrites previously existing data if necessary) can be used instead.
- The script only writes data for months that exist in the MOVES output databases. This can be converted to include all 12 months, if necessary.

## Appendix 2: SMOKE-MOVES SCC Mapping

| processID | processName                       | roadTypeID | rate | SMOKE_process | SMOKE_mode |
| --------- | --------------------------------- | ---------- | ---- | ------------- | ---------- |
| 1         | Running Exhaust                   | 1          | RPHO | 92            | EXH_NHTOG  |
| 1         | Running Exhaust                   | 2          | RPD  | 72            | EXH_NHTOG  |
| 1         | Running Exhaust                   | 3          | RPD  | 72            | EXH_NHTOG  |
| 1         | Running Exhaust                   | 4          | RPD  | 72            | EXH_NHTOG  |
| 1         | Running Exhaust                   | 5          | RPD  | 72            | EXH_NHTOG  |
| 2         | Start Exhaust                     | 1          | RPS  | 72            | EXH_NHTOG  |
| 11        | Evap Permeation                   | 1          | RPV  | 72            | EPM_NHTOG  |
| 11        | Evap Permeation                   | 2          | RPD  | 72            | EPM_NHTOG  |
| 11        | Evap Permeation                   | 3          | RPD  | 72            | EPM_NHTOG  |
| 11        | Evap Permeation                   | 4          | RPD  | 72            | EPM_NHTOG  |
| 11        | Evap Permeation                   | 5          | RPD  | 72            | EPM_NHTOG  |
| 12        | Evap Fuel Vapor Venting           | 1          | RPP  | 72            | EVP_NHTOG  |
| 13        | Evap Fuel Leaks                   | 1          | RPV  | 72            | EVP_NHTOG  |
| 13        | Evap Fuel Leaks                   | 2          | RPD  | 72            | EVP_NHTOG  |
| 13        | Evap Fuel Leaks                   | 3          | RPD  | 72            | EVP_NHTOG  |
| 13        | Evap Fuel Leaks                   | 4          | RPD  | 72            | EVP_NHTOG  |
| 13        | Evap Fuel Leaks                   | 5          | RPD  | 72            | EVP_NHTOG  |
| 15        | Crankcase Running Exhaust         | 2          | RPD  | 72            | EXH_NHTOG  |
| 15        | Crankcase Running Exhaust         | 3          | RPD  | 72            | EXH_NHTOG  |
| 15        | Crankcase Running Exhaust         | 4          | RPD  | 72            | EXH_NHTOG  |
| 15        | Crankcase Running Exhaust         | 5          | RPD  | 72            | EXH_NHTOG  |
| 15        | Crankcase Running Exhaust         | 1          | RPHO | 92            | EXH_NHTOG  |
| 16        | Crankcase Start Exhaust           | 1          | RPS  | 72            | EXH_NHTOG  |
| 17        | Crankcase Extended Idle Exhaust   | 1          | RPH  | 53            | EXH_NHTOG  |
| 18        | Refueling Displacement Vapor Loss | 1          | RPD  | 62            | RFL_NHTOG  |
| 18        | Refueling Displacement Vapor Loss | 2          | RPD  | 62            | RFL_NHTOG  |
| 18        | Refueling Displacement Vapor Loss | 3          | RPD  | 62            | RFL_NHTOG  |
| 18        | Refueling Displacement Vapor Loss | 4          | RPD  | 62            | RFL_NHTOG  |
| 18        | Refueling Displacement Vapor Loss | 5          | RPD  | 62            | RFL_NHTOG  |
| 19        | Refueling Spillage Loss           | 1          | RPD  | 62            | RFL_NHTOG  |
| 19        | Refueling Spillage Loss           | 2          | RPD  | 62            | RFL_NHTOG  |
| 19        | Refueling Spillage Loss           | 3          | RPD  | 62            | RFL_NHTOG  |
| 19        | Refueling Spillage Loss           | 4          | RPD  | 62            | RFL_NHTOG  |
| 19        | Refueling Spillage Loss           | 5          | RPD  | 62            | RFL_NHTOG  |
| 90        | Extended Idle Exhaust             | 1          | RPH  | 53            | EXH_NHTOG  |
| 91        | Auxiliary Power Exhaust           | 1          | RPH  | 91            | EXH_NHTOG  |