Harmonizing Product Codes with R
Christoph Baumgartner
Stjepan Srhoj
Janette Walde
Abstract
Innovation is a major engine of economic growth. To compare products over time, harmonization of product codes is mandatory. This package provides an 
easy-to-use approach to harmonize product codes. Moreover, it offers an application that allows finding all new and dropped products for given firm-level data based on harmonized product codes.
 
This package provides several functions to harmonize CN8 product codes (Combined Nomenclature 8 digits) as well as PC8 product codes (Production Communautaire 8 digits), HS6 (Harmonized System 6 digits) and BEC (Broad Economic Categories). All functions are listed below:
Main Functions
harmonize_cn8() 
provides for a given time period a data frame that contains all CN8 product codes and their history, harmonized CN8plus codes, harmonized HS6plus codes, and BEC classification. The “plus-codes” are the main outcome of the function. They provide harmonized information of the product codes, i.e. comparable codes. Every harmonization refers to the last year of interest. The following table offers an overview of all provided variables.
| CN8_xxxx | a specific CN8 code in a given year | 
| CN8plus | the harmonization code for CN8, which refers to the last year of the time period | 
| HS6plus | the harmonization code of HS6, which refers to the last year of the time period | 
| BEC | provides the BEC classification at a high aggregation level (1 digit) | 
| BEC_agr | provides the BEC classification at a lower aggregation level (up to 3 digits) | 
| BEC_basic_class | provides information if the code is classified as consumption, capital or intermediate good in BEC | 
| flag | either 0 or 1; 1 indicates that this code remained the same in notation over the whole time period but was split or merged in addition | 
| flagyear | indicates the first year in which the flag was set to 1 | 
For more application details, see ?harmonization_cn8.
harmonize_pc8() 
provides for a given time period a data frame that contains all PC8 product codes and their history, harmonized PC8plus codes, harmonized HS6plus codes, and BEC classification. The “plus-codes” are the main outcome of the function. They provide harmonized information of the product codes, i.e. comparable codes. Every harmonization refers to the last year of interest. The following table offers an overview of all provided variables.
| PC8_xxxx | a specific PC8 code in a given year | 
| PC8plus | the harmonization code for PC8, which refers to the last year of the time period | 
| HS6plus | the harmonization code of HS6, which refers to the last year of the time period | 
| BEC | provides the BEC classification at a high aggregation level (1 digit) | 
| BEC_agr | provides the BEC classification at a lower aggregation level (up to 3 digits) | 
| BEC_basic_class | provides information if the code is classified as consumption, capital or intermediate good in BEC | 
| flag | either 0 or 1; 1 indicates that this code remained the same in notation over the whole time period but was split or merged in addition | 
| flagyear | indicates the first year in which the flag was set to 1 | 
For more application details, see ?harmonization_pc8.
 
Support Functions
All support functions are used within the main functions. They provide intermediate steps to harmonize the data. However, they can be used as stand-alone functions as well.
history_cn8() 
provides a data frame that contains all CN8 product codes and their history over time for the demanded time period. This dataset is the basis for the main function harmonize_cn8() and can be obtained therewith as well. The following table offers an overview of all provided variables.
| CN8_xxxx | a specific CN8 code in a given year | 
| flag | either 0 or 1; 1 indicates that this code remained the same in notation over the whole time period but was split or merged in addition | 
| flagyear | indicates the first year in which the flag was set to 1 | 
For more application details, see ?history_cn8.
history_pc8() 
provides a data frame that contains all PC8 product codes and their history over time for the demanded time period. This dataset is the basis for the main function harmonize_PC8() and can be obtained therewith as well. The following table offers an overview of all provided variables.
| PC8_xxxx | a specific PC8 code in a given year | 
| flag | either 0 or 1; 1 indicates that this code remained the same in notation over the whole time period but was split or merged in addition | 
| flagyear | indicates the first year in which the flag was set to 1 | 
For more application details, see ?history_pc8.
cn8_to_bec() 
provides a data frame that contains all CN8 product codes and related BEC and HS6 codes in a given time period. Therefore, this data serves as a connection between CN8 and BEC classification and between CN8 and HS6 classification. It forms the basis of some output of the main function, namely: BEC, BEC_agr, BEC_basic_class and HS6plus. The following table offers an overview of all provided variables.
| CN8 | a specific CN8 code | 
| HS6 | provides the HS6 classification of the CN8plus code | 
| BEC | provides the BEC classification on a high aggregation level (1 digit) | 
| BEC_agr | provides the BEC classification on a lower aggregation level (up to 3 digits) | 
For more application details, see ?cn8_to_bec.
pc8_to_bec() 
provides a data frame that contains all PC8 product codes and related BEC and HS6 codes in a given time period. Therefore, this data serves as a connection between PC8 and BEC classification and between PC8 and HS6 classification. It forms the basis of some output of the main function, namely: BEC, BEC_agr, BEC_basic_class and HS6plus. The following table offers an overview of all provided variables.
| PC8_xxxx | a specific PC8 code | 
| HS6 | provides the HS6 classification of the PC8plus code | 
| BEC | provides the BEC classification on a high aggregation level (1 digit) | 
| BEC_agr | provides the BEC classification on a lower aggregation level (up to 3 digits) | 
For more application details see ?pc8_to_bec.
get_data_directory() 
provides the directory where custom data must be stored and the used data (e.g., concordance lists, list of codes) can be edited. However, before editing the employed data or using additional concordance lists for example, it is highly recommended to read first the instructions in this vignette carefully (also see section Data Sets and Custom Data). The directory is provided in the R console. Further features (like open an explorer, print available data in console) are only executable if the directory path does not contain any blanks.
For more application details see ?get_data_directory.
 
Additional Functions
These functions go beyond the primary purpose of this package. The additional functions provide an application of the data frames obtained by the main functions. To use these additional functions, data on firm-level is required, which is data that is not provided by the package. The firm-level data must provide columns with the following names: ID, year and CN8 or PC8. Other columns may exist; however, they will not be used by the function. The following table summarizes the variables that need to be included in the firm-level data.
| ID | specific code that describes a firm over the years (this code does not change over time) | 
| year | year in which the firm produced a product | 
| CN8 | CN8 code of firm product | 
| PC8 | PC8 code of firm product | 
  utilize_cn8() 
may provide two data frames:
- 
A data frame that contains all changed CN8 product codes per firm per year. In more detail, this means how many products remained the same, were added, were dropped, how many products were produced by a certain firm in a given year, and how many products were produced in the year after.
- 
A data frame that is based on the entered firm data. The entered firm data data is extended by harmonized data (that is CN8plus, flag, flagyear, HS6plus, BEC, BEC_agr, BEC_basic_class).
The tables at the end of this section offer an overview of all provided variables.
  utilize_pc8() 
may provide two data frames:
- 
A data frame that contains all changed PC8 product codes per firm per year. In more detail, this means how many products remained the same, were added or dropped - the value of the same/added/dropped products - how many products were produced by a certain firm in a given year, and how many products were produced in the year after.
- 
A data frame that is based on the entered firm data. The entered firm data data is extended by harmonized data (that is PC8plus, flag, flagyear, HS6plus, BEC, BEC_agr, BEC_basic_class).
The tables at the end of this section offer an overview of all provided variables.
Since the provided data frames do not differ between utilize_cn8() and utilize_pc8(), in terms of notation, the tables are only provided once here.
Table that summarizes the output, described by the notation a. above:
| firmID | specific code that describes a firm over the years (this code does not change over time) | 
| period_UL | lower limit of the time period | 
| period | time period in which the product was produced | 
| gap | indicating if the time period is greater than one (i.e. upper limit - lower limit > 1) | 
| same_products | number of products that were produced in both years (i.e. remained in the product portfolio of this firm) | 
| value_same_products | value of products that were produced in both years (i.e. remained in the product portfolio of this firm); the value is calculated in the upper limit of the time period | 
| new_products | number of added products in the upper limit of the time period (i.e. added to the product portfolio of this firm) | 
| value_new_products | value of added products in the upper limit of the time period (i.e. added to the product portfolio of this firm) | 
| dropped_products | number of dropped products in the upper limit of the time period (i.e. removed of the product portfolio of this firm) | 
| value_dropped_products | value of dropped products in the upper limit of the time period (i.e. removed of the product portfolio of this firm); the value is calculated in the lower limit of the time period | 
| nbr_of_products_period_LL | number of all products produced in the lower limit of the time period (i.e. entire product portfolio of this firm) | 
| nbr_of_products_period_UL | number of all products produced in the upper limit of the time period (i.e. entire product portfolio of this firm) | 
Table that summarizes the output, described by the notation b. above:
| firmID | specific code that describes a firm over the years (this code does not change over time, provided by user) | 
| year | year in which the firm produced a product (provided by user) | 
| CN8 | CN8 code of firm product (provided by user) | 
| PC8 | PC8 code of firm product (provided by user) | 
| (value) | value of the corresponding product code (may be provided by user) | 
| … | additional columns from original firm data (provided by user) | 
| CN8plus | final harmonization, which refers to the last year of the time period | 
| PC8plus | final harmonization, which refers to the last year of the time period | 
| flag | either 0 or 1; 1 indicates that this code remained the same in notation over the whole time period but was split or merged in addition | 
| HS6 | provides the HS6 classification of the PC8plus / CN8plus code | 
| HS6plus | also adjusts for the change lists of HS6 | 
| BEC | provides the BEC classification on a high aggregated level (1 digit) | 
| BEC_agr | provides the BEC classification on a less aggregated level (up to 3 digits) | 
| BEC_basic_class | provides information if the code is classified as consumption, capital or intermediate good in BEC | 
 
Data Sets
By default, the package provides several data sets for CN8-, PC8-, HS6- and BEC-classification. This data allows for harmonization of CN8 product codes between 1995 and 2022 and PC8 product codes between 2007 and 2017. All available data sets are stored within the package. The function get_data_directory() provides support to access the data more easily. All data included in the package was downloaded from EU server Ramon originally and altered if needed.
Provided data in more detail:
- CN8 data  is provided in the corresponding CN8 folder. This folder contains two different types of files. Firstly, a list of all existing CN8 codes for every year, e.g. for the year 2000, CN8_2000.rds. More technically speaking, these files provided a data frame with one column and n rows, where n is the number of existing CN8 codes in a given year. An example (first six lines) of the year 2000 is the following: -      group
1 01011100
2 01011910
3 01011990
4 01012010
5 01012090
6 01021010
 - Secondly, the CN8 folder contains a concordance list of all CN8 codes over time, a .csv file, where the separator is a semicolon, i.e. “;”. A header is necessary. The header names must be the following: from, to, obsolete, new. The period between “from” and “to” is always one year and describes when the code changed. The “obsolete” and “new” codes represent the outdated code and the replacement, respectively. The first six lines of the default csv-file look like the following: - from;to;obsolete;new
1988;1989;02012011;02012021
1988;1989;02012011;02012029
1988;1989;02012019;02012029
1988;1989;03036010;03036011
1988;1989;03036010;03036019
 
- PC8 data  is provided in the corresponding PC8 folder. This folder contains two different types of files. Firstly, a list of all existing PC8 codes for every year, e.g. for the year 2010, PC8_2010.rds. More technically speaking, these files provided a data frame with one column and n rows, where n is the number of existing CN8 codes in a given year. An example (first six lines) of the year 2010 is the following: -       2010
1 07101000
2 07291100
3 07291200
4 07291300
5 07291400
6 07291500
 - Secondly, a concordance between every year is necessary. These files contain two years in their filenames, with a period of one year in between, e.g. between 2010 and 2011 this results in PC8_2010_2011.rds. More technically speaking, these files are data frames with two columns, which must be named “new” and “old” and n rows, where n is the number of changes in a given year. An example (first six lines) of the changes between 2010 and 2011 is the following: -        new      old
1 07101000 07101000
2 07291100 07291100
3 07291200 07291200
4 07291300 07291300
5 07291400 07291400
6 07291500 07291500
 - Thirdly, the PC8 folder contains concordance lists between PC8- and CN8- classifications for every year. This data is needed in terms of translating PC8 into BEC. An example for the year 2010 would be PC8_CN8_2010.rds. Technically this means, a data frame with two columns, named “PRCCODE” for PC8 codes and “CNCODE” for CN8 codes and n rows, where n is the number of concordances between specific codes is provided by every year. However, no concordance between PC8 and CN8 may be possible. In this case, the missing value is filled by NA. Some examples out of the associated file for the year 2010 can be found below: -    PRCCODE CNCODE
1 10131430   <NA>
2 10139100   <NA>
3 10399100   <NA>
4 13301110   <NA>
5 13301121   <NA>
6 13301122   <NA>
     PRCCODE   CNCODE
2400 8111136 25151200
2401 8111150 25152000
2402 8111233 25161100
2403 8111236 25161200
2404 8111250 25162000
2405 8111290 25169000
 
- HS6 data  is provided in the corresponding HS6 folder. This folder only contains one type of file, which are correspondence lists between the changes of HS6 codes over time. Those changes happened in several years: 1992, 1996, 2002, 2007, 2012 and 2017. For every period, a separate concordance list is necessary. csv-files provided this data, where the separator is a semicolon, i.e. “;” and the filenames contain both years. For example, between 1996 and 2002, the file is called HS_1996_to_HS_1992.csv. Also, headers are included in this file. For this specific case, they are “HS 1996” and “HS 1992”. For other periods the headers change accordingly. An example (first six lines) of the changes between 1996 and 1992 is the following: - <U+FEFF>HS 1996;HS 1992
10111;10111
10119;10119
10120;10120
10210;10210
10290;10290
 
- BEC data  data is provided in the HS6toBEC folder. This folder contains only one type of file, which are correspondence lists between HS6- and BEC-classification in the years HS6 codes changed (i.e. 2002, 2007, 2012, 2017). For each year, a separate concordance list is necessary. csv-files are used for this data, where the separator is a semicolon, i.e. “;” and the filenames contain the year. For example, in 2002, the file is called HS2002toBEC.csv. Also, headers are included in this file, namely “HS” for the HS6 codes and “BEC” for the BEC codes. An example (first six lines) of the concordance in 2002 is the following: - HS;BEC
10110;41
10190;111
10210;41
10290;111
10310;41
 
Custom Data 
The use of additional concordance lists for example or altering provided data is possible. However, it is highly recommended to read first the instructions in this vignette carefully. If new data is added, there are some mandatory aspects and some valuable aspects to acknowledge.
Mandatory aspects:
- New data must be stored inside the package. This can be easily done by adding new files in the appropriate subfolder of the package database. - get_data_directory()may provide help to find the correct folder to store new data.
 
- Chosen filenames must be analogue to already existing files. 
- The structure of the new data is crucial. The section Data Sets may provide more details. In short: file-type, header names, column numbers and datatype (numeric, character, …) are very important. 
Valuable aspects:
- It is highly recommended to download new data from EU server Ramon and only alter content-related data if necessary.
- Product codes need to have the correct length, e.g. CN8 codes must be eight digits long. Some programs tend to interpret codes as numeric values and cut of leading zeros, which leads to completely wrong results.