MEASURING REGULATORY COMPLEXITY

Jean-Edouard Colliard (colliard@hec.fr)
Co-Pierre Georg (co.georg@fs.de)

This version: 14 July 2025

Citation:
Colliard, Jean-Edouard and Georg, Co-Pierre, Measuring Regulatory Complexity (May 09, 2025). HEC Paris Research Paper No. FIN-2020-1358, Available at SSRN: https://ssrn.com/abstract=3523824 or http://dx.doi.org/10.2139/ssrn.3523824


OVERVIEW

This folder contains the code and data for the paper "Measuring Regulatory Complexity" by Colliard and Georg. Each folder contains code pertaining to an application of our methodology. The code was written to work on OS X using a bash shell, e.g., in Terminal or Visual Studio Code. All data is provided in the directories for the various sections of the paper under Source_datasets/. 

Since we only use public and experimental data, today's date serves as their creation date.

.m files were run using Matlab R2023a. 
.do files were run using Stata/MP 18 (64-bit) for Windows
.nb files were run using Wolfram Mathematica 13.2
.sh files are bash shell scripts that can be executed on OS X (make them executable using 'chmod +x *.sh')
.py files is python code that can be executed on OS X (make them executable using 'chmod +x *.py')

* Data availability statement

Data file / source / provided
01_Basel_I/Source_datasets/bcbs04a.pdf / BIS / yes
02_Dodd_Frank_Act/Source_datasets/DFA_Dictionary.csv  / US Congress / yes
02_Dodd_Frank_Act/Source_datasets/Dodd_Frank_Act.pdf / US Congress /yes
03_EBA/Source_datasets/EBA_regulations.pdf / eur-lex / yes

* Computational requirements:
Python 3.13.4:
    pip install networkx pathlib 
(for more detail see 06_Other/python_requirements.txt)

The code was executed on a 2025 Macbook Pro. With this equipment, the code took about fifteen minutes to execute.


MAIN CONTENT:
01_Basel_I/

.Source_datasets/ contains the two source datasets for this section, Basel_I_Algo_Data.xlsx and Basel_I_Text_Data.xlsx. For completeness we also give the (modified) Basel I text we used to manually produce Basel_I_Text_Data.xlsx, and the original Basel I text for comparison (the relevant text, Annex 2, is on pp. 21-22).

1. Run Compute_Measures.m . This returns in the folder "Output":

- Correlations_Table_2.tex (Table 2) [Note that the output for "quantity" is NaN. We set it to 1.00 in Table 2, for reasons explained in the main text.]

- Measures_Algo_Table_OA1.tex (Table OA.1) [In the table we manually input the name of each regulation]

- Correlation_Algo_Table_OA2a.tex (Table OA.2 Panel A) and Correlation_Algo_TableOA2b.tex (Table OA.2 Panel B)

- Measures_Text_Table_OA4.tex (Table OA.4) [In the table we manually input the name of each regulation]

- Correlation_Text_Table_OA5a.tex (Table OA.5 Panel A) and Correlation_Text_Table_OA5b.tex (Table OA.5 Panel B)

2. Table OA.3 is done manually based on Basel_I_Text_Data.xlsx

3. Additionally, we provide the original pdf of the Basel I regulations obtained from https://bis.org/ in ./Source_datasets/bcbs04a.pdf 


02_Dodd_Frank_Act/

./Source_datasets/ contains the DFA dictionary DFA_Dictionary.csv and Dodd_Frank_Act.pdf, the original pdf version of the Dodd-Frank Act, downloaded from https://www.congress.gov/.

./Source_datasets/DFA-titles/ contains 16 files named title_x.txt, where x is the number of the title in the DFA

./Source_datasets/DFA-titles_processed/ contains the files normally produced in step 1 below [for users who do not wish to execute the shell script]

1. Run the shell script 01_do_analysis.sh . It executes process_text.py, which uses the raw text of the titles in ./Source_datasets/DFA-titles/ as well as the dictionary ./Source_datasets/DFA_Dictionary.csv and creates in ./Source_datasets/DFA-titles_processed/ :

- The files cons-count_title_x.csv, where x corresponds to a title. Each file gives the list of words (including its variants) present in the dictionary that appear in title x, and counts the number of occurrences. 

- The file all_cons-count.csv (concatenated version of all cons-count_title_x.csv files).

- The files residual_cons-count_title_x.txt, each file contains the words of title x not found in the dictionary, used for debugging only.

2. Run 02_Prepare_Counts.do . It uses the files "./Source_datasets/DFA-titles_processed/cons-count_title_x.csv" for x between 1 and 16. It generates in the folder  ./Source_datasets/DFA-titles_processed/ : 

- category_cons_count_all_titles.csv, which gives a count of the number of ngrams in each category, for each title.

- category_cons_all_titles_most_frequent_keys.csv, which gives the ten most frequent ngrams in each category, for the whole text.

- category_unique_count.csv, which gives a count of the number of unique ngrams in each category, for each title.

- cons-count_title_x.dta, for each title x, which is the same data as cons-count_title_x.csv, to be used in step 4.

3. Run 03_Summary_Stats.do . This returns: 

- A dataset with the words in each category sorted by number of occurrences, which we use to produce Table 3.

- counts_titles.dta in folder ./Datasets , which contains a count of ngrams in each category for each title

- Table OA.7

4. Run 04_Completeness.do . This returns Table OA.6.


03_EBA/
./Source_datasets/ contains EBA_Dictionary.csv (the EBA dictionary), Bankers_Data.csv (the bankers' survey data), Regulators_Data.csv (the regulators' survey data), and EBA_Regulations.pdf, the original pdf version of the ITS templates, downloaded from https://eur-lex.europa.eu/.
./Source_datasets/EBA-individual/ contains 101 files named x.txt, where x is the name of a template used either in the bankers' survey or in the regulators' survey.
./Source_datasets/EBA_processed/ contains the files normally produced in step 1 below [for users who do not wish to execute the shell script]

1. Run the shell script 01_do_analysis.sh . It executes process_text.py, which uses the raw text of the regulations in ./Source_datasets/EBA-individual/ as well as the dictionary ./Source_datasets/EBA_Dictionary.csv and creates in ./Source_datasets/EBA_processed/ :

- The files cons-count_x.csv, where x corresponds to each template / group of templates we are using in Tables 7 and 8. Each file gives the list of words (including its variants) present in the dictionary that appear in template x, and counts the number of occurrences. 

- The file all_cons-count.csv (concatenated version of all cons-count_x.csv files)

- The files residual_cons-count_x.txt, each file contains the words of template x not found in the dictionary, used for debugging only

2. Run 02_Prepare_EBA_Master.do . This creates in ./Datasets :

- EBA_Master.dta , which is used for all the analyses on the bankers' survey.

- EBA_Master_Regulators.dta , which is used for all the analyses on the regulators' survey.

3. Run 03_Summary_Stats.do . This returns:

- Table OA.8 

- Table OA.9

4. Run 04_Regressions.do . This returns:

- Table 7

- Table OA.23

- Table OA.24

- Table OA.25

- Table OA.26

- Table 9, numbers reported in the third columns of Panel A and Panel B

5. Run 05_Regulators.do . This returns:

- Table 8

- Table OA.27

- Table OA.28

- Table OA.29

- Table OA.30

- Table OA.31

- Table 9, numbers reported in the fourth columns of Panel A and Panel B


* 04_Experiments/

The repository for the experiments website (https://regulatorycomplexity.org), which is hosted on Heroku, can be found at [removed for double-blind peer review]. The balance sheet used in the experiments is always the same and included on the website as an image (the left-hand side of Fig. 3 in the paper). For completeness, the code we used to generate the random regulations can be found in "Additional Material".

./Source_datasets/ contains:

- random_regulations_data.xlsx , which gives a count of all words that appear in each randomly generated regulation (this file was generated manually).

- master.dta , which contains the results of the experiment.

- The folder ./random_regulations , which contains all the randomly generated regulations used in the experiments.

0. Run measures_preparation.m . This returns Measures.csv, in the folder "Datasets"

1. Run 01_data_preparation.do : matches the participants' answers with the complexity measures on each regulation, and returns :

- main.dta in the folder "Datasets"

- master.dta in the folder "Datasets"

- measures.dta in the folder "Datasets"

- Table 4

2. Run 02_regressions_mistakes_panel.do . This returns:

- Table 5

- Table OA.12

- Table OA.13

- Table OA.14

- Table OA.18

- Table 9, numbers reported in the first columns of Panel A and Panel B

3. Run 03_regressions_time_panel.do . This returns:

- Table 6

- Table OA.15

- Table OA.16

- Table OA.17

- Table 9, numbers reported in the second columns of Panel A and Panel B

4. Run 04_regressions_mistakes_pooled.do . This returns:

- Descriptive statistics on the proportion of mistakes across regulations, mentioned in the main text.

- Table OA.10

5. Run 05_regressions_time_pooled.do . This returns:

- Table OA.11

6. Run 06_regressions_time_alternative.do . This returns:

- Table OA.19

- Table OA.20

- Table OA.21

- Table OA.22


* 05_Model/

Run Model.nb . This returns the four graphs shown in Fig. OA8.


* 06_Other/

1. Run 01_Figures.do, which generates the subfigures of Figure 2 in folder ./Output:

- Figure2_length.png

- Figure2_cyclomatic.png

- Figure2_quantity.png

- Figure2_diversity.png

- Figure2_level.png

2. Run 02_DictionaryComparison.do , which generates the number of new operands and operators in the EBA dictionary relative to the DFA dictionary, as mentioned on p. 6 in the paper.

3. Run 06_Table9_Synthesis.do, which generates the tex code for Table 9 in the paper.