Data is not only fueling the economy, but has also become an increasingly important driver of empirical legal research. Three reasons are chiefly responsible for this. First, the internet, better search engines and bigger databases today put more international law data from treaties to disputes or arbitrators at a scholar’s disposal than ever before. Second, researchers are beginning to treat the primary material of law – legal texts – as data. By conceiving text as data and transforming it into numerical representation using natural language processing techniques, scholars can analyze more written material than they could ever read. Third, neighboring disciplines, including legal informatics, computer science or the digital humanities, provide international lawyers with new tools for digesting large amounts of legal data including through machine learning and artificial intelligence.
In a Special Issue for the Journal of International Economic Law we are beginning to explore this new data-driven frontier in empirical legal scholarship. We have been fortunate to assemble strong contributions that engage with major international economic law debates through a data-driven lens using state-of-the-art empirical techniques. In this blog post, we want set out the main issues that, we believe, are raised by this new frontier of empirical scholarship.
What is different in data-driven research?
Three aspects distinguish data-driven empirical scholarship from traditional approaches. First, data is not anymore only a means to empirically test a theory, but increasingly the starting point for empirical research. Big data science allows for the inductive analysis of large amounts of data to detect patterns and trends, which we did not know to exist or expected beforehand, by letting “data speak for itself”. Results generated through this “data-first” attitude can then be used to test established theories or build new ones.
Second, the availability of more information and more efficient means for its analysis allows researchers to investigate entire data populations rather than subsamples thereof. Even though datasets will not always be complete, as some investment awards will remain secret and some treaties unpublished, access to more comprehensive data will improve accuracy of results and reveal patterns only visible in the aggregate.
Third, in analyzing this new wealth of data, researchers are increasingly relying on computing rather than reading or counting. From traditional content analysis that employs an “army” of research assistants to read and hand-code documents, there is a shift towards using computers and artificial intelligence to make sense of texts. While machines are better than humans at spotting patterns across large amounts of texts (which is why we use them in plagiarism detection software for instance), they are worse than humans at resolving interpretive ambiguities. Human coders are thus not going to be completely replaced any time soon, but computers and artificial intelligence are beginning to play a larger role in legal document analysis.
The promises of data-driven research
Data-driven empirical research promises to uncover latent patterns in international law data, debunk past myths and forecast the future, while contributing to new theory-building as the contributions to the JIEL Special Issue show.
Behn et al., for instance, map the network of investment arbitration practitioners quantifying a phenomenon known as double-hatting where a lawyer acts both as arbitrator and counsel in concurrent, un-related proceedings, which is otherwise only discussed in the abstract. Similarly putting hard figures on abstract debates is Charlotin who compiled a dataset of 75’000 citations of international courts and tribunals for this Issue to put numbers on international law’s fragmentation by investigating the degree of cross-citations between international economic law tribunals and other international adjudicatory institutions.
Debunking a different type of myth is Allee et al. who challenge the view that preferential trade agreements (PTAs) and the World Trade Organizations are competing projects on trade regulation. Using textual similarity metrics they show that references to WTO agreements and incorporation of WTO language in PTAs have increased rather than decreased over time. Indeed, the countries most forcefully pursuing PTAs are also the ones that link their treaties most explicitly to the WTO. Daku and Pelc use similar techniques to compare party submissions in WTO disputes and Appellate Body reports to track parties’ influence over dispute settlement outcomes and find that WTO members with less legal capacity also have less of an impact on the precedents that shape WTO jurisprudence.
On the more conceptual side, Derlén and Lindholm show how citation analysis can be used to track the importance of precedent over time using the European Court of Justice’s decisions that shaped the European Single Market as case studies. Broude et al. empirically operationalize the idea of regulatory space in investment agreements and compare the Trans-Pacific Partnership to overlapping treaties on that ground. Finally, Morin et al. argue that the universe of trade agreements can be understood as a complex adaptive system and support that claim by empirically tracing innovation and adoption of environmental provisions in trade agreements.
The challenges and limitations of data-driven research
New data and new tools thus provide exciting new opportunities for empirical analysis that would have been impossible at this scale or depth using more traditional methods. At the same time, it also comes with challenges and limitations.
Challenges span across the entire life-cycle of data-driven research. It is often difficult to obtain machine-readable data at the outset and those who have it may not be willing to share it. Once data is obtained, most legal researchers lack the methodological training to fully exploit it. And even when research is ready for dissemination, outlets may be hesitant to publish work that is descriptive rather than normative. For data-driven research to thrive, empirical legal scholars thus need to collaborate more closely in building and disseminating joint datasets, work together with other disciplines, in particular computer science, to benefit from their complementary skillsets and prepare legal research outlets for more data-intensive scholarship including by broadening the pool of reviewers and putting in place mandatory data publication conventions.
Even when these challenges are overcome, some limitations remain. Data-driven research is particularly prone to be mistaken for theory-less research where data not only speaks but also thinks for itself. That is why researchers engaged in data-driven work need to be careful to separate pattern from noise and to complement elaborate quantitative tools with equally elaborate qualitative evidence backed up by sound theory.
Furthermore, data-driven research such as text-as-data analysis or network analysis is also exceptionally exposed to generating wrong conclusions from skewed data. Think of a research project that only looks at English-language treaties, because of their greater availability. How generalizable are its results? Or consider a network of cross-appointments of arbitrators, where links are missing because cases remain secret. Data-driven research is thus particularly sensitive to the quality of underlying data.
The time is ripe
In spite of these challenges and limitations, we still believe that the time is ripe for a greater role of data-driven research in empirical legal scholarship. Important normative questions from international law’s fragmentation to the double hatting of arbitration practitioners and legal innovation in trade agreements can finally be tackled through empirical research. And data-driven research not only provides new opportunities for legal scholars, but also for practitioners who can benefit from big data research when it is disseminated through dedicated websites and applications.
We thus hope that this Special Issue will help introduce this emerging field and its growing cohort of computer-savvy scholars to a wider range of legal researchers and practitioners.