Disrupting Corruption: The Case for Data Science

By Ina Cilliers MPL, DA Gauteng Spokesperson on the Standing Committee for Public Accounts (SCOPA 

South Africa has a big problem with corruption in government supply chains. The most salient recent example would be the looting of funds during the Covid 19 pandemic, specifically the procurement of personal protective equipment in the Gauteng Department of Health. Mark Heywood correctly asserted in the Daily Maverick that unless we introduce the certainty of punishment for corrupt public officials, we will lose the fight against corruption. The “July Insurrection” has taught us that these events affect our daily lives. They cause job losses and food price increases and are especially hard on the Youth sector. Even Cosatu recognized back in 2017 that corruption costs us at least 27 billion rand and 76 000 jobs a year. That was before the pandemic. 

Now imagine a scenario where an accounting officer in the Department of Health can accurately predict the likelihood that fruitless and wasteful expenditure will occur, and act to prevent it? Ponder for a moment the transformative power to predict the likelihood of xenophobic attacks or unrest such as the now infamous July unrest of 2021, so that they may be averted altogether? What if an entire government supply chain can be managed by a distributed ledger like a Blockchain, so that not a single public official is involved? 

Business domains including the banking and insurance industries have been making these kinds of predictions using data science tools for some time now. By employing machine learning to predict risks to its own business models they predict the likelihood of a client defaulting on a loan or instituting an insurance claim. Have you ever wondered why the bank did not want to give you a loan? There is an algorithm behind that!  

Data Science is an emerging field of inquiry usually associated with buzzwords such as Big Data (BD), Machine Learning (ML) and Artificial Intelligence (AI). All of these terms have their roots in classical statistics. Stated plainly, statistical learning is quite simply, learning from data (. This is made possible by two conspiring realities: the costs of storing data has decreased enormously over the years, and commensurately, the computational power of hardware has increased exponentially (McCallum, 2008). This means that it is computationally possible to find patterns and correlations in very large datasets (hence, Big Data). One way of understanding this ability is to say: if the data is too large for an Excel spreadsheet, or too large for a CPU to handle, its potentially a job for data science. 

What if we brought data science and good governance into the same room for a chat? I think there are enormous benefits to such an approach for good governance, evidence-based policy making, and in the fight against corruption. Full disclosure: I am wearing more than one hat. 

As a legislator in the Gauteng Provincial Legislature (GPL) and a member of The Standing Committee on Public Accounts (SCOPA), I often hear well-founded complaints about the ex post facto way we do oversight. The Sector Oversight Model (SOM) adopted by the South African legislative sector is a backward-looking tool. Oversight typically occurs months after irregular public expenditure has occurred. Committee recommendations do not scare corrupt officials. 

As a social researcher and budding code writer (Python is fun!), this makes me wonder of the potential to use an Artificial Neural Network (ANN) or a Decision Tree Regressor as a catalytic mechanism in the fight against corruption. The specific model is not as important as a few reality checks though: 

In the web-native world of data science, anybody can write code, train a computer model, and set it free on real data. This is indeed encouraged by the fact that free and open-source resources are now ubiquitous. According to Urban and Pineda adequately problematizes this abundance of free information: Most of it is not sufficiently rigorous to warrant deep enquiry, and few resources, if any, are aimed specifically at the policy maker. A simple Google search reveals thousands of short form resources in the form of “How To” video tutorials, articles, listicles and snippets each dealing with a specific subset of the myriad of elements of data science. Topics such as “Preparing data with Pandas” or “How to select features and responses” or “How to determine if my chosen algorithm is performing” all invite enticing glimpses of problem solving. Rarely is the world of data science systematically unpacked, referenced and peer-reviewed specifically for the policy maker, the legislator and the government official. And yet, on the periphery of applied policy making, most officials are aware of concepts such as BD, ML and AI. These concepts need to be firstly demystified before being introduced formally in the governance domain.  

The second question is whether the data exists. Let me explain. Data about government performance is everywhere, and it is abundant.  In my case SCOPA members are inundated with data all the time.  Data sources include portfolio committee reports, the AGSA, the public service commission, the financial and fiscal commission, the SIU, internal audit reports, departmental quarterly reports, and the list goes on. Yet in the midst of all this information, I very much doubt that a dataset exists that is ready for machine learning. If anyone reading this would like to rebut my assertion, I would welcome such a development. It will save my research several months! 

The third issue is with reproducibility. If my team and I build an ML model that performs well on unseen data, we must share! It should be standard practice that not only datasets but the actual code must be made available as a standard part of the research (Data School, 2014). This is because sometimes algorithms don’t work, machine learning models experience degradation over time, or we make the wrong business decisions from the data. In such cases, we may all learn from our failures just as much as from our triumphs. 

For the policy maker, machine learning can become the tool that helps us prove or disprove our intuition about the problem we are trying to solve. Here are a few of my recommendations for disrupting corruption with a data-driven approach: 

Firstly, governments need to become data driven learning organisations. For this to happen much more research and experimentation is needed. Officials and domain experts in government department are vital to the success of the undertaking because they make those vital inferences from the data on which decisions for say, corruption busting, hinges 

Secondly, we need  to find a place for data science in the policy life cycle. Ideally more than one place. Statistical models can help policymakers move from inputs and outputs to outcomes and impacts. This would depend on the computational efficiency available and the particular problem that the machine is trying to learn on, but every stage of the policy life cycle can benefit from machine learning. 

Next is to bring together the data scientist, the statistician and the domain expert. Governance is an immensely complex undertaking. Every moment of the day, thousands upon thousands of financial transactions happen across municipal, provincial and national budget line items, often involving staggering amounts. This is all happening in an environment regulated by a myriad of complex laws and regulations. Officials often underestimate their domain knowledge and they don’t get credit for being able to assimilate all this complexity. The current consensus is that the best problem solving teams include data scientists, statisticians and domain experts (Mukherjee, 2019). All three have distinct roles to play in arriving at informed governance decisions. 

The possibilities for machine learning to tackle corruption are very exciting. But it still requires humility from the data scientist, an understanding of the theory, and a willingness from the government official to give freely of his domain knowledge. And finally, no model can ever predict with a 100% accuracy. We should be honest about that when pitching solutions to legislatures and governments alike.