Uncovering causal relationships using adaptive interventions
Introduction
Will eating kaya toast or roti prata for breakfast make me feel more satisfied? If I exercised more, will I become fitter? Should I major in Computer Science or Medicine if I want to have a more meaningful career?
Kaya toast [Image credit] | Roti prata [Image credit] |
Life is full of what-ifs and one often wonders what would happen if we had done X instead of Y. Such counterfactual questions are one of the key objects of study in the field of causality. More broadly, the field of causality aims to understand and reason about the causal relationship between variables of interest, beyond simply statistical correlations.
Correlation does not imply causation
The phrase “correlation does not imply causation”[1] is often cited to warn others against making erroneous causal conclusions especially when spurious correlations[2] are common in everyday lives. For example, one may observe that the number of ice cream sales is highly positively correlated with the number of shark attacks. Does this mean that we should stop selling ice cream to save lives?[3] In this case, a more plausible causal relationship explaining this trend is that both are affected by weather conditions: people are both more likely to buy more ice cream and to visit beaches (hence increasing the chance of being attacked by sharks) when the weather is hotter. While this toy example may seem frivolous, misinterpreting spurious correlations may lead to poor performance in machine learning models, or even conspiracy theories[4].
Causal structure learning
Suppose we are now given measurements of a bunch of variables and we wish to learn the underlying causal relationships between them[5]. Depending on the context setting, these variables could be health measurements (diet, exercise, blood pressure, etc.), stock prices (AAPL, AMZN, TSLA, etc.), or activity levels of different genes.
Observational data alone is insufficient
Unfortunately, one can only typically recover the causal relationships up to an equivalence class. For example, suppose we obtain the following paired measurements of two variables X1 and X2:
Obs. 1 | Obs. 2 | Obs. 3 | Obs. 4 | Obs. 5 | Obs. 6 | Obs. 7 | Obs. 8 | Obs. 9 | |
---|---|---|---|---|---|---|---|---|---|
Variable X1 | -0.27 | 0.29 | 0.37 | -0.09 | 0.34 | 0.33 | 0.30 | -1.34 | 0.68 |
Variable X2 | -0.10 | 1.65 | 0.47 | 1.92 | 2.04 | 1.67 | 0.11 | -3.58 | 1.97 |
Even if we are guaranteed that one is the cause of the other, one cannot determine which is the cause and which is the effect simply from the given above data. This is because there exists two possible data generating processes that could have generated the data.
- One possible data generating process (X1 → X2):
Suppose ε1 ~ N(0, 1) and ε2 ~ N(0, 1) be two independent Gaussians.
Then, define X1 = ε1 and X2 = X1 + ε2. - Another possible data generating process (X2 → X1):
Suppose ε3 ~ N(0, 1/2) and ε4 ~ N(0, 2) be two independent Gaussians.
Then, define X1 = X2/2 + ε3 and X2 = ε4.
In both cases, each observation is just a pair of Gaussians X1 ~ N(0, 1) and X2 ~ N(0, 2) and one cannot distinguish between the two data generating processes, unless one is willing to make further assumptions on the data generation process.
Now, you may ask why should we even care about such a simple two-variable example? Who cares whether X1 → X2 or X2 → X1? What if X1 = Smoking and X2 = Cancer and we have the following observations?
Obs. 1 | Obs. 2 | Obs. 3 | Obs. 4 | Obs. 5 | Obs. 6 | … | |
---|---|---|---|---|---|---|---|
Smoking | Yes | Yes | Yes | No | No | No | … |
Cancer | No | Yes | Yes | No | No | Yes | … |
The question of whether smoking causes cancer was actually under a huge debate in the 1950s and the renowned statistician Ronald Fisher even penned a Nature letter about it (emphasis mine)[6]:
“The curious associations with lung cancer found in relation to smoking habits do not, in the minds of some of us, lend themselves easily to the simple conclusion that the products of combustion reaching the surface of the bronchus induce, though after a long interval, the development of a cancer… Such results suggest that an error has been made, of an old kind, in arguing from correlation to causation, and that the possibility should be explored that the different smoking classes, non-smokers, cigarette smokers, cigar smokers, pipe smokers, etc., have adopted their habits partly by reason of their personal temperaments and dispositions, and are not lightly to be assumed to be equivalent in their genotypic composition…”
Source: https://www.nature.com/articles/182596a0.pdf
In short, Fisher was complaining that one cannot simply draw conclusions that smoking causes cancer just from observational data of the form collected above. For instance, he said that maybe there is some unobserved underlying gene that causes someone to enjoy smoking and causes cancer.
Plausible causal explanation to why smoking and cancer are correlated |
One way to answer Fisher’s question is to perform randomized controlled trials (RCTs), the gold standard in scientific exploration, where we flip a fair coin for each person and force the person to smoke (or not)[7], and then measure whether the forced smokers have a higher chance of developing cancer as compared to the forced non-smokers. In causality language, RCTs are also known as interventions.
A hypothetical randomized controlled trial to check if smoking causes cancer. |
Interventions
What are interventions, and what do they reveal about causal relationships within a system of variables?
In causality, interventions refer to the act of forcefully setting a variable to a specific value and observing what changed in the intervened system. Examples of these include randomized controlled trials (RCTs) mentioned in the previous section and modifying genes using the CRISPR-Cas9 system[8].
Suppose you are a doctor who wishes to answer the health-related question: Does amount of exercise affect amount of food intake, or does amount of food intake affect amount of exercise, or are they not directly causally related? To determine whether exercise causes diet, a possible RCT would be to make different groups of trial patients exercise for different amounts of time and then measure the change in amount of food intake. Informally speaking[9], when we intervene on a variable, we will be able to uncover the causal relationship of that variable and the other variables in the system.
Note that as interventions can be costly and expensive (or even unethical in certain cases), one would ideally want to use few interventions as possible in any causal discovery endeavor.
Non-adaptive versus adaptive interventions
Given a set of variables of interest, a non-adaptive intervention setup decides all interventions in advance, and then sets out to perform them. Meanwhile, an adaptive intervention setup decides on some intervention, performs it, then observe the outcome to update the current causal knowledge before deciding on the next intervention to perform. It is known that there are causal graph structures where an adaptive approach uses exponentially less interventions than non-adaptive ones.
While non-adaptive intervention setups have been studied and well-understood in the context of causal graph discovery, it was still relatively unknown what is a good sequence of adaptive interventions given a set of variables of interest.
How to effectively select adaptive interventions to learn a causal graph?
One of the reasons that adaptive interventions was not yet well-understood was due to a lack of characterization of a measure of performance of how well an adaptive search algorithm is. Without such a performance metric, it would be hard to quantify or compare the efficacy of any proposed method.
Suppose you know the answer to a particular problem and wish to provide a sequence of actions to your friend who does not, so that your friend can recover the answer independently. What is the minimum number of actions do you need to tell your friend to perform? In this problem of causal graph discovery via adaptive interventions, the answer is the true causal graph and the actions are adaptive interventions.
In a recent NeurIPS 2022 submission [R1], we answered the above question by characterizing a quantity called the verification number ν(G*) which measures the minimum number of adaptive interventions that is necessary and sufficient when working with any true causal graph G*. In the paper, we also showed how our characterization provides new perspectives to existing known results in the causal discovery literature. Building upon this insight and existing methods in graph theory, we then provide a simple algorithm which provably incurs a slight overhead[10] — in terms of number of adaptive interventions used, as compared to ν(G*) — when attempting to uncover the full causal relationships between a set of variables, without knowing ν(G*) itself.
Some extensions and follow-up works
There are several possible extensions to the problem setting which we have described above, some of which we have explored in subsequent follow-up works.
- What if we are only concerned with uncovering the causal relationships of a subgraph amongst a large number of variables? [R2]
Suppose there is a HUGE number of variables and we are only concerned with recovering the causal relationships of a subset of variables within subgraph H within a huge network of variables. Can we perform less interventions than what is necessary to learn the full causal relationships amongst all the variables? |
- In many problem domains, there are domain experts that we can consult. Can we make use of their domain expertise (which may be imperfect) to improve the performance of our methods? [R3]
Suppose a domain expert looks at the equivalence class of graphs and pinpoints a particular graph G̃; as his/her belief of the true causal relationship. Can we use this domain knowledge to perform less interventions to uncover the true causal graph G*? |
- While adaptive methods can result in much less interventions than non-adaptive methods, adaptive methods are inherently sequential as we need to wait for the outcome of an intervention before we decide on the next. What if we have strict deadlines? [R5]
In each adaptive round, we may decide to perform on batch of non-adaptive interventions. Given a maximum number of adaptive rounds, how do we achieve the minimum number of adaptive interventions necessary to solve our problem? |
- What if we could perform more than one intervention at the same time? Furthermore, instead of counting the number of interventions performed, can we assign different costs to intervening on different variables? [R1,R2,R3,R4,R5]
In a health study, one could require a trial patient to run 10km and to eat an apple every day, effectively intervening on both the exercise and diet variables simultaneously. In causality terminology, this is called a non-atomic intervention. Meanwhile, it is more costly in practice to make a patient run 10km daily than to eat an apple a day, so it would be desirable to encode such differing costs in the problem setting directly. Our above discussions extend naturally to accommodate such settings.
Relations to AISG’s pillars and technical areas of interest
Within the field of machine learning, problems related to causality naturally arise when studying explainable AI, knowledge representation, and reasoning under uncertainty. By approaching the problem from a theoretical perspective, we were able to obtain optimal competitive ratios in a bid towards resource-efficient AI. Finally, our recent work involving domain experts explore aspects of collaborative AI and how one can design algorithms to harness domain-specific heuristics in a provable manner.
References
This article is based on the following joint work with Arnab Bhattacharyya, Themis Gouleakis, and Kirankumar Shiragur.
[R1] Davin Choo, Kirankumar Shiragur, and Arnab Bhattacharyya. Verification and search algorithms for causal DAGs. Neural Information Processing Systems (NeurIPS), 2022. [Link].
[R2] Davin Choo and Kirankumar Shiragur. Subset verification and search algorithms for causal DAGs. International Conference on Artificial Intelligence and Statistics (AISTATS), 2023. [Link].
[R3] Davin Choo, Themis Gouleakis, and Arnab Bhattacharyya. Active causal structure learning with advice. International Conference on Machine Learning (ICML), 2023. [Link].
[R4] Davin Choo, Kirankumar Shiragur. New metrics and search algorithms for weighted causal DAGs. International Conference on Machine Learning (ICML), 2023. [Link].
[R5] Davin Choo, and Kirankumar Shiragur. Adaptivity Complexity for Causal Graph Discovery. Uncertainty in Artificial Intelligence (UAI), 2023. [Link].
Footnotes:
[1] https://en.wikipedia.org/wiki/Correlation_does_not_imply_causation
[2] https://www.tylervigen.com/spurious-correlations
[3] Or more hilariously, should an unscrupulous ice cream seller release more sharks onto the public to boost sales?
[4] Van Der Wal, R. C., Sutton, R. M., Lange, J., & Braga, J. P. (2018). Suspicious binds: Conspiracy thinking and tenuous perceptions of causal connections between co‐occurring and spuriously correlated events. European Journal of Social Psychology, 48 (7), 970-989. Available at: https://onlinelibrary.wiley.com/doi/pdfdirect/10.1002/ejsp.2507
[5] Here, we assume that there are no hidden confounders and some classic causal assumptions such as faithfulness are satisfied.
[6] See also https://priceonomics.com/why-the-father-of-modern-statistics-didnt-believe/ for a nice blog post about it.
[7] Clearly this is unethical and fortunately was not used to establish a link between smoking and lung cancer. See https://tobaccocontrol.bmj.com/content/21/2/87 for a brief historical account.
[8] https://en.wikipedia.org/wiki/CRISPR
[9] We assume that perfect/ideal interventions, where every intervention cuts off any causal influence of parents, and measurements are able to detect the changes.
[10] Our overhead is worst case necessary as there exists causal graphs where such an overhead is unavoidable even if the verification number ν(G*) is known.