Introduction

We will walk through code that was submitted as part of my Master’s thesis “Interventional Causal Structure Learning With Gaussian Process Regression (GPR)” @ TUM. The R code is not specifically optimized for speed and efficiency, but rather scripted in order to demonstrate the performance of the causal discovery machine. We will discuss the Bayesian score (chapter 4.3) and Alternative Score (chapter 4.4) for learning the GP-SCM (Gaussian Process Structural Causal Model, chapter 4.1). For details, precise notation and references, please consult the thesis, which can be found here along with the complete R scripts for both scores.

Setup

Ground Truth

First, we define the true causal relations of variables on which the simulations are based on. We consider 3 variables \(X_1, X_2, X_3\)

n_var <- 3
all_var <- rep(1:n_var)
all_var_names <- c("x_1", "x_2", "x_3")

with the following graph structure

g_true <- graph(edges = c(1,3, 2,3), n = n_var, directed = T)

and (noisy) functional dependencies \[\begin{equation} \begin{split} &X_1 := N_1 \sim \cal N(0,1)\\ &X_2 := N_2 \sim \cal N(1,1)\\ &X_3 := 2\tanh(X_1) + 4\tanh(X_2) + N_3,\; N_3 \sim \cal N(0,0.01) \end{split} \end{equation}\]

Equipped with knowledge about the true preconditions, we can generate samples from an underlying distribution in the case of no intervention

#' generate n samples (x_1. ...) from the joint underlying distribution
#' 
#' @param input SCM (either true observational or some interventional)
#' @param n sample size
#' 
#' @return  the samples (x_1. ...) as data table  
joint_sample <- function(scm, n){
  
  # evaluation of variables given the input SCM 
  x1 = scm$X1(n)
  x2 = scm$X2(n)
  
  x <- list(x_1 = x1,
            x_2 = x2,
            x_3 = scm$X3(x1, x2, n))
  
  return(as.data.table(x))
}

and in the case of an (actual) intervention

#' generate n samples (default 1) from the interventional distribution P(.|do(X_i = x))
#' 
#' @param scm_intv the SCM that's scheduled for intervention
#' @param i_do index of variable that is being intervened on
#' @param x_do intervention value on X_i_do
#' @param n number of returned samples from the interventional SCM
#' 
#' @return the samples (x_1. ...) as data table  
Do_sample <- function(scm_intv, i_do, x_do, n = 1){
  
  # set the intervention variable to the intervention value
  body(scm_intv[[i_do]]) <- substitute(x_do)
  
  # get new samples
  smp <- joint_sample(scm = scm_intv, n = n)
  
  # tag the samples
  smp <- smp[,doX := i_do,]
  
  return(smp)
}

GPR Requires Kernels

We define a parametrized base kernel \(k_{base}(\cdot, \cdot): \mathbb{R} \times \mathbb{R} \rightarrow \mathbb R\). Here we use a (variant of the) squared exponential kernel, but depending on the application scenario any customized kernel could be considered.

#' customized base kernel: squared exponential (+ linear, turned off) 
#' 
#' @param v outer coefficient of exponential part
#' @param w outer coefficient of exponential part
#' @param a coefficient of linear part
#' @param x first input
#' @param y second input
#' 
#' @return kernel value
k_base <- function(v, w, a, x, y) return(v*exp(-w*((abs(x-y))^2)/2))

# define class from kernlab package 
class(k_base) <- "kernel"

With this definition we can determine our covariance matrices \(K_{\cdot,\cdot} \in \mathbb{R}^{n\times n}\) via the full covariance functions \(k(\cdot, \cdot): \mathbb{R}^D \times \mathbb{R}^D \rightarrow \mathbb R\) in the style of additive kernels (chapter 3.2 and 3.4)

#' generic kernel/covariance function with customized kernel with noise
#' 
#' @param lhp list of log parameters v, w, a, k_add_o, vv
#' @param X_l data in design matrix form (columns = features resp. dimensions)
#' @param X_r secondary data for K(X_l,X_r) in design matrix form
#' @param vv_incl Boolean for inclusion of variance in resulting kernel matrix, only sensible for X_l == X_r
#' 
#' @return the kernel/covariance matrix
cov_cust <- function(lhp, X_l, X_r, vv_incl = F){
  
  # number of input dimensions
  D = ncol(X_l)
  
  # additive kernel
  ker_cust <- function(x,y){
    
    # base kernel values (see Duvenaud)
    z <- k_base(v = exp(lhp$ex.v), w = exp(lhp$ex.w), a = exp(lhp$lin.a), x = x, y = y)
    
    # full additive kernel w/ resp. weights up to full order D 
    
    # (not recursive)
    k_add <- sum(exp(lhp$k_add_w) * sapply(1:D, function(o) es_poly(z, o)))
    
    return(k_add)
  } 
  
  if(vv_incl){
    
    # case of X_l == X_r, number of observations: 
    n <- nrow(X_l)
    return(kernelMatrix(kernel = ker_cust, x = as.matrix(X_l), 
                        y = as.matrix(X_r)) + diag(exp(lhp$vv), n))
  } 
  
  else return(kernelMatrix(kernel = ker_cust, x = as.matrix(X_l), y = as.matrix(X_r)))
}

We also need derivatives with respect to the hyperparameters for hyperparameter optimization (omitted).

Learning The Hyperparameters

For the sources (nodes with no incoming edges) in a considered graph, we estimate the parameters of a simple Gaussian distribution. With real world data, we would probably need to look at other options in order to model the univariate variable we currently regard as a source.

#' source node estimation w/ ML and simple univariate Gaussians
#' 
#' @param xtrain univariate samples (expected still in data.table format)
#' 
#' @return list of estimated mean and variance
est_source <- function(xtrain){
  
  # simple ML scheme for Gaussian
  f_Xloglike <- maxlogL(x = as.double(unlist(xtrain)), dist = 'dnorm', start=c(0, 1), lower=c(-4, 0.01), upper=c(4, 4))
  
  # estimates for mean and variance
  return(list(X_mea = f_Xloglike$fit$par[1], X_var = f_Xloglike$fit$par[2]^2))
}

For the dependant nodes (incoming edges exist) on the other hand, we model the functional relationship with GPR by maximizing the marginal log likelihood \[\begin{equation}\label{gppmlike} \begin{split} \log p(\textbf{y}|\bar{\textbf{x}}) = -\frac{1}{2} \textbf{y}^T (K_{\bar{\textbf{x}}, \bar{\textbf{x}}} + \sigma^2 I)^{-1} \textbf{y} - \frac{1}{2} \log \det(K_{\bar{\textbf{x}}, \bar{\textbf{x}}} + \sigma^2 I) - \frac{n}{2} \log(2\pi) \end{split} \end{equation}\] with data \(\bar{\textbf{x}}\) from the respective parents and data \(\textbf{y}\) from the dependant node itself

#' dependant node estimation: GPR w/ ML-II scheme for hyperparameter optimization
#' 
#' @param ytrain univariate data of the dependant node
#' @param xtrain possibly multivariate data of parents
#' @param dep_par list of parameters for the regression situation (level of dep_param[[1]][[1]])
#' 
#' @return list of estimates in standard form: 
est_dependant <- function(ytrain, xtrain, dep_par){
  
  # ensure canonical form: column vectors for the various dimensions  
  ytrain <- t(t(ytrain))
  xtrain <- t(t(xtrain))
  
  # number of observations and dimensions
  n = nrow(xtrain)
  D = ncol(xtrain)
  
  # get initial hyperparameter
  log_hp <- dep_par$para
  
  
  #------ functions 
  
  #' (marginal) GP loglikelohood 
  #' 
  #' @param lpara log hyperparameter 
  #' 
  #' @return likelihood value
  minus_maloglike <- function(lpara){
    
    lhp <- list("ex.v" = lpara[1], "ex.w" = lpara[2], "lin.a" = lpara[3], "k_add_w" = lpara[4:(D+3)], 
                "vv" = lpara[(D+4)])
    Cv <- cov_cust(lhp, X_l = xtrain, X_r = xtrain, vv_incl = T)
    
    return(0.5*quad.form(solve(Cv), unlist(ytrain)) + 0.5*log(det(Cv)) + 0.5*n*log(2*pi))
  }
  
  #' gradients of those  minus GP loglikelihoods 
  #' 
  #' @param lpara log hyperparameter 
  #' 
  #' @return likelihood value
  D_minus_maloglike <- function(lpara){
    
    lhp <- list("ex.v" = lpara[1], "ex.w" = lpara[2], "lin.a" = lpara[3], "k_add_w" = lpara[4:(D+3)], 
                "vv" = lpara[(D+4)])
    Cv <- cov_cust(lhp, X_l = xtrain, X_r = xtrain, vv_incl = T)
    Cv_inv <- solve(Cv)
    D_Cv <- D_cov_cust(lhp, X = xtrain)
    
    alph <- Cv_inv %*% as.matrix(ytrain)
    
    # return derivative of minus loglikelihood 
    return(unlist(lapply( D_Cv, function(i) return(- 0.5 * tr(((alph %*% t(alph)) - Cv_inv) %*% i)))))
  }
  
  
  #------ main
  
  lhp_init <- list("ex.v" = log(5), "ex.w" = log(10), "lin.a" = log(10), 
                   "k_add_w" = log(rep(1/D, D)), "vv" = log(1))
  
  o_res <- optimr(fn = minus_maloglike, gr = D_minus_maloglike, par = as.double(unlist(lhp_init)))

  
  dep_par$para = list("ex.v" = o_res$par[1], "ex.w" = o_res$par[2], "lin.a" = o_res$par[3], 
                      "k_add_w" = o_res$par[4:(D+3)], "vv" = o_res$par[(D+4)])
  
  dep_par$conv = o_res$convergence
  
  # calculate covariance matrices with optimized parameters
  dep_par$Cov <- cov_cust(dep_par$para, X_l = xtrain, X_r = xtrain, vv_incl = T)
  
  return(dep_par)
}

Learning Setup

We need to specify all graphs we want to test (including those with independent nodes)

models <- list(
  # v-structures
  graph(edges = c(1,3, 2,3), n = 3, directed = T),
  graph(edges = c(1,2, 3,2), n = 3, directed = T),
  graph(edges = c(2,1, 3,1), n = 3, directed = T),
  # reversed v-structures
  graph(edges = c(1,2, 1,3), n = 3, directed = T),
  graph(edges = c(2,1, 2,3), n = 3, directed = T),
  graph(edges = c(3,1, 3,2), n = 3, directed = T),
  # hierarchical lines
  graph(edges = c(1,2, 2,3), n = 3, directed = T),
  graph(edges = c(1,3, 3,2), n = 3, directed = T),
  graph(edges = c(2,1, 1,3), n = 3, directed = T),
  graph(edges = c(2,3, 3,1), n = 3, directed = T),
  graph(edges = c(3,1, 1,2), n = 3, directed = T),
  graph(edges = c(3,2, 2,1), n = 3, directed = T),
  # v-structures plus parent dependency
  graph(edges = c(1,2, 1,3, 2,3), n = 3, directed = T),
  graph(edges = c(1,3, 1,2, 3,2), n = 3, directed = T),
  graph(edges = c(2,1, 2,3, 1,3), n = 3, directed = T),
  graph(edges = c(2,3, 2,1, 3,1), n = 3, directed = T),
  graph(edges = c(3,1, 3,2, 1,2), n = 3, directed = T),
  graph(edges = c(3,2, 3,1, 2,1), n = 3, directed = T),
  # one independent variable
  graph(edges = c(1,2), n = 3, directed = T),
  graph(edges = c(2,1), n = 3, directed = T),
  graph(edges = c(1,3), n = 3, directed = T),
  graph(edges = c(3,1), n = 3, directed = T),
  graph(edges = c(2,3), n = 3, directed = T),
  graph(edges = c(3,2), n = 3, directed = T),
  # all independent
  graph(edges = c(), n = 3, directed = T)
)

and initialize a distribution on those graphs, here the (discrete) uniform distribution

prob_g <- rep(1/n_models, n_models)

The Bayesian Score

We now examine the implementation of a Bayesian way of learning the causal structure (chapter 4.3) represented as GP-SCM, which is basically a SCM with the unknown functional dependencies modeled as Gaussian Processes (chapter 4.1).

First, we generate a few observational samples

# current sample size
n <- 5 

# generate 5 samples
data <- joint_sample(scm = scm_true, n)

# intervention indicator: 0 = no (obs), 1 = doX_1, 2 = doX_2, ...
data <- data[,doX := 0,]

Before we proceed with the Bayesian update for mixed (i.e. observational and interventional samples) data, we need to identify sources, dependants and parents of those dependants in all considered graphs

sources <- lapply(adjm, function(m) which(colSums(m) == 0))

dependants <- lapply(1:n_models, function(i_m) setdiff(all_var, sources[[i_m]]))

# result: list[models] of lists[dependant nodes] of lists[dependant node (1) or his parents (2)]
dep_parent <- lapply(1:n_models, function(i_m)  lapply(dependants[[i_m]], 
               function(i_d)  list("dep" = i_d, "parent" = which(adjm[[i_m]][, i_d] == 1))))

Experiments

We conduct (random) experiments, more precisely atomic interventions (actively fixing one variable) of the form \(do\left(X_i := \tilde{N}\right)\) with \(\tilde{N} \sim \cal U[-1,1]\) (see also chapter 2.3 and 2.4 for some background on interventions and their role in causality)

for(i in 1:10){

  # Choose interventions at random uniformly in [-1, 1]
  x_intv <- runif(n_var, min = -1, max = 1)
  
  for(i_v in 1:n_var) data = rbind(data, Do_sample(scm_intv = scm_true, i_do = i_v, x_do = x_intv[i_v], n = 1))
  
  n = n + 1
  
}

We now have data of the following format (with the last column indicating which variable was intervened on)

Estimation Of (Hyper-)Parameters

We estimate the parameters of the Gaussian (univariate) densities \(p(x)\) of the variables we identified as sources based on valid data (no intervention on that specific node)

# ESTIMATE all source parameter for all models considered (outer loop)
# and within each model for all source nodes (inner loop) 
# result: list[models] of lists[source nodes] of lists[parameters of that source node]
source_param <- lapply(1:n_models, function(i_m) lapply(sources[[i_m]], function(i_s) 
    list("sou" = i_s, "para" = est_source(data[doX != i_s][,get(names(data)[i_s])]))))

Now for the dependant nodes, we estimate various parameters of the Gaussian Processes we use to model the functional dependencies. The list consists of kernel parameters and the additive noise assumed in “noisy GPR”

# INITIALIZE all dependent parameters for GPR
#' @param ex.v log linear weight for (exp)^2 part
#' @param ex.w log exp weight for (exp)^2 part
#' @param lin.a log linear weight lin part
#' @param k_add_w log vector of weights for the o-th order additive kernel respectively
#' @param vv log variance of GPR w/ normal noise 
# i_d now list in style of "dep_parent[[1]][[1]]"
dep_param <- lapply(1:n_models, function(i_m)  lapply(dep_parent[[i_m]], function(i_d) {
                    i_d[["para"]] <- list("ex.v" = log(5), "ex.w" = log(10), "lin.a" = log(10), 
                      "k_add_w" = log(rep(1/length(i_d$parent), length(i_d$parent))), "vv" = log(1))
                    i_d[["conv"]] <- 0
                    i_d[["Cov"]] <- matrix()
                    return(i_d)}))

Then the estimation for the dependant variables amounts to optimizing the hyperparameters with an ML-II scheme in the GPR setting (chapter 3.3)

# ESTIMATE all dependant parameters via the Gaussian process likelihoods
dep_param <- lapply(1:(n_models-1), function(i_m)  lapply(dep_param[[i_m]], function(i_d) 
                    est_dependant(as.matrix(data[doX != i_d$dep])[,i_d$dep],
                                  as.matrix(data[doX != i_d$dep])[,i_d$parent], i_d)))

Bayesian Learning

In order to calculate the Bayesian score, we need to evaluate the total marginal likelihood for a graph \(\cal G\) with \(d\) nodes, data \(\cal D\) and optimized parameters collected in \(\theta\) (with \(\textbf{SO}\) as the set of source node indices) \[\begin{equation}\label{likepara} \begin{split} p(\mathcal{D}|\theta,\mathcal{G}) = \prod_{i=1}^d \; p(\textbf{x}_i|\bar{\textbf{pa}_i}) = \prod_{i \notin \textbf{SO}} \; \mathcal{N}\left(\textbf{x}_i|\textbf{0}, K_{\bar{\textbf{pa}_i}, \bar{\textbf{pa}_i}, {{\theta}}} + \sigma_{i,{{\theta}}}^2 I \right) \times \prod_{i \in \textbf{SO}} p_{{\theta}}(\textbf{x}_i) \end{split} \end{equation}\] where \(\textbf{x}_i\) and \(\bar{\textbf{pa}_i}\) gather valid data of (parents of) nodes for the estimation procedures. This form is achieved due to the Markov factorization of the GP-SCM as graphical model.

For source nodes, we calculate the sum of log normal likelihoods from \(\prod_{i \in \textbf{SO}} p_{{\theta}}(\textbf{x}_i)\)

maloglike_sources <- lapply(1:n_models, function(i_m) sum(sapply(source_param[[i_m]], function(i_s){
  return(sum(log(unlist(lapply(as.matrix(data[doX != i_s$sou])[,i_s$sou], 
                               dnorm, mean = i_s$para$X_mea, sd = sqrt(i_s$para$X_var))))))})))

and for dependant nodes the sum of log GP likelihoods (multivariate Gaussians) from \(\prod_{i \notin \textbf{SO}} \; \mathcal{N}\left(\textbf{x}_i|\textbf{0}, K_{\bar{\textbf{pa}_i}, \bar{\textbf{pa}_i}, {{\theta}}} + \sigma_{i,{{\theta}}}^2 I \right)\)

maloglike_dependants <- lapply(1:(n_models-1), function(i_m) 
  sum(sapply(dep_param[[i_m]], function(i_d){
    return(-0.5*quad.form(solve(i_d$Cov), as.matrix(as.matrix(data[doX != i_d$dep])[,i_d$dep])) 
           - 0.5*log(det(i_d$Cov)) - 0.5*n*log(2*pi))})))

Thus, we obtain the total log likelihood

totallike <- exp(unlist(maloglike_sources) + unlist(maloglike_dependants))

and with

# Bayesian normalization constant
norm_const <- sum(prob_g*totallike)

we can perform the Bayesian update for \(M\) distinct graphical models with graph structures \(\cal G_m\) \[\begin{equation}\label{bayesscore} \begin{split} p(\cal G_m|\cal D) = \frac{p(\cal D|\cal G_m) p(\cal G_m)}{\sum_{m=1}^M p(\cal D|\cal G_m) p(\cal G_m)} \end{split} \end{equation}\]

prob_g = prob_g*totallike/norm_const

prob_g

##  [1] 3.479208e-05 3.173090e-23 5.893274e-26 1.009724e-22 5.249697e-16
##  [6] 4.167997e-19 7.859656e-17 2.690704e-19 6.744243e-22 1.857752e-16
## [11] 1.564098e-22 1.177806e-18 1.302157e-01 1.812126e-19 8.697495e-01
## [16] 2.619772e-16 2.807047e-19 5.877644e-19 1.768059e-26 1.180939e-25
## [21] 2.697860e-26 4.179081e-26 2.100005e-20 4.711510e-23 4.724039e-30

We see that the true underlying graph (no. 1 in our list) could not be identified, as the two graphs with an additional egde between \(X_1\) and \(X_2\) (no. 13 and 15) feature a higher posterior probability. This one example is by no means a definitive empirical proof for a systemic issue with the Bayesian score. Still, we can see that a problem exists by realizing that our prior assumptions of modeling sources with a specific distribution (in our case normal) and the dependants with GPR (higher fitting capacity) lead to potential imbalances in the likelihood based score, which results in the score favoring more edges.

The Alternative Score

As on display in the previous section, the Bayesian score has its limitations. Additionally, it uses interventional data only in an implicit way. For our Alternative score, we aim to utilize the interventions for the purpose of causal discovery in a more direct manner.

We mentioned earlier that the total likelihood formulation (in the Bayesian score) relies on the Markov factorization property for a graph \(\cal G\) of \(d\) nodes/variables \(x_i\) with parents \(\textbf{pa}_i^{\cal G}\) \[\begin{equation} \begin{split} p(x_1,...,x_d) = \prod_{i=1}^d p(x_i|\textbf{pa}_i^{\cal G}) \end{split} \end{equation}\] This product is altered by interventions, where variables that are not part of the intervention have unchanged Markov kernels \(p(x_i|\textbf{pa}_i^{\cal G})\) (chapter 2.4). Yet, only the true graph can guarantee that the proposed Markov kernels remain the same and the dependence on the parents is not broken (Theorem 4.2). With this insight, we are able to falsify proposed sources and depdendants by looking for interventions that result in violations of those guarantees (Corollary 4.4 and 4.5).

As in the Bayesian approach, we first acquire a few observational samples

# current sample size
n_obs <- 5 

# generate 5 samples
data <- joint_sample(scm = scm_true, n_obs)

# intervention indicator: 0 = no (obs), 1 = doX_1, 2 = doX_2, ...
data <- data[,doX := 0,]

Experiments

We conduct the interventions in two batches associated with two different characteristics and purposes. The first batch \(K_1\) consists of atomic interventions of the form \(do\left(X_i := \tilde{N}\right)\) with uniformly distributed noise \(\tilde{N}\) (comparable to the experiments of the Bayesian score) to identify dependants with their parents.

# number of interventions
n_intv <- 10

# total number of values that GPs are fitted on
n_total <- ((n_var-1)*n_intv) + n_obs

# Experiments: get interventional data of K_1
for(i in 1:n_intv){
  
  # choose interventions at random uniformly in [-1, 1]
  x_intv <- runif(n_var, min = -1, max = 1)
  
  # get interventional data
  for(i_v in 1:n_var) data = rbind(data, Do_sample(scm_intv = scm_true, i_do = i_v, x_do = x_intv[i_v], n = 1))
  
}

In the second batch \(K_2\) we intervene on every variable but one simultaneously, i.e. \(do(...,X_{i-1} := c_{i-1}, X_{i+1} := c_{i+1},...)\) with \(c_1,... \in \mathbb R\) (deterministic intervention)

# ... and get interventional data of K_2
# generate intervention data for fixed-value-interventions
data <- data[,doXi := NA,][,doXj := NA,] 

# intervene on every variable w/ fixed intv value 0
for(not_v in 1:n_var){
  
  i_v <- setdiff(1:n_var, not_v)[1]
  j_v <- setdiff(1:n_var, not_v)[2]
  
  # sample size (n_var-1)*n_intv to match sample size for GP =  n_obs + (n_var-1)*n_intv   
  smp <- Do2_sample(scm_intv = scm_true, i_do = i_v, xi_do = 0, j_do = j_v, xj_do = 0, n = ((n_var-1)*n_intv))
  
  data = rbind(data, smp)
}

We now have data of the following format (with additional rows indicating the type of intervention)

Estimation Of (Hyper-)Parameters

For the dependants we estimate GPR parameters as in the Bayesian score, yet our understanding of how the resulting marginal GP likelihood helps to identify the correct causal structure is different. Based on our previous considerations together with some reasonable ad-hoc assumptions, the GPR will likely be fit on data that includes points, which are not associated with the appropriate Markov kernel when considering the wrong parents.

The following figure illustrates the situation for a simplified simulation with only two variables \(X, Y\) and a true graph \(X \rightarrow Y\) and some ground truth functional dependency. Orange samples stem from the observational distribution, green samples from an intervention on \(X\) and blue samples from an intervention on \(Y\). The images show GP fits, i.e. black line: posterior mean, gray shade: area of the 95%-quantile for each marginal univariate normal distribution at respective input \(x\) or \(y\). The upper image depicts a successful fit on observational samples and samples with intervention on input \(X\). The lower image, shows a bad fit accompanied by a low GP marginal likelihood as we try to fit both observational samples and samples with intervention on input \(Y\). The latter samples (blue) however are in reality associated with the marginal density of \(X\) at randomly sampled intervention values \(Y\).

Hence, we code

# INITIALIZE all dependant parameters for GPR
#' @param ex.v log linear weight for (exp)^2 part
#' @param ex.w log exp weight for (exp)^2 part
#' @param lin.a log linear weight lin part
#' @param k_add_w log vector of weights for the o-th order additive kernel respectively
#' @param vv log variance of GPR w/ normal noise 
# i_d now list in style of "dep_parent[[1]][[1]]"
dep_param <- lapply(1:n_models, function(i_m)  lapply(dep_parent[[i_m]], 
                                                      function(i_d) {
                                                        i_d[["para"]] <- list("ex.v" = log(5), "ex.w" = log(10), "lin.a" = log(10), 
                                                                              "k_add_w" = log(rep(1/length(i_d$parent), length(i_d$parent))), "vv" = log(1))
                                                        i_d[["conv"]] <- 0
                                                        i_d[["Cov"]] <- matrix()
                                                        return(i_d)
                                                      }))

# ESTIMATE all dependant parameters via the Gaussian process likelihoods
dep_param <- lapply(1:(n_models-1), function(i_m)  lapply(dep_param[[i_m]], 
  function(i_d) est_dependant(as.matrix(data[doX != i_d$dep & !is.na(doX)])[,i_d$dep], as.matrix(data[doX != i_d$dep & !is.na(doX)])[,i_d$parent], i_d)))

For the sources on the other hand, we don’t actually estimate any density for later (inferential) use but only GPR hyperparameters relating to a pseudo likelihood utilizing the interventions of \(K_2\). This pseudo likelihood for sources proposed by a given graph is set up in such a way that we have a symmetric situation with respect to the GPR of the dependants and thus avoid the imbalances in the likelihood based score as with the Bayesian approach.

# allow to draw 15 additional observational samples just to estimate the sample variances for pseudo nodes later on equal grounds,
# not used otherwise
data_obs_add <- joint_sample(scm = scm_true, 15)

#' pseudo source node estimation: GPR w/ ML-II scheme for hyperparameter optimization
#' 
#' @param data data table with observational data and fixed value intervention on all other d-1 variables
#' @param psou_par list of parameters for the regression situation (level of psou_param[[1]][[1]])
#' 
#' @return list of estimates in standard form: 
est_pseudo_source <- function(psou_data, psou_par){
  
  # select correct (regular + additional) observational sample data, restricted to max. 3 variable case
  if(psou_par$sou == 1){
    psou_data <- psou_data[,-c(2,3)]
    sou_data_obs_add <- data_obs_add[,1]
  } 
  if(psou_par$sou == 2){
    psou_data <-psou_data[,-c(1,3)]
    sou_data_obs_add <- data_obs_add[,2]
  } 
  if(psou_par$sou == 3){
    psou_data <- psou_data[,-c(1,2)]
    sou_data_obs_add <- data_obs_add[,3]
  } 

  # calculate observational sample variance (corrected), also from additional observational data
  smp_var_obs = as.double(var(rbind(psou_data[doX == 0][,1],sou_data_obs_add)))
  
  # calculate intventional sample variance (corrected)
  smp_var_intv = as.double(var(psou_data[is.na(doX)][,1]))
  
  # set base noise variance equal to the noise variance around functional values in GPR settings (base level now fixed sd = 0.1, var = 0.01)
  eps_base <- 0.01
  
  # set additional noise according to linear function: 
  #     if intervetional sample variance is high (probably source with fixed interventions) -> small additional noise, low penalty
  #     else -> additional variance as high as spread of entire variable 'smp_var_obs', high penalty
  eps_add <- max(0,-smp_var_intv + smp_var_obs)
  
  # calculate total noise as sum of base and additional noise
  eps_total <- eps_add + eps_base
  
  
  
  # generate pseudo GP fitting data, first the observational data with just base noise
  data_pseudofit_obs <- list(x_1 = psou_data[doX == 0][,1],
                             x_2 = rnorm(n_obs, 0, sqrt(eps_base)))
  
  # then the for the "interventional data" randomly distributed points with full noise eps_total, change with different "support interval"
  data_pseudofit_intv <- list(x_1 = runif(((n_var-1)*n_intv), min = -2, max = 2),
                              x_2 = rnorm(((n_var-1)*n_intv), 0, sqrt(eps_total)))
  
  # entire data to do GP pseudo fit on 
  data_pseudofit <- rbind(as.data.table(data_pseudofit_obs), as.data.table(data_pseudofit_intv))
  
  #plot(data_pseudofit)
  
  # fit pseudo GP
  res_psou_par <- est_dependant(ytrain = data_pseudofit$x_2, xtrain = data_pseudofit$x_1, dep_par = psou_par)
  
  res_psou_par[["ydat"]] <- data_pseudofit$x_2
  
  return(res_psou_par)
}


print("estimating pseudo source parameters...")

## [1] "estimating pseudo source parameters..."

# INITIALIZE all pseudo source parameters for GPR
#' @param ex.v log linear weight for (exp)^2 part
#' @param ex.w log exp weight for (exp)^2 part
#' @param lin.a log linear weight lin part
#' @param k_add_w log vector of weights for the o-th order additive kernel respectively
#' @param vv log variance of GPR w/ normal noise 
# i_d now list in style of "dep_parent[[1]][[1]]"
psou_param <- lapply(1:n_models, function(i_m)  lapply(sources[[i_m]], 
                                                      function(i_s) {
                                                        l <- list()
                                                        l[["sou"]] <- i_s
                                                        l[["para"]] <- list("ex.v" = log(5), "ex.w" = log(10), "lin.a" = log(10), 
                                                                            "k_add_w" = log(1), "vv" = log(1))
                                                        l[["conv"]] <- 0
                                                        l[["Cov"]] <- matrix()
                                                        l[["ydat"]] <- c()
                                                        return(l)
                                                      }))

# ESTIMATE all pseudo source parameter via the Gaussian process likelihoods
psou_param <- lapply(1:n_models, function(i_m)  lapply(psou_param[[i_m]], 
  function(i_s) est_pseudo_source(data[(doXi != i_s$sou & doXj != i_s$sou & is.na(doX)) | doX == 0], i_s)))

Likelihood Based Score

Now we combine the marginal likelihood from the dependants and the pseudo likelihood from the sources into one score similar to the Bayesian score

# log likelihoods using Markov factorization
# ... for sources
maloglike_psources <- lapply(1:n_models, function(i_m) sum(sapply(psou_param[[i_m]], function(i_s){
  return(-0.5*quad.form(solve(i_s$Cov), t(t(i_s$ydat))) - 0.5*log(det(i_s$Cov)) - 0.5*n_total*log(2*pi))
})))

# .. for dependants, fix n...
maloglike_dependants <- lapply(1:(n_models-1), function(i_m) sum(sapply(dep_param[[i_m]], function(i_d){
  return(-0.5*quad.form(solve(i_d$Cov), as.matrix(as.matrix(data[doX != i_d$dep & !is.na(doX)])[,i_d$dep])) - 0.5*log(det(i_d$Cov)) - 0.5*n_total*log(2*pi))
})))

# nothing for last model of independent variables
maloglike_dependants[[25]] <- 0



# ... total 
totallike <- exp(unlist(maloglike_psources) + unlist(maloglike_dependants))

# nornalization constant
norm_const <- sum(prob_g*totallike)

# probablities for the individual graphs 
prob_g = prob_g*totallike/norm_const

prob_g

##  [1] 9.668430e-01 1.167868e-28 1.071893e-26 4.174243e-34 1.406684e-21
##  [6] 1.258077e-34 1.729922e-26 6.800616e-31 5.243961e-27 3.428846e-22
## [11] 4.180669e-36 4.431193e-40 2.667431e-08 3.349585e-32 3.315698e-02
## [16] 1.035797e-21 6.086323e-37 1.942432e-38 1.177322e-26 3.886997e-31
## [21] 7.800949e-22 1.458611e-29 6.666393e-18 1.734197e-33 7.214091e-23

The true model (no. 1) is detected as it features the highest probability.

Conclusion

Inspecting a crucial example of a causal discovery task, we learned that the Bayesian score has fundamental deficiencies, which can be addressed by thinking in alternative ways. Even though the Alternative score presented here solves the problem in this instance in an ad-hoc way relying on some (reasonable) assumptions, the core idea of utilizing statements in the vein of Theorem 4.1 with its corollaries in order to detect the causal structure by exploiting the unique interplay of true causal structure with experimental data using flexible regression methods such as GPR is actually promising in general.

Interventional Causal Discovery With GPR - Simulations

Johannes Schmitt

Introduction

Setup

Ground Truth

GPR Requires Kernels

Learning The Hyperparameters

Learning Setup

The Bayesian Score

Experiments

Estimation Of (Hyper-)Parameters

Bayesian Learning

The Alternative Score

Experiments

Estimation Of (Hyper-)Parameters

Likelihood Based Score

Conclusion