{"title": "Towards Hardware-Aware Tractable Learning of Probabilistic Models", "book": "Advances in Neural Information Processing Systems", "page_first": 13726, "page_last": 13736, "abstract": "Smart portable applications increasingly rely on edge computing due to privacy and latency concerns. But guaranteeing always-on functionality comes with two major challenges: heavily resource-constrained hardware; and dynamic application conditions. Probabilistic models present an ideal solution to these challenges: they are robust to missing data, allow for joint predictions and have small data needs. In addition, ongoing efforts in field of tractable learning have resulted in probabilistic models with strict inference efficiency guarantees. However, the current notions of tractability are often limited to model complexity, disregarding the hardware's specifications and constraints. We propose a novel resource-aware cost metric that takes into consideration the hardware's properties in determining whether the inference task can be efficiently deployed. We use this metric to evaluate the performance versus resource trade-off relevant to the application of interest, and we propose a strategy that selects the device-settings that can optimally meet users' requirements. We showcase our framework on a mobile activity recognition scenario, and on a variety of benchmark datasets representative of the field of tractable learning and of the applications of interest.", "full_text": "Towards Hardware-Aware Tractable\n\nLearning of Probabilistic Models\n\nLaura I. Galindez Olascoaga\u2020, Wannes Meert\u2021 , Nimish Shah\u2020\n\nMarian Verhelst\u2020 and Guy Van den Broeck\u2020\u2020\n\u2020 Electrical Engineering Department, KU Leuven\n\u2021 Computer Science Department, KU Leuven\n\n\u2020\u2020 Computer Science Department, University of California, Los Angeles\n\n{laura.galindez,nimish.shah,marian.verhelst}@esat.kuleuven.be\n\nwannes.meert@cs.kuleuven.be, guyvdb@cs.ucla.edu\n\nAbstract\n\nSmart portable applications increasingly rely on edge computing due to privacy\nand latency concerns. But guaranteeing always-on functionality comes with two\nmajor challenges: heavily resource-constrained hardware; and dynamic application\nconditions. Probabilistic models present an ideal solution to these challenges:\nthey are robust to missing data, allow for joint predictions and have small data\nneeds. In addition, ongoing efforts in the \ufb01eld of tractable learning have resulted\nin probabilistic models with strict inference ef\ufb01ciency guarantees. However, the\ncurrent notions of tractability are often limited to model complexity, disregarding\nthe hardware\u2019s speci\ufb01cations and constraints. We propose a novel resource-aware\ncost metric that takes into consideration the hardware\u2019s properties in determining\nwhether the inference task can be ef\ufb01ciently deployed. We use this metric to\nevaluate the performance versus resource trade-off relevant to the application\nof interest, and we propose a strategy that selects the device settings that can\noptimally meet users\u2019 requirements. We showcase our framework on a mobile\nactivity recognition scenario, and on a variety of benchmark datasets representative\nof the \ufb01eld of tractable learning and of the applications of interest.\n\n1\n\nIntroduction\n\nTractable learning aims to balance the trade-off between how well the resulting models \ufb01t the\navailable data and how ef\ufb01ciently queries are answered. Most implementations focus on maximizing\nmodel performance and only factor in query ef\ufb01ciency by subjecting the learning stage to a \ufb01xed\ntractability constraint (e.g. max treewidth [2]). While recent notions of tractability consider the cost\nof probabilistic inference as the number of arithmetic operations involved in a query [26, 27], they\nstill disregard hardware implementation nuances of the target application. This is of special concern\nfor edge computing on embedded applications, where the target algorithm must run in resource\nconstrained hardware, such as a small ARM or RISC-V embedded processor, or a microcontroller.\nFor such architectures running a lightweight operating system, the overall compute cost is mostly\ndetermined by the cost of fundamental arithmetic operations, the interaction with sensor interfaces\nand the device\u2019s memory transactions [17, 12].\nIn addition, efforts towards hardware-ef\ufb01cient realizations of probabilistic inference are currently\nscarce [36, 21, 34]. This is in stark contrast with the tremendous progress achieved by embedded\nneural network implementations [37, 18, 29] .\nWe address these limitations of the \ufb01eld of tractable learning by proposing a novel resource-aware cost\nmetric that takes into consideration the target embedded device\u2019s properties (e.g. energy consumption);\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Arithmetic Circuit from a compiled noisy-OR and its mapping to hardware.\n\nand system-level con\ufb01guration (e.g. sensors used). We map these hardware characteristics to the cost\nvs. performance trade-off space, and propose a set of techniques to \ufb01nd the optimal system-level\ncon\ufb01guration. Speci\ufb01cally, we address the following points: (a) Section 3 discusses the relevant\nhardware-aware tractability metrics, and Section 4 de\ufb01nes the problem statement; (b) Section 5\ndiscusses how to exploit the model\u2019s properties to exchange task-performance for hardware ef\ufb01ciency,\nand introduces techniques to \ufb01nd the optimal set of system con\ufb01gurations in the cost vs. performance\ntrade-off space; and (c) Section 6 shows practical examples of these optimal solutions. This work\nconstitutes one of the \ufb01rst efforts to introduce the \ufb01eld of tractable probabilistic reasoning to the\nemerging domain of edge computing. This is motivated by probabilistic models\u2019 traits, several of\nwhich are ideal for portable applications that require reasoning on the edge: robustness to missing\ninformation, small data needs, joint predictions, and expert knowledge integration. Moreover, unlike\n\ufb01xed neural architecture training, tractable learning allows to explicitly vary the level of complexity\nof the inference task, which allows us to model resource tunability.\n\n2 Background and motivation\n\nWe use standard notation: random variables are denoted by upper case letters X and their instantiations\nby lower case letters x. Sets of variables are denoted in bold upper case X and their joint instantiations\nin bold lower case x. Sets of variable sets are denoted with X .\nThe model representation of choice in this paper is the Arithmetic Circuit (AC), a state-of-the-art,\ncompact representation for a variety of machine learning models such as probabilistic graphical\nmodels (PGMs) [6] and probabilistic programs [10]. Recent developments show how the structure\nof ACs can also be learned from data [24, 23]. Furthermore, ACs can be complemented with deep\nlearning architectures [41, 28] to achieve the best of both worlds. An alternative representation\nof ACs are Sum-Product Networks (SPNs), which can also encode probability distributions as a\ncomputational graph [32, 13]. SPNs can be ef\ufb01ciently converted to ACs and vice versa [33].\n\n2.1 Probabilistic inference with Arithmetic Circuits\n\nAn AC is a directed acyclic graph where inner nodes represent addition or multiplication and leaf\nnodes are real-valued variables. ACs constitute a standard representation for computing polynomials,\nbut they have proven to be ef\ufb01cient for reasoning over knowledge bases and probabilistic models\nwhen a number of additional properties are enforced on them [6]. Once the circuit is known, the\ncomplexity of executing the encoded formula is also known, since marginalization and partition\nfunction operations are polynomial in the size of the circuit [4], thus making them a well-suited\nrepresentation for tractable learning. ACs represent a joint probability distribution over a set of\nrandom variables X: the leaf nodes are either binary indicator variables \u03bbX=x, where X \u2208 X, or\nparameters \u03b8. Figure 1 shows an example of an AC that encodes the joint probability distribution of a\nnoisy-OR model [15].\nThis representation allows to perform inference to answer a number of probabilistic queries. For\nexample, given an instantiation f of F \u2286 X, the marginal probability Pr(f ) can be computed by\nsetting the indicator variables to 1 if they correspond to instantiations consistent with the observed\nvalues, \u03bbX=x \u2190 1x\u223cf , and subsequently performing an upward pass on the AC [4]. In a binary\nclassi\ufb01cation task, one can de\ufb01ne a class variable C, a feature set F and a classi\ufb01cation threshold T ,\nassumed to be 0.5 in this work. For a given instance f, the task consists of selecting the class CT\n\n2\n\n+\u00d7\u00d7P(B)+P(\u00acB)\u00d7P(A|B,\u00acC)P(\u00acA|B,\u00acC)\u00d7P(C)P(A|\u00acB,C)localsave\u00d7localfetchP(A|\u00acB,C)localsavememoryfetch\u00d7\u03bbCP(C)memoryfetchPr(A)=XB,CPr(A|B,C)|{z}Noisy-ORPr(B)Pr(C)=\fFigure 2: Sensory embedded classi\ufb01cation example.\n\nfor which the condition P r(C|f ) \u2265 0.5 is met. The conditional probability can be calculated by\nperforming two upward passes on the AC1 that encodes Pr(C, F), after setting the indicator variables\n\u03bb in accordance to instance f. ACs\u2019 straightforward mapping to embedded hardware and the fact that\nthey readily encode the algorithm necessary for inference, motivates our choice for this probabilistic\nmodel representation. Moreover, the process of learning them already entails a trade-off between\ntheir predictive performance and their computational ef\ufb01ciency. The following section motivates our\nproposed hardware-aware tractability metric.\n\n2.2 Motivating example\n\nthe mobile\n\nclassi\ufb01cation scenario in Figure 2, where\n\nConsider\nset\nF={FA1,FA2,FB1,FB2,FD1,FD2} is extracted from sensors A, B and D, and where the\nAC is assumed to be the most compact model that maximizes classi\ufb01cation accuracy. Suppose there\nare two feature subsets available, F1={FA1, FB2, FD1} and F2={FB1, FB2, FD1}; and that they\nattain the same accuracy. Hence, the goal is to \ufb01nd the least expensive subset. The solution to this\nproblem would clearly be F1 when considering only feature cost, a common approach to address the\nproblem of feature subset selection. But when considering also the costs of the sensors, F2 turns out\nto be a better choice, as sensor A is unused and can be turned off. This example shows that trade-off\nopportunities can be missed when failing to describe realistic hardware-aware system-level costs.\n\nthe\n\nfeature\n\n3 Hardware-aware cost\n\nIn this section we formalize the notion of hardware-aware cost, the basis of our optimization frame-\nwork. Let \u03b1 = (cid:104)+,\u00d7, \u03b8(cid:105) be an AC that encodes a joint probability distribution over variables F,\nextracted from the set of sensor interfaces S. The hardware-aware cost (CHA) is de\ufb01ned as:\n\n(cid:88)\n\nS\u2208S\n\nCHA(\u03b1, S, F) = CAC(\u03b1) +\n\nCSI(S, FS),\n\n(1)\n\nwhere CAC are the computation costs, pertaining to inference on arithmetic circuit \u03b1, CSI are the\nsensor interfacing and feature extraction costs, and FS is the feature subset extracted from sensor S.\n\nComputation costs. At a high level, a typical embedded hardware architecture entails two com-\nponents: an on-chip main memory (typically SRAM), which commonly houses the algorithm\u2019s\nparameter set; and a processing unit, where operations are performed and intermediate values are\ncached in a local memory. Performing an upward pass on an AC involves the following actions\n(see Figure 1): 1) fetching parameters from the main memory, 2) performing arithmetic operations,\nconsisting of additions and multiplications, 3) caching intermediate values in a local memory (e.g.\nregister \ufb01le or low level cache) and 4) fetching intermediate values from local memory, as needed.2\nEach action has a signi\ufb01cantly different hardware resource cost. For example, post synthesis energy\nmodels of a simple embedded CPU show that multiplications can require 4 times as much energy\nas additions, and memory transactions 5 times as much energy as multiplications [17]. When it\n\n1Conditional probability can also be performed by an upward and a downward pass [6].\n2In this work we assume that the local cache size is suf\ufb01ciently large to store intermediate values, but not\nlarge enough to store parameters. However, for some learned circuits, there are about as many parameters as\nedges, so depending on the local memory size, one might need to store intermediate values also in main memory.\n\n3\n\n\fcomes to the design of embedded hardware, energy ef\ufb01ciency is indeed one of the main challenges to\naddress. Hence, we continue to focus on this resource as a proof of concept without loss of generality;\nexamples of other relevant hardware resource metrics are throughput and latency. It is evident that the\ntotal hardware cost of performing a pass on an AC must factor in all the aforementioned transactions.\nLet nb be the number of bits used to represents parameters \u03b8 and perform arithmetic operations +\nand \u00d7. The computation cost (CAC) of AC \u03b1 is de\ufb01ned as:\n\nCAC(\u03b1, nb) = C+(nb) + C\u00d7(nb) + Cmem(nb) + Ccache(nb),\n\n(2)\nwhere the terms in CAC de\ufb01ne the cost incurred by each type of operation. Here, C+ and C\u00d7 are the\ncosts of addition and multiplication; Cmem is the cost from fetching parameter leaf nodes from main\nmemory and Ccache is the cost from storing and fetching from local cache (as in Figure 1):\n\na[a (cid:54)=t + and a (cid:54)=t \u00d7] \u00b7 \u03c6mem(nb),\na[a =t + or a =t \u00d7] \u00b7 \u03c6cache(nb),\n\nC+(nb) =(cid:80)\nC\u00d7(nb) =(cid:80)\n\na[a =t +] \u00b7 \u03c6+(nb),\na[a =t \u00d7] \u00b7 \u03c6\u00d7(nb),\n\nCmem(nb) =(cid:80)\nCcache(nb) =(cid:80)\n\nwhere a denotes a node in \u03b1, the equality =t holds when node a matches the operation type on\nthe right side and [\u03b2] is equal to 1 when \u03b2 is true. The function \u03c6(.) describes the effective cost of\nthe particular operation and can be derived from empirical benchmarks, customized to the target\nhardware [17, 35]. When expressing cost in terms of energy consumption, computation costs scale\nwith the precision in number of bits used to represent parameters and perform arithmetic operations\n(nb), which is typically the same for all nodes in the AC. To conclude, the cost incurred by each node\nin an AC is determined by its type (whether addition, multiplication, local parameter fetch, or remote\nmemory access) and the resolution of the operation or parameter (in nb).\n\nSensor interfacing costs The computational block described above is often part of a larger system,\nwhich repeatedly performs a task based on external inputs or observations, such as classi\ufb01cation. In\nthis scenario, one must factor in the costs incurred by interfacing with the environment or the user.\nA sensory interface consists of a set of sensors S, which gather, process and digitize environmental\ninformation (typically in the analog domain), and a (typically digital) feature extraction unit, which\ngenerates the feature set F to be used by the machine learning algorithm. Let S be the set of available\nsensors and F the feature set extracted from them. The sensor interfacing cost (CSI) is:\n\nCSI(S, FS) = CS(S) +\n\n(3)\nwhere CS describes the cost incurred by sensor S and CF the cost of extracting feature set FS \u2286 F.\nThe sensing cost function CS can be customized to the target platform and applications through\nmeasurements or data sheets. Note that, if no features from a given sensor are used, it can be shut\ndown, and its cost dropped (see Figure 1). In most systems, CF can be de\ufb01ned from the type and\nnumber of arithmetic and memory operations involved, in a similar fashion to the computation cost\nestimation CAC, as will be illustrated in the experiments (Section 6.1).\n\nCF(FS),\n\nFS\u2208FS\n\n(cid:88)\n\n4 Problem statement\n\nWe have seen so far that CHA depends on four system properties:1) the complexity of model \u03b1,\ndetermined by the number and type of its operations; 2) the size and type of the feature set F; 3)\nthe size and type of the available sensor set S; and 4) the number of bits nb used within \u03b1. We\nrefer to an instantiation of these four properties \u03c3 ={\u03b1, F, S, nb} as a system con\ufb01guration. Clearly,\nthe system con\ufb01guration also determines the algorithm\u2019s performance, de\ufb01ned according to the\napplication of interest. The methods proposed in this work can accommodate any performance metric\nor miss-classi\ufb01cation cost, but we will only consider accuracy, due to its generality. Speci\ufb01cally,\nwe set the classi\ufb01cation threshold to T = 0.5, and we consider the accuracy of the Bayes-optimal\npredictions (Acc) over a set of feature instantiations {f1, ..., fl}.\nSection 2.2 asks to identify the system con\ufb01guration that incurs the lowest cost for a desired accuracy.\nSimilarly, we might be interested in the con\ufb01guration that achieves the highest accuracy for a given\ncost constraint. Thus, the problem we aim to address is how to select the system con\ufb01gurations\nthat map to the Pareto-frontier on the hardware-cost vs. accuracy trade-off space. The inputs to\nour problem are the class variable C, the available features F and sensors S sets, and the set\nof available precisions nb. The output is the set of Pareto-optimal system con\ufb01gurations \u03c3\u2217 =\n{{\u03b1\u2217\n\ni, nb\u2217\n\ni }i=1:p}.\n\ni , F\u2217\n\ni, S\u2217\n\n4\n\n\f5 Trade-off space search\n\nWe propose to search the cost vs. accuracy trade-off space by scaling four properties (see Section 4):\nModel complexity scaling. We learn a set of ACs \u03b1 of increasing complexity. Each maps to a\nspeci\ufb01c classi\ufb01cation accuracy and computation cost CAC (see Eq. 2). Although discriminative AC\nlearners have shown state-of-the-art classi\ufb01cation accuracy [24], we have opted for a generative\nlearning strategy: the LearnPSDD algorithm [23]. The motivation for this choice is twofold: this\nalgorithm improves the model incrementally, but each iteration already leads to a valid AC, that can\nbe used to populate the set \u03b1. Moreover, the learned ACs encode a joint probability distribution,\nwhich ensures they are robust to missing data, as demanded by the application range of interest:\nembedded reasoning tasks must often deal with missing values, either due to malfunction (e.g., a\nsensor is blocked in an autonomous driving system), or to enforce hardware-cost ef\ufb01ciency (e.g.,\nwhen energy consumption is excessive, the driving system has the choice to turn off an expensive\nsensor and the features extracted from it).\nFeature and sensor set scaling. We scale the feature set F by sequentially pruning individual\nfeatures (see Section 5.1). The effect of feature pruning on classi\ufb01cation accuracy has been discussed\nin numerous works [11, 5, 22] and the impact on the hardware-aware cost is clear from Eq. 3. Pruning\nfeatures can also have an impact on the computation costs CAC: if a variable is always unobserved, its\nindicator variables are \ufb01xed (see Section 2.1), which we exploit to simplify the circuit (see Algorithm\n2). In addition, sensor S \u2208 S can be pruned when none of the features it originates is used anymore;\na strategy that has not been explored by the state of the art, but that is straightforward with our\napproach, since it considers the full system.\nPrecision scaling. We consider four different standard IEEE 754 \ufb02oating point representations, as\nthey can be implemented in almost any embedded hardware platform. Reducing the precision of arith-\nmetic operations and numerical representations entails information loss and results in performance\ndegradation [35]. The effect on computation costs CAC is clear from Eq. 2.\n\n5.1 Search strategy\n\nFinding the smallest possible AC that computes a given function is \u03a3p\n2-hard [3], thus computationally\nharder than NP. No single optimal solution is known for this problem; it is a central question in the\n\ufb01eld of knowledge compilation [7]. Optimizing for the lowest-cost/highest-accuracy AC, further\nincreases complexity. We therefore opt for a greedy optimization algorithm. Speci\ufb01cally, we rely\non a series of heuristics to search the trade-off space. In each step, we independently scale one of\nthe available con\ufb01guration properties (cid:104)\u03b1, F, S, nb(cid:105), as described in the previous section, and aim to\n\ufb01nd its locally optimal setting. The search begins by learning the model set \u03b1={\u03b1k}k=1:n. Then,\nas shown by Algorithm 1, starting from each model \u03b1k, we perform a greedy neighborhood search\nthat aims to maximize cost savings and minimize accuracy losses by sequentially pruning the sets F\nand S, and simplifying \u03b1k accordingly (Algorithm 2). At each iteration, we evaluate the accuracy\nand cost of m feature subset candidates, where each considers the impact of removing a feature from\nthe user-de\ufb01ned prunable set Fprun \u2286 F. We then select the feature and sensor subsets Fsel \u2286 F,\nSsel \u2286 S and the simpli\ufb01ed model \u03b1sel that maximizes the objective function OF = acc/costnorm,\nwhere costnorm is the evaluated hardware-aware cost CHA, normalized according to the maximum\nachievable cost (from the most complex model available \u03b1n). Note that feature subset selection drives\nsensor subset selection Ssel, as described before, and de\ufb01ned in lines 12 and 19 of Algorithm 1.\nThe output of Algorithm 1, (cid:104)F (k),S (k),M(k)(cid:105), is a set of system con\ufb01gurations of the form\n{{Fsel,1 , Ssel,1 , \u03b1sel,1}, . . . ,{Fsel,q , Ssel,q , \u03b1sel,q}}, where q=|Fprunable|, and the superscript (k)\ndenotes the number of the input model \u03b1k, taken from \u03b1. For each con\ufb01guration resulting from\nAlgorithm 1, we can sweep the available precision con\ufb01gurations nb, for a \ufb01nal space described\nby \u03c3=(cid:104)F,S,M,N(cid:105) of size |\u03b1| \u00b7 |Fprunable| \u00b7 |N|, where N contains the selected precision. In the\nexperimental section we show a work-around to reduce search space size and the number of steps\nrequired by the Pareto-optimal search. Regarding complexity, the feature selection in Algorithm 1 is\na greedy search, thus its complexity is linear in the number of features times the number of iterations\nneeded for convergence. 3 The AC pruning routine consists of an upward pass on the AC and its\ncomplexity is therefore linear in the size of the AC.\n\n3In Alg. 1 the user can provide the desired accuracy or cost as the while-loop break criterion.\n\n5\n\n\fAlgorithm 1: ScaleSI(\u03b1k,Fprun,Sprun)\nInput: \u03b1k: the kth model in \u03b1 , Fprun,Sprun: set of\nOutput: (cid:104)F k,S k,Mk(cid:105), acc, cost: kth collection of\npruned features, sensors and model sets, their\naccuracy and cost.\n\nprunable features and sensors.\n\n1 Fsel \u2190 Fprun, Ssel \u2190 Sprun, \u03b1sel \u2190 \u03b1k\n2 (cid:104)F k,S k,Mk(cid:105) \u2190 (cid:104)Fsel , Ssel , \u03b1sel(cid:105)\n3 accsel=Acc(\u03b1sel , Fsel ), costsel=CHA(\u03b1sel , Fsel , Ssel )\n4 (cid:104)acc, cost(cid:105) \u2190 (cid:104)accsel , costsel(cid:105)\n5 while |Fsel| > 1 do\n\n// Initialize objective value\n\nAlgorithm 2: PruneAC(\u03b1,F)\nInput: \u03b1: the input AC, F: the observed feature set\n\nused to guide the pruning of \u03b1.\n\nOutput: \u03b1pr: the pruned AC.\n1 \u03b1pr \u2190 copy(\u03b1)\n/* Loop through AC, children before parents\n2 foreach a in \u03b1pr do\n3\n4\n5\n6\n\nif a is an indicator variable \u03bbF =f and F /\u2208 F then\nelse if a is + or \u00d7 with constant children then\nreplace a in \u03b1pr by an equivalent constant\n\nreplace a in \u03b1pr by a constant 1\n\n*/\n\n7 return \u03b1pr\n\n6\n7\n8\n9\n10\n11\n12\n\nobmax \u2190 0\nforeach F \u2208 Fprun do\nFca \u2190 Fsel \\ F\nSca \u2190 Ssel\nforeach S \u2208 Sprun do\nSca \u2190 Sca \\ S\n\nif Fca \u2229 FS = \u2205 then\n\n13\n14\n15\n16\n17\n18\n19\n20\n\n21\n22\n\n\u03b1ca \u2190PruneAC(\u03b1sel,Fca)\naccca \u2190 Acc(\u03b1ca,Fca)\ncostca \u2190 CHA(\u03b1ca,Fca,Sca)\nobca \u2190 OF(accca , costca )\nif obca > obmax then\n\nobmax \u2190 obca\nFsel \u2190 Fca, Ssel \u2190 Sca, \u03b1sel \u2190 \u03b1ca\naccsel \u2190 accca,costsel \u2190 costca\n\nF k.insert(Fsel), S k.insert(Ssel), Mk.insert(\u03b1sel)\nacc.insert(accsel), cost.insert(costsel)\n\n23 return (cid:104)F k,S k,Mk(cid:105), acc, cost\n\n5.2 Pareto-optimal con\ufb01guration selection\n\nAlgorithm 3: GetPareto(\u03c3,acc,cost)\nInput: \u03c3, acc, cost: Con\ufb01guration set, their accuracy\n\n// Prune sensor\n\nand cost.\n\nOutput: \u03c3\u2217, acc\u2217, cost\u2217: Pareto optimal\n\ncon\ufb01gurations, their accuracy and cost.\n\n1 (cid:104)cost\u2217, \u03c3\u2217, acc\u2217(cid:105) \u2190 (cid:104){},{},{}(cid:105) ;\n/* Sort according to ascending cost\n2 (cid:104)cost, \u03c3, acc(cid:105) \u2190 sorted((cid:104)cost, \u03c3, acc(cid:105));\n3 i \u2190 |\u03c3| + 1;\n4 while i > 0 do\n5\n6\n7\n8\n9\n10 return \u03c3\u2217, acc\u2217, cost\u2217\n\ni \u2190 arg max acc0:i\n\u03c3\u2217.insert(\u03c3i)\nacc\u2217.insert(acci)\ncost\u2217.insert(costi)\ni \u2190 i \u2212 1\n\n*/\n\ni, S\u2217\n\ni , F\u2217\n\ni, nb\u2217\n\nAlgorithm 3 describes how we extract the Pareto-optimal con\ufb01guration subset, but any convex hull\nalgorithm can be used. The input is the con\ufb01guration set \u03c3=(cid:104)F,S,M,N(cid:105) and their corresponding\naccuracy (acc) and cost (cost). The output of this algorithm is the set of Pareto-optimal system\ncon\ufb01gurations \u03c3\u2217={{\u03b1\u2217\ni }i=1:p}, each guaranteed to achieve the largest reachable\naccuracy for any given cost; or the lowest reachable cost for any given accuracy (acc\u2217, cost\u2217).\nNote that the framework introduced thus far can balance the trade-off between hardware-aware cost\nand any other application-speci\ufb01c performance metric, by simply replacing the accuracy terms in\nAlgorithms 1, 2 and 3 with such a metric. For instance, medical applications often aim to balance\nprecision and recall, and may favor the latter at night under scarce medical supervision. Furthermore,\nour framework can be used density estimation tasks by deploying the model complexity scaling\nfollowed by precision scaling stages and forgoing the pruning stages of Algorithms 1 and 2, in order\nto keep the full joint distribution. The next section illustrates how our methodology can reap the\nbene\ufb01ts of scalable embedded hardware.\n\n6 Experimental evaluation\n\nWe empirically evaluate the proposed techniques on a relevant embedded sensing use case: the Human\nActivity Recognition (HAR) benchmark [1]. Additionally, we show our method\u2019s general applicability\non a number of other publicly available datasets [8, 14, 20, 25, 30], two of them commonly used for\ndensity estimation benchmarks and the rest commonly used for classi\ufb01cation (see Table 1).4\n\nComputational costs. The computation costs CAC are based on the energy benchmarks discussed\nin [17] and [35]. Table 2 shows the relative costs of each term in CAC and how they scale with\n\n4Code available at https://github.com/laurago894/HwAwareProb.\n\n6\n\n\fprecision nb. The baseline is 64 \ufb02oating point bits because it is the standard IEEE representation\nin software environments. For the rest of the experiments, we consider three other standard low\nprecision representations: 32 bits (8 exponent and 24 mantissa), 16 bits (5 exponent and 11 mantissa)\nand 8 bits (4 exponent and 4 mantissa) [19].\n\nDataset pre-processing. For the classi\ufb01cation bench-\nmarks, we discretized numerical features using the method\nin [9]. We then binarized them using a one-hot encoding\nand subjected them to a 75%-train, 10%-validation and\n15%-test random split. For the HAR benchmark, we pre-\nserved the timeseries information by using the \ufb01rst 85%\nsamples for training and validation and the last for test-\ning. For the density estimation datasets, we used the splits\nprovided in [25] and we assumed the last feature in the\nset to be the class variable. On all datasets, we performed\nwrapper feature selection (evaluating the features\u2019 value\non a Tree Augmented Naive Bayes classi\ufb01er) before going\nthrough the hardware-aware optimization process to avoid\nover-\ufb01tting on the baseline model and ensure it is a fair\nreference point. The number of effectively used features |F| is shown in Table 1. In addition, we\nconsider all the features to be in the prunable set Fprun for datasets with less than 30 features. For\nthe rest, we consider the 20 with the highest correlation to the class variable. Within the context of an\napplication, the prunable set can be user-de\ufb01ned. For instance, in a multi-sensor seizure detection\napplication, medical experts might advise against pruning features extracted from an EEG monitor.\n\nTable 1: Experimental datasets\n\u2020: Classi\ufb01cation , (cid:63): Density est.\nDataset\n|\u03b1|\nBanknote\u2020\n11\nHAR \u2020\n11\nHouses \u2020\n11\nJester (cid:63)\n11\nMadelone \u2020\n11\nNltcs (cid:63)\n11\nSix-HAR \u2020\n11\nWilt \u2020\n11\n\n|F|\n15\n28\n36\n99\n20\n15\n54\n11\n\n15\n28\n20\n20\n20\n15\n20\n11\n\n|Fprun|\n\nTable 2: Experiment computational costs.\n\nModel learning We learned the models on the\ntrain and validation sets with the LearnPSDD\nalgorithm [23], using the same settings reported\ntherein, and following the bottom-up vtree in-\nduction strategy. To populate the model set \u03b1,\nwe retained a model after every N/10 iterations,\nwhere N is the number of iterations needed for\nconvergence (in the algorithm this is until the\nlog-likelihood on validation data stagnates). Table 1 shows |\u03b1| for each dataset. Furthermore, as a\nbaseline, we trained a Tree Augmented Naive Bayes (TAN) classi\ufb01er and compiled it to an AC.5\n\nOperation At 64 bits\nCmem\nCcache\nC\u00d7\nC+\n\n\u03c6mem = \u03b3mem \u00b7 nb\n\u03c6cache = \u03b3cache \u00b7 nb\n\u03c6\u00d7 = \u03b32\n\u00d7 \u00b7 nb2 \u00b7 log(nb)\n\n1\n0.2\n0.6\n0.1\n\nOperation cost\n\n\u03c6+ = \u03b3+ \u00b7 nb\n\n6.1 Embedded Human Activity Recognition\n\nThe HAR dataset aims to recognize the activity a person is performing based on statistical and\nenergy features extracted from smartphone accelerometer and gyroscope data. We perform binary\nclassi\ufb01cation by discerning \u201cwalking downstairs\u201d from the other activities. For the experiments,\nwe use a total of 28 binary features, 8 of which are extracted from the gyroscope\u2019s signal and the\nrest from the accelerometer, as detailed in Appendix B.4. All computation costs for this dataset are\nnormalized according to the energy consumption trends of an embedded ARM M9 CPU, assuming\n0.1nJ per operation [38]. Sensors are estimated to consume 2mWatt, while the costs of all features is\nde\ufb01ned as 30 MAC operations (see Appendix B.1 for more details).\nPareto optimal con\ufb01guration. This experiment consisted of three stages, performed on the training\nset (Figure 3(a)): 1) We \ufb01rst mapped each model in \u03b1 to the trade-off space, as shown in black. 2)\nStarting from each model, we scaled the feature and the sensor sets F, S, as shown in blue. 3) We\nthen scaled the precision nb of each of these pruned con\ufb01gurations (shown by the grey curves) and\nwe \ufb01nally extracted the Pareto front shown in red. As shown by the Pareto con\ufb01gurations highlighted\nin green, our method preserves the highest baseline train-set accuracy by pruning 11 of the available\n28 features, which results in CHA savings of 53%. When willing to tolerate further accuracy losses\nof 0.4%, our method outputs a con\ufb01guration that consumes only 13% of the original cost by using\na smaller model (\u03b13), pruning 18 features, turning one sensor off and using a 32 bit representation.\nFigures 3(c,d) break down the computational cost CAC and the sensor costs CSI. When considering\nonly the costs of the AC evaluation (graph (c)), our method results in savings of almost two orders of\n\n5Using the ACE compiler available at http://reasoning.cs.ucla.edu/ace/.\n\n7\n\n\fFigure 3: Experiments on the Human Activity Recognition benchmark.\n\nmagnitude with accuracy losses lower than 1%, and up to 3 orders of magnitude when tolerating more\nlosses. Indeed, these computational cost savings result from the joint effects of precision reduction\nand of simplifying the AC with Algorithm 2, with savings of about 2 to 10 % per feature pruned (see\nAppendix B.2). Sensor and feature costs, as shown in graph (d) only scale up to 50%, since at least\none of the sensors must always be operating. This demonstrates the importance of taking these costs\ninto account: even though computation costs savings are impressive, the system is still limited by the\nsensing costs.\nRobustness and online deployment. The red curve in Figure 3(b) shows that the evaluation of the\nselected Pareto con\ufb01gurations against testing data stays within a range of \u00b11% with respect to Figure\n3(a). Comparing our method to the TAN classi\ufb01er denoted with the cyan marker, we can see that it\nprovides further cost saving opportunities, while achieving competitive accuracy. We also assessed\nthe robustness of our method by simulating, per con\ufb01guration, ten iterations of random failure of\nvarying sizes of feature sets (|F|/10,|F|/5,|F|/2). The green and magenta dotted curves show the\nmedian of these experiments for the Pareto con\ufb01gurations and for the original model set. These trials\nstay within a range of \u22122% compared to the fully functional results in red and black, which validates\nour choice of a generative learner that can naturally cope with missing features at prediction time.\nIn embedded sensing scenarios, environmental circumstances, power consumption requirements\nand accuracy constraints commonly vary over time. This calls for dynamic operating settings,\nwhere the system can switch between different accuracy-cost operating points at run-time. Figure\n3(e) shows such a scenario for nine operating points selected off-line (as highlighted with circular\nmarkers in the background graph), which comply with hypothetical user needs of accuracy>95%\nand normalized cost<20%. The implemented policy assumes an energy ef\ufb01cient mode when the\nclassi\ufb01er has recently identi\ufb01ed that there is no ongoing activity, and a high reliability \u2013 and costlier \u2013\nmode when it has identi\ufb01ed that there is ongoing activity (see Appendix B.3 for more details). The\nforeground of Figure 3(e) contrasts the test-set cost-accuracy performance attained when always\nusing the same model (in red), with the cost-accuracy performance resulting from the implementation\nof our model-switching policy (purple cross). Even with its simplicity, the proposed policy attains\naccuracy vs. cost improvements that go beyond the static Pareto front. Figure 3(f) shows that this\n\n8\n\nP1P2P3P4\fTable 3: Results for benchmarking datasets [CAC,Acctr%,Accte%].\n\nDataset\nBanknote\nHouses\nJester\n\nMadelone\n\nNltcs\n\nSix-HAR\n\nWilt\n\nOperating pt. 1\n[1,94.5,95.6]\n[1,97.6,97.4]\n[1,75.6,76.4]\n[1,68.1,68.4]\n[1,93.5,93.9]\n[1,91.5,89.8]\n[1,97.1,97.5]\n\nOperating pt.2\n[0.09,94.5,95.6]\n[0.12,97.6,97.3]\n[0.35,75.6,76.4]\n[0.05,68.6,69.1]\n[0.19,93.6,93.8]\n[0.38,91.6,89.9]\n[0.07,97.1,97.5]\n\nOperating pt. 3\n[0.04,89.9,93.1]\n[0.04,97.1,96.6]\n[0.12,74.7,75.7]\n[0.02,66.9,68.8]\n[0.03,93.4,94.2]\n[0.15,89.3,89.3]\n[0.03,97.1,97.5]\n\nOperating pt. 4\n[0.01,84.5,86.8]\n[0.01,94.3,94.0]\n[0.02,72.6,73.1]\n[0.01,62.6,62.9]\n[0.01,91.7,92.0]\n[0.04,89.8,89.8]\n[0.01,96.9,97.5]\n\nTAN\n\n[0.24,93.8,92.2]\n[0.05,97.2,97.1]\n[0.12,73.1,72.3]\n[0.13,66.0,65.7]\n[0.11,91.4,91.9]\n[0.36,91.7,90.3]\n[0.25,97.1,97.5]\n\nis achieved by making a balanced use of the nine available con\ufb01gurations. Note that this switching\naction incurs overhead only in terms of memory since the set of Pareto switching con\ufb01gurations is\nalways determined off-line, and will be only fetched when needed. In most portable applications,\npredictions must be made at a much higher frequency than con\ufb01guration changes are necessary [12].\nThe incurred memory overhead in our experiments is less than 3% of the total cost, since model\nswitching is only necessary on 120 out of the 1470 predictions.\n\n6.2 Generality of the method: evaluation on benchmark datasets\n\nWe now apply our optimization sequence to the datasets in Table 1. For lack of information on the\nhardware that originated these datasets, we only consider the computation cost CAC, again evaluated\non the cost model of the ARM M9 CPU. Table 3 shows this cost along with the training and testing\naccuracy (Acctr,Accte) at four operating points for every dataset. Note that we have also included the\nsix-class HAR benchmark, to demonstrate the applicability of our method beyond binary classi\ufb01cation.\nWe can see that all the benchmarks strongly bene\ufb01t from our proposed methodology, that they are all\nrobust when contrasted against the test dataset, and that they are competitive when compared to a\nTAN classi\ufb01er. Appendix A shows a \ufb01gure with the Pareto fronts for all the experiments herewith.\n\n7 Related work\n\nThe problem of hardware-ef\ufb01cient probabilistic inference has been addressed by the probabilistic-\nmodels and the embedded-hardware communities from several perspectives. The works by Tschi-\natschek and Pernkopf [39] and Piatkowski et al. [31] propose reduced precision and integer represen-\ntation schemes for PGMs as a strategy to address the constraints of embedded devices. In [35], Shah\net al. propose a framework that automatically chooses an appropriate energy-ef\ufb01cient low precision\nrepresentation and generates custom hardware. Other ef\ufb01cient hardware implementation efforts have\nbeen made by Zermani et al. [42], Schuman et al. [34], and Sommer et al. [36], who have proposed\nto accelerate inference on SPNs, capitalizing on their tractable properties.\nOur work constitutes an effort to integrate the expertise from both communities under a uni\ufb01ed\nframework, which considers the impact of all scalable aspects of the model to optimize it in a\nhardware-aware fashion. To that end, it leverages the properties of the selected AC representation.\nSuch representation enables the use of our framework with any probabilistic model that is compute-\nef\ufb01cient at prediction time: see [16] by Holtzen et al. and [40] by Vlasselaer et al. for examples\nof probabilistic program compilation to ACs; and [27] by Lowd and Rooshenas on how to perform\nef\ufb01cient inference with Markov networks represented as ACs.\n\n8 Conclusions\n\nWe proposed a novel hardware-aware cost metric to deal with the limitations of the ef\ufb01ciency vs.\nperformance trade-off considered by the \ufb01eld of tractable learning. Our method obtains the Pareto-\noptimal system-con\ufb01guration set in the hardware-aware cost vs. accuracy space. The proposed\nsolution consists of a sequential hardware-aware search and a Pareto-optimal con\ufb01guration selection\nstage. Experiments on a variety of benchmarks demonstrated the effectiveness of the approach and\nsacri\ufb01ce little to no accuracy for signi\ufb01cant cost savings. This opens up opportunities for the ef\ufb01cient\nimplementation of probabilistic models in resource-constrained edge devices, operating in dynamic\nenvironments.\n\n9\n\n\fAcknowledgements\n\nWe thank Yitao Liang for the LearnPSDD algorithm and for helpful feedback. This work is partially\nsupported by the EU ERC Project Re-SENSE under Grant ERC-2016-STG-71503; NSF grants\n#IIS-1633857, #CCF-1837129, DARPA XAI grant #N66001-17-2-4032, NEC Research, and gifts\nfrom Intel and Facebook Research.\n\nReferences\n[1] D. Anguita, A. Ghio, L. Oneto, X. Parra, and J. L. Reyes-Ortiz. A public domain dataset for human\nactivity recognition using smartphones. In 21th European Symposium on Arti\ufb01cial Neural Networks,\nComputational Intelligence and Machine Learning (ESANN), 2013.\n\n[2] F. R. Bach and M. I. Jordan. Thin junction trees. In Advances in Neural Information Processing Systems,\n\npages 569\u2013576, 2002.\n\n[3] D. Buchfuhrer and C. Umans. The complexity of boolean formula minimization.\nColloquium on Automata, Languages, and Programming, pages 24\u201335. Springer, 2008.\n\nIn International\n\n[4] M. Chavira and A. Darwiche. On probabilistic inference by weighted model counting. Arti\ufb01cial Intelligence,\n\n172(6-7):772\u2013799, 2008.\n\n[5] Y. Choi and G. Van den Broeck. On robust trimming of bayesian network classi\ufb01ers. In Proceedings of the\n\n27th International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), July 2018.\n\n[6] A. Darwiche. Modeling and Reasoning with Bayesian Networks. Cambridge University Press, 2009.\n\n[7] A. Darwiche and P. Marquis. A knowledge compilation map. Journal of Arti\ufb01cial Intelligence Research,\n\n17:229\u2013264, 2002.\n\n[8] D. Dua and C. Graff. UCI machine learning repository, 2017.\n\n[9] U. Fayyad and K. Irani. Multi-interval discretization of continuous-valued attributes for classi\ufb01cation\nlearning. Proceedings of the 13th International Joint Conference on Arti\ufb01cial Intelligence (IJCAI), 1993.\n\n[10] D. Fierens, G. Van den Broeck, I. Thon, B. Gutmann, and L. De Raedt.\n\nInference and learning in\nprobabilistic logic programs using weighted CNFs. Theory and Practice of Logic Programming, 15, 2015.\n\n[11] N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network classi\ufb01ers. Machine learning, 29(2-3):131\u2013\n\n163, 1997.\n\n[12] L. Galindez Olascoaga, K. Badami, J. Vlasselaer, W. Meert, and M. Verhelst. Dynamic sensor-frontend\ntuning for resource ef\ufb01cient embedded classi\ufb01cation. IEEE Journal on Emerging and Selected Topics in\nCircuits and Systems, 8(4):858\u2013872, Dec 2018.\n\n[13] R. Gens and P. Domingos. Learning the structure of sum-product networks. In International conference on\n\nmachine learning, pages 873\u2013880, 2013.\n\n[14] I. Guyon and S. Gunn. Nips feature selection challenge, 2003.\n\n[15] D. Heckerman. Causal independence for knowledge acquisition and inference. In Proceedings of the\n\nConference on Uncertainty in Arti\ufb01cial Intelligence (UAI), pages 122\u2013127. Elsevier, 1993.\n\n[16] S. Holtzen, T. Millstein, and G. Van den Broeck. Symbolic exact inference for discrete probabilistic\nprograms. In Proceedings of the ICML Workshop on Tractable Probabilistic Modeling (TPM), June 2019.\n\n[17] M. Horowitz. 1.1 computing\u2019s energy problem (and what we can do about it). In 2014 IEEE International\n\nSolid-State Circuits Conference Digest of Technical Papers (ISSCC), pages 10\u201314, Feb 2014.\n\n[18] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam.\nMobilenets: Ef\ufb01cient convolutional neural networks for mobile vision applications. arXiv preprint\narXiv:1704.04861, 2017.\n\n[19] IEEE. Ieee standard for \ufb02oating-point arithmetic. IEEE Std 754-2008, pages 1\u201370, Aug 2008.\n\n[20] B. A. Johnson, R. Tateishi, and N. T. Hoan. A hybrid pansharpening approach and multiscale object-\nbased image analysis for mapping diseased pine and oak trees. International journal of remote sensing,\n34(20):6969\u20136982, 2013.\n\n10\n\n\f[21] O. U. Khan and D. D. Wentzloff. Hardware accelerator for probabilistic inference in 65-nm cmos. IEEE\n\nTransactions on Very Large Scale Integration (VLSI) Systems, 24(3):837\u2013845, 2016.\n\n[22] P. Khosravi, Y. Liang, Y. Choi, and G. Van den Broeck. What to expect of classi\ufb01ers? reasoning about\nlogistic regression with missing features. In Proceedings of the 28th International Joint Conference on\nArti\ufb01cial Intelligence (IJCAI), aug 2019.\n\n[23] Y. Liang, J. Bekker, and G. Van den Broeck. Learning the structure of probabilistic sentential decision\n\ndiagrams. In Proceedings of the Conference on Uncertainty in Arti\ufb01cial Intelligence (UAI), 2017.\n\n[24] Y. Liang and G. Van den Broeck. Learning logistic circuits. Proceedings of the 33rd Conference on\n\nArti\ufb01cial Intelligence (AAAI), 2019.\n\n[25] D. Lowd and J. Davis. Learning markov network structure with decision trees. In 2010 IEEE International\n\nConference on Data Mining, pages 334\u2013343. IEEE, 2010.\n\n[26] D. Lowd and P. Domingos. Learning arithmetic circuits. arXiv preprint arXiv:1206.3271, 2012.\n[27] D. Lowd and A. Rooshenas. Learning markov networks with arithmetic circuits. In Arti\ufb01cial Intelligence\n\nand Statistics, pages 406\u2013414, 2013.\n\n[28] R. Manhaeve, S. Duman\u02c7ci\u00b4c, A. Kimmig, T. Demeester, and L. De Raedt. Deepproblog: Neural probabilistic\n\nlogic programming. Advances in Neural Information Processing Systems 31 (NeurIPS), 2018.\n\n[29] B. Moons, D. Bankman, L. Yang, B. Murmann, and M. Verhelst. Binareye: An always-on energy-accuracy-\nscalable binary cnn processor with all memory on chip in 28nm cmos. In 2018 IEEE Custom Integrated\nCircuits Conference (CICC), pages 1\u20134. IEEE, 2018.\n\n[30] R. K. Pace and R. Barry. Sparse spatial autoregressions. Statistics & Probability Letters, 33(3):291\u2013297,\n\n1997.\n\n[31] N. Piatkowski, S. Lee, and K. Morik. Integer undirected graphical models for resource-constrained systems.\n\nNeurocomputing, 173:9\u201323, 2016.\n\n[32] H. Poon and P. Domingos. Sum-product networks: A new deep architecture. In 2011 IEEE International\n\nConference on Computer Vision Workshops (ICCV Workshops), pages 689\u2013690. IEEE, 2011.\n\n[33] A. Rooshenas and D. Lowd. Learning sum-product networks with direct and indirect variable interactions.\n\nIn International Conference on Machine Learning, pages 710\u2013718, 2014.\n\n[34] J. Schumann, K. Y. Rozier, T. Reinbacher, O. J. Mengshoel, T. Mbaya, and C. Ippolito. Towards real-time,\non-board, hardware-supported sensor and software health management for unmanned aerial systems.\nInternational Journal of Prognostics and Health Management, 2015.\n\n[35] N. Shah, L. I. Galindez Olascoaga, W. Meert, and M. Verhelst. Problp: A framework for low-precision\nprobabilistic inference. In Proceedings of the 56th Annual Design Automation Conference 2019, page 190.\nACM, 2019.\n\n[36] L. Sommer, J. Oppermann, A. Molina, C. Binnig, K. Kersting, and A. Koch. Automatic mapping of the\nsum-product network inference problem to fpga-based accelerators. In 2018 IEEE 36th International\nConference on Computer Design (ICCD), pages 350\u2013357. IEEE, 2018.\n\n[37] M. Tan, B. Chen, R. Pang, V. Vasudevan, and Q. V. Le. Mnasnet: Platform-aware neural architecture\n\nsearch for mobile. arXiv preprint arXiv:1807.11626, 2018.\n\n[38] S. Tarkoma, M. Siekkinen, E. Lagerspetz, and Y. Xiao. Smartphone energy consumption: modeling and\n\noptimization. Cambridge University Press, 2014.\n\n[39] S. Tschiatschek and F. Pernkopf. On bayesian network classi\ufb01ers with reduced precision parameters.\n\nPattern Analysis and Machine Intelligence, IEEE Transactions on, 37(4):774\u2013785, 2015.\n\n[40] J. Vlasselaer, G. Van den Broeck, A. Kimmig, W. Meert, and L. De Raedt. Tp-compilation for inference in\n\nprobabilistic logic programs. International Journal of Approximate Reasoning, June 2016.\n\n[41] J. Xu, Z. Zhang, T. Friedman, Y. Liang, and G. Van den Broeck. A semantic loss function for deep learning\nwith symbolic knowledge. In Proceedings of the 35th International Conference on Machine Learning\n(ICML), July 2018.\n\n[42] S. Zermani, C. Dezan, R. Euler, and J.-P. Diguet. Bayesian network-based framework for the design of\nrecon\ufb01gurable health management monitors. In 2015 NASA/ESA Conference on Adaptive Hardware and\nSystems (AHS), pages 1\u20138. IEEE, 2015.\n\n11\n\n\f", "award": [], "sourceid": 7658, "authors": [{"given_name": "Laura", "family_name": "Galindez Olascoaga", "institution": "KU Leuven"}, {"given_name": "Wannes", "family_name": "Meert", "institution": "K.U.Leuven"}, {"given_name": "Nimish", "family_name": "Shah", "institution": "KU Leuven"}, {"given_name": "Marian", "family_name": "Verhelst", "institution": "KU Leuven"}, {"given_name": "Guy", "family_name": "Van den Broeck", "institution": "UCLA"}]}