Probabilistic programming

Probabilistic logic programming

Inference

Learning

Applications

Users specify a probabilistic model in its entirety (e.g., by writing code that generates a sample from the joint distribution) and inference follows automatically given the specification.

PP languages provide the full power of modern programming languages for describing complex distributions

Reuse of libraries of models

Interactive modeling

Abstraction

Source https://en.wikipedia.org/wiki/Probabilistic_programming_language

Only three are logic, the others imperative/functional/object oriented

DARPA released in 2013 the funding call “Probabilistic Programming for Advancing Machine Learning” (PPAML)

Aim: develop probabilistic programming languages and accompanying tools to facilitate the construction of new machine learning applications across a wide range of domains.

Focus: functional PP

What are we missing?

Is logic programming to blame?

Probabilistic logic programming is alive and kicking!

Relationships are first class citizens

Conceptually easier to lift

Strong semantics

Inductive systems

Handling non-termination

Continuous variables

Possible when the number of explanations for the query is infinite

Church code http://forestdb.org/models/arithmetic.html

```
(define (random-arithmetic-fn)
(if (flip 0.3)
(random-combination (random-arithmetic-fn)
(random-arithmetic-fn))
(if (flip)
(lambda (x) x)
(random-constant-fn))))
(define (random-combination f g)
(define op (uniform-draw (list + -)))
(lambda (x) (op (f x) (g x))))
(define (random-constant-fn)
(define i (sample-integer 10))
(lambda (x) i))
```

LPAD (`cplint`

) code http://cplint.lamping.unife.it/example/inference/arithm.pl

```
eval(X,Y):-
random_fn(X,0,F),
Y is F.
op(L,+):0.5;op(L,-):0.5.
random_fn(X,L,F):-
comb(L), random_fn(X,l(L),F1), random_fn(X,r(L),F2),
op(L,Op), F=..[Op,F1,F2].
random_fn(X,L,F):-
\+ comb(L), base_random_fn(X,L,F).
comb(_):0.3.
base_random_fn(X,L,X):- identity(L).
base_random_fn(_X,L,C):- \+ identity(L), random_const(L,C).
identity(_):0.5.
random_const(_,C):discrete(C,[0:0.1,1:0.1,2:0.1,3:0.1,4:0.1,
5:0.1,6:0.1,7:0.1,8:0.1,9:0.1]).
```

Aim: given observations of couples input-output for the random function, predict the output for a new input

Arbitrarily complex functions have a non-zero probability of being selected

The program has a non-terminating execution.

Exact inference: infinite number of explanations

```
(define (sample)
(rejection-query
(define my-proc (random-arithmetic-fn))
(my-proc 2)
(= (my-proc 1) 3)))
(hist (repeat 100 sample))
```

Use (T. Sato, P. Meyer, Infinite probability computation by cyclic explanation graphs, Theor. Pract. Log. Prog. 14 2014) or (A. Gorlin, C. R. Ramakrishnan, S. A. Smolka, Model checking with proba- bilistic tabled logic programming, Theor. Pract. Log. Prog. 12 (4-5) 2012)

or resort to sampling: with the increase of complexity, the probability of functions tend to 0 and the probability of the infinite trace is 0

Metropolis Hastings: (Nampally, A., Ramakrishnan, C.: Adaptive MCMC-based inference in probabilistic logic programs. arXiv preprint arXiv:1403.6036 2014)

Monte Carlo sampling is attractive for the simplicity of its implementation and because you can improve the estimate as more time is available.

The disjunctive clause

\(C_r=H_1:\alpha_1\vee \ldots\vee H_n:\alpha_n\leftarrow L_1,\ldots,L_m.\)

is transformed into the set of clauses \(MC(C_r)\)

\(\begin{array}{l} MC(C_r,1)=H_1\leftarrow L_1,\ldots,L_m,sample\_head(n,r,VC,NH),NH=1.\\ \ldots\\ MC(C_r,n)=H_n\leftarrow L_1,\ldots,L_m,sample\_head(n,r,VC,NH),NH=n.\\ \end{array}\)

Sample truth value of query

`Q`

:`... (call(Q)-> NT1 is NT+1 ; NT1 =NT), ...`

A Markov chain is built by taking an initial sample and by generating successor samples.

The initial sample is built by randomly sampling choices so that the evidence is true.

A successor sample is obtained by deleting a fixed number of sampled probabilistic choices.

Then the evidence is queried

If the query succeeds, the goal is queried.

The sample is accepted with a probability of \(\min\{1,\frac{N_0}{N_1}\}\) where \(N_0\) (\(N_1\)) is the number of choices sampled in the previous (current) sample.

In

`cplint`

:

`?- mc_mh_sample(eval(2,4),eval(1,3),100,100,3,T,F,P).`

Probability of

`eval(2,4)`

given that`eval(1,3)`

is true

```
F = 90,
T = 10,
P = 0.1
```

You can also try rejection sampling (usually slower)

```
?- mc_rejection_sample(eval(2,4),eval(1,3),100,
T,F,P).
```

You may be interested in the distribution of the output

In

`cplint`

:

```
?- mc_mh_sample_arg_bar(eval(2,Y),eval(1,3),100,
100,3,Y,V).
```

You may be interested in the expected value of the output

In

`cplint`

:

```
?- mc_mh_expectation(eval(2,Y),eval(1,3),
100,100,3,Y,E).
E = 3.21
```

Distributional clauses (B. Gutmann, I. Thon, A. Kimmig, M. Bruynooghe, and L. De Raedt, “The magic of logical inference in probabilistic programming,” Theory and Practice of Logic Programming, 2011)

Gaussian mixture model in

`cplint`

:

```
heads:0.6;tails:0.4.
g(X): gaussian(X,0, 1).
h(X): gaussian(X,5, 2).
mix(X) :- heads, g(X).
mix(X) :- tails, h(X).
```

Inference by sampling

Without evidence or evidence on discrete random variables, you can reuse the same methods

Sampling arguments of goals for building a probability density of the arguments.

```
heads:0.6;tails:0.4.
g(X): gaussian(X,0, 1).
h(X): gaussian(X,5, 2).
mix(X) :- heads, g(X).
mix(X) :- tails, h(X).
?- mc_sample_arg(mix(X),10000,X,L0),
histogram(L0,40,Chart).
```

You cannot use rejection sampling or Metropolis-Hastings, as the probability of the evidence is 0

You can use likelihood weighting to obtain samples of continuous arguments of a goal. (Nitti, D., De Laet, T., De Raedt, L.: Probabilistic logic programming for hybrid relational domains. Mach. Learn. 103(3), 407-449 2016)

For each sample to be taken, likelihood weighting samples the query and then assigns a weight to the sample on the basis of evidence.

The weight is computed by deriving the evidence backward in the same sample of the query starting with a weight of one

Each time a choice should be taken or a continuous variable sampled, if the choice/variable has already been taken, the current weight is multiplied by probability of the choice/by the density value of the continuous value.

Problem from http://www.robots.ox.ac.uk/~fwood/anglican/examples/viewer/?worksheet=gaussian-posteriors

Estimate the true value of a Gaussian distributed random variable, given some observed data.

The variance is known and we suppose that the mean has itself a Gaussian distribution with mean 1 and variance 5 (prior on the parameter)

We take different measurement (e.g. at different times), indexed with an integer.

Anglican code

```
(def dataset [9 8])
(defquery gaussian-model [data]
(let [mu (sample (normal 1 (sqrt 5)))
sigma (sqrt 2)]
(doall (map (fn [x] (observe
(normal mu sigma) x)) data))
mu))
(def posterior
((conditional gaussian-model :smc :number-of-particles 10) dataset))
(def posterior-samples (repeatedly 20000 #(sample* posterior)))
```

```
value(I,X) :-
mean(M),
value(I,M,X).
mean(M): gaussian(M,1.0, 5.0).
value(_,M,X): gaussian(X,M, 2.0).
?- mc_sample_arg(value(0,Y),10000,Y,L0),
mc_lw_sample_arg(value(0,X),(value(1,9),value(2,8)),10000,X,L),
densities(L0,L,40,Chart).
```

Parameter learning

Structure learning more developed for PLP, but

(Perov, Yura N., and Frank D. Wood. Learning Probabilistic Programs. arXiv preprint arXiv:1407.2646 2014).

(Lake, Brenden M., Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science 350.6266 2015).

(Gaunt, Alexander L., et al. TerpreT: A Probabilistic Programming Language for Program Induction. arXiv preprint arXiv:1608.04428 2016).

Problem: given a set of interpretations, a program, find the parameters maximizing the likelihood of the interpretations (or of instances of a target predicate)

Exploit the equivalence with BN to use BN learning algorithms

The interpretations record the truth value of ground atoms, not of the choice variables

Unseen data: relative frequency can’t be used

(Thon et al. ECML 2008) proposed an adaptation of EM for CPT-L, a simplified version of LPADs

The algorithm computes the counts efficiently by repeatedly traversing the BDDs representing the explanations

(Ishihata et al. ILP 2008) independently proposed a similar algorithm

LFI-ProbLog (Gutamnn et al. ECML 2011) is the adaptation of EM to ProbLog

EMBLEM (Riguzzi & Bellodi IDAJ 2013) adapts (Ishihata et al. ILP 2008) to LPADs

Given a trivial LPAD or an empty one, a set of interpretations (data)

*Find the model and the parameters*that maximize the probability of the data (log-likelihood)SLIPCOVER: Structure LearnIng of Probabilistic logic program by searching OVER the clause space

Beam search in the space of clauses to find the promising ones

Greedy search in the space of probabilistic programs guided by the LL of the data.

*Parameter learning*by means of EMBLEM

Link prediction: given a (social) network, compute the probability of the existence of a link between two entities (UWCSE)

```
advisedby(X, Y) :0.3 :-
publication(P, X),
publication(P, Y),
student(X).
```

Classify web pages on the basis of the link structure (WebKB)

```
coursePage(Page1): 0.3 :- linkTo(Page2,Page1),coursePage(Page2).
coursePage(Page1): 0.3 :- linkTo(Page2,Page1),facultyPage(Page2).
...
coursePage(Page): 0.3 :- has('abstract',Page).
...
```

Entity resolution: identify identical entities in text or databases

```
samebib(A,B):0.3 :-
samebib(A,C), samebib(C,B).
sameauthor(A,B):0.3 :-
sameauthor(A,C), sameauthor(C,B).
sametitle(A,B):0.3 :-
sametitle(A,C), sametitle(C,B).
samevenue(A,B):0.3 :-
samevenue(A,C), samevenue(C,B).
samebib(B,C):0.3 :-
author(B,D),author(C,E),sameauthor(D,E).
samebib(B,C):0.3 :-
title(B,D),title(C,E),sametitle(D,E).
samebib(B,C):0.3 :-
venue(B,D),venue(C,E),samevenue(D,E).
samevenue(B,C):0.3 :-
haswordvenue(B,word_06),
haswordvenue(C,word_06).
...
```

Chemistry: given the chemical composition of a substance, predict its mutagenicity or its carcenogenicity

```
active(A):0.5 :-
atm(A,B,c,29,C),
gteq(C,-0.003),
ring_size_5(A,D).
active(A):0.5 :-
lumo(A,B), lteq(B,-2.072).
active(A):0.5 :-
bond(A,B,C,2),
bond(A,C,D,1),
ring_size_5(A,E).
active(A):0.5 :-
carbon_6_ring(A,B).
active(A):0.5 :-
anthracene(A,B).
...
```

Medicine: diagnose diseases on the basis of patient information (Hepatitis), influence of genes on HIV, risk of falling of elderly people (FFRAT)

System | HIV | UW-CSE | Mondial | ||
---|---|---|---|---|---|

SLIPCOVER | \(0.82 \pm0.05\) | \(0.11\pm 0.08\) | \(0.86 \pm 0.07\) | ||

SLIPCASE | \(0.78\pm0.05\) | \(0.03\pm 0.01\) | \(0.65 \pm 0.06\) | ||

LSM | \(0.37\pm0.03\) | \(0.07\pm0.02\) | - | ||

ALEPH++ | - | \(0.05\pm0.01\) | \(0.87 \pm 0.07\) | ||

RDN-B | \(0.28 \pm 0.06\) | \(0.28 \pm 0.06\) | \(0.77 \pm 0.07\) | ||

MLN-BT | \(0.29 \pm 0.04\) | \(0.18 \pm 0.07\) | \(0.74 \pm 0.10\) | ||

MLN-BC | \(0.51 \pm 0.04\) | \(0.06 \pm 0.01\) | \(0.59 \pm 0.09\) | ||

BUSL | \(0.38 \pm 0.03\) | \(0.01 \pm 0.01\) | - |

System | Carcinogenesis | Mutagenesis | Hepatitis | ||
---|---|---|---|---|---|

SLIPCOVER | \(0.60\) | \(0.95\pm0.01\) | \(0.80\pm0.01\) | ||

SLIPCASE | \(0.63\) | \(0.92\pm 0.08\) | \(0.71\pm0.05\) | ||

LSM | - | - | \(0.53\pm 0.04\) | ||

ALEPH++ | \(0.74\) | \(0.95\pm0.01\) | - | ||

RDN-B | \(0.55\) | \(0.97 \pm 0.03\) | \(0.88 \pm 0.01\) | ||

MLN-BT | \(0.50\) | \(0.92 \pm 0.09\) | \(0.78 \pm 0.02\) | ||

MLN-BC | \(0.62\) | \(0.69 \pm 0.20\) | \(0.79 \pm 0.02\) | ||

BUSL | - | - | \(0.51 \pm 0.03\) |

http://cplint.lamping.unife.it/

Inference (knwoledge compilation, Monte Carlo)

Parameter learning (EMBLEM)

Structure learning (SLIPCOVER)

http://www.cs.kuleuven.be/~dtai/problog/

Inference (knwoledge compilation, Monte Carlo)

Parameter learning (LFI-ProbLog)

PLP is still a fertile field but...

...we must look at other communities and build bridges and...

...join forces!

Much is left to do:

Tractable sublanguages (see following talk)

Lifted inference

Structure/Parameter learning (also for programs with continuous variables)