# Outline

• Probabilistic programming

• Probabilistic logic programming

• Inference

• Learning

• Applications

# Probabilistic Programming

• Users specify a probabilistic model in its entirety (e.g., by writing code that generates a sample from the joint distribution) and inference follows automatically given the specification.

• PP languages provide the full power of modern programming languages for describing complex distributions

• Reuse of libraries of models

• Interactive modeling

• Abstraction

# Probabilistic Programming Languages

• Only three are logic, the others imperative/functional/object oriented

• DARPA released in 2013 the funding call “Probabilistic Programming for Advancing Machine Learning” (PPAML)

• Aim: develop probabilistic programming languages and accompanying tools to facilitate the construction of new machine learning applications across a wide range of domains.

• Focus: functional PP

# Probabilistic Logic Programming

• What are we missing?

• Is logic programming to blame?

# Thesis

Probabilistic logic programming is alive and kicking!

# Strengths

• Relationships are first class citizens

• Conceptually easier to lift

• Strong semantics

• Inductive systems

# Weaknesses

• Handling non-termination

• Continuous variables

# Non-termination

• Possible when the number of explanations for the query is infinite

# Non-termination: Inducing Arithmetic Functions

Church code http://forestdb.org/models/arithmetic.html

(define (random-arithmetic-fn)
(if (flip 0.3)
(random-combination (random-arithmetic-fn)
(random-arithmetic-fn))
(if (flip)
(lambda (x) x)
(random-constant-fn))))

(define (random-combination f g)
(define op (uniform-draw (list + -)))
(lambda (x) (op (f x) (g x))))

(define (random-constant-fn)
(define i (sample-integer 10))
(lambda (x) i))

# Non-termination: Inducing Arithmetic Functions

LPAD (cplint) code http://cplint.lamping.unife.it/example/inference/arithm.pl

eval(X,Y):-
random_fn(X,0,F),
Y is F.
op(L,+):0.5;op(L,-):0.5.
random_fn(X,L,F):-
comb(L), random_fn(X,l(L),F1), random_fn(X,r(L),F2),
op(L,Op), F=..[Op,F1,F2].
random_fn(X,L,F):-
\+ comb(L), base_random_fn(X,L,F).
comb(_):0.3.
base_random_fn(X,L,X):- identity(L).
base_random_fn(_X,L,C):- \+ identity(L), random_const(L,C).
identity(_):0.5.
random_const(_,C):discrete(C,[0:0.1,1:0.1,2:0.1,3:0.1,4:0.1,
5:0.1,6:0.1,7:0.1,8:0.1,9:0.1]).

# Non-termination: Inducing Arithmetic Functions

• Aim: given observations of couples input-output for the random function, predict the output for a new input

• Arbitrarily complex functions have a non-zero probability of being selected

• The program has a non-terminating execution.

• Exact inference: infinite number of explanations

# Non-termination: Inducing Arithmetic Functions

(define (sample)
(rejection-query
(define my-proc (random-arithmetic-fn))
(my-proc 2)
(= (my-proc 1) 3)))
(hist (repeat 100 sample))

# Solution

• Use (T. Sato, P. Meyer, Infinite probability computation by cyclic explanation graphs, Theor. Pract. Log. Prog. 14 2014) or (A. Gorlin, C. R. Ramakrishnan, S. A. Smolka, Model checking with proba- bilistic tabled logic programming, Theor. Pract. Log. Prog. 12 (4-5) 2012)

• or resort to sampling: with the increase of complexity, the probability of functions tend to 0 and the probability of the infinite trace is 0

• Metropolis Hastings: (Nampally, A., Ramakrishnan, C.: Adaptive MCMC-based inference in probabilistic logic programs. arXiv preprint arXiv:1403.6036 2014)

• Monte Carlo sampling is attractive for the simplicity of its implementation and because you can improve the estimate as more time is available.

# Monte Carlo

• The disjunctive clause

$$C_r=H_1:\alpha_1\vee \ldots\vee H_n:\alpha_n\leftarrow L_1,\ldots,L_m.$$

is transformed into the set of clauses $$MC(C_r)$$

$$\begin{array}{l} MC(C_r,1)=H_1\leftarrow L_1,\ldots,L_m,sample\_head(n,r,VC,NH),NH=1.\\ \ldots\\ MC(C_r,n)=H_n\leftarrow L_1,\ldots,L_m,sample\_head(n,r,VC,NH),NH=n.\\ \end{array}$$

• Sample truth value of query Q:

...
(call(Q)->    NT1 is NT+1 ;   NT1 =NT),
...

# Metropolis-Hastings MCMC

• A Markov chain is built by taking an initial sample and by generating successor samples.

• The initial sample is built by randomly sampling choices so that the evidence is true.

• A successor sample is obtained by deleting a fixed number of sampled probabilistic choices.

• Then the evidence is queried

• If the query succeeds, the goal is queried.

• The sample is accepted with a probability of $$\min\{1,\frac{N_0}{N_1}\}$$ where $$N_0$$ ($$N_1$$) is the number of choices sampled in the previous (current) sample.

# Solution

• In cplint:

?- mc_mh_sample(eval(2,4),eval(1,3),100,100,3,T,F,P).
• Probability of eval(2,4) given that eval(1,3) is true

F = 90,
T = 10,
P = 0.1
• You can also try rejection sampling (usually slower)

?- mc_rejection_sample(eval(2,4),eval(1,3),100,
T,F,P).

# Solution

• You may be interested in the distribution of the output

• In cplint:

?- mc_mh_sample_arg_bar(eval(2,Y),eval(1,3),100,
100,3,Y,V).

# Solution

• You may be interested in the expected value of the output

• In cplint:

?- mc_mh_expectation(eval(2,Y),eval(1,3),
100,100,3,Y,E).
E = 3.21

# Continuous Random Variables

• Distributional clauses (B. Gutmann, I. Thon, A. Kimmig, M. Bruynooghe, and L. De Raedt, “The magic of logical inference in probabilistic programming,” Theory and Practice of Logic Programming, 2011)

• Gaussian mixture model in cplint:

heads:0.6;tails:0.4.
g(X): gaussian(X,0, 1).
h(X): gaussian(X,5, 2).
mix(X) :- tails, h(X).

# Continuous Random Variables

• Inference by sampling

• Without evidence or evidence on discrete random variables, you can reuse the same methods

• Sampling arguments of goals for building a probability density of the arguments.

# Gaussian Mixture Model

heads:0.6;tails:0.4.
g(X): gaussian(X,0, 1).
h(X): gaussian(X,5, 2).
mix(X) :- tails, h(X).

?- mc_sample_arg(mix(X),10000,X,L0),
histogram(L0,40,Chart).

# Evidence on Continuous Random Variables

• You cannot use rejection sampling or Metropolis-Hastings, as the probability of the evidence is 0

• You can use likelihood weighting to obtain samples of continuous arguments of a goal. (Nitti, D., De Laet, T., De Raedt, L.: Probabilistic logic programming for hybrid relational domains. Mach. Learn. 103(3), 407-449 2016)

# Likelihood Weighting

• For each sample to be taken, likelihood weighting samples the query and then assigns a weight to the sample on the basis of evidence.

• The weight is computed by deriving the evidence backward in the same sample of the query starting with a weight of one

• Each time a choice should be taken or a continuous variable sampled, if the choice/variable has already been taken, the current weight is multiplied by probability of the choice/by the density value of the continuous value.

# Bayesian Estimation

• Estimate the true value of a Gaussian distributed random variable, given some observed data.

• The variance is known and we suppose that the mean has itself a Gaussian distribution with mean 1 and variance 5 (prior on the parameter)

• We take different measurement (e.g. at different times), indexed with an integer.

# Bayesian Estimation

• Anglican code

(def dataset [9 8])
(defquery gaussian-model [data]
(let [mu (sample (normal 1 (sqrt 5)))
sigma (sqrt 2)]
(doall (map (fn [x] (observe
(normal mu sigma) x)) data))
mu))
(def posterior
((conditional gaussian-model :smc :number-of-particles 10) dataset))
(def posterior-samples (repeatedly 20000 #(sample* posterior)))

# Bayesian Estimation

value(I,X) :-
mean(M),
value(I,M,X).
mean(M): gaussian(M,1.0, 5.0).
value(_,M,X): gaussian(X,M, 2.0).
?- mc_sample_arg(value(0,Y),10000,Y,L0),
mc_lw_sample_arg(value(0,X),(value(1,9),value(2,8)),10000,X,L),
densities(L0,L,40,Chart).

# Learning

• Parameter learning

• Structure learning more developed for PLP, but

• (Perov, Yura N., and Frank D. Wood. Learning Probabilistic Programs. arXiv preprint arXiv:1407.2646 2014).

• (Lake, Brenden M., Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science 350.6266 2015).

• (Gaunt, Alexander L., et al. TerpreT: A Probabilistic Programming Language for Program Induction. arXiv preprint arXiv:1608.04428 2016).

# Parameter Learning

• Problem: given a set of interpretations, a program, find the parameters maximizing the likelihood of the interpretations (or of instances of a target predicate)

• Exploit the equivalence with BN to use BN learning algorithms

• The interpretations record the truth value of ground atoms, not of the choice variables

• Unseen data: relative frequency can’t be used

# Parameter Learning

• (Thon et al. ECML 2008) proposed an adaptation of EM for CPT-L, a simplified version of LPADs

• The algorithm computes the counts efficiently by repeatedly traversing the BDDs representing the explanations

• (Ishihata et al. ILP 2008) independently proposed a similar algorithm

• LFI-ProbLog (Gutamnn et al. ECML 2011) is the adaptation of EM to ProbLog

• EMBLEM (Riguzzi & Bellodi IDAJ 2013) adapts (Ishihata et al. ILP 2008) to LPADs

# Structure Learning

• Given a trivial LPAD or an empty one, a set of interpretations (data)

• Find the model and the parameters that maximize the probability of the data (log-likelihood)

• SLIPCOVER: Structure LearnIng of Probabilistic logic program by searching OVER the clause space

1. Beam search in the space of clauses to find the promising ones

2. Greedy search in the space of probabilistic programs guided by the LL of the data.

• Parameter learning by means of EMBLEM

# Applications

• Link prediction: given a (social) network, compute the probability of the existence of a link between two entities (UWCSE)

advisedby(X, Y) :0.3 :-
publication(P, X),
publication(P, Y),
student(X).

# Applications

• Classify web pages on the basis of the link structure (WebKB)

coursePage(Page1): 0.3 :- linkTo(Page2,Page1),coursePage(Page2).
...
coursePage(Page): 0.3 :- has('abstract',Page).
...

# Applications

• Entity resolution: identify identical entities in text or databases

samebib(A,B):0.3 :-
samebib(A,C), samebib(C,B).
sameauthor(A,B):0.3 :-
sameauthor(A,C), sameauthor(C,B).
sametitle(A,B):0.3 :-
sametitle(A,C), sametitle(C,B).
samevenue(A,B):0.3 :-
samevenue(A,C), samevenue(C,B).
samebib(B,C):0.3 :-
author(B,D),author(C,E),sameauthor(D,E).
samebib(B,C):0.3 :-
title(B,D),title(C,E),sametitle(D,E).
samebib(B,C):0.3 :-
venue(B,D),venue(C,E),samevenue(D,E).
samevenue(B,C):0.3 :-
haswordvenue(B,word_06),
haswordvenue(C,word_06).
...

# Applications

• Chemistry: given the chemical composition of a substance, predict its mutagenicity or its carcenogenicity

active(A):0.5 :-
atm(A,B,c,29,C),
gteq(C,-0.003),
ring_size_5(A,D).
active(A):0.5 :-
lumo(A,B), lteq(B,-2.072).
active(A):0.5 :-
bond(A,B,C,2),
bond(A,C,D,1),
ring_size_5(A,E).
active(A):0.5 :-
carbon_6_ring(A,B).
active(A):0.5 :-
anthracene(A,B).
...

# Applications

• Medicine: diagnose diseases on the basis of patient information (Hepatitis), influence of genes on HIV, risk of falling of elderly people (FFRAT)

# Experiments - Area Under the PR Curve

System HIV UW-CSE Mondial
SLIPCOVER $$0.82 \pm0.05$$ $$0.11\pm 0.08$$ $$0.86 \pm 0.07$$
SLIPCASE $$0.78\pm0.05$$ $$0.03\pm 0.01$$ $$0.65 \pm 0.06$$
LSM $$0.37\pm0.03$$ $$0.07\pm0.02$$ -
ALEPH++ - $$0.05\pm0.01$$ $$0.87 \pm 0.07$$
RDN-B $$0.28 \pm 0.06$$ $$0.28 \pm 0.06$$ $$0.77 \pm 0.07$$
MLN-BT $$0.29 \pm 0.04$$ $$0.18 \pm 0.07$$ $$0.74 \pm 0.10$$
MLN-BC $$0.51 \pm 0.04$$ $$0.06 \pm 0.01$$ $$0.59 \pm 0.09$$
BUSL $$0.38 \pm 0.03$$ $$0.01 \pm 0.01$$ -

# Experiments - Area Under the PR Curve

System Carcinogenesis Mutagenesis Hepatitis
SLIPCOVER $$0.60$$ $$0.95\pm0.01$$ $$0.80\pm0.01$$
SLIPCASE $$0.63$$ $$0.92\pm 0.08$$ $$0.71\pm0.05$$
LSM - - $$0.53\pm 0.04$$
ALEPH++ $$0.74$$ $$0.95\pm0.01$$ -
RDN-B $$0.55$$ $$0.97 \pm 0.03$$ $$0.88 \pm 0.01$$
MLN-BT $$0.50$$ $$0.92 \pm 0.09$$ $$0.78 \pm 0.02$$
MLN-BC $$0.62$$ $$0.69 \pm 0.20$$ $$0.79 \pm 0.02$$
BUSL - - $$0.51 \pm 0.03$$

# Conclusions

• PLP is still a fertile field but...

• ...we must look at other communities and build bridges and...

• ...join forces!

• Much is left to do:

• Tractable sublanguages (see following talk)

• Lifted inference

• Structure/Parameter learning (also for programs with continuous variables)