Deductive and Inductive Probabilistic Programming

Fabrizio Riguzzi

University of Ferrara
http://ds.ing.unife.it/~friguzzi/DIPP/
http://cplint.lamping.unife.it
image

Outline

  • Probabilistic programming

  • Probabilistic logic programming

  • Inference

  • Learning

  • Applications

Probabilistic Programming

  • Users specify a probabilistic model in its entirety (e.g., by writing code that generates a sample from the joint distribution) and inference follows automatically given the specification.

  • PP languages provide the full power of modern programming languages for describing complex distributions

  • Reuse of libraries of models

  • Interactive modeling

  • Abstraction

Probabilistic Programming

image

Probabilistic Programming Languages

Source https://en.wikipedia.org/wiki/Probabilistic_programming_language

Probabilistic Programming Languages

  • Only three are logic, the others imperative/functional/object oriented

  • DARPA released in 2013 the funding call “Probabilistic Programming for Advancing Machine Learning” (PPAML)

  • Aim: develop probabilistic programming languages and accompanying tools to facilitate the construction of new machine learning applications across a wide range of domains.

  • Focus: functional PP

Probabilistic Logic Programming

  • What are we missing?

  • Is logic programming to blame?

image

Thesis

Probabilistic logic programming is alive and kicking!

Strengths

  • Relationships are first class citizens

  • Conceptually easier to lift

  • Strong semantics

  • Inductive systems

Weaknesses

  • Handling non-termination

  • Continuous variables

Non-termination

  • Possible when the number of explanations for the query is infinite

Non-termination: Inducing Arithmetic Functions

Church code http://forestdb.org/models/arithmetic.html

(define (random-arithmetic-fn)
  (if (flip 0.3)
      (random-combination (random-arithmetic-fn) 
                          (random-arithmetic-fn))
      (if (flip) 
          (lambda (x) x) 
          (random-constant-fn))))

(define (random-combination f g)
  (define op (uniform-draw (list + -)))
  (lambda (x) (op (f x) (g x))))

(define (random-constant-fn)
  (define i (sample-integer 10))
  (lambda (x) i))

Non-termination: Inducing Arithmetic Functions

LPAD (cplint) code http://cplint.lamping.unife.it/example/inference/arithm.pl

eval(X,Y):-
  random_fn(X,0,F),
  Y is F.
op(L,+):0.5;op(L,-):0.5.
random_fn(X,L,F):-
  comb(L), random_fn(X,l(L),F1), random_fn(X,r(L),F2),
  op(L,Op), F=..[Op,F1,F2].
random_fn(X,L,F):-
  \+ comb(L), base_random_fn(X,L,F).
comb(_):0.3.
base_random_fn(X,L,X):- identity(L).
base_random_fn(_X,L,C):- \+ identity(L), random_const(L,C).
identity(_):0.5.
random_const(_,C):discrete(C,[0:0.1,1:0.1,2:0.1,3:0.1,4:0.1,
  5:0.1,6:0.1,7:0.1,8:0.1,9:0.1]).

Non-termination: Inducing Arithmetic Functions

  • Aim: given observations of couples input-output for the random function, predict the output for a new input

  • Arbitrarily complex functions have a non-zero probability of being selected

  • The program has a non-terminating execution.

  • Exact inference: infinite number of explanations

Non-termination: Inducing Arithmetic Functions

(define (sample)
  (rejection-query
   (define my-proc (random-arithmetic-fn))
   (my-proc 2)
        (= (my-proc 1) 3)))
(hist (repeat 100 sample))

image

Solution

  • Use (T. Sato, P. Meyer, Infinite probability computation by cyclic explanation graphs, Theor. Pract. Log. Prog. 14 2014) or (A. Gorlin, C. R. Ramakrishnan, S. A. Smolka, Model checking with proba- bilistic tabled logic programming, Theor. Pract. Log. Prog. 12 (4-5) 2012)

  • or resort to sampling: with the increase of complexity, the probability of functions tend to 0 and the probability of the infinite trace is 0

  • Metropolis Hastings: (Nampally, A., Ramakrishnan, C.: Adaptive MCMC-based inference in probabilistic logic programs. arXiv preprint arXiv:1403.6036 2014)

  • Monte Carlo sampling is attractive for the simplicity of its implementation and because you can improve the estimate as more time is available.

Monte Carlo

  • The disjunctive clause

    \(C_r=H_1:\alpha_1\vee \ldots\vee H_n:\alpha_n\leftarrow L_1,\ldots,L_m.\)

    is transformed into the set of clauses \(MC(C_r)\)

    \(\begin{array}{l} MC(C_r,1)=H_1\leftarrow L_1,\ldots,L_m,sample\_head(n,r,VC,NH),NH=1.\\ \ldots\\ MC(C_r,n)=H_n\leftarrow L_1,\ldots,L_m,sample\_head(n,r,VC,NH),NH=n.\\ \end{array}\)

  • Sample truth value of query Q:

    ...
      (call(Q)->    NT1 is NT+1 ;   NT1 =NT),
    ...

Metropolis-Hastings MCMC

  • A Markov chain is built by taking an initial sample and by generating successor samples.

  • The initial sample is built by randomly sampling choices so that the evidence is true.

  • A successor sample is obtained by deleting a fixed number of sampled probabilistic choices.

  • Then the evidence is queried

  • If the query succeeds, the goal is queried.

  • The sample is accepted with a probability of \(\min\{1,\frac{N_0}{N_1}\}\) where \(N_0\) (\(N_1\)) is the number of choices sampled in the previous (current) sample.

Solution

  • In cplint:

?- mc_mh_sample(eval(2,4),eval(1,3),100,100,3,T,F,P).
  • Probability of eval(2,4) given that eval(1,3) is true

F = 90,
T = 10,
P = 0.1
  • You can also try rejection sampling (usually slower)

?- mc_rejection_sample(eval(2,4),eval(1,3),100,
   T,F,P).

Solution

  • You may be interested in the distribution of the output

  • In cplint:

?- mc_mh_sample_arg_bar(eval(2,Y),eval(1,3),100,
   100,3,Y,V).

image

Solution

  • You may be interested in the expected value of the output

  • In cplint:

?- mc_mh_expectation(eval(2,Y),eval(1,3),
   100,100,3,Y,E).
E = 3.21

Continuous Random Variables

  • Distributional clauses (B. Gutmann, I. Thon, A. Kimmig, M. Bruynooghe, and L. De Raedt, “The magic of logical inference in probabilistic programming,” Theory and Practice of Logic Programming, 2011)

  • Gaussian mixture model in cplint:

heads:0.6;tails:0.4. 
g(X): gaussian(X,0, 1).
h(X): gaussian(X,5, 2).
mix(X) :- heads, g(X).
mix(X) :- tails, h(X).

Continuous Random Variables

  • Inference by sampling

  • Without evidence or evidence on discrete random variables, you can reuse the same methods

  • Sampling arguments of goals for building a probability density of the arguments.

Gaussian Mixture Model

heads:0.6;tails:0.4. 
g(X): gaussian(X,0, 1).
h(X): gaussian(X,5, 2).
mix(X) :- heads, g(X).
mix(X) :- tails, h(X).

?- mc_sample_arg(mix(X),10000,X,L0),
   histogram(L0,40,Chart).

image

Evidence on Continuous Random Variables

  • You cannot use rejection sampling or Metropolis-Hastings, as the probability of the evidence is 0

  • You can use likelihood weighting to obtain samples of continuous arguments of a goal. (Nitti, D., De Laet, T., De Raedt, L.: Probabilistic logic programming for hybrid relational domains. Mach. Learn. 103(3), 407-449 2016)

Likelihood Weighting

  • For each sample to be taken, likelihood weighting samples the query and then assigns a weight to the sample on the basis of evidence.

  • The weight is computed by deriving the evidence backward in the same sample of the query starting with a weight of one

  • Each time a choice should be taken or a continuous variable sampled, if the choice/variable has already been taken, the current weight is multiplied by probability of the choice/by the density value of the continuous value.

Bayesian Estimation

Bayesian Estimation

  • Anglican code

(def dataset [9 8])
(defquery gaussian-model [data]
  (let [mu (sample (normal 1 (sqrt 5)))
        sigma (sqrt 2)]
    (doall (map (fn [x] (observe 
      (normal mu sigma) x)) data))
    mu))
(def posterior 
  ((conditional gaussian-model :smc :number-of-particles 10) dataset))
(def posterior-samples (repeatedly 20000 #(sample* posterior)))

image

Bayesian Estimation

value(I,X) :- 
  mean(M),
  value(I,M,X).
mean(M): gaussian(M,1.0, 5.0).
value(_,M,X): gaussian(X,M, 2.0).
?- mc_sample_arg(value(0,Y),10000,Y,L0),
   mc_lw_sample_arg(value(0,X),(value(1,9),value(2,8)),10000,X,L),
   densities(L0,L,40,Chart).

image

Learning

  • Parameter learning

  • Structure learning more developed for PLP, but

  • (Perov, Yura N., and Frank D. Wood. Learning Probabilistic Programs. arXiv preprint arXiv:1407.2646 2014).

  • (Lake, Brenden M., Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Human-level concept learning through probabilistic program induction. Science 350.6266 2015).

  • (Gaunt, Alexander L., et al. TerpreT: A Probabilistic Programming Language for Program Induction. arXiv preprint arXiv:1608.04428 2016).

Parameter Learning

  • Problem: given a set of interpretations, a program, find the parameters maximizing the likelihood of the interpretations (or of instances of a target predicate)

  • Exploit the equivalence with BN to use BN learning algorithms

  • The interpretations record the truth value of ground atoms, not of the choice variables

  • Unseen data: relative frequency can’t be used

Parameter Learning

  • (Thon et al. ECML 2008) proposed an adaptation of EM for CPT-L, a simplified version of LPADs

  • The algorithm computes the counts efficiently by repeatedly traversing the BDDs representing the explanations

  • (Ishihata et al. ILP 2008) independently proposed a similar algorithm

  • LFI-ProbLog (Gutamnn et al. ECML 2011) is the adaptation of EM to ProbLog

  • EMBLEM (Riguzzi & Bellodi IDAJ 2013) adapts (Ishihata et al. ILP 2008) to LPADs

Structure Learning

  • Given a trivial LPAD or an empty one, a set of interpretations (data)

  • Find the model and the parameters that maximize the probability of the data (log-likelihood)

  • SLIPCOVER: Structure LearnIng of Probabilistic logic program by searching OVER the clause space

    1. Beam search in the space of clauses to find the promising ones

    2. Greedy search in the space of probabilistic programs guided by the LL of the data.

  • Parameter learning by means of EMBLEM

Applications

  • Link prediction: given a (social) network, compute the probability of the existence of a link between two entities (UWCSE)

image

advisedby(X, Y) :0.3 :-
  publication(P, X),
  publication(P, Y),
  student(X).

Applications

  • Classify web pages on the basis of the link structure (WebKB)

image

coursePage(Page1): 0.3 :- linkTo(Page2,Page1),coursePage(Page2).
coursePage(Page1): 0.3 :- linkTo(Page2,Page1),facultyPage(Page2).
...
coursePage(Page): 0.3 :- has('abstract',Page).
...

Applications

  • Entity resolution: identify identical entities in text or databases

image

samebib(A,B):0.3 :- 
    samebib(A,C), samebib(C,B).
sameauthor(A,B):0.3 :- 
  sameauthor(A,C), sameauthor(C,B).
sametitle(A,B):0.3 :- 
  sametitle(A,C), sametitle(C,B).
samevenue(A,B):0.3 :- 
  samevenue(A,C), samevenue(C,B).
samebib(B,C):0.3 :-
  author(B,D),author(C,E),sameauthor(D,E).
samebib(B,C):0.3 :-
  title(B,D),title(C,E),sametitle(D,E).
samebib(B,C):0.3 :-
  venue(B,D),venue(C,E),samevenue(D,E).
samevenue(B,C):0.3 :-
  haswordvenue(B,word_06),
  haswordvenue(C,word_06).
...

Applications

  • Chemistry: given the chemical composition of a substance, predict its mutagenicity or its carcenogenicity

image

active(A):0.5 :-
   atm(A,B,c,29,C),
   gteq(C,-0.003), 
   ring_size_5(A,D).
active(A):0.5 :-
   lumo(A,B), lteq(B,-2.072).
active(A):0.5 :-
   bond(A,B,C,2), 
   bond(A,C,D,1), 
   ring_size_5(A,E).
active(A):0.5 :-
   carbon_6_ring(A,B).
active(A):0.5 :-
   anthracene(A,B).
...

Applications

  • Medicine: diagnose diseases on the basis of patient information (Hepatitis), influence of genes on HIV, risk of falling of elderly people (FFRAT)

image   image

Experiments - Area Under the PR Curve

System HIV UW-CSE Mondial
SLIPCOVER \(0.82 \pm0.05\) \(0.11\pm 0.08\) \(0.86 \pm 0.07\)
SLIPCASE \(0.78\pm0.05\) \(0.03\pm 0.01\) \(0.65 \pm 0.06\)
LSM \(0.37\pm0.03\) \(0.07\pm0.02\) -
ALEPH++ - \(0.05\pm0.01\) \(0.87 \pm 0.07\)
RDN-B \(0.28 \pm 0.06\) \(0.28 \pm 0.06\) \(0.77 \pm 0.07\)
MLN-BT \(0.29 \pm 0.04\) \(0.18 \pm 0.07\) \(0.74 \pm 0.10\)
MLN-BC \(0.51 \pm 0.04\) \(0.06 \pm 0.01\) \(0.59 \pm 0.09\)
BUSL \(0.38 \pm 0.03\) \(0.01 \pm 0.01\) -

Experiments - Area Under the PR Curve

System Carcinogenesis Mutagenesis Hepatitis
SLIPCOVER \(0.60\) \(0.95\pm0.01\) \(0.80\pm0.01\)
SLIPCASE \(0.63\) \(0.92\pm 0.08\) \(0.71\pm0.05\)
LSM - - \(0.53\pm 0.04\)
ALEPH++ \(0.74\) \(0.95\pm0.01\) -
RDN-B \(0.55\) \(0.97 \pm 0.03\) \(0.88 \pm 0.01\)
MLN-BT \(0.50\) \(0.92 \pm 0.09\) \(0.78 \pm 0.02\)
MLN-BC \(0.62\) \(0.69 \pm 0.20\) \(0.79 \pm 0.02\)
BUSL - - \(0.51 \pm 0.03\)

PLP Online

Conclusions

  • PLP is still a fertile field but...

  • ...we must look at other communities and build bridges and...

  • ...join forces!

  • Much is left to do:

    • Tractable sublanguages (see following talk)

    • Lifted inference

    • Structure/Parameter learning (also for programs with continuous variables)

image