undefined

Matthew Varble

matthew@rodent.club

Mathematician for hire!

roadmap

  • formulate generative models and associated language
  • give examples of generative models for intuition
  • discuss inference of a generative model
  • summarize a cool inference algorithm

my research

mathematicsprobabilitymontecarlolargedeviations"esoteric" objectsthis talkstatisticsmachine learning

quick terminology

  • Probability.  The comprehensive study of probability measures, measurable functions, kernels, and their related operations and properties.

  • Statistics. The study of modeling indeterminism of real-world measurements as quantities YYY distributing under a measure μθ\mu_\thetaμθ​ determined by parameter θ∈Θ\theta \in \Thetaθ∈Θ. Investigate μθ\mu_\thetaμθ​ to solve θ\thetaθ.

  • Machine learning. Construct massive family (((F(⋅,α)F(\cdot,\alpha)F(⋅,α))α∈A)_{\alpha \in \mathcal{A}})α∈A​ of functions F(⋅,α)F(\cdot, \alpha)F(⋅,α) and use data-centric variational methods to select α∈A\alpha \in \mathcal{A}α∈A, ultimately coercing F(⋅,α)F(\cdot, \alpha)F(⋅,α) into behaving nicely in some context.

probabilistic operations

  • Probability.  The comprehensive study of probability measures, measurable functions, kernels, and their related operations and properties.

((( μ\muμ ,,, fff ))) ↦\mapsto↦ ∫Xf(x)μ(dx),\displaystyle\int_{\mathbb{X}} f(x) \mu({\rm d}x),∫X​f(x)μ(dx),

((( μ\muμ ,,, TTT ))) ↦\mapsto↦ T#μ=(Γ↦μ(T−1Γ))T_\#\mu = \Big(\Gamma \mapsto \mu(T^{-1}\Gamma) \Big)T#​μ=(Γ↦μ(T−1Γ))

Other notation:  X#μ=μ(X∈⋅)=μXX_\#\mu = \mu(X \in \cdot) = \mu_XX#​μ=μ(X∈⋅)=μX​

((( μ\muμ ,,, κ\kappaκ ))) ↦\mapsto↦ κ∗μ=(Γ↦∫Xκ(x,Γ)μ(dx))\displaystyle \kappa\ast\mu = \Big( \Gamma \mapsto \int_\mathbb{X} \kappa(x, \Gamma) \mu({\rm d}x) \Big)κ∗μ=(Γ↦∫X​κ(x,Γ)μ(dx))

((( κ\kappaκ ,,, TTT ))) ↦\mapsto↦ T#κ=(Γ↦κ(T(x),T−1Γ))T_\#\kappa = \Big( \Gamma \mapsto \kappa\big(T(x), T^{-1}\Gamma\big) \Big)T#​κ=(Γ↦κ(T(x),T−1Γ))

((( μ\muμ ,,, XXX ,,, YYY ))) ↦\mapsto↦ μX∣Y,μ(X,Y)=μX∣Y∗μY\mu_{X|Y}, \quad \mu_{(X,Y)} = \mu_{X|Y}\ast \mu_YμX∣Y​,μ(X,Y)​=μX∣Y​∗μY​

generative modeling

Construct computer algorithms to consecutively sample quantities/measures via sucessive operations

  • Sources.  xxx ∼\sim∼ μ\muμ; simple μ\muμ

  • Transports/maps. yyy ∼\sim∼ T#T_\#T#​ν\nuν

    i.e. xxx ∼\sim∼ ν\nuν and yyy===T(T(T(xxx)))

  • Kernels/flatmaps. yyy ∼\sim∼ κ∗\kappa \astκ∗ν\nuν

    i.e. xxx ∼\sim∼ ν\nuν and yyy ∼\sim∼ κ(\kappa(κ(xxx,,,⋅)\cdot)⋅)

  • All sorts of combinations thereof!

example: hidden Markov model

X0X_0X0​

Y0Y_0Y0​

X1X_1X1​

Y1Y_1Y1​

⋯\cdots⋯

κ1∗⋅\kappa_1 \ast \cdotκ1​∗⋅

λ1∗⋅\lambda_1 \ast \cdotλ1​∗⋅

λ2∗⋅\lambda_2 \ast \cdotλ2​∗⋅

κ2∗⋅\kappa_2 \ast \cdotκ2​∗⋅

example: recurrent neural network

W0W_0W0​

∼μW\sim \mu_W∼μW​

L0L_0L0​

∼μL\sim \mu_L∼μL​

X0X_0X0​

Y0Y_0Y0​

W1W_1W1​

∼μW\sim \mu_W∼μW​

L1L_1L1​

X1X_1X1​

Y1Y_1Y1​

⋯\cdots⋯

λ1∗⋅\lambda_1\ast\cdotλ1​∗⋅

T(⋅,θ)#⋅T(\cdot,\theta)_\#\cdotT(⋅,θ)#​⋅

λ2∗⋅\lambda_2\ast\cdotλ2​∗⋅

T(⋅,θ)#⋅T(\cdot,\theta)_\#\cdotT(⋅,θ)#​⋅

calibration

When our measurement is a quantity YYY, we may calibrate θ\thetaθ by repeatedly generating YYY under μθ\mu_\thetaμθ​.

inference

All generative models produce to some joint measure μ\muμ (respectively μθ\mu_\thetaμθ​). Inference is the notion of studying a quantity XXX conditioned on one YYY; i.e. making judgements about μX∣Y\mu_{X|Y}μX∣Y​.

Bayesian inference.

μ(X∈dx,Y∈dy)=pXY(x,y)dxdy\mu(X \in {\rm d}x, Y \in {\rm d}y) = p_{XY}(x,y) {\rm d}x {\rm d}yμ(X∈dx,Y∈dy)=pXY​(x,y)dxdy

⇝μ(X∈dx)=pX(x)dxμ(Y∈dy)=pY(y)dyμ(X∈dx∣Y=y)=pX∣Y(x∣y)dxμ(Y∈dy∣X=x)=pY∣X(y∣x)dy \leadsto \begin{aligned} \mu(X \in {\rm d}x) &= p_X(x) {\rm d}x \\ \mu(Y \in {\rm d}y) &= p_Y(y) {\rm d}y \\ \mu(X \in {\rm d}x | Y = y) &= p_{X|Y}(x|y) {\rm d}x \\ \mu(Y \in {\rm d}y | X = x) &= p_{Y|X}(y|x) {\rm d}y \\ \end{aligned} ⇝μ(X∈dx)μ(Y∈dy)μ(X∈dx∣Y=y)μ(Y∈dy∣X=x)​=pX​(x)dx=pY​(y)dy=pX∣Y​(x∣y)dx=pY∣X​(y∣x)dy​

pX∣Y=pXY/pY=pXpY∣X/pY \begin{aligned} p_{X|Y} &= p_{XY}/p_Y \\ &= p_X p_{Y|X} / p_Y \end{aligned} pX∣Y​​=pXY​/pY​=pX​pY∣X​/pY​​

Bayesian inference

Various estimators for XXX∣|∣Y=yY=yY=y, often based off of pX∣Y∝pY∣XpXp_{X|Y} \propto p_{Y|X}p_XpX∣Y​∝pY∣X​pX​

  • MLE (respectively MAP) Maximize pY∣X(y∣⋅)p_{Y|X}(y|\cdot)pY∣X​(y∣⋅) (respectively pX∣Y(⋅∣y)p_{X|Y}(\cdot|y)pX∣Y​(⋅∣y)).

  • Importance sampling. Sample a weighted prior.

    ∫XxμX∣Y(dx∣y)∝∫XxpY∣X(y∣x)μX(dx)\displaystyle\int_{\mathbb{X}} x \mu_{X|Y}({\rm d}x|y) \propto \int_{\mathbb{X}} x p_{Y|X}(y|x) \mu_X({\rm d}x)∫X​xμX∣Y​(dx∣y)∝∫X​xpY∣X​(y∣x)μX​(dx)

    i.e. sample x1,…,xM∼μXx_1,\ldots,x_M \sim \mu_Xx1​,…,xM​∼μX​ and take x^=∑i=1M(p(y∣xi)xi∑j=1Mp(y∣xj))\displaystyle \hat{x} = \sum_{i=1}^M \bigg( \frac{p(y|x_i) x_i}{\sum_{j=1}^M p(y|x_j)} \bigg) x^=i=1∑M​(∑j=1M​p(y∣xj​)p(y∣xi​)xi​​)

  • MCMC Construct a Markov chain with proposal kernel

    Q(dx′∣x)=q(x′∣x)dx′Q({\rm d}x'|x) = q(x'|x){\rm d}x'Q(dx′∣x)=q(x′∣x)dx′

    and rejection scheme to ensure the invariant distribution is μX∣Y\mu_{X|Y}μX∣Y​.

Markov-chain Monte Carlo

Do we get the idea?

1Initialize state x0x_0x0​
2for k=0,…,L−1k = 0, \ldots, L-1k=0,…,L−1 do
3

sample proposal x~k+1\tilde x_{k+1}x~k+1​ ∼\sim∼ Q(⋅∣Q(\cdot|Q(⋅∣xkx_kxk​)))

4

sample aka_kak​ ∼\sim∼ U(0,1)U(0,1)U(0,1) and set the following.

a(a(a(xkx_kxk​,,,x~k+1\tilde x_{k+1}x~k+1​))) === min⁡{1,pX(x~k+1)pY∣X(y∣x~k+1)q(xk∣x~k+1)pX(xk)pY∣X(y∣xk)q(x~k+1∣xk)}\displaystyle \min\left\{ 1, \frac{p_X(\tilde x_{k+1})p_{Y|X}(y|\tilde x_{k+1})q(x_k|\tilde x_{k+1})}{p_X(x_k)p_{Y|X}(y|x_k)q(\tilde x_{k+1}|x_k)} \right\}min{1,pX​(xk​)pY∣X​(y∣xk​)q(x~k+1​∣xk​)pX​(x~k+1​)pY∣X​(y∣x~k+1​)q(xk​∣x~k+1​)​}

xk+1x_{k+1}xk+1​ === {x~k+1ak≤a(xk,x~k+1)xk otherwise \left\{ \begin{array}{ll} \tilde x_{k+1} & a_k \leq a(x_k,\tilde x_{k+1}) \\ x_k & \text{ otherwise} \end{array}\right. {x~k+1​xk​​ak​≤a(xk​,x~k+1​) otherwise​

transport map Markov-chain Monte Carlo

Learn a proposal kernel T(⋅T(\cdotT(⋅,,,α\alphaα)#)_\#)#​QQQ through variational method.

1Initialize state x0x_0x0​ and parameter α\alphaα
2for k=0,…,L−1k = 0, \ldots, L-1k=0,…,L−1 do
3acompute reference rkr_krk​===T(T(T(xkx_kxk​,,,α\alphaα)))
3bsample proposal reference r~k+1\tilde r_{k+1}r~k+1​∼\sim∼Q(⋅∣Q(\cdot|Q(⋅∣rkr_krk​)))
3cevaluate proposal x~k+1\tilde x_{k+1}x~k+1​===T−1(T^{-1}(T−1(r~k+1\tilde r_{k+1}r~k+1​,,,α\alphaα)))
4

sample aka_kak​ ∼\sim∼ U(0,1)U(0,1)U(0,1) and set the following.

a(a(a(xkx_kxk​,,,x~k+1\tilde x_{k+1}x~k+1​))) === min⁡{1,pX(x~k+1)pY∣X(y∣x~k+1)q(rk∣r~k+1)∣det⁡∇T(xk,α)∣pX(xk)pY∣X(y∣xk)q(r~k+1∣rk)∣det⁡∇T(x~k+1,α)∣}\displaystyle \min\left\{ 1, \frac{p_X(\tilde x_{k+1})p_{Y|X}(y|\tilde x_{k+1})q(r_k|\tilde r_{k+1})|\operatorname{det}\nabla T(x_k,\alpha)|}{p_X(x_k)p_{Y|X}(y|x_k)q(\tilde r_{k+1}|r_k)|\operatorname{det}\nabla T(\tilde x_{k+1},\alpha)|} \right\}min{1,pX​(xk​)pY∣X​(y∣xk​)q(r~k+1​∣rk​)∣det∇T(x~k+1​,α)∣pX​(x~k+1​)pY∣X​(y∣x~k+1​)q(rk​∣r~k+1​)∣det∇T(xk​,α)∣​}

xk+1x_{k+1}xk+1​ === {x~k+1ak≤a(xk,x~k+1)xk otherwise \left\{ \begin{array}{ll} \tilde x_{k+1} & a_k \leq a(x_k,\tilde x_{k+1}) \\ x_k & \text{ otherwise} \end{array}\right. {x~k+1​xk​​ak​≤a(xk​,x~k+1​) otherwise​

5if(k+1mod  KU)=0(k+1 \mod K_U) = 0(k+1modKU​)=0 then
6

Update α\alphaα by optimizing estimated divergence γ→C(T(⋅;γ))\gamma \rightarrow C\big(T(\cdot; \gamma)\big)γ→C(T(⋅;γ)) induced by running chain {x1,…,xk+1}\{x_1,\ldots,x_{k+1}\}{x1​,…,xk+1​}

That's it!

homepostspresentationsbooks