Probability. The comprehensive study of probability measures, measurable functions, kernels, and their related operations and properties.
Statistics. The study of modeling indeterminism of real-world measurements as quantities Y distributing under a measure μθ determined by parameter θ∈Θ. Investigate μθ to solve θ.
Machine learning. Construct massive family (F(⋅,α))α∈A of functions F(⋅,α) and use data-centric variational methods to select α∈A, ultimately coercing F(⋅,α) into behaving nicely in some context.
Probability. The comprehensive study of probability measures, measurable functions, kernels, and their related operations and properties.
( μ , f ) ↦ ∫Xf(x)μ(dx),
( μ , T ) ↦ T#μ=(Γ↦μ(T−1Γ))
Other notation: X#μ=μ(X∈⋅)=μX( μ , κ ) ↦ κ∗μ=(Γ↦∫Xκ(x,Γ)μ(dx))
( κ , T ) ↦ T#κ=(Γ↦κ(T(x),T−1Γ))
( μ , X , Y ) ↦ μX∣Y,μ(X,Y)=μX∣Y∗μY
Construct computer algorithms to consecutively sample quantities/measures via sucessive operations
Sources. x ∼ μ; simple μ
Transports/maps. y ∼ T#ν
i.e. x ∼ ν and y=T(x)
Kernels/flatmaps. y ∼ κ∗ν
i.e. x ∼ ν and y ∼ κ(x,⋅)
- All sorts of combinations thereof!
Y0
X1
Y1
⋯
κ1∗⋅
λ1∗⋅
λ2∗⋅
κ2∗⋅
W0
∼μW
L0
∼μL
X0
Y0
W1
∼μW
L1
X1
Y1
⋯
λ1∗⋅
T(⋅,θ)#⋅
λ2∗⋅
T(⋅,θ)#⋅
When our measurement is a quantity Y, we may calibrate θ by repeatedly generating Y under μθ.
All generative models produce to some joint measure μ (respectively μθ). Inference is the notion of studying a quantity X conditioned on one Y; i.e. making judgements about μX∣Y.
Bayesian inference.
μ(X∈dx,Y∈dy)=pXY(x,y)dxdy
⇝μ(X∈dx)μ(Y∈dy)μ(X∈dx∣Y=y)μ(Y∈dy∣X=x)=pX(x)dx=pY(y)dy=pX∣Y(x∣y)dx=pY∣X(y∣x)dy
pX∣Y=pXY/pY=pXpY∣X/pY
Various estimators for X∣Y=y, often based off of pX∣Y∝pY∣XpX
MLE (respectively MAP) Maximize pY∣X(y∣⋅) (respectively pX∣Y(⋅∣y)).
Importance sampling. Sample a weighted prior.
∫XxμX∣Y(dx∣y)∝∫XxpY∣X(y∣x)μX(dx)i.e. sample x1,…,xM∼μX and take x^=i=1∑M(∑j=1Mp(y∣xj)p(y∣xi)xi)
MCMC Construct a Markov chain with proposal kernel
Q(dx′∣x)=q(x′∣x)dx′
and rejection scheme to ensure the invariant distribution is μX∣Y.
Do we get the idea?
1Initialize state x02for k=0,…,L−1 do3sample proposal x~k+1 ∼ Q(⋅∣xk)
4sample ak ∼ U(0,1) and set the following.
a(xk,x~k+1) = min{1,pX(xk)pY∣X(y∣xk)q(x~k+1∣xk)pX(x~k+1)pY∣X(y∣x~k+1)q(xk∣x~k+1)}
xk+1 = {x~k+1xkak≤a(xk,x~k+1) otherwise
Learn a proposal kernel T(⋅,α)#Q through variational method.
1Initialize state x0 and parameter α2for k=0,…,L−1 do3acompute reference rk=T(xk,α)3bsample proposal reference r~k+1∼Q(⋅∣rk)3cevaluate proposal x~k+1=T−1(r~k+1,α)4sample ak ∼ U(0,1) and set the following.
a(xk,x~k+1) = min{1,pX(xk)pY∣X(y∣xk)q(r~k+1∣rk)∣det∇T(x~k+1,α)∣pX(x~k+1)pY∣X(y∣x~k+1)q(rk∣r~k+1)∣det∇T(xk,α)∣}
xk+1 = {x~k+1xkak≤a(xk,x~k+1) otherwise
5if(k+1modKU)=0 then6Update α by optimizing estimated divergence γ→C(T(⋅;γ)) induced by running chain {x1,…,xk+1}