EM算法及高斯混合模型GMM详述

1、最大似然估计

最大似然估计(Maximum Likelihood Estimation,MLE)就是利用已知的样本结果,反推最有可能(最大概率)导致这样结果的参数值的计算过程。直白来讲,就是给定了一定的数据,假定知道数据是从某种分布中随机抽取出来的,但是不知道这个分布具体的参数值,即“模型已定,参数未知”,MLE就可以用来估计模型的参数。MLE的目标是找出一组参数(模型中的参数),使得模型产出观察数据的概率最大。
arg ⁡ max ⁡ θ p ( X ; θ ) \arg\max_{\theta}p(X;\theta) argθmaxp(X;θ)

MLE求解过程:
  • 似然函数,即在样本固定的情况下,样本出现的概率与参数θ之间的函数
  • 对似然函数取对数
  • 对数似然函数求导数
  • 解似然方程

L ( X ; θ ) → l ( θ ) = ln ⁡ ( L ( X ; θ ) ) → ∂ l ∂ θ arg ⁡ max ⁡ p P ( x i ; p ) = arg ⁡ max ⁡ p ln ⁡ ( P ( x i ; p ) ) \begin{aligned} L(X; \theta)\to l(\theta)&=\ln(L(X;\theta))\to\frac{\partial l}{\partial\theta} \\ \arg\max_pP(x_i;p)&=\arg\max_p\ln(P(x_i;p)) \end{aligned} L(X;θ)l(θ)argpmaxP(xi;p)=ln(L(X;θ))θl=argpmaxln(P(xi;p))

最大后验概率估计

最大后验概率估计(Maximum A Posteriori Estimation,MAP)和MLE一样,都是通过样本估计参数 θ \theta θ的值。在MLE中,是使似然函数 P ( x ∣ θ ) P(x|\theta) P(xθ)最大的时候参数 θ \theta θ的值,MLE中假设先验概率是等值的。而在MAP中,则是求 θ \theta θ使 P ( x ∣ θ ) P ( θ ) P(x|\theta)P(\theta) P(xθ)P(θ)的值最大,这也就是要求 θ \theta θ值不仅仅是让似然函数最大,同时要求 θ \theta θ本身出现的先验概率也得比较大。可以认为MAP是贝叶斯算法的一种应用。
P ( θ ′ ∣ X ) = P ( θ ′ ) P ( X ∣ θ ′ ) P ( X ) → arg ⁡ max ⁡ θ ′ P ( θ ′ ∣ X ) → arg ⁡ max ⁡ θ ′ P ( θ ′ ) P ( X ∣ θ ′ ) P(\theta'|X)=\frac{P(\theta')P(X|\theta')}{P(X)}\to\arg\max_{\theta'}P(\theta'|X)\to\arg\max_{\theta'}P(\theta')P(X|\theta') P(θX)=P(X)P(θ)P(Xθ)argθmaxP(θX)argθmaxP(θ)P(Xθ)

2、EM算法

EM算法(Expectation Maximization Algorithm,最大期望算法)是一种迭代类型的算法,是一种在概率模型中寻找参数最大似然估计或者最大后验估计的算法,其中概率模型依赖于无法观测的隐藏变量。

EM算法流程:
  • 初始化分布参数
  • 重复下列两个操作直到收敛:
  • E步骤:估计隐藏变量的概率分布期望函数
  • M步骤:根据期望函数重新估计分布参数
EM算法原理:

给定的m个训练样本 x ( 1 ) , x ( 2 ) , … , x ( m ) {x^{(1)}, x^{(2)}, \dots, x^{(m)}} x(1),x(2),,x(m),样本间独立,找出样本的模型参数 θ θ θ,极大化模型分布的对数似然函数如下:
θ = arg ⁡ max ⁡ θ ∑ i = 1 m log ⁡ ( P ( x ( i ) ; θ ) ) \theta=\arg\max_{\theta}\sum_{i=1}^m\log(P(x^{(i)};\theta)) θ=argθmaxi=1mlog(P(x(i);θ))
假定样本数据中存在隐含数据 z = z ( 1 ) , z ( 2 ) , … , z ( k ) z={z^{(1)}, z^{(2)}, \dots, z^{(k)}} z=z(1),z(2),,z(k),此时极大化模型分布的对数似然函数如下:
θ = arg ⁡ max ⁡ θ ∑ i = 1 m log ⁡ ( P ( x ( i ) ; θ ) ) = arg ⁡ max ⁡ θ ∑ i = 1 m log ⁡ ( ∑ z ( i ) P ( z ( i ) ) P ( x ( i ) ∣ z ( i ) ; θ ) ) = arg ⁡ max ⁡ θ ∑ i = 1 m log ⁡ ( ∑ z ( i ) P ( x ( i ) , z ( i ) ; θ ) ) \begin{aligned} \theta&=\arg\max_{\theta}\sum_{i=1}^m\log(P(x^{(i)};\theta)) \\ &=\arg\max_{\theta}\sum_{i=1}^m\log(\sum_{z^{(i)}}P(z^{(i)})P(x^{(i)}|z^{(i)};\theta)) \\ &=\arg\max_{\theta}\sum_{i=1}^m\log(\sum_{z^{(i)}}P(x^{(i)},z^{(i)};\theta)) \end{aligned} θ=argθmaxi=1mlog(P(x(i);θ))=argθmaxi=1mlog(z(i)P(z(i))P(x(i)z(i);θ))=argθmaxi=1mlog(z(i)P(x(i),z(i);θ))
令z的分布为 Q ( z ; θ ) Q(z;\theta) Q(z;θ) ,并且 Q ( z ; θ ) ≥ 0 Q(z;\theta)\ge0 Q(z;θ)0,那么有如下公式:
∑ z Q ( z ; θ ) = 1 l ( θ ) = ∑ i = 1 m log ⁡ ∑ z p ( x , z ; θ ) = ∑ i = 1 m log ⁡ ∑ z Q ( z ; θ ) ⋅ p ( x , z ; θ ) Q ( z ; θ ) = ∑ i = 1 m log ⁡ ( E Q ( p ( x , z ; θ ) Q ( z ; θ ) ) ) ≥ ∑ i = 1 m E Q ( log ⁡ ( p ( x , z ; θ ) Q ( z ; θ ) ) ) = ∑ i = 1 m ∑ z Q ( z ; θ ) log ⁡ ( p ( x , z ; θ ) Q ( z ; θ ) ) \begin{aligned} \sum_zQ(z;\theta)&=1 \\ l(\theta)&=\sum_{i=1}^m\log\sum_zp(x,z;\theta) \\ &=\sum_{i=1}^m\log\sum_zQ(z;\theta)\cdot\frac{p(x,z;\theta)}{Q(z;\theta)} \\ &=\sum_{i=1}^m\log(E_Q(\frac{p(x,z;\theta)}{Q(z;\theta)})) \\ &\ge\sum_{i=1}^mE_Q(\log(\frac{p(x,z;\theta)}{Q(z;\theta)})) \\ &=\sum_{i=1}^m\sum_zQ(z;\theta)\log(\frac{p(x,z;\theta)}{Q(z;\theta)}) \end{aligned} zQ(z;θ)l(θ)=1=i=1mlogzp(x,z;θ)=i=1mlogzQ(z;θ)Q(z;θ)p(x,z;θ)=i=1mlog(EQ(Q(z;θ)p(x,z;θ)))i=1mEQ(log(Q(z;θ)p(x,z;θ)))=i=1mzQ(z;θ)log(Q(z;θ)p(x,z;θ))

Jensen不等式:

如果函数f为凸函数,那么存在下列公式:
f ( θ x + ( 1 − θ ) y ) ≤ θ f ( x ) + ( 1 − θ ) f ( y ) f(\theta x+(1-\theta)y)\le\theta f(x)+(1-\theta)f(y) f(θx+(1θ)y)θf(x)+(1θ)f(y)
如下图所示:

θ 1 , θ 2 , … , θ k ≥ 0 , θ 1 + ⋯ + θ k = 1 \theta_1, \theta_2, \dots, \theta_k\ge0,θ_1 +\dots+θ_k =1 θ1,θ2,,θk0θ1++θk=1,则
f ( θ 1 x 1 + ⋯ + θ k x k ) ≤ θ 1 f ( x 1 ) + ⋯ + θ k f ( x k ) f ( E ( x ) ) ≤ E ( f ( x ) ) \begin{aligned} f(\theta_1x_1+\dots+\theta_kx_k)&\le\theta_1f(x_1)+\dots+\theta_kf(x_k) \\ f(E(x))&\le E(f(x)) \end{aligned} f(θ1x1++θkxk)f(E(x))θ1f(x1)++θkf(xk)E(f(x))
根据Jensen不等式的特性,当下列式子的值为常数的时候, l ( θ ) l(\theta) l(θ)函数才能取等号。
p ( x , z ; θ ) Q ( z ; θ ) = c ∑ z Q ( z ; θ ) = 1 Q ( z , θ ) = p ( x , z ; θ ) c = p ( x , z ; θ ) c ⋅ ∑ z i Q ( z i ; θ ) = p ( x , z ; θ ) ∑ z i c ⋅ Q ( z i ; θ ) = p ( x , z ; θ ) ∑ z i p ( x , z i ; θ ) = p ( x , z ; θ ) p ( x ; θ ) = p ( z ∣ x ; θ ) θ = arg ⁡ max ⁡ θ l ( θ ) = arg ⁡ max ⁡ θ ∑ i = 1 m ∑ z Q ( z ; θ ) log ⁡ ( p ( x , z ; θ ) Q ( z ; θ ) ) = arg ⁡ max ⁡ θ ∑ i = 1 m ∑ z Q ( z ∣ x ; θ ) log ⁡ ( p ( x , z ; θ ) Q ( z ∣ x ; θ ) ) = arg ⁡ max ⁡ θ ∑ i = 1 m ∑ z Q ( z ∣ x ; θ ) log ⁡ ( p ( x , z ; θ ) ) \begin{aligned} \frac{p(x,z;\theta)}{Q(z;\theta)}&=c \\ \sum_zQ(z;\theta)&=1 \\ Q(z, \theta)&=\frac{p(x, z; \theta)}{c} \\ &=\frac{p(x, z; \theta)}{c\cdot\sum_{z^i}Q(z^i;\theta)} \\ &=\frac{p(x, z; \theta)}{\sum_{z^i}c\cdot Q(z^i;\theta)} \\ &=\frac{p(x, z; \theta)}{\sum_{z^i}p(x, z^i; \theta)} \\ &=\frac{p(x, z; \theta)}{p(x;\theta)} \\ &=p(z|x;\theta) \\ \theta&=\arg\max_{\theta}l(\theta) \\ &=\arg\max_{\theta}\sum_{i=1}^m\sum_zQ(z;\theta)\log(\frac{p(x,z;\theta)}{Q(z;\theta)}) \\ &=\arg\max_{\theta}\sum_{i=1}^m\sum_zQ(z|x;\theta)\log(\frac{p(x,z;\theta)}{Q(z|x;\theta)}) \\ &=\arg\max_{\theta}\sum_{i=1}^m\sum_zQ(z|x;\theta)\log(p(x,z;\theta)) \end{aligned} Q(z;θ)p(x,z;θ)zQ(z;θ)Q(z,θ)θ=c=1=cp(x,z;θ)=cziQ(zi;θ)p(x,z;θ)=zicQ(zi;θ)p(x,z;θ)=zip(x,zi;θ)p(x,z;θ)=p(x;θ)p(x,z;θ)=p(zx;θ)=argθmaxl(θ)=argθmaxi=1mzQ(z;θ)log(Q(z;θ)p(x,z;θ))=argθmaxi=1mzQ(zx;θ)log(Q(zx;θ)p(x,z;θ))=argθmaxi=1mzQ(zx;θ)log(p(x,z;θ))

EM算法流程:

样本数据 x = x 1 , x 2 , … , x k x={x_1, x_2, \dots, x_k} x=x1,x2,,xk,联合分布 p ( x , z ; θ ) p(x, z;\theta) p(x,z;θ),条件分布 p ( z ∣ x ; θ ) p(z|x;\theta) p(zx;θ),最大迭代次数J。

  • 随机初始化模型参数θ的初始值 θ 0 θ_0 θ0
  • 开始EM算法的迭代处理:
  • E步:计算联合分布的条件概率期望
    Q j = p ( z ∣ x ; θ j ) l ( θ ) = ∑ i = 1 m ∑ z Q j log ⁡ ( p ( x , z ; θ j ) ) \begin{aligned} Q_j&=p(z|x;\theta_j) \\ l(\theta)&=\sum_{i=1}^m\sum_zQ_j\log(p(x,z;\theta_j)) \end{aligned} Qjl(θ)=p(zx;θj)=i=1mzQjlog(p(x,z;θj))
  • M步:极大化L函数,得到 θ j + 1 θ_{j+1} θj+1
    θ j + 1 = arg ⁡ max ⁡ θ l ( θ ) \theta_{j+1}=\arg\max_{\theta}l(\theta) θj+1=argθmaxl(θ)
  • 如果 θ j + 1 θ_{j+1} θj+1已经收敛,则算法结束,输出最终的模型参数 θ \theta θ,否则继续迭代处理
EM算法收敛证明

EM算法的收敛性只要我们能够证明对数似然函数的值在迭代的过程中是增加的即可,即证明下式成立:
∑ i = 1 m log ⁡ ( p ( x i ; θ j + 1 ) ) ≥ ∑ i = 1 m log ⁡ ( p ( x i ; θ j ) ) \sum_{i=1}^m\log(p(x^i; \theta_{j+1}))\ge\sum_{i=1}^m\log(p(x^i; \theta_j)) i=1mlog(p(xi;θj+1))i=1mlog(p(xi;θj))
证明过程如下:
L ( θ , θ j ) = ∑ i = 1 m ∑ z p ( z ∣ x i ; θ j ) log ⁡ p ( x i , z ; θ ) H ( θ , θ j ) = ∑ i = 1 m ∑ z p ( z ∣ x i ; θ j ) log ⁡ p ( z ∣ x i ; θ ) L ( θ , θ j ) − H ( θ , θ j ) = ∑ i = 1 m log ⁡ ( x i ; θ ) [ L ( θ j + 1 , θ j ) − L ( θ j , θ j ) ] − [ H ( θ j + 1 , θ j ) − H ( θ j , θ j ) ] = ∑ i = 1 m log ⁡ ( x i ; θ j + 1 ) − ∑ i = 1 m log ⁡ ( x i ; θ j ) L ( θ j + 1 , θ j ) − L ( θ j , θ j ) ≥ 0 H ( θ j + 1 , θ j ) − H ( θ j , θ j ) = ∑ i = 1 m ∑ z p ( z ∣ x i ; θ j ) log ⁡ p ( z ∣ x i ; θ j + 1 ) p ( z ∣ x i ; θ j ) ≤ ∑ i = 1 m log ⁡ ( ∑ z p ( z ∣ x i ; θ j ) ⋅ p ( z ∣ x i ; θ j + 1 ) p ( z ∣ x i ; θ j ) ) ∑ i = 1 m log ⁡ ( x i ; θ j + 1 ) − ∑ i = 1 m log ⁡ ( x i ; θ j ) ≥ 0 \begin{aligned} L(\theta, \theta_j)&=\sum_{i=1}^m\sum_zp(z|x_i;\theta_j)\log p(x_i, z;\theta) \\ H(\theta, \theta_j)&=\sum_{i=1}^m\sum_zp(z|x_i;\theta_j)\log p(z|x_i;\theta) \\ L(\theta, \theta_j)-H(\theta, \theta_j)&=\sum_{i=1}^m\log(x_i;\theta) \\ [L(\theta_{j+1}, \theta_j)-L(\theta_j, \theta_j)]-[H(\theta_{j+1}, \theta_j)-H(\theta_j, \theta_j)]&=\sum_{i=1}^m\log(x_i;\theta_{j+1})-\sum_{i=1}^m\log(x_i;\theta_j) \\ L(\theta_{j+1}, \theta_j)-L(\theta_j, \theta_j)&\ge 0 \\ H(\theta_{j+1}, \theta_j)-H(\theta_j, \theta_j)&=\sum_{i=1}^m\sum_zp(z|x_i;\theta_j)\log\frac{p(z|x_i;\theta_{j+1})}{p(z|x_i;\theta_j)} \\ &\le\sum_{i=1}^m\log(\sum_zp(z|x_i;\theta_j)\cdot\frac{p(z|x_i;\theta_{j+1})}{p(z|x_i;\theta_j)}) \\ \sum_{i=1}^m\log(x_i;\theta_{j+1})-\sum_{i=1}^m\log(x_i;\theta_j)&\ge0 \end{aligned} L(θ,θj)H(θ,θj)L(θ,θj)H(θ,θj)[L(θj+1,θj)L(θj,θj)][H(θj+1,θj)H(θj,θj)]L(θj+1,θj)L(θj,θj)H(θj+1,θj)H(θj,θj)i=1mlog(xi;θj+1)i=1mlog(xi;θj)=i=1mzp(zxi;θj)logp(xi,z;θ)=i=1mzp(zxi;θj)logp(zxi;θ)=i=1mlog(xi;θ)=i=1mlog(xi;θj+1)i=1mlog(xi;θj)0=i=1mzp(zxi;θj)logp(zxi;θj)p(zxi;θj+1)i=1mlog(zp(zxi;θj)p(zxi;θj)p(zxi;θj+1))0

3、高斯混合模型

GMM(Gaussian Mixture Model,高斯混合模型)是指该算法由多个高斯模型线性叠加混合而成。每个高斯模型称之为component。GMM算法描述的是数据的本身存在的一种分布。
GMM算法常用于聚类应用中,component的个数就可以认为是类别的数量。假定GMM由k个Gaussian分布线性叠加而成,那么概率密度函数如下:
p ( x ) = ∑ k = 1 K p ( k ) p ( x ∣ k ) = ∑ k = 1 K π k p ( x ; μ k , Σ k ) \begin{aligned} p(x)&=\sum_{k=1}^Kp(k)p(x|k) \\ &=\sum_{k=1}^K\pi_kp(x;\mu_k, \Sigma_k) \end{aligned} p(x)=k=1Kp(k)p(xk)=k=1Kπkp(x;μk,Σk)

  • 对数似然函数:
    l ( π , μ , σ ) = ∑ i = 1 N log ⁡ ( ∑ k = 1 K π k p ( x i ; μ k , Σ k ) ) l(\pi, \mu, \sigma)=\sum_{i=1}^N\log(\sum_{k=1}^K\pi_kp(x_i;\mu_k,\Sigma_k)) l(π,μ,σ)=i=1Nlog(k=1Kπkp(xi;μk,Σk))
  • E step:
    w j ( i ) = Q i ( z ( i ) = j ) = p ( z ( i ) = j ∣ x ( i ) ; π , μ , Σ ) \begin{aligned} w_j^{(i)}&=Q_i(z^{(i)}=j) \\ &=p(z^{(i)}=j|x^{(i)};\pi, \mu, \Sigma) \end{aligned} wj(i)=Qi(z(i)=j)=p(z(i)=jx(i);π,μ,Σ)
  • M step:
    l ( π , μ , Σ ) = ∑ i = 1 m ∑ z ( i ) Q i ( z ( i ) ) log ⁡ ( p ( x ( i ) , z ( i ) ; π , μ , Σ ) Q i ( z ( i ) ) ) = ∑ i = 1 m ∑ j = 1 k Q i ( z ( i ) = j ) log ⁡ p ( x ( i ) ∣ z ( i ) = j ; μ , Σ ) ⋅ p ( z ( i ) = j ; π ) Q i ( z ( i ) = j ) = ∑ i = 1 m ∑ j = 1 k w j ( i ) log ⁡ ( 1 ( 2 π ) n 2 ∣ Σ j ∣ 1 2 e − 1 2 ( x ( i ) − μ j ) T Σ j − 1 ( x ( i ) − μ j ) ) ⋅ π j w j ( i ) \begin{aligned} l(\pi, \mu, \Sigma)&=\sum_{i=1}^m\sum_{z^{(i)}}Q_i(z^{(i)})\log(\frac{p(x^{(i)}, z^{(i)};\pi, \mu, \Sigma)}{Q_i(z^{(i)})}) \\ &=\sum_{i=1}^m\sum_{j=1}^kQ_i(z^{(i)}=j)\log\frac{p(x^{(i)}|z^{(i)}=j;\mu, \Sigma)\cdot p(z^{(i)}=j;\pi)}{Q_i(z^{(i)}=j)} \\ &=\sum_{i=1}^m\sum_{j=1}^kw_j^{(i)}\log\frac{(\frac{1}{(2\pi)^{\frac{n}{2}}|\Sigma_j|^{\frac{1}{2}}}e^{-\frac{1}{2}(x^{(i)}-\mu_j)^T\Sigma_j^{-1}(x^{(i)}-\mu_j)})\cdot\pi_j}{w_j^{(i)}} \end{aligned} l(π,μ,Σ)=i=1mz(i)Qi(z(i))log(Qi(z(i))p(x(i),z(i);π,μ,Σ))=i=1mj=1kQi(z(i)=j)logQi(z(i)=j)p(x(i)z(i)=j;μ,Σ)p(z(i)=j;π)=i=1mj=1kwj(i)logwj(i)((2π)2nΣj211e21(x(i)μj)TΣj1(x(i)μj))πj
  • 对均值求偏导:
    l ( π , μ , Σ ) = ∑ i = 1 m ∑ j = 1 k w j ( i ) ( − 1 2 ( x ( i ) − μ j ) T Σ j − 1 ( x ( i ) − μ j ) ) + c ∂ l ∂ μ l = − 1 2 ∑ i = 1 m w l ( i ) ( x ( i ) T Σ l − 1 x ( i ) − x ( i ) T Σ l − 1 μ l − μ l T Σ l − 1 x ( i ) + μ l T Σ l − 1 μ l ) = 1 2 ∑ i = 1 m w l ( i ) ( ( x ( i ) T Σ l − 1 ) T + Σ l − 1 x ( i ) − ( ( Σ l − 1 ) T + Σ l − 1 ) μ l ) = ∑ i = 1 m w l ( i ) ( Σ l − 1 x ( i ) − Σ l − 1 μ l ) 令 ∂ l ∂ μ l = 0 → μ l = ∑ i = 1 m w l ( i ) x ( i ) ∑ i = 1 m w l ( i ) \begin{aligned} l(\pi, \mu, \Sigma)&=\sum_{i=1}^m\sum_{j=1}^kw_j^{(i)}(-\frac{1}{2}(x^{(i)}-\mu_j)^T\Sigma_j^{-1}(x^{(i)}-\mu_j))+c \\ \frac{\partial l}{\partial\mu_l}&=-\frac{1}{2}\sum_{i=1}^mw_l^{(i)}(x^{(i)^T}\Sigma_l^{-1}x^{(i)}-x^{(i)^T}\Sigma_l^{-1}\mu_l-\mu_l^T\Sigma_l^{-1}x^{(i)}+\mu_l^T\Sigma_l^{-1}\mu_l) \\ &=\frac{1}{2}\sum_{i=1}^mw_l^{(i)}((x^{(i)^T}\Sigma_l^{-1})^T+\Sigma_l^{-1}x^{(i)}-((\Sigma_l^{-1})^T+\Sigma_l^{-1})\mu_l) \\ &=\sum_{i=1}^mw_l^{(i)}(\Sigma_l^{-1}x^{(i)}-\Sigma_l^{-1}\mu_l) \\ &\underrightarrow{令\frac{\partial l}{\partial\mu_l}=0}\mu_l=\frac{\sum_{i=1}^mw_l^{(i)}x^{(i)}}{\sum_{i=1}^mw_l^{(i)}} \end{aligned} l(π,μ,Σ)μll=i=1mj=1kwj(i)(21(x(i)μj)TΣj1(x(i)μj))+c=21i=1mwl(i)(x(i)TΣl1x(i)x(i)TΣl1μlμlTΣl1x(i)+μlTΣl1μl)=21i=1mwl(i)((x(i)TΣl1)T+Σl1x(i)((Σl1)T+Σl1)μl)=i=1mwl(i)(Σl1x(i)Σl1μl) μll=0μl=i=1mwl(i)i=1mwl(i)x(i)
  • 对方差求偏导:
    l ( π , μ , Σ ) = 1 2 ∑ i = 1 m ∑ j = 1 k w j ( i ) ( log ⁡ Σ j − 1 − ( x ( i ) − μ j ) T Σ j − 1 ( x ( i ) − μ j ) ) + c ∂ l ∂ Σ l = 1 2 ∑ i = 1 m w l ( i ) ( Σ l − ( x ( i ) − μ j ) ( x ( i ) − μ j ) T ) 令 ∂ l ∂ Σ l = 0 → Σ l = ∑ i = 1 m w l ( i ) ( x ( i ) − μ j ) ( x ( i ) − μ j ) T ∑ i = 1 m w l ( i ) \begin{aligned} l(\pi, \mu, \Sigma)&=\frac{1}{2}\sum_{i=1}^m\sum_{j=1}^kw_j^{(i)}(\log\Sigma_j^{-1}-(x^{(i)}-\mu_j)^T\Sigma_j^{-1}(x^{(i)}-\mu_j))+c \\ \frac{\partial l}{\partial\Sigma_l}&=\frac{1}{2}\sum_{i=1}^mw_l^{(i)}(\Sigma_l-(x^{(i)}-\mu_j)(x^{(i)}-\mu_j)^T) \\ &\underrightarrow{令\frac{\partial l}{\partial\Sigma_l}=0}\Sigma_l=\frac{\sum_{i=1}^mw_l^{(i)}(x^{(i)}-\mu_j)(x^{(i)}-\mu_j)^T}{\sum_{i=1}^mw_l^{(i)}} \end{aligned} l(π,μ,Σ)Σll=21i=1mj=1kwj(i)(logΣj1(x(i)μj)TΣj1(x(i)μj))+c=21i=1mwl(i)(Σl(x(i)μj)(x(i)μj)T) Σll=0Σl=i=1mwl(i)i=1mwl(i)(x(i)μj)(x(i)μj)T
  • 对概率使用拉格朗日乘子法求解:
    l ( π , μ , Σ ) = ∑ i = 1 m ∑ j = 1 k w j ( i ) log ⁡ π j + c s . t .      ∑ j = 1 k π j = 1 L ( π ) = ∑ i = 1 m ∑ j = 1 k w j ( i ) log ⁡ π j + β ( ∑ j = 1 k π j − 1 ) ∂ L ∂ π l = ∑ i = 1 m w l ( i ) π l + β 令 ∂ L ∂ π l = 0 → { β = − ∑ i = 1 m ∑ j = 1 k w j ( i ) = − m π l = 1 m ∑ i = 1 m w l ( i ) \begin{aligned} l(\pi, \mu, \Sigma)&=\sum_{i=1}^m\sum_{j=1}^kw_j^{(i)}\log\pi_j+c \\ s.t. \ \ \ \ &\sum_{j=1}^k\pi_j=1 \\ L(\pi)&=\sum_{i=1}^m\sum_{j=1}^kw_j^{(i)}\log\pi_j+\beta(\sum_{j=1}^k\pi_j-1) \\ \frac{\partial L}{\partial\pi_l}&=\sum_{i=1}^m\frac{w_l^{(i)}}{\pi_l}+\beta \\ &\underrightarrow{令\frac{\partial L}{\partial\pi_l}=0}\begin{cases} \beta&=-\sum_{i=1}^m\sum_{j=1}^kw_j^{(i)}=-m \\ \pi_l&=\frac{1}{m}\sum_{i=1}^mw_l^{(i)} \end{cases} \end{aligned} l(π,μ,Σ)s.t.    L(π)πlL=i=1mj=1kwj(i)logπj+cj=1kπj=1=i=1mj=1kwj(i)logπj+β(j=1kπj1)=i=1mπlwl(i)+β πlL=0{βπl=i=1mj=1kwj(i)=m=m1i=1mwl(i)
已标记关键词 清除标记
©️2020 CSDN 皮肤主题: 书香水墨 设计师:CSDN官方博客 返回首页