机器学习基础——概率论与信息论

  1. 1. 前言
  2. 2. Probability
    1. 2.1. Conditional probability
    2. 2.2. Independence
    3. 2.3. Covariance
    4. 2.4. Gaussian Distribution
    5. 2.5. Dirac delta distribution & empirical distribution
    6. 2.6. Mixture distribution
    7. 2.7. Latent variable
    8. 2.8. Useful properties of Common Functions
    9. 2.9. Bayes’ Rule
    10. 2.10. prior-probability and posterior-probability
    11. 2.11. PDF of y where y=g(x)
  3. 3. Information theory
    1. 3.1. Self-information & Shanon entropy
    2. 3.2. Structed Probabilistic Models
      1. 3.2.1. DAG
      2. 3.2.2. UAG

前言

本文的主体是机器学习中所用到的概率论知识,因此奉行Lazy evaluation,对这些知识更深层次的探究只在绝对必要时完成。本文假设你学过《概率论与数理统计》,仅指出花书的现有定义的不同,同时补充花书中特有的知识

Probability

Conditional probability

Chain rule/product rule of conditiona probability

eg

Independence

Indenpendence:

则x,y互相独立。记作

Conditional independent:

称x,y在z下相互独立,记作

Covariance

协方差值越高,意味着f和g的变化非常大,并且同时距离各自的均值很远。如果协方差是正值,那么f和g倾向于同时相对大的值。如果是负值,则一个取高值的同时另一个取低值。

向量

的covariance matrix(协方差矩阵)是一个nxn的方阵,其中

对于对角线的元素,

Gaussian Distribution

高斯分布,又称正态分布。
概率密度函数(PDF):

计算概率密度时,经常要取σ的平方倒数,工程中常使用另一个参数β∈(0,∞)表示高斯分布的精准度

高斯分布的特点

  • 现实中很多复杂的系统可以由高斯分布建模(中心极限定理
  • 在方差相同的所有分布中,高斯分布在实数范围上的“不确定度”最高。换句话说,高斯分布是所有分布中对样本做出最少先验假设的

N维高斯分布:

其中Σ是正定对称矩阵。μ是矢量形式的分布均值,Σ给出分布的协方差矩阵。为了便于计算,对于N维高斯分布,常用准确度矩阵β作为参数:

实践上通常将协方差矩阵固定为对角阵。更简单的方式是将isotropic matrix作为协方差矩阵,其中isotropic matrix指标量数乘单位矩阵的结果。

Dirac delta distribution & empirical distribution

有时我们希望所有的概率密度都聚集在一个点附近。这可以通过Dirac delta函数

Dirac delta分布常常被用作empirical distribution(经验分布)的一个组件:

empirical distribution在全部m个点

上放置

概率密度

Mixture distribution

用其他简单的概率分布来定义概率分布是十分普遍的,混合分布(mixture distribution)就是这样一种方式。混合分布由好几个组件(component)组成。每次采样时,由一个多重分布的结果选择组件标识(component identity),由此最终结果是由哪一个分布给出的。

其中P(c)是所有组件上的多重分布。

一种常见且强大的混合模型是高斯混合模型。高斯混合模型所有组件都是高斯分布,他们具有不同的参数μΣ。有些分布可以增加限制,如所有组件共用协方差矩阵等。

高斯混合分布是概率密度的通用近似方式。具有足够分量的高斯混合模型可以用任何特定的非零误差量来近似任何平滑密度。

PS,如果用高斯分布来做分类问题,让两个分布共用协方差矩阵Σ通常效果会比使用各自的协方差矩阵效果好。

Latent variable

隐含变量(Latent variable)指无法直接观测的随机变量。混合分布中的组件标识变量c就是隐含变量。

隐含变量的分布P(c)和条件分布P(x|c)共同决定了P(x)的分布。尽管P(x)可以在没有隐含变量的条件下被计算出来。

Useful properties of Common Functions

logistic sigmoid:

softplus:

why the name softplus?
It’s a “softeded” version of x^=max(0,x)

Bayes’ Rule

已知P(y|x)P(x)P(x|y)时可以使用贝叶斯公式:

其中

prior-probability and posterior-probability

  • prior-probability
    即先验概率。指根据以往经验和分析得到的概率
  • posterior-probability
    后验概率是在考虑和给出相关证据或数据后所得到的条件概率

考虑bayes’ Rule

  • θ:parameter
  • x:observed value
  • P(x):evidence
  • P(θ):prior
  • P(x|θ):likelihood
  • P(θ|x):posterior

PDF of y where y=g(x)

假设有现随机变量x,y,其中y=g(x),求y的PDF

PDF根据pdf的定义,

p(x)dx为x落在某一邻域δ内的概率。现保留该属性,则有

考虑高维情况,x与y为向量,定义雅可比矩阵J,其中

Information theory

basic assemption:

  • Likely events should have low information content, and in the extreme case, events that are guaranteed to happen should have no information content whatsoever.
    • Less likely events should have higher information content.
    • Independent events should have additive information. For example, finding out that a tossed coin has come up as heads twice should convey twice as much information as finding out that a tossed coin has come up as heads once.

Self-information & Shanon entropy

单一事件所含的信息,单位为nat

整个概率分布上的不确定性,即香农熵

也记作H(P).

当P(x)和Q(x)为相同随机变量x的分部时,两种分布间的“距离”用
Kullback-Leibler (KL) divergence定义:

the value means the extra amount of information
needed to send a message containing symbols drawn from probability distribution P, when we use a code that was designed to minimize the length of messages drawn from probability distribution .

KL divergence

  • non-negarive
  • not symmetric

与KL divergence密切相关的一种度量是cross-entropy
定义为

Minimizing the cross-entropy with respect to Q is equivalent to minimizing the KL divergence, because Q does not participate in the omitted term.

Structed Probabilistic Models

机器学习中的参数成千上万,使用含有这么多参数的分布不切实际。根据条件概率的乘法公式,可以把大分布拆成小分布的乘积(这一过程叫factorization)。当使用CS中的图来表示这种factorization时,就把模型称为structured probabilistic model或graphical model。structed probabilistic model分为两类,分别使用DAG和UAG。

DAG

Directed models use graphs with directed edges, and they represent factorizations into conditional probability distributions, Specifically, a directed model contains one factor for every random variable xi in the distribution

where

is the parents of xi.

UAG

Undirected models use graphs with undirected edges, and they represent
factorizations into a set of functions; unlike in the directed case, these functions are usually not probability distributions of any kind. Any set of nodes that are all connected to each other in G is called a clique. Each clique Ci in an undirected model is associated with a factor φi . These factors are just functions, not probability distributions. The output of each factor must be non-negative, but there is no constraint that the factor must sum or integrate to 1 like a probability distribution.

The probability of a configuration of random variables is proportional to the
product of all of these factors—assignments that result in larger factor values are more likely. Of course, there is no guarantee that this product will sum to 1. We therefore divide by a normalizing constant Z, defined to be the sum or integral over all states of the product of the φ functions, in order to obtain a normalized probability distribution:

DAG和UAG都是描述概率分布的方法,他们并不是互斥的概率分布。使用DAG还是UAG并不是概率分布的属性,而是某种特定描述方式的属性