Chapter 3 Uncertainty

Probability

Uncertainty can be represented as a number of events and the likelihood, or probability, of each of them happening.

Axioms in Probability

  • 0 < $P(ω)$ < 1: every value representing probability must range between 0 and 1.
    • Zero is an impossible event, like rolling a standard die and getting a 7.
    • One is an event that is certain to happen, like rolling a standard die and getting a value less than 10.
    • In general, the higher the value, the more likely the event is to happen.
  • The probabilities of every possible event, when summed together, are equal to 1.

$$
\sum_{\omega \in \Omega} P(\omega) = 1
$$

To get the probability of an event, we divide the number of worlds in which it occurs by the number of total possible worlds. For example, there are 36 possible worlds when rolling two dice. Only in one of these worlds, when both dice yield a 6, do we get the sum of 12. Thus, $P(12) = 1/36$, or, in words, the probability of rolling two dice and getting two numbers whose sum is 12 is 1/36. What is $P(7)$? We count and see that the sum 7 occurs in 6 worlds. Thus, $P(7) = 6/36 = 1/6$.

Unconditional Probability

Unconditional probability is the degree of belief in a proposition in the absence of any other evidence. All the questions that we have asked so far were questions of unconditional probability, because the result of rolling a die is not dependent on previous events.

Conditional Probability

Conditional probability is the degree of belief in a proposition given some evidence that has already been revealed. As discussed in the introduction, AI can use partial information to make educated guesses about the future. To use this information, which affects the probability that the event occurs in the future, we rely on conditional probability.

Conditional probability is expressed using the following notation: $P(a \mid b)$, meaning “the probability of event a occurring given that we know event b to have occurred,” or, more succinctly, “the probability of a given b.” Now we can ask questions like what is the probability of rain today given that it rained yesterday $P(rain today \mid rain yesterday)$, or what is the probability of the patient having the disease given their test results $P(disease \mid test results)$.

Mathematically, to compute the conditional probability of a given b, we use the following formula:

$$P(a \mid b) = \frac{P(a \wedge b)}{P(b)}$$

To put it in words, the probability that a given b is true is equal to the probability of a and b being true, divided by the probability of b. An intuitive way of reasoning about this is the thought “we are interested in the events where both a and b are true (the numerator), but only from the worlds where we know b to be true (the denominator).” Dividing by b restricts the possible worlds to the ones where b is true. The following are algebraically equivalent forms to the formula above:

$$P(a \wedge b) = P(b)P(a \mid b)$$

$$P(a \wedge b) = P(a)P(b \mid a)$$

Random Varibles

A random variable is a variable in probability theory with a domain of possible values that it can take on. For example, to represent possible outcomes when rolling a die, we can define a random variable Roll, that can take on the values {0, 1, 2, 3, 4, 5, 6}. To represent the status of a flight, we can define a variable Flight that takes on the values {on time, delayed, canceled}.

Often, we are interested in the probability with which each value occurs. We represent this using a probability distribution. For example,

  • $P(Flight = on time) = 0.6$
  • $P(Flight = delayed) = 0.3$
  • $P(Flight = canceled) = 0.1$
    To interpret the probability distribution with words, this means that there is a 60% chance that the flight is on time, 30% chance that it is delayed, and 10% chance that it is canceled. Note that, as shown previously, the sum the probabilities of all possible outcomes is 1.

A probability distribution can be represented more succinctly as a vector. For example, $P(Flight) = <0.6, 0.3, 0.1>$. For this notation to be interpretable, the values have a set order (in our case, on time, delayed, canceled).

Independence

Independence is the knowledge that the occurrence of one event does not affect the probability of the other event. For example, when rolling two dice, the result of each die is independent from the other. Rolling a 4 with the first die does not influence the value of the second die that we roll. This is opposed to dependent events, like clouds in the morning and rain in the afternoon. If it is cloudy in the morning, it is more likely that it will rain in the morning, so these events are dependent.

Independence can be defined mathematically: events a and b are independent if and only if the probability of a and b is equal to the probability of a times the probability of b: $P(a \wedge b) = P(a)P(b)$.

Bayes’ Rule

Bayes’ rule is commonly used in probability theory to compute conditional probability. In words, Bayes’ rule says that the probability of b given a is equal to the probability of a given b, times the probability of b, divided by the probability of a.

$$
P(b \mid a) = \frac{P(b)P(a \mid b)}{P(a)}$$

Knowing $P(a \mid b)$, in addition to $P(a)$ and $P(b)$, allows us to calculate $P(b \mid a)$. This is helpful, because knowing the conditional probability of a visible effect given an unknown cause, P(visible effect | unknown cause), allows us to calculate the probability of the unknown cause given the visible effect, P(unknown cause | visible effect). For example, we can learn P(medical test results | disease) through medical trials, where we test people with the disease and see how often the test picks up on that. Knowing this, we can calculate P(disease | medical test results), which is valuable diagnostic information.

Joint Probability

Joint probability is the likelihood of multiple events all occurring.

We need to look at the joint probabilities of all the possible outcones of thhe 2 varibles.

C = cloud C = $\neg$ cloud
0.4 0.6
R = rain R = $\neg$ rain
0.1 0.9
R = rain R = $\neg$ rain
C = cloud 0.08 0.32
C = $\neg$ cloud 0.02 0.58

In fact, the joint probability here is not the result of calculation, but a rule of thought.

Now we are able to know information about the co-occurrence of the events. For example, we know that the probability of a certain day having clouds in the morning and rain in the afternoon is 0.08. The probability of no clouds in the morning and no rain in the afternoon is 0.58.

Using joint probabilities, we can deduce conditional probability. For example, if we are interested in the probability distribution of clouds in the morning given rain in the afternoon. $P(C \mid rain) = \frac{P(C, rain)}{P(rain)}$ (a side note: in probability, commas and $\wedge$ are used interchangeably. Thus, $P(C, rain) = P(C \wedge rain))$. In words, we divide the joint probability of rain and clouds by the probability of rain.

In the last equation, it is possible to view $P(rain)$ as some constant by which $P(C, rain)$ is multiplied. Thus, we can rewrite $\frac{P(C, rain)}{P(rain)} = \alpha P(C, rain)$, or $\alpha <0.08, 0.02>$. Factoring out α leaves us with the proportions of the probabilities of the possible values of C given that there is rain in the afternoon. Namely, if there is rain in the afternoon, the proportion of the probabilities of clouds in the morning and no clouds in the morning is 0.08:0.02. Note that 0.08 and 0.02 don’t sum up to 1; however, since this is the probability distribution for the random variable C, we know that they should sum up to 1. Therefore, we need to normalize the values by computing α such that $\alpha 0.08 + \alpha 0.02 = 1$. Finally, we can say that $P(C \mid rain) = <0.8, 0.2>$.

因为这中间需要包涵到cloud的两种情况,即晴天或者多云,两种情况的概率总和为1,没有第三种可能的情况,所以两种情况的等比扩大和应为1(也可以理解为矩阵加法),由此我们可以推导出对应特定条件的单一条件的概率。

Probability Rules

Negation: $P(\neg a) = 1 - P(a)$. This stems from the fact that the sum of the probabilities of all the possible worlds is 1, and the complementary literals $a$ and $\neg a$ include all the possible worlds.

Inclusion-Exclusion: $P(a \vee b) = P(a) + P(b) - P(a \wedge b)$. This can interpreted in the following way: the worlds in which a or b are true are equal to all the worlds where a is true, plus the worlds where b is true. However, in this case, some worlds are counted twice (the worlds where both a and b are true)). To get rid of this overlap, we subtract once the worlds where both a and b are true (since they were counted twice).

Here is an example from outside lecture that can elucidate this. Suppose I eat ice cream 80% of days and cookies 70% of days. If we’re calculating the probability that today I eat ice cream or cookies $P(ice cream \vee cookies)$ without subtracting $P(ice cream \wedge cookies)$, we erroneously end up with 0.7 + 0.8 = 1.5. This contradicts the axiom that probability ranges between 0 and 1. To correct for counting twice the days when I ate both ice cream and cookies, we need to subtract $P(ice cream \wedge cookies)$ once.

Marginalization

$P(a) = P(a, b) + P(a, \neg b)$. The idea here is that $b$ and $\neg b$ are disjoint probabilities. That is, the probability of $b$ and $\neg b$ occurring at the same time is 0. We also know $b$ and $\neg b$ sum up to 1. Thus, when a happens, $b$ can either happen or not. When we take the probability of both a and b happening in addition to the probability of $a$ and $\neg b$, we end up with simply the probability of a.

Marginalization can be expressed for random variables the following way:

$$P(X = x_{i}) = \sum_{j}P(X = x_{i}, Y = y_{i})$$

The left side of the equation means “The probability of random variable X having the value $x_{i}$.” For example, for the variable C we mentioned earlier, the two possible values are clouds in the morning and no clouds in the morning. The right part of the equation is the idea of marginalization. $P(X = x_{i})$ is equal to the sum of all the joint probabilities of $x_{i}$ and every single value of the random variable Y.

Conditioning

$P(a) = P(a \mid b)P(b) + P(a \mid \neg b)P(\neg b)$. This is $a$ similar idea to marginalization. The probability of event $a$ occurring is equal to the probability of a given $b$ times the probability of $b$, plus the probability of a given $\neg b$ time the probability of $\neg b$.

$$P(X = x_{i}) = \sum_{j}P(X = x_{i} \mid Y = y_{i})P(Y = y_{i})$$

In this formula, the random variable X takes the value $x_{i}$ with probability that is equal to the sum of the probabilities of $x_{i}$ given each value of the random variable Y multiplied by the probability of variable Y taking that value. This makes sense if we remember that $P(a \mid b) = \frac{P(a, b)}{P(b)}$. If we multiply this expression by $P(b)$, we end up with $P(a, b)$, and from here we do the same as we did with marginalization.

Beyesian Networks

A Bayesian network is a data structure that represents the dependencies among random variables. Bayesian networks have the following properties:

They are directed graphs.

  • Each node on the graph represent a random variable.
  • An arrow from X to Y represents that X is a parent of Y. That is, the probability distribution of Y depends on the value of X.
  • Each node X has probability distribution $P(X \mid Parents(X))$.

In other words, if we want to find the probability of events under specific conditions, in a complex Bayesian network, we only need to consider the previous conditions of the current situation and go back to the initial conditions. The conditional probability under the current conditions can be used as the result of the preconditions and go back to the initial conclusion.

We make A hypothesis that if A is the most initial condition, A indicates B, A also indicates C, B indicates C, and C indicates D. each condition contains three seeds. If we want to find the probability of a specific case in condition D under three different specific conditions of A, B, C. As shown in the following table:

A
$\downarrow$ $\searrow$
B $\to$ C
$\downarrow$
D

And we assume that each of the condtions contains 3 sub cases $\alpha , \beta and \gamma$.

So now, if we want to calculate the probability of $\alpha$ in condition D, when C is in $\beta$, B is in $\gamma$ and A is also in $\gamma$. We need to combine all the circumstances to compute. As the answer follows:

$$P(A_{\gamma}, B_{\gamma}, C_{\beta}, D_{\alpha}) = P(D_{\alpha} \mid C_{\beta})P(C_{\beta} \mid A_{\gamma}, B_{\gamma})P(B_{\gamma} \mid A_{\gamma})P(A_{\gamma})$$

Inference

We could definitively conclude new information based on the information that we already had. We can also infer new information based on probabilities. While this does not allow us to know new information for certain, it allows us to figure out the probability distributions for some values. Inference has multiple properties.

  • Query X: the variable for which we want to compute the probability distribution.
  • Evidence variables E: one or more variables that have been observed for event e. For example, we might have observed that there is light rain, and this observation helps us compute the probability that the train is delayed.
  • Hidden variables Y: variables that aren’t the query and also haven’t been observed. For example, standing at the train station, we can observe whether there is rain, but we can’t know if there is maintenance on the track further down the road. Thus, Maintenance would be a hidden variable in this situation.
  • The goal: calculate $P(X \mid e)$. For example, compute the probability distribution of the Train variable (the query) based on the evidence e that we know there is light rain.

Inference by Enumeration

Inference by enumeration is a process of finding the probability distribution of variable X given observed evidence e and some hidden variables Y.

$$P(X \mid e) = \alpha P(X, e) = \alpha \sum_{y} P(X, e, y)$$

In this equation, $X$ stand for the query variable, $e $for the observed evidence, $y$ for all the values of the hidden variables, and $\alpha$ normalizes the result such that we end up with probabilities that add up to 1. To explain the equation in words, it is saying that the probability distribution of $X$ given $e$ is equal to a normalized probability distribution of $X$ and $e$. To get to this distribution, we sum the normalized probability of $X$, $e$, and $y$, where $y$ takes each time a different value of the hidden variables $Y$.

Sampling

Sampling is one technique of approximate inference. In sampling, each variable is sampled for a value according to its probability distribution. We will start with an example from outside lecture, and then cover the example from lecture.

Because in most cases, we don’t need to predict very accurate things, just know the probability estimation of things, which will avoid a lot of complex and cumbersome calculations. The core view of Bayes is that it is different from observing the distribution of samples under specific parameters, but to make the samples full of randomness and analyze the distribution of parameters, that is, the samples are fixed, but the parameter attributes of each sample are random.

Likelihood Weighting 似然加权

  • Start by fixing the values for evidence variables.
  • Sample the non-evidence variables using conditional probabilities in the Bayesian network.
  • Weight each sample by its likelihood: the probability of all the evidence occurring.

For example, if we have the observation that the train was on time, we will start sampling as before. We sample a value of Rain given its probability distribution, then Maintenance, but when we get to Train - we always give it the observed value, in our case, on time. Then we proceed and sample Appointment based on its probability distribution given $Train = on time$. Now that this sample exists, we weight it by the conditional probability of the observed variable given its sampled parents. That is, if we sampled Rain and got light, and then we sampled Maintenance and got yes, then we will weight this sample by $P(Train = on time \mid light, yes)$.

Markov Models

So far, we have looked at questions of probability given some information that we observed. In this kind of paradigm, the dimension of time is not represented in any way. However, many tasks do rely on the dimension of time, such as prediction. To represent the variable of time we will create a new variable, $X$, and change it based on the event of interest, such that $X_{t}$ is the current event, $X_{t+1}$ is the next event, and so on. To be able to predict events in the future, we will use Markov Models.

The Markov Assumption

  • 将随机变量作为结点,若两个随机变量相关或者不独立,则将二者连接一条边;若给定若干随机变量,则形成一个有向图,即构成一个网络。
  • 如果该网络是有向无环图,则这个网络称为贝叶斯网络。
  • 如果这个图退化成线性链的方式,则得到马尔可夫模型;因为每个结点都是随机变量,将其看成各个时刻(或空间)的相关变化,以随机过程的视角,则可以看成是马尔可夫过程。
  • 若上述网络是无向的,则是无向图模型,又称马尔可夫随机场或者马尔可夫网络。
  • 如果在给定某些条件的前提下,研究这个马尔可夫随机场,则得到条件随机场。
  • 如果使用条件随机场解决标注问题,并且进一步将条件随机场中的网络拓扑变成线性的,则得到线性链条件随机场。

The Markov assumption is an assumption that the current state depends on only a finite fixed number of previous states. This is important to us. Think of the task of predicting weather. In theory, we could use all the data from the past year to predict tomorrow’s weather. However, it is infeasible, both because of the computational power this would require and because there is probably no information about the conditional probability of tomorrow’s weather based on the weather 365 days ago. Using the Markov assumption, we restrict our previous states (e.g. how many previous days we are going to consider when predicting tomorrow’s weather), thereby making the task manageable. This means that we might get a more rough approximation of the probabilities of interest, but this is often good enough for our needs. Moreover, we can use a Markov model based on the information of the one last event (e.g. predicting tomorrow’s weather based on today’s weather).

Markov Chain

A Markov chain is a sequence of random variables where the distribution of each variable follows the Markov assumption. That is, each event in the chain occurs based on the probability of the event before it.

To start constructing a Markov chain, we need a transition model that will specify the the probability distributions of the next event based on the possible values of the current event.

每个状态的转移只依赖于之前的n个状态,这个过程被称为1个n阶的模型,其中n是影响转移状态的数目。最简单的马尔可夫过程就是一阶过程,每一个状态的转移只依赖于其之前的那一个状态,这个也叫作马尔可夫性质。

$$P(X_{n+1} \mid X_{1} = x_{1}, X_{2} = x_{2},… , X_{n} = x_{n} = P(X_{n+1} = x \mid X_{n} = x_{n}))$$

Hidden Markov Models

A hidden Markov model is a type of a Markov model for a system with hidden states that generate some observed event. This means that sometimes, the AI has some measurement of the world but no access to the precise state of the world. In these cases, the state of the world is called the hidden state and whatever data the AI has access to are the observations. Here are a few examples for this:

For a robot exploring uncharted territory, the hidden state is its position, and the observation is the data recorded by the robot’s sensors.
In speech recognition, the hidden state is the words that were spoken, and the observation is the audio waveforms.

When measuring user engagement on websites, the hidden state is how engaged the user is, and the observation is the website or app analytics.

Sensor Markov Assumption

The assumption that the evidence variable depends only on the corresponding state. For example, for our models, we assume that whether people bring umbrellas to the office depends only on the weather. This is not necessarily reflective of the complete truth, because, for example, more conscientious, rain-averse people might take an umbrella with them everywhere even when it is sunny, and if we knew everyone’s personalities it would add more data to the model. However, the sensor Markov assumption ignores these data, assuming that only the hidden state affects the observation.

A hidden Markov model can be represented in a Markov chain with two layers. The top layer, variable X, stands for the hidden state. The bottom layer, variable E, stands for the evidence, the observations that we have.

Based on hidden Markov models, multiple tasks can be achieved:

  • Filtering: given observations from start until now, calculate the probability distribution for the current state. For example, given information on when people bring umbrellas form the start of time until today, we generate a probability distribution for whether it is raining today or not.
  • Prediction: given observations from start until now, calculate the probability distribution for a future state.
  • Smoothing: given observations from start until now, calculate the probability distribution for a past state. For example, calculating the probability of rain yesterday given that people brought umbrellas today.
  • Most likely explanation: given observations from start until now, calculate most likely sequence of events.