4.4. Law of Total Probability and Bayes’ Rule

When we make decisions under uncertainty, we often need to revise our probability assessments as new information emerges. Medical diagnoses, legal judgments, and even everyday decisions typically involve updating our beliefs based on partial evidence. In this chapter, we’ll develop the foundational principles of Bayes’ Rule, which provides a framework for this fundamental process of learning from evidence.

Road Map 🧭

  • Define partitions of the sample space and derive the law of partitions.

  • Build upon this to establish the law of total probability.

  • Develop Bayes’ rule for inverting conditional probabilities.

4.4.1. Law of Partitions

What is a Partition?

A collection of events \(\{A_1, A_2, \cdots, A_n\}\) forms a partition of the sample space \(\Omega\) if the following two conditions are satisfied.

  1. The events are mutually exclusive:

    \[A_i \cap A_j = \emptyset \text{ for all } i \neq j.\]
  2. The events are exhaustive:

    \[A_1 \cup A_2 \cup \cdots \cup A_n = \Omega.\]

In other words, a partition divides the sample space into non-overlapping pieces that, when combined, reconstruct the entire space. You can think of a partition as pizza slices—each slice represents an event, the slices do not overlap, and together they make up a whole pizza.

The law of partitions provides a way to calculate the probability of a new event by examining how it intersects with each part of a partition.

Note ✏️

The simplest example of a partition consists of just two events: any event \(A\) and its complement \(A'\). These two events are

  • mutually exclusive because \(A \cap A' = \emptyset\), and

  • exhaustive because together they cover the entire sample space (\(A \cup A' = \Omega\)).

Law of Partitions

If \(\{A_1, A_2, \cdots, A_n\}\) forms a partition of the sample space \(\Omega\), then for any event \(B\) in the same sample space:

\[P(B) = \sum_{i=1}^{n} P(A_i \cap B)\]

What Does It Say?

Visual representation of the law of partitions

Fig. 4.17 Law of partitions

Take a partition that consists of three events as in Fig. 4.17. Then, the Law of Partitions can be expanded to

\[P(B) = P(A_1 \cap B) + P(A_2 \cap B) + P(A_3 \cap B).\]

The left-hand side of the equation points to the relative area of the whole blue region, while each term on the right-hand side points to the relative area of a smaller piece created by the overlap of \(B\) with one of the events in the partion.

The core message of the Law of Partitions is quite simple; the probability of the whole is equal to the sum of the probabilities of its parts.

4.4.2. Law of Total Probability

The Law of Total Probability takes the Law of Partitions one step further by rewriting the intersection probabilities using the general multiplication rule.

Reminder🔎: The General Multiplication Rule

For any two events \(C\) and \(D\), \(P(C \cap D) = P(C|D) P(D) = P(D|C) P(C).\)

Statement

If \(\{A_1, A_2, \cdots, A_n\}\) forms a partition of the sample space \(\Omega\), then for any event \(B \subseteq \Omega\),

\[P(B) = \sum_{i=1}^{n} P(B|A_i) P(A_i).\]

What Does It Say?

Visual representation of the law of total probability

Fig. 4.18 Law of Total Probability

Let us continue to use the simple three-event partition. The Law of Total Probability says

\[P(B) = P(B|A_1)P(A_1) + P(B|A_2)P(A_2) + P(B|A_3)(PA_3).\]

The Law of Total Probability now expresses the probability of event \(B\) as a weighted average of conditional probabilities. Each weight \(P(A_i)\) represents the probability of a particular part in the sample space, and each conditional probability \(P(B|A_i)\) represents the likelihood of \(B\) given that we are in that part.

The Law of Total Probability on a Tree diagram

Recall that in a tree diagram, the set of branches extending from the same node must represent all possible outcomes given the preceding path. This requirement is, in fact, another way of saying that these branches must form a partition. As a result, a tree diagram provides an ideal setting for applying the Law of Total Probability.

Computing a single-stage probability \(P(B)\) using the Law of Total Probability is equivalent to

  1. finding the path probabilties of all paths involving \(B\),

  2. then summing the probabilities.

Try writing these steps down in mathematical notation and confirm that they are identical to applying the Law of Total Probability directly.

the law of total probability on a tree diagram

Fig. 4.19 Using the Law of Total Probability with a tree diagram

Example💡: The Law of Partitions and the Law of Total Probability

The tree diagram for the Indianapolis problem

Fig. 4.20 Tree diagram for the Indianapolis problem

Recall the Indianapolis example from the previous section. In this problem, what is the probabity that it rains?

\[\begin{split}P(R) &= P(R \cap Sun) + P(R \cap Sat) + P(R \cap Fri) \\ &= P(Sun)P(R|Sun) + P(Sat)P(R|Sat) + P(Fri)P(R|Fri)\\ &= 1/10 + 1/8 + 3/40 \\ &= 0.1 + 0.125 + 0.075 \\ &= 0.3\end{split}\]
  • First equality uses the Law of Partitions. The second equality uses the Law of Total Probability.

  • Confirm that the mathematical steps and the final outcome are identical when the tree diagram is used.

4.4.3. Bayes’ Rule

Bayes’ rule allows us to invert conditional probabilities. That is, it allows us to compute \(P(A|B)\) from our knowledge of \(P(B|A).\)

Statement

If \(\{A_1, A_2, \cdots, A_n\}\) forms a partition of the sample space \(\Omega\), and \(B\) is an event with \(P(B) > 0\), then for any \(i=1,2,\cdots,n\),

\[P(A_i|B) = \frac{P(B|A_i)P(A_i)}{\sum_{j=1}^{n} P(B|A_j)P(A_j)}.\]

For the simplified case of a three-event partition, Bayes’ rule for \(P(A_1|B)\) is:

\[P(A_1|B) = \frac{P(B|A_1)P(A_1)}{P(B|A_1)P(A_1) + P(B|A_2)P(A_2) + P(B|A_3)P(A_3)}.\]

Graphically, this equation represents the ratio of the area of the first blue piece (\(A_1 \cap B\)) over the whole area of \(B\) in Fig. 4.21.

Visual representation of the law of total probability

Fig. 4.21 Visual aid for Bayes’ Rule

Derivation of Bayes’ Rule

\[P(A_i|B) = \frac{P(A_i \cap B)}{P(B)} = \frac{P(A_i \cap B)}{\sum_{j=1}^{n} P(B|A_j)P(A_j)} = \frac{P(B|A_i)P(A_i)}{\sum_{j=1}^{n} P(B|A_j)P(A_j)}\]
  • First equality: definition of conditional probability

  • Second equality: Law of Total Probability for the denominator

  • Third equality: the general multiplication rule for the numerator

Example💡: Bayes’ Rule

The Indianapolis example is continued. Knowing that it didn’t rain on the day Glen and Jia went to Indianapolis, find the probability that this was Friday.

\[P(Fri|R') = \frac{P(Fri \cap R')}{P(R')} = \frac{P(Fri \cap R')}{1 - P(R)} = \frac{11/120}{1 - 0.3} \approx 0.131\]
  • \(P(R')\) can be computed directly using the tree diagram or the Law of Total Probability. However, using the complement rule is more convenient since we already have \(P(R)\) from the previous part.

4.4.4. Understanding the Bayesian Approach to Probability through Bayes’ Rule

Bayes’ rule forms the foundation of the Bayesian approach to probability, which interprets probabilities as degrees of belief that can be updated as new evidence emerges.

Each component of Bayes’ rule has a Bayesian interpretation:

  1. \(P(A_i)\): the prior probability

    The initial assessment of the probability of event \(A_1\)

  2. \(P(B|A_i)\): the likelihood

    The probability of observing a new evidence \(B\) given that \(A_1\) holds. This measures how consistent the evidence is with \(A_i\).

  3. \(P(A_i|B)\): the posterior probability

    The updated probability of \(A_i\) accounting for the evidence \(B\).

  4. \(P(B)\): the normalizing constant

    Once the evidence \(B\) is observed, the sample space shrinks to only the region that would have made \(B\) possible. Computing a posterior probability involves a step where we divide probabilities by \(P(B)\), the size of a new whole (see the second step of deriving Bayes rule).

As we gather more evidence, we can repeatedly apply Bayes’ rule, using the posterior probability from one calculation as the prior probability for the next. This iterative process allows our probability assessments to continuously improve as we incorporate new information.

Comprehensive Example💡: Medical Testing

Consider a disease that affects a small percentage of the population and a diagnostic test used to detect it.

  • Let \(D\) be the event that a person has the disease.

  • Let \(+\): be the event that the test gives a positive result.

  • Define \(D'\) and \(-\) as the complements of \(D\) and \(+\), respectively

Given these events, we can identify:

  • \(P(D)\): The prevalence of the disease in the population (prior probability)

  • \(P(+|D)\): The sensitivity of the test (true positive rate)

  • \(P(+|D')\): The false positive rate (1 - specificity)

What doctors and patients typically want to know is \(P(D|+)\), the probability that a person has the disease given a positive test result. This posterior probability can be calculated using Bayes’ rule:

\[P(D|+) = \frac{P(+|D)P(D)}{P(+|D) P(D) + P(+|D')P(D')}\]

Suppose a disease affects 1% of the population, the test has a sensitivity of 95%, and a specificity of 90%. What is the probability that someone with a positive test result actually has the disease?

Step 1: Write the building blocks in mathematical notation

  • \(P(D) = 0.01\)

  • \(P(+|D) = 0.95\)

  • \(P(+|D') = 1-P(-|D') = 1 - 0.9 = 0.1\)

Step 2: Compute the posterior probability

\[\begin{split}P(D|+) &= \frac{(0.95)(0.01)}{(0.95)(0.01) + (0.1)(0.99)} \\ &= \frac{0.0095}{0.0095 + 0.099} \\ &= \frac{0.0095}{0.1085} \\ &\approx 0.0876\end{split}\]

Despite the test being quite accurate (95% sensitivity, 90% specificity), the probability that a positive result indicates disease is less than 9%. This illustrates the importance of considering the base rate (prior probability) when interpreting test results, especially for rare conditions. Even a very accurate test will generate many false positives when applied to a population where the condition is uncommon.

  • Also try solving this problem using a tree diagram, and confirm that the results are consistent.

4.4.5. Bringing It All Together

Key Takeaways 📝

  1. The Law of Partitions decomposes the probability of an event across a partition.

  2. The Law of Total Probability expresses an event’s probability as a weighted average of conditional probabilities.

  3. Bayes’ rule lets us calculate “inverse” conditional probabilities.

  4. Tree diagrams serve as an assisting tool for the three rules above.

  5. Bayes’ rule forms the foundation of the Bayesian approach to probability.

4.4.6. Exercises

These exercises develop your skills in applying the Law of Total Probability and Bayes’ Rule to solve multi-stage probability problems.

Exercise 1: Identifying Partitions

For each scenario, determine whether the given collection of events forms a valid partition of the sample space. If not, explain which condition (mutually exclusive or exhaustive) is violated.

  1. Sample space: All students in a class. Events: \(A_1\) = “freshman”, \(A_2\) = “sophomore”, \(A_3\) = “junior”, \(A_4\) = “senior”

  2. Sample space: All possible outcomes when rolling a six-sided die. Events: \(B_1\) = “even number”, \(B_2\) = “odd number”

  3. Sample space: All employees at a company. Events: \(C_1\) = “works in engineering”, \(C_2\) = “works in marketing”, \(C_3\) = “has been with the company more than 5 years”

  4. Sample space: All real numbers from 0 to 100. Events: \(D_1 = [0, 50)\), \(D_2 = [50, 100]\)

  5. Sample space: Results of a software test. Events: \(E_1\) = “test passes”, \(E_2\) = “test fails with minor error”

Solution

Part (a): Valid Partition

  • Mutually exclusive: A student can only be in one class year at a time. ✓

  • Exhaustive: Every student must be a freshman, sophomore, junior, or senior. ✓

Part (b): Valid Partition

  • Mutually exclusive: A number cannot be both even and odd. ✓

  • Exhaustive: Every integer is either even or odd. ✓

Note: \(B_1\) and \(B_2\) form the simplest partition — an event and its complement.

Part (c): NOT a Valid Partition

  • Mutually exclusive: VIOLATED. An engineer could also have been with the company more than 5 years. The events overlap.

  • Exhaustive: VIOLATED. An employee in finance with 2 years tenure wouldn’t be in any of these events.

Part (d): Valid Partition

  • Mutually exclusive: \([0, 50)\) and \([50, 100]\) don’t overlap (50 is only in the second set). ✓

  • Exhaustive: Together they cover all numbers from 0 to 100. ✓

Part (e): NOT a Valid Partition

  • Mutually exclusive: These could be mutually exclusive if properly defined. ✓

  • Exhaustive: VIOLATED. What about “test fails with major error” or “test crashes”? The events don’t cover all possible outcomes.


Exercise 2: Law of Total Probability

A data center has three server clusters that handle incoming requests:

  • Cluster A handles 50% of all requests

  • Cluster B handles 30% of all requests

  • Cluster C handles 20% of all requests

Due to different hardware and configurations, the probability of a request being processed successfully varies by cluster:

  • Cluster A: 99% success rate

  • Cluster B: 97% success rate

  • Cluster C: 95% success rate

  1. Define appropriate events and write out the given information using probability notation.

  2. Use the Law of Total Probability to find the overall probability that a randomly selected request is processed successfully.

  3. What is the probability that a randomly selected request fails?

  4. Draw a tree diagram representing this situation and verify your answer to part (b).

Solution

Part (a): Define Events and Notation

Let:

  • \(A\) = request handled by Cluster A

  • \(B\) = request handled by Cluster B

  • \(C\) = request handled by Cluster C

  • \(S\) = request processed successfully

Given information:

  • \(P(A) = 0.50\), \(P(B) = 0.30\), \(P(C) = 0.20\)

  • \(P(S|A) = 0.99\), \(P(S|B) = 0.97\), \(P(S|C) = 0.95\)

Note: \(\{A, B, C\}\) forms a partition since clusters are mutually exclusive and exhaustive.

Part (b): Law of Total Probability

\[\begin{split}P(S) &= P(S|A)P(A) + P(S|B)P(B) + P(S|C)P(C) \\ &= (0.99)(0.50) + (0.97)(0.30) + (0.95)(0.20) \\ &= 0.495 + 0.291 + 0.190 \\ &= 0.976\end{split}\]

The overall success rate is 97.6%.

Part (c): Probability of Failure

Using the complement rule:

\[P(S') = 1 - P(S) = 1 - 0.976 = 0.024\]

The overall failure rate is 2.4%.

Part (d): Tree Diagram

             [Start]
            /   |   \
     0.50  /    |0.30 \  0.20
          /     |      \
        [A]    [B]     [C]
       / \    /  \    /  \
0.99  /   \  /    \  /    \ 0.05
     /  0.01\0.97 0.03\0.95  \
   [S]  [S'] [S] [S'] [S]  [S']
    |     |   |    |   |     |
 0.495  0.005 0.291 0.009 0.190 0.010

Sum of success paths: \(0.495 + 0.291 + 0.190 = 0.976\)


Exercise 3: Bayes’ Rule — Manufacturing

An electronics manufacturer sources microprocessors from three suppliers:

  • Supplier X provides 40% of processors with a 2% defect rate

  • Supplier Y provides 35% of processors with a 3% defect rate

  • Supplier Z provides 25% of processors with a 5% defect rate

A randomly selected processor is found to be defective.

  1. What is the probability that it came from Supplier X?

  2. What is the probability that it came from Supplier Y?

  3. What is the probability that it came from Supplier Z?

  4. Verify that your answers to parts (a), (b), and (c) sum to 1.

  5. Which supplier is most likely responsible for a defective processor? Is this the same as the supplier with the highest defect rate?

Solution

Setup:

Let \(X, Y, Z\) denote the suppliers and \(D\) = “processor is defective.”

Given:

  • \(P(X) = 0.40\), \(P(D|X) = 0.02\)

  • \(P(Y) = 0.35\), \(P(D|Y) = 0.03\)

  • \(P(Z) = 0.25\), \(P(D|Z) = 0.05\)

First, find P(D) using Law of Total Probability:

\[\begin{split}P(D) &= P(D|X)P(X) + P(D|Y)P(Y) + P(D|Z)P(Z) \\ &= (0.02)(0.40) + (0.03)(0.35) + (0.05)(0.25) \\ &= 0.008 + 0.0105 + 0.0125 \\ &= 0.031\end{split}\]

Part (a): P(X|D)

\[P(X|D) = \frac{P(D|X)P(X)}{P(D)} = \frac{(0.02)(0.40)}{0.031} = \frac{0.008}{0.031} \approx 0.2581\]

Part (b): P(Y|D)

\[P(Y|D) = \frac{P(D|Y)P(Y)}{P(D)} = \frac{(0.03)(0.35)}{0.031} = \frac{0.0105}{0.031} \approx 0.3387\]

Part (c): P(Z|D)

\[P(Z|D) = \frac{P(D|Z)P(Z)}{P(D)} = \frac{(0.05)(0.25)}{0.031} = \frac{0.0125}{0.031} \approx 0.4032\]

Part (d): Verification

\(P(X|D) + P(Y|D) + P(Z|D) = 0.2581 + 0.3387 + 0.4032 = 1.0000\)

This must be true since \(\{X, Y, Z\}\) partitions the sample space.

Part (e): Interpretation

Supplier Z is most likely responsible for a defective processor (40.32% probability).

This is the same as the supplier with the highest defect rate (5%). However, it’s not always the case! If Supplier Z only provided 5% of processors instead of 25%, the answer would change. Bayes’ Rule accounts for both:

  • The defect rate (likelihood)

  • The proportion of supply (prior)


Exercise 4: Bayes’ Rule — Diagnostic Testing

A new screening test for a rare genetic condition is being evaluated. The condition affects 1 in 500 people in the general population. Clinical trials show:

  • Sensitivity (true positive rate): 98% — P(+|Disease) = 0.98

  • Specificity (true negative rate): 96% — P(−|No Disease) = 0.96

  1. Calculate the probability that a person who tests positive actually has the condition. (This is called the positive predictive value.)

  2. Calculate the probability that a person who tests negative does not have the condition. (This is called the negative predictive value.)

  3. If the test is used to screen 10,000 people, approximately how many will test positive? Of those, how many actually have the condition?

  4. Why is the positive predictive value so much lower than the sensitivity, even though both the sensitivity and specificity are quite high?

Solution

Setup:

  • \(P(D) = 1/500 = 0.002\) (prevalence)

  • \(P(D') = 0.998\)

  • \(P(+|D) = 0.98\) (sensitivity)

  • \(P(-|D') = 0.96\) (specificity)

  • \(P(+|D') = 1 - 0.96 = 0.04\) (false positive rate)

Part (a): Positive Predictive Value P(D|+)

First, find P(+) using Law of Total Probability:

\[\begin{split}P(+) &= P(+|D)P(D) + P(+|D')P(D') \\ &= (0.98)(0.002) + (0.04)(0.998) \\ &= 0.00196 + 0.03992 \\ &= 0.04188\end{split}\]

Now apply Bayes’ Rule:

\[P(D|+) = \frac{P(+|D)P(D)}{P(+)} = \frac{(0.98)(0.002)}{0.04188} = \frac{0.00196}{0.04188} \approx 0.0468\]

Positive Predictive Value ≈ 4.68%

Part (b): Negative Predictive Value P(D’|−)

First, find P(−):

\[P(-) = 1 - P(+) = 1 - 0.04188 = 0.95812\]

Find \(P(-|D) = 1 - P(+|D) = 1 - 0.98 = 0.02\) (false negative rate)

\[P(D'|-) = \frac{P(-|D')P(D')}{P(-)} = \frac{(0.96)(0.998)}{0.95812} = \frac{0.95808}{0.95812} \approx 0.99996\]

Negative Predictive Value ≈ 99.996%

Part (c): Screening 10,000 People

Expected positive tests: \(10,000 \times P(+) = 10,000 \times 0.04188 \approx 419\) people

Of those with positive tests, expected true positives:

  • People with disease: \(10,000 \times 0.002 = 20\)

  • True positives (disease AND positive): \(20 \times 0.98 \approx 20\) people

So approximately 419 test positive, but only about 20 actually have the condition.

Part (d): Why is PPV So Low?

Even though the test has high sensitivity (98%) and specificity (96%), the base rate of the disease is very low (0.2%).

When the disease is rare:

  • The number of true positives is small (98% of a small number)

  • The number of false positives is much larger (4% of a very large number)

With 10,000 people:

  • True positives: ~20 (from the 20 people with disease)

  • False positives: ~399 (4% of the 9,980 healthy people)

  • Total positives: ~419, but only ~20 are real!

This is the base rate fallacy — people often overestimate PPV when the condition is rare.


Exercise 5: Multi-Stage Problem with Tree Diagram

A delivery company has two distribution centers (DC1 and DC2) that ship packages to customers. DC1 handles 60% of packages and DC2 handles 40%.

  • Packages from DC1 are shipped via Ground (80%) or Express (20%)

  • Packages from DC2 are shipped via Ground (50%) or Express (50%)

The on-time delivery rates are:

  • DC1 Ground: 92% on-time

  • DC1 Express: 98% on-time

  • DC2 Ground: 88% on-time

  • DC2 Express: 99% on-time

  1. Draw a complete three-stage tree diagram for this problem.

  2. What is the probability that a randomly selected package is delivered on time?

  3. Given that a package was delivered late, what is the probability it came from DC1?

  4. Given that a package was delivered late, what is the probability it was shipped via Ground?

Solution

Part (a): Tree Diagram

                  [Start]
                 /       \
          0.60  /         \ 0.40
               /           \
            [DC1]         [DC2]
           /    \         /    \
     0.80 /      \0.20   /0.50  \ 0.50
         /        \     /        \
       [G]       [E]  [G]        [E]
      /  \      /  \  /  \      /   \
0.92 / \0.08 0.98\0.02 0.88\0.12 0.99\ 0.01
    /   \    /   \   /    \   /     \
  [OT] [L] [OT] [L] [OT]  [L] [OT]  [L]

Path probabilities:

  • DC1, Ground, On-Time: 0.60 × 0.80 × 0.92 = 0.4416

  • DC1, Ground, Late: 0.60 × 0.80 × 0.08 = 0.0384

  • DC1, Express, On-Time: 0.60 × 0.20 × 0.98 = 0.1176

  • DC1, Express, Late: 0.60 × 0.20 × 0.02 = 0.0024

  • DC2, Ground, On-Time: 0.40 × 0.50 × 0.88 = 0.1760

  • DC2, Ground, Late: 0.40 × 0.50 × 0.12 = 0.0240

  • DC2, Express, On-Time: 0.40 × 0.50 × 0.99 = 0.1980

  • DC2, Express, Late: 0.40 × 0.50 × 0.01 = 0.0020

Part (b): P(On-Time)

Sum all on-time paths:

\[P(OT) = 0.4416 + 0.1176 + 0.1760 + 0.1980 = 0.9332\]

93.32% of packages are delivered on time.

Part (c): P(DC1|Late)

First, find P(Late):

\[P(L) = 1 - P(OT) = 1 - 0.9332 = 0.0668\]

P(DC1 ∩ Late) = P(DC1, Ground, Late) + P(DC1, Express, Late) = 0.0384 + 0.0024 = 0.0408

\[P(DC1|L) = \frac{P(DC1 \cap L)}{P(L)} = \frac{0.0408}{0.0668} \approx 0.6108\]

Given a late package, there’s about a 61.08% chance it came from DC1.

Part (d): P(Ground|Late)

P(Ground ∩ Late) = P(DC1, Ground, Late) + P(DC2, Ground, Late) = 0.0384 + 0.0240 = 0.0624

\[P(G|L) = \frac{P(G \cap L)}{P(L)} = \frac{0.0624}{0.0668} \approx 0.9341\]

Given a late package, there’s about a 93.41% chance it was shipped via Ground.


Exercise 6: Prior and Posterior Probabilities

A machine learning model classifies emails as “spam” or “not spam.” Before seeing any features of an email:

  • Prior probability: 30% of incoming emails are spam

The model uses the presence of the word “FREE” as a feature:

  • P(“FREE” appears | Spam) = 0.60

  • P(“FREE” appears | Not Spam) = 0.10

  1. An email arrives containing the word “FREE.” What is the posterior probability that it is spam?

  2. An email arrives that does NOT contain the word “FREE.” What is the posterior probability that it is spam?

  3. Compare the prior and posterior probabilities. How much does observing “FREE” change our belief about whether the email is spam?

  4. The model now also considers the presence of “URGENT.” Given that an email is spam: - P(“URGENT” | Spam) = 0.40

    Given that an email is not spam: - P(“URGENT” | Not Spam) = 0.05

    If an email contains BOTH “FREE” and “URGENT” (assume these are conditionally independent given spam status), what is the posterior probability it is spam?

Solution

Setup:

  • \(P(S) = 0.30\) (prior — email is spam)

  • \(P(S') = 0.70\) (prior — email is not spam)

  • \(P(F|S) = 0.60\) (“FREE” given spam)

  • \(P(F|S') = 0.10\) (“FREE” given not spam)

Part (a): P(Spam | “FREE” appears)

First, find P(F) using Law of Total Probability:

\[P(F) = P(F|S)P(S) + P(F|S')P(S') = (0.60)(0.30) + (0.10)(0.70) = 0.18 + 0.07 = 0.25\]

Apply Bayes’ Rule:

\[P(S|F) = \frac{P(F|S)P(S)}{P(F)} = \frac{(0.60)(0.30)}{0.25} = \frac{0.18}{0.25} = 0.72\]

Posterior probability of spam given “FREE”: 72%

Part (b): P(Spam | “FREE” does NOT appear)

First, find P(F’) = 1 − 0.25 = 0.75

Find \(P(F'|S) = 1 - P(F|S) = 1 - 0.60 = 0.40\)

\[P(S|F') = \frac{P(F'|S)P(S)}{P(F')} = \frac{(0.40)(0.30)}{0.75} = \frac{0.12}{0.75} = 0.16\]

Posterior probability of spam given NO “FREE”: 16%

Part (c): Comparison

  • Prior P(Spam) = 0.30 (before seeing any features)

  • Posterior P(Spam | “FREE”) = 0.72 (increased dramatically)

  • Posterior P(Spam | no “FREE”) = 0.16 (decreased)

Observing “FREE” increases our belief that the email is spam from 30% to 72% — more than doubling it. The absence of “FREE” decreases our belief to 16% — nearly halving it.

The word “FREE” is a strong indicator because it’s 6 times more likely to appear in spam (60%) than in legitimate email (10%).

Part (d): P(Spam | “FREE” AND “URGENT”)

With conditional independence, we can update sequentially. Start with the posterior from part (a) as the new prior:

  • New prior: P(S) = 0.72 (given “FREE”)

  • P(U|S) = 0.40

  • P(U|S’) = 0.05

Find P(U | already observed “FREE”):

\[P(U) = P(U|S) \cdot P(S|F) + P(U|S') \cdot P(S'|F) = (0.40)(0.72) + (0.05)(0.28) = 0.288 + 0.014 = 0.302\]

Apply Bayes’ Rule:

\[P(S|F,U) = \frac{P(U|S) \cdot P(S|F)}{P(U)} = \frac{(0.40)(0.72)}{0.302} = \frac{0.288}{0.302} \approx 0.9536\]

Posterior probability of spam given BOTH “FREE” and “URGENT”: ~95.4%

Each piece of evidence updates our belief. Starting from 30%, “FREE” brings us to 72%, and “URGENT” further increases it to 95.4%.


4.4.7. Additional Practice Problems

True/False Questions (1 point each)

  1. A partition of a sample space must consist of at least three events.

    Ⓣ or Ⓕ

  2. The Law of Total Probability requires that the conditioning events form a partition.

    Ⓣ or Ⓕ

  3. Bayes’ Rule allows us to compute \(P(A|B)\) when we know \(P(B|A)\).

    Ⓣ or Ⓕ

  4. In Bayes’ Rule, the denominator is computed using the Law of Total Probability.

    Ⓣ or Ⓕ

  5. If a medical test has high sensitivity (99%), then a positive result means the patient almost certainly has the disease.

    Ⓣ or Ⓕ

  6. When using Bayes’ Rule, the posterior probabilities for all events in the partition must sum to 1.

    Ⓣ or Ⓕ

Multiple Choice Questions (2 points each)

  1. Factory A produces 60% of a product with 2% defective. Factory B produces 40% with 5% defective. What is P(defective)?

    Ⓐ 0.012

    Ⓑ 0.020

    Ⓒ 0.032

    Ⓓ 0.035

  2. Using the information from Question 7, if a product is defective, what is P(from Factory A)?

    Ⓐ 0.375

    Ⓑ 0.400

    Ⓒ 0.600

    Ⓓ 0.625

  3. A disease affects 2% of the population. A test has 95% sensitivity and 90% specificity. What is P(Disease | Positive)?

    Ⓐ About 16%

    Ⓑ About 50%

    Ⓒ About 90%

    Ⓓ About 95%

  4. In the Bayesian framework, which term describes \(P(A_i)\) in Bayes’ Rule?

    Ⓐ Likelihood

    Ⓑ Posterior probability

    Ⓒ Prior probability

    Ⓓ Normalizing constant

Answers to Practice Problems

True/False Answers:

  1. False — The simplest partition consists of just two events: an event A and its complement A’.

  2. True — The Law of Total Probability requires that the events \(\{A_1, A_2, \ldots, A_n\}\) be mutually exclusive and exhaustive (i.e., form a partition).

  3. True — This is exactly what Bayes’ Rule does: it “inverts” conditional probabilities.

  4. True — The denominator \(\sum P(B|A_i)P(A_i)\) is the Law of Total Probability applied to find P(B).

  5. False — High sensitivity means the test correctly identifies most people with the disease. But if the disease is rare, most positive results may still be false positives (low positive predictive value).

  6. True — Since the partition events cover the entire sample space, the posterior probabilities must sum to 1: \(\sum P(A_i|B) = 1\).

Multiple Choice Answers:

  1. — P(D) = P(D|A)P(A) + P(D|B)P(B) = (0.02)(0.60) + (0.05)(0.40) = 0.012 + 0.020 = 0.032

  2. — P(A|D) = P(D|A)P(A) / P(D) = (0.02)(0.60) / 0.032 = 0.012 / 0.032 = 0.375

  3. — P(+) = (0.95)(0.02) + (0.10)(0.98) = 0.019 + 0.098 = 0.117. P(D|+) = 0.019/0.117 ≈ 0.162 ≈ 16%

  4. \(P(A_i)\) is the prior probability — our initial belief before observing evidence B.