Worksheet 6: Named Discrete Distributions
Learning Objectives 🎯
Understand when to use named distributions instead of explicit PMFs
Master the Bernoulli distribution as the building block for other distributions
Apply the Binomial distribution to count successes in independent trials
Use the Poisson distribution for rare events over fixed intervals
Calculate probabilities using both formulas and R functions
Recognize the assumptions required for each distribution
Introduction
Previously, we examined discrete random variables by explicitly listing their probability mass functions (PMFs) in tabular form. While this approach works well for small, well-defined cases, it can become cumbersome when dealing with random variables that take on many possible values.
In many situations, patterns emerge in how probabilities are assigned to outcomes. Instead of defining a PMF from scratch, we can use named discrete random variables, which follow standard probability distributions that have been studied extensively. These named distributions provide a structured way to describe random variables with known properties, simplifying calculations and allowing us to apply established theorems and results.
Advantages of using named distributions include:
Compact Representation: Instead of listing probabilities for each possible outcome, we can describe a random variable using a few key parameters
Generalization: Many processes share similar probability structures, so named distributions allow us to apply the same mathematical tools across different contexts
Efficient Computation: Expectation, variance, and probabilities can be computed using well-established formulas rather than recalculating them from first principles
In this course, we will focus on three fundamental named discrete distributions: the Bernoulli, Binomial, and Poisson Distributions. However, in this worksheet, you will also get a preview of additional named discrete distributions, providing insight into how different random variables arise in various contexts. These will be explored in greater depth in later coursework.
Part 1: The Bernoulli Distribution
A Bernoulli random variable \(X\) is the simplest discrete random variable, as it represents the occurrence or non-occurrence of an event in a single trial. It takes on only two possible values, typically 0 and 1, meaning its support is: \(\text{Supp}(X) = \{0, 1\}\).
This setup is useful for modeling scenarios where an event either happens (1) or does not happen (0). The probability mass function (PMF) of a Bernoulli random variable with success probability \(p\) is:
and we write:
where the tilde symbol (∼) is used to denote that \(X\) “follows a” Bernoulli distribution with probability \(p\).
You can think of a Bernoulli random variable as an indicator function, where it simply “indicates” whether a specific event occurs. In fact, if \(A\) is an event, the random variable:
follows a Bernoulli distribution with success probability \(p = P(A)\).
This makes Bernoulli random variables fundamental in probability, as they serve as building blocks for more complex models, such as the Binomial distribution, which describes the total number of successes in multiple independent Bernoulli trials.
Part 2: The Binomial Distribution
A Binomial random variable counts the number of successes in \(n\) independent and identical Bernoulli trials, each with success probability \(p\). Here \(n\) and \(p\) are what we call parameters of the distribution and once we know the values of the parameters we know everything about the distribution. A Binomial random variable is simply the sum of \(n\) independent Bernoulli random variables:
where each \(X_i \overset{\text{i.i.d.}}{\sim} \text{Bern}(p)\) for \(i \in \{1, 2, \ldots, n\}\). Here i.i.d. (independent and identically distributed) means that:
Each \(X_i\) follows the same Bernoulli distribution with success probability \(p\)
The outcomes of different trials do not influence each other
This property makes the Binomial distribution a natural extension of the Bernoulli, allowing it to model repeated independent trials of the same process. Companies and researchers leverage the Binomial distribution alongside statistical inference in quality control to test defect rates, in medical research to estimate treatment success probabilities, and in politics to analyze voter behavior and predict election outcomes, among other applications.
A Binomial random variable \(X\) has support \(\text{Supp}(X) = \{0, 1, 2, \ldots, n\}\) and we write \(X \sim \text{Binomial}(n, p)\).
It has probability mass function:
and the expected value and variance are simply functions of the parameters:
\(E[X] = np\)
\(\text{Var}(X) = np(1 - p)\)
Question 1: Power Outages During Storms
Northgate faces severe storms each year, leading to the possibility of widespread power outages. City planners want to estimate how likely it is for a storm to knock out power in many neighborhoods simultaneously, and if it does, how effectively emergency crews can restore service. Use the Binomial distribution and the BINS criteria to guide your analysis.
Note
Important Note on Independence
In practice, power outages across neighborhoods are often correlated (for example, if a major substation goes down, it can affect multiple neighborhoods at once). For this exercise, however, we assume that each neighborhood’s power status is independent of the others.
During a storm, each of the city’s 50 neighborhoods independently loses power with probability \(p = 0.3\). Let \(X\) denote the total number of neighborhoods that lose power during a single storm.
Explain why \(X\) follows a Binomial distribution and identify its parameters.
Compute the probability by hand that there are exactly 20 neighborhoods that lose power in a storm under these assumptions. Repeat the calculation using R to compute the probability with the function
dbinom
(help(dbinom)
) to check your work.Express symbolically the probability that there are at least 20 neighborhoods that lose power and using the Binomial PMF formula, express how you would compute this probability.
Directly calculating this probability in c) would be tedious. We will learn a technique later to leverage the continuous normal distribution to approximate this probability. However, we can still calculate this using R.
Method 1: Create a list of possible successes from 20 to 50 and apply
dbinom
to the list and sum over the list.Method 2: Utilize the
pbinom
function to compute the probability.Use both approaches to compute the probability.
Given that at least 20 neighborhoods lost power during a particular storm, what is the probability that exactly 25 neighborhoods lost power? Assume the same Binomial model applies to this scenario.
For a random storm determine the expected number of neighborhoods that will lose power and determine the standard deviation.
For a randomly selected storm, determine the probability that the number of neighborhoods that lose power exceeds two standard deviations above its expected value. Express your answer symbolically in terms of the Binomial PMF and compute the probability using R.
R Code for Binomial Distribution:
# Parameters
n <- 50 # number of neighborhoods
p <- 0.3 # probability of power loss per neighborhood
# Part b: P(X = 20)
# By hand calculation
choose(50, 20) * (0.3)^20 * (0.7)^30
# Using R function
dbinom(20, size = n, prob = p)
# Part d: P(X >= 20)
# Method 1: Sum individual probabilities
sum(dbinom(20:50, size = n, prob = p))
# Method 2: Using cumulative distribution
1 - pbinom(19, size = n, prob = p)
# or equivalently
pbinom(19, size = n, prob = p, lower.tail = FALSE)
# Part e: P(X = 25 | X >= 20)
p_25 <- dbinom(25, size = n, prob = p)
p_at_least_20 <- pbinom(19, size = n, prob = p, lower.tail = FALSE)
conditional_prob <- p_25 / p_at_least_20
cat("P(X = 25 | X >= 20) =", conditional_prob, "\n")
# Part f: Expected value and standard deviation
expected <- n * p
variance <- n * p * (1 - p)
std_dev <- sqrt(variance)
cat("E[X] =", expected, "\n")
cat("SD[X] =", std_dev, "\n")
# Part g: P(X > E[X] + 2*SD[X])
threshold <- expected + 2 * std_dev
cat("Two SDs above mean:", threshold, "\n")
prob_exceed <- pbinom(floor(threshold), size = n, prob = p, lower.tail = FALSE)
cat("P(X > E[X] + 2*SD[X]) =", prob_exceed, "\n")
Part 3: The Poisson Distribution
Many real-world phenomena involve random events occurring over time or space at an average rate. In such cases, the Poisson distribution provides a powerful mathematical model for counting the number of events in a fixed interval of time, space, or volume. A scenario is Poisson-distributed if:
Unique Events (no clustering): Events occur one at a time. The probability of two or more events happening simultaneously in a very small interval is negligible
Independence: The occurrence of one event does not affect the probability of another occurring. Events happen randomly and are not influenced by previous occurrences. Additionally, the number of events that occur in any interval are independent of those that occur in any other non-overlapping interval
Stationarity (Constant Rate): Events occur at a steady average rate over time or space. The expected number of events in an interval does not change unless external conditions shift the distribution
Proportionality: The expected number of events is proportional to the size of the interval. Doubling the length of time or space results in twice as many expected occurrences
Question 2: MRI Artifact Detection
Modern MRI scanners are highly sensitive but sometimes suffer from artifacts which are unwanted noise distortions caused by patient movement, hardware issues, or interference. AI powered medical imaging tools detect these artifacts to improve diagnostic accuracy.
In medical imaging, artifacts appear randomly across the spatial area of a scan. These artifacts do not overlap, occur independently, and appear at an average rate for a given image size. Because the number of artifacts is counted over a fixed spatial region rather than over time, the Poisson distribution provides a natural way to model this process.
To ensure the Poisson model remains valid, we assume that each MRI scan covers the same fixed spatial area. If scan sizes varied, the number of artifacts would depend on scan area, and the Poisson rate \(\lambda\) would need to be scaled accordingly. In this case, however, we assume that all scans are uniform in size, meaning that the average number of artifacts per scan remains constant.
A hospital’s radiology department is analyzing the number of MRI artifacts detected by an AI system per scan. Based on past data, the AI detects an average of \(\lambda = 3\) artifacts per scan, meaning that each complete image contains an average of 3 randomly occurring noise distortions. The department now wants to analyze data for 10 consecutive MRI scans to determine the likelihood of different artifact patterns across multiple scans.
Let \(Y\) denote the total number of artifacts detected by the AI in 10 scans. Provide a justification why the number of MRI artifacts detected across 10 scans should follow a Poisson distribution and identify the parameter \(\lambda_Y\) representing the expected number of artifacts across the 10 scans. Bonus exercise: prove this mathematically using induction.
Compute the probability that there are exactly 20 artifacts detected across the 10 scans. Repeat the calculation using R to compute the probability with the function
dpois
(help(dpois)
) to check your work.Compute the probability that there are at most 20 artifacts detected across the 10 scans. This calculation is tedious to determine by hand. Use R to calculate this as before:
Method 1: Create a list of possible successes from 0 to 20 and apply
dpois
to the list and sum over the list.Method 2: Utilize the
ppois
function to compute the probability.The hospital’s radiology department is analyzing two separate collections of MRI scans. Box A contains 10 scans, while Box B contains 15 scans. Compute the probability that Box A contains at most 20 artifacts and Box B contains at most 35 artifacts.
R Code for Poisson Distribution:
# Parameters
lambda_per_scan <- 3
n_scans <- 10
lambda_total <- lambda_per_scan * n_scans
# Part b: P(Y = 20)
# By hand (would use formula)
# P(Y = 20) = e^(-30) * 30^20 / 20!
# Using R
dpois(20, lambda = lambda_total)
# Part c: P(Y <= 20)
# Method 1: Sum individual probabilities
sum(dpois(0:20, lambda = lambda_total))
# Method 2: Using cumulative distribution
ppois(20, lambda = lambda_total)
# Part d: Independent boxes
lambda_A <- 3 * 10 # 10 scans
lambda_B <- 3 * 15 # 15 scans
prob_A <- ppois(20, lambda = lambda_A)
prob_B <- ppois(35, lambda = lambda_B)
prob_both <- prob_A * prob_B # Independent events
cat("P(Box A <= 20 AND Box B <= 35) =", prob_both, "\n")
Part 4: Other Named Discrete Distributions
While we have focused on three named discrete random variables, it is important to recognize that there are many others, each designed to model different types of random processes. Not all discrete distributions count the number of successes in a fixed number of trials, nor do they all assume independent events.
For example:
Geometric Distribution: models the number of trials until the first success occurs. This can model repeated failures before success, such as the number of free throw attempts needed before the first basket
Negative Binomial Distribution: extends the geometric case by counting the number of trials until the \(r^{\text{th}}\) success occurs
Hypergeometric Distribution: counts the number of success in a fixed number of trials but differs from the Binomial distribution because it models dependent trials. The Hypergeometric distribution applies to cases where sampling occurs without replacement, such as drawing defective items from a small batch or selecting colored gummy bears from a finite jar
A random variable \(X\) following a Hypergeometric distribution has the following parameters: \(N\) the total number of items considered, \(n\) the number of items sampled (or trials performed), and \(M\) the total number of success possible and it has the following probability mass function:
with \(\binom{N}{n}\) denoting the number of ways to select \(n\) items out of \(N\) total items, \(\binom{M}{x}\) denoting the number of ways to select \(x\) successes out of \(M\) total successes, and \(\binom{N - M}{n - x}\) denoting the number of ways to choose \(n - x\) failures out of \(N - M\) total failures.
Question 3: Hypergeometric Distribution
Let’s consider our gummy bear example:
Jar₁ contains 30 red, 10 green, and 10 blue gummies
Jar₂ contains 20 red and 40 green gummies
Jar₃ contains 35 yellow gummies
Using the hypergeometric distribution determine the probability of getting 2 green gummies given you sample from Jar₁ and confirm with how we calculated this in a previous worksheet.
R Code for Hypergeometric Distribution:
# Jar 1 contents
total_jar1 <- 50 # Total gummies
green_jar1 <- 10 # Green gummies (successes)
sample_size <- 2 # Drawing 2 gummies
# Probability of exactly 2 green gummies
# Using hypergeometric PMF
dhyper(x = 2, # number of successes desired
m = green_jar1, # number of success states in population
n = total_jar1 - green_jar1, # number of failure states
k = sample_size) # number of draws
# Manual calculation for verification
choose(10, 2) * choose(40, 0) / choose(50, 2)
Key Takeaways
Summary 📝
Named distributions provide efficient ways to model common random processes
Bernoulli is the building block: single trial with success probability \(p\)
Binomial counts successes in \(n\) independent Bernoulli trials
Poisson models rare events occurring at rate \(\lambda\) over fixed intervals
Each distribution has specific assumptions that must be verified
R provides functions for PMF (
d
), CDF (p
), quantiles (q
), and random generation (r
)Other distributions (Geometric, Negative Binomial, Hypergeometric) handle different scenarios