6.3. Cumulative Distribution Functions

Computing probabilities for continuous random variables requires integrating the PDF over intervals, which can become tedious when we need to calculate many different probabilities. Just as calculus provides us with antiderivatives to avoid repeated integration, probability theory offers us the cumulative distribution function (CDF) as a tool that eliminates the need for repeated integrations.

Road Map 🧭

  • Define the cumulative distribution function (CDF).

  • Master the relationship between PDF and CDF through the fundamental theorem of calculus.

  • Learn essential properties of CDFs that make them valid probability functions.

  • Practice computing probabilities using CDFs instead of direct integration.

  • Apply CDFs to find percentiles and quantiles for continuous distributions.

6.3.1. The Universal Language of Cumulative Probability

The cumulative distribution function (CDF) provides a unified approach to probability that works seamlessly for both discrete and continuous random variables. Instead of dealing with probability mass functions and probability density functions separately, the CDF gives us one consistent framework.

Definition

For any random variable \(X\) (discrete or continuous), the cumulative distribution function \(F_X(x)\) is defined as:

\[F_X(x) = P(X \leq x)\]

The definition expresses a simple idea: \(F_X(x)\) gives the probability that the random variable \(X\) falls at or below any given value \(x\).

Implementation of CDF to Different Variable Types

Discrete

Continuous

\[F_X(x) = P(X \leq x) = \sum_{t=-\infty}^{x} p_X(t)\]
\[F_X(x) = P(X \leq x) = \int_{-\infty}^{x} f_X(t) \, dt\]

In both cases, the argument of the PMF/PDF is replaced with a dummy variable \(t\). This is to avoid any confusion with teh true argument of \(F_X\).

In the continuous case, \(F_X(x)\) can be used for \(P(X\leq x)\) or \(P(X<x)\)

6.3.2. Requirements for a valid CDF

Cumulative distribution functions must satisfy several mathematical properties that reflect their probabilistic interpretation.

1. Monotonicity

As we move from a smaller to a larger value, we can only accumulate probability. Thefere, \(F_X\) is a non-decreasing function. That is, for any real numbers \(a < b\),

\[F_X(a) \leq F_X(b)\]

2. Limiting Behavior

The accumulated probability must be close to 0 at the far left of the x-axis (no probability accumulates infinitely far to the left), and close to 1 at the far right (all probability accumulates when we include everything):

\[\lim_{x \to -\infty} F_X(x) = 0 \quad \text{and} \quad \lim_{x \to +\infty} F_X(x) = 1\]

3: Right Continuity

For any point \(c\),

\[\lim_{x \to c^+} F_X(x) = F_X(c).\]

This technical property ensures that the CDF behaves consistently at boundary points, particularly important when dealing with mixed discrete-continuous distributions or piecewise functions. This property is satisfied automatically for continuous distributions.

6.3.3. The PDF-CDF Connection

For continuous random variables, the relationship between the PDF and CDF mirrors the fundamental theorem of calculus.

From PDF to CDF: Integration

By the definition,

\[F_X(x) = \int_{-\infty}^{x} f_X(t) \, dt\]

\(F_X(x)\) the accummulation of all the probability from the left tail of the distribution up to the point \(x\).

From CDF to PDF: Differentiation

Given a CDF \(F_X(x)\), we can recover the PDF by differentiation:

\[f_X(x) = \frac{d}{dx} F_X(x) = F_X'(x)\]

wherever the derivative exists. This relationship explains why the PDF measures the “rate of accumulation” of probability at each point.

Visualizing the Relationship

The connection becomes clear when we graph the two functions together.

Relationship between PDF and CDF

Fig. 6.4 Relationship between PDF and CDF

  • If one function is piecewise-defined, the other is as well. They share the same set of boundaries.

  • A probability of the form \(P(X\leq a)\) corresponds to the area to the left of \(a\) under a PDF, and to the height of the curve on a CDF.

  • A PDF can take any shape, as long as it stays above the horizontal axis integrates to 1. A CDF is more constrained in shape; it approaches 0 on the left, 1 on the right, and is never decreasing in between.

  • In any region where the PDF is \(0\), the CDF is flat, but not necessarily 0. It retains any area accummulated in previous regions, but no additional area is added.

6.3.4. Building a Piecewise CDF

Constructing a CDF based on a piecewise-defined PDF requires careful attention. The most crucial aspect of working with piecewise CDFs is remembering that they are cumulative—they must account for all probability accumulated up to each point, not just in the current region.

Useful tips for constructing a piecewise CDF:

  • Make a rough sketch similar to Fig. 6.4. Visually identify the regions where the CDF is approaching 0, incresing, flat, or approaching 1. At the end, check whether the computational result matches the sketch.

  • Stick to the definition at first. By definition,

    \[F_X(x) = \int_{-\infty}^x f_X(t) \, dt\]

    is an integral from negative infinity up to the input value, \(x\). Starting directly from this definition-without skipping steps-will help you become familiar with the subtle patterns involved in constructing piecewise CDFs.

Example💡: Constructing the CDF for a piecewise PDF

Construct the CDF of a random variable \(X\) which has the following PDF. Use Fig. 6.4 as reference.

\[\begin{split}f_X(x) = \begin{cases} \frac{3x}{2} & \text{for } 0 \leq x \leq 1 \\ 0 & \text{for } 1 < x < 5 \\ \frac{1}{4} & \text{for } 5 \leq x \leq 6 \\ 0 & \text{elsewhere} \end{cases}\end{split}\]

Work on each region given by the PDF separately:

Region 1 (\(x < 0\)): No probability accumulated yet

\[F_X(x) = \int_{-\infty}^x f_X(t) \, dt = \int_{-\infty}^x 0 \, dt = 0\]

Region 2 (\(0 \leq x \leq 1\)): Accumulating triangular area

\[F_X(x) = \int_{-\infty}^x f_X(t) \, dt = \underbrace{\int_{-\infty}^0 f_X(t)dt}_{=F_X(0) = 0} + \int_0^x \frac{3t}{2} \, dt = \frac{3x^2}{4}\]

Region 3 (\(1 < x < 5\)): No new probability, but keep previous accumulation

\[\begin{split}F_X(x) &= \int_{-\infty}^x f_X(t) \, dt \\ &= \underbrace{\int_{-\infty}^0 f_X(t)dt + \int_0^1 \frac{3t}{2} \, dt}_{=F_X(1) = \frac{3}{4}} + \underbrace{\int_1^x f_X(t) \, dt}_{=0} = \frac{3}{4}\end{split}\]

The area is constant at the triangle’s total area.

Region 4 (\(5 \leq x \leq 6\)): Add rectangular area to previous accumulation

\[\begin{split}F_X(x) &= \int_{-\infty}^x f_X(t) \, dt \\ &= \underbrace{\int_{-\infty}^0 f_X(t)dt + \int_0^1 \frac{3t}{2} \, dt + \int_1^5 f_X(t) \, dt}_{=F_X(5)=\frac{3}{4}} + \int_5^x f_X(t) \, dt\\ &=\frac{3}{4} + \int_5^x \frac{1}{4} \, dt = \frac{3}{4} + \frac{x-5}{4} = \frac{x-2}{4}\end{split}\]

Region 5 (\(x > 6\)): All probability has been accumulated

\[\begin{split}F_X(x) &= \int_{-\infty}^x f_X(t) \, dt \\ &= F_X(6) + \int_{6}^{\infty} f_X(t) \, dt = 1 + \int_6^{\infty} 0 \, dt = 1.\end{split}\]

Putting together,

\[\begin{split}F_X(x) = \begin{cases} 0 & \text{for } x < 0 \\ \frac{3x^2}{4} & \text{for } 0 \leq x \leq 1 \\ \frac{3}{4} & \text{for } 1 < x < 5 \\ \frac{x-2}{4} & \text{for } 5 \leq x \leq 6 \\ 1 & \text{for } x > 6 \end{cases}\end{split}\]
  • The video below goes through the series of examples covered in this section.

🛑 Know how to avoid the common mistake before simplifying steps

In all regions beyond the first in the previous example, the CDF \(f_X(x)\) is computed as the sum of area accumulated in the earlier regions plus the integral over the current region.

Forgetting to factor in the area from previous regions is a very common mistake made by students. While you may eventually skip some redundant steps as you gain familiarity, your top priority should be to avoid this error.

6.3.5. Computing Probabilities with CDFs

Once we have the CDF, calculating probabilities becomes remarkably straightforward. We can handle any probability question without additional integration.

Recall that equality signs do not matter for continuous RVs

All the rules below will apply if \(<\) and \(\leq\) are interchanged as well as if \(>\) and \(\geq\) are switched because \(P(X = a) = 0\) for any single point \(a\).

Basic CDF Evaluation

For \(P(X \leq a)\),

\[P(X \leq a) = F_X(a)\]

This is just the definition of the CDF—plug in the value and read the result.

“Greater Than” Probabilities

For \(P(X > a)\), we use the complement rule:

\[P(X > a) = 1 - P(X \leq a) = 1 - F_X(a)\]

The probability of exceeding \(a\) equals one minus the probability of not exceeding \(a\).

Interval Probabilities

For \(P(a < X \leq b)\), we take the accumulated probability up to \(b\) and subtract the accumulated probability up to \(a\), leaving just the probability in the interval.

\[P(a < X \leq b) = F_X(b) - F_X(a)\]
Computing interval probabilities with CDFs

Fig. 6.5 P(a < X ≤ b) equals the difference between CDF values

Example💡: Computing Probabilities Using a CDF

Use the PDF and CDF found in the previous example to answer the questions below.

\[\begin{split}f_X(x) = \begin{cases} \frac{3x}{2} & \text{for } 0 \leq x \leq 1 \\ \frac{1}{4} & \text{for } 5 \leq x \leq 6 \\ 0 & \text{elsewhere} \end{cases}\end{split}\]
\[\begin{split}F_X(x) = \begin{cases} 0 & \text{for } x < 0 \\ \frac{3x^2}{4} & \text{for } 0 \leq x < 1 \\ \frac{3}{4} & \text{for } 1 \leq x < 5 \\ \frac{x-2}{4} & \text{for } 5 \leq x \leq 6 \\ 1 & \text{for } x > 6 \end{cases}\end{split}\]
  1. Find \(P(X \leq 0.5)\).

    \(x=0.5\) belongs to Region 2 of the CDF, so \(F_X(0.5) = \frac{3(0.5)^2}{4} = \frac{3}{16}\).

  2. Find \(P(X \geq 5.5)\).

    \(P(X \geq 5.5) = 1 - P(X < 5.5) = 1 - F_X(5.5)\). \(x=5.5\) is in Region 4. Therefore, \(P(X > 5.5) = 1 - \frac{5.5-2}{4} = 1 - \frac{7}{8} = \frac{1}{8}\)

  3. Find \(P(0.2 < X \leq 5.8)\).

    \[\begin{split}P(0.2 < X \leq 5.8) &= F_X(5.8) - F_X(0.2) = \frac{5.8-2}{4} -\frac{3(0.2)^2}{4}\\ &= 0.95 - 0.03 = 0.92.\end{split}\]

6.3.6. Finding Percentiles using CDF

Definition

Take a value \(p \in [0,1]\). The \(p \times 100\) th percentile of a random variable \(X\), denoted \(x_p\), is the value satisfying:

\[F_X(x_p) = P(X \leq x_p) = p.\]

For example,

  • The 50th percentile of a random variable \(X\) is written as \(x_{0.5}\) and satisfies the conditon:

    \[F_X(x_{0.5}) = P(X \leq x_{0.5})= 0.5.\]

    \(x_{0.5}\) represents the vertical cutoff on the PDF of \(X\) such that the area to its left is exactly \(0.5\). This special percentile is also called the median (\(\tilde{\mu}\)) of \(X\).

    50th percentile
  • The 25th percentile \(x_{0.25}\) satisfies \(F_X(x_{0.25}) = 0.25\). It splits the PDF into a left region with area 0.25 and right region with area 0.75.

  • The 75th percentile \(x_{0.75}\) satisfies \(F_X(x_{0.75}) = 0.75\). It splits the PDF into a left region with area 0.75 and right region with area 0.25.

  • The true IQR of \(X\) is the difference of \(x_{0.75} - x_{0.25}\).

Connection to sample percentiles

Recall that we’ve already discussed the concept of sample percentiles for a dataset in Chapter 3. The percentiles for a random variable are considered their population equivalent in the framework of data analysis.

If we generate data points \(x_1, x_2,\cdots, x_n\) from a random variable \(X\), we would compute the sample percentiles using \(x_1, x_2,\cdots, x_n\), and the true percentiles using the CDF of \(X\). The two sets of objects are closely related, but different.

Computing Percentiles of a Continuous Random Variable

Percentiles for a continuous random variables are found by replacing the general definition \(F_X(x_p) = p\) with specific expressions, than solving for \(x_p\). See the example below for further detail.

Example💡: Finding a Percentile

Find the 85th percentile of a random variable \(X\) which has the following CDF:

\[\begin{split}F_X(x) = \begin{cases} 0 & \text{for } x < 0 \\ \frac{3x^2}{4} & \text{for } 0 \leq x < 1 \\ \frac{3}{4} & \text{for } 1 \leq x < 5 \\ \frac{x-2}{4} & \text{for } 5 \leq x \leq 6 \\ 1 & \text{for } x > 6 \end{cases}\end{split}\]

We must solve \(F_X(x_{0.85}) = 0.85\) for \(x_{0.85}\).

Which piece of the CDF should we use to replace the general expression \(F_x(x_{0.85})\)? Since \(F_X(5) = \frac{3}{4} = 0.75 < 0.85\), we know that the 85th percentile has to be strictly greater than \(5\). Using Region 4,

\[\begin{split}&\frac{x_{0.85} - 2}{4} = 0.85\\ &x_{0.85} - 2 = 3.4\\ &x_{0.85} = 5.4\end{split}\]

6.3.7. More examples

6.3.8. Bringing It All Together

Key Takeaways 📝

  1. The cumulative distribution function \(F_X(x) = P(X \leq x)\) provides a universal framework for both discrete and continuous random variables. It eliminates the need for repeated integration when computing probabilities for continuous random variables.

  2. For continuous variables, PDFs and CDFs are related through differentiation and integration: \(f_X(x) = F_X'(x)\) and \(F_X(x) = \int_{-\infty}^x f_X(t) \, dt\).

  3. Valid CDFs must be non-decreasing, approach 0 as \(x \to -\infty\), approach 1 as \(x \to +\infty\), and be right-continuous.

  4. Probability calculations become simpler with CDFS: \(P(X \leq a) = F_X(a)\), \(P(X > a) = 1 - F_X(a)\), and \(P(a < X \leq b) = F_X(b) - F_X(a)\).

  5. Piecewise CDFs require careful attention—each region must include all probability accumulated from previous regions.

  6. Percentiles are found by solving \(F_X(x_p) = p\).

In our next sections, we’ll see how these CDF concepts apply to specific named distributions, starting with the most important continuous distribution in all of statistics: the normal distribution.

Exercises

  1. CDF Construction: Given the PDF \(f_X(x) = 2x\) for \(0 \leq x \leq 1\), 0 elsewhere:

    1. Find the CDF \(F_X(x)\)

    2. Verify that your CDF satisfies all required properties

    3. Use the CDF to find \(P(0.2 < X \leq 0.8)\)

  2. Piecewise CDF: Consider the PDF:

    \[\begin{split}f_X(x) = \begin{cases} kx & \text{for } 0 \leq x \leq 2 \\ k & \text{for } 2 < x \leq 4 \\ 0 & \text{elsewhere} \end{cases}\end{split}\]
    1. Find the value of \(k\) that makes this a valid PDF.

    2. Construct the complete CDF.

    3. Find \(P(1 < X < 3)\) using the CDF.

  3. CDF Validation: Determine whether the following function could be a valid CDF:

    \[\begin{split}F(x) = \begin{cases} 0 & \text{for } x < 0 \\ x^2 & \text{for } 0 \leq x < 1 \\ 1 & \text{for } x \geq 1 \end{cases}\end{split}\]

    If valid, find the corresponding PDF.

  4. Percentile Calculations: For the CDF from Exercise 2,

    1. Find the median.

    2. Find the 75th percentile.

    3. Find the interquartile range.

  5. PDF-CDF Relationship: Given \(F_X(x) = 1 - e^{-x}\) for \(x \geq 0\), 0 elsewhere,

    1. Find the PDF \(f_X(x)\).

    2. Calculate \(P(X > 2)\).

    3. Find the value \(x\) such that \(P(X \leq x) = 0.95\).