Normal Distribution
=========================================
We now encounter the most important continuous distribution in all of statistics:
the normal distribution.
.. admonition:: Road Map 🧭
:class: important
• Understand the **historical development** and significance of the normal distribution.
• Master the **mathematical definition** and properties of the normal PDF.
• Explore how the **parameters μ and σ** control location and shape.
• Learn the famous **empirical rule** for quick probability estimates.
• Understand why **standardization** is essential for normal computations.
The Historical Legacy: From Gauss to Modern Statistics
--------------------------------------------------------
The normal distribution carries a rich mathematical heritage spanning over two centuries.
While often called the "Gaussian distribution" in honor of Carl Friedrich Gauss (1777-1855),
the distribution's development involved several brilliant mathematicians who recognized
patterns in natural variation.
Gauss and the Method of Least Squares
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter6/Gauss.jpg
:alt: Carl Friedrich Gauss
:align: right
:figwidth: 30%
Carl Friedrich Gauss (1777-1855)
In the late 1700s and early 1800s, Gauss was working on astronomical calculations and
geodetic surveys—problems requiring precise measurements where small errors were
inevitable. He sought to understand how these measurement errors behaved and how
to optimally combine multiple measurements of the same quantity.
Gauss discovered that measurement errors followed a specific pattern: most errors
were small and clustered around zero, with larger errors becoming increasingly rare.
More importantly, he found that this error distribution had a particular exponential
form with quadratic decay that optimized his least squares fitting procedure.
The Connection to Binomial Distributions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Gauss recognized that his continuous error distribution emerged as a limiting case
of discrete binomial distributions. When the number of trials becomes very large
while the probability of success becomes very small (in a specific balanced way),
the jagged, discrete binomial distribution smooths into the graceful bell curve we
now call the normal distribution.
This connection between discrete counting processes and continuous measurement errors
revealed a profound unity in probability theory—the same mathematical structure appears
whether we're flipping coins or measuring stellar positions.
A Universal Pattern in Nature
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
What makes the normal distribution truly remarkable is its ubiquity.
It describes not just measurement errors, but heights and weights of organisms,
intelligence test scores, particle velocities in gases, and countless other
natural phenomena. This universality isn't coincidental—it emerges from a deep
mathematical principle we'll encounter later called the Central Limit Theorem.
The Mathematical Definition: Anatomy of the Bell Curve
---------------------------------------------------------
Notation and Parameters
~~~~~~~~~~~~~~~~~~~~~~~~~
If a random variable :math:`X` has normal distribution, we write:
.. math::
X \sim N(\mu, \sigma^2) \quad \text{or} \quad X \sim N(\mu, \sigma).
A normal random variable takes two parameters:
.. flat-table::
:header-rows: 1
:align: center
:width: 100%
* -
- Mean :math:`\mu`
- Standard Deviation :math:`\sigma`
* - **Possible values**
- :math:`\mu \in (-\infty, +\infty)`. It can be any real number.
- :math:`\sigma >0`. It must be a postive value.
* - **Interpretation**
- The *location parameter*. It represents the center of the distribution of :math:`X`.
- The *scale parameter*. It represents how spread out the distribution of :math:`X` is.
* - **Effect on the appearance of the PDF**
- Slides the curve left or right, without changing the shape
- Makes the graph tall and narrow (small :math:`\sigma`) or wide and flat (large :math:`\sigma`). It does not change the
location of the center.
.. admonition:: Variance or Standard Deviation?
:class: important
It is standard to describe a normal distribution using either variance or
standard deviation, but **we must be explicit about which we're using**.
The constraints and interpretations of standard deviation transfer almost directly
to variance. Variance must be a positive number, and it controls how wide
the distribution is. The only difference is their scale—variance is in the squared scale,
while standard deviation is on the same scale as :math:`X`.
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter6/normal-pdfs.png
:alt: Normal pdfs for different sets of parameters mu and sigma
:align: center
:figwidth: 60%
How different values of μ and σ affect the normal distribution's appearance
The Normal PDF
~~~~~~~~~~~~~~~~~
The PDF of a normal random variable :math:`X` takes the form:
.. math::
f_X(x) = \frac{1}{\sqrt{2\pi \sigma}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} \quad \text{for all } x \in \mathbb{R}
This elegant formula contains several key components:
- The **normalizing constant** :math:`\frac{1}{\sqrt{2\pi \sigma}}` ensures the total
area under the curve equals 1.
- The **exponential function** :math:`e^{-(\cdot)}` creates the smooth, continuous decay.
- The **quadratic expression** :math:`\left(\frac{x-\mu}{\sigma}\right)^2` in the exponent
produces the symmetric, bell-shaped curve.
- The **parameters** :math:`\mu` and :math:`\sigma` control the distribution's location and spread.
Fundamental Properties
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Regardless of its parameters, every normal distribution satisfies the following properties:
1. It is **symmetrical** about the mean :math:`\mu`.
2. It is **unimodal** with a single peak at :math:`x = \mu`.
3. Since the distribution is perfectly symmetric, the **mean equals the median**:
:math:`\mu = \tilde{\mu}`.
4. It is **bell-shaped** with smooth, continuous curves.
5. The two tails **approach but never reach zero** as :math:`x \to \pm\infty`.
This implies that :math:`\text{supp}(X) = (-\infty, +\infty)`.
6. The points where the normal curve changes from concave down to concave up (its **inflection points**) occur exactly at
:math:`x = \mu - \sigma` and :math:`x = \mu + \sigma`.
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter6/inflection-points.png
:alt: Normal distribution showing inflection points at μ ± σ
:align: center
:width: 70%
The normal curve changes concavity at exactly one standard deviation from the mean
The Empirical Rule: A Practical Tool
----------------------------------------------
One of the most useful properties of normal distributions is that they all follow the same probability
pattern, regardless of their specific parameter values. This universal pattern is called the
**empirical rule** or **68-95-99.7 rule**.
For any normal distribution :math:`X \sim N(\mu, \sigma)`:
1. **68% of the probability** lies within one standard deviation:
:math:`P(\mu - \sigma < X < \mu + \sigma) \approx 0.68`
2. **95% of the probability** lies within two standard deviations:
:math:`P(\mu - 2\sigma < X < \mu + 2\sigma) \approx 0.95`
3. **99.7% of the probability** lies within three standard deviations:
:math:`P(\mu - 3\sigma < X < \mu + 3\sigma) \approx 0.997`
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter6/empirical-rule-labeled.png
:alt: Empirical rule showing 68-95-99.7 percentages
:align: center
:width: 80%
The empirical rule provides quick probability estimates for any normal distribution
Extended Breakdown of the Empirical Rule
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1. **34%** of probability lies in each of :math:`(\mu, \mu + \sigma)`.
and :math:`(\mu - \sigma, \mu)`
* Each interval is half of 68%
2. **13.5%** of probability lies in each of :math:`(\mu + \sigma, \mu + 2\sigma)`.
and :math:`(\mu - 2\sigma, \mu - \sigma)`
* Each interval is half of 95%, minus an interval from #1.
3. **2.35%** of probability lies in each of :math:`(\mu + 2\sigma, \mu + 3\sigma)`
and :math:`(\mu - 3\sigma, \mu - 2\sigma)`.
* Each interval is half of 99.7%, minus an interval from #2 and an interval from #1.
4. **0.15%** of probability lies beyond :math:`\mu + 3\sigma` and
another **0.15%** beyond :math:`\mu - 3\sigma`.
* Each region is half of 100% - 99.7%.
Insights from the Empirical Rule
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- About **2/3** of values fall within one standard deviation of the mean.
- About **19 out of 20** values fall within two standard deviations.
- **Nearly all values** (99.7%) fall within three standard deviations.
.. admonition:: Example💡: Computing Normal Probabilities Using Empirical Rules
:class: note
A chemical lab reports that the amount of active ingredient in a single tablet of
a medication is normally distributed with a mean of 500 mg and a standard deviation of 5 mg.
Q. What percentage of the tablets contain between 490 mg and 505 mg of active ingredient?
:math:`490 = \mu - 2\sigma \text{ and } 505 = \mu + \sigma`. Therefore, we are looking for
.. math:: P(\mu -2\sigma \leq X \leq \mu + \sigma)
There are many different ways to solve this using the empirical rule. One way is to view
the probability as
.. math:: P(\mu -2\sigma \leq X \leq \mu + 2\sigma) - P(\mu+\sigma \leq X \leq \mu +2\sigma)
The first term is approximately 0.95 by the empirical rule,
and the second term is approximately 0.135. Then finally,
.. math:: P(\mu -2\sigma \leq X \leq \mu + \sigma) \approx 0.95 - 0.135 = 0.815
The Standard Normal Distribution: The Foundation of All Normal Computations
---------------------------------------------------------------------------------
While normal distributions can have any mean and standard deviation, there's one particular
normal distribution that serves as the foundation for all normal probability calculations.
Definition of the Standard Normal Distribution
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The **standard normal distribution** is the normal distribution with mean 0 and standard deviation 1.
When a random variable follows the standard normal distribution, we denote it with :math:`Z` and write:
.. math::
Z \sim N(0, 1)
Its PDF is obtained by plugging in 0 and 1 for :math:`\mu` and :math:`\sigma`, respectively, in the
general form:
.. math::
f_Z(z) = \frac{1}{\sqrt{2\pi}} e^{-\frac{z^2}{2}} \quad \text{for all } z \in \mathbb{R}
Because the standard normal is so important, it also gets **special notations** for its PDF and CDF:
- **PDF**: :math:`\phi(z) = f_Z(z)` (lowercase Greek phi)
- **CDF**: :math:`\Phi(z) = P(Z \leq z)` (uppercase Greek phi)
Standardization of Normal Random Variables
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Any normal random variable can be converted to a standard normal random variable
using the **standardization formula**:
.. math::
Z = \frac{X - \mu}{\sigma}.
If :math:`X \sim N(\mu, \sigma)`, then :math:`Z \sim N(0, 1)`.
**Why Standardization Works**
Standardization is essentially a change of variables (u-substitution) that:
1. **Centers** the distribution at 0 by subtracting the mean.
2. **Rescales** the distribution to unit variance by dividing by the standard deviation.
For a more concrete demonstration, we first need to know a special property of normal distribution:
* When a normal random variable is **multiplied or added by a constant, the resulting random variable
will still be normal**, just with a new set of mean and variance parameters.
Since :math:`\mu` and :math:`\sigma` are constants, the operation on :math:`X` to get to :math:`Z` leaves us
with another normal random variable. Also,
* :math:`E[Z] = E\left[\frac{X-\mu}{\sigma}\right]= \frac{E[X]-\mu}{\sigma} = \frac{\mu - \mu}{\sigma}=0`.
* :math:`\sigma^2[Z] = \text{Var}(Z) = \text{Var}\left(\frac{X-\mu}{\sigma}\right) = \frac{\text{Var}(X)}{\sigma^2}= \frac{\sigma^2}{\sigma^2} =1`.
Why Do We Standardize?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The fundamental problem with normal distributions is that their CDFs cannot be expressed in terms of
elementary functions. There's no simple formula for:
.. math::
P(X\leq x) = \int_{-\infty}^{x} \frac{1}{\sqrt{2\pi \sigma}} e^{-\frac{1}{2}\left(\frac{t-\mu}{\sigma}\right)^2} dt
However, we can **numerically approximate these integrals for the standard normal distribution** and tabulate the results.
Instead of creating tables of approximations for all possible pairs of parameters—which would be impossible—we standardize,
so that we can refer to one table for any normal random variables.
Forward Problems: :math:`x` to Probability
--------------------------------------------------
.. raw:: html
Now that we understand the theoretical foundation, let's learn how to actually compute probabilities
for normal distributions. Since we cannot integrate the normal PDF analytically, we rely on numerical
approximations tabulated in standard normal tables.
The Standard Normal Table (Z-Table)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Statisticians have computed high-precision numerical approximations for the standard normal CDF
:math:`\Phi(z) = P(Z \leq z)` and compiled them into tables. These tables typically provide probabilities
accurate to four decimal places for z-values given to two decimal places.
For example, if we want to find :math:`P(Z \leq -1.38)`, first located :math:`-1.3` from the row
labels. Then find the column with the label :math:`0.08`. The intersection
of the row and the column gives the desired probability. :math:`P(Z \leq -1.38)=0.0838`.
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter6/z-table-neg-z.png
:alt: Half of the Z-table for negative z values
:align: center
:width: 90%
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter6/z-table-pos-z.png
:alt: The other half of the Z-table for positive z- values
:align: center
:width: 90%
The Strategy for Non-standard Normal RVs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We said we would apply the standardization technique to us the Z-table for any normal distribution.
How will this work? The key steps are the following:
1. Recognize that subtracting the same value on both sides or multiplying by the same positive value on both sides
does not change the truth of an (in)equality. It follows that the probability of the
(in)equality also remains unchanged.
2. Using #1,
.. math::
P(X \leq a) = P\left(\frac{X-\mu}{\sigma} \leq \frac{a-\mu}{\sigma}\right)
= P\left(Z \leq \frac{a-\mu}{\sigma}\right) = \Phi\left(\frac{a-\mu}{\sigma}\right).
The Strategy for Probabilities Which Do Not Match the CDF
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
We are often interested in probabilities which are not in the form :math:`\Phi(z) = P(Z \leq z)`.
* For **"greater than" probabilities**, use the complement rule: :math:`P(Z > z) = 1 - \Phi(z)`.
* For **probabilities of intervals**, use :math:`P(a < Z < b) = \Phi(b) - \Phi(a)`
* Because the standard normal distribution is symmetric **around zero**, we have
an additional tool: :math:`\Phi(-z) = 1 - \Phi(z)` (:numref:`z-symmetry-property`).
.. _z-symmetry-property:
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter6/phi-symmetry.png
:alt: Special property of standard normal CDF due to symmetry around 0
:align: center
:width: 40%
Due to symmetry around zero, the two grey regions have equal probability.
Forward Problems
~~~~~~~~~~~~~~~~~~~
When a problem gives a value and asks for a related probability, we call it a
**forward problem**. The systematic approach is:
1. **Identify** what probability you need to calculate in correct probability notation.
**Sketch** the region on a normal PDF plot if needed.
2. **Standardize** by converting x-values to z-scores using :math:`z = \frac{x-\mu}{\sigma}`.
3. **Modify the probability statement** to an expression involving :math:`P(Z \leq z)` only
so the Z-table can be used directly.
4. **Round the z-score** to two decimal places and look it up in the table.
5. **Write your conclusion** in the context of the original problem.
.. admonition:: Example💡: Systolic Blood Pressure
:class: note
Systolic blood pressure readings for healthy adults, in mmHg, follow a normal distribution
with :math:`\mu=112` and :math:`\sigma^2= 100`. Find the probability that a randomly
selected adult has blood pressure between 90 and 134 mmHg.
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter6/forward-example-sketch.png
:alt: Sketch of the probability P(90 < X < 134).
:align: right
:figwidth: 30%
A sketch of :math:`P(90 < X < 134)`
* **Step 1: Write the random variable and its distribution in correct notation**
Let :math:`X` be the blood pressure readings for healthy adults. :math:`X \sim N(\mu=112, \sigma^2=100)`.
* **Step 2: Find the correct probability statement**
We are looking for
.. math:: P(90 < X < 134) = P(X < 134) - P(X < 90).
We need to find :math:`z_1` and :math:`z_2` such that :math:`P(X < 134) = P(Z< z_2)` and
:math:`P(X< 90)=P(Z< z_1)`.
* **Step 3: Standardize to find** :math:`z_1` **and** :math:`z_2`
Note that the variance is given for the spread parameter. We must use :math:`\sigma = \sqrt{100} = 10`
for standardization.
.. math::
z_1 = \frac{90 - 112}{10} = \frac{-22}{10} = -2.2 \text{ and }
z_2 = \frac{134 - 112}{10} = \frac{22}{10} = 2.2
* **Step 4: Convert to standard normal probability**
.. math:: P(90 < X < 134) = P(Z< z_2) - P(Z< z_1) = \Phi(2.2) - \Phi(-2.2)
* **Step 5: Use symmetry to simplify**
We can look up the CDF values for :math:`z=2.2` and :math:`z=-2.2`
separately in the Z-table, but when the two z values are negatives of each other,
we can simplify the search step using :math:`\Phi(-2.2) = 1 - \Phi(2.2)`.
.. math:: P(-2.2 < Z < 2.2) = \Phi(2.2) - (1 - \Phi(2.2)) = 2\Phi(2.2) - 1
* **Step 6: Look up in the Z-table and calculate the final answer**
From the standard normal table: :math:`\Phi(2.2) = 0.9861`. So finally,
.. math:: P(90 < X < 134) = 2(0.9861) - 1 = 0.9722
There is approximately a 0.9722 probability that a randomly selected healthy adult
will have systolic blood pressure between 90 and 134 mmHg.
Backward Problems: Probability to :math:`x` (Percentile)
----------------------------------------------------------
.. raw:: html
Backward problems reverse the process: given a probability, we must find the corresponding value (percentile).
Walkthrough of a Backward Problem
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Consider a typical backward question:
* The gas price on a fixed date in State A follows normal distribution with
mean $3.30 and standard deviation $0.12. If Gas Station B has a price higher than 63% of all gas stations
in the state that day, what is the gas price in Gas Station B?
In this problem, a **probability is given (63% or 0.63)**, and we are **asked for the cutoff** whose
lower region has area of 0.63 (the 63th percentile).
To solve for this type of problems, we begin by setting up the correct probability statement.
.. math:: P(X \leq x_{0.63}) = 0.63.
Standardize to get a probability statement in terms of :math:`Z`:
.. math:: P(Z \leq \frac{x_{0.63}-\mu}{\sigma})=0.63.
The right-hand side of the inequality above fits the definition of the 63th percentile of a standard normal random variable.
That is,
.. math:: z_{0.63} = \frac{x_{0.63}-\mu}{\sigma}.
We will now look for :math:`z_{0.63}` and convert back to :math:`x_{0.63}` using the above relationship.
To find :math:`z_{0.63}`, we locate 0.63 (or the value closest to it) in the **main body** of the table,
then obtain the :math:`z` **value from its margins**. 0.6293 is the value closest to 0.63 in the main body,
and its margins give us :math:`z_{0.63}=0.33`.
Converting back, :math:`x_{0.63} = \sigma z_{0.63} +\mu = (0.12)(0.33) + 3.3 = 3.3396`.
Finally, the price at Gas Station B is around $3.34.
Summary of the Key Steps
~~~~~~~~~~~~~~~~~~~~~~~~~~~
1. Identify the value you need to find using correct probability notation. **Sketch the region**
if needed.
2. **Find the z-score** by looking up the probability in the body of the standard normal table.
3. **Convert the z-score to the original scale** using :math:`x = \mu + \sigma z`.
4. **Write your conclusion** in context.
Points That Require Special Attention
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
* The probability given may correspond to an **upper region** rather than a lower one.
Since percentiles are always based on the area in the lower region, you need to adjust accordingly.
For example, if Gas Station C has a price lower than 23 % of all other stations in the state,
its price corresponds to the (100 – 23)th percentile.
* If the given probability does not have an exact match in the table, take the z-value for the **closest entry**.
If it is exactly in the middle of two values, **take the average between the z-values of the two entries**.
.. admonition:: Example💡: Systolic Blood Pressure, Continued
:class: note
Continue with the RV of blood pressure measurements: :math:`X \sim N(112, 100)`.
1. find the 95th percentile.
We want to find :math:`x_{0.95}` such that :math:`P(X \leq x_{0.95}) = 0.95`
First, find :math:`z_{0.95}` such that :math:`\Phi(z_{0.95}) = 0.95`
Searching the body of the standard normal table for 0.95, we find it's between 0.9495 and 0.9505.
Since 0.95 is exactly halfway between these values, we average the corresponding z-values:
.. math:: z_{0.95} = \frac{1.64 + 1.65}{2} = 1.645.
Convertin to the original scale,
:math:`x_{0.95} = \mu + \sigma z_{0.95} = 112 + 10(1.645) = 128.45`.
*Conclusion*: The 95th percentile of systolic blood pressure is 128.45 mmHg.
This means 95% of healthy adults have blood pressure at or below this value.
2. Find the cutoffs for the middle 50% of blood pressure measurements.
Using the cutoffs, also compute the interquartile range.
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter6/backward-example-sketch.png
:alt: Sketch of the cutoffs for middle 50%
:align: right
:figwidth: 30%
A sketch of problem 2
We need to find two cutoffs: the 25th percentile and the 75th percentile.
*For the 25th percentile*:
- :math:`\Phi(z_{0.25}) = 0.25`
- From the table (using symmetry): :math:`z_{0.25} = -0.67`
- :math:`x_{0.25} = 112 + 10(-0.67) = 105.3` mmHg
*For the 75th percentile*:
- :math:`\Phi(z_{0.75}) = 0.75`
- From the table: :math:`z_{0.75} = 0.67`
- :math:`x_{0.75} = 112 + 10(0.67) = 118.7` mmHg
*Conclusion*: The middle 50% of systolic blood pressure readings fall between 105.3 and 118.7 mmHg.
The interquartile range is :math:`118.7 - 105.3 = 13.4` mmHg.
Proving the Theoretical Properties of Normal Distribution
----------------------------------------------------------------------
Validity of the PDF
~~~~~~~~~~~~~~~~~~~~~~
To establish that the normal PDF is legitimate, we must verify that it satisfies the two fundamental
requirements for any probability density function.
Property 1: Non-Negativity
^^^^^^^^^^^^^^^^^^^^^^^^^^^
We need to show that :math:`f_X(x) \geq 0` for all :math:`x`.
Since :math:`\sigma > 0`, we have :math:`\frac{1}{\sqrt{2\pi \sigma}} > 0`. The exponential function :math:`e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}` is always positive because:
- The exponent :math:`-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2` is always negative (or zero).
- :math:`e^{\text{negative number}}` is always positive.
- :math:`e^0 = 1 > 0`.
Therefore, :math:`f_X(x) > 0` for all :math:`x \in \mathbb{R}`. ✓
Property 2: Integration to Unity
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
We must prove that :math:`\int_{-\infty}^{\infty} f_X(x) \, dx = 1`.
**Step 1: Change of Variables**
Let :math:`z = \frac{x - \mu}{\sigma}`, so :math:`x = \sigma z + \mu` and :math:`dx = \sigma \, dz`.
:math:`z = -\infty` when :math:`x = -\infty`, and :math:`z = +\infty` when :math:`x = +\infty`.
The integral becomes:
.. math::
I = \int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi \sigma}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} dx = \int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi}} e^{-\frac{z^2}{2}} dz
**Step 2: The Squaring Trick**
This integral has no elementary antiderivative, so we use a clever approach. Let's compute :math:`I^2`:
.. math::
I^2 = \left(\int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi}} e^{-\frac{z^2}{2}} dz\right)\left(\int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi}} e^{-\frac{v^2}{2}} dv\right)
Since the integrals converge absolutely, we can rewrite this as a double integral:
.. math::
I^2 = \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} \frac{1}{2\pi} e^{-\frac{1}{2}(z^2 + v^2)} \, dz \, dv
**Step 3: Polar Coordinate Transformation**
Let:
- :math:`z = r\cos\theta`
- :math:`v = r\sin\theta`
- :math:`z^2 + v^2 = r^2`
- :math:`dz \, dv = r \, dr \, d\theta`
The integration limits become:
- :math:`r`: from 0 to :math:`\infty`
- :math:`\theta`: from 0 to :math:`2\pi`
Therefore:
.. math::
I^2 = \int_0^{2\pi} \int_0^{\infty} \frac{1}{2\pi} e^{-\frac{r^2}{2}} \cdot r \, dr \, d\theta
**Step 4: Separating the Integrals**
.. math::
I^2 = \frac{1}{2\pi} \int_0^{2\pi} d\theta \int_0^{\infty} r e^{-\frac{r^2}{2}} dr
The first integral gives us :math:`2\pi`. For the second integral, use substitution :math:`u = \frac{r^2}{2}`, so :math:`du = r \, dr`:
.. math::
\int_0^{\infty} r e^{-\frac{r^2}{2}} dr = \int_0^{\infty} e^{-u} du = \left[-e^{-u}\right]_0^{\infty} = 0 - (-1) = 1
**Step 5: Final Result**
.. math::
I^2 = \frac{1}{2\pi} \cdot 2\pi \cdot 1 = 1
Since :math:`I > 0` (the integrand is positive), we have :math:`I = 1`. ✓
This completes the proof that the normal PDF is a valid probability density function.
The Parameter Relationships: Expected Value and Variance
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
To complete our theoretical understanding, we must prove that the parameters :math:`\mu` and :math:`\sigma^2` are indeed the mean and variance of the distribution.
Theorem: The Expected Value is μ
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For :math:`X \sim N(\mu, \sigma)`, :math:`E[X] = \mu`.
*Proof:*
.. math::
E[X] = \int_{-\infty}^{\infty} x \cdot \frac{1}{\sqrt{2\pi \sigma}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2} dx
Using the standardization substitution :math:`z = \frac{x-\mu}{\sigma}`, we have :math:`x = \sigma z + \mu` and :math:`dx = \sigma \, dz`.
.. math::
E[X] = \int_{-\infty}^{\infty} (\sigma z + \mu) \cdot \frac{1}{\sqrt{2\pi}} e^{-\frac{z^2}{2}} dz
Distributing the integral,
.. math::
E[X] = \sigma \int_{-\infty}^{\infty} z \cdot \frac{1}{\sqrt{2\pi}} e^{-\frac{z^2}{2}} dz + \mu \int_{-\infty}^{\infty} \frac{1}{\sqrt{2\pi}} e^{-\frac{z^2}{2}} dz
The second integral equals 1 since it's the integral of the standard normal PDF.
For the first integral, note that :math:`z \phi(z)` is an odd function, and we're integrating over a symmetric interval, so:
.. math::
\int_{-\infty}^{\infty} z \cdot \phi(z) \, dz = 0
Therefore, :math:`E[X] = \sigma \cdot 0 + \mu \cdot 1 = \mu`. ✓
Theorem: The Variance is :math:`\sigma^2`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
For :math:`X \sim N(\mu, \sigma)`, :math:`\text{Var}(X) = \sigma^2` .
*Proof:*
Using the standardization :math:`Z = \frac{X-\mu}{\sigma}`, we know that :math:`X = \sigma Z + \mu`. By the properties of variance:
.. math::
\text{Var}(X) = \text{Var}(\sigma Z + \mu) = \sigma^2 \text{Var}(Z)
So we need to show that :math:`\text{Var}(Z) = 1` for the standard normal.
.. math::
\text{Var}(Z) = E[Z^2] - (E[Z])^2 = E[Z^2] - 0^2 = E[Z^2]
.. math::
E[Z^2] = \int_{-\infty}^{\infty} z^2 \cdot \frac{1}{\sqrt{2\pi}} e^{-\frac{z^2}{2}} dz
Using integration by parts with :math:`u = z` and :math:`dv = z e^{-\frac{z^2}{2}} dz`,
we have :math:`du = dz` and :math:`v = -e^{-\frac{z^2}{2}}`. Then:
.. math::
\int z^2 e^{-\frac{z^2}{2}} dz = z(-e^{-\frac{z^2}{2}}) - \int (-e^{-\frac{z^2}{2}}) dz = -ze^{-\frac{z^2}{2}} + \int e^{-\frac{z^2}{2}} dz
The boundary term :math:`\left[-ze^{-\frac{z^2}{2}}\right]_{-\infty}^{\infty} = 0` since exponential decay dominates linear growth.
Therefore,
.. math::
E[Z^2] = \frac{1}{\sqrt{2\pi}} \int_{-\infty}^{\infty} e^{-\frac{z^2}{2}} dz = \frac{1}{\sqrt{2\pi}} \cdot \sqrt{2\pi} = 1
Thus :math:`\text{Var}(Z) = 1` and :math:`\text{Var}(X) = \sigma^2`. ✓
Assessing Normality in Practice: Why It Matters
-----------------------------------------------
.. raw:: html
In statistical practice, we frequently need to determine whether observed data comes from a normal distribution.
This assessment is crucial because many statistical procedures—confidence intervals, t-tests, ANOVA, and
regression—assume normality or rely on estimators whose sampling distributions are approximately normal.
While we've established the theoretical foundation of the normal distribution, real data is messy.
Heights, weights, test scores, and measurement errors may approximately follow normal patterns,
but we need systematic methods to evaluate how close our data comes to this idealized mathematical model.
The Challenge of Real-World Assessment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Unlike our theoretical examples with known parameters, real data presents several challenges:
- We don't know the true population parameters μ and σ
- Sample sizes are finite, introducing sampling variability
- Real phenomena may deviate from perfect normality in subtle ways
- We need to distinguish between minor departures that don't affect our analyses and serious violations that require different approaches
A Multi-Faceted Approach
~~~~~~~~~~~~~~~~~~~~~~~~~
Assessing normality requires multiple complementary methods because no single approach provides complete information. We combine:
1. **Visual methods** that reveal patterns and deviations at a glance
2. **Numerical checks** that quantify adherence to normal distribution properties
3. **Formal statistical tests** that provide rigorous hypothesis testing frameworks
Visual Assessments for Normality
----------------------------------------------------
A. Histograms with Overlaid Curves
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The most intuitive approach overlays three elements on a histogram of the data:
- The **histogram** itself, showing the actual distribution of observations
- A **kernel density estimate** (smooth red curve) that traces the data's shape without assuming any particular distribution
- A **normal density curve** (blue curve) fitted using the sample mean and standard deviation
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter6/histogram.png
:alt: Histogram with kernel density and normal curve overlay
:align: center
:figwidth: 70%
Comparing actual data distribution (purple histogram) with its smooth estimate (red) and fitted normal curve (blue)*
When data follows a normal distribution, these three elements align closely. Deviations reveal specific patterns:
- **Skewness**: The red curve shifts away from the blue curve
- **Heavy tails**: The red curve extends further than the blue curve
- **Light tails**: The red curve falls short of the blue curve's extent
- **Multimodality**: The red curve shows multiple peaks while the blue curve shows only one
B. Normal Probability Plots: A Sophisticated Diagnostic
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Normal probability plots (also called QQ-plots for "quantile-quantile plots") provide a more sensitive method for
detecting departures from normality. These plots directly compare the quantiles of our data with the quantiles we
would expect if the data truly came from a normal distribution.
Steps of Constructing a QQ-Plot
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1. Order the Data
Arrange the n observations from smallest to largest: :math:`x_{(1)} \leq x_{(2)} \leq \cdots \leq x_{(n)}`
2. Assign Theoretical Probabilities
Each ordered observation :math:`x_{(i)}` represents approximately the :math:`\frac{i-0.5}{n}`
quantile of the data distribution. The adjustment of :math:`-0.5` centers each data point within
its expected quantile interval, providing more accurate comparisons.
3. Find Corresponding Normal Quantiles
For each probability :math:`p_i = \frac{i-0.5}{n}`, find the z-value :math:`z_i` such that
:math:`\Phi(z_i) = p_i`. These are the theoretical quantiles we would expect if the data came
from a standard normal distribution.
4. Create the Plot
Plot the ordered data values :math:`x_{(i)}` (y-axis) against the theoretical quantiles :math:`z_i` (x-axis).
5. Add a Reference Line
The reference line :math:`y = \bar{x} + s \cdot z` shows where points would fall if the data
perfectly matched a normal distribution with the sample's mean and standard deviation.
Interpreting QQ-Plots
^^^^^^^^^^^^^^^^^^^^^^^^
The power of QQ-plots lies in how different departures from normality create characteristic patterns.
**Perfect Normality**: Points fall exactly on the reference line.
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter6/qq-normal.png
:alt: QQ-plot showing perfectly normal data
:align: center
:figwidth: 80%
Normal probability plot for normal data
**Long Tails**: Points begin below the line but curves above for larger values.
- Data has more extreme values than a normal distribution would predict
- The lower tail extends further left, upper tail extends further right
- Common in financial data, measurement errors with occasional large mistakes
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter6/qq-long-tail.png
:alt: QQ-plot showing perfectly normal data
:align: center
:figwidth: 80%
Normal probability plot for long-tailed data
**Short Tails**: Points begin above the line but curves below for larger values.
- Data is more concentrated around the center than normal
- Fewer extreme values than expected
- Sometimes seen in truncated or bounded measurements
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter6/qq-short-tail.png
:alt: QQ-plot showing perfectly normal data
:align: center
:figwidth: 80%
Normal probability plot for short-tailed data
**Right (Positive) Skewness**: Concave-up curve
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter6/qq-right-skew.png
:alt: QQ-plot showing perfectly normal data
:align: center
:figwidth: 80%
Normal probability plot for right-skewed data
**Left (Negative) Skewness**: Concave-down curve
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter6/qq-left-skew.png
:alt: QQ-plot showing perfectly normal data
:align: center
:figwidth: 80%
Normal probability plot for left-skewed data
**Bimodality**: S-shaped curve with plateaus
- Points cluster in the middle region of the plot
- Suggests the data might come from a mixture of two populations
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter6/qq-bimodal.png
:alt: QQ-plot showing perfectly normal data
:align: center
:figwidth: 80%
Normal probability plot for bimodal data
Numerical Assessments for Normality
------------------------------------------------
While visual methods provide intuitive insights, numerical methods offer precise,
quantifiable assessments of normality.
A. The Empirical Rule in Reverse
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Instead of using the 68-95-99.7 rule to predict probabilities, we can use it in reverse to check whether our data behaves as a normal distribution should:
For truly normal data,
- Approximately **68%** of observations should fall within one standard deviation: :math:`\bar{x} \pm s`.
- Approximately **95%** should fall within two standard deviations: :math:`\bar{x} \pm 2s`.
- Approximately **99.7%** should fall within three standard deviations: :math:`\bar{x} \pm 3s`.
Implementation Steps
^^^^^^^^^^^^^^^^^^^^^^
1. Calculate the sample mean :math:`\bar{x}` and sample standard deviation :math:`s`.
2. Count observations within each interval.
3. Compare observed proportions to expected proportions (0.68, 0.95, 0.997).
4. Large deviations suggest non-normality.
B. The IQR-to-Standard Deviation Ratio
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
For normal distributions, there's a consistent relationship between the interquartile range and the
standard deviation. This relationship arises from the fixed positions of quantiles in any normal distribution.
For any normal distribution :math:`N(\mu, \sigma)`:
- The first quartile (25th percentile) occurs at :math:`\mu - 0.674\sigma`.
- The third quartile (75th percentile) occurs at :math:`\mu + 0.674\sigma`.
- Therefore: :math:`IQR = Q_3 - Q_1 = 1.348\sigma`.
- The ratio :math:`\frac{IQR}{\sigma} \approx 1.35` (often rounded to 1.4).
Implementation Steps
^^^^^^^^^^^^^^^^^^^^^^
1. Calculate the sample IQR and sample standard deviation :math:`s`.
2. Compute the ratio :math:`\frac{IQR}{s}`.
3. Values close to 1.35 suggest normality.
4. Values substantially different indicate departures from normality.
Formal Statistical Tests for Assessing Normality
------------------------------------------------------------
While visual and numerical methods provide insights, formal statistical tests
such as Shapiro-Wilk Test and Kolmovorov-Smirnov Tests offer rigorous
frameworks for hypothesis testing about normality. These tests are covered in more advanced statistics courses.
..
Integrating Multiple Assessment Methods
--------------------------------------------
The most robust approach to assessing normality combines multiple methods, as each has strengths and limitations:
**A Systematic Approach:**
1. **Start with visual inspection**: Histograms and QQ-plots reveal the nature of any departures
2. **Apply numerical checks**: Empirical rule and IQR ratios quantify the degree of departure
3. **Consider formal tests if needed**: Particularly useful for borderline cases or when documentation is required (outside scope of course)
4. **Make an informed decision**: Balance statistical evidence with practical considerations
**Decision Guidelines:**
- **Strong evidence for normality**: Visual plots look good, numerical checks align with expectations, and formal tests (if used) agree
- **Mild departures**: Some visual deviation but numerical checks are reasonable; consider robust methods or transformations
- **Clear violations**: Multiple methods indicate serious departures; use non-parametric methods or appropriate transformations
**Remember the Context:**
The decision about whether data is "normal enough" depends on your intended analysis:
- Some procedures are robust to mild departures from normality
- Others require strict normality assumptions
- Large sample sizes often mitigate the impact of non-normality through the Central Limit Theorem
Bringing It All Together
--------------------------------------------------
.. admonition:: Key Takeaways 📝
:class: important
1. The **normal distribution** emerged from Gauss's work on measurement errors and has
become the most important continuous distribution in statistics.
2. The **PDF** :math:`f_X(x) = \frac{1}{\sqrt{2\pi \sigma}} e^{-\frac{1}{2}\left(\frac{x-\mu}{\sigma}\right)^2}`
is completely determined by two parameters: location :math:`\mu` and scale :math:`\sigma`.
3. **All normal distributions** are symmetric, unimodal, and bell-shaped, with inflection
points at :math:`\mu \pm \sigma`.
4. The **empirical rule** (68-95-99.7) provides quick probability estimates and applies to every
normal distribution regardless of parameters.
5. **Mathematical rigor**: We proved the normal PDF is valid through polar coordinate integration and
confirmed that :math:`E[X] = \mu` and :math:`\text{Var}(X) = \sigma^2`.
6. The **standard normal** :math:`N(0,1)` serves as the foundation for all normal computations through the
standardization transformation :math:`Z = \frac{X-\mu}{\sigma}`.
7. **Assessing normality** requires multiple approaches: visual methods (histograms, QQ-plots),
numerical checks (empirical rule, IQR ratios), and formal tests (Shapiro-Wilk, Kolmogorov-Smirnov variants).
Exercises
~~~~~~~~~~~~~~~~
1. **Empirical Rule Applications**: A normal distribution has :math:`\mu = 100` and :math:`\sigma = 15`.
a) Find the intervals containing approximately 68%, 95%, and 99.7% of the probability
b) What percentage of values fall between 85 and 130?
c) Values beyond what points would be considered unusual (more than 2 standard deviations from the mean)?
2. **Standardization Practice**: For :math:`X \sim N(25, 16)`, find the standardized values corresponding to:
a) :math:`x = 25`
b) :math:`x = 29`
c) :math:`x = 17`
d) What do these z-values tell you about the original x-values?
3. **Parameter Estimation**: If you know that a normal distribution has its inflection points at :math:`x = 12`
and :math:`x = 18`, determine :math:`\mu` and :math:`\sigma`.
..
Working with Normal Distributions in R
---------------------------------------
R provides comprehensive functions for working with normal distributions, eliminating the need for standard normal tables while offering powerful visualization and analysis capabilities.
**The Four Essential R Functions**
R follows its standard naming convention for the normal distribution:
- **rnorm()**: Generates random samples from a normal distribution
- **dnorm()**: Calculates the probability density function (PDF values)
- **pnorm()**: Calculates the cumulative distribution function (CDF values)
- **qnorm()**: Finds quantiles (percentiles)
**Basic Function Usage**
.. code-block:: r
# Generate random samples from N(100, 15²)
set.seed(123)
random_values <- rnorm(n = 10, mean = 100, sd = 15)
print(round(random_values, 1))
# Calculate density at x = 105 for N(100, 15²)
density_at_105 <- dnorm(x = 105, mean = 100, sd = 15)
print(paste("Density at x = 105:", round(density_at_105, 6)))
# Calculate P(X ≤ 110) for N(100, 15²)
prob_less_than_110 <- pnorm(q = 110, mean = 100, sd = 15)
print(paste("P(X ≤ 110) =", round(prob_less_than_110, 4)))
# Find the 90th percentile
percentile_90 <- qnorm(p = 0.9, mean = 100, sd = 15)
print(paste("90th percentile:", round(percentile_90, 2)))
**Standardization and Z-scores**
.. code-block:: r
# Working with standardization
x_val <- 85
mu <- 100
sigma <- 15
# Calculate z-score manually
z_score <- (x_val - mu) / sigma
print(paste("Z-score for x =", x_val, "is", round(z_score, 2)))
# Use standard normal distribution
prob_standard <- pnorm(z_score) # defaults to mean=0, sd=1
prob_original <- pnorm(x_val, mean = mu, sd = sigma)
print(paste("P(Z ≤", round(z_score, 2), ") =", round(prob_standard, 4)))
print(paste("P(X ≤", x_val, ") =", round(prob_original, 4)))
print(paste("Results match:", prob_standard == prob_original))
**Computing Various Probabilities**
.. code-block:: r
# For X ~ N(50, 10²), calculate different probability types
# P(X > 60)
prob_greater_60 <- pnorm(60, mean = 50, sd = 10, lower.tail = FALSE)
# Alternative: 1 - pnorm(60, mean = 50, sd = 10)
# P(45 ≤ X ≤ 65)
prob_between <- pnorm(65, mean = 50, sd = 10) - pnorm(45, mean = 50, sd = 10)
# P(|X - 50| > 20) = P(X < 30 or X > 70)
prob_extreme <- pnorm(30, mean = 50, sd = 10) +
pnorm(70, mean = 50, sd = 10, lower.tail = FALSE)
print(paste("P(X > 60) =", round(prob_greater_60, 4)))
print(paste("P(45 ≤ X ≤ 65) =", round(prob_between, 4)))
print(paste("P(|X - 50| > 20) =", round(prob_extreme, 4)))
# Verify empirical rule
within_1sd <- pnorm(60, 50, 10) - pnorm(40, 50, 10)
within_2sd <- pnorm(70, 50, 10) - pnorm(30, 50, 10)
within_3sd <- pnorm(80, 50, 10) - pnorm(20, 50, 10)
print(paste("Within 1 SD:", round(within_1sd, 4), "(should be ~0.68)"))
print(paste("Within 2 SD:", round(within_2sd, 4), "(should be ~0.95)"))
print(paste("Within 3 SD:", round(within_3sd, 4), "(should be ~0.997)"))
**Visualizing Normal Distributions**
.. code-block:: r
library(ggplot2)
# Compare different normal distributions
x_vals <- seq(-10, 110, by = 0.5)
# Create data for multiple distributions
plot_data <- data.frame(
x = rep(x_vals, 3),
density = c(dnorm(x_vals, mean = 50, sd = 5),
dnorm(x_vals, mean = 50, sd = 10),
dnorm(x_vals, mean = 50, sd = 15)),
distribution = rep(c("N(50, 5²)", "N(50, 10²)", "N(50, 15²)"),
each = length(x_vals))
)
# Plot PDFs
ggplot(plot_data, aes(x = x, y = density, color = distribution)) +
geom_line(size = 1.2) +
labs(title = "Normal Distributions with Different Standard Deviations",
x = "x", y = "Density", color = "Distribution") +
theme_minimal() +
geom_vline(xintercept = 50, linetype = "dashed", alpha = 0.5)
# Demonstrate empirical rule visually
x_range <- seq(20, 80, by = 0.1)
pdf_vals <- dnorm(x_range, mean = 50, sd = 10)
empirical_data <- data.frame(x = x_range, y = pdf_vals)
ggplot(empirical_data, aes(x = x, y = y)) +
geom_line(size = 1.2, color = "blue") +
geom_area(data = subset(empirical_data, x >= 40 & x <= 60),
aes(x = x, y = y), fill = "lightblue", alpha = 0.7) +
geom_area(data = subset(empirical_data, x >= 30 & x <= 70),
aes(x = x, y = y), fill = "lightgreen", alpha = 0.3) +
labs(title = "Empirical Rule: N(50, 10²)",
x = "x", y = "Density") +
theme_minimal() +
annotate("text", x = 50, y = 0.025, label = "68% within 1 SD",
color = "blue", fontface = "bold") +
annotate("text", x = 50, y = 0.015, label = "95% within 2 SD",
color = "darkgreen", fontface = "bold")
**Applied Example: Quality Control**
.. code-block:: r
# Manufacturing process produces items with N(500, 25²) grams
target_weight <- 500
tolerance <- 25
# Specification limits: 450g to 550g
lower_spec <- 450
upper_spec <- 550
# Calculate proportion within specifications
prob_in_spec <- pnorm(upper_spec, target_weight, tolerance) -
pnorm(lower_spec, target_weight, tolerance)
# Calculate defect rates
prob_underweight <- pnorm(lower_spec, target_weight, tolerance)
prob_overweight <- pnorm(upper_spec, target_weight, tolerance, lower.tail = FALSE)
print(paste("Proportion meeting specifications:", round(prob_in_spec, 4)))
print(paste("Underweight defect rate:", round(prob_underweight, 4)))
print(paste("Overweight defect rate:", round(prob_overweight, 4)))
# Find control limits that capture 99% of production
control_lower <- qnorm(0.005, target_weight, tolerance)
control_upper <- qnorm(0.995, target_weight, tolerance)
print(paste("99% control limits:", round(control_lower, 1), "to", round(control_upper, 1)))
# Simulate production and analyze
set.seed(456)
production_sample <- rnorm(n = 1000, mean = target_weight, sd = tolerance)
# Check empirical vs theoretical statistics
sample_mean <- mean(production_sample)
sample_sd <- sd(production_sample)
print(paste("Sample mean:", round(sample_mean, 2),
"(theoretical:", target_weight, ")"))
print(paste("Sample SD:", round(sample_sd, 2),
"(theoretical:", tolerance, ")"))
# Visualize production data
production_data <- data.frame(weight = production_sample)
ggplot(production_data, aes(x = weight)) +
geom_histogram(bins = 30, fill = "lightblue", alpha = 0.7, aes(y = ..density..)) +
stat_function(fun = dnorm, args = list(mean = target_weight, sd = tolerance),
color = "red", size = 1.2) +
geom_vline(xintercept = c(lower_spec, upper_spec),
color = "darkgreen", linetype = "dashed", size = 1) +
labs(title = "Production Weight Distribution vs Normal Model",
x = "Weight (grams)", y = "Density") +
theme_minimal() +
annotate("text", x = 475, y = 0.012, label = "Specification\nLimits",
color = "darkgreen", fontface = "bold")