The general form: task prefix + name of the distribution

In R language, functions involving known distributions are generally named with a single-letter prefix representing the desired task, followed by the name of the distribution.

For example, try running:

rnorm(10, mean=2, sd=0.1)
##  [1] 2.057324 2.107380 2.051944 1.972280 1.906241 1.974466 2.053314 2.181800
##  [9] 1.944498 1.939041

Above, r represents the task of random sampling, and norm represents normal distribution. rnorm(10, mean=2, sd=0.1) creates a length-\(n\) vector filled with random observations from normal distribution with \(\mu=2\) and \(\sigma=0.1\). When mean and sd are not specified, rnorm assumes standard normal distribution.

Other prefixes

Continue to suppose that we are working with a random variable \(X\) following normal distribution with \(\mu=2\) and \(\sigma=0.1\).

“d” for density (pdf)

d+name of distribution gives the evaluation of the pmf or the pdf at the given input value. For example,

dnorm(2, mean=2, sd=0.1)
## [1] 3.989423

is the evaluation of the pdf \(X\) at \(x=2\).

“p” for cumulative probability (cdf)

p+name of distribution computes the cumulative probability \(P(X \leq \text{input})\).

pnorm(2, mean=2, sd=0.1)
## [1] 0.5

Remember that the p+name of distribution functions are always defined using the “less than or equal to” sign, \(\leq\). The knowledge of this fact may influence your results involving discrete distributions.

“q” for quantile

q+name of distribution computes the quantile \(x\) satisfying \(\text{input} = P(X \leq x)\).

qnorm(0.5, mean=2, sd=0.1)
## [1] 2

“r” for random sample

Finally, the random sampling task is restated for completeness.

rnorm(10, mean=2, sd=0.1)
##  [1] 1.983257 1.853933 1.938687 2.023015 1.957880 2.058344 1.837056 1.841285
##  [9] 2.008765 1.989763

Other distributions

  • Beta: beta
  • Binomial: binom
  • Cauchy: cauchy
  • Chi-square: chisq
  • Exponential: exp
  • F: f
  • Gamma: gamma
  • LogNormal: lnorm
  • normal: norm
  • Poisson: pois
  • Student’s t: t
  • Uniform: unif
  • Weibull: weibull

Combine with a task prefix introduced in Section ?? for implementation.Some examples are

dpois(2, lambda=1) #Poisson
## [1] 0.1839397
qexp(0.2, rate=1) #exponential; rate = lambda
## [1] 0.2231436

Working with a random sample

All r+name of distribution functions give their output as a vector, which means that we can apply the vector functions learned in CA 1 and CA 2.

Using a seed

In CA 3, all parts related to one distribution must be completed using the same set of observations. This can be tricky because every time you click “run” in the default setting, a new set of numbers is generated. For reproducibility, R allows you to fix the seed of random generation. When a fixed seed is used, the same set of random numbers will be generated over different runs.

Without seed specification

rnorm(5)
## [1] -0.04701438  0.44454431  0.66766416  1.20222598 -0.69045818
rnorm(5)
## [1]  0.1528669 -1.2344644 -0.9206679 -1.8512973 -0.8135921
rnorm(5)
## [1] -0.3770250 -0.9271884  0.8429667  1.2228096  1.0682855

When a seed is fixed

set.seed(1234)
rnorm(5);
## [1] -1.2070657  0.2774292  1.0844412 -2.3456977  0.4291247
set.seed(1234)
rnorm(5);
## [1] -1.2070657  0.2774292  1.0844412 -2.3456977  0.4291247
set.seed(1234)
rnorm(5)
## [1] -1.2070657  0.2774292  1.0844412 -2.3456977  0.4291247

To use this feature, it suffices to write set.seed(your seed number) at the beginning of your R script.

Completing the assignment

Suppose we would like to work with a sample of size \(n=20\) drawn from a normal distribution with \(\mu=572\) and \(\sigma=51\).

Generate the data

set.seed(12345)
n <- 20
RandomData <- rnorm(n, mean=572, sd=51)
print(RandomData)
##  [1] 601.8620 608.1828 566.4255 548.8716 602.9003 479.2842 604.1350 557.9146
##  [9] 557.5079 525.1146 566.0714 664.6829 590.9020 598.5310 533.7229 613.6619
## [17] 526.7958 555.0895 629.1563 587.2349

Compute the numerical summary

xbar <- mean(RandomData)
xbar
## [1] 575.9024
s <- sd(RandomData)
s
## [1] 42.53071

Graphically assessing normality

Before beginning this section, make sure that ggplot2 is installed on your RStudio. If not, revisit CA 2a tutorial. If ggplot2 is already installed, then load it into the current session by first running

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3

A ggplot function only accepts data in the form of data.frame (a way of saying “table” in the language of R). Therefore, we must first re-format the RandomData vector into adata.frame of single column by running

RandomDataDF <- data.frame(RandomData=RandomData)
RandomDataDF

Assessing normality through a histogram

Define the number of bins.

number_bins <- max(round(sqrt(n))+2,5)

Draw the histogram. Always include the red and blue curves in a histogram.

ggplot(RandomDataDF, aes(x=RandomData)) + 
  geom_histogram(aes(y= after_stat(density)), bins = number_bins, 
                 fill="grey", col="black") +
  geom_density(col="red", lwd=1) + 
  stat_function(fun=dnorm, args=list(mean=xbar, sd=s), col="blue", lwd=1) +
  xlab("Data") +
  ylab("Proportion") +
  ggtitle("Histogram of RandomData")

There are two complimentary ways to determine if a histogram looks normal.

  1. Assess whether the blue normal curve is ‘close’ to the red smoothed curve.
  2. Determine whether the histogram itself forms an approximately normal shape based on the symmetry, modality and tails.

It is recommended to always use both methods. In this case, the two curves are similar, and the histogram looks unimodal and approximately symmetric. Therefore, we conclude that this histogram resembles a normal distribution.

Make a normal probability plot

ggplot(RandomDataDF, aes(sample=RandomData)) + 
  stat_qq() +
  geom_abline(slope=s, intercept=xbar) + 
  ggtitle("normal Proability Plot of RandomData")

Since the data points follow the straight line reasonably well, the randomly generated data does not appear to deviate substantially from a normal distribution.

Other distributions

The assignment will ask you to repeat the example above using four other distributions. For all distributions, including normal, you must use the parameter values provided by the assignment, not this tutorial. Use the help function in R to determine what parameters need to be set in R and how to sample from the specified distributions. For example the LogNormal distribution.

help(rlnorm)
R: The Log Normal Distribution
LognormalR Documentation

The Log Normal Distribution

Description

Density, distribution function, quantile function and random generation for the log normal distribution whose logarithm has mean equal to meanlog and standard deviation equal to sdlog.

Usage

dlnorm(x, meanlog = 0, sdlog = 1, log = FALSE)
plnorm(q, meanlog = 0, sdlog = 1, lower.tail = TRUE, log.p = FALSE)
qlnorm(p, meanlog = 0, sdlog = 1, lower.tail = TRUE, log.p = FALSE)
rlnorm(n, meanlog = 0, sdlog = 1)

Arguments

x, q

vector of quantiles.

p

vector of probabilities.

n

number of observations. If length(n) > 1, the length is taken to be the number required.

meanlog, sdlog

mean and standard deviation of the distribution on the log scale with default values of 0 and 1 respectively.

log, log.p

logical; if TRUE, probabilities p are given as log(p).

lower.tail

logical; if TRUE (default), probabilities are P[X \le x], otherwise, P[X > x].

Details

The log normal distribution has density

f(x) = \frac{1}{\sqrt{2\pi}\sigma x} e^{-(\log(x) - \mu)^2/2 \sigma^2}%

where \mu and \sigma are the mean and standard deviation of the logarithm. The mean is E(X) = exp(\mu + 1/2 \sigma^2), the median is med(X) = exp(\mu), and the variance Var(X) = exp(2\mu + \sigma^2)(exp(\sigma^2) - 1) and hence the coefficient of variation is \sqrt{exp(\sigma^2) - 1} which is approximately \sigma when that is small (e.g., \sigma < 1/2).

Value

dlnorm gives the density, plnorm gives the distribution function, qlnorm gives the quantile function, and rlnorm generates random deviates.

The length of the result is determined by n for rlnorm, and is the maximum of the lengths of the numerical arguments for the other functions.

The numerical arguments other than n are recycled to the length of the result. Only the first elements of the logical arguments are used.

Note

The cumulative hazard H(t) = - \log(1 - F(t)) is -plnorm(t, r, lower = FALSE, log = TRUE).

Source

dlnorm is calculated from the definition (in ‘Details’). [pqr]lnorm are based on the relationship to the normal.

Consequently, they model a single point mass at exp(meanlog) for the boundary case sdlog = 0.

References

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.

Johnson, N. L., Kotz, S. and Balakrishnan, N. (1995) Continuous Univariate Distributions, volume 1, chapter 14. Wiley, New York.

See Also

Distributions for other standard distributions, including dnorm for the normal distribution.

Examples

dlnorm(1) == dnorm(0)