In R language, functions involving known distributions are generally named with a single-letter prefix representing the desired task, followed by the name of the distribution.
For example, try running:
## [1] 2.057324 2.107380 2.051944 1.972280 1.906241 1.974466 2.053314 2.181800
## [9] 1.944498 1.939041
Above, r
represents the task of random sampling, and norm
represents
normal distribution. rnorm(10, mean=2, sd=0.1)
creates a length-\(n\) vector
filled with random observations from normal distribution with \(\mu=2\) and \(\sigma=0.1\). When mean
and sd
are not specified, rnorm
assumes standard normal distribution.
Continue to suppose that we are working with a random variable \(X\) following normal distribution with \(\mu=2\) and \(\sigma=0.1\).
d+name of distribution
gives the evaluation of the pmf or the pdf at the given
input value. For example,
## [1] 3.989423
is the evaluation of the pdf \(X\) at \(x=2\).
p+name of distribution
computes the cumulative probability \(P(X \leq \text{input})\).
## [1] 0.5
Remember that the p+name of distribution
functions are always defined using the “less than or equal to” sign, \(\leq\).
The knowledge of this fact may influence your results involving discrete distributions.
q+name of distribution
computes the quantile \(x\) satisfying
\(\text{input} = P(X \leq x)\).
## [1] 2
beta
binom
cauchy
chisq
exp
f
gamma
lnorm
norm
pois
t
unif
weibull
Combine with a task prefix introduced in Section ?? for implementation.Some examples are
## [1] 0.1839397
## [1] 0.2231436
All r+name of distribution
functions give their output as a vector, which means that
we can apply the vector functions learned in CA 1 and CA 2.
In CA 3, all parts related to one distribution must be completed using the same set of observations. This can be tricky because every time you click “run” in the default setting, a new set of numbers is generated. For reproducibility, R allows you to fix the seed of random generation. When a fixed seed is used, the same set of random numbers will be generated over different runs.
## [1] -0.04701438 0.44454431 0.66766416 1.20222598 -0.69045818
## [1] 0.1528669 -1.2344644 -0.9206679 -1.8512973 -0.8135921
## [1] -0.3770250 -0.9271884 0.8429667 1.2228096 1.0682855
## [1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247
## [1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247
## [1] -1.2070657 0.2774292 1.0844412 -2.3456977 0.4291247
To use this feature, it suffices to write set.seed(your seed number)
at the beginning of your R script.
Suppose we would like to work with a sample of size \(n=20\) drawn from a normal distribution with \(\mu=572\) and \(\sigma=51\).
## [1] 601.8620 608.1828 566.4255 548.8716 602.9003 479.2842 604.1350 557.9146
## [9] 557.5079 525.1146 566.0714 664.6829 590.9020 598.5310 533.7229 613.6619
## [17] 526.7958 555.0895 629.1563 587.2349
## [1] 575.9024
## [1] 42.53071
Before beginning this section, make sure that ggplot2
is installed on your
RStudio. If not, revisit CA 2a tutorial. If ggplot2
is already installed,
then load it into the current session by first running
## Warning: package 'ggplot2' was built under R version 4.3.3
A ggplot
function only accepts data in the form of data.frame
(a way of
saying “table” in the language of R). Therefore, we must first re-format the RandomData
vector into adata.frame
of single column by running
Define the number of bins.
Draw the histogram. Always include the red and blue curves in a histogram.
ggplot(RandomDataDF, aes(x=RandomData)) +
geom_histogram(aes(y= after_stat(density)), bins = number_bins,
fill="grey", col="black") +
geom_density(col="red", lwd=1) +
stat_function(fun=dnorm, args=list(mean=xbar, sd=s), col="blue", lwd=1) +
xlab("Data") +
ylab("Proportion") +
ggtitle("Histogram of RandomData")
There are two complimentary ways to determine if a histogram looks normal.
It is recommended to always use both methods. In this case, the two curves are similar, and the histogram looks unimodal and approximately symmetric. Therefore, we conclude that this histogram resembles a normal distribution.
ggplot(RandomDataDF, aes(sample=RandomData)) +
stat_qq() +
geom_abline(slope=s, intercept=xbar) +
ggtitle("normal Proability Plot of RandomData")
Since the data points follow the straight line reasonably well, the randomly generated data does not appear to deviate substantially from a normal distribution.
The assignment will ask you to repeat the example above using four other distributions. For all distributions, including normal, you must use the parameter values provided by the assignment, not this tutorial. Use the help function in R to determine what parameters need to be set in R and how to sample from the specified distributions. For example the LogNormal distribution.
Lognormal | R Documentation |
Density, distribution function, quantile function and random
generation for the log normal distribution whose logarithm has mean
equal to meanlog
and standard deviation equal to sdlog
.
dlnorm(x, meanlog = 0, sdlog = 1, log = FALSE)
plnorm(q, meanlog = 0, sdlog = 1, lower.tail = TRUE, log.p = FALSE)
qlnorm(p, meanlog = 0, sdlog = 1, lower.tail = TRUE, log.p = FALSE)
rlnorm(n, meanlog = 0, sdlog = 1)
x , q |
vector of quantiles. |
p |
vector of probabilities. |
n |
number of observations. If |
meanlog , sdlog |
mean and standard deviation of the distribution
on the log scale with default values of |
log , log.p |
logical; if TRUE, probabilities p are given as log(p). |
lower.tail |
logical; if TRUE (default), probabilities are
|
The log normal distribution has density
f(x) = \frac{1}{\sqrt{2\pi}\sigma x} e^{-(\log(x) - \mu)^2/2 \sigma^2}%
where \mu
and \sigma
are the mean and standard
deviation of the logarithm.
The mean is E(X) = exp(\mu + 1/2 \sigma^2)
,
the median is med(X) = exp(\mu)
, and the variance
Var(X) = exp(2\mu + \sigma^2)(exp(\sigma^2) - 1)
and hence the coefficient of variation is
\sqrt{exp(\sigma^2) - 1}
which is
approximately \sigma
when that is small (e.g., \sigma < 1/2
).
dlnorm
gives the density,
plnorm
gives the distribution function,
qlnorm
gives the quantile function, and
rlnorm
generates random deviates.
The length of the result is determined by n
for
rlnorm
, and is the maximum of the lengths of the
numerical arguments for the other functions.
The numerical arguments other than n
are recycled to the
length of the result. Only the first elements of the logical
arguments are used.
The cumulative hazard H(t) = - \log(1 - F(t))
is -plnorm(t, r, lower = FALSE, log = TRUE)
.
dlnorm
is calculated from the definition (in ‘Details’).
[pqr]lnorm
are based on the relationship to the normal.
Consequently, they model a single point mass at exp(meanlog)
for the boundary case sdlog = 0
.
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
Johnson, N. L., Kotz, S. and Balakrishnan, N. (1995) Continuous Univariate Distributions, volume 1, chapter 14. Wiley, New York.
Distributions for other standard distributions, including
dnorm
for the normal distribution.
dlnorm(1) == dnorm(0)