In statistics, data visualization plays a crucial role in gaining insights into the underlying behavior of the attributes being analyzed, assessing the validity of inference assumptions, identifying necessary corrections (such as transformations) to ensure valid inference, and effectively presenting findings to others. To meet these demands, the ggplot2 library serves as a powerful tool in our course for creating informative and visually appealing plots.
The ggplot2 package is a widely used graphing library in R that offers a unique approach to data visualization compared to other traditional graphing tools. It provides a flexible and layered framework for creating a wide range of visualizations. With ggplot2, users can easily add various components to their plots, such as data points, aesthetic mappings, geometric objects, statistical transformations, and annotations. ggplot2 used a layered approach where each layer represents a specific aspect of the plot, such as data points, geometric objects, statistical transformations, or annotations. This layered approach allows for seamless customization and the creation of complex and informative plots with ease. Built on the principles of the grammar of graphics (see The Grammar of Graphics by Leland Wilkinson), ggplot2 follows a structured and consistent approach to constructing plots. It adheres to the tidy data philosophy and supports extensibility, making it a powerful and versatile tool for visualizing data. By leveraging aesthetic mapping, ggplot2 enables the representation of multiple dimensions in a single plot, while automatic legends and scales reduce the need for manual configuration. Additionally, integrated statistical transformations enhance data exploration and analysis.
An entrepreneur faces many bureaucratic and legal hurdles when starting a new business. The World Bank collects information about starting businesses throughout the world. It has determined the time, in days, to complete all of the procedures required to start a business. Data for 195 countries are included in the data set. For this section we will examine data for a sample of 24 of these countries. Here are the data:
A “separator” is a character that marks the end of one cell and the beginning of the next one when R reads in data from plain text. Some common separators are spaces, tabs, and commas. By writing sep = ","
inside the read.table
function, we tell R that comma is the correct separator for this dataset. It is a good practice to always specify the separator.
See Tutorial for CA 1 for the steps. If a problem asks for the code used for importing, explain the procedure in words instead. For example, write “Import Dataset \(\to\) From Text (base) \(\to\) Browse to eg01-23time24.csv \(\to\) Change name to TimeStart \(\to\) Check ‘Yes’ for heading \(\to\) Check strings as factors \(\to\) Import”. Make sure that the dataset looks correct in the preview before clicking “Import”.
Make sure that the data was imported correctly using View(TimeStart)
or:
We need to access a variable inside a table to perform operations like computing the mean. There are a few different ways to do this.
tableName$variableName
attach(TableName)
once, then just type the variable name in the rest of the code. This allows for less writing, but brings sources of potential confusion or error. Make sure that none of your column names are exactly the same as the table name.tableName[, columnNumber]
tableName[, "variableName"]
The third and fourth methods would apply to our problem as TimeStart[, 2]
and TimeStart[, "TimeToStart"]
, respectively.
This tutorial will only use Method 1. Computer Assignment 2b will have examples of the rest of the methods.
There is no mathematically optimal choice for the bin size due to the large differences in data from different populations. For small to medium datasets, we will use the square root rule. The square root rule has been shown to work well in practice for most standard data of medium to small size. However, there are other rules available that one can explore for special cases. I will list some of them here for the interested student.
From now on, we will call the number of observations in the dataset \(n\), and the number of bins (or classes) \(n_{bins}\).
For small to medium datasets (\(n \leq 400\)), the number of bins we shall use is \(n_{bins} \approx round(\sqrt{n})\). If a dataset is small, we will use the following formula to ensure we have a minimal number of bins.
\[n_{bins} = \max (round(\sqrt{n})+2, 5)\] The +2 in the formula adds a couple additional bins to help with what are called edge effects where the bins at the edges contain much fewer observations than the interior. We also ensure that there are at least 5 bins as it can be difficult to understand the histrogram when there are too few bins.
For large datasets (\(n>400\)), you will be required to explore a range of values between 20 - 30 bins to determine the best number of bins to display the distributional properties of the data. If you use the same rule as for small/medium data, the histogram will not correctly represent the properties of the data. You will lose points if you do not use a reasonable bin size. No matter how many bins you choose, always look at the result to see if there is a problem in the shape. If there is a problem, change the number of bins.
We will now plot the histogram of the variable TimeToStart
, which is part of the dataset TimeStart
. The package ggplot2
is already available on our version of RStudio, but when using a local version, you must first download it by running install.packages("ggplot2")
. In this course, you only need to import the downloaded package into your current session, using:
## Warning: package 'ggplot2' was built under R version 4.3.3
This needs to be run only once, at the beginning of each R session.
We are now ready to write a ggplot
code. A typical ggplot
code begins by specifying the dataset as the first argument,
then specifying the variable(s) of interest from this dataset inside the aes()
argument.
Then desired layers are added one by one using the +
sign. Observe what happens as we add each additional layer to the base ggplot(TimeStart, aes(TimeToStart))
in the following series of code.
Compute the summary statistics. They give us some information about the shape of the sample. Further, they will be used as ingredients in the code for our histogram later on.
We begin by adding the base histogram layer geom_histogram
.
ggplot(TimeStart, aes(TimeToStart)) +
geom_histogram(aes(y = after_stat(density)),
bins=number_bins, fill="grey", col="black")
The argument aes(y = after_stat(density))
tells R that we want the y-axis to represent the density (relative frequency). Try running the code after deleting this part, and the plot will revert to its default option for the y-axis, the counts. We will always use the density option in this course.
ggplot(TimeStart, aes(TimeToStart)) +
geom_histogram(aes(y = after_stat(density)),
bins=number_bins, fill="grey", col="black") +
geom_density(col="red", lwd=1)
geom_density
adds a curve that follows the trend of the histogram, sometimes called a ‘kernel’.
ggplot(TimeStart, aes(TimeToStart)) +
geom_histogram(aes(y = after_stat(density)),
bins=number_bins, fill="grey", col="black") +
geom_density(col="red", lwd=1) +
stat_function(fun=dnorm, args = list(mean = xbar, sd = s),
col = "blue", lwd = 1)
This stat_function
layer draws a normal density (fun=dnorm
) whose center is at the sample mean of TimeToStart
(mean = xbar
) and the standard deviation is also equal to the sample sd of TimeToStart
(sd = s
). Make sure that xbar
and s
are defined BEFORE running this part, or you will get an error due to undefined or incorrectly defined objects.
ggplot(TimeStart, aes(TimeToStart)) +
geom_histogram(aes(y = after_stat(density)),
bins=number_bins, fill="grey", col="black") +
geom_density(col="red", lwd=1) +
stat_function(fun=dnorm, args = list(mean = xbar, sd = s),
col = "blue", lwd = 1) +
ggtitle("Histogram of Time To Start")
Histograms used in our course should always contain all the layers used in this final version: the histogram bars, red smoothing curve, blue normal curve, and the title. ONLY the final version is necessary.
When a
ggplot
command is run successfully, the plot will appear in the lower right window of RStudio, under the Plots tab (blue arrow). You can flip through all the plots produced in the current R session using the left and right arrows (red). To save a plot as an image file, click export (green), then choose your preferred option among Save as Image, Save as PDF, and Copy to Clipboard.
This distribution is right skewed because the peak is on the left of the histogram and there is a long tail to the right.
According to the histogram, there is a possibility of outliers because the upper bin is separated from the rest of the bins.
This tutorial will cover the following materials:
attach
, and direct referenceAn entrepreneur faces many bureaucratic and legal hurdles when starting a new business. The World Bank collects information about starting businesses throughout the world. It has determined the time, in days, to complete all of the procedures required to start a business. Data for 195 countries are included in the data set. For this section we will examine data for a sample of 24 of these countries. Here are the data:
A “separator” is a character that marks the end of one cell and the beginning of the next one when R reads in data from plain text. Some common separators are spaces, tabs, and commas. By writing sep = ","
inside the read.table
function, we tell R that comma is the correct separator for this dataset. It is a good practice to always specify the separator.
See Tutorial for CA 1 for the steps. If a problem asks for the code used for importing, explain the procedure in words instead. For example, write “Import Dataset \(\to\) From Text (base) \(\to\) Browse to eg01-23time24.csv \(\to\) Change name to TimeStart \(\to\) Check ‘Yes’ for heading \(\to\) Check strings as factors \(\to\) Import”. Make sure that the dataset looks correct in the preview before clicking “Import”. After importing, use View(TimeStart)
to confirm that the import was successful.
We need to access a variable inside a table to perform operations like computing the mean. There are a few different ways to do this.
tableName$variableName
attach(TableName)
once, then just type the variable name in the rest of the code. This allows for less writing, but brings sources of potential confusion or error. Make sure that none of your column names are exactly the same as the table name.tableName[, columnNumber]
tableName[, "variableName"]
The third and fourth methods would apply to our problem as TimeStart[, 2]
and TimeStart[, "TimeToStart"]
, respectively.
This tutorial will only use Method 1. Computer Assignment 2b will have examples of the rest of the methods.
## 0% 25% 50% 75% 100%
## 5 7 13 30 94
From the R output above, Min = 5, Q1 = 7, Median = 13, Q3 = 30, Max = 94.
Note: There are other ways of computing the five-number summary. However, this method will generate the values that are used in the boxplot generated from ggplot2. Therefore, to receive full credit, you can only use the function quantile().
There are a number of different ways that this can be accomplished.
Compare the numbers From parts a) and b), Mean = 23.625, Median = 13. I would say that these numbers are not close in this data set because their absolute difference is 10.625 which is large compared to the range of data (approximately 100).
Visual assess their distance on the histogram.
library(ggplot2)
median <- median(TimeStart$TimeToStart)
xbar <- mean(TimeStart$TimeToStart)
s <- sd(TimeStart$TimeToStart)
n <- nrow(TimeStart)
number_bins <- max(round(sqrt(n))+2,5)
ggplot(TimeStart, aes(TimeToStart)) +
geom_histogram(aes(y = after_stat(density)),
bins=number_bins, fill="grey", col="black") +
geom_density(col="red", lwd=1) +
stat_function(fun=dnorm, args = list(mean = xbar, sd = s),
col = "blue", lwd = 1) +
geom_vline(xintercept=xbar, linetype=1, lwd=1) +
geom_vline(xintercept=median, linetype=2, lwd=1) +
ggtitle("Histogram of Time To Start")
It is also okay to mark the locations approximately by hand or another application on the histogram generated from the Tutorial for CA 2a. Here, the dotted line is the median and the solid line is the mean.To me, these numbers do not look close to each other.
Compare the difference of the numbers to the total spread of the numbers \[\frac{\textrm{mean}-\textrm{median}}{\textrm{maximum}-\textrm{minimum}} = \frac{23.625-13}{94-5} = 0.119\] This is ~12% which is fairly large.
Compare the difference of the numbers to the standard deviation \[\frac{\textrm{mean}-\textrm{median}}{\textrm{standard deviation}} = \frac{23.625-13}{23.82876} = 0.446\] Therefore, the difference is about half the standard deviation which is fairly large.
In this case, I would say that the two numbers are not close to each other. However, these four methods will not always provide the same answer. When answering this question, you need to look at all of the information to make a decision. In some cases, it is possible for different people to have different answers.
The following is the procedure for generating a modified boxplot. In a modified boxplot, the outliers calculated from the 1.5 IQR Rule are explicitly plotted. These calculated points might or might not be ‘real’ outliers which have to be far from the rest of the data.
To begin, import the ggplot2
once.
ggplot(TimeStart, aes(x = "", y = TimeToStart)) +
stat_boxplot(geom="errorbar") +
geom_boxplot() +
ggtitle("Boxplot of Time To Start") +
stat_summary(fun = mean, col = "black", geom = "point", size = 3)
Note the stat_summary()
function is an additional feature that puts the mean on the boxplot. The circle on the boxplot is the location of the mean. You may modify the color and size of the symbol.
Note: When putting together a document for upload, resize your graphics so that they fit on the page, especially when you have more than one plot.
The shape of the distribution looks right skewed. This is because the upper whisker is longer than the lower whisker. The median is also closer to the bottom of the box versus the top of the box. In addition, there are explicit points at the higher end and not the lower end.
Looking at the boxplot, the lower explicit point looks close to the top of the whisker. Remember that the whisker stops at an actual data point and not the fence. Therefore, this point would not be considered a ‘real’ outlier because it is not that far from the rest of the data. However, the point at around 90 is far from the lower explicit point so this point would be considered a ‘real’ outlier.