Part 1: Graphical Summaries

Introduction

ggplot logo

In statistics, data visualization plays a crucial role in gaining insights into the underlying behavior of the attributes being analyzed, assessing the validity of inference assumptions, identifying necessary corrections (such as transformations) to ensure valid inference, and effectively presenting findings to others. To meet these demands, the ggplot2 library serves as a powerful tool in our course for creating informative and visually appealing plots.

The ggplot2 package is a widely used graphing library in R that offers a unique approach to data visualization compared to other traditional graphing tools. It provides a flexible and layered framework for creating a wide range of visualizations. With ggplot2, users can easily add various components to their plots, such as data points, aesthetic mappings, geometric objects, statistical transformations, and annotations. ggplot2 used a layered approach where each layer represents a specific aspect of the plot, such as data points, geometric objects, statistical transformations, or annotations. This layered approach allows for seamless customization and the creation of complex and informative plots with ease. Built on the principles of the grammar of graphics (see The Grammar of Graphics by Leland Wilkinson), ggplot2 follows a structured and consistent approach to constructing plots. It adheres to the tidy data philosophy and supports extensibility, making it a powerful and versatile tool for visualizing data. By leveraging aesthetic mapping, ggplot2 enables the representation of multiple dimensions in a single plot, while automatic legends and scales reduce the need for manual configuration. Additionally, integrated statistical transformations enhance data exploration and analysis.

Problem: Time to Start a Business (data: eg01-23time24.csv)

An entrepreneur faces many bureaucratic and legal hurdles when starting a new business. The World Bank collects information about starting businesses throughout the world. It has determined the time, in days, to complete all of the procedures required to start a business. Data for 195 countries are included in the data set. For this section we will examine data for a sample of 24 of these countries. Here are the data:

data

Importing the dataset

If using code

TimeStart <- read.table("Data/eg01-23time24.csv", header = T, sep = ",")

A “separator” is a character that marks the end of one cell and the beginning of the next one when R reads in data from plain text. Some common separators are spaces, tabs, and commas. By writing sep = "," inside the read.table function, we tell R that comma is the correct separator for this dataset. It is a good practice to always specify the separator.

If using RStudio user interface

See Tutorial for CA 1 for the steps. If a problem asks for the code used for importing, explain the procedure in words instead. For example, write “Import Dataset \(\to\) From Text (base) \(\to\) Browse to eg01-23time24.csv \(\to\) Change name to TimeStart \(\to\) Check ‘Yes’ for heading \(\to\) Check strings as factors \(\to\) Import”. Make sure that the dataset looks correct in the preview before clicking “Import”.

Check the dataset

Make sure that the data was imported correctly using View(TimeStart) or:

head(TimeStart)

Accessing variables inside a table

We need to access a variable inside a table to perform operations like computing the mean. There are a few different ways to do this.

  • tableName$variableName
  • (Not recommended) Run attach(TableName) once, then just type the variable name in the rest of the code. This allows for less writing, but brings sources of potential confusion or error. Make sure that none of your column names are exactly the same as the table name.
  • tableName[, columnNumber]
  • tableName[, "variableName"]

The third and fourth methods would apply to our problem as TimeStart[, 2] and TimeStart[, "TimeToStart"], respectively. This tutorial will only use Method 1. Computer Assignment 2b will have examples of the rest of the methods.

Creating Histograms

Selecting the number of bins

There is no mathematically optimal choice for the bin size due to the large differences in data from different populations. For small to medium datasets, we will use the square root rule. The square root rule has been shown to work well in practice for most standard data of medium to small size. However, there are other rules available that one can explore for special cases. I will list some of them here for the interested student.

  • Sturges’ Rule
  • Scott’s Rule
  • Freedman-Diaconis Rule
  • Rice’s Rule
  • Doane’s Formula

From now on, we will call the number of observations in the dataset \(n\), and the number of bins (or classes) \(n_{bins}\).

Small to medium datasets

For small to medium datasets (\(n \leq 400\)), the number of bins we shall use is \(n_{bins} \approx round(\sqrt{n})\). If a dataset is small, we will use the following formula to ensure we have a minimal number of bins.

\[n_{bins} = \max (round(\sqrt{n})+2, 5)\] The +2 in the formula adds a couple additional bins to help with what are called edge effects where the bins at the edges contain much fewer observations than the interior. We also ensure that there are at least 5 bins as it can be difficult to understand the histrogram when there are too few bins.

Large datasets

For large datasets (\(n>400\)), you will be required to explore a range of values between 20 - 30 bins to determine the best number of bins to display the distributional properties of the data. If you use the same rule as for small/medium data, the histogram will not correctly represent the properties of the data. You will lose points if you do not use a reasonable bin size. No matter how many bins you choose, always look at the result to see if there is a problem in the shape. If there is a problem, change the number of bins.

Import the ggplot2 package

We will now plot the histogram of the variable TimeToStart, which is part of the dataset TimeStart. The package ggplot2 is already available on our version of RStudio, but when using a local version, you must first download it by running install.packages("ggplot2"). In this course, you only need to import the downloaded package into your current session, using:

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3

This needs to be run only once, at the beginning of each R session.

Organization of a ggplot function

We are now ready to write a ggplot code. A typical ggplot code begins by specifying the dataset as the first argument, then specifying the variable(s) of interest from this dataset inside the aes() argument. Then desired layers are added one by one using the + sign. Observe what happens as we add each additional layer to the base ggplot(TimeStart, aes(TimeToStart))in the following series of code.

Compute the summary statistics and the number of bins

Compute the summary statistics. They give us some information about the shape of the sample. Further, they will be used as ingredients in the code for our histogram later on.

xbar <- mean(TimeStart$TimeToStart)
s <- sd(TimeStart$TimeToStart)
n <- nrow(TimeStart)
number_bins <- max(round(sqrt(n))+2,5)

Histogram layer

We begin by adding the base histogram layer geom_histogram.

ggplot(TimeStart, aes(TimeToStart)) + 
  geom_histogram(aes(y = after_stat(density)),
             bins=number_bins, fill="grey", col="black")

The argument aes(y = after_stat(density)) tells R that we want the y-axis to represent the density (relative frequency). Try running the code after deleting this part, and the plot will revert to its default option for the y-axis, the counts. We will always use the density option in this course.

A smooth curve representing the histogram

ggplot(TimeStart, aes(TimeToStart)) + 
  geom_histogram(aes(y = after_stat(density)),
             bins=number_bins, fill="grey", col="black") +
  geom_density(col="red", lwd=1)

geom_density adds a curve that follows the trend of the histogram, sometimes called a ‘kernel’.

A normal curve sharing the same mean and sd as the data

ggplot(TimeStart, aes(TimeToStart)) + 
  geom_histogram(aes(y = after_stat(density)),
             bins=number_bins, fill="grey", col="black") +
  geom_density(col="red", lwd=1) +
  stat_function(fun=dnorm, args = list(mean = xbar, sd = s),
                col = "blue", lwd = 1)

This stat_function layer draws a normal density (fun=dnorm) whose center is at the sample mean of TimeToStart (mean = xbar) and the standard deviation is also equal to the sample sd of TimeToStart (sd = s). Make sure that xbar and s are defined BEFORE running this part, or you will get an error due to undefined or incorrectly defined objects.

Finalize by adding the title

ggplot(TimeStart, aes(TimeToStart)) + 
  geom_histogram(aes(y = after_stat(density)),
             bins=number_bins, fill="grey", col="black") +
  geom_density(col="red", lwd=1) +
  stat_function(fun=dnorm, args = list(mean = xbar, sd = s),
                col = "blue", lwd = 1) +
  ggtitle("Histogram of Time To Start")

Histograms used in our course should always contain all the layers used in this final version: the histogram bars, red smoothing curve, blue normal curve, and the title. ONLY the final version is necessary.

Solving the problem

a) Create a histogram of the times it took to start a new business.

Exporting a plot from RStudio to an image file

Exporting graphs When a ggplot command is run successfully, the plot will appear in the lower right window of RStudio, under the Plots tab (blue arrow). You can flip through all the plots produced in the current R session using the left and right arrows (red). To save a plot as an image file, click export (green), then choose your preferred option among Save as Image, Save as PDF, and Copy to Clipboard.

b) Describe the shape of the histogram.

This distribution is right skewed because the peak is on the left of the histogram and there is a long tail to the right.

c) Are there any outliers?

According to the histogram, there is a possibility of outliers because the upper bin is separated from the rest of the bins.

Part 2: Numerical Summaries

Introduction

This tutorial will cover the following materials:

  • How to reference variables using \(\$\), attach, and direct reference
  • How to calculate the mean, median, standard deviation, and quartiles
  • How to generate and interpret a boxplot in terms of shape and ‘real’ outliers

Problem: Time to Start a Business (data: eg01-23time24.csv)

An entrepreneur faces many bureaucratic and legal hurdles when starting a new business. The World Bank collects information about starting businesses throughout the world. It has determined the time, in days, to complete all of the procedures required to start a business. Data for 195 countries are included in the data set. For this section we will examine data for a sample of 24 of these countries. Here are the data:

dataset

Importing the dataset

If using code

TimeStart <- read.table("Data/eg01-23time24.csv", header = T, sep = ",")

A “separator” is a character that marks the end of one cell and the beginning of the next one when R reads in data from plain text. Some common separators are spaces, tabs, and commas. By writing sep = "," inside the read.table function, we tell R that comma is the correct separator for this dataset. It is a good practice to always specify the separator.

If using RStudio user interface

See Tutorial for CA 1 for the steps. If a problem asks for the code used for importing, explain the procedure in words instead. For example, write “Import Dataset \(\to\) From Text (base) \(\to\) Browse to eg01-23time24.csv \(\to\) Change name to TimeStart \(\to\) Check ‘Yes’ for heading \(\to\) Check strings as factors \(\to\) Import”. Make sure that the dataset looks correct in the preview before clicking “Import”. After importing, use View(TimeStart) to confirm that the import was successful.

Accessing variables inside a table

We need to access a variable inside a table to perform operations like computing the mean. There are a few different ways to do this.

  • tableName$variableName
  • (Not recommended) Run attach(TableName) once, then just type the variable name in the rest of the code. This allows for less writing, but brings sources of potential confusion or error. Make sure that none of your column names are exactly the same as the table name.
  • tableName[, columnNumber]
  • tableName[, "variableName"]

The third and fourth methods would apply to our problem as TimeStart[, 2] and TimeStart[, "TimeToStart"], respectively. This tutorial will only use Method 1. Computer Assignment 2b will have examples of the rest of the methods.

Solving the problem

a) Find the mean and the standard deviation of the times it took to start a new business among all countries in the data set.

Method 1

mean(TimeStart$TimeToStart)
## [1] 23.625
sd(TimeStart$TimeToStart)
## [1] 23.82876

Method 2

attach(TimeStart)
mean(TimeToStart)
## [1] 23.625
sd(TimeToStart)
## [1] 23.82876

Method 3

mean(TimeStart[,2])
## [1] 23.625
sd(TimeStart[,2])
## [1] 23.82876

From any of the methods above, Mean = 23.625, Standard deviation = 23.82876.

As long as you include the command before the output, you do not need to retype the numbers unless specifically asked to.

b) Find the five-number summary of the times it took to start a new business.

quantile(TimeStart$TimeToStart)
##   0%  25%  50%  75% 100% 
##    5    7   13   30   94

From the R output above, Min = 5, Q1 = 7, Median = 13, Q3 = 30, Max = 94.

Note: There are other ways of computing the five-number summary. However, this method will generate the values that are used in the boxplot generated from ggplot2. Therefore, to receive full credit, you can only use the function quantile().

c) Comparing Central Tendency Measures

There are a number of different ways that this can be accomplished.

  1. Compare the numbers From parts a) and b), Mean = 23.625, Median = 13. I would say that these numbers are not close in this data set because their absolute difference is 10.625 which is large compared to the range of data (approximately 100).

  2. Visual assess their distance on the histogram.

library(ggplot2)
median <- median(TimeStart$TimeToStart)
xbar <- mean(TimeStart$TimeToStart)
s <- sd(TimeStart$TimeToStart)
n <- nrow(TimeStart)
number_bins <- max(round(sqrt(n))+2,5)
ggplot(TimeStart, aes(TimeToStart)) + 
  geom_histogram(aes(y = after_stat(density)),
             bins=number_bins, fill="grey", col="black") +
  geom_density(col="red", lwd=1) +
  stat_function(fun=dnorm, args = list(mean = xbar, sd = s),
                col = "blue", lwd = 1) +
  geom_vline(xintercept=xbar, linetype=1, lwd=1) +
  geom_vline(xintercept=median, linetype=2, lwd=1) +
  ggtitle("Histogram of Time To Start")

It is also okay to mark the locations approximately by hand or another application on the histogram generated from the Tutorial for CA 2a. Here, the dotted line is the median and the solid line is the mean.To me, these numbers do not look close to each other.

  1. Compare the difference of the numbers to the total spread of the numbers \[\frac{\textrm{mean}-\textrm{median}}{\textrm{maximum}-\textrm{minimum}} = \frac{23.625-13}{94-5} = 0.119\] This is ~12% which is fairly large.

  2. Compare the difference of the numbers to the standard deviation \[\frac{\textrm{mean}-\textrm{median}}{\textrm{standard deviation}} = \frac{23.625-13}{23.82876} = 0.446\] Therefore, the difference is about half the standard deviation which is fairly large.

In this case, I would say that the two numbers are not close to each other. However, these four methods will not always provide the same answer. When answering this question, you need to look at all of the information to make a decision. In some cases, it is possible for different people to have different answers.

d) Create a (modified) boxplot of the times it took to start the new businesses.

The following is the procedure for generating a modified boxplot. In a modified boxplot, the outliers calculated from the 1.5 IQR Rule are explicitly plotted. These calculated points might or might not be ‘real’ outliers which have to be far from the rest of the data.

To begin, import the ggplot2 once.

library(ggplot2)
ggplot(TimeStart, aes(x = "", y = TimeToStart)) +
  stat_boxplot(geom="errorbar") +
  geom_boxplot() +
  ggtitle("Boxplot of Time To Start") +
  stat_summary(fun = mean, col = "black", geom = "point", size = 3) 

Note the stat_summary() function is an additional feature that puts the mean on the boxplot. The circle on the boxplot is the location of the mean. You may modify the color and size of the symbol.

Note: When putting together a document for upload, resize your graphics so that they fit on the page, especially when you have more than one plot.

e) From the boxplot, what do you think the shape of the distribution is?

The shape of the distribution looks right skewed. This is because the upper whisker is longer than the lower whisker. The median is also closer to the bottom of the box versus the top of the box. In addition, there are explicit points at the higher end and not the lower end.

f) From the boxplot, are there any ‘real’ outliers?

Looking at the boxplot, the lower explicit point looks close to the top of the whisker. Remember that the whisker stops at an actual data point and not the fence. Therefore, this point would not be considered a ‘real’ outlier because it is not that far from the rest of the data. However, the point at around 90 is far from the lower explicit point so this point would be considered a ‘real’ outlier.