Introduction

This tutorial will cover the following materials:

Data Subsetting: How to subset data based on logical conditions.
One Sample Statistical Inference: How to perform statistical inference procedures for tests involving the sutdy of an unknown population mean.

Subsetting in R

Subsetting in R is a technique that allows analysts and researchers to focus on specific portions of their data for deeper analysis or inference. This practice is not just a matter of convenience; it’s a fundamental approach to handling complex datasets and extracting meaningful insights. There are several reasons why one might want to use subsetting to consider a subset of the sample for inference:

Specific Group Analysis: In many cases, researchers are interested in understanding the behavior or characteristics of a specific group within the larger population. Subsetting allows for the isolation of these groups so their properties can be studied in detail without the noise from irrelevant data and direct inference can be conducted regarding the groups of interest.
Controlled Comparisons: When investigating the effects of a particular factor, analysts often need to compare different groups under varying conditions. Subsetting enables the creation of these comparison groups.
Handling Missing Data: In datasets with missing values, it might be necessary to analyze only the subset of data that is complete for certain variables of interest. This approach can help maintain the integrity of statistical analyses, although it’s important to consider the potential biases introduced by excluding data.
Efficiency in Computing: Working with a subset of data can significantly reduce computational load and processing time, especially when dealing with very large datasets. This can be crucial for exploratory data analysis, where multiple iterations and analyses are performed.
Tailored Modeling: Different subsets of data might exhibit different patterns or relationships between variables. By subsetting the data, models can be tailored to specific segments, potentially leading to more accurate predictions and inferences.
Statistical Significance and Power: Subsetting data simplifies analysis and boosts computational speed by focusing on relevant segments. This approach reduces dataset complexity for easier pattern recognition and minimizes processing time, optimizing both insight extraction and resource use.
Ethical and Legal Considerations: When dealing with sensitive data, it may be necessary to subset data to ensure that analyses comply with privacy laws and ethical guidelines. This could involve excluding personally identifiable information or focusing on anonymized subsets of the data.

In practice, subsetting is often one of the first steps in a detailed data analysis workflow in R. It enables researchers to focus their attention and resources on the most relevant parts of the data, leading to more efficient and insightful analyses. However, it’s essential to apply subsetting thoughtfully, ensuring that the subsets analyzed are representative and that the process does not introduce bias into the findings.

You will use a useful function in R called subset(). The subset() in R is a handy tool that allows you to extract specific rows from a data frame based on certain conditions. To use the subset() function, you pass your data frame as the first argument, followed by a comma. After the comma, you specify the condition or criteria that you want to use for subsetting. This condition is usually written in the form of a logical expression in terms of variables in your data frame.

Logical Operators

The ‘&’ operator in R conducts an element-wise logical ‘AND’ operation on values or vectors of equal length, returning TRUE for each position where both vectors have TRUE values, and FALSE otherwise. It’s pivotal for filtering data by combining multiple conditions.
The ‘|’ operator (pipe symbol) in R executes an element-wise logical ‘OR’ operation on values or vectors of equal length, returning TRUE for each position where either or both vectors have TRUE values, and FALSE otherwise. It is essential for selecting data that satisfies any one of multiple conditions.
The ‘==’ operator in R conducts an element-wise equality test between values or vectors, returning TRUE for positions where elements are equal, and FALSE where they differ. It’s essential for selecting data that matches specific criteria or for comparing datasets for consistency.
The ‘!’ operator in R negates the truth value of its operand, turning TRUE into FALSE and vice versa. It’s critical for reversing conditions, particularly in filtering data to exclude rather than include matches.
The ‘>’ operator in R performs an element-wise comparison, returning TRUE for each position where the element in the first vector is greater than the corresponding element in the second vector, and FALSE otherwise. It’s vital for selecting data points that exceed a specific threshold.
The ‘<’ operator in R conducts an element-wise comparison, returning TRUE for each position where the element in the first vector is less than the corresponding element in the second vector, and FALSE otherwise. It’s crucial for filtering data points that fall below a certain threshold.
The ‘>=’ operator in R executes an element-wise comparison, returning TRUE for each position where the element in the first vector is greater than or equal to the corresponding element in the second vector, and FALSE otherwise. It’s essential for selecting data points that meet or exceed a specific threshold.
The ‘<=’ operator in R performs an element-wise comparison, returning TRUE for each position where the element in the first vector is less than or equal to the corresponding element in the second vector, and FALSE otherwise. It’s important for filtering data points that are at or below a certain threshold.
The combination of ‘!’ and ‘=’ as the ‘!=’ operator in R is used for testing inequality between values or vectors of equal length, returning TRUE where elements differ and FALSE where they are equal. It’s key for excluding specific data points or identifying discrepancies in datasets.

Building on the foundation of logical operators, let’s delve into the subset() function and explore how it utilizes these operators to filter and extract specific portions of data for focused analysis.

To use subset(), you provide it with the data object you wish to subset from and a condition (or set of conditions) that specifies which rows (or elements) should be included in the output. The condition is expressed using logical operators, applying them to the columns (or elements) of the data object.

subset(x, subset, select)

x: The data frame, matrix, or vector from which you want to subset.
subset: The logical condition that rows must meet to be included in the output. This is where logical operators play a crucial role.
select: Optionally, specifies which columns to keep and the order in which they should be represented in the resulting data frame.

Logical operators are integral to forming the subset argument. By combining these operators, you can create complex conditions that precisely define the subset of data you’re interested in. In R, always remember to enclose text values in double quotes and use the appropriate logical operators for filtering. Here are some examples to illustrate how logical operators are used with subset():

Subsetting with ‘==’: Select rows where a specific column equals a particular value.

subset(data_frame, column_name == specific_value)

Combining Conditions with ‘&’: Select rows that meet multiple criteria simultaneously.

subset(data_frame, column1 == value1 & column2 > value2)

Using ‘OR’ with ‘|’: Select rows that satisfy at least one of multiple conditions.

subset(data_frame, column1 == value1 | column2 < value2)

Excluding with ‘!=’: Select rows where a column does not equal a specific value.

subset(data_frame, column_name != excluded_value)

Example: (DATA SET: linebackers.csv) NFL Linearbacker contracts from approximately 2006-2010

linebackers <- read.csv("Data/linebackers.csv")

The dataset contains the following variables for NFL linebackers:

Name: The player’s name.
Team: The NFL team the player is associated with.
Base Salary: The basic annual compensation the player receives.
Signing Bonus: An upfront bonus received upon signing a contract.
Other Bonus: Additional bonuses, possibly including roster bonuses, workout bonuses, or performance incentives.
Total Salary: The sum of the base salary, signing bonus, and other bonuses for the year.
Cap Value: The amount of the player’s salary that counts against the team’s salary cap.
Percentage: Represents the player’s Other.Bonus relative to the Total.Salary as a percentage.
Rating: A performance rating, which could be based on various metrics or evaluations.
Sqrt.rating.: The square root of the player’s performance rating.

As an example we will utilize the subset function in R, for filtering data based on specific criteria. This approach will allow us to narrow down our dataset to include only the teams that not only competed in but triumphed in the Super Bowl championships held between 2006 and 2010. By subsetting our data to focus on the Pittsburgh Steelers (2006), Indianapolis Colts (2007), New York Giants (2008), Pittsburgh Steelers (2009), New Orleans Saints (2010).

linebacker_superbowl_subset <- subset(linebackers, linebackers$Team == "Steelers" | linebackers$Team == "Colts" | linebackers$Team == "Giants" | linebackers$Team == "Saints" )

We can further refine our results to only consider the players with a rating of at least 5. Make sure to use parenthesis to control order of logical operations.

linebacker_superbowl_subset <- subset(linebackers, (linebackers$Team == "Steelers" | linebackers$Team == "Colts" | linebackers$Team == "Giants" | linebackers$Team == "Saints") & linebackers$Rating >= 5)

Lets also only select the following columns in order: Name, Team, Rating, Total.Salary, Other.Bonus, Percentage.

linebacker_superbowl_subset <- subset(linebackers, (linebackers$Team == "Steelers" | linebackers$Team == "Colts" | linebackers$Team == "Giants" | linebackers$Team == "Saints") & linebackers$Rating >= 5, select = c(Name, Team, Rating, Total.Salary, Other.Bonus,  Percentage))

Conducting a t-procedure for inference of the population mean.

The t.test() function in R is used for both, constructing confidence intervals/bounds, and for conducting a hypothesis test. R’s t.test() function is a versatile tool that caters to both one-sample and two-sample t-procedures, enabling researchers to perform hypothesis testing and construct confidence intervals for population means. Two-sample procedures will be discussed in a subsequent tutorial.

The One-Sample T-Procedure serves a dual purpose: firstly, to test the hypothesis whether there is a significant difference between the sample mean and the hypothesized population mean; and secondly, to estimate the range within which the true population mean is likely to fall with a specified level of confidence. This procedure underscores the inferential capacity to both challenge and quantify assumptions about population parameters based on sample data. Here is an example of how to use the t.test function to conduct inference for a single unknown population mean.

t.test(x, mu = mu_0, alternative = "alternative_hypothesis", conf.level = C)

x: Vector of data values.
mu: The hypothesized mean value under the null hypothesis. Defaults to \(0\).
alternative: The alternative hypothesis direction chosen from “two.sided”, “less”, or “greater”. This parameter determines whether the test is looking for a difference in either direction (“two.sided”) or a specific direction (“less” or “greater”). Defaults to “two.sided”. This also defines the type of confidence interval/bound constructed.
conf.level: The numerical confidence coefficient. Defaults to \(0.95\).

Example: (DATA SET: DMS.csv) Dimethyl sulfide odor detection thresholds for 10 oenology students.

Many food products contain small quantities of substances that would give an undesirable taste or smell if they were present in large amounts. An example is the “off-odors” caused by sulfur compounds in wine. Oenologists (wine experts) have determined the odor threshold, the lowest concentration of a compound that the human nose can detect. For example, the odor threshold for dimethyl sulfide (DMS) is given in the oenology literature as 25 micrograms per liter of wine (μg/L). Untrained noses may be less sensitive, however. Listed below are the DMS odor thresholds for 10 beginning students of oenology:

wine_data <- read.csv("Data/DMS.csv")
wine_data$DMS

##  [1] 31 31 43 36 23 34 32 30 20 24

This study focuses on the sensitivity of novice oenology students to DMS, investigating whether their detection thresholds are larger than the established norm of 25 micrograms per liter of wine (μg/L).

Checking Assumptions

Prior to performing a t-procedure on this dataset, we must first verify the assumptions of normality through visual inspection of the data. We will assume that the data in collected as a simple random sample (SRS).

xbar <- mean(wine_data$DMS) 
s <- sd(wine_data$DMS) 

#BOXPLOT
ggplot(wine_data, aes(x = "", y = DMS)) + 
  stat_boxplot(geom = "errorbar") + 
  geom_boxplot() + 
  ggtitle("Boxplot of DMS") + 
  stat_summary(fun = mean, col = "black", geom = "point", size = 3)

# HISTOGRAM 
n <- nrow(wine_data)
n_bins <- max(5,round(sqrt(n))+2)
ggplot(wine_data, aes(DMS)) + 
  geom_histogram(aes(y = after_stat(density)), bins = n_bins, fill = "grey", col = "black") + 
  geom_density(col = "red", lwd = 1) + 
  stat_function(fun = dnorm, args = list(mean = xbar, sd = s), col="blue", lwd = 1) + 
  ggtitle("Histogram of DMS") + 
  xlab("DMS") + 
  ylab("Proportion")

# QQPlot 
ggplot(wine_data, aes(sample = DMS)) + 
  stat_qq() + 
  geom_abline(slope = s, intercept = xbar) + 
  ggtitle("QQ Plot of DMS")

Assuming that the sample is SRS, the only other assumption that is necessary is that the distribution is normal. Since the sample size is \(10\), we cannot use CLT. However, from the normal probability plot and the histogram, we can see that the sample is approximately normal which indicates that the population itself was likely normal. Therefore, the assumptions are met and we can conduct one-sample t-procedures.

Computing the confidence lower bound and performing the hypothesis test using t.test.

To investigate whether the detection threshold of novice oenology students is larger than the established norm of 25 micrograms per liter of wine (μg/L) we obtain a 95% confidence lower bound and the corresponding upper tailed hypothesis test with \(\mu_0=25\).

t.test(wine_data$DMS,mu = 25, alternative = "greater", conf.level = 0.95)

## 
##  One Sample t-test
## 
## data:  wine_data$DMS
## t = 2.5288, df = 9, p-value = 0.01615
## alternative hypothesis: true mean is greater than 25
## 95 percent confidence interval:
##  26.48554      Inf
## sample estimates:
## mean of x 
##      30.4

From the output we obtain the following:

t-Test Statistic: \(t_{\text{TS}}=2.5288\)
Degrees of freedom: \(\text{df}= n-1 = 9\)
\(p\)-value: \(p\)-value\(=0.01615\)
95% Confidence Lower Bound: \(\bar{x}-t_{0.05,9}\frac{s}{\sqrt{n}}=(26.48554,\infty)\)
Sample Mean: \(\bar{x}=30.4\)

Note: This does not give us the t-critical value which is used for manual computation of the confidence bound.

From the confidence lower bound, we know that we are 95% confident that the true value (i.e., the population mean detection threshold for novice oenology students) is greater than \(26.4855\). Since \(26.4855\) is greater than \(25\), we are also confident that the true value is greater than the established norm of \(25\) micrograms per liter of wine (μg/L).This is further confirmed from the \(p\)-value\(=0.01615<0.05=\alpha\).

Computing the confidence lower bound and performing the hypothesis test manually.

Obtain the confidence lower bound:

C <- 0.95
n <- nrow(wine_data)
xbar <- mean(wine_data$DMS)
s <- sd(wine_data$DMS)
t <- qt(p = 1-C, df = n-1, lower.tail = FALSE) #Obtain critical value
conf.lowerBound <- xbar - t*s/sqrt(n)
conf.lowerBound

## [1] 26.48554

Conduct the t-test:

C <- 0.95
alpha <- 1-C #One-sided test dont need to divide by 2.
mu_0 <- 25
n <- nrow(wine_data)
xbar <- mean(wine_data$DMS)
s <- sd(wine_data$DMS)
t_TS <- (xbar-mu_0)/(s/sqrt(n)) #Obtain test statistic
pvalue <- pt(q = t_TS, df = n-1, lower.tail = FALSE) #Obtain p-value
pvalue

## [1] 0.01614994

R Tutorial for CA 2: Subsetting and Basic Statistical Inference

Authors:
Leonore Findsen, Timothy Reese,
Sarah H. Sellke, Halin Shin, Chunyan Sun, Jeremy Troisi
STAT 350

Introduction

Subsetting in R

Logical Operators

Example: (DATA SET: linebackers.csv) NFL Linearbacker contracts from approximately 2006-2010

Conducting a t-procedure for inference of the population mean.

Example: (DATA SET: DMS.csv) Dimethyl sulfide odor detection thresholds for 10 oenology students.

Checking Assumptions

Computing the confidence lower bound and performing the hypothesis test using t.test.

Computing the confidence lower bound and performing the hypothesis test manually.

R Tutorial for CA 2: Subsetting and Basic Statistical Inference

Authors: Leonore Findsen, Timothy Reese, Sarah H. Sellke, Halin Shin, Chunyan Sun, Jeremy Troisi STAT 350

Introduction

Subsetting in R

Logical Operators

Example: (DATA SET: linebackers.csv) NFL Linearbacker contracts from approximately 2006-2010

Conducting a t-procedure for inference of the population mean.

Example: (DATA SET: DMS.csv) Dimethyl sulfide odor detection thresholds for 10 oenology students.

Checking Assumptions

Computing the confidence lower bound and performing the hypothesis test using t.test.

Computing the confidence lower bound and performing the hypothesis test manually.

Authors:
Leonore Findsen, Timothy Reese,
Sarah H. Sellke, Halin Shin, Chunyan Sun, Jeremy Troisi
STAT 350