1.1. What Is Statistics?

Welcome to the fascinating world of statistics! Statistics surrounds us daily—from weather forecasts and medical studies to economic reports and political polls. As we begin our journey together this semester, let’s explore some fundamental questions: what exactly is statistics, and what powers its universal presence?

Statistics is a framework for making sense of the real world through data and drawing meaningful conclusions in the face of uncertainty. The American Statistical Association defines statistics as

“the science of learning from data and of measuring, controlling, and communicating uncertainty.”

In the remainder of this section, we will learn how this definition correctly entails the field of statistics by examining its major branches and understanding the role of a statistician in making statistics available for broad audiance.

Road Map 🧭

  • Define the four branches of statistics, and understand how they interact.

  • Understand the role of a statistician.

  • Understand what it means to achieve the three stages of statistcal competence. Understand where we are now and where we will be at the end of this course.

1.1.1. Primary Branches of Statistics

Statistics is a diverse ecosystem of interconnected components. One way to view statistics is to divide it into four branches: data management, exploratory data analysis, inferential statistics, and predictive analytics.

A. Data Management: Collection, Cleaning, and Storage

Data Collection

A data-collecting process begins with careful planning. This entails

  • defining clear objectives for what you want to achieve with the data,

  • determining the specific questions that the data is needed to answer,

  • developing a detailed plan including timelines, resources needed, and roles and responsibilities, and the data types to be recorded.

  • planning for potential challenges and how to address them, and

  • continuously reviewing the data collection process.

The selection of a data collection method must take into account the question being addressed and the available resources. Ethical considerations must also be factored in, especially when the procedure involves human participants. It is important to comply with relevant laws and regulations (e.g., FERPA, HIPAA, GDPR), obtain informed consent from all participants, and ensure privacy and confidentiality.

Some widely used collection methods include surveys and questionnaires, interviews, observations, experiments, web scraping, sensor-based and electronic data collection, and document and record reviews.

Data Cleaning

This stage consists of an overall inspection and organization of collected datasets. Some possible issues addressed during this stage are:

  • Correction of errors such as duplicate entries

  • Consolidating data from various sources

  • Resolving conflicts in data types, values, and formats

  • Handling missing data. Different strategy may be chosen depending on the nature of the missing entry. Some possibilities are deletion, imputation (making an educated guess), and flagging (not taking an immediate action other than making a note).

Data cleaning is often a repeated process, especially when new data is collected periodically. It is advisable to maintain comprehensive documentation of the data cleaning steps. This facilitates reviewing and refining the process to accommodate new data or adapt to updated standards.

Data Storage

Poster illustrating the growth in the amount of avialable data (Raconteur)

Fig. 1.1 Growth of data

We live in an era of unprecedented data abundance:

  • Social media platforms generate billions of interactions daily

  • Location services track geographical movements of people and vehicles

  • Wearable devices monitor health metrics around the clock

  • Business transactions create detailed records of economic activity

By 2025, experts predict approximately 463 exabytes of data will exist globally—a volume almost impossible to conceptualize (one exabyte equals one billion gigabytes). This explosion of data creates both opportunities and challenges for statisticians.

In this course, we’ll focus primarily on structured data in manageable volumes, building foundational skills to master essential principles that remain relevant regardless of data scale or format.

When data is large or has a complex structure, an efficient storage solution must be implemented by establishing appropriate data schemas. When a dataset consists of multiple parts, it should be cataloged or integrated into a unified, condensed form. Various storage solutions exist for such datasets, including relational databases, data warehouses, data lakes, and cloud storage.

Once the data is stored, accessing and retrieving it also requires careful planning due to factors such as volume, privacy constraints, or regulatory sensitivity. It may be necessary to develop efficient querying mechanisms using SQL, utilize APIs, implement robust user access controls, apply encryption, and more.

Connection to future chapters

Good practices in data collection will be discussed in Chapter 8. Most Computer Assignments will require various steps of data cleaning to suit the questions being addressed.

B. Exploratory Data Analysis

diagram of exploratory data analysis

Fig. 1.2 Exploratory data analysis

Once we have collected and prepared our data, our first task is to perform Exploratory Data Analysis (EDA). During EDA, key features of the data are visually and numerically summarized through descriptive statistics. Descriptive statistics reveal patterns and detect peculiarities in the dataset.

EDA is typically an iterative process—we examine the data, formulate questions, explore further, refine our understanding, and repeat. This cycle helps us develop hypotheses and determine which inferential methods might be most appropriate for deeper analysis.

Connection to future chapters

Descpriptive statistics are introduced in Chapters 2 and 3. The techniques learned here will serve as assisting tools in Chapters 7-13.

C. Inferential Statistics

Perhaps the most powerful branch of statistics, inferential statistics allows us to extend what we learn from samples to make conclusions about entire populations. Two major branches of statistical inference are:

Parameter Estimation

diagram of paragmeter estimation

Fig. 1.3 Parameter estimation

Parameter estimation is making educated guesses about a key value of the population, such as the mean, median, and variance. This can be done through

  • Point estimation: finding a single “best guess” for the unknown value

  • Interval estimation: constructing a range to which the unknown value is expected to belong.

Hypothesis Testing

Hypothesis tests evaluate specific claims against evidence from data. Hypothesis testing can be split into four stages:

  1. Making a claim - usually made in the form of a “yes or no” question asking if something has changed from a previous belief (status quo)

  2. Gathering evidence through experiment

  3. Assessing the likelihood - does the evidence support the claim?

  4. Making a conclusion

diagram of hypothesis test

Fig. 1.4 Hypothesis testing

Regardless of the branch, statistical inference always involves the following key elements:

  • Assumption validation: Most inference methods are constructed under a set of asumptions about the charateristics of the data-generating population. It must be verified that our data meets the requirements to ensure reliablity of the inference results.

  • Uncertainty quantification: The core of statistical inference is the ability to draw conclusions in the face of uncertainty and to numerically express the degree of confidence in the result.

Connection to future chapters

Foundations of parameter estimation and hypothesis testing will be covered in Chapters 9 and 10, respectively. Both will continue to be used for various inference scenarios in Chapters 11-13.

D. Predictive Analytics

Rather than understanding current data, predictive analytics focuses on what might happen next. Predictive analytics often requires large datasets as it relies on identifying patterns and relationships in historical data which are used to predict future events. Key elements of predictive analytics are:

  • Modeling of variable relationships: Structuralize the relationship between two or more variables. Use this structure as a tool for making predictions. This model can be as simple as a linear pattern, or as complex as a deep neural network.

  • Prediction: Make educated guesses on unobserved values based on the identified model and the individual’s observed characteristics

Connection to future chapters

We’ll touch on predictive methods primarily through linear regression, in Chapter 13.

1.1.2. Course Goals: Growing as a Competent Statistician

Having explored what statistics is, let’s consider who practices it. The American Statistical Association defines a statistician as

“a person who applies statistical thinking and methods to a wide variety of scientific, social, and business endeavors.”

This broad definition encompasses professionals working across diverse fields—astronomy, biology, education, economics, engineering, genetics, and many others.

Statisticians serve several crucial functions in research and decision-making:

  • They provide guidance on what information can be considered reliable

  • They help determine which predictions warrant confidence

  • They offer insights that illuminate complex scientific questions

  • They protect against false impressions that might mislead investigators

This course aims to develop your statistical abilities across three progressively sophisticated levels:

Statistical Literacy

At the most fundamental level, statistical literacy includes:

  • Understanding basic data management principles

  • Exploring data effectively through visual and numerical methods

  • Comprehending the vocabulary and notation statisticians use

  • Grasping how probability serves as our framework for measuring uncertainty

  • Recognizing when particular statistical methods are appropriate

Statistical literacy allows you to read and understand statistical information—the minimum required to be an informed consumer of research and data-based claims.

diagram of three stages of statistical competence

Statistical Reasoning

Moving beyond literacy, statistical reasoning involves:

  • Applying statistical tools effectively to answer specific questions

  • Interpreting results correctly within their proper context

  • Communicating findings clearly to various stakeholders

With statistical reasoning skills, you can actively engage with data analysis rather than simply consuming others’ conclusions.

Statistical Thinking

The most sophisticated level, statistical thinking encompasses:

  • Understanding how statistical models represent and simulate real-world phenomena

  • Selecting appropriate inferential tools for specific analytical situations

  • Seeing the entire pipeline from data collection through analysis to interpretation

  • Being able to design studies and experiments, and understanding why designed experiments are needed to establish causation.

Statistical thinking represents the mindset of a practitioner who can navigate the entire statistical process independently.

While developing complete statistical thinking extends beyond a single course, by semester’s end, you should achieve statistical literacy and begin developing reasoning skills. These capabilities will serve you well regardless of your career path, enhancing your ability to make decisions in our data-rich world.