What Is Statistics?
========================================
Welcome to the fascinating world of statistics! Statistics surrounds us daily—from weather forecasts and
medical studies to economic reports and political polls. As we begin our journey together this semester,
let's explore some fundamental questions: what exactly is statistics, and what powers its universal presence?
Statistics is a framework for making sense of the real world through data and drawing meaningful conclusions
in the face of uncertainty. The American Statistical Association defines statistics as
**"the science of learning from data and of measuring, controlling, and communicating uncertainty."**
In the remainder of this section, we will learn how this definition correctly entails the field of statistics by
examining its major branches and understanding the role of a statistician in making statistics available for
broad audiance.
.. admonition:: Road Map đź§
:class: important
* Define the **four branches of statistics**, and understand how they interact.
* Understand the **role of a statistician**.
* Understand what it means to achieve the **three stages of statistcal competence**. Understand
where we are now and where we will be at the end of this course.
Primary Branches of Statistics
-----------------------------------
Statistics is a diverse ecosystem of interconnected components. One way to view statistics is
to divide it into four branches: data management, exploratory data analysis, inferential statistics,
and predictive analytics.
A. Data Management: Collection, Cleaning, and Storage
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Data Collection
^^^^^^^^^^^^^^^^^
A data-collecting process begins with careful planning. This entails
- defining **clear objectives** for what you want to achieve with the data,
- determining the **specific questions** that the data is needed to answer,
- developing a **detailed plan** including timelines, resources needed,
and roles and responsibilities, and the data types to be recorded.
- planning for **potential challenges** and how to address them, and
- continuously reviewing the data collection process.
The selection of a data collection method must take into account the question being addressed
and the available resources. Ethical considerations must also be factored in, especially when
the procedure involves human participants. It is important to comply with relevant laws and
regulations (e.g., FERPA, HIPAA, GDPR), obtain informed consent from all participants, and
ensure privacy and confidentiality.
Some widely used collection methods include surveys and questionnaires, interviews, observations,
experiments, web scraping, sensor-based and electronic data collection, and document and record reviews.
Data Cleaning
^^^^^^^^^^^^^^^^^
This stage consists of an overall inspection and organization of collected datasets.
Some possible issues addressed during this stage are:
- **Correction of errors** such as duplicate entries
- **Consolidating data** from various sources
- **Resolving conflicts** in data types, values, and formats
- Handling **missing data**. Different strategy may be chosen depending on the nature of the missing entry.
Some possibilities are deletion, imputation (making an educated guess),
and flagging (not taking an immediate action other than making a note).
Data cleaning is often a repeated process, especially when new data is collected periodically. It is
advisable to maintain comprehensive documentation of the data cleaning steps. This facilitates
reviewing and refining the process to accommodate new data or adapt to updated standards.
Data Storage
^^^^^^^^^^^^^^^
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter1/data-growth.png
:alt: Poster illustrating the growth in the amount of avialable data (Raconteur)
:align: center
:figwidth: 90%
Growth of data
We live in an era of unprecedented data abundance:
- Social media platforms generate billions of interactions daily
- Location services track geographical movements of people and vehicles
- Wearable devices monitor health metrics around the clock
- Business transactions create detailed records of economic activity
By 2025, experts predict approximately 463 exabytes of data will exist globally—a volume almost
impossible to conceptualize (one exabyte equals one billion gigabytes). This explosion of data
creates both opportunities and challenges for statisticians.
In this course, we'll focus primarily on structured data in manageable volumes, building foundational
skills to master essential principles that remain relevant regardless of data scale or format.
When data is large or has a complex structure, an efficient storage solution must be implemented by
establishing appropriate data schemas. When a dataset consists of multiple parts, it should be
cataloged or integrated into a unified, condensed form. Various storage solutions exist for such
datasets, including relational databases, data warehouses, data lakes, and cloud storage.
Once the data is stored, accessing and retrieving it also requires careful planning due to factors
such as volume, privacy constraints, or regulatory sensitivity. It may be necessary to develop
efficient querying mechanisms using SQL, utilize APIs, implement robust user access controls, apply
encryption, and more.
.. admonition:: Connection to future chapters
:class: note
Good practices in data collection will be discussed in **Chapter 8**. Most **Computer Assignments**
will require various steps of data cleaning to suit the questions being addressed.
B. Exploratory Data Analysis
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter1/eda-diagram.png
:alt: diagram of exploratory data analysis
:align: left
:figwidth: 40%
Exploratory data analysis
Once we have collected and prepared our data, our first task is to perform
**Exploratory Data Analysis (EDA)**. During EDA, key features of the data are visually and
numerically summarized through **descriptive statistics**. Descriptive statistics reveal
patterns and detect peculiarities in the dataset.
EDA is typically an iterative process—we examine the data, formulate questions, explore further, refine
our understanding, and repeat. This cycle helps us develop hypotheses and determine which inferential methods might
be most appropriate for deeper analysis.
.. admonition:: Connection to future chapters
:class: note
Descpriptive statistics are introduced in **Chapters 2 and 3**. The techniques learned here will serve as
assisting tools in **Chapters 7-13**.
C. Inferential Statistics
~~~~~~~~~~~~~~~~~~~~~~~~~~~
Perhaps the most powerful branch of statistics, inferential statistics allows us to **extend what we learn from samples
to make conclusions about entire populations**. Two major branches of statistical inference are:
Parameter Estimation
^^^^^^^^^^^^^^^^^^^^^^
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter1/estimation-diagram.png
:alt: diagram of paragmeter estimation
:align: right
:figwidth: 40%
Parameter estimation
Parameter estimation is making educated guesses about a key value of the population, such as the mean, median, and variance.
This can be done through
* **Point estimation**: finding a single "best guess" for the unknown value
* **Interval estimation**: constructing a range to which the unknown value is expected to belong.
Hypothesis Testing
^^^^^^^^^^^^^^^^^^^^
Hypothesis tests evaluate specific claims against evidence from data. Hypothesis testing
can be split into four stages:
1) Making a claim - usually made in the form of a "yes or no" question asking if something has changed
from a previous belief (**status quo**)
2) Gathering evidence through experiment
3) Assessing the likelihood - does the evidence support the claim?
4) Making a conclusion
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter1/hypothesis-test-diagram.png
:alt: diagram of hypothesis test
:align: right
:figwidth: 40%
Hypothesis testing
Regardless of the branch, statistical inference always involves the following key elements:
- **Assumption validation**: Most inference methods are constructed under a set of asumptions
about the charateristics of the data-generating population. It must be verified that our
data meets the requirements to ensure reliablity of the inference results.
- **Uncertainty quantification**: The core of statistical inference is the ability to
draw conclusions in the face of uncertainty and to numerically express the degree of confidence
in the result.
.. admonition:: Connection to future chapters
:class: note
Foundations of parameter estimation and hypothesis testing will be covered in **Chapters 9 and 10**,
respectively. Both will continue to be used for various inference scenarios in **Chapters 11-13**.
D. Predictive Analytics
~~~~~~~~~~~~~~~~~~~~~~~~~~
Rather than understanding current data, predictive analytics focuses on what might happen next.
Predictive analytics often requires large datasets as it relies on identifying patterns and
relationships in historical data which are used to predict future events.
Key elements of predictive analytics are:
- **Modeling of variable relationships**: Structuralize the relationship between two or more
variables. Use this structure as a tool for making predictions. This model can be as simple as
a linear pattern, or as complex as a deep neural network.
- **Prediction**: Make educated guesses on unobserved values based on the identified **model**
and the individual's observed characteristics
.. admonition:: Connection to future chapters
:class: note
We'll touch on predictive methods primarily through linear regression, in **Chapter 13**.
Course Goals: Growing as a Competent Statistician
-----------------------------------------------------------
Having explored what statistics is, let's consider who practices it. The American Statistical Association
defines a statistician as
**"a person who applies statistical thinking and methods to a wide variety of scientific, social,
and business endeavors."**
This broad definition encompasses professionals working across diverse fields—astronomy, biology,
education, economics, engineering, genetics, and many others.
Statisticians serve several crucial functions in research and decision-making:
- They provide guidance on what information can be considered reliable
- They help determine which predictions warrant confidence
- They offer insights that illuminate complex scientific questions
- They protect against false impressions that might mislead investigators
This course aims to develop your statistical abilities across three progressively sophisticated levels:
Statistical Literacy
~~~~~~~~~~~~~~~~~~~~~~
At the most fundamental level, statistical literacy includes:
- Understanding basic data management principles
- Exploring data effectively through visual and numerical methods
- Comprehending the vocabulary and notation statisticians use
- Grasping how probability serves as our framework for measuring uncertainty
- Recognizing when particular statistical methods are appropriate
Statistical literacy allows you to read and understand statistical information—the minimum required
to be an informed consumer of research and data-based claims.
.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter1/statistical-competence-diagram.png
:alt: diagram of three stages of statistical competence
:figwidth: 40%
:align: right
Statistical Reasoning
~~~~~~~~~~~~~~~~~~~~~~~~
Moving beyond literacy, statistical reasoning involves:
- Applying statistical tools effectively to answer specific questions
- Interpreting results correctly within their proper context
- Communicating findings clearly to various stakeholders
With statistical reasoning skills, you can actively engage with data analysis rather
than simply consuming others' conclusions.
Statistical Thinking
~~~~~~~~~~~~~~~~~~~~~~~~~
The most sophisticated level, statistical thinking encompasses:
- Understanding how statistical models represent and simulate real-world phenomena
- Selecting appropriate inferential tools for specific analytical situations
- Seeing the entire pipeline from data collection through analysis to interpretation
- Being able to design studies and experiments, and understanding why designed
experiments are needed to establish causation.
Statistical thinking represents the mindset of a practitioner who can navigate
the entire statistical process independently.
While developing complete statistical thinking extends beyond a single course, by semester's end,
you should achieve statistical literacy and begin developing reasoning skills. These capabilities
will serve you well regardless of your career path, enhancing your ability to make decisions in our
data-rich world.