.. _1-1-intro:


.. raw:: html

   <div class="video-placeholder">
     <iframe
       src="https://www.youtube.com/embed/DcDqlxacmRY?list=PLHKEwTHXfbagA3ybKLAcEJriGT-6k89c6"
       allowfullscreen>
     </iframe>
   </div>


What Is Statistics?
========================================

Welcome to the fascinating world of statistics! Statistics surrounds us daily—from weather forecasts and 
medical studies to economic reports and political polls. As we begin our journey together this semester, 
let's explore some fundamental questions: what exactly is statistics, and what powers its universal presence?

Statistics is a framework for making sense of the real world through data and drawing meaningful conclusions 
in the face of uncertainty. The American Statistical Association defines statistics as

  **"the science of learning from data and of measuring, controlling, and communicating uncertainty."**

In the remainder of this section, we will learn how this definition correctly entails the field of statistics by
examining its major branches and understanding the role of a statistician in making statistics available for 
broad audiance.

.. admonition:: Road Map 🧭
   :class: important

   * Define the **four branches of statistics**, and understand how they interact.
   * Understand the **role of a statistician**.
   * Understand what it means to achieve the **three stages of statistcal competence**. Understand 
     where we are now and where we will be at the end of this course. 
    
Primary Branches of Statistics
-----------------------------------

Statistics is a diverse ecosystem of interconnected components. One way to view statistics is 
to divide it into four branches: data management, exploratory data analysis, inferential statistics,
and predictive analytics.

A. Data Management: Collection, Cleaning, and Storage
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Data Collection 
^^^^^^^^^^^^^^^^^
  A data-collecting process begins with careful planning. This entails

  - defining **clear objectives** for what you want to achieve with the data,
  - determining the **specific questions** that the data is needed to answer,
  - developing a **detailed plan** including timelines, resources needed, 
    and roles and responsibilities, and the data types to be recorded.
  - planning for **potential challenges** and how to address them, and
  - continuously reviewing the data collection process.

  The selection of a data collection method must take into account the question being addressed 
  and the available resources. Ethical considerations must also be factored in, especially when 
  the procedure involves human participants. It is important to comply with relevant laws and 
  regulations (e.g., FERPA, HIPAA, GDPR), obtain informed consent from all participants, and 
  ensure privacy and confidentiality.
  
  Some widely used collection methods include surveys and questionnaires, interviews, observations, 
  experiments, web scraping, sensor-based and electronic data collection, and document and record reviews.
 
Data Cleaning
^^^^^^^^^^^^^^^^^
 This stage consists of an overall inspection and organization of collected datasets. 
 Some possible issues addressed during this stage are:

 - **Correction of errors** such as duplicate entries
 - **Consolidating data** from various sources
 - **Resolving conflicts** in data types, values, and formats
 - Handling **missing data**. Different strategy may be chosen depending on the nature of the missing entry.
   Some possibilities are deletion, imputation (making an educated guess), 
   and flagging (not taking an immediate action other than making a note).
  
 Data cleaning is often a repeated process, especially when new data is collected periodically. It is 
 advisable to maintain comprehensive documentation of the data cleaning steps. This facilitates 
 reviewing and refining the process to accommodate new data or adapt to updated standards.
 
Data Storage
^^^^^^^^^^^^^^^
  .. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter1/data-growth.png
    :alt: Poster illustrating the growth in the amount of avialable data (Raconteur)
    :align: center
    :figwidth: 90%

    Growth of data

  We live in an era of unprecedented data abundance:

  - Social media platforms generate billions of interactions daily
  - Location services track geographical movements of people and vehicles
  - Wearable devices monitor health metrics around the clock
  - Business transactions create detailed records of economic activity

  By 2025, experts predict approximately 463 exabytes of data will exist globally—a volume almost 
  impossible to conceptualize (one exabyte equals one billion gigabytes). This explosion of data 
  creates both opportunities and challenges for statisticians.

  In this course, we'll focus primarily on structured data in manageable volumes, building foundational 
  skills to master essential principles that remain relevant regardless of data scale or format.

  When data is large or has a complex structure, an efficient storage solution must be implemented by 
  establishing appropriate data schemas. When a dataset consists of multiple parts, it should be 
  cataloged or integrated into a unified, condensed form. Various storage solutions exist for such 
  datasets, including relational databases, data warehouses, data lakes, and cloud storage.

  Once the data is stored, accessing and retrieving it also requires careful planning due to factors 
  such as volume, privacy constraints, or regulatory sensitivity. It may be necessary to develop 
  efficient querying mechanisms using SQL, utilize APIs, implement robust user access controls, apply 
  encryption, and more.

.. admonition:: Connection to future chapters
  :class: note

  Good practices in data collection will be discussed in **Chapter 8**. Most **Computer Assignments**
  will require various steps of data cleaning to suit the questions being addressed.

B. Exploratory Data Analysis
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter1/eda-diagram.png
  :alt: diagram of exploratory data analysis
  :align: left
  :figwidth: 40%

  Exploratory data analysis

Once we have collected and prepared our data, our first task is to perform
**Exploratory Data Analysis (EDA)**. During EDA, key features of the data are visually and
numerically summarized through **descriptive statistics**. Descriptive statistics reveal 
patterns and detect peculiarities in the dataset.

EDA is typically an iterative process—we examine the data, formulate questions, explore further, refine 
our understanding, and repeat. This cycle helps us develop hypotheses and determine which inferential methods might 
be most appropriate for deeper analysis.

.. admonition:: Connection to future chapters
  :class: note

  Descpriptive statistics are introduced in **Chapters 2 and 3**. The techniques learned here will serve as 
  assisting tools in **Chapters 7-13**.

C. Inferential Statistics
~~~~~~~~~~~~~~~~~~~~~~~~~~~

Perhaps the most powerful branch of statistics, inferential statistics allows us to **extend what we learn from samples 
to make conclusions about entire populations**. Two major branches of statistical inference are:

Parameter Estimation
^^^^^^^^^^^^^^^^^^^^^^

  .. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter1/estimation-diagram.png
    :alt: diagram of paragmeter estimation
    :align: right
    :figwidth: 40%
    
    Parameter estimation

  Parameter estimation is making educated guesses about a key value of the population, such as the mean, median, and variance.
  This can be done through

  * **Point estimation**: finding a single "best guess" for the unknown value
  * **Interval estimation**: constructing a range to which the unknown value is expected to belong.

Hypothesis Testing
^^^^^^^^^^^^^^^^^^^^

  Hypothesis tests evaluate specific claims against evidence from data. Hypothesis testing
  can be split into four stages:

  1) Making a claim - usually made in the form of a "yes or no" question asking if something has changed 
     from a previous belief (**status quo**)
  2) Gathering evidence through experiment
  3) Assessing the likelihood - does the evidence support the claim?
  4) Making a conclusion

.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter1/hypothesis-test-diagram.png
  :alt: diagram of hypothesis test
  :align: right
  :figwidth: 40%
  
  Hypothesis testing
  
Regardless of the branch, statistical inference always involves the following key elements:

- **Assumption validation**: Most inference methods are constructed under a set of asumptions
  about the charateristics of the data-generating population. It must be verified that our
  data meets the requirements to ensure reliablity of the inference results.

- **Uncertainty quantification**: The core of statistical inference is the ability to
  draw conclusions in the face of uncertainty and to numerically express the degree of confidence
  in the result. 


.. admonition:: Connection to future chapters
  :class: note

  Foundations of parameter estimation and hypothesis testing will be covered in **Chapters 9 and 10**,
  respectively. Both will continue to be used for various inference scenarios in **Chapters 11-13**.

D. Predictive Analytics
~~~~~~~~~~~~~~~~~~~~~~~~~~

Rather than understanding current data, predictive analytics focuses on what might happen next.
Predictive analytics often requires large datasets as it relies on identifying patterns and 
relationships in historical data which are used to predict future events.
Key elements of predictive analytics are:

- **Modeling of variable relationships**: Structuralize the relationship between two or more
  variables. Use this structure as a tool for making predictions. This model can be as simple as
  a linear pattern, or as complex as a deep neural network.

- **Prediction**: Make educated guesses on unobserved values based on the identified **model**
  and the individual's observed characteristics

.. admonition:: Connection to future chapters
  :class: note

  We'll touch on predictive methods primarily through linear regression, in **Chapter 13**.


Course Goals: Growing as a Competent Statistician
-----------------------------------------------------------

Having explored what statistics is, let's consider who practices it. The American Statistical Association
defines a statistician as 
  
  **"a person who applies statistical thinking and methods to a wide variety of scientific, social, 
  and business endeavors."**

This broad definition encompasses professionals working across diverse fields—astronomy, biology, 
education, economics, engineering, genetics, and many others.

Statisticians serve several crucial functions in research and decision-making:

- They provide guidance on what information can be considered reliable
- They help determine which predictions warrant confidence
- They offer insights that illuminate complex scientific questions
- They protect against false impressions that might mislead investigators

This course aims to develop your statistical abilities across three progressively sophisticated levels:

Statistical Literacy
~~~~~~~~~~~~~~~~~~~~~~
   
  At the most fundamental level, statistical literacy includes:

  - Understanding basic data management principles
  - Exploring data effectively through visual and numerical methods
  - Comprehending the vocabulary and notation statisticians use
  - Grasping how probability serves as our framework for measuring uncertainty
  - Recognizing when particular statistical methods are appropriate

  Statistical literacy allows you to read and understand statistical information—the minimum required 
  to be an informed consumer of research and data-based claims.

.. figure:: https://yjjpfnblgtrogqvcjaon.supabase.co/storage/v1/object/public/stat-350-assets/images/chapter1/statistical-competence-diagram.png
  :alt: diagram of three stages of statistical competence
  :figwidth: 40%
  :align: right

Statistical Reasoning
~~~~~~~~~~~~~~~~~~~~~~~~

  Moving beyond literacy, statistical reasoning involves:

  - Applying statistical tools effectively to answer specific questions
  - Interpreting results correctly within their proper context
  - Communicating findings clearly to various stakeholders

  With statistical reasoning skills, you can actively engage with data analysis rather 
  than simply consuming others' conclusions.

Statistical Thinking
~~~~~~~~~~~~~~~~~~~~~~~~~

  
  The most sophisticated level, statistical thinking encompasses:

  - Understanding how statistical models represent and simulate real-world phenomena
  - Selecting appropriate inferential tools for specific analytical situations
  - Seeing the entire pipeline from data collection through analysis to interpretation
  - Being able to design studies and experiments, and understanding why designed 
    experiments are needed to establish causation.
    
  Statistical thinking represents the mindset of a practitioner who can navigate 
  the entire statistical process independently.


While developing complete statistical thinking extends beyond a single course, by semester's end, 
you should achieve statistical literacy and begin developing reasoning skills. These capabilities 
will serve you well regardless of your career path, enhancing your ability to make decisions in our 
data-rich world.