.. _r_datasets:

Course Datasets
-------------------------------------------------

Primary Course Dataset - AppRating
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The **AppRating dataset** is the central dataset used throughout all computer assignments. This dataset contains app ratings and various metrics that students will analyze using different statistical techniques as they progress through the course.

**Fall 2025 Session**

- Dataset: `AppRatingFALL2025.csv <https://treese41528.github.io/STAT350/AppRatingFALL2025.csv>`_
- Description: `AppRatingDescription.pdf <https://treese41528.github.io/STAT350/AppRatingDescription.pdf>`_

**Winter 2025 Session**

- Dataset: `AppRatingWINTER2025.csv <https://treese41528.github.io/STAT350/AppRatingWINTER2025.csv>`_
- Description: `AppRatingDescription.pdf <https://treese41528.github.io/STAT350/AppRatingDescription.pdf>`_

.. important::
   The AppRating dataset is used in **all six computer assignments**. Download the appropriate version for your session at the beginning of the course and use it throughout.

Tutorial Support Datasets
~~~~~~~~~~~~~~~~~~~~~~~~~~

These datasets are available in the Computer Assignment Tutorials Data folder and are used for demonstrations and additional practice:

**CSV Format Datasets**

- `Bikedata_clean.csv <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Data/Bikedata_clean.csv>`_ - Cleaned bicycle data
- `DMS.csv <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Data/DMS.csv>`_ - DMS measurements
- `eduproduct.csv <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Data/eduproduct.csv>`_ - Educational product data
- `eg01-23time24.csv <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Data/eg01-23time24.csv>`_ - Time series example
- `ex07-39mpgdiff.csv <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Data/ex07-39mpgdiff.csv>`_ - MPG difference data
- `furnace.csv <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Data/furnace.csv>`_ - Furnace efficiency data
- `helicon_cleaned.csv <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Data/helicon_cleaned.csv>`_ - Cleaned helicon measurements
- `helicon_m.csv <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Data/helicon_m.csv>`_ - Helicon measurement data
- `linebackers.csv <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Data/linebackers.csv>`_ - Football linebacker statistics
- `loc.csv <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Data/loc.csv>`_ - Location data
- `movies.csv <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Data/movies.csv>`_ - Movie ratings and information
- `studyhabits.csv <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Data/studyhabits.csv>`_ - Student study habits survey

**Text Format Datasets**

- `ANOVA paxil.txt <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Data/ANOVA%20paxil.txt>`_ - ANOVA example with Paxil data
- `linebackers.txt <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Data/linebackers.txt>`_ - Text version of linebacker data
- `singer1.txt <https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Data/singer1.txt>`_ - Singer height data

Loading Datasets in R
~~~~~~~~~~~~~~~~~~~~~~

**Loading CSV files:**

.. code-block:: r

   # From local file (after downloading)
   d <- read.csv("data/helicon_m.csv")
   
   # Directly from URL
   d <- read.csv("https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Data/helicon_m.csv")

**Loading text files:**

.. code-block:: r

   # Space-separated text file
   d <- read.table("data/linebackers.txt", header = TRUE)
   
   # Or if tab-separated
   d <- read.table("data/ANOVA paxil.txt", header = TRUE, sep = "\t")
   
   # From URL (note the %20 for space in filename)
   d <- read.table("https://treese41528.github.io/STAT350/Computer_Assignment_Tutorials/Data/ANOVA%20paxil.txt", 
                   header = TRUE)

Built-in R Datasets Used in Course
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The course also utilizes several built-in R datasets for examples and demonstrations:

**Primary Built-in Datasets**

- ``iris`` - Fisher's iris flower measurements (150 obs, 5 variables)
- ``mtcars`` - Motor Trend car statistics (32 cars, 11 variables)
- ``sleep`` - Student sleep data for paired t-tests (20 obs, 3 variables)
- ``CO2`` - Carbon dioxide uptake in grass plants (84 obs, 5 variables)
- ``AirPassengers`` - Monthly airline passenger numbers (time series)

**Additional Built-in Datasets for Practice**

- ``chickwts`` - Chicken weights by feed type (ANOVA examples)
- ``PlantGrowth`` - Plant growth under different treatments
- ``InsectSprays`` - Effectiveness of insect sprays
- ``ToothGrowth`` - Tooth growth in guinea pigs
- ``faithful`` - Old Faithful geyser eruption data

**Loading Built-in Datasets**

.. code-block:: r

   # Load a specific dataset
   data(iris)
   
   # View available datasets
   data()
   
   # Get help on a dataset
   ?iris
   
   # View structure
   str(iris)
   head(iris)

Data Download and Organization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Recommended Folder Structure:**

.. code-block:: text

   STAT350_Project/
   ├── data/
   │   ├── AppRating.csv        # Your main dataset
   │   ├── helicon_m.csv        # Tutorial datasets
   │   └── ...other datasets
   ├── scripts/
   │   ├── CA1.R
   │   ├── CA2.R
   │   └── ...
   └── output/
       ├── figures/
       └── tables/

**Download Instructions:**

1. **Create project structure:** Set up folders as shown above
2. **Download AppRating:** Save your session's version to ``data/`` folder
3. **Download tutorial data:** Save tutorial datasets as needed for each assignment
4. **Set working directory:** Use RStudio Projects or ``setwd()`` to your project folder

**Verification After Loading:**

Always verify your data after loading:

.. code-block:: r

   # Check structure
   str(d)
   
   # Check dimensions
   dim(d)
   
   # Look for missing values
   sum(is.na(d))
   
   # Summary statistics
   summary(d)
   
   # First/last few rows
   head(d)
   tail(d)