# Data & Dataset **Module:** `leaspy.io.data` Leaspy separates the **user-facing data container** (`Data`) from the **computation-optimized representation** (`Dataset`). ## `Data`: The User Interface The **`Data`** class is what you will use 99% of the time. It wraps your raw data (usually a Pandas DataFrame) and prepares it for usage with Leaspy. * **Flexible**: You can easily inspect, modify, slice, and reload variables (cofactors, headers). * **Convenient**: Methods like `Data.from_csv_file` or `Data.from_dataframe` handle the complex formatting logic for you. ## `Dataset`: The Computational Core The **`Dataset`** class is an internal, read-only optimization of `Data`. When you call `model.fit(data)`, Leaspy automatically converts your `Data` object into a `Dataset` behind the scenes. **You rarely need to instantiate this class yourself.** ### Why does it exist? While `Data` is user-friendly, `Dataset` is machine-friendly: 1. **Tensorized**: Converts everything to PyTorch tensors for fast math. 2. **Locked**: Prevents accidental modification during training. 3. **Optimized Layout**: Handles padding, masking for missing values, and memory layout for batch operations. ### When should you use `Dataset` directly? The only real use case for manually creating a `Dataset` is for **performance** in advanced scripts. If you are running thousands of iterations or repeated fit/personalize calls on the **same static data**, you can convert it once: ```python # Optimize once dataset = Dataset(data) # Reuse many times (avoids re-converting data at every call) for i in range(100): model.fit(dataset, ...) ``` ## Where is Data Handled in the Code? Users typically pass `Data` (or even a `pandas.DataFrame`) to the model's main methods: `fit`, `predict`, or `personalize`. The conversion happens in [`BaseModel`](../models/BaseModel.md), the ancestor of all Leaspy models (including `LogisticModel`): 1. **`BaseModel.fit(data)`**: * Accepts `DataFrame`, `Data`, or `Dataset`. * Calls internal helper `_get_dataset(data)`. * `_get_dataset` converts `DataFrame` $\rightarrow$ `Data` $\rightarrow$ `Dataset` as needed. 2. **Algorithm Execution**: The optimized `Dataset` is then passed to the algorithm runner (e.g., `algo.run(model, dataset)`). This means high-level models like `LogisticModel` never worry about data formatting—they simply rely on `BaseModel` to hand them a clean, tensorized `Dataset`.