Data & Dataset#

Module: leaspy.io.data

Leaspy separates the user-facing data container (Data) from the computation-optimized representation (Dataset).

Data: The User Interface#

The Data class is what you will use 99% of the time. It wraps your raw data (usually a Pandas DataFrame) and prepares it for usage with Leaspy.

  • Flexible: You can easily inspect, modify, slice, and reload variables (cofactors, headers).

  • Convenient: Methods like Data.from_csv_file or Data.from_dataframe handle the complex formatting logic for you.

Dataset: The Computational Core#

The Dataset class is an internal, read-only optimization of Data.

When you call model.fit(data), Leaspy automatically converts your Data object into a Dataset behind the scenes. You rarely need to instantiate this class yourself.

Why does it exist?#

While Data is user-friendly, Dataset is machine-friendly:

  1. Tensorized: Converts everything to PyTorch tensors for fast math.

  2. Locked: Prevents accidental modification during training.

  3. Optimized Layout: Handles padding, masking for missing values, and memory layout for batch operations.

When should you use Dataset directly?#

The only real use case for manually creating a Dataset is for performance in advanced scripts. If you are running thousands of iterations or repeated fit/personalize calls on the same static data, you can convert it once:

# Optimize once
dataset = Dataset(data)

# Reuse many times (avoids re-converting data at every call)
for i in range(100):
   model.fit(dataset, ...) 

Where is Data Handled in the Code?#

Users typically pass Data (or even a pandas.DataFrame) to the model’s main methods: fit, predict, or personalize.

The conversion happens in BaseModel, the ancestor of all Leaspy models (including LogisticModel):

  1. BaseModel.fit(data):

    • Accepts DataFrame, Data, or Dataset.

    • Calls internal helper _get_dataset(data).

    • _get_dataset converts DataFrame \(\rightarrow\) Data \(\rightarrow\) Dataset as needed.

  2. Algorithm Execution: The optimized Dataset is then passed to the algorithm runner (e.g., algo.run(model, dataset)).

This means high-level models like LogisticModel never worry about data formatting—they simply rely on BaseModel to hand them a clean, tensorized Dataset.