Data & Dataset#
Module: leaspy.io.data
Leaspy separates the user-facing data container (Data) from the computation-optimized representation (Dataset).
Data: The User Interface#
The Data class is what you will use 99% of the time. It wraps your raw data (usually a Pandas DataFrame) and prepares it for usage with Leaspy.
Flexible: You can easily inspect, modify, slice, and reload variables (cofactors, headers).
Convenient: Methods like
Data.from_csv_fileorData.from_dataframehandle the complex formatting logic for you.
Dataset: The Computational Core#
The Dataset class is an internal, read-only optimization of Data.
When you call model.fit(data), Leaspy automatically converts your Data object into a Dataset behind the scenes. You rarely need to instantiate this class yourself.
Why does it exist?#
While Data is user-friendly, Dataset is machine-friendly:
Tensorized: Converts everything to PyTorch tensors for fast math.
Locked: Prevents accidental modification during training.
Optimized Layout: Handles padding, masking for missing values, and memory layout for batch operations.
When should you use Dataset directly?#
The only real use case for manually creating a Dataset is for performance in advanced scripts. If you are running thousands of iterations or repeated fit/personalize calls on the same static data, you can convert it once:
# Optimize once
dataset = Dataset(data)
# Reuse many times (avoids re-converting data at every call)
for i in range(100):
model.fit(dataset, ...)
Where is Data Handled in the Code?#
Users typically pass Data (or even a pandas.DataFrame) to the model’s main methods: fit, predict, or personalize.
The conversion happens in BaseModel, the ancestor of all Leaspy models (including LogisticModel):
BaseModel.fit(data):Accepts
DataFrame,Data, orDataset.Calls internal helper
_get_dataset(data)._get_datasetconvertsDataFrame\(\rightarrow\)Data\(\rightarrow\)Datasetas needed.
Algorithm Execution: The optimized
Datasetis then passed to the algorithm runner (e.g.,algo.run(model, dataset)).
This means high-level models like LogisticModel never worry about data formatting—they simply rely on BaseModel to hand them a clean, tensorized Dataset.