Add Dataset
To support a new Dataset into the Data Integration step follow these steps:
Fork the repository, pull the latest changes and create a new branch add-dataset-{dataset_name}
Create a bash script to download the raw data and store it into a new folder with the name of the dataset.
Place it into the
datasets/folder.
Implement a
Datasetsubclass for your dataset.from ..dataset import DatasetYour subclass must override the following methods:
__init__()to instantiate the class.load_item_data()for converting the item data into apd.DataFrame().load_user_data()for converting the user data into apd.DataFrame().load_rating_data()for converting the rating data into apd.DataFrame().load_social_data()for converting the social link data into apd.DataFrame(), if supported.entity_linking()for matching each item with a DBpedia resource.enrich()for enriching each item with DBpedia’s properties.
Your code must be placed in a new file at
data_integration/datasets/.
Store the submodule path to dinamically load the subclass.
Go to the
data_integration/dataset2class.pyfile. This file store the mapping between the dataset name and the submodule path and class name.Create a new key with the dataset’s name. This dataset name will be used to identify the dataset when using the
data_integration.pyscript.Store in this new key, the
submodule:path, fromdata_integration/, and theclass:name.For example:
dataset2class = { # .... 'ml-100k': { # Dataset name 'submodule': 'datasets.movielens', # submodule path 'class': 'MovieLens100k' # Class name } }
Add the dataset into the documentation
In the file
docs/source/getting_started/support.mdand Datasets section, add the dataset to the table.Inform the dataset name with reference link, number of matched items and total number of items of the dataset.
Make a Pull Request on Github.