# Setting up configuration file The directive `experiment` configures the experimetn pipeline that includes: - `dataset` configuration - `preprocess` methods - `split` method - `models` to be evaluated - `evaluation` configuration - `report` to summarize experiment results Example: ```yaml experiment: dataset: name: ml-100k item: path: datasets/ml-100k/processed/item.csv extra_features: [movie_year, movie_title] user: path: datasets/ml-100k/processed/user.csv extra_features: [gender, occupation] ratings: path: datasets/ml-100k/processed/rating.csv timestamp: True enrich: map_path: datasets/ml-100k/processed/map.csv enrich_path: datasets/ml-100k/processed/enriched.csv remove_unmatched: False properties: - type: subject grouped: True sep: "::" - type: director grouped: True sep: "::" preprocess: - method: filter_kcore parameters: k: 20 iterations: 1 target: user split: seed: 42 test: method: k_fold k: 2 level: "user" models: - name: deepwalk_based config: save_weights: True parameters: walk_len: 10 p: 1.0 q: 1.0 n_walks: 50 embedding_size: 64 epochs: 1 - name: deepwalk_based config: save_weights: True parameters: walk_len: 10 p: 0.8 q: 0.6 n_walks: 50 embedding_size: 64 epochs: 1 evaluation: k: 5 relevance_threshold: 3 metrics: [MAP, nDCG] report: file: "experiment_results/ml100k_enriched/run1.csv" ``` The next session goes through all details of each directive. ## dataset In this section, we can define the dataset name and provide details about items, users, ratings and enriching data. ```yaml experiment: dataset: name: dataset_name item: path: /path/to/item.csv extra_features: [feat1_name, feat2_name, ...] user: path: path/to/user.csv extra_features: [feat1_name, feat2_name, ...] ratings: path: path/to/rating.csv timestamp: True|False enrich: map_path: path/to/map.csv enrich_path: path/to/enriched.csv remove_unmatched: True|False properties: [property1_name, property2_name] ``` Let's break down the main directives for the dataset: - `item`: specifies the item info to be added to the network. (mandatory) - `path`: filepath of the standardized **item.csv**. (mandatory) - `extra_features`: For default, the only column to be added is the `item_id`. With a list of column names the user can specify additional features to be added as property node. (optional) - `user`: specifies the user info. (mandatory) - `path`: filepath of the standardized **user.csv**. (mandatory) - `extra_features`: For default, the only column to be added is the `item_id`. With a list of column names the user can specify additional features to be added as property node. (optional) - `ratings`: specifies the ratings info. (mandatory) - `path`: filepath of the standardized `ratings.csv`. (mandatory) - `timestamp`: boolean that indicates if the column `timestamp` is present. - `enrich`: specifies the enriched info. (mandatory) - `map_path`: filepath of the standardized `map.csv`. (mandatory) - `enrich_path`: filepath of the standardized `user.csv`. (mandatory) - `remove_unmatched`: boolean to specify if nodes unmatched with DBpedia should be removed. (mandatory) - `properties`: list of properties to enrich the dataset (mandatory) ## preprocess In the .yaml file, the directive **preprocess** is used to define the list of pre-processing methods to be performed during the experiment pipeline. The pre-processing step can be configured as: ```yaml experiment: # ... preprocess: - method: method1_name parameters: parameter_1: 3 parameter_2: val - method: method2_name parameters: parameter_1: 4 ``` Where, - `preprocess`: specifies a list of pre-processing methods. (optional) - `method`: method name (mandatory) - `parameters`: method parameters in the format `parameter_name: parameter_value` The current supported pre-processing methods can be found in [support documentation](./support.md). ### binarize parameters In the _.yaml_ file, the method name is `binarize` and the only parameter is a `threshold` number. Example: ```yaml experiment: preprocess: - method: binarize parameters: threshold: 4 ``` ### filter_kcore parameters In the _.yaml_ file, the method name is `filter_kcore`, and the parameters are `k`, the number of `iterations` and the `target` type of node (user or item). Example: ```yaml experiment: # ... preprocess: - method: filter_kcore parameters: k: 20 iterations: 3 target: user # user or rating ``` ## split In the .yaml file, the directive **split** is used to define the split method. For example: ```yaml experiment: # ... split: seed: 42 test: method: method1_name parameter1_name: 0.2 parameter2_name: value validation: method: method2_name parameter1_name: value_2 parameter2_name: 100 ``` Where, - `split`: specifies the splitting method used. (mandatory) - `seed`: random seed value for reproducibility - `test`: directive for the test split - `validation`: directive for the validation split - `method`: splitting method name (mandatory) - Parameters as a dictionary where the key is the splitting method parameter name and the value is the corresponding value of this parameter. Example: `parameter1: value1` The current supported splitting methods can be found in [support documentation](./support.md). ### random_by_ratio parameters In the _.yaml_ file, the method name is `random_by_ratio`, and the parameters are: - `p`: test set proportion (mandatory) - `level`: **global** or **user** level (mandatory). Example: ```yaml experiment: # ... split: seed: 42 test: method: random_by_ratio level: global # or user p: 0.2 validation: method: random_by_ratio level: global # or user p: 0.2 ``` ### timestamp_by_ratio parameters In the _.yaml_ file, the method name is `timestamp_by_ratio`, and the parameters are the same as Random by Ratio. Example: ```yaml experiment: # ... split: test: method: timestamp_by_ratio level: user # or global p: 0.1 validation: method: timestamp_by_ratio level: user # or global p: 0.2 ``` ### fixed_timestamp parameters In the _.yaml_ file, the method name is `fixed_timestamp`. The only parameter is the `timestamp` number. Example: ```yaml experiment: # ... split: test: method: fixed_timestamp timestamp: 890000000 validation: method: fixed_timestamp timestamp: 880000000 ``` ### k_fold parameters In the _.yaml_ file, the method name is `k_fold`. The parameters are: - `k`: number of folds (mandatory) - `level`: **global** or **user** level (mandatory). **Note:** This method does not support validation splitting. Example: ```yaml experiment: # ... split: test: method: k_fold k: 3 level: "user" ``` ## models In the _.yaml_ file, the **models** directive defines a list of models to be evaluated during the experiment pipeline. Example: ```yaml experiment: models: - name: deepwalk_based config: save_weights: True parameters: parameter1: 10 paramater2: value - # Other model to be evaluated... ``` Where, - `models`: specifies a list of models to be evaluated. (mandatory) - `name`: model name (mandatory) - `config`: metadata config - _save_weights_: boolean that indicates if the model parameters must be saved after training. - `parameters`: model parameters in the format `parameter_name: parameter_value` The supported models and their parameters can be found at the [support](./support.md) documentation. ## evaluation In the _.yaml_ file, the **evaluation** directive defines the experiment evaluation. The evaluation metadata is given in the format `metadata1: metadata1_value`. Example: ```yaml experiment: # ... evaluation: k: 5 relevance_threshold: 3 metrics: [MAP, nDCG] ``` Where, - `evaluation`: specifies the evaluation metadata (mandatory) - `k`: evaluates the first k recommendations (mandatory) - `relevance_threshold`: threshold value to consider a rating relevant. - `metrics`: list of metric names to be evaluated. (mandatory) Check out the availables evaluation metric in the [support](./support.md) documentation. ## report In the _.yaml_ file, the **report** directive defines the experiment summarization. ```yaml experiment: # ... report: file: "experiment_results/ml100k_enriched/run1.csv" ``` Where, - `report`: specifies the report metadata (mandatory) - `file`: **_.csv_** filename of the experiment report. (mandatory)