Setting up configuration file

The directive experiment configures the experimetn pipeline that includes:

dataset configuration
preprocess methods
split method
models to be evaluated
evaluation configuration
report to summarize experiment results

Example:

experiment:
  dataset:
    name: ml-100k
    item:
      path: datasets/ml-100k/processed/item.csv
      extra_features: [movie_year, movie_title]
    user:
      path: datasets/ml-100k/processed/user.csv
      extra_features: [gender, occupation]
    ratings:
      path: datasets/ml-100k/processed/rating.csv
      timestamp: True
    enrich:
      map_path: datasets/ml-100k/processed/map.csv
      enrich_path: datasets/ml-100k/processed/enriched.csv
      remove_unmatched: False
      properties:
        - type: subject
          grouped: True
          sep: "::"
        - type: director
          grouped: True
          sep: "::"

  preprocess:
    - method: filter_kcore
      parameters:
        k: 20
        iterations: 1
        target: user

  split:
    seed: 42
    test:
      method: k_fold
      k: 2
      level: "user"

  models:
    - name: deepwalk_based
      config:
        save_weights: True
      parameters:
        walk_len: 10
        p: 1.0
        q: 1.0
        n_walks: 50
        embedding_size: 64
        epochs: 1
    - name: deepwalk_based
      config:
        save_weights: True
      parameters:
        walk_len: 10
        p: 0.8
        q: 0.6
        n_walks: 50
        embedding_size: 64
        epochs: 1

  evaluation:
    k: 5
    relevance_threshold: 3
    metrics: [MAP, nDCG]

  report:
    file: "experiment_results/ml100k_enriched/run1.csv"

The next session goes through all details of each directive.

dataset

In this section, we can define the dataset name and provide details about items, users, ratings and enriching data.

experiment:
  dataset:
    name: dataset_name
    item:
      path: /path/to/item.csv
      extra_features: [feat1_name, feat2_name, ...]
    user:
      path: path/to/user.csv
      extra_features: [feat1_name, feat2_name, ...]
    ratings:
      path: path/to/rating.csv
      timestamp: True|False
    enrich:
      map_path: path/to/map.csv
      enrich_path: path/to/enriched.csv
      remove_unmatched: True|False
      properties: [property1_name, property2_name]

Let’s break down the main directives for the dataset:

item: specifies the item info to be added to the network. (mandatory)
- path: filepath of the standardized item.csv. (mandatory)
- extra_features: For default, the only column to be added is the item_id. With a list of column names the user can specify additional features to be added as property node. (optional)
user: specifies the user info. (mandatory)
- path: filepath of the standardized user.csv. (mandatory)
- extra_features: For default, the only column to be added is the item_id. With a list of column names the user can specify additional features to be added as property node. (optional)
ratings: specifies the ratings info. (mandatory)
- path: filepath of the standardized ratings.csv. (mandatory)
- timestamp: boolean that indicates if the column timestamp is present.
enrich: specifies the enriched info. (mandatory)
- map_path: filepath of the standardized map.csv. (mandatory)
- enrich_path: filepath of the standardized user.csv. (mandatory)
- remove_unmatched: boolean to specify if nodes unmatched with DBpedia should be removed. (mandatory)
- properties: list of properties to enrich the dataset (mandatory)

preprocess

In the .yaml file, the directive preprocess is used to define the list of pre-processing methods to be performed during the experiment pipeline. The pre-processing step can be configured as:

experiment:
  # ...
  preprocess:
    - method: method1_name
      parameters:
        parameter_1: 3
        parameter_2: val
    - method: method2_name
      parameters:
        parameter_1: 4

Where,

preprocess: specifies a list of pre-processing methods. (optional)
- method: method name (mandatory)
- parameters: method parameters in the format parameter_name: parameter_value

The current supported pre-processing methods can be found in support documentation.

binarize parameters

In the .yaml file, the method name is binarize and the only parameter is a threshold number. Example:

experiment:
  preprocess:
    - method: binarize
      parameters:
        threshold: 4

filter_kcore parameters

In the .yaml file, the method name is filter_kcore, and the parameters are k, the number of iterations and the target type of node (user or item). Example:

experiment:
  # ...
  preprocess:
    - method: filter_kcore
      parameters:
        k: 20
        iterations: 3
        target: user # user or rating

split

In the .yaml file, the directive split is used to define the split method. For example:

experiment:
  # ...
  split:
    seed: 42
    test:
      method: method1_name
      parameter1_name: 0.2
      parameter2_name: value
    validation:
      method: method2_name
      parameter1_name: value_2
      parameter2_name: 100

Where,

split: specifies the splitting method used. (mandatory)
- seed: random seed value for reproducibility
- test: directive for the test split
- validation: directive for the validation split
- method: splitting method name (mandatory)
  - Parameters as a dictionary where the key is the splitting method parameter name and the value is the corresponding value of this parameter. Example: parameter1: value1

The current supported splitting methods can be found in support documentation.

random_by_ratio parameters

In the .yaml file, the method name is random_by_ratio, and the parameters are:

p: test set proportion (mandatory)
level: global or user level (mandatory).

Example:

experiment:
  # ...
  split:
    seed: 42
    test:
      method: random_by_ratio
      level: global # or user
      p: 0.2
    validation:
      method: random_by_ratio
      level: global # or user
      p: 0.2

timestamp_by_ratio parameters

In the .yaml file, the method name is timestamp_by_ratio, and the parameters are the same as Random by Ratio. Example:

experiment:
  # ...
  split:
    test:
      method: timestamp_by_ratio
      level: user # or global
      p: 0.1
    validation:
      method: timestamp_by_ratio
      level: user # or global
      p: 0.2

fixed_timestamp parameters

In the .yaml file, the method name is fixed_timestamp. The only parameter is the timestamp number. Example:

experiment:
  # ...
  split:
    test:
      method: fixed_timestamp
      timestamp: 890000000
    validation:
      method: fixed_timestamp
      timestamp: 880000000

k_fold parameters

In the .yaml file, the method name is k_fold. The parameters are:

k: number of folds (mandatory)
level: global or user level (mandatory).

Note: This method does not support validation splitting.

Example:

experiment:
  # ...
  split:
    test:
      method: k_fold
      k: 3
      level: "user"

models

In the .yaml file, the models directive defines a list of models to be evaluated during the experiment pipeline. Example:

experiment:
  models:
    - name: deepwalk_based
      config:
        save_weights: True
      parameters:
        parameter1: 10
        paramater2: value
    -  # Other model to be evaluated...

Where,

models: specifies a list of models to be evaluated. (mandatory)
- name: model name (mandatory)
- config: metadata config
  - save_weights: boolean that indicates if the model parameters must be saved after training.
- parameters: model parameters in the format parameter_name: parameter_value

The supported models and their parameters can be found at the support documentation.

evaluation

In the .yaml file, the evaluation directive defines the experiment evaluation. The evaluation metadata is given in the format metadata1: metadata1_value. Example:

experiment:
  # ...
  evaluation:
    k: 5
    relevance_threshold: 3
    metrics: [MAP, nDCG]

Where,

evaluation: specifies the evaluation metadata (mandatory)
- k: evaluates the first k recommendations (mandatory)
- relevance_threshold: threshold value to consider a rating relevant.
- metrics: list of metric names to be evaluated. (mandatory)

Check out the availables evaluation metric in the support documentation.

report

In the .yaml file, the report directive defines the experiment summarization.

experiment:
  # ...
  report:
    file: "experiment_results/ml100k_enriched/run1.csv"

Where,

report: specifies the report metadata (mandatory)
- file: .csv filename of the experiment report. (mandatory)