Skip to content

Documentation

Installation

In the following we describe different ways in which you can install and use Morph-KGC. Depending on the data sources that you need to work with, you may need to install additional libraries, see Advanced Setup.

PyPi

PyPi is the fastest way to install Morph-KGC:

pip install morph-kgc

We recommend to use virtual environments to install Morph-KGC.

From Source

You can also grab the latest source code from the GitHub repository. Clone the repository:

git clone https://github.com/oeg-upm/morph-kgc.git

Access the root directory of the repository:

cd morph-kgc

Install Morph-KGC:

pip3 install .

Usage

Morph-KGC uses an INI file to configure the materialization process, see Configuration.

Command Line

To run the engine using the command line you just need to execute the following:

python3 -m morph_kgc path/to/config.ini

Library

Morph-KGC can be used as a library, providing different methods to materialize the RDF or RDF-star knowledge graph. It integrates with RDFLib and Oxigraph to easily create and work with knowledge graphs in Python.

The methods in the API accept the config as a string or as the path to an INI file.

import morph_kgc

config = """
            [DataSource1]
            mappings: /path/to/mapping/mapping_file.rml.ttl
            db_url: mysql+pymysql://user:password@localhost:3306/db_name
         """

Note: Morph-KGC does not parallelize when running as a library.

RDFLib

morph_kgc.materialize(config)

Materialize the knowledge graph to RDFLib.

# generate the triples and load them to an RDFLib graph

graph = morph_kgc.materialize(config)
# or
graph = morph_kgc.materialize('/path/to/config.ini')

# work with the RDFLib graph
q_res = graph.query(' SELECT DISTINCT ?classes WHERE { ?s a ?classes } ')

Note: RDFLib does not support RDF-star, hence materialize does not support RML-star.

Oxigraph

morph_kgc.materialize_oxigraph(config)

Materialize the knowledge graph to Oxigraph.

# generate the triples and load them to Oxigraph

graph = morph_kgc.materialize_oxigraph(config)
# or
graph = morph_kgc.materialize_oxigraph('/path/to/config.ini')

# work with Oxigraph
q_res = graph.query(' SELECT DISTINCT ?classes WHERE { ?s a ?classes } ')

Set of Triples

morph_kgc.materialize_set(config)

Materialize the knowledge graph to a Python Set of triples.

# create a Python Set with the triples

graph = morph_kgc.materialize_set(config)
# or
graph = morph_kgc.materialize_set('/path/to/config.ini')

# work with the Python set
print(len(graph))

Configuration

The configuration of Morph-KGC is done via an INI file. This configuration file can contain the following sections:

CONFIGURATION

One section for each DATA SOURCE

  • Each input data source has its own section, see Data Sources.

DEFAULT

  • It is optional and it declares variables that can be used in all other sections for convenience. For instance, you can set main_dir: ../testing so that main_dir can be used in the rest of the sections.

Below is an example configuration file with one input relational source. In this case DataSource1 is the only data source section, but other data sources can be considered by including additional sections. Here you can find a configuration file which is more complete.

[DEFAULT]
main_dir: ../testing

[CONFIGURATION]
output_file: knowledge-graph.nt

[DataSource1]
mappings: ${mappings_dir}/mapping_file.rml.ttl
db_url: mysql+pymysql://user:password@localhost:3306/db_name

The parameters of the sections in the INI file are explained below.

Engine Configuration

The execution of Morph-KGC can be tuned via the CONFIGURATION section in the INI file. This section can be empty, in which case Morph-KGC will use the default property values.

Property
Description Values
output_file File to write the resulting knowledge graph to. Default: knowledge-graph.nt
output_dir Directory to write the resulting knowledge graph to. If it is specified, output_file will be ignored and multiple output files will generated, one for each mapping partition. Default:
na_values Set of values to be interpreted as NULL when retrieving data from the input sources. The set of values must be separated by commas. Default: #N/A,N/A,#N/A N/A,n/a,NA,<NA>,#NA,NULL,null,NaN,nan,,None
output_format RDF serialization to use for the resulting knowledge graph. Valid: N-TRIPLES, N-QUADS
Default: N-TRIPLES
only_printable_characters Remove characters in the genarated RDF that are not printable. Valid: yes, no, true, false, on, off, 1, 0
Default: no
safe_percent_encoding Set of ASCII characters that should not be percent encoded. All characters are encoded by default. Example: :/
Default:
mapping_partition Mapping partitioning algorithm to use. Mapping partitioning can also be disabled. Valid: PARTIAL-AGGREGATIONS, MAXIMAL, no, false, off, 0
Default: PARTIAL-AGGREGATIONS
infer_sql_datatypes Infer datatypes for relational databases. If a datatypeable term map has a rr:datatype property, then the datatype will not be inferred. Valid: yes, no, true, false, on, off, 1, 0
Default: no
number_of_processes The number of processes to use. If 1, Morph-KGC will use sequential processing (minimizing memory consumption), otherwise parallel processing is used (minimizing execution time). Default: 2 * number of CPUs in the system
logging_level Sets the level of the log messages to show. Valid: DEBUG, INFO, WARNING, ERROR, CRITICAL, NOTSET
Default: INFO
logging_file If not provided, log messages will be redirected to stdout. If a file path is provided, log messages will be written to the file. Default:
oracle_client_lib_dir lib_dir directory specified in a call to cx_Oracle.init_oracle_client(). Default:
oracle_client_config_dir config_dir directory specified in a call to cx_Oracle.init_oracle_client(). Default:

Note: there are some configuration properties that are ignored when using Morph-KGC as a library, such as output_file.

Data Sources

One data source section should be included in the INI file for each data source to be materialized. The properties in the data source section vary depending on the data source type (relational database or data file). Remote mapping files are supported.

Note: Morph-KGC is case sensitive regarding identifiers. This means that table, column and reference names in the mappings must be the same as those in the data sources (no matter if the mapping uses delimited identifiers).

Relational Databases

The properties to be specified for relational databases are listed below. All of the properties are required.

Property
Description
Values
mappings Specifies the mapping file(s) or URL(s) for the relational database. [REQUIRED]
Valid:
- The path to a mapping file or URL.
- The paths to multiple mapping files or URLs separated by commas.
- The path to a directory containing all the mapping files.
db_url It is a URL that configures the database engine (username, password, hostname, database name). See here how to create the database URLs. [REQUIRED]
Example: dialect+driver://username:password@host:port/db_name

Example db_url values (see here all the information) for the DBAPI drivers recommended in Advanced Setup are:

  • MySQL: mysql+pymysql://username:password@host:port/db_name
  • PostgreSQL: postgresql+psycopg2://username:password@host:port/db_name
  • Oracle: oracle+cx_oracle://username:password@host:port/db_name
  • Microsoft SQL Server: mssql+pymssql://username:password@host:port/db_name
  • MariaDB: mariadb+pymysql://username:password@host:port/db_name
  • SQLite: sqlite:///db_name.db

Data Files

The properties to be specified for data files are listed below. Remote data files are supported. The mappings property is required.

Property
Description
Values
mappings Specifies the mapping file(s) or URL(s) for the data file. [REQUIRED]
Valid:
- The path to a mapping file or URL.
- The paths to multiple mapping files or URLs separated by commas.
- The path to a directory containing all the mapping files.
file_path Specifies the local path or URL of the data file. It is optional since it can be provided within the mapping file with rml:source. If it is provided it will override the local path or URL provided in the mapping files. Default:

Note: CSV, TSV, Stata and SAS support compressed files (gzip, bz2, zip, xz, tar). Files are decompressed on-the-fly and compression format is automatically inferred.

Advanced Setup

Relational Databases

The supported DBMSs are MySQL, PostgreSQL, Oracle, Microsoft SQL Server, MariaDB and SQLite. To use relational databases it is neccessary to first install the DBAPI driver. We recommend the following ones:

Morph-KGC relies on SQLAlchemy. Additional DBAPI drivers are supported, you can check the full list here. For MySQL and MariaDB you may also need to install cryptography.

Note: to run Morph-KGC with Oracle, the libraries of the Oracle Client need to be loaded. See cx_Oracle Installation to install these libraries. See cx_Oracle Initialization to setup the initialization of Oracle. Depending on the selected option, provide the properties oracle_client_lib_dir and oracle_client_config_dir in the CONFIGURATION section accordingly.

Tabular Files

The supported tabular files formats are CSV, TSV, Excel, Parquet, Feather, ORC, Stata, SAS, SPSS and ODS. To work with some of them it is neccessary to install some libraries:

Hierarchical Files

The supported hierarchical files formats are XML and JSON.

Morph-KGC uses XPath 3.0 to query XML files and JSONPath to query JSON files.

Note: the specific JSONPath syntax supported by Morph-KGC can be consulted here.

OEG UPM