Skip to content

Documentation

Tutorial

You can get quickly started with the tutorial in Google Colaboratory.

Installation

In the following we describe different ways in which you can install and use Morph-KGC.

PyPi

PyPi is the fastest way to install Morph-KGC:

pip install morph-kgc

Some data sources require additional dependencies. Check Advanced Setup for specific installation instructions or install all the dependencies:

pip install morph-kgc[all]

We recommend to use virtual environments to install Morph-KGC.

From Source

You can also grab the latest source code from the GitHub repository:

pip install git+https://github.com/morph-kgc/morph-kgc.git

Usage

Morph-KGC uses an INI file to configure the materialization process, see Configuration.

Command Line

To run the engine using the command line you just need to execute the following:

python3 -m morph_kgc path/to/config.ini

Library

Morph-KGC can be used as a library, providing different methods to materialize the RDF or RDF-star knowledge graph. It integrates with RDFLib, Oxigraph and Kafka to easily create and work with knowledge graphs in Python.

The methods in the API accept the config as a string or as the path to an INI file.

import morph_kgc

config = """
            [DataSource1]
            mappings: /path/to/mapping/mapping_file.rml.ttl
            db_url: mysql+pymysql://user:password@localhost:3306/db_name
         """

RDFLib

morph_kgc.materialize(config)

Materialize the knowledge graph to RDFLib.

# generate the triples and load them to an RDFLib graph

graph = morph_kgc.materialize(config)
# or
graph = morph_kgc.materialize('/path/to/config.ini')

# work with the RDFLib graph
q_res = graph.query(' SELECT DISTINCT ?classes WHERE { ?s a ?classes } ')

Note: RDFLib does not support RDF-star, hence materialize does not support RML-star.

Oxigraph

morph_kgc.materialize_oxigraph(config)

Materialize the knowledge graph to Oxigraph.

# generate the triples and load them to Oxigraph

graph = morph_kgc.materialize_oxigraph(config)
# or
graph = morph_kgc.materialize_oxigraph('/path/to/config.ini')

# work with Oxigraph
q_res = graph.query(' SELECT DISTINCT ?classes WHERE { ?s a ?classes } ')

Set of Triples

morph_kgc.materialize_set(config)

Materialize the knowledge graph to a Python Set of triples.

# create a Python Set with the triples

graph = morph_kgc.materialize_set(config)
# or
graph = morph_kgc.materialize_set('/path/to/config.ini')

# work with the Python set
print(len(graph))

Kafka

morph_kgc.materialize_kafka(config)

Materialize the knowledge graph to a Kafka topic. To use this method, ensure that the config file includes the output_kafka_server and output_kafka_topic parameters.

# generate the triples and sent them to Kafka topic

graph = morph_kgc.materialize_kafka(config)
# or
graph = morph_kgc.materialize_kafka('/path/to/config.ini')

Configuration

The configuration of Morph-KGC is done via an INI file. This configuration file can contain the following sections:

CONFIGURATION

One section for each DATA SOURCE

  • Each input data source has its own section, see Data Sources.

DEFAULT

  • It is optional and it declares variables that can be used in all other sections for convenience. For instance, you can set main_dir: ../testing so that main_dir can be used in the rest of the sections.

Below is an example configuration file with one input relational source. In this case DataSource1 is the only data source section, but other data sources can be considered by including additional sections. Here you can find a configuration file which is more complete.

[DEFAULT]
main_dir: ../testing

[CONFIGURATION]
output_file: knowledge-graph.nt

[DataSource1]
mappings: ${mappings_dir}/mapping_file.rml.ttl
db_url: mysql+pymysql://user:password@localhost:3306/db_name

The parameters of the sections in the INI file are explained below.

Engine Configuration

The execution of Morph-KGC can be tuned via the CONFIGURATION section in the INI file. This section can be empty, in which case Morph-KGC will use the default property values.

Property
Description Values
output_file File to write the resulting knowledge graph to. Default: knowledge-graph.nt
output_dir Directory to write the resulting knowledge graph to. If it is specified, output_file will be ignored and multiple output files will generated, one for each mapping partition. Default:
output_kafka_server Kafka server address for sending the resulting knowledge graph. Default:
output_kafka_topic Kafka topic to send the resulting knowledge graph to. Default:
na_values Set of values to be interpreted as NULL when retrieving data from the input sources. The set of values must be separated by commas. Default: ,nan
output_format RDF serialization to use for the resulting knowledge graph. Valid: N-TRIPLES, N-QUADS
Default: N-TRIPLES
only_printable_chars Remove characters in the genarated RDF that are not printable. Valid: yes, no, true, false, on, off, 1, 0
Default: no
safe_percent_encoding Set of ASCII characters that should not be percent encoded. All characters are encoded by default. Example: :/
Default:
udfs File with Python user-defined functions to be called from RML-FNML. Default:
mapping_partitioning Mapping partitioning algorithm to use. Mapping partitioning can also be disabled. Valid: PARTIAL-AGGREGATIONS, MAXIMAL, no, false, off, 0
Default: PARTIAL-AGGREGATIONS
infer_sql_datatypes Infer datatypes for relational databases. If a datatypeable term map has a rr:datatype property, then the datatype will not be inferred. Valid: yes, no, true, false, on, off, 1, 0
Default: no
number_of_processes The number of processes to use. If 1, Morph-KGC will use sequential processing (minimizing memory consumption), otherwise parallel processing is used (minimizing execution time). Default: 2 * number of CPUs in the system
logging_level Sets the level of the log messages to show. Valid: DEBUG, INFO, WARNING, ERROR, CRITICAL, NOTSET
Default: INFO
logging_file If not provided, log messages will be redirected to stdout. If a file path is provided, log messages will be written to the file. Default:
oracle_client_lib_dir lib_dir directory specified in a call to cx_Oracle.init_oracle_client(). Default:
oracle_client_config_dir config_dir directory specified in a call to cx_Oracle.init_oracle_client(). Default:

Note: there are some configuration properties that are ignored when using Morph-KGC as a library, such as output_file.

Data Sources

One data source section should be included in the INI file for each data source to be materialized. The properties in the data source section vary depending on the data source type (relational database or data file). Remote mapping files are supported.

Note: Morph-KGC is case sensitive regarding identifiers. This means that table, column and reference names in the mappings must be the same as those in the data sources (no matter if the mapping uses delimited identifiers).

Relational Databases

The properties to be specified for relational databases are listed below. All of the properties are required.

Property
Description
Values
mappings Specifies the mapping file(s) or URL(s) for the relational database. [REQUIRED]
Valid:
- The path to a mapping file or URL.
- The paths to multiple mapping files or URLs separated by commas.
- The path to a directory containing all the mapping files.
db_url It is a URL that configures the database engine (username, password, hostname, database name). See here how to create the database URLs. [REQUIRED]
Example: dialect+driver://username:password@host:port/db_name
connect_args A dictionary string of options for SQLAlchemy. See here the SQLAlchemy documentation. Example: {"http_path": ""}

Example db_url values (see here all the information):

  • MySQL: mysql+pymysql://username:password@host:port/db_name
  • PostgreSQL: postgresql+psycopg://username:password@host:port/db_name
  • Oracle: oracle+cx_oracle://username:password@host:port/db_name
  • Microsoft SQL Server: mssql+pymssql://username:password@host:port/db_name
  • MariaDB: mariadb+pymysql://username:password@host:port/db_name
  • SQLite: sqlite:///db_path/db_name.db

Data Files

The properties to be specified for data files are listed below. Remote data files are supported. The mappings property is required.

Property
Description
Values
mappings Specifies the mapping file(s) or URL(s) for the data file. [REQUIRED]
Valid:
- The path to a mapping file or URL.
- The paths to multiple mapping files or URLs separated by commas.
- The path to a directory containing all the mapping files.
file_path Specifies the local path or URL of the data file. It is optional since it can be provided within the mapping file with rml:source. If it is provided it will override the local path or URL provided in the mapping files. Default:

Note: CSV, TSV, Stata and SAS support compressed files (gzip, bz2, zip, xz, tar). Files are decompressed on-the-fly and compression format is automatically inferred.

Advanced Setup

Relational Databases

The supported DBMSs are MySQL, PostgreSQL, Oracle, Microsoft SQL Server, MariaDB and SQLite. To use relational databases it is necessary to additionally install DBAPI drivers. You can install them via:

To run Morph-KGC with Oracle, the libraries of the Oracle Client need to be loaded. See cx_Oracle Installation to install these libraries. See cx_Oracle Initialization to setup the initialization of Oracle. Depending on the selected option, provide the properties oracle_client_lib_dir and oracle_client_config_dir in the CONFIGURATION section accordingly.

Tabular Files

The supported tabular files formats are CSV, TSV, Excel, Parquet, Feather, ORC, Stata, SAS, SPSS and ODS. To work with some of them it is necessary to install additional dependencies. You can install them via:

Hierarchical Files

The supported hierarchical files formats are XML and JSON.

Morph-KGC uses XPath 3.0 to query XML files and JSONPath to query JSON files. The specific JSONPath syntax supported by Morph-KGC can be consulted here.

Docker

You can also use Morph-KGC with the provided Dockerfile.

Image Building

Build the container as follows:

docker build -t morph-kgc .

To include optional dependencies, use the optional_dependencies option as follows:

docker build -t morph-kgc --build-arg optional_dependencies="sqlite,kafka" .

Execution

The container is designed to mount a local directory containing the required files. To run the container, use the following command, replacing $(pwd)/files with the path to the local directory containing your files:

docker run -v $(pwd)/files:/app/files morph-kgc files/config.ini

This will mount the local directory to /app/files within the container and execute the application using the provided configuration file.

Mappings

Morph-KGC is compliant with the W3C Recommendation RDB to RDF Mapping Language (R2RML) and the RDF Mapping Language (RML). You can refer to their associated specifications to consult the syntaxes.

RML-FNML

Declarative transformation functions are supported via RML-FNML. Morph-KGC comes with a subset of the GREL functions as built-in functions that can be directly used from the mappings. Python user-defined functions are additionally supported. A Python script with user-defined functions is provided to Morph-KGC via the udfs parameter. Decorators for these functions must be defined to link the Python parameters to the FNML parameters. An example of a user-defined function:

@udf(
    fun_id='http://example.com/toUpperCase',
    text='http://users.ugent.be/~bjdmeest/function/grel.ttl#valueParam')
def to_upper_case(text):
    return text.upper()

An RML-FNML mapping calling this functions would be:

<#TM1>
    rml:logicalSource [
        rml:source "test/rml-fnml/udf/student.csv";
        rml:referenceFormulation ql:CSV;
    ];
    rr:subjectMap [
        rr:template "http://example.com/{Name}";
    ];
    rr:predicateObjectMap [
        rr:predicate foaf:name;
        rr:objectMap [
            fnml:execution <#Execution>;
        ];
    ].

<#Execution>
    fnml:function ex:toUpperCase;
    fnml:input [
        fnml:parameter grel:valueParam;
        fnml:valueMap [
            rml:reference "Name";
        ];
    ].

The complete set of built-in functions can be consulted here.

RML-star

Morph-KGC supports the new RML-star mapping language to generate RDF-star knowledge graphs. RML-star introduces the star map class to generate RDF-star triples. A star map can be either at the place of a subject map or an object map, generating quoted triples in either the subject or object positions. The rml:embeddedTriplesMap property connects the star maps to the triples map that defines how the quoted triples will be generated. Triples map can be declared as rml:NonAssertedTriplesMap if they are to be referenced from an embedded triples map, but are not supposed to generate asserted triples in the output RDF-star graph. The following example from the RML-star specification uses a non-asserted triples map to generate quoted triples.

<#TM1> a rml:NonAssertedTriplesMap;
    rml:logicalSource ex:ConfidenceSource;
    rml:subjectMap [
        rr:template "http://example.com/{entity}";
    ];
    rr:predicateObjectMap [
        rr:predicate rdf:type;
        rml:objectMap [
            rr:template "http://example.com/{class}";
        ];
    ].

<#TM2> a rr:TriplesMap;
    rml:logicalSource ex:ConfidenceSource;
    rml:subjectMap [
        rml:quotedTriplesMap <#TM1>;
    ];
    rr:predicateObjectMap [
        rr:predicate ex:confidence;
        rml:objectMap [
            rml:reference "confidence";
        ];
    ].

YARRRML

YARRRML is a human-friendly serialization of RML that uses YAML. Morph-KGC supports YARRRML, also for RML-FNML and RML-star. The mapping below shows a YARRRML example.

prefixes:
  foaf: http://xmlns.com/foaf/0.1/
  ex: http://example.com/
  rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
  xsd: http://www.w3.org/2001/XMLSchema#

mappings:
  TM1:
   sources:
     - ['student.csv~csv']
   s: http://example.com/$(Name)
   po:
     - [foaf:name, $(Name)]

RML Views

In addition to R2RML views, Morph-KGC also supports RML views over tabular data (CSV and Parquet formats) and JSON files. RML views enable transformation functions, complex joins or mixed content using the SQL query language. For instance, the following triples map takes as input a CSV file and filters the data based on the language of some codes.

<#TM1>
    rml:logicalSource [
        rml:query """
            SELECT "Code", "Name", "Lan"
            FROM 'country.csv'
            WHERE "Lan" = 'EN';
        """
    ];
    rr:subjectMap [
        rr:template "http://example.com/{Code}";
    ];
    rr:predicateObjectMap [
        rr:predicate rdfs:label;
        rr:objectMap [
            rr:column "Name";
            rr:language "en";
        ];
    ].

Morph-KGC uses DuckDB to evaluate queries over tabular sources, the supported SQL syntax can be consulted in its documentation. For views over JSON check the corresponding JSON section in the DuckDB documentation and this blog post.

RML In-Memory

Morph-KGC supports the definition of in-memory logical sources (Pandas DataFrames and Python Dictionaries) within RML using the SD Ontology. The following RML rules show the transformation of a Pandas Dataframe to RDF.

@prefix sd: <https://w3id.org/okn/o/sd#>.

<#TM1>
    rml:logicalSource [
        rml:source [
            a sd:DatasetSpecification;
            sd:name "variable1";
            sd:hasDataTransformation [
                sd:hasSoftwareRequirements "pandas>=1.1.0";
                sd:hasSourceCode [
                    sd:programmingLanguage "Python3.9";
                ];
            ];   
        ];
        rml:referenceFormulation ql:DataFrame;
    ];
    rr:subjectMap [
        rr:template "http://example.com/data/user{Id}";
    ];
    rr:predicateObjectMap [
        rr:predicate rdf:type;
        rr:objectMap [
            rr:constant ex:User;
        ];
    ].

The above mappings can be executed from Python as follows:

import morph_kgc
import pandas as pd

users_df = pd.DataFrame({'Id': [1,2,3,4],\
           'Username': ["@jude","@emily","@wayne","@jordan1"]})
data_dict = {"variable1": users_df}

config = """
    [DataSource]
    mappings = mapping_rml.ttl
"""

g_rdflib = morph_kgc.materialize(config, data_dict)

OEG UPM