alibaba/GraphScope

Add more popular datasets to graphscope built-in datasets

Open

#1,015 opened on Nov 17, 2021

View on GitHub
 (0 comments) (0 reactions) (1 assignee)HTML (301 forks)batch import
good first issue

Repository metrics

Stars
 (2,401 stars)
PR merge metrics
 (Avg merge 1m) (7 merged PRs in 30d)

Description

We have several built-in datasets that can be easily loaded in one-line, located in the dataset directory of Aliyun OSS bucket graphscope, and the corresponding utility function to load them, located in python/graphscope/dataset/. We are planning to enrich the datasets continuously.

There's the procedure to add new datasets:

  1. Find a popular and appropriate dataset, adapt the format to property graph if necessary,
  2. Put all data files inside a folder, give the folder a meaningful name,
  3. Compress the folder, then upload the compressed file together with the original folder to the dataset folder of the OSS bucket. Assume you have a folder named foo/, and two files foo/nodes.csv and foo/edge.csv, after this step, you will have the following file structure in the bucket:
dataset
|-- foo.tar.gz
|-- foo
    |-- nodes.csv
    |-- edge.csv
  1. Write the loading function load_foo in a new file named python/graphscope/dataset/foo.py.
  2. A corresponding unit test is appreciated!

Contributor guide