Data Access Overview
The sponsor of a Gen3 data commons typically decides how users will access data in object storage. In some cases, approved users may be allowed to download data directly to their local computer from within an internet browser or with the cdis-data-client. When more security is required, users may be required to download and analyze data in a protected environment, such as a virtual machine (VM) in a virtual private cloud (VPC).
Accessing data from within a browser
Once data files are registered, their address in s3 object storage can be obtained by providing the file's UUID to the following URL: https://data.gen3.org/index/index/UUID
Data files can be downloaded by providing the file's UUID to the following URL: https://data.gen3.org/user/data/download/UUID
Downloading data with the cdis-data-client
Data files can also be downloaded using the "cdis-data-client", which provides a simple command-line interface for downloading and uploading data files.
Download the latest release of the client here.
Once downloaded and installed, the client can be configured with the API credentials.json downloaded from your Profile in the data portal:
./cdis-data-client configure --profile <profile_name> --cred /path/to/api/credentials.json
The client will then prompt you for the API. Enter the API of your commons, e.g.:
API endpoint: https://gen3.datacommons.io/
To download a data file, pass the file's UUID to the client:
cdis-data-client download --profile <profile_name> --file ./filename.tsv --uuid d7a5XXXX-XXXX-XXXX-XXXX-XXXX53583014
In the above command, download
mode is specified, the profile_name
we configured with the API credentails earlier is used, and a filename (filename.tsv
) was specified for our local copy of the downloaded file.
Accessing data from the Virtual Private Cloud
If additional security is required, the cdis-data-client can be installed on a VM in the VPC to download files. The following instructions detail how to login to a VM.
Send public SSH Key and OpenConnectID account to Gen3 commons team.
To access the Virtual Private Cloud (VPC), users will need to send their public ssh key (or "pubkey") and an email that supports Oauth (often gmail) to support@datacommons.io. Do not send your private ssh key! This is confidential and should never be shared with anyone.
Github has a very nice ssh tutorial with step-by-step instructions for Mac, Windows (users with gitbash installed), and Linux users. If you're new to using SSH we'd recommend reviewing the links:
Gen3 users with the appropriate signed legal documents will be sent an email that gives the following unique information:
- username (this is used to ssh to the VPC login node) - eg:
ssh -A <username>@<login-ip-address>
- an IP associated with your new VM
SSH to Virtual Machine
In some data commons, you may need to ssh in to a "login" node prior to accessing your personal virtual machine:
- SSH to login node:
ssh -A <username>@3<login-node-IP>
- SSH from login node to your VM:
ssh -A ubuntu@<my_VM_IP>
NOTE: the
-A
flag forwards all the keys in your key chain. For more details on managing SSH keys, check the guides linked in the previous step.
It is recommended to update the SSH config file to use a 'multihop' ssh tunnel. To 'multihop' in this context is to use a single command to get to your VM. What follows are instructions for updating your .ssh/config
file to get to your VM in a single command.
Go to your .ssh directory in your laptop home directory: cd ~/.ssh
Within this directory, edit the file named "config" (if it doesn't exist, create it first): vim config
In this file you can specify various hosts for access via ssh. Your host for the head login node should look something like this:
Host login-node
Hostname <ip.address.of.login.node>
User <YOUR_USERNAME>
IdentityFile </path/to/YOUR_CREDFILE>
ForwardAgent yes
Where /path/to/YOUR_CREDFILE might be: ~/.ssh/id_rsa
The username will be provided to you by the Gen3 support team and will be tied to the credential file that you provide us when setting up the account (your private ssh key). Save this file and exit. Back in the command line terminal you should now be able to log in to the login node using this host:
ssh login-node
Exit the head login node and return to your local machine. Back in your config file, add a new host for your analysis VM:
Host analysis
Hostname <YOUR_VM_IP>
User ubuntu
ProxyCommand ssh -q -AXY login-node -W %h:%p
Once again you will receive the Hostname IP from the support team. This host will route you through the head login node and take you directly to your personal analysis VM. Once again, save the file and exit. In the terminal, try to log in to the submission VM:
ssh analysis
If you've done everything correctly, you should now be in the analysis VM.
Ready to work!
You're ready to use whatever tools you wish to analyze data in the commons within your VM. For requests for alternative configurations, analysis storage, or other needs please contact support@datacommons.io
.
For an example of how to use a Jupyter Notebook for data analyses in a browser window on your local computer, please continue on to the next section.