The datacollider provides you with a straight-forward way to upload your own datasets. Uploaded datasets will only be available to you. You can upload your dataset either from the quick start section on dashboard page or the dataset page where.
After you clicked on the upload dataset button you will be guided through a series of steps to prepare your dataset for the user in the datacollider. This is necessary because the tool requires all data to be in the same internal format. But don’t worry the steps are pretty straight-forward:
In the beginning you’ll be asked to choose which kind of dataset you would like to prepare. Usually this would be a temporal dataset, meaning a dataset where each record has a timestamp to connect a certain event happening to a certain time. We’ll assume a temporal dataset (some open NYC taxi data) is used in this short guide, however the process is similar for other kinds of data.
First, you need to upload your raw dataset which. The data can be in a variety of formats, however each record needs to be on a separate line and all fields of a record must be separated by the same delimiter. In our case, we are using a comma separated file for the New York taxi dataset. If your dataset consists of multiple files, you should upload all of them now because they cannot be added to the dataset later.
As a second step, you need to specify the file structure of your file. The first field asks you to pick the line number of header names, this simplifies the process later on where you would give a name to each field. If your dataset doesn’t have headers, you can leave this blank. The second and third line indicate the syntax of your file. The second option allows you to ignore lines starting with a certain prefix (such as ‘#’ for comments). The third parameter then specifies the field delimiter (for CSV this would be comma). Once you entered the parameters you can click the Parse sample data button to see whether your file was parsed correctly.
On the third page, you’ll need to declare the details for each fields. These contains name, (optional) description, value range, data type and an optional alternative null value. Please see the screenshot below for an example. We defined the medallion ID as a field of type String, the trip time in seconds as a field of type long (which is basically a integer) and the latitude and longitude fields of type double (a floating point number).
Since we are preparing a temporal dataset, we need to select the field in each record that represents the event time. In our case, we use the pickup date as the timestamp (however you can also select multiple fields for example if your time information is split in a date and time field).
Once you select the fields and pressed next, you will have to enter the format of your timestamp and the time zone your time is in. The description of the syntax for the date format can be found at the link specified. Please also not the example below to get a better understanding of how this works.
And that’s it. One you entered all this information, you can click on the start structuring button to start. The processing time depends on your data file size. We’ll send you an email when the dataset is ready to use. For small datasets, this should only take a few minutes. Once you got the email, log in back to the datacollider and you will see your new dataset in the list of datasets available to you.