Data Wizard

Introduced in PerceptiLabs v0.12, the Data Wizard comprises a series of screens on the New Model popup (accessed from the Model Hub) which help you to import your data and get up and running with a basic working neural network model in PerceptiLabs' Modeling Tool.

The expected data consists of:

  • a .csv file that maps data (e.g., image files) to labels

  • (optional) additional raw data (e.g., .png image files) that PerceptiLabs will use as a data source.

Workflow Overview

The general workflow is as follows:

1) Click Create in the Model Hub.

2) Click Load data in the New Model popup to start the Data Wizard:

3) Locate your CSV file and click Confirm.

4) Define your dataset.

5) Configure your model for training.

CSV File Format

Your .csv file must contain two or more columns each separated using a comma character. The first row must contain column headers separated in the same way (the column headers can be any strings you want). The following is an example of valid CSV data that maps image files in an images subdirectory (located in the same place as the .csv file itself) to classification labels:

Image_paths,Labels
images/0.png,7
images/1.png,2
images/2.png,1
images/3.png,0
images/4.png,4

This example is used in the Basic Image Recognition Tutorial.

Define Your Dataset

After loading your .csv file, the Data Wizard requires you to define the data so that PerceptiLabs can prepare to train on it:

The main elements of this screen are as follows:

  1. Dataset Column Definitions: allows you to configure each column.

  2. Column Pre-processing Settings (available only for certain datatypes (e.g., images, numerical data, etc.)): displays a popup with the following options that allows you to specify PerceptiLabs should pre-process the data before loading it:

    1. Normalize: lets you choose what method to use to normalize your data, bringing it into a specific range of values. This is useful for most cases as long as the value itself is not of importance.

    2. Random Flip: doubles the size of the dataset and randomly selects specific images to flip.

    3. Resize: resizes the image. Set to Custom to specify the width and height in pixels to resize each image. Set to Automatic and select one of the following options:

      1. Dataset mode: determine the mode image size and resize all images to that size.

      2. Dataset mean: determine the average image size and resize all images to that size.

      3. Dataset max: determine the largest image size and resize all images to that size.

      4. Dataset min: determine the smallest image size and resize all images to that size.

    4. Random Rotation (based on RandomRotation layer): randomly rotate some of the images using one of the following methods:

      1. Reflect (d c b a | a b c d | d c b a): the input is extended by reflecting about the edge of the last pixel.

      2. Constant (k k k k | a b c d | k k k k): the input is extended by filling all values beyond the edge with the same constant value k = 0.

      3. Wrap (a b c d | a b c d | a b c d): the input is extended by wrapping around to the opposite edge.

      4. Nearest (a a a a | a b c d | d d d d): The input is extended by the nearest pixel.

      The Factor can be set to between 0 and 2pi to specify the maximum. It also randomly rotates down to the negative version of that value. Set Seed to seed the randomness.

    5. Random Crop: randomly crop some of the images to the specified size. This also doubles the size of the dataset

  3. Dataset Column Examples: shows a small sample of the columns and data loaded from the .csv file so you can visualize what the .csv file looks like.

  4. Input/Target: specifies whether the CSV column shown directly above this field represents input or target (classification) data. In the screenshot above, the images column is defined as Input and the Labels column as Target. To ignore the column set it to Do not use.

  5. Data Type: specifies the type of data represented in the column directly above this field. In the screenshot above, the images column is configured as representing image data, and the labels column is configured as representing categorial (i.e., classification) data. The currently available data types are:

    1. Categorical: strings or numbers; they are automatically converted into numbers and OneHot encoded.

    2. Image: loaded as a path to image data; the supported file types are: .jpg, .jpeg .png, .tif and .tiff.

    3. Text: string data.

    4. Numerical: numerical data.

  6. Data Partition: partitions the data into three sets:

    1. Training: core training data on which to train the model.

    2. Validation (aka verification data): data used to test model fit during training.

    3. Test: data to test the model against after training, to see how well the trained model handles data it hasn't seen before.

  7. Randomize partition: when enabled, constructs the partitions using a random order of data samples.

  8. Reload dataset: returns to the .csv file selection screen.

After completing this configuration, click Next on the bottom right-hand corner to configure your model for training.

Configuring Your Model for Training

After defining your dataset, the final step is to configure the training settings

The main elements of this screen are as follows:

  1. Name and Model Path: allows you to specify a unique name and path to store your model. PerceptiLabs will generate a sub directory in the model's location using the specified model name. The model will be saved to a model.json file within that directory every time you save the model.

  2. Epochs: sets the number of epochs to perform. One epoch corresponds to the number of iterations it takes to go through the entire dataset one time. The higher the number, the better the model will learn your training data. Note that training too long may overfit your model to your training data.

  3. Batch size: the number of samples that the algorithm should train on at a time, before updating the weights in the model. Higher values will speed up the training and may make your model generalize better. However, too high of values may prevent your model from learning the data.

  4. Loss: specifies which loss function to apply.

  5. Learning rate: sets the learning rate for the algorithm. The value must be between 0 and 1 (default is 0.001). The higher the value, the quicker your model will learn. If the value is too high, training can skip over good local minimas. If the value is too low, training can get stuck in a poor local minima.

  6. Save checkpoint every epoch: when enabled, saves a training checkpoint every epoch.

  7. Optimizer: specifies which optimizer algorithm to use for the model. The optimizer continually tries new weights and biases during training until it reaches its goal of finding the optimal values for the model to make accurate predictions. Optimizers available in PerceptiLabs' Training components include: ADAM, Stochastic gradient descent (SGD), Adagrad, Momentum, and RMSprop.

  8. Beta1: optimizer-specific parameter. See the TensorFlow Optimizers page for optimizer-specific definitions.

  9. Beta2: optimizer-specific parameter. See the TensorFlow Optimizers page for optimizer-specific definitions.

  10. Shuffle: randomizes the order to train the data on, to make the model more robust.

After configuring these settings click one of the following:

  • Run model: starts training the model and displays the Statistics View where you can see how training is progressing.

  • Customize: displays the Modeling Tool where you can view and edit the model's architecture.

Last updated