Using a U-Net to Enhance Dark Photos

This topic describes how a special type of convolutional neural network (CNN) called a U-Net, can be used to enhance dark photos, and how to implement this approach in PerceptiLabs.

The concept behind this approach is based on that described in Learning to See in the Dark (SID), a scholarly article published on https://arxiv.org/abs/1805.01934 that demonstrates how machine learning (ML) can be used in place of traditional digital image processing techniques, to enhance very dark images. Note that the authors of that project have also made their TensorFlow code available on GitHub.

In this topic we will cover the following:

Overview of SID's Approach

SID's U-Net takes in raw camera data from dark images and learns to predict what the image would look like with better lighting. The technique works by learning from the raw camera data in a low-light image and predicting each pixel of the image taken from the same angle with more lighting.

The training data consists of pairs of images. The first item in each pair is the raw sensor data collected from the camera (i.e., the raw data for a really dark image), while the second item is the corresponding sRGB image, as it would appear if correctly lit.

The entire pipeline behind SID's approach is illustrated in Fig 1.:

Fig 1: Overview of SIDs pipeline including image preprocessing, ConvNet upscaling, and ConvNet output.

Here we can see that the raw sensor data exists as a Bayer array that is first pre-processed into separate input channels for each pixel color and is spatially reduced in both dimensions. The purpose of downsampling the Bayer array during pre-processing is to make the input data more optimized for processing, but there isn't any data loss in this step. This is done because the first layer of the U-Net that will eventually work on this data, performs better with four smaller channels separated by color rather than one larger, image channel.

Preprocessing, then removes the black levels from the optimized data and an amplification factor is applied to specify the desired level of brightness.

The preprocessed data is then fed into the U-Net for training. The last layer of the U-Net makes use of a sub-pixel upsampling layer to ultimately produce a brightened image from the original raw sensor data. The U-Net uses the corresponding sRGB images during training from which to determine ground truth and perform supervised learning.

A Closer Look at the U-Net

The power of this approach lies in its use of a U-Net, which extracts features from images across multiple levels of resolution. The U-Net repeatedly downsamples the data down into lower resolutions and then upsamples the data to increase the resolution while making use of convolutional layers, to the point where the output is an image of the same resolution as the input. In this case, the U-Net provides us with an image at the desired level of brightness.

A U-Net has a contracting pathconsisting of a series of layers that repeatedly convolve the data into lower resolutions using pairs of convolution and max-pooling layers, while extracting increasingly detailed feature information (maps). A U-Net also has an expansivepath that takes the feature information, concatenates it with image data, and repeatedly upscales the image. The feature information is fed from the convolution layers in the contracting path to the concatenation layers in the expansive path through skip connections. These skip connections let the network reuse features that were extracted earlier in the network. When a U-Net is generating its prediction, it takes previous features at each level of resolution from the contracting path to help it make more accurate predictions in the expansive path.

The following diagram illustrates a U-Net architecture to accomplish the task at hand:

Fig 2: Overview of how a U-Net downsamples and upsamples. Note the use of skip connections that let the network reuse features that were extracted earlier in the network.

Here we can see that the overall shape of this CNN's architecture is that of a "U", hence the name U-Net.

Recreating the SID U-Net in PerceptiLabs

With PerceptiLabs' focus on model visualization, a U-Net can be visually constructed in the Modeling Tool and arranged to look very much like the conceptual architecture shown above. Below is an example of a fully-functional U-Net built in PerceptiLabs that can be trained with the data provided in the SID project:

Fig 3: Implementation of the SID U-Net in PerceptiLabs.

Note: this model architecture is not compatible with PerceptiLabs v0.12 as the components have now changed.

At the top of the network is a Data component (1) that provides the sRGB reference images. For example, we might feed in a picture that looks like this:

Fig 4: Example of a reference image with ideal lighting.

These are fed into a Classification training component (2) that performs supervised learning using them. Another Datacomponent (3) provides the corresponding pre-processed camera sensor data that is fed into a Convolution Layer component (4). Such image data could look something like the following:

Fig 5: Example of raw image data corresponding to the picture in Fig 4.

This component downsamples the data and sends it to a subsequent Convolution Layer (5) for further downsampling, while also extracting the features as a feature map, thus forming the contracting path.

Each feature map is fed into a Merge component (6) in the expansive path using a skip connection. The Merge component then concatenates the feature map with upsampled data from earlier in the expansive path. Starting at the bottom of the U-Net, the most down-sampled data (8) is fed into a Deconvolution Layer component (9) for upsampling. The upsampled image, along with the feature map from the corresponding resolution of downsampled data are concatenated using a Mergecomponent (10), and the output is fed into a subsequent convolution layer (11).

The convolution layers along the expansive path provide additional intermediate image-processing steps. Note that these layers use the “same” padding (a stride of 1), so they do not change the resolution of the feature maps, unlike the deconvolution layers. The process is then repeated thus forming the expansive path.

Each layer produces feature maps that can be inspected in PerceptiLabs using the preview windows of the various layers. The preview provides a view of one feature map. For example, a layer in the contracting path might generate a feature map that looks as follows:

Fig 6: Example of a feature map generated by a convolution layer in the contracting path.

Initially, an untrained layer in the expansive path might produce a feature map similar to the following:

Fig 7: Example of an output image generated by an untrained deconvolution layer in the expansive path.

However, as the model trained, these output images will gradually begin to look more and more like the ideal image shown earlier in Fig 4.

The output of our U-Net is a 512x512x12 feature space, which has four times less spatial area but four times more channels than the desired output (a 1024x1024x3 image). We can map from one to the other to get the final output image by using TensorFlow's depth_to_space function, which is invoked in the very last convolutional layer. This trick is called a sub-pixel layer. The diagram below shows how a single color channel can be reconstructed from four smaller channels, which we do after the last layer of the U-Net.

Fig 8: Reconstructing a single color channel from four smaller channels in the last layer of a U-Net.

Note that the final output fed from the Convolution layer (7) in the expansive path into the Normal training component (2), has the same dimensions as the original, ideal reference image provided by the top-most Data component (1). This means that we can train the U-Net model to reconstruct these images with a pixel-wise loss function. Like in the original SID project, we use a pixel-wise L1 loss score to minimize the absolute difference between predicted pixels and actual pixels. When the model converges, we have a system that can reliably predict the ideal lighting of images from low-exposure images at the same level of resolution.

Sample Model

We've provided a sample PerceptiLabs model of the U-Net described above, along with sample data, in this GitHub Repo.

The repo has the following structure:

  • /Data: contains pre-processed versions of the raw data/reference image pairs to use for training that originated as raw camera data files in the SID project. The pairs of images are stored in the following subdirectories. Note that each pair uses the same filenames for the raw data and reference images:

    • /Data/short_cropped: contains the raw, dark camera sensor data images that have been preprocessed, cropped, and spatially reduced to a resolution of 512x512x4 in .tiff format.

    • /Data/long_cropped: contains the corresponding reference images with the ideal lighting that have been preprocessed and cropped to a resolution of 1024x1024x3 (sRGB) in .tiff format.

Hyperparameter Settings

The hyperparameters of each layer are derived from the original SID project, with minor differences. For all convolutional layers, the patch size is always 3, and the stride is always 1. The only exception is the last layer, that uses a patch size of 1.

Contracting Path

Each level of resolution has two identical convolutional layers. Each time the resolution is halved, the number of feature channels is doubled as follows:

  • 512x512 resolution: 8 feature channels

  • 256x256 resolution: 16 feature channels

  • 128x128 resolution: 32 feature channels

  • 64x64 resolution: 64 feature channels

  • 32x32 resolution: 128 feature channels

  • 16x16 resolution: 256 feature channels

Expansive Path

Each level of resolution has a concatenative layer, a convolutional layer, and a deconvolutional layer. Note that in the original SID implementation of U-Net, the authors use an additional convolutional layer in the expansive path. Like in the contracting path, the resolution is inversely related to the number of feature channels (i.e., when the resolution is doubled, the number of feature channels is halved):

  • 16x16 resolution: 256 feature channels

  • 32x32 resolution: 128 feature channels

  • 64x64 resolution: 64 feature channels

  • 128x128 resolution: 32 feature channels

  • 256x256 resolution: 16 feature channels

  • 512x512 resolution: 8 feature channels

Loading the Sample Model

Follow the steps below to load the sample model in PerceptiLabs:

  1. Clone or download the sample model from GitHub.

  2. On the Model Hub screen, import the sample model into PerceptiLabs. Navigate to and select the location of the sample's model.json file.

  3. Open the topmost Data component in the model workspace and set its folder to the long_cropped directory.

  4. Open the second Data component in the model workspace and set its folder to the short_cropped directory.

Comparing raw Coding to PerceptiLabs' Visual Approach

In this section we look at how a complex model such as a U-Net can be easier to implement visually.

Parts of the original U-Net TensorFlow code from the SID project is shown here:

def network(input): # Unet
conv1 = slim.conv2d(input, 32, [3, 3], rate=1, activation_fn=lrelu, scope='g_conv1_1')
conv1 = slim.conv2d(conv1, 32, [3, 3], rate=1, activation_fn=lrelu, scope='g_conv1_2')
pool1 = slim.max_pool2d(conv1, [2, 2], padding='SAME')
conv2 = slim.conv2d(pool1, 64, [3, 3], rate=1, activation_fn=lrelu, scope='g_conv2_1')
conv2 = slim.conv2d(conv2, 64, [3, 3], rate=1, activation_fn=lrelu, scope='g_conv2_2')
pool2 = slim.max_pool2d(conv2, [2, 2], padding='SAME')
conv3 = slim.conv2d(pool2, 128, [3, 3], rate=1, activation_fn=lrelu, scope='g_conv3_1')
conv3 = slim.conv2d(conv3, 128, [3, 3], rate=1, activation_fn=lrelu, scope='g_conv3_2')
pool3 = slim.max_pool2d(conv3, [2, 2], padding='SAME')
conv4 = slim.conv2d(pool3, 256, [3, 3], rate=1, activation_fn=lrelu, scope='g_conv4_1')
conv4 = slim.conv2d(conv4, 256, [3, 3], rate=1, activation_fn=lrelu, scope='g_conv4_2')
pool4 = slim.max_pool2d(conv4, [2, 2], padding='SAME')
conv5 = slim.conv2d(pool4, 512, [3, 3], rate=1, activation_fn=lrelu, scope='g_conv5_1')
conv5 = slim.conv2d(conv5, 512, [3, 3], rate=1, activation_fn=lrelu, scope='g_conv5_2')
up6 = upsample_and_concat(conv5, conv4, 256, 512)
conv6 = slim.conv2d(up6, 256, [3, 3], rate=1, activation_fn=lrelu, scope='g_conv6_1')
conv6 = slim.conv2d(conv6, 256, [3, 3], rate=1, activation_fn=lrelu, scope='g_conv6_2')
up7 = upsample_and_concat(conv6, conv3, 128, 256)
conv7 = slim.conv2d(up7, 128, [3, 3], rate=1, activation_fn=lrelu, scope='g_conv7_1')
conv7 = slim.conv2d(conv7, 128, [3, 3], rate=1, activation_fn=lrelu, scope='g_conv7_2')
up8 = upsample_and_concat(conv7, conv2, 64, 128)
conv8 = slim.conv2d(up8, 64, [3, 3], rate=1, activation_fn=lrelu, scope='g_conv8_1')
conv8 = slim.conv2d(conv8, 64, [3, 3], rate=1, activation_fn=lrelu, scope='g_conv8_2')
up9 = upsample_and_concat(conv8, conv1, 32, 64)
conv9 = slim.conv2d(up9, 32, [3, 3], rate=1, activation_fn=lrelu, scope='g_conv9_1')
conv9 = slim.conv2d(conv9, 32, [3, 3], rate=1, activation_fn=lrelu, scope='g_conv9_2')
conv10 = slim.conv2d(conv9, 27, [1, 1], rate=1, activation_fn=None, scope='g_conv10')
out = tf.depth_to_space(conv10, 3)
return out

While the original code is fairly easy to follow, creating a model as complex as a U-Net in a non-visual, code-centric way, suffers from the following drawbacks:

  • it's not immediately obvious what the overall model structure is (i.e., a U-net)

  • the presence of skip connections is difficult to identify

  • there is no way to see the dimensions of the various layers

For comparison, the screenshot of the same model implemented in PerceptiLabs is shown here:

Fig 9: Implementation of the U-Net in PerceptiLabs.

Here we can see the obvious structure of the model is a U-Net including its skip connections and the various dimensions of the layers. Moreover, it's easy to identify the contracting and expansive paths, and the layers that make up each. In addition, you can easily arrange the layers and establish the connections between them visually.

The code for the various components is provided for you by PerceptiLabs whenever you add a component, and modifying that code is optional since the various hyperparameters can be configured visually as well.

The code for each component is encapsulated in a class whose name is composed of the component type and the component’s name. For example, the code below is from the second convolutional layer of the model's contracting path:

class DeepLearningConv_Convolution_2(Tf1xLayer):
def __init__(self):
self._scope = 'DeepLearningConv_Convolution_2'
# TODO: implement support for 1d and 3d conv, dropout, funcs, pooling, etc
self._patch_size = 3
self._feature_maps = 16
self._padding = 'SAME'
self._stride = 1
self._keep_prob = 1
self._variables = {}
def __call__(self, x):
""" Takes a tensor as input and feeds it forward through a convolutional layer, returning a newtensor."""
with tf.compat.v1.variable_scope(self._scope, reuse=tf.compat.v1.AUTO_REUSE):
shape = [
self._patch_size,
self._patch_size,
x.get_shape().as_list()[-1],
self._feature_maps
]
#initial = tf.random.truncated_normal(
# shape,
# stddev=np.sqrt(2/(self._patch_size)**2 + self._feature_maps)
#)
W = tf.compat.v1.get_variable('W', shape = shape, initializer= tf.contrib.layers.xavier_initializer())
#initial = tf.constant(0.1, shape=[self._feature_maps])
b = tf.compat.v1.get_variable('b', shape=[self._feature_maps], initializer=tf.zeros_initializer())
y = tf.add(tf.nn.conv2d(x, W, strides=[1, self._stride, self._stride, 1], padding=self._padding), b)
y = tf.compat.v1.nn.relu(y)
y = tf.nn.max_pool(y, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')
self._variables = {k: v for k, v in locals().items() if can_serialize(v)}
return y
@property
def variables(self):
"""Any variables belonging to this layer that should be rendered in the frontend.
Returns:
A dictionary with tensor names for keys and picklable for values.
"""
return self._variables.copy()
@property
def trainable_variables(self):
"""Any trainable variables belonging to this layer that should be updated during backpropagation. Their gradients will also be rendered in the frontend.
Returns:
A dictionary with tensor names for keys and tensors for values.
"""
variables = tf.get_collection(tf.GraphKeys.GLOBAL_VARIABLES, scope=self._scope)
variables = {v.name: v for v in variables}
return variables
@property
def weights(self):
"""Any weight tensors belonging to this layer that should be rendered in the frontend.
Return:
A dictionary with tensor names for keys and tensors for values.
"""
with tf.compat.v1.variable_scope(self._scope, reuse=tf.compat.v1.AUTO_REUSE):
w = tf.compat.v1.get_variable('W')
return {w.name: w}
@property
def biases(self):
"""Any weight tensors belonging to this layer that should be rendered in the frontend.
Return:
A dictionary with tensor names for keys and tensors for values.
"""
with tf.compat.v1.variable_scope(self._scope, reuse=tf.compat.v1.AUTO_REUSE):
b = tf.compat.v1.get_variable('b')
return {b.name: b}

Depending on the type of component, its class may have following:

  • init(): class constructor/initializer.

  • call(): enables an instance to be called as a function (required in order for PerceptiLabs to invoke the class as part of a model).

  • run() (included in Training components)

  • variables(): a dictionary of tensor values for the layer.

  • trainable_variables(): a dictionary of tensors to be updated during back propagation when training.

  • weights(): a dictionary of weight tensors that are updated during training.

  • biases(): a dictionary of bias tensors that are updated during training.

Note

You can easily view and share the code from all of your model's components using PerceptiLabs Notebook functionality.

Although the PerceptiLabs model ultimately results in more code, it also provides more granular control over each layer, while retaining the same level of training performance as a raw code-based TensorFlow implementation, with the added benefit of visualizing the model.

Conclusion

A U-Net is a powerful way to leverage CNNs for complex use cases such as the enhancement of dark images. However, the number of layers involved in constructing a U-Net, and special features like skip connections can make it difficult to visualize such a model when implemented purely in code.

As we've shown in this topic, PerceptiLabs' visual modeling approach makes it much easier to architect such models while still providing full access to the underlying code.