TL;DR: lpips metric might not translate well as a loss function for training an image transformation/processing model. This is despite the fact that it might serve as a good quantitative evaluator, relating well with human perception of image quality.
Following is the code snippet I utilized for implementing the lpips based loss function for a de-blurring model.
Except for the convolutional layer after SpatialDropout, all the weights were frozen. This refers to the lin configuration as defined by Zhang, et al. in The Unreasonable Effectiveness of Deep Features as a Perceptual Metric.
Here, I share the key insights through tests with lpips loss function.
- For my model, the loss function without the linear combination (using convolutional layer, lines 52–59) of the losses from multilayer feature maps behaved better. By including the learnable parameters, the model performance started to degrade after a few training epochs. Perhaps, this configuration of lpips loss is too strong of a regularizer.
- Trying different kernel initializers (he_normal versus ones) for the convolutional layer did not make much difference on the model training.
- Depending on the task at hand, applying tensor normalization before convolution proved/not-proved helpful.
- Without the Conv2D layer, higher loss weight (0.06 versus 0.006) resulted in a slightly lower peak validation dataset performance.
- Using latent features from single layer (block_group4) was better than using multilayer latent features. Using single layer features is same as Perceptual Losses for Real-Time Style Transfer and Super-Resolution by Johnson et al.
Note: Regarding 4 & 5, I used MSE to judge the model performance on validation dataset which as per Zhang, et al. does not relate well with human perception. Hence, it is possible that using multilayer features with higher loss weight might indeed be a better training signal. So, if you have a way to bring in the human judgment, you might wanna try that.
While using MSE metric to judge the model performance, lpips loss function might not provide much advantage than vanilla perceptual loss over the final feature vectors.