Model Architecture and Loss Function have been taken from their paper.
Briefly, model uses a VGG-16 network as backbone, has 3 Scale blocks which upscale the resolution from,
15x20-> 60x80 -> 120x160 -> 120x160.
Input image resolution is 320x240, output depthmap resolution is at 120x160.
Loss function takes difference of pixel values at 'log' scale adds squares of differences. Additional component of image gradients has been added to get the scene geometry/edges right along with relative depth values.
- Model predicts object boundaries well, due to added image gradient component in the newer loss fn.
- Prediction quality is decent considering from single image.
- Model produces depthmaps at lower resolution (320x240).
- Depthmaps lack clarity.
- Model is really large, ~900MB, inference time is ~2s for a mini-batch of 8 (640x480) images.