Evaluation metrics for generative image models
<img alt="" src="https://softwaremill.com/images/u/x/k/a/1/uxka16w39endirj-3e9b424a.png?g-044cb45a" /> <p>Training generative image models might be challenging, but properly evaluating them might be even more difficult. The most naive metric is human expert judgment. However, this is expensive, time-consuming, and prone to bias. Human experts are subjective to the task setup, motivation, and the feedback they get about their mistakes. There was a lack of an objective function that could be used to evaluate generated images. Therefore, other alternatives have been developed to measure the quality, diversity, and fidelity of generated images. Among the most commonly used ones are <a href="https://en.wikipedia.org/wiki/Inception_score">Inception Score</a> (IS) and <a href="https://en.wikipedia.org/wiki/Fr%C3%A9chet_inception_distance">Fréchet inception distance </a>(FID).</p> <h2>Inception Score</h2> <p>The Inception Score was initially proposed in a paper, <a href="https://arxiv.org/abs/1606.03498">Improved Techniques for Training GANs 2016</a>. By design, it measures two factors:</p> <ul> <li><strong>Image fidelity.</strong> Each image has to contain a clear object.</li> <li><strong>Diversity</strong>. The model should be able to generate many different object classes, ideally following a uniform distribution. Each class should be equally likely.</li> </ul> <p>The IS is computed using the following function:<br> <img title="1" alt="1" src="/user/pages/blog/258.evaluation-metrics-for-generative-image-models/1.png?g-044cb45a"></p> <p>The p(y|x) is a conditional label distribution. The authors of the paper use the Inception network to predict the label for each generated image. If the image contains a clear object, the inception model should return a high probability for one of the classes, and the entropy will be low H(y|x)~0.</p> <p>The other part of the equation, p(y), is the probability of the inception model predicting each class. Ideally, the probability for each class should be equal, so the entropy should be high H(y)~+∞.</p> <p>KL is the <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">Kullback–Leibler divergence</a>. It measures how those two distributions are similar to each other. A large value indicates that they are not similar at all. This satisfies the constraint for fidelity and diversity.</p> <p><img title="figure 1" alt="figure 1" src="/user/pages/blog/258.evaluation-metrics-for-generative-image-models/figure%201.png?g-044cb45a"></p> <h4 class="center">Figure 1. The KL divergence measures the similarity between distributions. D<sub>KL</sub>(P||Q) = ∑<sub>x</sub>P(x)ln(P(x)/Q(x)). The D<sub>KL</sub> is not symmetric D<sub>KL</sub>(P||Q) != D<sub>KL</sub>(Q||P). <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">Source</a></h4> <p>The exponential function was introduced only for aesthetic purposes. The higher the IS, the better.</p> <h3>Inception Score drawbacks</h3> <p>Inception Score, however, has some drawbacks. Those were pointed out in the paper <a href="https://arxiv.org/abs/1801.01973">A Note on the Inception Score 2018</a>. Here are the main of them:</p> <ul> <li><strong>Low generalization ability</strong> - The IS does not measure image diversity within the class. The GenAI model can generate the same images within a class, and the score will still be high. Therefore, it does not penalize the model for memorizing only a small subset of the training data.</li> <li><strong>Sensitivity to model weights</strong>. Even small differences in weight values can result in large variances in evaluation results.</li> </ul> <blockquote> <p>The mean Inception Score is 3.5% higher for ImageNet validation images, and 11.5% higher for CIFAR validation images, depending on whether a Keras or Torch implementation of the Inception Network is used.</p> </blockquote> <ul> <li>Directly optimizing the GenAI model using an Inception Score teaches the model to generat...
Source: View source