The first assumption I make here is that David already checked that both gain and energy are encoded at the "optimal" resolution that balances bitrate and coding artefacts. To reduce the rate, we need a smarter quantizer. Below is the distribution of the pitch and energy for my training database.
So what if we were to use vector quantization to reduce the bit-rate. In theory, we could reduce the rate (for equal error) by having more codevectors in areas where the figure above shows more data. Same error, lower rate, but still a bad idea. It would be bad because it would mean that for some people, whose pitch falls into the range that is less likely, codec2 wouldn't work well. It would also mean that just changing the audio gain could make codec2 do worse. That is clearly not acceptable. We need to not just care about the mean square error (MSE), but also about the outliers. We need to be able to encode any amplitude with increments of 1-2 dB and any pitch with an increment around 0.04-0.08 (between half a semitone and a semitone). So it looks like we're stuck and the best we could do is to have uniform VQ, which wouldn't save much compared to scalar quantization.
The key here is to relax our resolution constraint above. In practice, we only need such good resolution when the signal is stationnary. For example, when the pitch in unvoiced frames jumps around randomly, it's not really important to encode it accurately. Similarly, energy error are much more perceivable when the energy is stable than when it's fluctuating. So this is where prediction becomes very useful, because stationary signals are exactly the ones that are easily predicted. By using a simple first-order recursive predictor (prediction = alpha*previous_value), we can reduce the range for which we need good resolution by a factor (1-alpha). For example, if we have a signal that ranges from 0 to 100 and we want a resolution of 1, then using alpha=0.1, the prediction error (current_value-prediction) will have a range of 0 to 10 when the signal is stationary. We still need to have quantizer values outside that range to encode variations, but we don't need a good resolution.
Now that we have reduced the domain for which we need good resolution, we can actually start using vector quantization too. By combining prediction and vector quantization, it's possible to have a good enough quantizer using only 8 bits for both the energy and the pitch, saving 4 bits, so 200 b/s. The figure below illustrates how the quantizer is trained, with the distribution of the prediction residual (actual value minus prediction) in blue, and the distribution of the code vectors in red. The prediction coefficients are 0.8 for pitch and 0.9 for energy.
First thing we notice from the residual distribution is that it's much less uniform and there's two higher-density areas that stand out. The first is around (0.3,0), which corresponds to the case where the pitch and energy are stationary and is about one fifth of the range for pitch (which has a prediction coefficient of 4/5) and one tenth of the range for energy (which has a prediction coefficient of 9/10). The second higher-density area is a line around residual energy of -2.5, and it corresponds to silence. Now looking at the codebook in red, we can see a very high density of vectors in the area of stationary speech, enough for a resolution of 1-2 dB energy and 1/2 to 1 semitone for pitch. The difference is that this time the high resolution is only needed for much smaller range. Now, the reason we see such a high density of code vectors around stationary speech and not so much around the "silence line" is that the last detail of this quantizer: weighting. The whole codebook training procedure uses weighting based on how important the quantization error is. The weight given to pitch and energy error on stationary voiced speech is much higher than it is for non-stationary speech or silence. This is why this quantizer is able to give good enough quality with 8 bits instead of 12.