How Did Google DeepMind Perform for Hurricane Melissa?
One of the most hotly-discussed AI models this hurricane season has been Google’s DeepMind/FNV3 ensemble. Google has worked with the National Hurricane Center and made their track files publicly available in real-time during this hurricane season, which has provided an excellent opportunity for evaluation in a variety of cases, to understand the strengths and weaknesses of the model. One case that will certainly draw a lot of attention is Hurricane Melissa, which severely impacted Jamaica as well as Cuba, Haiti, and the Bahamas in late October. It was tied for the strongest hurricane landfall ever in the Atlantic at 892 hPa. Based on its performance in other Atlantic storms this year, NHC actually leaned on the DeepMind forecasts quite a bit in their forecasts for Melissa.
So how did it perform? For this evaluation, I decided to compare the DeepMind ensemble mean (GDMN in the ATCF files) with 5 heavily-used dynamical operational models: AVNI (operational GFS), HFSA (operational HAFS-A), HFSB (operational HAFS-B), HWRF (operational HWRF), and HMON (operational HMON).
Track Forecasts
As we have come to expect for AI forecast models so far, GDMN performed very well for track in Hurricane Melissa, with lower errors than all the dynamical models (Figure 1). Note that the verifications used the “interpolated” versions of the models (what NHC has available to them when they issue their advisories). GDMN/GDMI had the lowest track errors by far at all forecast lead times, with only HAFS-A and HAFS-B (NCEP’s newly-operational hurricane models that were introduced in 2023) competitive for track at Day 5.
Figure 2 shows the “frequency of superior performance”, which is how often one model outperforms another. Here the FSP is calculated relative to the operational GFS, which unfortunately performed poorly for this storm. DeepMind outperformed GFS 100% of the time for all forecast hours past 60, and had higher FSP than HAFS-A and HAFS-B out to 84h or so.
An example cycle (2025102306) is shown to illustrate this performance (Figure 3). This was about 5 days prior to landfall in Jamaica. GFS was way too far east, taking the storm towards Haiti and the Dominican Republic. DeepMind and HAFS correctly showed the slow motion and the turn back to the south of Jamaica. For this cycle, GDMN was a little too quick to make landfall, but corrected in later cycles.
Intensity
As with several other Atlantic cases this year, DeepMind’s ensemble mean performed well overall for intensity, with lower errors than all models at Days 2-4 and even outperforming NOAA’s legacy HWRF model at Day 5 (Figure 4).
Looking at the intensity bias (Figure 5), we see that all models had a negative bias, which makes sense as most models struggled to fully resolve the extreme intensity that Melissa achieved. However, DeepMind had the lowest negative bias at Days 2-3, similar to the operational HAFS-A. HAFS-A and HAFS-B were slightly better at Day 5.
A closer look at the October 23, 06 UTC cycle examined earlier shows more details of how DeepMind performed for intensity. None of the models fully captured the 2-3 day rapid intensification. DeepMind was closest in intensification rate for a time, though the ensemble mean levelled off at too low of a peak intensity compared to the operational HAFS.
Ensemble vs. Ensemble Mean
One thing to note is that the version of DeepMind that is included in the NHC “adecks” is the ensemble mean. Google has made the full ensemble available as well, both on their website and for public downloads. Using an ensemble mean produces both strengths and weaknesses. Often, an ensemble mean has lower errors overall than a single deterministic forecasts, as biases average out. This is why NHC often leans on consensus forecasts in their official forecasts. An ensemble mean is sometimes an average between widely disparate outcomes, which can help minimize errors, but also mask some of the overall uncertainty in the forecast. An example comes from the October 21, 06 UTC DeepMind ensemble forecast (Figure 7). The mean track was not bad, but many members were too far right, showing a track towards Hispaniola (similar to GFS), while others were also too far west near Honduras. The mean track was simply an average of two incorrect outcomes with opposite biases. Is it an apples-to-apples comparison to compare this with deterministic models like HAFS? Not necessarily, though the ability to run skillful ensembles quickly is one of the big draws of AI forecasts.
One other downside of ensemble mean forecasts is that they can smooth out extreme values, which is probably why some of the GDMN intensity forecasts levelled off despite many individual members deepening to extreme values (like pressure < 900 hPa). This is where the human forecaster can add a lot of value by examining the details of the ensemble and interpreting which intensity outcome is more likely based on how the track is evolving. This is exactly what the NHC forecast referenced above did. Overall, this case was a success for DeepMind’s forecast abilities, and showed how it can be effectively used as another tool by hurricane forecasters.