October 8, 2025

How Did AI Models Perform for Hurricane Humberto?

Andy Hazelton

How Did AI Models Perform for Hurricane Humberto?

As we discussed for Typhoon Ragasa, it is worthwhile to examine how AI models performed for high-profile or high-impact TCs, before conducting a full verification at the end of the season. The latest TC we will look at is Humberto, which did not impact land but was our 2nd Category 5 of the 2025 Atlantic season, and also was interesting because of its interaction with Hurricane Imelda (which we will look at in a later post).

As a reminder, these are the AI models examined:

AIEN: AIFS Ensemble Control

AISN: AIFS Deterministic

PANG: Pangu Weather

AURO: Aurora

GRPC: GraphCast

GDMN: Google’s Deepmind FNV3 Ensemble Mean

We also compare with operational GFS (AVNO) and HAFS-A (HFSA), two key models typically used for TC prediction.

Track

In contrast to some of the recent storms we have seen, where AI models dominated on track skill compared to dynamical models, the AI models actually did not perform as well overall for Hurricane Humberto compared to operational GFS especially (Figure 1). The one AI model that was competitive at most forecast hours was Google Deepmind/FNV3 (GDMN on the plots), with better skill than GFS at Days 1-2 and comparable skill at Day 5. GraphCast was also comparable to GFS and HAFS-A at times, but did considerably worse at early lead times.

Figure 1: Track errors (km, top) and skill (percentage relative to AVNO, bottom) for all models. The middle panel shows the “consistency metric” from Ditchek et al. (2023), which is a measure of significance/robustness of the results. The numbers on the bottom show the sample size at each lead time.

A breakdown of the track error shows that most of the AI models suffered from a slow bias (Figure 2), leading to a large along-track error compared to GFS. AIFS (both single run and ensemble control versions) had particularly large errors at Day 5.

Figure 2: Along-track error for all models. Negative means a slow bias and positive means a fast bias.

An example track forecast (from one of the earlier cycles while Humberto was still developing) illustrates this speed issue. For this cycle, almost all of the AI models were too slow and to the left of track (though GraphCast and Deepmind were close). GFS on the other hand had a better speed though it was too far to the right (a common bias for this model). This was a complex track case with lots of moving parts thanks to interaction with a front and also Imelda to the west, so it remains to be seen whether this along-track bias is something that will plague AI models in other cases.

Figure 3: Track forecasts for Hurricane Humberto initialized at 00 UTC September 25, 2025, and valid out to 126 hours. The observed track is shown in black.

Intensity

For intensity, as with other cases, most AI models did not have very good skill (Figure 4). The one notable exception was Google Deepmind (FNV3), which actually had the lowest intensity errors (even lower than HAFS-A!) at almost all lead times.

Figure 4: Intensity errors (m/s, top) and skill (percentage relative to AVNO, bottom) for all models. The middle panel shows the “consistency metric” from Ditchek et al. (2023), which is a measure of significance/robustness of the results. The numbers on the bottom show the sample size at each lead time.

This intensity skill was pretty remarkable considering that Humberto rapidly intensified to a Category 5 hurricane. While FNV3 did not fully capture the rapid intensification, even the ensemble mean did show a faster intensification rate from its early forecasts than any of the AI or dynamical models (Figure 5).

Figure 5: Maximum wind speed (kt, 1 kt = 0.51 m/s) forecasts for Hurricane Humberto initialized at 00 UTC September 25, 2025, and valid out to 126 hours. The observed wind is shown in black.

Humberto did illustrate one weakness of current AI modeling for intensity. Once it got very strong (Category 4-5), FNV3 (GDMN) was not able to maintain the initial intensity that it was given, and dropped off to an intensity that was ~50 knots too weak 6 hours after initialization (Figure 6). It eventually recovered somewhat around 36 hours later, but this bias led to a large error in the 12-24 hour range. It is not immediately clear what is causing this issue - it could be that the low resolution is leading to this dropoff despite the bias correction. The fact that DeepMind had comparable skill to HAFS overall at 12-24 hours despite this issue suggests that AI intensity skill should continue to improve as better bias correction techniques and higher resolution datasets are developed.

Figure 6: Maximum wind speed (kt, 1 kt = 0.51 m/s) forecasts for Hurricane Humberto initialized at 00 UTC September 27, 2025, and valid out to 126 hours. The observed wind is shown in black.

How Did AI Models Perform for Hurricane Humberto?

How Did AI Models Perform for Hurricane Humberto?

Track

Intensity

Join the weather forecasting revolution.