Abstract
Wildfires pose a significant natural disaster risk to populations and contribute to accelerated climate change. As wildfires are also affected by climate change, extreme wildfires are becoming increasingly frequent. Although they occur less frequently globally than those sparked by human activities, lightning-ignited wildfires play a substantial role in carbon emissions and account for the majority of burned areas in certain regions. While existing computational models, especially those based on machine learning, aim to predict lightning-ignited wildfires, they are typically tailored to specific regions with unique characteristics, limiting their global applicability. In this study, we present machine learning models designed to characterize and predict lightning-ignited wildfires on a global scale. Our approach involves classifying lightning-ignited versus anthropogenic wildfires, and estimating with high accuracy the probability of lightning to ignite a fire based on a wide spectrum of factors such as meteorological conditions and vegetation. Utilizing these models, we analyze seasonal and spatial trends in lightning-ignited wildfires shedding light on the impact of climate change on this phenomenon. We analyze the influence of various features on the models using eXplainable Artificial Intelligence (XAI) frameworks. Our findings highlight significant global differences between anthropogenic and lightning-ignited wildfires. Moreover, we demonstrate that, even over a short time span of less than a decade, climate changes have steadily increased the global risk of lightning-ignited wildfires. This distinction underscores the imperative need for dedicated predictive models and fire weather indices tailored specifically to each type of wildfire.
Similar content being viewed by others
Introduction
Wildfires are among the most hazardous natural disasters on Earth. Although studies have found a decrease in global burned areas due to anthropogenic activity1, the frequency of extratropical lightning-ignited wildfires appears to be on the rise as the Earth’s climate changes2. For example, the frequency of extreme lightning-ignited wildfires has drastically increased in California in the last decades3. Lightning is the main cause for wildfire activity in high latitudes both in terms of wildfire occurrence and in terms of burned areas4. This trend is of considerable concern, as extratropical forests are responsible for a substantial portion of carbon emissions. Lightning-ignited wildfires differ from anthropogenic wildfires in several aspects; for example, lightning-ignited wildfires tend to ignite in remote locations in which firefighters have difficulty extinguishing them before they evolve into extreme dimensions. These fires also tend to ignite in clusters5 while forest managers are less likely to fight these fires if they are not impacting settlements and infrastructures. While recent studies have developed machine learning (ML) models to predict wildfires on a global scale6, the different nature of lightning-ignited wildfires requires dedicated models to predict and analyze them separately.
Wildfires can be ignited by cloud-to-ground (CG) lightning strikes. As lighting ignitions occur in thunderstorms, they are often accompanied by precipitation7. However, lightning strokes that occur with little precipitation are defined as “dry lightning” and are more likely to cause an ignition. While a common definition for “dry lightning” is less than 2.5 mm of daily precipitation, some studies suggest that the risk of ignition is more complex and depends on additional factors7. An additional unique characteristic of lightning ignitions is the phenomenon of holdover wildfires, in which the ignition causes smoldering which can last for several days before a wildfire is detected. Although there exist some reports of extreme cases in which the smoldering phase lasted for several weeks, in the vast majority of cases they are limited to several days or a week8.
Recently, global lightning-ignited wildfire analysis and models obtained much attention. For example, a recent study used the EMAC model, a numerical model that aims to describe tropospheric and middle atmosphere processes and their interaction with oceans, land, and influences coming from anthropogenic emissions9. The authors used global data from 2009 to 2011 with \(2.8 \times 2.8\) degree and 12 minutes spatial and temporal resolution, respectively. Based on this model, they predict a 41% increase in global lightning flash rate. They also apply a simple model of wildfire ignition risk (lightning frequency divided by precipitation) and predict an increase in most regions around the world.
In a similar manner, researchers used the Max Planck Institute Earth System Model to explore how changes in lightning induced by climate change alter wildfire activity10. To be exact, the authors used the popular ECHAM6 simulator, which allowed them to explore different realistic atmospheric conditions, together with the JSBACH land surface vegetation model, which allows them to represent a spatio-temporal vegetation distribution. Using this setup and empirical values from the literature to obtain geophysical relevant values for the models’ parameters, the authors show a non-linear correlation between burned areas and cloud-to-ground lightning activity. Importantly, this approach, where a geophysical simulator is used does not allow the applied generalization of the phenomenon for future events as it allows us only to study the dynamics of the phenomenon and can not take into consideration hidden factors available in real-world data which are not taken into account during the simulators’ development.
To address this limitation, several studies adopted the ML model approach. Most studies in this field develop ML models on a regional scale. For example, Vecín-Arias et al. (2016) develop a Random Forest model for the Iberian Peninsula11; Malik et al. (2021) develop an Adaboost model for California12; Sayad et al. (2019) develop Artificial Neural Network and Support Vector Machine models to predict wildfire risk in Canada13. Several studies have also leveraged eXplainable Artificial Intelligence (XAI) frameworks to obtain a better understanding of the models’ predictions. XAI is an important field of research which aims to develop models which can be better understood by humans14. XAI approaches vary between models which are inherently explainable, and methods whose goal is to provide explanations of black-box models15. The application of the latter in wildfire research is becoming increasingly frequent, with the use of SHapley Additive exPlanations (SHAP) as the most common approach16,17,18,19,20,21. Other studies also leverage the intrinsic feature importance ordering by tree ensemble models such as XGBoost22,23,24. While regional models may better capture regional characteristics, their performance when applied to out of distribution regions is mostly low. In addition, effective training of ML models requires a very large number of observations, which is not available on a regional scale.
Few studies have tackled this challenge on a global scale. For instance, Janssen et al. (2023) integrated fire-caused reference, lightning, burned-area, low-impact land, intact-forest, fire-related forest loss, and carbon combustion data from multiple sources to take into consideration the factors known to play a role in lightning-ignited wildfires2. Based on this data, the authors trained two XGBoost models25. Their models predict the most likely ignition source at a 0.5-degree and monthly resolution. This novel study sheds light on the global distribution of anthropogenic versus lightning-caused wildfires, but does have some limitations. Namely, it is performed in a monthly resolution, which does not allow to verify that the lightnings occurred before the ignition; the data used for training are limited to 7 regions; finally, the study classifies wildfire types in hindsight and does not allow prediction of future ignition risk which is required for most real-world applications. Comparably, Coughlan et al. (2021) combined lightning presence, lightning-ignited wildfires, and environmental predictors such as soil water and temperature data26. The authors used this data to test Decision Tree, AdaBoost, and Random Forest models27,28,29, together with the recursive feature elimination selection method, for binary classification of lighting-integrated wildfires. While their approach is novel, the trained models reached an accuracy score of 71% which is not always sufficient in real-world applications.
Following these attempts, there is a gap in the literature for a model that is able to globally predict lightning-ignited wildfires with both a high level of accuracy and spatiotemporal stability. In this study, we build on global datasets of lightning30 and fire activity31 to develop an ML model that predicts the conditions in which lightning is expected to ignite a wildfire. We test the developed model both by using a random train-test split of the years, and by testing the year 2021 as a holdout year to demonstrate the generalization capabilities of the model. The best-performing model demonstrates high predictive performance with accuracy exceeding 90%. We analyze the different characteristics of anthropogenic and lightning-caused wildfires, and demonstrate that models developed for the former perform poorly for the latter and vice versa, emphasizing the need for dedicated models and fire weather indices for lightning-caused wildfires. Finally, we use the developed models to estimate the mean annual increase in lightning ignition risk throughout the data’s timespan, and also evaluate the predictions of these models in future climate conditions to estimate the expected trends of lightning-ignited wildfire activity. We find that a concerning increase in lightning ignition risk is already occurring and is likely to continue in the future.
The rest of the paper is organized as follows. First, we present the results of the study, starting with the dynamics analysis of the lightning-ignited wildfires, followed by the proposed model’s performance, the anthropogenic versus lightning-ignited wildfires, and the effect of climate change on these dynamics. Next, we discuss the geophysical interpretation of the obtained results and how policymakers can use the proposed model to design better policies. Finally, we formally present the methods and materials used as part of this study, including the datasets, the proposed model’s development, and the experimental design.
Results
In this section, we outline the obtained results, divided into four main outcomes: an analysis of wildfire occurrence dynamics, the performance of the obtained ML model, a comparison between models of anthropogenic and lightning-ignited wildfires, and climate change trajectory prediction based on the obtained model. Table 1 summarizes the variables used in the study and their sources, following a data pre-processing and merging, as described in detail in the “Methods and materials” section.
Dynamics analysis
In order to appropriately decide on the modeling approach to the problem, we started by analyzing the dynamics that emerge from the data. We present the Pearson correlation matrix between the source features and themselves and between the source features and the target feature in Fig. 1. Observing the data, it becomes apparent that the values in the first row are predominantly close to zero, with the highest absolute value reaching 0.48, indicating that the source features are not linearly related to the target features in a pair-wise manner. The features that are most strongly correlated to ignition probability are fire indices (fwi, ffmc) and Relative Humidity (RH). Moreover, as revealed in the figure, the source features are weakly linearly correlated to each other for most of the cases. Some exceptions are low and high vegetation cover, or fire weather indices which are correlated between themselves and some meteorological factors. These correlations are already known from previous studies26. This outcome indicates that the source space is mostly non-linear and an ML model can be useful for the proposed classification task.
In addition, Fig. 2a presents descriptive statistics of the obtained dataset and compares the different variables during thunderstorms where ignitions either did or did not occur. Eight selected variables are presented, and the remaining variables are presented in the Appendix (Fig. S1). Notably, the distributions of Relative Humidity (RH) and the Fine Fuel Moisture Code (FFMC) indexes, indicative of vegetation dryness, exhibit discernible differences between instances of ignitions and non-ignitions. This observation implies their potential value as critical variables in the predictive model. Historical (monthly) precipitation, which is correlated with the ffmc index, also presents a different distribution between the two cases. Other variables such as temperature and wind velocity appear to have similar distributions for both cases.
Model’s performance
Table 2 shows the performance of four ML models—Logistic Regression, Random Forest, AutoGluon, and XGBoost for five different configurations of source features, divided into four sets—vegetation, meteorological, and anthropogenic factors, fire history, FWIs, and spatio-temporal data. The results for each feature configuration and model are shown as the accuracy of the obtained model on the test set. For all five feature configurations, a clear order of performance emerges where the XGboost model provides the highest accuracy followed by the AutoGluon, Random Forest, and Logistic Regression models. In particular, with all the features in the dataset, the XGboost obtains a 91.6% accuracy, compared to the benchmark of a logistic regression with an accuracy of 82.1%, a 9.5% performance increase. Interestingly, comprising the fourth and fifth rows shows that the introduction of spatial data allows all models to highly improve their performance with XGboost increasing from 89.3% to 91.6%, a 3.7% increase. Also, we find that including regional characteristics such as the historical burned area contributes to the performance of the model. The ROC curves and AUC scores are presented in Fig. S4 in the Supplementary Information file.
Figure 3a,b present the mean regional accuracy of the XGBoost and logistic regression models, respectively. While the logistic regression performs relatively well in Africa, South America, and South Asia, it performs poorly in Australia, North Asia, and North America. In contrast, the XGBoost model exhibits excellent performance in all the continents, an indicator of good generalization capability. The accuracy of the model in the validation year, 2021, was almost identical, with 90.5% for the best-performing XGBoost model. These results indicate that the proposed model was able to generalize from the trained data and capture relevant patterns in the dynamics.
Figure 3c–g presents the estimated risk for lightning-ignited wildfires. We present both annual mean values (Fig. 3c) and seasonal averages (Fig. 3d–g). There is a difference in the distribution of lightning-ignited wildfires in the tropics compared with high latitudes. In the tropics, lightning fires generally occur in the dry season, or the transition between seasons, when the vegetation is dry, but there can still be isolated thunderstorms that can ignite the vegetation. In high latitudes, such as the boreal forests of Canada, Alaska, and Siberia, lightning fires always occur in the summer months that are associated with thunderstorms. While thunderstorms are also associated with rainfall, the rain falls directly below the clouds, while the lightning can strike tens of kilometers away from the rainfall. Furthermore, after a dry period without rainfall, forests can ignite from lightning even in the presence of rainfall. While it is generally thought that tropic lightning fires are rare, due to the high moisture in the tropics, this study shows a significant risk for lightning fires in the tropics throughout the year.
Model accuracy and predicted seasonal ignition risk based on the full XGBoost model. (a) Average accuracy of the XGBoost model. (b) Average accuracy of the logistic regression. (c) Predicted annual ignition risk based on the full XGBoost model. (d–g) Predicted seasonal ignition risk based on the full XGBoost model: (d) December–February, (e) March–May, (f) June–August, (g) September–November. Similar figures for additional models are presented in the Supplementary Information file (Figs. S5–S7). The figure was generated using the Cartopy library36.
Anthropogenic versus lightning-ignited wildfires
We present descriptive statistics of anthropogenic versus lightning-ignited wildfires in Fig. 2b. Eight selected variables are presented, and the remaining variables are presented in the Appendix (Fig. S3). The histograms in the figure describe the different variables on the day of ignition for both wildfire types. The histograms highlight some important differences between the two wildfire types: First, the meteorological conditions are remarkably different, with lightning ignitions occurring in high RH values and nonzero precipitation. Obviously, this does not mean that high RH or precipitation caused lightning-ignitions, but rather a description of the typical meteorological factors during thunderstorms. The typical wind velocity is small in lightning-ignitions; it is possible that strong winds at ground level are an indicator of strong winds at higher altitudes, which hinder the accumulation of thunderclouds. Second, and with a direct link to the meteorological factors, the fire weather indices in lightning-ignited ignitions are often very low. This is not surprising, as these indices are a function of the meteorological conditions, and high RH and precipitation normally indicate a low risk of ignitions. Finally, the vegetation indices are also different between the two ignition causes; lightning-ignited ignitions occur in regions with higher NDVI values.
These distinctively different characteristics suggest that the prediction of anthropogenic and lightning-ignited wildfires requires separate models, as these are two different phenomena. To examine this issue, we trained an XGBoost model to predict anthropogenic wildfires and tested its performance in the lightning-ignited wildfires dataset, and vice versa. The results, as summarized in Table 3a and b, confirm the inadequate performance of both models in predicting wildfires originating from an ignition source different from the one on which they were trained. In similarity to the XGBoost model, fire weather indices could also be adapted to reflect the risk of lightning-ignited ignitions.
Climate change influence and projections
Figure 4a presents the mean annual difference for the model’s prediction throughout the eight years of the data’s timespan (2014–2021). When averaging over the entire globe, we find a mean of 1% and a median of 0.7% annual increase in lightning-caused wildfire risk. However, this increase is not uniformly distributed in all regions. The increase in Africa, South America, and the USA is small but relatively consistent throughout these regions. In contrast, the trends in Australia, Canada, and North Asia is mixed and region-dependent. The possible decrease in ignition risk at high latitudes is consistent with previous studies, which have recognized increased precipitation as a mitigating factor of lightning-caused ignitions9. Figure 4b presents the increase in lightning ignition risk between present day and the projected risk for 2100. The model predicts an increase of 50% in lightning ignition risk. Although these two analyses are distinctively different, the regional distributions of their outcomes are remarkably similar. The lightning ignitions risk is consistently increasing in Africa, South America, and the USA, in contrast to mixed patterns in Canada and North Asia. Unlike the analysis in Fig. 4a, climate projections point to a clear increase in Australia.
(a) Mean annual difference in the model’s prediction of ignition risk between 2014–2021. (b) Increase in wildfire risk between present day and 2100. The figure was generated using the Cartopy library36.
Discussion
We created a dataset that classifies anthropogenic and lightning-ignited wildfires globally based on high-resolution spatio-temporal observations of wildfires and thunderhours. We developed ML models to predict the risk of ignition when lightning occurs and obtained a high predictive performance of 91.6% using the XGboost model. The model was trained on a global dataset and based on common and easily available variables, making it applicable everywhere. The high performance of the model was also validated on data from 2021 as a hold-out year that was not used for training, suggesting this model generalizes the patterns well and could be applicable in future applications.
We analyze the most important features in the best performing models in the Supplementary Information file. Figure S8 presents an analysis of the most important features in the XGBoost model in the form of Shapley values. Precipitation on the day of ignition is the most important factor and is negatively correlated with ignition risk, although ignitions do occur with limited precipitation during thunderstorms. The next most important feature is the fire weather index, which is mostly positively correlated with ignition risk despite a non-negligible number of low fwi values which were still associated with high ignition risk. RH, ffmc, and monthly precipitation all had the expected impact on the model’s outcome. Temperature had a relatively small effect on the model. Its small, negative correlation with ignition risk is likely a spurious result, possibly influenced by the regions where thunderstorms commonly occur. Similarly, Fig. S9 presents dependence plots for the ten most important features of the XGBoost model based on the SHAP analysis. The impact of some features on the model aligns with expectations. For instance, ffmc, RH, soil moisture, and vegetation water content values exhibit a nearly linear and strong effect on wildfire risk. The influence of other features, however, is not linear. For example, wildfire risk increases with NDVI values up to approximately 0.8 but then decreases, which might be due to very high NDVI values indicating wet and healthy vegetation that is less flammable. Fire history also shows a mixed effect on the model. Very low fire history values suggest a lower risk of wildfire ignition, while very high values also appear to have a negative effect, possibly because previous fires have already exhausted the available flammable vegetation. The fire weather index has a positive effect on the model, demonstrating that traditional indices have relatively good predictive capability. Figure S10 shows the ranked importance of features according to the XGBoost model. As expected, the one-hot encoded month variables significantly impact the model. Other key features include daily and monthly precipitation, soil moisture, NDVI, and others. This analysis aligns with the SHAP feature importance analysis, reinforcing its conclusions.
We demonstrated that anthropogenic and lightning-ignited wildfires have distinctively different characteristics and commonly occur at different RH, precipitation, and other conditions. In fact, as lightning-ignited wildfires can only occur following a lightning stroke, the meteorological conditions in which they begin are systematically biased compared to anthropogenic wildfires which can ignite without thunderstorms. This dramatically reduces the predictive performance of models trained on anthropogenic wildfires when applied to lightning-ignited ignitions, and vice versa. We demonstrated this finding by training an ML model for anthropogenic wildfires and testing it for lightning-ignited wildfires, which reduced its accuracy to as low as 75%. This highlights the need to address anthropogenic and lightning-ignited wildfires as different phenomena and develop distinct prediction models and fire weather indices for the two ignition sources.
Based on the developed model, we analyzed the influence of climate change on wildfire ignition risk throughout the years. We did so both by the annual predicted ignition risk in our dataset (2014–2021), and by applying the model to climate change projections for 2100. Both analyses provide a similar conclusion—lightning-ignited wildfires are already rising in Africa, South America, and the USA, and will likely keep rising substantially following climate change. In contrast, the trend in Australia, Canada, and North Asia is not as clear, though regional variations are expected in these areas as well.
This study provides a novel approach to prediction of lightning-caused wildfires, but it does have some noteworthy limitations. First, although the classification of anthropogenic versus lightning-caused wildfires is performed at a high resolution, the uncertainty in holdover fire duration prevents a certain classification. Namely, our classification assumes that a wildfire that ignites in the week after the thunderstorm is caused by it. Furthermore, we chose to analyze the effect of climate change in two ways: either based on our data (2014–2021), or based on climate change projections. The analysis based on the actual data makes its conclusion solid and robust, but it comes at the price of basing our findings regarding climate change on eight years of data, which is not necessarily sufficient for this matter. In contrast, using the climate projections provides insights on a longer time frame, but is not as accurate as the empirical analysis. Future studies could elaborate this analysis both for the past (by using alternative lightning datasets) or by elaborating the climate projections analysis. Future studies could also address the challenge of improving the prediction and detection of holdover fires, which could potentially serve as an early warning for large wildfires. Managing wildfires during the smoldering phase is considerably easier, making timely detection a crucial opportunity for effective wildfire mitigation. Finally, while the current study focused on ignition risk during a thunderstorm, to fully comprehend future lightning-ignited wildfires one must also consider the frequency of thunderstorms under climate change.
The accurate ML model presented in this study holds the potential to augment existing firefighting strategies. With the escalating risk of wildfires triggered by lightning, an enhanced understanding and predictive model serve as crucial initial steps for effective fire mitigation. Accurate prediction models are equally vital for promptly alerting nearby populations to imminent dangers. To fulfill this purpose, we advocate for the establishment of dedicated fire weather indices for lightning ignitions, leveraging insights from the model developed in this study.
Methods and materials
This section presents the three main components of constructing this study: the dataset, the ML model of lightning-ignited fire prediction, and the performed experiments. Initially, we outline the open-access datasets taken into consideration as well as their preparation and integration into a single dataset for the ML model. Subsequently, we formally describe the proposed ML model training and validation processes. Finally, based on both the datasets and the proposed ML model, we describe several experiments designed to approximate the effect of climate change on lightning-ignited fire on a global scale. Figure 5 presents a schematic view of the methodology used in this study.
Datasets
Wildfire data
Wildfire data are taken from31. This dataset classifies wildfires to the individual wildfire level with global coverage and includes shape files representing the daily burned area for each wildfire in each day, at 250-meter resolution. For each wildfire, we find the center of the polygon describing the burned area on its first day and define it as the ignition point. We do so for all wildfires in the years 2014–2021.
Lightning data
To classify lightning-ignited wildfires, we have used the thunder hours data recently produced from the Earth Networks Total Lightning Detection Network (ENTLDN)30. This ground-based lightning detection network observes the electromagnetic pulses radiated from lightning discharges, and using hundreds of stations around the globe can geolocate the lightning discharges in time and space. However, all lightning networks have problems with detection efficiencies since they are biased to the most intense flashes, while missing many of the weaker flashes. To overcome this deficiency, ENTLDN have gridded the lightning data into a global 5 km horizontal grid. If at least 2 flashes are detected within a 5 km gridbox during a particular hour, that gridbox is given the value of 1, otherwise 0. In this way instead of counting lightning flashes, ENTLDN counts thunderstorms per hour, or how many hours an observer would have heard thunder at that location. These thunderhous can be counted per day, month, or year for a specific location. Hence, our study uses thunderhour as the parameter representing the risk of lightning in a particular location and at a particular time.
Meteorological data and vegetation
We use ERA5 reanalysis32 for historical meteorological data. We include daily means of temperature, relative humidity, precipitation, and wind data. As studies show that time-lagged meteorological factors have a substantial impact on wildfire risk37, we also include two additional features: mean precipitation in the month prior to each observation, and the total days since the last precipitation. We also include three vegetation-related features: we obtain low vegetation cover and high vegetation cover from ERA5 reanalysis32; we use the normalized difference vegetation index (NDVI), an estimate of the density of live green vegetation, obtained from the NASA Earth Observations website33. Finally, we include vegetation water content (skin reservoir) and soil moisture from the ERA5 reanalysis32.
Historical burned areas
We include a feature describing the mean historical burned area in the years 2003–2013. For this feature, we do not include data from 2014 onwards to avoid data leakage. The historical burned areas are obtained from the ECMWF website35 in 0.25-degree resolution.
Fire weather indices
We use the Canadian Forest Fire Weather Index System38 which consists of six indices that account for the effects of fuel moisture and weather conditions on fire behavior: fine fuel moisture code, duff moisture code, drought code, initial spread index, buildup index, and the fire weather index which serves as a general danger index built on the previously mentioned ones. We obtain these variables in daily \(0.25^\circ\) resolution from the Copernicus Climate Change Service39. Although these indices are based on meteorological data, research shows that integrating significant features based on domain knowledge can improve machine learning performance40.
Dataset integration
For each wildfire in the data, we extracted its ignition point and date. We classified a wildfire as lightning-ignited if thunderhours occurred in a range of 10 kilometers on the day of ignition, or the week before it (to account for holdover wildfires). We eliminated wildfires which lasted for a single day to avoid prediction of agricultural burning. We then balanced the data by randomly sampling a similar number of observations in which a thunderhour occurred, but no wildfires ignited within a week in a range of 10 kilometers. We left the year 2021 as a hold-out year, meaning it was not used for training at all, and only served for validation purposes.
At this stage, the dataset (\(D\)) is represented as a single table with \(n = 974{,}459\) positive samples (lightning-ignited wildfires) and \(m = 18\) features. Formally, each sample contains + source features (\(x\)) and the target feature (\(y\)) which indicates if a fire is lightning ignited or not. This structure dictates a binary classification task. To this end, we first computed the Pearson coefficient between each pair of features in the dataset (including the target feature). Afterwards, since these results reveal that there is no significant linear or statistical connection between each feature to the target feature as well as a non-linear connection between the source features, we decided to use ML models for this classification task41.
To this end, we first pre-process the dataset to a form that is usable by ML models. Namely, we first removed all positive samples where at least a single datapoint is missing—resulting in a removal of \(1.6\%\) of the samples. Moreover, as the dataset is unbalanced (i.e., in the vast majority of cases lightnings do not result in wildfires), we decided to apply a down-sampling strategy. Namely, in addition to the lightning-ignited wildfires, we included an equal number of lightning observations that did not result in ignitions, sampled at random. The global distribution of lightning-caused and anthropogenic wildfires is presented in Fig. S2.
Machine learning model development
In order to find the ML model that produces the highest accuracy score on the test set, we tested multiple popular ML models, including Logistic Regression (LR)42, Random Forest (RF)27, AutoGluon43, and XGboost25. We chose these models as previous studies show these models obtain promising results in similar spatiotemporal geophysical tasks44,45. To be exact, LR is the classification version of the popular linear regression46 model where one assumes the relationship between the source and target features is linear. The RF model is an ensemble ML model that generates multiple decision tree (DT) models by providing a different and randomly sampled subset of the samples and features of the dataset to each DT and makes a final prediction using the majority-vote aggregation method. The XGBoost classifier, or eXtreme Gradient Boosting, is a robust ML technique that enhances the performance of DTs through an iterative process25. In contrast to traditional DTs, XGBoost constructs a sequence of trees, each aiming to correct the errors made by the previously trained trees. This sequential training is guided by the principles of gradient boosting. Finally, AutoGluon is an Automatic Machine Learning (AutoML) library that uses ensemble models of Machine and Deep Learning43.
On top of that, as deep learning models obtain promising results in a wide range of fields47,48,49, in general, and in geophysics, in particular50, we take advantage of the AutoGluon library43, an Automatic Machine Learning (AutoML) library which uses ensemble models of Machine and Deep Learning. This tool uses a unique search and optimization process to find and test a large number of neural network models as well as tree-based ensemble models, looking to optimize the model’s accuracy on the test set.
Experimental setup
We develop several models based on different features and different ML techniques (as presented in Table 2). The features included in the model vary from the basic model, which only includes meteorological data, population density, and vegetation data, to the full model which additionally includes regional fire history, fire weather indices, and spatiotemporal data. We present the accuracy scores of the four different models: logistic regression, XGBoost, Random Forest, and AutoGluon.
In Fig. 3a,b we present the accuracy of the best performing model, XGBoost, and logistic regression as a benchmark. Both models are based on the full set of features. Figure 3c presents the annual average of the XGBoost model’s prediction in different regions, and Fig. 3d–g provides the seasonal average of its prediction.
We evaluate the performance of the model, which is trained on lightning-caused wildfires, when tested on anthropogenic wildfires, and vice versa. To do so, we train an additional XGBoost model on all the wildfires in the dataset which were not classified as lightning-caused. In addition, we randomly chose a similar number of random observations in which there were no ignitions (and no lightning). We trained an additional XGBoost model for anthropogenic wildfires, and evaluated either model on either dataset. We repeated this process twice, using either the basic set of features or the full model.
To estimate the effect of climate change on lightning ignition risk, we perform two analyses. In the first analysis (presented in Fig. 4a), we evaluate the mean annual change in the model’s prediction. We randomly selected eight million thunder samples from the eight years of data (2014–2021). We aggregated samples that are located in similar regions, and calculated the mean model prediction in each region, in each year. We then calculated the mean annual change in each region. Next, we applied the model to climate change projections for 2100 based on the EC-Earth3-CC model under SSP245 scenario51 (presented in Fig. 4b). Inspired by the methodology of previous studies9, we added the mean projected differences between 2100 and present-day for three main meteorological factors (RH, temperature, and precipitation), while holding the rest of the variables constant. We present the difference in the model’s prediction between the two.
Data availability
The datasets generated during and/or analyzed during the current study are available from the corresponding author upon reasonable request.
References
Andela, N. et al. A human-driven decline in global burned area. Science 356(6345), 1356–1362 (2017).
Janssen, T. A. et al. Extratropical forests increasingly at risk due to lightning fires. Nat. Geosci. 16, 1136–1144 (2023).
Li, S. & Banerjee, T. Spatial and temporal pattern of wildfires in California from 2000 to 2019. Sci. Rep. 11(1), 8779 (2021).
Hanes, C. C. et al. Fire-regime changes in Canada over the last half century. Can. J. For. Res. 49(3), 256–269 (2019).
Nampak, H. et al. Characterizing spatial and temporal variability of lightning activity associated with wildfire over Tasmania, Australia. Fire 4(1), 10 (2021).
Shmuel, A. & Heifetz, E. Global wildfire susceptibility mapping based on machine learning models. Forests 13(7), 1050 (2022).
Kalashnikov, D. A. et al. Lightning-ignited wildfires in the western united states: Ignition precipitation and associated environmental conditions. Geophys. Res. Lett. 50(16), 103785 (2023).
Schultz, C. J., Nauslar, N. J., Wachter, J. B., Hain, C. R. & Bell, J. R. Spatial, temporal and electrical characteristics of lightning in reported lightning-initiated wildfire events. Fire 2(2), 18 (2019).
Perez-Invernon, F. J., Gordillo-Vazquez, F. J., Huntrieser, H. & Jöckel, P. Variation of lightning-ignited wildfire patterns under climate change. Nat. Commun. 14, 739 (2023).
Krause, A., Kloster, S., Wilkenskjeld, S. & Paeth, H. The sensitivity of global wildfires to simulated past, present, and future lightning frequency. Adv. Earth Space Sci. 119, 312–322 (2014).
Vecín-Arias, D., Castedo-Dorado, F., Ordóñez, C. & Rodríguez-Pérez, J. R. Biophysical and lightning characteristics drive lightning-induced fire occurrence in the central plateau of the Iberian Peninsula. Agric. For. Meteorol. 225, 36–47 (2016).
Malik, A. et al. Data-driven wildfire risk prediction in Northern California. Atmosphere 12(1), 109 (2021).
Sayad, Y. O., Mousannif, H. & Al Moatassime, H. Predictive modeling of wildfires: A new dataset and machine learning approach. Fire Saf. J. 104, 130–146 (2019).
Arrieta, A. B. et al. Explainable artificial intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion 58, 82–115 (2020).
Rai, A. Explainable AI: From black box to glass box. J. Acad. Mark. Sci. 48, 137–141 (2020).
Abdollahi, A. & Pradhan, B. Explainable artificial intelligence (XAI) for interpreting the contributing factors feed into the wildfire susceptibility prediction model. Sci. Total Environ. 879, 163004 (2023).
Tran, T. T. K. et al. Improving the prediction of wildfire susceptibility on Hawaii Island, Hawaii, using explainable hybrid machine learning models. J. Environ. Manag. 351, 119724 (2024).
Cilli, R. et al. Explainable artificial intelligence (XAI) detects wildfire occurrence in the Mediterranean countries of southern Europe. Sci. Rep. 12, 16349 (2022).
Kondylatos, S. et al. Wildfire danger prediction and understanding with deep learning. Geophys. Res. Lett. 49, e2022GL099368 (2022).
Fan, D., Biswas, A. & Ahrens, J. P. Explainable AI integrated feature engineering for wildfire prediction. arXiv preprint[SPACE]arXiv:2404.01487 (2024).
Van, L. N. et al. Enhancing wildfire mapping accuracy using mono-temporal sentinel-2 data: A novel approach through qualitative and quantitative feature selection with explainable ai. Ecol. Inform. 81, 102601 (2024).
Auret, L. & Aldrich, C. Empirical comparison of tree ensemble variable importance measures. Chemom. Intell. Lab. Syst. 105, 157–170 (2011).
Hang, H. T., Mallick, J., Alqadhi, S., Bindajam, A. A. & Abdo, H. G. Exploring forest fire susceptibility and management strategies in western Himalaya: Integrating ensemble machine learning and explainable ai for accurate prediction and comprehensive analysis. Environ. Technol. Innov. 35, 103655 (2024).
Zhang, B. et al. Explainable machine learning for the prediction and assessment of complex drought impacts. Sci. Total Environ. 898, 165509 (2023).
Chen, T. & Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785–794 (Association for Computing Machinery, 2016).
Coughlan, R. et al. Using machine learning to predict fire-ignition occurrences from lightning forecasts. Nat. Geosci. 28, e1973 (2021).
Ho, T. K. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20, 832–844 (1998).
Swain, P. H. & Hauska, H. The decision tree classifier: Design and potential. IEEE Trans. Geosci. Electron. 15, 142–147 (1977).
Duchi, J., Hazan, E. & Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011).
DiGangi, E. A., Stock, M. & Lapierre, J. Thunder hours: How old methods offer new insights into thunderstorm climatology. Bull. Am. Meteorol. Soc. 103, E548–E569 (2021).
Artés, T. et al. A global wildfire dataset for the analysis of fire regimes and fire behaviour. Sci. Data 6(1), 296 (2019).
Hersbach, H. et al. The era5 global reanalysis. Q. J. R. Meteorol. Soc. 146(730), 1999–2049 (2020).
Didan, K., Munoz, A. B., Solano, R. & Huete, A. Modis vegetation index user’s guide (mod13 series). Tech. Rep. 35 (University of Arizona: Vegetation Index and Phenology Lab, 2015).
Center for International Earth Science Information Network-CIESIN-Columbia University. Gridded population of the world, version 4 (gpwv4): Population density, revision 11 (NASA Socioeconomic Data and Applications Center (SEDAC), 2018).
Lizundia-Loiola, J., Otón, G., Ramo, R. & Chuvieco, E. A spatio-temporal active-fire clustering approach for global burned area mapping at 250 m from Modis data. Remote Sens. Environ. 236, 111493 (2020).
Elson, P. et al. Scitools/cartopy: Cartopy 0.18.0. Zenodo (2023).
Shmuel, A., Ziv, Y. & Heifetz, E. Machine-learning-based evaluation of the time-lagged effect of meteorological factors on 10-hour dead fuel moisture content. For. Ecol. Manag. 505, 119897 (2022).
Van Wagner, C. E. Development and Structure of the Canadian Forest Fire Weather Index System, vol. 35 (1987).
Copernicus Climate Change Service, Climate Data Store. Fire danger indices historical data from the copernicus emergency management service. Copernicus Climate Change Service (C3S) Climate Data Store (CDS), https://doi.org/10.24381/cds.0e89c522. Accessed on 01-01-2024 (2019).
Shmuel, A., Glickman, O. & Lazebnik, T. Symbolic regression as feature engineering method for machine and deep learning regression tasks. arXiv:2311.06028v1 (2023).
Kotsiantis, S. B., Zaharakis, I. D. & Pintelas, P. E. Machine learning: a review of classification and combining techniques. Artif. Intell. Rev. 26, 159–190 (2006).
Nick, T. G. & Campbell, K. M. Logistic Regression, 273–301 (Humana Press, 2007).
Erickson, N. et al. Autogluon-tabular: Robust and accurate automl for structured data. arXiv preprint[SPACE]arXiv:2003.06505 (2020).
Sidik, I., Saroji, S. & Sulistyani, S. Implementation of machine learning for volcanic earthquake pattern classification using xgboost algorithm. Acta Geophys. (2023).
Harris, J. & Grunsky, E. Predictive lithological mapping of Canada’s north using random forest classification applied to geophysical and geochemical data. Comput. Geosci. 80, 9–25 (2015).
Poole, M. A. & O’Farrell, P. The assumptions of the linear regression model. Trans. Inst. Br. Geogr. 52, 145–158 (1971).
Lazebnik, T. Data-driven hospitals staff and resources allocation using agent-based simulation and deep reinforcement learning. Eng. Appl. Artif. Intell. 126, 106783 (2023).
Bushaj, S. et al. A simulation-deep reinforcement learning (SIRL) approach for epidemic control optimization. Ann. Oper. Res. 328, 245–277 (2022).
Mao, H., Alizadeh, M., Menache, I. & Kandula, S. Resource management with deep reinforcement learning. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks, 50–56 (Association for Computing Machinery, 2016).
Yu, S. & Ma, J. Deep learning for geophysics: Current and future trends. Rev. Geophys. 59, e2021RG000742 (2021).
O’Neill, B. C., Tebaldi, C. & Van Vuuren, D. P. The scenario model intercomparison project (scenariomip) for cmip6. Geosci. Model Dev. 9(9), 3461–3482 (2016).
Acknowledgements
We acknowledge the World Climate Research Programme, which, through its Working Group on Coupled Modelling, coordinated and promoted CMIP6. We thank the climate modeling groups for producing and making available their model output, the Earth System Grid Federation (ESGF) for archiving the data and providing access, and the multiple funding agencies that support CMIP6 and ESGF. We also thank the Earth Networks Lightning Network for making their novel thunderhour data available.
Author information
Authors and Affiliations
Contributions
Conceptualization: A. S., C. P.; software: A. S.; Validation: T. L., O. G., C. P.; Formal Analysis: A. S., O. G.; data curation: A. S.; Writing—Original Draft: A. S., T. L.; Visualization: A. S., T. L.; Supervision: T. L., O. G., E. H., C. P.; Project administration: A. S.; All authors designed the methodology, investigated, and reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Shmuel, A., Lazebnik, T., Glickman, O. et al. Global lightning-ignited wildfires prediction and climate change projections based on explainable machine learning models. Sci Rep 15, 7898 (2025). https://doi.org/10.1038/s41598-025-92171-w
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-025-92171-w