J. Cent. South Univ. (2012) 19: 1668-1672
DOI: 10.1007/s11771-012-1191-2
Extreme air pollution events: Modeling and prediction
ZHOU Song-mei(周松梅), DENG Qi-hong(邓启红), LIU Wei-wei(刘蔚巍)
School of Energy Science and Engineering, Central South University, Changsha 410083, China
? Central South University Press and Springer-Verlag Berlin Heidelberg 2012
Abstract: In order to get prepared for the coming extreme pollution events and minimize their harmful impacts, the first and most important step is to predict their possible intensity in the future. Firstly, the generalized Pareto distribution (GPD) in extreme value theory was used to fit the extreme pollution concentrations of three main pollutants: PM10, NO2 and SO2, from 2005 to 2010 in Changsha, China. Secondly, the prediction results were compared with actual data by a scatter plot. Four statistical indicators: EMA (mean absolute error), ERMS (root mean square error), IA (index of agreement) and R2 (coefficient of determination) were used to evaluate the goodness-of-fit as well. Thirdly, the return levels corresponding to different return periods were calculated by the fitted distributions. The fitting results show that the distribution of PM10 and SO2 belongs to exponential distribution with a short tail while that of the NO2 belongs to beta distribution with a bounded tail. The scatter plot and four statistical indicators suggest that GPD agrees well with the actual data. Therefore, the fitted distribution is reliable to predict the return levels corresponding to different return periods. The predicted return levels suggest that the intensity of coming pollution events for PM10 and SO2 will be even worse in the future, which means people have to get enough preparation for them.
Key words: extreme pollution event; generalized Pareto distribution; return level; return period
1 Introduction
In many environmental processes, the average behavior of system is much less important than the rather few extreme situations with large impacts [1]. One of the extreme situations, the extreme air pollution event, is of particular interest due to its acute harmful impacts on human health and other environment damage. It is still remembered clearly that the extreme pollution event in human history, the “London Fog”, has taken away at least 4 000 lives in only four days’ period [2]. It is a lesson that reminds us to protect the environment, reduce the pollution emission and prepare for the extreme pollution events. In order to get enough preparation for the coming extreme pollution events and minimize their damage, we have to figure out the most important question on the first step: “Have we already seen the worst pollution events or are we going to experience even worse ones?”
To answer this question, we have to predict the possible intensity of the extreme pollution events in the future. Since the air pollutant concentrations are influenced by the emission levels, meteorological conditions and geography, they are defined as random variables. Therefore, we can use the theory statistical distributions to fit them, and get the return levels corresponding to different return periods by the fitted distribution. One of the most popular distributions, the generalized Pareto distribution (GPD) in the extreme value theory has been widely applied to fitting the extreme pollution samples. Many researchers concluded that the GPD provides a useful tool in application to air quality data [3]. LU and FANG used parent distribution and the type I GPD to fit the PM10 in the high concentration region. They found that the GPD is better agreed with the high PM10 concentration levels than the parent distributions [4]. HOROWITZ and BARAKAT showed that the O3 extreme data could be fitted well by a type I GPD of the extreme value theory [5]. SMITH also used the GPD to analyse the ground level O3 and gain a downward trend in the extreme values [6].
In this work, the GPD was also used to fit the daily average extreme value samples of three main pollutants in Changsha, China: PM10, NO2 and SO2, from 2005 to 2010. Moreover, in order to evaluate the goodness-of-fit of the fitted distributions, the predicted values were compared with actual data, and four statistical indicators (EMA, ERMS, IA and R2) were calculated as well. Finally, the return levels corresponding to different return periods were calculated by the fitted distribution.
2 Method
2.1 Extreme value distribution
One of the currently most used methods in extreme value analyses is the peaks-over-threshold (POT) model in which the excesses above a sufficiently high threshold u are fitted by the generalized Pareto distribution (GPD) [7]. The model is used by the following steps.
If F(x) is the cumulative density distribution of total samples, and Fu(x) is the conditional cumulative density distribution of extreme value sample which is made up of excesses above a pre-specified threshold u, then
(1)
According to Pickands-Balkema-de Haan theorem [8-9], when the threshold u is high enough, the GPD becomes the asymptotic distribution for Fu(x). The GPD is usually expressed as a two-parameter distribution:
(2)
where the scale parameter β>0, and the shape parameter -1<<1. Corresponding to different shape parameter values, the GPD can be divided into three types. For =0, the GPD has a type I short tail and it is the exponential distribution. For >0, it has a type II long tail and it is the Pareto distribution. And for <0, it has a type III bounded tail and it is a special case of beta distribution [10]. We have to emphasize that, just like the Fu(x), the G(x) is only used to fit the extreme value sample which is made up of the excesses above an appropriate threshold u, and the parameters and β are obtained from the extreme value sample as well.
From Eq. (1), we can easily get
(3)
In Eq. (3), we set , where N is the size of the extreme value sample and n is the size of the total sample.
When we use G(x) to take the place of Fu(x), we can get the cumulative density probability in the total sample for the extreme value concentrations above u:
(4)
Since we get the values of the scale parameter β and shape parameter from the fitted G(x), the expression of F(x) which can be used to calculate the return levels and return periods can be obtained immediately.
2.2 Return period and return level
A major topic of interest in environmental studies is the return period. At most times, people want to know how long it can be while a particular concentration xc is not exceeded. Moreover, the possible extreme pollution level in the future is of particular interest as well.
The cumulative density function F(x) could be used to calculate the return period as [11]
(5)
where PR(xc) is the return period of the critical concentration xc; in other words, the particular concentration xc is the return level corresponding to the return period PR(xc) . This means that during a time series of PR(xc), the possible maximum pollution concentration will not exceed xc.
Correspondingly, the return level xc can be predicted whenever the return period is given:
(6)
where is the inverse function of F.
2.3 Parameter estimation and goodness-of-fit criteria
There are three parameters in the GPD function needed to be estimated. To begin with, we have to choose an appropriate threshold u. The choice of threshold u is critical to any POT analysis: A too high threshold could discard too much data, leading to high variance; a too low threshold is likely to violate the asymptotic basis of the model, leading to bias [12]. One of the popular approaches is to fit the GPD using a range of thresholds, and then graphing the parameter estimates along with their variability, an appropriate threshold being the lowest possible choice such that any higher threshold would result in similar estimated values [13].
Since the threshold u is fixed, according to the observed extreme value sample which is made up of the excesses above u, the other two parameters, scale parameter β and shape parameter in the GPD function can be estimated by the maximum-likelihood method using the extremes toolkit package in the software named R language [14].
After the parameters were estimated, four statistical indicators were used to evaluate the goodness-of-fit of the obtained theoretical distributions. These statistical indicators are EMA (mean absolute error), ERMS (root mean square error), IA (index of agreement) and R2 (coefficient of determination) [15]:
(7)
(8)
(9)
(10)
where Pi and Oi are predicted and observed data; and are mean value of predicted and observed data; N is data size. For good model, EMA and ERMS should approach zero and IA and R2 should be close to 1 [16].
3 Results and discussion
3.1 Extreme value sample
Table 1 gives the basic statistics of daily average pollutant concentrations of PM10, NO2 and SO2 from 2005 to 2010 and the extreme value samples.
Table 1 Total samples of three main pollutants from 2005 to 2010 in Changsha and extreme value samples
During the past 6 years, we had got 2 189 samples with two days’ samples missing for three kinds of pollutants. The mean concentrations for PM10, NO2 and SO2 are 101.5, 41.7 and 61.9 μg/m3, respectively. Since the standard deviation for the NO2 is minimal, we get to know that the variability of NO2 concentrations must be smaller than the other two pollutants. When we use 180, 75 and 150 μg/m3, which are approximately two times of their mean values as the thresholds for the three pollutants, SO2 gets the smallest extreme value sample size of 79, while the PM10 and NO2 get 143 and 103 samples, respectively. In addition, it is worth noting that, the thresholds are chosen after many different values have been tested.
3.2 Results of fitting
Since the extreme value samples of the three pollutants had been chosen, the GPD were used to fit them by the extreme toolkit in R language. The fitting results and parameter estimations of the three pollutants are shown in Fig. 1.
Fig. 1 Extreme value distribution fitting of three pollutants: (a) PM10; (b) NO2; (c) SO2
On one hand, the shape parameters of PM10 and SO2 are zero while the shape parameter of NO2 is negative. This directly suggests that PM10 and SO2 belong to type I exponential distribution with a short tail while NO2 belongs to type III beta distribution with a bounded tail. This means that the levels of PM10 and SO2 concentration are unbounded and their maximum concentration cannot be figured out. In the contrast, the level of NO2 concentration is up bounded with a maximum value.
On the other hand, we can find out that the probability density of PM10 and SO2 falls tardily at high concentration region and becomes an asymptote for x axis on the tail. However, when it comes for NO2, a different trend shows up: The probability density decreases fleetly in the whole process until it gets to the x axis. These characters of the probability density curves for three pollutants also support the tail properties that have been suggested by their shape parameter values before.
3.3 Reliability of fitted distributions
Figure 2 shows the comparisons of predicted results with actual data. The scattered data are all around by the straight line. It seems that the predicted results well match with the actual data for all the three pollutants.
In order to decide the goodness of fitting more directly, the statistical indicators were calculated as well. The results are given in Table 2. The values of EMA and ERMS for all the pollutants are very low and the values of IA and R2 are very close to 1, which means the predicted values agree well with the observed value. What’s more, the error of NO2 seems to be the lowest. It may be resulted from its smallest variability. All the statistical indicators suggest that the extreme value distribution GPD is a reliable model to represent the performance of high pollutant concentration.
3.4 Return period and return level
It is no doubt that the observed models are so reliable that we can use them to predict the return levels corresponding to different return periods. The predicted results can be seen from Fig. 3. Taking 50 a return period for instance, the corresponding return levels of PM10, NO2 and SO2 are 405, 112 and 333 μg/m3 respectively. This means that the levels equal to 405, 112 and 333 μg/m3 for PM10, NO2 and SO2, respectively, are going to show up with the probability of once every 50 a.
With the return period increases, the return levels of PM10 and SO2 raise linearly, which means an unbounded extreme return level is going to show up as the time goes by. In other words, there is no strongest intensity for PM10 or SO2 extreme pollution events in the future. However, the slope of the NO2 goes near to zero when the return periods become longer than 10 a. This indicates that the concentration level of NO2 is up bounded. Namely, the return level will not change evidently with a maximum around 115 μg/m3 for any return period longer than 10 a. Both phenomena confirm with their distribution type and tail property suggested.
Eventually, for the pollutant NO2, the worst pollution event has already shown up. However, citizens are still going to experience even worse pollution events which are caused by PM10 or SO2 as the time goes by.
Fig. 2 Comparisons of predicted values by extreme value distributions with actual data of three pollutants: (a) PM10; (b) NO2; (c) SO2
Table 2 Results of goodness-of-fit criteria in high concentration region for different fitted distributions
Fig. 3 Return levels for different return periods of three pollutants
3.5 Discussion
Many other studies have investigated the extreme distribution of pollutants as well. LU suggested that the exponential distribution of GPD can fit daily average PM10 concentration well [17], which is the same with the results in this work. An extreme value analysis of Munich showed that the daily average NO2 concentration follows the exponential distribution [18], which is different from the results in this work. Another research of Istanbal found that the distribution of hourly concentration of SO2 and NO2 varied at different stations. For the Alibeykoy station, they belonged to exponential distribution with short tails, but they changed to the Pareto distribution at the Umraniye station [19].
Therefore, we can notice that the distribution may be different in different regions. Since we know the pollution concentration is influenced by the emission levels, meteorological conditions and geography [20], the variety of extreme pollution distribution types seem to be reasonable.
4 Conclusions
1) The fitted GPD agrees well with the actual data. The extreme values of PM10 and SO2 are exponentially distributed with a short tail while the NO2 is beta distributed with a bounded tail.
2) As the return period increases, the return levels of PM10 and SO2 mount up as well while the NO2 keeps still on the tail with a maximum concentration around 115 μg/m3 for any return period longer than 10 a.
3) We have already seen the worst NO2 pollution event. However, we are still going to experience even worse pollution events caused by PM10 or SO2 as the time goes by. Enough preparation for PM10 and SO2 pollutions should be got in order to minimize their damage.
References
[1] BEGUERIA S, VICENTE-SERRANO S M. Mapping the hazard of extreme rainfall by peaks over threshold extreme value and spatial regression techniques [J]. Journal of Applied Meteorology and Climatology, 2006, 45(1): 108-124.
[2] BELL M L, DAVIS D L. Reassessment of the lethal London fog of 1952: Novel indicators of acute and chronic consequences of acute exposure to air pollution [J]. Environmental Health Perspectives, 2001, 109(S3): 389-394.
[3] MIJIC Z, TASIC M, RAJSIC S. The statistical characters of PM10 in Belgrade area [J]. Atmospheric Research, 2009, 92(4): 420-426.
[4] LU H C, FANG G C. Predicting the exceedances of a critical PM10 concentration—A case study in Taiwan [J]. Atmospheric Environment, 2003, 37(25): 3491-3499.
[5] HOROWITZ J, BARKAT S. Statistical analysis of the maximum concentration of an air pollutant: effects of autocorrelation and non-stationarity [J]. Atmospheric Environment, 1967, 13(6): 811-818.
[6] SMITH R L. Extreme value analysis of environmental time series: an application to trend detection in ground-level ozone [J]. Statistical Science, 1989, 4(4): 367-393.
[7] MCNEIL A J. Estimating the tails of loss severity distributions using extreme value theory [J]. ASTIN Bulletin, 1997, 27: 117-137.
[8] BALKEMA A A, HAAN LD. Residual life time at great age [J]. The Annals of Probability, 1974, 2(5): 792-804.
[9] PICKANDS J. Statistical inference using extreme order statistics [J]. The Annals of Probability. 1975, 3(1): 119-131.
[10] CAIRES S, GROENEWEG J, STERL A. Past and future changes in the sea extreme waves [C] // 31st International Conference on Coastal Engineering. Hamburg, Germany: ICCE Press, 2008.
[11] LU H C. Estimating the emission source reduction of PM10 in Taiwan [J]. Chemosphere, 2004, 54(7): 805-814.
[12] TANCREDI A, ANDERSON C, O’HAGAN A. Accounting fir threshold uncertainty in extreme value estimation [J]. Mathematics and Statistics, 2006, 9(2): 87-106.
[13] GILLELAND E, KATZ R W. Analyzing seasonal to inter-annual extreme weather and climate variability with the extremes toolkit [C] // 86th American Meteorological Society Annual Meeting. Atlanta: Georgia Press, 2006: 2-15.
[14] ZHANG J. Likelihood moment estimation for the generalized pareto distribution [J]. Australian and New Zealand Journal of Statistics, 2007, 49(1): 69-77.
[15] ATILLA A, KAAN Y, FERRUH E, ERCAN O. A neural network based approach for the prediction of urban SO2 concentrations in the Istanbul metropolitan area [J]. Environment and Pollution, 2010, 40(4): 301-321.
[16] KARATZAS K D, PAPADOURAKIS G, KYRIAKIDIS L. Understanding and forecasting air pollution with the aid of artificial Intelligence methods in Athens, Greece [J]. Tools and Applications with Intelligence, 2009, 166: 37-50.
[17] MIJIC Z, TASIC M, RAJSIC S, NOVAKOVIC V. The statistical characters of PM10 in Belgrade area [J]. Atmospheric Research, 2009, 92(4): 420-426.
[18] KUCHENHOFF H, THAMERUS M. Extreme value analysis of Munich air pollution data [J]. Environmental and Ecological Statistics, 1996, 3(2): 127-141.
[19] ERCELEBL S G, TOROS H. Extreme value analysis of Istanbul air pollution data [J]. Clean, 2009, 37(2): 122-131.
[20] FENG J L, HU M, CHAN C K, LAU P S, FANG M, HE L Y, TANG X Y. A comparative study of the organic matter in PM2.5 from three Chinese megacities in three different climatic zones [J]. Atmospheric Environment, 2006, 40(21): 3983-3994.
(Edited by HE Yun-bin)
Foundation item: Project(51178466) supported by the National Natural Science Foundation of China; Project(200545) supported by the Foundation for the Author of National Excellent Doctoral Dissertation of China; Project(2011JQ006) supported by the Fundamental Research Funds of the Central Universities of China; Project(2008BAJ12B03) supported by the National Key Program of Scientific and Technical Supporting Programs of China
Received date: 2011-07-26; Accepted date: 2011-11-14
Corresponding author: DENG Qi-hong, Professor, PhD; Tel: +86-731-88877175; E-mail: qhdeng@csu.edu.cn