简介概要

Extreme air pollution events: Modeling and prediction

来源期刊：中南大学学报(英文版)2012年第6期

论文作者：周松梅邓启红刘蔚巍

文章页码：1668 - 1672

Key words：extreme pollution event; generalized Pareto distribution; return level; return period

Abstract:

In order to get prepared for the coming extreme pollution events and minimize their harmful impacts, the first and most important step is to predict their possible intensity in the future. Firstly, the generalized Pareto distribution (GPD) in extreme value theory was used to fit the extreme pollution concentrations of three main pollutants: PM₁₀, NO₂ and SO₂, from 2005 to 2010 in Changsha, China. Secondly, the prediction results were compared with actual data by a scatter plot. Four statistical indicators: E_MA (mean absolute error), E_RMS (root mean square error), I_A (index of agreement) and R² (coefficient of determination) were used to evaluate the goodness-of-fit as well. Thirdly, the return levels corresponding to different return periods were calculated by the fitted distributions. The fitting results show that the distribution of PM₁₀ and SO₂ belongs to exponential distribution with a short tail while that of the NO₂ belongs to beta distribution with a bounded tail. The scatter plot and four statistical indicators suggest that GPD agrees well with the actual data. Therefore, the fitted distribution is reliable to predict the return levels corresponding to different return periods. The predicted return levels suggest that the intensity of coming pollution events for PM₁₀ and SO₂ will be even worse in the future, which means people have to get enough preparation for them.

J. Cent. South Univ. (2012) 19: 1668-1672

DOI: 10.1007/s11771-012-1191-2

Extreme air pollution events: Modeling and prediction

ZHOU Song-mei(周松梅), DENG Qi-hong(邓启红), LIU Wei-wei(刘蔚巍)

School of Energy Science and Engineering, Central South University, Changsha 410083, China

? Central South University Press and Springer-Verlag Berlin Heidelberg 2012

Abstract: In order to get prepared for the coming extreme pollution events and minimize their harmful impacts, the first and most important step is to predict their possible intensity in the future. Firstly, the generalized Pareto distribution (GPD) in extreme value theory was used to fit the extreme pollution concentrations of three main pollutants: PM₁₀, NO₂ and SO₂, from 2005 to 2010 in Changsha, China. Secondly, the prediction results were compared with actual data by a scatter plot. Four statistical indicators: E_MA (mean absolute error), E_RMS (root mean square error), I_A (index of agreement) and R² (coefficient of determination) were used to evaluate the goodness-of-fit as well. Thirdly, the return levels corresponding to different return periods were calculated by the fitted distributions. The fitting results show that the distribution of PM₁₀ and SO₂ belongs to exponential distribution with a short tail while that of the NO₂ belongs to beta distribution with a bounded tail. The scatter plot and four statistical indicators suggest that GPD agrees well with the actual data. Therefore, the fitted distribution is reliable to predict the return levels corresponding to different return periods. The predicted return levels suggest that the intensity of coming pollution events for PM₁₀ and SO₂ will be even worse in the future, which means people have to get enough preparation for them.

Key words: extreme pollution event; generalized Pareto distribution; return level; return period

1 Introduction

In many environmental processes, the average behavior of system is much less important than the rather few extreme situations with large impacts [1]. One of the extreme situations, the extreme air pollution event, is of particular interest due to its acute harmful impacts on human health and other environment damage. It is still remembered clearly that the extreme pollution event in human history, the “London Fog”, has taken away at least 4 000 lives in only four days’ period [2]. It is a lesson that reminds us to protect the environment, reduce the pollution emission and prepare for the extreme pollution events. In order to get enough preparation for the coming extreme pollution events and minimize their damage, we have to figure out the most important question on the first step: “Have we already seen the worst pollution events or are we going to experience even worse ones?”

To answer this question, we have to predict the possible intensity of the extreme pollution events in the future. Since the air pollutant concentrations are influenced by the emission levels, meteorological conditions and geography, they are defined as random variables. Therefore, we can use the theory statistical distributions to fit them, and get the return levels corresponding to different return periods by the fitted distribution. One of the most popular distributions, the generalized Pareto distribution (GPD) in the extreme value theory has been widely applied to fitting the extreme pollution samples. Many researchers concluded that the GPD provides a useful tool in application to air quality data [3]. LU and FANG used parent distribution and the type I GPD to fit the PM₁₀ in the high concentration region. They found that the GPD is better agreed with the high PM₁₀ concentration levels than the parent distributions [4]. HOROWITZ and BARAKAT showed that the O₃ extreme data could be fitted well by a type I GPD of the extreme value theory [5]. SMITH also used the GPD to analyse the ground level O₃ and gain a downward trend in the extreme values [6].

In this work, the GPD was also used to fit the daily average extreme value samples of three main pollutants in Changsha, China: PM₁₀, NO₂ and SO₂, from 2005 to 2010. Moreover, in order to evaluate the goodness-of-fit of the fitted distributions, the predicted values were compared with actual data, and four statistical indicators (E_MA, E_RMS, I_A and R²) were calculated as well. Finally, the return levels corresponding to different return periods were calculated by the fitted distribution.

2 Method

2.1 Extreme value distribution

One of the currently most used methods in extreme value analyses is the peaks-over-threshold (POT) model in which the excesses above a sufficiently high threshold u are fitted by the generalized Pareto distribution (GPD) [7]. The model is used by the following steps.

If F(x) is the cumulative density distribution of total samples, and F_u(x) is the conditional cumulative density distribution of extreme value sample which is made up of excesses above a pre-specified threshold u, then

(1)

According to Pickands-Balkema-de Haan theorem [8-9], when the threshold u is high enough, the GPD becomes the asymptotic distribution for F_u(x). The GPD is usually expressed as a two-parameter distribution:

(2)

where the scale parameter β>0, and the shape parameter -1<<1. Corresponding to different shape parameter values, the GPD can be divided into three types. For =0, the GPD has a type I short tail and it is the exponential distribution. For >0, it has a type II long tail and it is the Pareto distribution. And for <0, it has a type III bounded tail and it is a special case of beta distribution [10]. We have to emphasize that, just like the F_u(x), the G(x) is only used to fit the extreme value sample which is made up of the excesses above an appropriate threshold u, and the parameters and β are obtained from the extreme value sample as well.

From Eq. (1), we can easily get

(3)

In Eq. (3), we set , where N is the size of the extreme value sample and n is the size of the total sample.

When we use G(x) to take the place of F_u(x), we can get the cumulative density probability in the total sample for the extreme value concentrations above u:

(4)

Since we get the values of the scale parameter β and shape parameter from the fitted G(x), the expression of F(x) which can be used to calculate the return levels and return periods can be obtained immediately.

2.2 Return period and return level

A major topic of interest in environmental studies is the return period. At most times, people want to know how long it can be while a particular concentration x_c is not exceeded. Moreover, the possible extreme pollution level in the future is of particular interest as well.

The cumulative density function F(x) could be used to calculate the return period as [11]

(5)

where P_R(x_c) is the return period of the critical concentration x_c; in other words, the particular concentration x_c is the return level corresponding to the return period P_R(x_c) . This means that during a time series of P_R(x_c), the possible maximum pollution concentration will not exceed x_c.

Correspondingly, the return level x_c can be predicted whenever the return period is given:

(6)

where is the inverse function of F.

2.3 Parameter estimation and goodness-of-fit criteria

There are three parameters in the GPD function needed to be estimated. To begin with, we have to choose an appropriate threshold u. The choice of threshold u is critical to any POT analysis: A too high threshold could discard too much data, leading to high variance; a too low threshold is likely to violate the asymptotic basis of the model, leading to bias [12]. One of the popular approaches is to fit the GPD using a range of thresholds, and then graphing the parameter estimates along with their variability, an appropriate threshold being the lowest possible choice such that any higher threshold would result in similar estimated values [13].

Since the threshold u is fixed, according to the observed extreme value sample which is made up of the excesses above u, the other two parameters, scale parameter β and shape parameter in the GPD function can be estimated by the maximum-likelihood method using the extremes toolkit package in the software named R language [14].

After the parameters were estimated, four statistical indicators were used to evaluate the goodness-of-fit of the obtained theoretical distributions. These statistical indicators are E_MA (mean absolute error), E_RMS (root mean square error), I_A (index of agreement) and R² (coefficient of determination) [15]:

(7)

(8)

(9)

(10)

where P_i and O_i are predicted and observed data; and are mean value of predicted and observed data; N is data size. For good model, E_MA and E_RMS should approach zero and I_A and R² should be close to 1 [16].

3 Results and discussion

3.1 Extreme value sample

Table 1 gives the basic statistics of daily average pollutant concentrations of PM₁₀, NO₂ and SO₂ from 2005 to 2010 and the extreme value samples.

Table 1 Total samples of three main pollutants from 2005 to 2010 in Changsha and extreme value samples

During the past 6 years, we had got 2 189 samples with two days’ samples missing for three kinds of pollutants. The mean concentrations for PM₁₀, NO₂ and SO₂are 101.5, 41.7 and 61.9 μg/m³, respectively. Since the standard deviation for the NO₂is minimal, we get to know that the variability of NO₂concentrations must be smaller than the other two pollutants. When we use 180, 75 and 150 μg/m³, which are approximately two times of their mean values as the thresholds for the three pollutants, SO₂gets the smallest extreme value sample size of 79, while the PM₁₀ and NO₂ get 143 and 103 samples, respectively. In addition, it is worth noting that, the thresholds are chosen after many different values have been tested.

3.2 Results of fitting

Since the extreme value samples of the three pollutants had been chosen, the GPD were used to fit them by the extreme toolkit in R language. The fitting results and parameter estimations of the three pollutants are shown in Fig. 1.

Fig. 1 Extreme value distribution fitting of three pollutants: (a) PM₁₀; (b) NO₂; (c) SO₂

On one hand, the shape parameters of PM₁₀ and SO₂are zero while the shape parameter of NO₂is negative. This directly suggests that PM₁₀ and SO₂belong to type I exponential distribution with a short tail while NO₂ belongs to type III beta distribution with a bounded tail. This means that the levels of PM₁₀ and SO₂ concentration are unbounded and their maximum concentration cannot be figured out. In the contrast, the level of NO₂ concentration is up bounded with a maximum value.

On the other hand, we can find out that the probability density of PM₁₀ and SO₂falls tardily at high concentration region and becomes an asymptote for x axis on the tail. However, when it comes for NO₂, a different trend shows up: The probability density decreases fleetly in the whole process until it gets to the x axis. These characters of the probability density curves for three pollutants also support the tail properties that have been suggested by their shape parameter values before.

3.3 Reliability of fitted distributions

Figure 2 shows the comparisons of predicted results with actual data. The scattered data are all around by the straight line. It seems that the predicted results well match with the actual data for all the three pollutants.

In order to decide the goodness of fitting more directly, the statistical indicators were calculated as well. The results are given in Table 2. The values of E_MA and E_RMS for all the pollutants are very low and the values of I_A and R² are very close to 1, which means the predicted values agree well with the observed value. What’s more, the error of NO₂ seems to be the lowest. It may be resulted from its smallest variability. All the statistical indicators suggest that the extreme value distribution GPD is a reliable model to represent the performance of high pollutant concentration.

3.4 Return period and return level

It is no doubt that the observed models are so reliable that we can use them to predict the return levels corresponding to different return periods. The predicted results can be seen from Fig. 3. Taking 50 a return period for instance, the corresponding return levels of PM₁₀, NO₂ and SO₂are 405, 112 and 333 μg/m³ respectively. This means that the levels equal to 405, 112 and 333 μg/m³ for PM₁₀, NO₂ and SO₂, respectively, are going to show up with the probability of once every 50 a.

With the return period increases, the return levels of PM₁₀ and SO₂raise linearly, which means an unbounded extreme return level is going to show up as the time goes by. In other words, there is no strongest intensity for PM₁₀ or SO₂ extreme pollution events in the future. However, the slope of the NO₂ goes near to zero when the return periods become longer than 10 a. This indicates that the concentration level of NO₂ is up bounded. Namely, the return level will not change evidently with a maximum around 115 μg/m³for any return period longer than 10 a. Both phenomena confirm with their distribution type and tail property suggested.

Eventually, for the pollutant NO₂, the worst pollution event has already shown up. However, citizens are still going to experience even worse pollution events which are caused by PM₁₀ or SO₂as the time goes by.

Fig. 2 Comparisons of predicted values by extreme value distributions with actual data of three pollutants: (a) PM₁₀; (b) NO₂; (c) SO₂

Table 2 Results of goodness-of-fit criteria in high concentration region for different fitted distributions

Fig. 3 Return levels for different return periods of three pollutants

3.5 Discussion

Many other studies have investigated the extreme distribution of pollutants as well. LU suggested that the exponential distribution of GPD can fit daily average PM₁₀ concentration well [17], which is the same with the results in this work. An extreme value analysis of Munich showed that the daily average NO₂ concentration follows the exponential distribution [18], which is different from the results in this work. Another research of Istanbal found that the distribution of hourly concentration of SO₂ and NO₂varied at different stations. For the Alibeykoy station, they belonged to exponential distribution with short tails, but they changed to the Pareto distribution at the Umraniye station [19].

Therefore, we can notice that the distribution may be different in different regions. Since we know the pollution concentration is influenced by the emission levels, meteorological conditions and geography [20], the variety of extreme pollution distribution types seem to be reasonable.

4 Conclusions

1) The fitted GPD agrees well with the actual data. The extreme values of PM₁₀ and SO₂are exponentially distributed with a short tail while the NO₂ is beta distributed with a bounded tail.

2) As the return period increases, the return levels of PM₁₀ and SO₂mount up as well while the NO₂keeps still on the tail with a maximum concentration around 115 μg/m³for any return period longer than 10 a.

3) We have already seen the worst NO₂ pollution event. However, we are still going to experience even worse pollution events caused by PM₁₀ or SO₂as the time goes by. Enough preparation for PM₁₀ and SO₂ pollutions should be got in order to minimize their damage.

References

[1] BEGUERIA S, VICENTE-SERRANO S M. Mapping the hazard of extreme rainfall by peaks over threshold extreme value and spatial regression techniques [J]. Journal of Applied Meteorology and Climatology, 2006, 45(1): 108-124.

[2] BELL M L, DAVIS D L. Reassessment of the lethal London fog of 1952: Novel indicators of acute and chronic consequences of acute exposure to air pollution [J]. Environmental Health Perspectives, 2001, 109(S3): 389-394.

[3] MIJIC Z, TASIC M, RAJSIC S. The statistical characters of PM10 in Belgrade area [J]. Atmospheric Research, 2009, 92(4): 420-426.

[4] LU H C, FANG G C. Predicting the exceedances of a critical PM₁₀ concentration—A case study in Taiwan [J]. Atmospheric Environment, 2003, 37(25): 3491-3499.

[5] HOROWITZ J, BARKAT S. Statistical analysis of the maximum concentration of an air pollutant: effects of autocorrelation and non-stationarity [J]. Atmospheric Environment, 1967, 13(6): 811-818.

[6] SMITH R L. Extreme value analysis of environmental time series: an application to trend detection in ground-level ozone [J]. Statistical Science, 1989, 4(4): 367-393.

[7] MCNEIL A J. Estimating the tails of loss severity distributions using extreme value theory [J]. ASTIN Bulletin, 1997, 27: 117-137.

[8] BALKEMA A A, HAAN LD. Residual life time at great age [J]. The Annals of Probability, 1974, 2(5): 792-804.

[9] PICKANDS J. Statistical inference using extreme order statistics [J]. The Annals of Probability. 1975, 3(1): 119-131.

[10] CAIRES S, GROENEWEG J, STERL A. Past and future changes in the sea extreme waves [C] // 31st International Conference on Coastal Engineering. Hamburg, Germany: ICCE Press, 2008.

[11] LU H C. Estimating the emission source reduction of PM₁₀ in Taiwan [J]. Chemosphere, 2004, 54(7): 805-814.

[12] TANCREDI A, ANDERSON C, O’HAGAN A. Accounting fir threshold uncertainty in extreme value estimation [J]. Mathematics and Statistics, 2006, 9(2): 87-106.

[13] GILLELAND E, KATZ R W. Analyzing seasonal to inter-annual extreme weather and climate variability with the extremes toolkit [C] // 86th American Meteorological Society Annual Meeting. Atlanta: Georgia Press, 2006: 2-15.

[14] ZHANG J. Likelihood moment estimation for the generalized pareto distribution [J]. Australian and New Zealand Journal of Statistics, 2007, 49(1): 69-77.

[15] ATILLA A, KAAN Y, FERRUH E, ERCAN O. A neural network based approach for the prediction of urban SO₂ concentrations in the Istanbul metropolitan area [J]. Environment and Pollution, 2010, 40(4): 301-321.

[16] KARATZAS K D, PAPADOURAKIS G, KYRIAKIDIS L. Understanding and forecasting air pollution with the aid of artificial Intelligence methods in Athens, Greece [J]. Tools and Applications with Intelligence, 2009, 166: 37-50.

[17] MIJIC Z, TASIC M, RAJSIC S, NOVAKOVIC V. The statistical characters of PM₁₀ in Belgrade area [J]. Atmospheric Research, 2009, 92(4): 420-426.

[18] KUCHENHOFF H, THAMERUS M. Extreme value analysis of Munich air pollution data [J]. Environmental and Ecological Statistics, 1996, 3(2): 127-141.

[19] ERCELEBL S G, TOROS H. Extreme value analysis of Istanbul air pollution data [J]. Clean, 2009, 37(2): 122-131.

[20] FENG J L, HU M, CHAN C K, LAU P S, FANG M, HE L Y, TANG X Y. A comparative study of the organic matter in PM_2.5 from three Chinese megacities in three different climatic zones [J]. Atmospheric Environment, 2006, 40(21): 3983-3994.

(Edited by HE Yun-bin)

Foundation item: Project(51178466) supported by the National Natural Science Foundation of China; Project(200545) supported by the Foundation for the Author of National Excellent Doctoral Dissertation of China; Project(2011JQ006) supported by the Fundamental Research Funds of the Central Universities of China; Project(2008BAJ12B03) supported by the National Key Program of Scientific and Technical Supporting Programs of China

Received date: 2011-07-26; Accepted date: 2011-11-14

Corresponding author: DENG Qi-hong, Professor, PhD; Tel: +86-731-88877175; E-mail: qhdeng@csu.edu.cn

简介概要

详情信息展示

Extreme air pollution events: Modeling and prediction

相关论文

相关知识点