Applying machine learning algorithms to estimate PM 2.5 using satellite data and metrological data

----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------


Introduction
Air pollution is a serious environmental problem that affects human health and the environment worldwide.PM2.5 refers to fine particulate matter and has a diameter of 2.5 micrometers or smaller.PM 2.5 is a tiny particle suspended in the air that can come from a variety of sources like construction, industrial work, forest fires, dust, and so on.Exposure to PM2.5 has been linked to a range of health problems, particularly respiratory and cardiovascular issues.When people are exposed to high levels of PM2.5 for an extended period, it can cause or worsen respiratory diseases such as asthma, bronchitis, and other chronic obstructive pulmonary diseases (COPD).Additionally, it may increase the risk of heart attacks, strokes, birth defects and premature death [1].One of the most common and accurate methods for monitoring air quality is through air quality monitoring stations.However, measurements are only available in the surrounding area of the stations [2].Through air quality monitoring, air pollutant concentration data are obtained to determine whether the concentration levels are good, unhealthy for sensitive groups, or at emergency levels.Due to the high personnel, infrastructure, and financial demands associated with their establishment, operation, and maintenance, these stations are dispersed widely [3].On the other hand, air quality monitoring is also conducted nowadays by utilizing geospatial and remote sensing techniques to gather air quality data over large areas.It is convenient to use the satellite remote sensing data information inference method because of the widespread remote sensing satellite coverage, large area synchronized observation that can be performed in a short amount of time, handy and quick access to the real-time global range of all kinds of natural phenomena, and efficiency of the PM2.5 measurement [4][5].Air quality data is either directly obtained from sensors on various satellite platforms or derived from satellite images using regression analysis or dispersion models.As computer technology advances, machine learning techniques are being used extensively to predict PM2.5 concentrations.Classical machine learning methods using decision trees, random forests, support vector machines, and artificial neural networks have demonstrated good predictive ability for PM2.5 concentrations [6][7].To estimate PM2.5 concentrations, machine learning methods often combine several data sources, such as meteorological, air pollution, spatiotemporal, land use, and satellite remote sensing information [8].The level of particle air pollution in Nepal frequently ranks among the worst in the world.Nepal was ranked 177 th out of 180 nations in 2016 and was ranked 162 in the 2022 Environmental Performance Index (EPI) for air pollution [9].Air quality monitoring in Nepal is conducted through ground monitoring station data regulated by the Department of Environment under the Ministry of Forests and Environment.According to the Department of Environment of Nepal, 27 air quality monitoring stations measure the following significant parameters: PM1, PM2.5, PM10, and Total Suspended Particulates (TSP) [10].Some of the limitations of these ground stations are limited coverage, spatial variability, and expensive and complex equipment.These few stations are not well sufficient for measuring air quality throughout the country.However, the technique of using satellite remote sensing to estimate air pollution concentrations has the benefit of being highly effective and inexpensive [11].

Related works
Based on the data source, the related works can be classified into three groups.The methods in the first group used the historical PM readings from the ground-based air quality monitoring stations for PM estimation.Several machine learning algorithms linear regression, K-neighbors, decision tree, RF, gradient boosting, CNN, and LSTM were used, to estimate the PM value in the current day or future days [12][13].In addition to the PM measurements from the available stations, the second group uses satellite-derived data such as aerosol optical depth (AOD).Moreover, several studies have included meteorological data (temperature, humidity, wind speed, etc.) [14][15].In the third group, instead of using the satellite-derived products, the satellite images are directly used [16].However, only a few studies have used air pollutant concentrations (SO2, CO, NO2, and O3) for PM 2.5 computations [17].The proposed study is one of the first in Nepal to directly estimate PM2.5 concentrations using Sentinel-5P pollution data.One of the worst affected areas in Nepal is the Kathmandu Valley, located in the midhills of Nepal at a latitude of 27.7•N and a longitude of 85.3•E.Most of the studies regarding air pollution in Kathmandu Valley are based on air pollution stations and satellite-derived products (AOD) [18].The use of AOD and other atmospheric products has some difficulties such as computational burden and uncertainties due to aerosol model selection and cloud screening schemes.AOD products have coarse spatial resolutions which make them a good candidate for monitoring of global distribution of aerosols on large scales and wide coverage.However, the AOD products are not capable to estimate PM 2.5 concentration in smaller area.The purpose of this study is to develop low-computing and simple machine-learning models for the estimation of PM 2.5 concentration.It uses free Sentinel-5P pollution data to estimate PM in small areas, such as cities.The proposed study offers several key contributions.Since there are fewer air quality monitoring stations in Nepal and most of the stations are not operating properly, this study will help enhance the analysis of the near-ground PM2.5 pollution situation.It will also be helpful to estimate the ground-level PM 2.5 using satellite images where there is no availability of ground-level AQM stations.Rather than using AOD products, Sentinel-5P air pollution data is directly used.We compared the effectiveness of various machine learning models on Sentinel-5P datasets, both with and without metrological data.

Study Area
For this study, we have selected Kathmandu, Nepal, as our study area.There are seven air quality monitoring stations located around the Kathmandu Valley: in Dhulikhel, Ratnapark, Sankhapark, Bhaisepati, Pulchowk, Bhaktapur, and Kirtipur.The research locations and PM 2.5 monitoring sites in Kathmandu are shown in Figure 1.

Datasets
The dataset used in this study includes air pollution data from the Sentinel-5P satellite, PM2.5 along with meteorological factors and data from ground stations.Two datasets were created based on ground stations, as shown in Table 2. Dataset 1 contains Ratnapark station data, while Dataset 2 consists of three stations (Ratnapark, Shankapark, and Pulchowk) data.[19].AQM Data from Ratnapark, Shankhapark, and Pulchowk stations were used in this study.Table 2 shows the summary of AQM station data and the overall total data that was used, excluding the missing value.

Metrological data
To check the performance and effectiveness of PM 2.5 estimation with and without metrological datasets, the minimum and maximum meteorological data, such as air temperature, RH and WS, were used.These data sets were obtained from the Department of Hydrology and Meteorology, Babarmahal, Kathmandu, Nepal, following the completion of the metrological data request payment procedure [20].

Satellite data
Sentinel-5P is the first mission of the Copernicus air pollution control program.Sentinel-5P can be used to detect gases such as NO2, SO2, CH4, HCHO, AI, CO and O3.Google Earth Engine (GEE) has been used by several researchers for Air Pollutants (AP) retrieval [21][22].GEE is a cloud-based platform developed by Google for planetary-scale for analyzing environmental data.It provides a huge collection of satellite imagery, geospatial datasets, and computational capabilities that helps to visualize, analyze, and process remote sensing data for a wide range of applications.Therefore, AP retrieval from the Sentinel-5P satellite has been done in this study using GEE.

Methodology
Data on NO2, SO2, CH4, HCHO, AI, CO, and O3 pollutant levels from Sentinel-5P air pollution were retrieved using GEE.The GEE platform was used to extract Sentinel-5P air pollution data, providing a strong toolkit for accessing and processing satellite imagery.To begin the extraction process, a region of interest (ROI) was defined.This defines the geographic area from which the data were gathered.The Sentinel-5P dataset was filtered based on the specified ROI, date range, and desired pollutants.The selected bands were then aggregated by calculating the mean value for each band across all available images within the dataset over the specific time period.Additionally, the dataset was clipped to the ROI to retain only the data relevant to the study area.Finally, the processed data was exported in CSV format.Data preprocessing was done on the gathered data from Sentinel-5P, AQM stations, and meteorological sources, as shown in Figure 2.

Data preprocessing
To enable temporal analysis and modeling, the date values in the date column of the gathered datasets were transformed into a standard date format that included the day, month, and year.Missing values in the datasets were handled using linear interpolation techniques to fill in the gaps and ensure continuity in the data [23].This approach effectively addresses missing data issues while preserving the temporal and spatial integrity of the datasets.In this study, we have used four different algorithms, RF, SVM, XGBOOST and KNN.RF is a part of the ensemble learning techniques, which improve prediction performance by combining several models.To do regression tasks, RF builds a large number of decision trees during training and outputs the average prediction of each tree [24].SVM works by finding the optimal hyperplane that best separates the data points into different classes or predicts continuous values.It accomplishes this by maximizing the margin between the hyperplane and the nearest data points (support vectors) [25].XGBoost belongs to the ensemble learning category and is based on the gradient boosting framework.It combines the predictions of multiple weak models to create a stronger, more accurate model.XGBoost employs a technique called tree boosting, where decision trees are added one at a time and their predictions are combined with the predictions of previously added trees [26].KNN is based on the principle that similar data points tend to have similar target values.In KNN, the prediction for a new data point is made based on the majority class (for classification) or the average of the values of its k nearest neighbors (for regression) in the feature space [27].

Hyperparameter Tuning
Hyperparameters play an important role in determining the complexity, flexibility, and generalization ability of the model.In this study, we have used grid search cross-validation for hyperparameter tuning [28].It involves systematically searching through a specified parameter grid and evaluating each combination of hyperparameters to identify the optimal configuration.One of the main features of grid search CV is its extensive search capability and simplicity.

Results and Discussion
This study used evaluation metrics like the coefficient of determination (R 2 ) and Root Mean Square Error (RMSE).The dataset was divided into two parts, with 80% allocated for model training and 20% reserved for testing.In this section, the PM 2.5 modeling performances of RF, XGBoost, KNN, and SVM are compared.The performance of these models on the Ratnapark station dataset 1 is presented in Table 4.The first model performance was evaluated on Sentinel-5P data only, and after that, it was again evaluated on Sentinel-5P data, including metrological data.
The results reveal that RF and XGBoost demonstrated superior performance compared to SVM and KNN.Across both, RF performed slightly better than the others.In Sentinel-5P data only, the RF obtained R 2 and RMSE of 0.82 and 13.17, respectively, whereas including metrological data in Sentinel AP data, R 2 and RMSE were 0.75 and 15.60, respectively.The results of dataset 1 indicated that SVM performed poorly, with the lowest R 2 values of 0.62 and 0.67 and the highest RMSE values of 19.38 and 18.07.Additionally, the addition of meteorological data in dataset 1 appears to have a significant effect on model performance.
To see the model performance on datasets of different locations, Dataset 2 was prepared by combining three station data sets (Ratnapark, Shankapark, and Pulchowk).The model's performance was evaluated on Sentinel-5P air pollution data with and without metrological data.The performance of the model on dataset 2 is shown in Table 5.In Sentinel-5P data only, RF showed the lowest RMSE of 11.68, whereas after including the metrological data, RF showed the lowest RMSE of 11.36, indicating better accuracy in predicting PM2.5 concentrations.In dataset 2, for both cases of including and excluding metrological data with Sentinel-5P data, RF From the Table 4 result, it is seen that, when metrological data were added to dataset 1, the model performance seemed to have slightly declined.We have only used RH, wind speed, and temperature because other meteorological data such as wind direction, cloud cover, precipitation, and air pressure are not available from DOHM, Nepal.The model's performance might have been impacted in some way by the exclusion of these data.The linear interpolation method was used to handle the missing values; it may introduce some level of uncertainty or error in the data, especially if there are large gaps between observations.Also, in the combined dataset, the larger number of actual values available may help mitigate the impact of missing data.Additionally, having data from multiple stations may provide more robust and representative information about the underlying patterns and relationships in the data [29].As we increased the dataset by combining three AQM stations (Ratnapark, Shankapark, and Pulchowk) data in dataset 2, the result of Table 5 shows that the model performance improved after including the metrological data.The average change in R 2 and RMSE values across all algorithms is +0.032 and -0.77, respectively, indicating an overall improvement in model performance on dataset 2 compared to dataset 1.It was seen that the size of the dataset had a great impact on the model performance.With the help of the AQI breakpoint table presented in Table 6, a further analysis was carried out to figure out if the actual and expected levels of the Air Quality Index (AQI) would be identical based on the estimated PM2.5 values obtained.By looking at the results shown in Table 7, it is seen that our model can also be used to determine the AQI level.The predicted PM 2.5 value is in the same AQI category as the actual PM 2.5 category.

Limitations of the Study
One limitation of the study is the availability of Sentinel-5P data from Google Earth Engine (GEE), which started in 2018.Additionally, data from air quality monitoring (AQM) stations showed inconsistencies due to device malfunctions.Consequently, the models were trained using limited datasets, which may not provide sufficient information for robust machine learning (ML) model training.Future work should focus on increasing the datasets and addition of other metrological data and relevant features.Additionally, checking the performance of deep learning algorithms could be a promising direction for further work.

Conclusion
This study illustrated the use of Sentinel-5P air pollution data and metrological data for the estimation of PM 2.5 using machine learning techniques.Our findings confirm that air pollution data obtained from Sentinel-5P can be used for the estimation of PM 2.5.Taking advantage of satellite data and a cloud platform like GEE is the most cost-effective and efficient method for AP retrieval.Over the four machine learning algorithms used, the performance of RF was found to be superior with the R 2 of 0.81 and RMSE of 11.36, while the performance of SVM was found to be the worst in all the scenarios.This study also confirmed that the addition of metrological data had a significant impact on model performances, and it was observed that there was an improvement in model performance after adding the metrological data.This study concludes that adequately trained machine learning models, utilizing sufficient data, hold promise for accurately estimating PM 2.5 levels at local scales.The integration of Sentinel-5P air pollution data and meteorological data presents an economically feasible solution, particularly in regions such as Nepal, where the establishment and maintenance costs of traditional air quality monitoring stations are costly.

Figure 1 :
Figure 1: AQM stations used for this research

Figure 2 :
Figure 2: Methodology used for the estimations of PM 2.5

Ground station AQM data
Daily average PM 2.5 data were collected from the Department of Environment office located at Forest Complex, Babarmahal, Kathmandu

Table 2 .
Summary of AQM station data

Table 3 :
Best hyperparameters for Dataset 1 and Dataset 2

Table 5 .
Model performance on dataset 2 with and without metrological data.

Table 7 .
AQI level analysis based on predicted PM 2.5