Nobody likes to swim in sewage-polluted waters. On top of being smelly and unpleasant, raw sewage contains a mixture of bacteria, viruses, harmful chemicals and microplastics, which introduces a variety of ecological and environmental problems. Yet the UK faces a significant sewage pollution problem. Our environmental agency (EA) recorded 301,091 sewage spill incidents in 2022 [1], which averages to over 800 spills a day! While some of these spills help ensure sewage treatment plants do not get overwhelmed under heavy rain, water companies are recently being suspected of illegally dumping sewage even when there is no rain to avoid spending money on improving their plants [2]. This is an issue that has - rightfully so - received extensive press coverage recently.
Sewage pollution monitoring and measurement is thus key to ensuring our oceans are clean and safe. Current modes of measurement include using remote monitors or manual observations. These methods however face significant limitations. A large number of monitors and people are required to get good coverage across both space and time, which can quickly become very expensive and infeasible.
Any ideas on what could be a much less costly solution? Data from satellite images! (No prizes for getting this right) Theoretically, satellite data has many further advantages, such as more coverage, more scalability and (potentially) more timely warnings. Bird-eye photography and footage of sewage spills give us hope that sewage pollution can be seen from space, and will thus be reflected in satellite data.
However, satellite data is often very noisy and influenced by other factors, most notably by moving clouds. Furthermore, at the moment they can only be obtained at certain time intervals, such as once per day, which might not align with the times when sewage gets pumped into the sea. Thus, this became the main question our team at LSE [5] hoped to answer for our master's capstone project collaborating with Marla: can we observe and make predictions on sewage spills with satellite data?
Our Approach
To answer the question, we built several supervised machine-learning models, and then compared and evaluated them in a range of different settings. Supervised machine-learning models refer to models trained on datasets with labelled known outputs. In our case, we trained the models by giving them satellite data from the Copernicus Marine Service [6] across some dates, and telling them whether there was sewage pollution on those dates based on EA water quality data [7] across selected bathing water sites.
The satellite data used are mainly oceanography measures related to the water transparency, turbidity, suspended particulate matter, concentration of plankton, etc. After training, the models were able to take satellite data as input and output the probability of an incident happening. This was used to generate simple yes/no predictions: if the probability is larger than 0.5, then we predict that there is sewage pollution.
Visualising Satellite Data
The diagram below illustrates what the satellite data we collected looks like. The axes of the graph represent latitude and longitude, and the red dot represents a point on the map where we have information about its pollution status for a specific date. The squares on the graph are coloured according to the estimated concentration of chlorophyll-a (CHL) as measured by satellite in this case. Black squares are essentially points that are on land (where oceanography measures do not apply).
The first two sample graphs show satellite data when there was sewage pollution, and the latter two show the opposite. Can you spot any patterns that distinguish graphs with sewage pollution from those with no sewage pollution? It seems like for graphs with sewage pollution we get slightly lighter colours (larger values) along the coastline (where black squares meet coloured squares). A well-trained machine-learning model would probably do better at picking up the subtle differences between the graphs than we can.
Graphs in the top row are examples of observed CHL values when sewage pollution incidents were recorded, and the bottom row shows observed CHL values when no sewage pollution incidents were recorded, all from satellite data on 27.09.2021.
Results and Takeaways
We found somewhat promising results with our models, with our best model achieving 14 times better accuracy than that of our baseline model based on making random guesses [8]. Comparing and contrasting our models gave us more insight into how we can build better models to make sewage pollution predictions. We came to several key takeaways, which are:
Simpler models may perform better when there is limited data. Our models based on a more interpretable approach with a simpler training process (random forest models) performed better than models that had a longer and more complex training process (neural network models). We believe that this may be because we did not have enough data to allow the more complex models to reach their full potential.
Some models performed better on raw data, while other models benefitted from feature aggregation. Namely, we saw somewhat promising results from running a convolutional neural network which inputs our data in the form of an image like in our diagram above. The random forest model however performed the best when we aggregated each two-dimensional image into a series of figures and used them as inputs instead. This matched our expectations based on our intuitions about how the models function.
All models performed better on certain subsets of data than others. For example, all models were better at predicting sewage pollution status for August and September than for May, July and June. We believe this is because a larger proportion of data points across August and September has a positive label (where sewage pollution has occurred!). This highlights the challenges faced in building a model on an imbalanced dataset.
Next Steps
So what does this all mean for sewage pollution predictions? There is potential for using satellite data to monitor the health of our oceans, and we believe there is room to further improve our models. These improvements include running models on more historical data (we used 3 years of data for our project) and including more signals to control for changes in oceanography values that may not be caused by sewage pollution. Improvements can also be made by using more advanced algorithms to make better guesses of missing data caused by cloud coverage. Over the next few months, our team at Marla aim to implement some of these improvements and make them available for our clients. Watch out for further updates from us soon!
Notes and References
[3] https://www.youtube.com/watch?v=hTwjUh92j00 (Guardian News)
[4] https://www.youtube.com/watch?v=GT-NC9r-q5U (The Telegraph)
[5] The team at LSE consisted of Sally Lai, Ziyue Lu and Yuhan Wang.
[6] Primarily https://data.marine.copernicus.eu/product/OCEANCOLOUR_ATL_BGC_L3_MY_009_113/description
[8] F1 score was used as a proxy for accuracy for a better measure of performance since our dataset is highly imbalanced (there are many more sewage-pollution-free days in our dataset)
Comments