A guest blog post by Brian Cleland, Ulster University
As part of the MIDAS project, the team at Ulster University has undertaken to examine the potential for using open data to improve policy- making in Northern Ireland’s health sector. Our research has focused primarily on open prescribing data from GP practices as an important information source. In previous analyses, we have combined this prescribing data with open government data from other sources such as the Northern Ireland Statistical Research Agency (NISRA) and the Northern Ireland Open Data Portal. We have looked at questions such as the relationship between deprivation and antidepressant prescribing, and the possibility of using machine learning techniques to identify prescribing anomalies across the region. Through a review of the literature in this field, it has become apparent that one under-explored opportunity is the application of time series-based approaches to prescribing data.
In data science, a time series is simply a series of data points that are indexed in time order. It is distinct from cross- sectional data, where there is no intrinsic ordering to the observations, or a spatial analysis, which is typically related to geographic features within the dataset. Time series information can include real-valued, continuous or discrete data. There are a range of well-established techniques, including both statistical and machine-learning methods, that can be applied to time series data. In this article, we will examine the application of some common analytic approaches to open prescribing data, to see how the prescribing of different categories of drugs varies over time.
There are a number of well established techniques that can be applied to time series data. Box plots, for example, can show the distribution of the observations for a particular dataset. Box plots can be displayed in a sequence to view trends and seasonality in time series data. In the figures below, we can visualise the distributions for GP prescribing for all drugs in Northern Ireland on a yearly (Figure 1) or monthly basis (Figure 2). This allows us to see very easily that there is no obvious yearly trend for the 35 months of data that we are examining. In the monthly plot, however, it can be seen that there appears to be some seasonal variation in the data – with overall prescribing levels dipping during the summer months and rising again during the winter months.
Digging deeper into the data, we see that seasonality is more visible in some categories of drug than other. For example, BNF Chapter 5 (drugs for treating infections, including antibiotics), dip sharply over the summer and rise in the wintertime (Figure 3). This corresponds with what we might expect based on the common experience of seasonal chest infections, etc. On the other hand, BNF Chapter 6 (endocrine system drugs, including insulin), exhibits much less seasonality (Figure 4). Given that these drugs are often prescribed for illnesses such as diabetes, this is also unsurprising. If we were to break down the types of drugs to a more granular level, we would get an even clearer picture of how seasonal effects affect individual prescribing patterns.
Another method of visualising time series data is to use “decomposition”. This is a statistical method for breaking down the time series into different components, including the trend component, the seasonal component, and residual or remainder component. There are a number of popular software libraries for achieving this. In our case, we used the Python-based “statsmodel” library. The results for the entire dataset can be seen in Figure 5. There are two types of decomposition – additive and multiplicative. Depending on the dataset these methods may show different results – but as can be seen in Figure 5 this is not always the case. If we compare a specific drug category such as gastro-intestinal treatments (BNF Chapter 1), the decomposition shows a clear increasing trend over the three year period (Figure 6). On the other hand there is little evidence of seasonality in this dataset.
Finally, another way to view repeating patterns in time series data is to use a method called “autocorrelation”. This approach, as the name suggests, measures the level of correlation of a dataset with itself over varying time periods. The assumption is that repeating patterns in the data should show as spikes or peaks in the resulting graph. To illustrate the concept, we show autocorrelation charts for BNF Chapter 5 (Infections) and BNF Chapter 6 (Endocrine System) in Figures 7 and 8 seven below. As might be expected, the autocorrelation chart for Infections shows clear signs of being cyclical on a 12 monthly basis. On the other hand, Endocrine prescribing shows little signs of repeating patterns in the data.
As can be seen from the above examples, there are a number of ways that time series based on open prescribing datasets can be visualised and analysed. We have summarised only a few common statistical methods here, and there is much scope for further investigation using more advanced machine learning techniques such as clustering, classification and automate anomaly detection. There is also the potential for linking this data to other relevant information sources such as demographic and spatial datasets. As we continue our exploration of this data in the coming months we will post further updates on our findings on this blog.