Forecasting Unemployment in Greece

Introduction

Background

Unemployment has been a chronic problem in our country, as it has historically been at higher levels than the European average and OECD countries over the last 25 years. The phenomenon worsened during the years of the economic crisis, when at its worst moments one quarter of the workforce was unable to find employment. Even worse was the situation for young people in the country, with youth unemployment reaching 46%, a figure that represents the worst performance in the EU.

Before proceeding with the analysis, I think it is important to clarify what exactly the term “unemployed” means. The definition is stricter than it appears, as it is not enough for someone simply not to be working. To be classified as unemployed, a person must simultaneously meet three criteria: not be working, be available for work, and be actively seeking it. Students, retirees, and anyone not participating in the labour force are thus excluded from the calculation. The unemployment rate is derived as follows:

Unemployment rate = \frac{Number of unemployed}{Labour Force}

Of course, calculating the number of unemployed is a fairly complex process for various reasons, such as the existence of underemployment or seasonal work. There are two dominant methodologies for recording unemployment. The first is based on administrative data, that is, the number of registered unemployed at DYPA (formerly OAED). This is essentially a count of those receiving or applying for unemployment benefits. The problem is that the eligibility criteria are strict:

Dismissal, not resignation, from the last employment
125 days of work in the last 14 months (the last two months are not counted)
No benefit entitlement for more than 400 days per four-year period of unemployment

With these criteria, registered unemployment significantly underestimates the actual figure, as it excludes those who resigned, those who do not meet the minimum days of work requirement, and those working in undeclared or temporary arrangements. The second and more reliable methodology is the Labour Force Survey (LFS). This is a sample survey conducted by national statistical agencies (in Greece, ELSTAT) based on the International Labour Organization (ILO) definition of unemployment: a person is considered unemployed if they are not working, are actively seeking work, and are available to take up employment within two weeks. This methodology is also used by Eurostat and the OECD, making the data comparable across countries.

Summary Answer

In this article, my goal is to forecast the trajectory of unemployment over the coming months. I therefore obtained some historical data on unemployment in the EU, the OECD, and our country. I will use a simple (S)ARIMA model to make an estimate of its magnitude over the coming months. The data I am using covers the period from 1998 to 2022. If you want a quick answer, in this particular analysis I forecast that the downward trend in unemployment is expected to continue over the coming months. In February 2023, it is expected to range between 10% and 13%.

Prerequisites

For this analysis I used standard libraries for data import and processing (readr, dplyr), the kableExtra and gt packages for table formatting, and the highcharter library for interactive charts. For time series analysis the following packages were used: lubridate, tseries, forecast, tsibble, feasts, fable, strucchange, and urca. The ACF and pACF diagrams as well as the unemployment trend forecasts were built with Highcharter.

Data Structure

The dataset contains unemployment data for various countries or country entities such as the European Union and the OECD countries (Organisation for Economic Co-operation and Development), which allows us to compare unemployment in Greece with that of the other developed economies.

The dataset preview (first 5 rows):

Highest unemployment rates

(a) Greece
Month	Unemployment rate (%)
2013 Nov	27.9
2013 Apr	28
2013 May	28
2013 Jul	28.1
2013 Sep	28.1

(b) EU-27 (excl. UK)
Month	Unemployment rate (%)
2013 Jun	11.6
2013 Jan	11.7
2013 Feb	11.7
2013 Mar	11.7
2013 Apr	11.7

On the other hand, it is also interesting to look at the periods with the lowest observed unemployment. In Greece this period was just before the economic crisis, in 2008, while the EU27 is going through one of its best periods in terms of unemployment, with a 20-year historical low at 6%.

Lowest unemployment rates

(a) Greece
Month	Unemployment rate (%)
2008 May	7.4
2008 Jun	7.6
2008 Jul	7.6
2008 Oct	7.6
2008 Jan	7.7

(b) EU-27 (excl. UK)
Month	Unemployment rate (%)
2022 Jul	6
2022 Aug	6
2022 Apr	6.1
2022 May	6.1
2022 Jun	6.1

All of this can be summarised in the chart below, where the enormous changes in our country stand out. We can see that the 2008 crisis affected unemployment in the EU and across the developed world, as there is an upward trajectory over the same period. The EU has since recovered, but Greece has not managed to return to pre-crisis levels, although the trend is downward.

Examining Trend and Seasonality

In time series analysis it is important to separate the sources of variation in a time series and determine where they originate from. Time series have three basic components: trend ( $T$ ), seasonality ( $S$ ), and randomness ( $E$ ). The trend describes the general direction the time series follows over time, whether upward, downward, or horizontal. In our case, for example, the sharp rise in unemployment after 2009 is a characteristic example of a strong upward trend. Seasonality refers to patterns that repeat with a fixed periodicity; for instance, if unemployment tends to increase every winter and decrease in summer, we are talking about a seasonal component. Finally, randomness (or the remainder) is what is left after removing the trend and seasonality, that is, the unpredictable fluctuations that cannot be attributed to any systematic pattern, such as the sudden increase in unemployment during the pandemic in March 2020.

y_{t} = S_{t} + T_{t} + E_{t}

Where:

$y_{t}$ denotes the data we have available,
$S_{t}$ is the seasonal component,
$T_{t}$ is the trend component,
$E_{t}$ is the random component.

Similarly, the multiplicative model:

y_{t} = S_{t} \cdot T_{t} \cdot E_{t}

where the components composing the time series are multiplied rather than added.

From the chart above, it is very important to distinguish the seasonal component, because I need to know whether I am dealing with a model without seasonality using ARIMA, an autoregressive AR model, or a moving average MA model. If I detect seasonality I will need to use a model that incorporates it, such as Seasonal ARIMA (SARIMA), a seasonal autoregressive model (SAR), or a seasonal MA. The seasonality does not follow the same pattern throughout the entire time series. Up to 2004, seasonality is negligible; from then until 2014, there are signs of weak seasonality. From 2014 onwards, seasonality is stronger than at any other period since 1998. It is worth noting that most peaks occur, approximately, in the months of February and March.

I obtained an ambiguous picture with historically weak seasonality, which has been more pronounced in recent years. A simple method to get a quick answer is through the nsdiffs command from the {forecast} package. In this case we received a response of zero, which leads me to believe that any seasonality present has generally been weak.

Seasonality tests on Greek unemployment series
Test	Value	Result
Canova-Hansen (CH)	-	No seasonal differencing required
OCSB	-	No seasonal differencing required
Seasonal Influence (STL)	0.153	Weak seasonality

Testing for Structural Breaks

Time series are not a measure that can be reliably interpreted, and consequently forecast, without taking into account various exogenous factors. In our case, we are studying and seeking to forecast unemployment in our country over the coming months. Our task becomes even more difficult when we consider that we cannot do this satisfactorily, as the model cannot understand the patterns of changes in the series and how they came about. Many changes in the time series may have arisen from external factors that influence the specification of our model. It is therefore important to determine whether there are structural breaks, that is, defining events that may have affected the movement of the time series. The case of Greece is one such complex case. These time points could be various dates on which significant events occurred that may have affected the behaviour of the time series. In our case, we are analysing Greek unemployment, which skyrocketed after the 2009 economic crisis, reaching a historical high at its peak. In addition, the series includes the period of the pandemic, which affected the unemployment index.

Structural break count indicator
Breaks	RSS	BIC
0	11913.47	1928.5
1	3255.88	1559.78
2	1356.96	1314.7
3	1243.73	1300.53
4	1207.46	1303.22
5	1146.21	1299.33

There is a fairly large reduction in the Bayesian Information Criterion (BIC) when moving from a model with no structural break to one with two structural breaks, with a smaller reduction for three breaks. This is an important indication that my unemployment series was indeed affected by sudden factors. Of course, we suspected this, as we have the crisis period that contributed to high unemployment rates. Having determined the number of breaks, it is time to identify which segments need to be analysed; in short, to determine the date ranges during which a significant change in the behaviour of the time series was detected. According to the results I have:

Structural break positions (by number of breaks)
Number of breaks	Break 1	Break 2	Break 3	Break 4	Break 5
1	155	—	—	—	—
2	159	240	—	—	—
3	159	207	250	—	—
4	45	159	207	250	—
5	48	119	162	207	250

and the dates of the corresponding breaks:

Structural break dates (by number of breaks)
Number of breaks	Break 1	Break 2	Break 3	Break 4	Break 5
1	2011-03	—	—	—	—
2	2011-07	2018-04	—	—	—
3	2011-07	2015-07	2019-02	—	—
4	2002-01	2011-07	2015-07	2019-02	—
5	2002-04	2008-03	2011-10	2015-07	2019-02

Based on the error results, I will choose either 2 or 3 structural breaks, given that there is a significant reduction in the BIC criterion at that number. At the third break there is a small reduction, while at the fourth it increases. Let us examine the proposed breaks individually. On the one hand, the 2-break model proposes breaks in June 2011 and March 2018; on the other hand, the 3-break model proposes breaks also in June 2011, June 2015, and January 2019. I deliberated at length over whether I could decide on my own which was a defining moment of the crisis, as this is potentially a somewhat subjective judgement, which is why I prefer to let the model calculate it. Furthermore, the entire period was quite turbulent and full of negative developments, meaning that in reality one cannot pinpoint a single clear break.

Looking at the dates, the three-break model may be the one that makes the most sense to observers. 2011 saw problems that were already showing up in the unemployment index; 2015 was a period of uncertainty; and January 2019 marks the country’s recovery, with the exit from the memoranda announced a few months earlier, in August 2018.

Stationarity Testing

Definition of Stationarity

An important concept in time series is stationarity. A time series is called stationary if:

$E (X_{t}) : constant$
$V a r (X_{t}) : constant$
$C o v (X_{t}, X_{s}) : constant$

Examining Stationarity Graphically

From the unemployment chart above it is very evident that our series does not move around any specific value, violating the first condition for a time series to be considered stationary. This indicates the need to use first-order differences for Greek unemployment. From the first difference ( $Δ y = y_{t} - y_{t - 1}$ ) I observe a great improvement, as we no longer have the enormous deviations seen in the previous chart. The values mostly show no trend and move around values relatively close to zero. This is a good sign, though I have a mild concern as there are two points in the time series with relatively large deviations from zero. The first is between points 120 and 170, where the fluctuation around zero has deviated slightly, referring to the period between 2008 and 2012 (the crisis and deterioration of economic indicators). Another slightly problematic point is the 266th, referring to March-April 2020 and the imposition of movement restrictions to curb the spread of coronavirus when the first cases were detected in our country.

Taking into account the above concerns, I also computed the second differences ( $Δ^{2} y = Δ y - Δ_{y - 1}$ ) and visualised them. The chart is almost identical, with values fluctuating around zero even at the problematic point near the 150th observation. In the second differences I observe a more consistent fluctuation around zero, but at the same time there is an even larger deviation at the onset of COVID-19 measures in our country (3.6% versus 2.6% for the first differences).

Examining Stationarity with Statistical Tests

Graphical examination of stationarity is a fairly easy way to identify the presence of trends or whether our series has generally stable behaviour. Except in some very clear-cut cases, there will be times when many people may disagree on the stationarity of a series simply from its trajectory. This is understandable given that as a measure of evaluation it is somewhat subjective, being based on the personal opinion and interpretation each person gives to the movement of the time series. The use of this method alone may therefore lead to inconsistent or unstable decisions, since one person may consider a slight upward trend that reverts as stationary, while another may interpret the pattern as a faint but real trend. By analogy with normality tests, which include both graphical (quantile-quantile plot) and statistical tests (Kolmogorov-Smirnov test, Shapiro-Wilk test), we have similar alternatives for testing stationarity. This way we can have a more objective criterion for whether our series are stationary or not. Some of the best-known stationarity tests, which are also known as unit root tests, are as follows:

The DF test (Dickey-Fuller)
The ADF test (Augmented Dickey-Fuller)
The ADF-GLS test
The PP test (Phillips-Perron)
The KPSS test (Kwiatkowski-Phillips-Schmidt-Shin) and
The ZA test (Zivot-Andrews)

And this is where the chaos begins, which I started to realise when I began learning R. When using a programming language we do not have nice menus and tick boxes with options for each test. In R, as in other languages such as Python, there are packages that add functions and capabilities to each language. For stationarity tests, quite a few R packages have been built that serve a similar purpose. Some fairly well-known packages offering stationarity tests are tseries and urca. In brief, the tseries package is quite restrictive, despite the fact that many guides use it, I realised that I could not set the lags for the tests or set characteristics of the time series. The urca package addresses these limitations by allowing users to set the number of lags as well as time series characteristics. The only drawback of urca is that it does not provide a p-value in the test results. For the hypothesis test, the test statistic is computed and compared against the corresponding critical value; if the test statistic exceeds the critical value, the test is rejected at the corresponding significance level.

Summary of Results

Summarising the results of the statistical stationarity tests, I conclude that the unemployment observations in Greece cannot be characterised as stationary. Furthermore, all classical tests agree on the presence of stationarity in the first- and second-order differences.

Test	Result	Stationarity achieved at...
ADF	I(1)	first difference
PP	I(1)	first difference
KPSS	I(1)	first difference
ZA	I(1)	first difference
LS	I(1)	first difference

DF Test

The Dickey-Fuller test is one of the simplest unit root tests for determining the stationarity of a time series. This test is based on the first-order autoregressive model, $AR (1)$ :

y_{t} = ϕ y_{t - 1} + e_{t}

Where $y_{t}$ is the value of the time series and $e_{t}$ is the error term. That is, it is a time series whose values are influenced by, and depend on, previous values of the series.

y_{t} = ρ y_{t - 1} + e_{t}

y_{t} - y_{t - 1} = ρ y_{t - 1} - y_{t - 1} + e_{t}

Δ y_{t} = (ρ - 1) y_{t - 1} + e_{t}

Δ y_{t} = γ y_{t - 1} + e_{t}

The test has specific variants based on the behaviour of the time series in question. More precisely, there are three variants:

Without a constant and without a trend, where the series moves around zero: $Δ y_{t} = γ y_{t - 1} + e_{t}$
With a constant, when the series moves around a fixed value (different from zero): $Δ y_{t} = α + γ y_{t - 1} + e_{t}$
With a constant and trend, when the series appears to follow a downward or upward trajectory over time: $Δ y_{t} = α + β t + γ y_{t - 1} + e_{t}$

This particular test has some problems in application, as it makes a series of assumptions: our series follows an autoregressive model with $p = 1$ , the errors are homoskedastic, and the errors are uncorrelated.

The series is clearly non-stationary in the original observations, while the trend disappears in the first-order differences. I will therefore check the assumptions of the Dickey-Fuller test. From the Ljung-Box and Durbin-Watson tests, I reject the hypothesis that the errors are uncorrelated. Finally, the ARCH test shows that the errors are not homoskedastic (they have varying variance). From these results it is clear that the DF test is not reliable and is generally not used, as its assumptions are quite restrictive and are typically violated. Tests such as the ADF and PP address this error behaviour in various ways to obtain valid results when assessing the stationarity of a series.

Error assumption checks (autocorrelation & homoscedasticity)
Test	Statistic	Df	p-value
Ljung-Box	158.036	17	<0.0001
Durbin-Watson	1.655	-	0.0015
ARCH	39.874	12	7.5e-05

ADF Test

One of the most widely used unit root and stationarity tests is the Augmented Dickey-Fuller (ADF) test, which is a generalisation of the simple Dickey-Fuller test, as it is based on a higher-order autoregressive model ( $ρ > 1$ ). That is, the value of the time series does not depend only on its previous value but also on other $ρ$ preceding ones. An autoregressive model of order $ρ$ takes the following form:

y_{t} = ϕ_{1} y_{t - 1} + ϕ_{2} y_{t - 2} + \dots + ϕ_{p} y_{t - p} + ϵ_{t}

Analogously with the simple Dickey-Fuller test, variants are defined based on the behaviour of the time series in question:

Without a constant and without a trend:
$Δ y_{t} = γ y_{t - 1} + \sum_{i = 1}^{p} δ_{i} Δ y_{t - i} + u_{t}$
With a constant:
$Δ y_{t} = α + γ y_{t - 1} + \sum_{i = 1}^{p} δ_{i} Δ y_{t - i} + u_{t}$
With a constant and trend:
$Δ y_{t} = α + β_{t} + γ y_{t - 1} + \sum_{i = 1}^{p} δ_{i} Δ y_{t - i} + u_{t}$

The ADF test has the following hypothesis structure:

$H_{0} : γ = 0 (the series is not stationary)$ $H_{1} : γ \neq 0 (the series is stationary)$

That is, if our results reject the null hypothesis, this indicates that the time series is stationary. Before doing so I need to determine how many time lags to include. Using Schwert’s rule based on the number of observations ( $T = 293$ ):

p_{m a x} = ⌊ 12 \cdot {(\frac{293}{100})}^{0.25} ⌋ = ⌊ 15.7 ⌋ = 15 lags

Augmented Dickey-Fuller (ADF) — levels
Model	ADF statistic	Lag	Critical value (1%)	Critical value (5%)
ADF (none)	-0.87	10	-2.58	-1.95
ADF (drift)	-1.889	10	-3.44	-2.87
ADF (trend)	-2.039	10	-3.98	-3.42

Given that the results are not statistically significant, I can conclude that my time series is not stationary. The use of differenced observations therefore makes sense, and they should be re-tested for stationarity.

Augmented Dickey-Fuller (ADF) — first differences
Model	ADF statistic	Lag	Critical value (1%)	Critical value (5%)
ADF (none)	-2.119*	10	-2.58	-1.95
ADF (drift)	-2.111	10	-3.44	-2.87
ADF (trend)	-2.194	10	-3.98	-3.42

Since the test statistic (−2.119) is smaller than the corresponding critical value (−1.95), we can reject the null hypothesis of non-stationarity for the differences of the original Greek unemployment observations, at the 5% significance level.

PP Test

Another stationarity test is the Phillips-Perron test, which is based on the simple Dickey-Fuller. The logic is the same as the ADF and the hypothesis structure is identical, but they differ significantly in how they handle autocorrelation of the errors. The PP test does not add lags to reduce error autocorrelation (as the ADF does); instead, it attempts to modify the test statistic by estimating the long-run variance $\hat{λ}$ .

The only term that cannot be determined in advance is the long-run variance, as its summation depends on a parameter $q$ . For the bandwidth parameter $q$ of the long-run variance, there are calculation rules that coincide with those for the maximum lag number in the ADF test. For small samples:

q_{small} = ⌊ 4 \cdot {(\frac{T}{100})}^{0.25} ⌋

and for larger samples:

q_{large} = ⌊ 12 \cdot {(\frac{T}{100})}^{0.25} ⌋

The hypothesis structure takes the same form as the tests mentioned above, with the null hypothesis assuming non-stationarity of our time series:

$H_{0} : The series is not stationary$ $H_{1} : The series is stationary$

Phillips-Perron test (levels & first differences)
Series	Model	Statistic	Lag (Newey-West)	Critical value (1%)	Critical value (5%)
Levels	Constant	-1.11	15	-3.454	-2.872
Levels	Trend	-0.552	15	-3.993	-3.427
First differences	Constant	-16.959	15	-3.454	-2.872
First differences	Trend	-16.957	15	-3.993	-3.427

Consequently, we have a failure to reject the null hypothesis ( $H_{0}$ ) for the original data and a rejection for the first differences. This means that the Greek unemployment observations were not stationary, but their differences were. The PP test thus confirms the findings of the ADF test.

KPSS Test

Another unit root test is the KPSS test. Although the aim of the test is the same, it has the opposite logic compared to the previous tests: its null hypothesis is stationarity, as opposed to non-stationarity.

$H_{0} : The series is stationary$ $H_{1} : The series is not stationary$

KPSS test (levels & differences)
Series form	KPSS statistic	Lag	Critical value (1%)	Critical value (5%)
Levels	0.973	15	0.739	0.463
First differences	0.297	15	0.739	0.463
Second differences	0.298	15	0.739	0.463

My null hypothesis assumed stationarity, which is rejected by the KPSS test at the 1% significance level. Furthermore, the test cannot reject stationarity for the differenced observations. These results confirm both the PP test and the ADF test.

ZA and LS Tests

Finally, another stationarity test is the Zivot-Andrews (ZA) test. This particular test statistic differs from the previous ones in that it takes into account certain points at which the time series changes behaviour. The best-known tests for stationarity in time series with structural breaks are:

The Zivot-Andrews (ZA) test, if there is only one structural break, and
The Lee-Strazicich (LS) test, if there are two structural breaks.

For completeness I will include the Zivot-Andrews test, which is not optimal since it assumes only one structural break. Although for classical tests there are packages and commands that compute the test statistic, no equivalent commands exist for the Lee-Strazicich test. Fortunately, instead of relying on a statistical package (e.g. EViews), I found a relevant repository on GitHub, where user hannes101 has written a series of functions for this specific test that follow a logic similar to those in the urca package.

Unit root tests with structural breaks
Test	Variable	Statistic	Critical value (5%)
Zivot-Andrews	Unempl	-4.39	-5.08
Zivot-Andrews	Δ Unempl	-16.42	-5.08
Lee-Strazicich	Unempl	-4.71	-5.65
Lee-Strazicich	Δ Unempl	-7.72	-5.65

Model Identification

Above I concluded that the unemployment observations are not stationary; however, their first differences constitute a stationary time series. At this point I would like to investigate which (S)ARIMA(p, d, q) model is appropriate for my case. For this reason I will construct autocorrelation and partial autocorrelation diagrams in order to identify the ideal values of $p$ and $q$ . Finally, it is worth recalling that in an ARIMA model, $d$ represents the order of differencing, which in our case is 1.

Original Observations

Autocorrelations on levels (first 12 lags)
Lag	ACF	PACF
0	1	1
1	0.997	0.997
2	0.992	-0.1
3	0.988	-0.068
4	0.982	-0.064
5	0.977	-0.048
6	0.97	-0.112
7	0.963	-0.144
8	0.955	-0.04
9	0.946	0.005
10	0.937	-0.117
11	0.926	-0.166

For completeness, we will start with the given unemployment observations. We already know that they are not stationary, but there are specific patterns that we need to observe in the autocorrelation diagrams in order to confirm this. The autocorrelation diagram decreases at an extremely slow rate, which is a strong indication of non-stationarity of the series.

First Difference

Autocorrelations on first differences (first 12 lags)
Lag	ACF	PACF
0	1	1
1	0.172	0.172
2	0.118	0.091
3	0.145	0.116
4	0.122	0.076
5	0.191	0.148
6	0.284	0.225
7	0.178	0.087
8	0.003	-0.107
9	0.223	0.166
10	0.338	0.271
11	0.18	0.048

The non-stationarity of my data, confirmed countless times above, compels us to take the first differences and obtain the corresponding autocorrelation diagrams. I observe a rather difficult-to-interpret situation from the diagrams, as there are significant fluctuations in both diagrams that make it hard to decide on the appropriate ARIMA model. Normally, to determine the right lag orders I look for a sharp drop within the red region, but there are quite a few statistically significant exceedances. After the 10th lag this tendency diminishes or disappears significantly, so my model likely has parameters $p$ and $q$ somewhere between 1 and 10.

Model Building

Splitting the Time Series

Forecasting future unemployment requires us to build a model. Established methods include ARIMA models and Exponential Smoothing (ETS) models. Simply running a model says very little unless we evaluate its parameters. More generally, in work where we try to build predictive models we want to evaluate the model’s power, and for this reason we typically split our data into two parts. One part is used for training the model (train set) and the remainder for evaluating it (test set). The usual split is 70/30 or 80/20 respectively. This approach generally works well for classical machine learning problems such as classification and regression.

Full Dataset (1998–2022, 293 obs)
        ↙               ↘
Training (1998–2015)   Evaluation (2015–2022)

However, if we have time series problems, things are not so straightforward. Let us first suppose we want to apply the same logic to these. We would then end up with a model that ignores one fifth of the observations, specifically the most recent ones, which may well influence the variability of the time series. For this reason, other methods have been proposed for time series.

Full Time Series (293 observations)
  ├─ 1st Fold: Training 1998–2010 / Forecast 2011
  ├─ 2nd Fold: Training 1998–2011 / Forecast 2012
  ├─ 3rd Fold: Training 1998–2012 / Forecast 2013
  ├─ …
  └─ N-th Fold: Training 1998–2021 / Forecast 2022

All of this may sound somewhat complicated. I have included an interactive application below for you to try for yourselves, so that the difference is clear and you can see how the various options affect the result.

Specifying ARIMA Models

For the simple ARIMA model I need to determine my three parameters. I know that $d = 1$ , since I achieved stationarity with the first differences. For the determination of $p$ I suspect 1, 6, and 10, as there was a large sharp drop in the partial autocorrelation diagram of the first differences after these values. Finally, for $q$ I suspect 1, 3, 6, and 10. From the diagram I cannot draw a safe conclusion about their values, as there are quite a few significant autocorrelations at least up to the 10th lag. The graphical determination of $p$ and $q$ is quite subjective and provides uncertain conclusions. For this reason it would be good to evaluate a range of parameter combinations for the ARIMA time series models in order to find the model with the optimal parameters.

ARIMA models by information criteria
Model	BIC	AIC	AICc
ARIMA (1,1,1)	305.5	294.4	294.5
ARIMA (1,1,2)	311.1	296.4	296.5
ARIMA (2,1,1)	311.1	296.4	296.5
ARIMA (1,1,3)	313.8	295.4	295.6
ARIMA (3,1,1)	314.7	296.3	296.5
ARIMA (3,1,2)	314.9	292.9	293.1
ARIMA (2,1,3)	315.1	293	293.3
ARIMA (2,1,2)	319.1	293.4	293.8
ARIMA (1,1,0)	325.5	318.2	318.2
ARIMA (0,1,1)	326.8	319.5	319.5
ARIMA (2,1,0)	328.8	317.8	317.9
ARIMA (3,1,0)	330.6	315.9	316
ARIMA (0,1,2)	330.7	319.7	319.8
ARIMA (0,1,3)	333.8	319.1	319.3

Older guides or articles would likely have suggested using auto.arima to find the best ARIMA. Well, this article did exactly that until I realised that the {fable} library gives us the necessary information in a more organised way, so using the model command with various parameter combinations for ARIMA models gives us a ready-made table with the Akaike information criteria (AIC), the corrected Akaike information criterion (AICc), and the Bayesian information criterion (BIC).

But what should we be looking for? This is very important because, as you can see, each criterion gives a different model as the best. For example, based on the AIC and AICc criteria I should choose the ARIMA(3,1,2) model, while BIC proposes the ARIMA(1,1,1) model. In general, we look for the model that gives the smallest numbers on these criteria. The ideal model is ARIMA(1,1,1) as it achieves the best performance on the BIC criterion, has good performance on the AIC and AICc criteria with a small gap from the optimal suggestion on those, and finally the AIC criteria proposed models with a marginal improvement but with 5 parameters, compared to 2-parameter models that may generalise better to the data and the behaviour of the time series.

Specifying the ETS Model

The Exponential Smoothing method is an equally popular method for forecasting time series. Its great advantage is that it does not require the tests we performed for the ARIMA models. Despite its flexible methodology, we need to examine the time series graphically to observe its properties (trend, seasonality, cyclicality) so that I can specify the ETS model that suits our case.

E T S (Error, Trend, Seasonality)

The smoothing model I will use is based on these characteristics.

ETS models by information criteria
Model	BIC	AIC	AICc
M-A-N	1112.2	1093.8	1094
M-Ad-N	1115.9	1093.8	1094.1
M-N-N	1126.7	1115.7	1115.8
A-A-N	1152.1	1133.7	1133.9
A-Ad-N	1154.7	1132.6	1132.9
A-N-N	1172.9	1161.9	1161.9
A-A-A	1221.4	1158.9	1161.1
M-Ad-M	1224	1157.7	1160.2
A-A-M	1260.1	1197.6	1199.8

The table shows that the ETS(M,A,N) model, that is, with multiplicative error, linear trend, and no seasonality, has the best performance on the BIC criterion. Very close behind is ETS(M,Ad,N), which differs only in that the trend damps gradually rather than continuing linearly. The absence of a seasonal component (N) from the best-performing models confirms the findings of the seasonality section: seasonality in Greek unemployment is weak and does not improve the models. By contrast, models with a seasonal component (A-A-A, M-Ad-M, A-A-M) sit at the bottom of the table, with the added complexity not offset by better fit.

Prophet Model

The solutions above are potentially quite complex and require deep theoretical knowledge as well as basic statistical programming skills. An easy alternative for quick time series forecasting is the Prophet algorithm. It was created by Facebook in 2017 and is a fairly popular forecasting method. The eponymous R package, with the prophet() command, gives us an estimate in a single line of code. One point worth noting is that it requires data in data frame format. For smooth application, you should rename your columns appropriately so that it understands which columns it needs to analyse.

Unlike ARIMA and ETS models, Prophet does not require explicit stationarity testing or parameter specification because it automatically decomposes the time series into trend, seasonality, and holidays. This ease of use makes it popular in industry applications, but it does not guarantee better performance. In the comparison table that follows we will see how it performs relative to the classical models.

Model Comparison

Up to now I used the AIC and BIC criteria to select the best parameters within each model type, for example, which ARIMA is the best among the other ARIMAs. These criteria cannot, however, tell me whether an ARIMA is better than an ETS or Prophet, because each model type computes them in a different way and they are not directly comparable across types. For this reason I need a common basis for comparison. The logic is simple: each model makes forecasts for months for which I already know the actual unemployment figure, and I measure how far off it was. The smaller the error, the better the model. The metrics I will use are three:

M A E = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}

Model error comparison — 2-month horizon
Model	RMSE	MAE
ARIMAX (1,1,1) - 2 breaks	0.6278	0.5006
ARIMA (1,1,1)	0.643	0.4942
ARIMAX (1,1,1) - 1 break	0.6503	0.4928
Naive	0.6702	0.5132
ARIMA (2,1,1)	0.6753	0.5265
ETS (M-Ad-N)	0.6791	0.5232
ETS (M-A-N)	0.6857	0.5215
Auto ARIMA	0.6863	0.5267
ARIMA (1,1,2)	0.6891	0.5344
SNaive	2.0796	1.9158
Prophet	2.7631	1.8038

Model error comparison — 6-month horizon
Model	RMSE	MAE
ARIMAX (1,1,1) - 1 break	0.8504	0.6243
ARIMA (1,1,1)	0.851	0.6252
ARIMAX (1,1,1) - 2 breaks	0.8879	0.6449
ETS (M-Ad-N)	0.8913	0.6703
ARIMA (2,1,1)	0.892	0.6682
ETS (M-A-N)	0.8996	0.6776
Naive	0.9007	0.6946
ARIMA (1,1,2)	0.9165	0.6941
Auto ARIMA	0.9411	0.7101
SNaive	2.1633	1.9108
Prophet	2.942	1.8554

From the tables above it emerges that for short-term forecasts ( $h \leq 6$ months) the ARIMAX model, which includes two exogenous variables representing the structural changes (breaks), performs best. The ARIMA(1,1,1) model also performs well for both the 2-month and 6-month forecasts. What stands out in our results is the significant difference in errors between the ARIMA models and the Naive model. This potentially indicates that our models identify the various patterns of the time series better, and therefore justify their existence. One point that requires particular attention is the differences in errors: while differences clearly exist, it should be investigated whether they are statistically significant. To study this, the Diebold-Mariano test will be used, which tests whether the errors of the compared models are statistically significant:

Diebold-Mariano test (2-month horizon)
Model 1	Model 2	DM statistic	p-value	Conclusion
ARIMAX(1,1,1) 2br.	ARIMAX(1,1,1) 1br.	-0.9218	0.3574	No difference
ARIMAX(1,1,1) 2br.	ARIMA (1,1,1)	-0.9349	0.3506	No difference
ARIMAX(1,1,1) 2br.	Naive	-2.663	0.0082	Stat. significant
ARIMAX(1,1,1) 1br.	ARIMA (1,1,1)	-0.2242	0.8228	No difference
ARIMAX(1,1,1) 1br.	Naive	-2.5129	0.0125	Stat. significant
ARIMA (1,1,1)	Naive	-2.5001	0.013	Stat. significant

According to the test, I found that the selected ARIMA models have statistically significant differences from the Naive model, demonstrating their value for forecasting future unemployment, at least in the short term. A further finding from the combinations of tests is that the selected ARIMA models do not have statistically significant differences among themselves. That is, the ARIMAX model using one or two breaks does not have significant differences at the 5% significance level from a simple ARIMA(1,1,1) model.

Wait, Where Are the Assumptions?

Hey, where are the assumption tests? We want our money back.

Well, I have some good news and some bad news. The good news is that not only has the article not ended, but I also need to check whether the best models satisfy the assumptions we discussed, namely that the errors must be:

uncorrelated ( $C o v (ϵ_{t}, ϵ_{t - 1}) = 0$ ),
homoskedastic ( $V a r (ϵ_{t}) = σ^{2}$ ), and
follow the normal distribution.

Residual diagnostics of the best models
Model	Ljung-Box stat	ARCH stat	ARCH p	Jarque-Bera stat	Jarque-Bera p	AIC	BIC
ARIMA (1,1,1)	48.25	45.326	0.001	2346.666	0	294.442	305.472
ARIMAX (1,1,1) - 1 break	48.464	45.376	0.001	2341.208	0	296.388	311.095
ARIMAX (1,1,1) - 2 breaks	51.468	45.387	0.001	2463.176	0	295.352	313.736
Naive	170.806	54.964	0	—	—	—	—

The results are not ideal for any of the models. The Ljung-Box test rejects the null hypothesis of uncorrelated errors for all models ( $p < 0.01$ ), indicating that the errors retain some structure that the models fail to fully explain. The ARCH test is also rejected everywhere, indicating heteroskedasticity (the variance of the errors is not constant over time), something expected in a time series that passed through a debt crisis and a pandemic. Finally, the Jarque-Bera test rejects normality, with extremely high test statistics suggesting heavy tails in the error distribution. These findings mean that the confidence intervals produced by the models should be interpreted with caution, as they may be narrower or wider than the true underlying uncertainty. However, the violation of these assumptions does not automatically invalidate the point forecasts. ARIMA models are known to produce reliable point estimates even when the distributional assumptions are not fully satisfied, particularly over a short-term horizon. The main implication concerns the quantification of uncertainty, not the direction of the forecast.

Forecasting Future Unemployment

All of this was somewhat tiring, but the moment has finally arrived for all of it to make sense. I found the best model and, using it, I will forecast the level of unemployment in Greece for the next 12 months. My data end in August 2022 (unemployment at 12.2%), so any estimate will be made for the months from September 2022 through to August 2023.

And the corresponding table with forecasts:

Unemployment forecast for the next 6 months
Month	Forecast	80% CI	95% CI
2022 Sep	12.15	[11.65, 12.66]	[11.38, 12.93]
2022 Oct	12.03	[11.28, 12.78]	[10.88, 13.18]
2022 Nov	11.91	[10.96, 12.86]	[10.45, 13.37]
2022 Dec	11.79	[10.65, 12.94]	[10.05, 13.54]
2023 Jan	11.68	[10.36, 13]	[9.66, 13.71]
2023 Feb	11.57	[10.07, 13.07]	[9.28, 13.86]

Based on the best-performing model, the downward trend in Greek unemployment is expected to continue over the next six months. The point estimate for February 2023 is 11.4%, with an 80% confidence interval between 10.1% and 12.8% (and a 95% interval between 9.3% and 13.5%). If confirmed, this would represent a continuation of the steady deceleration that began after the historical high of 2013.

However, it is worth highlighting certain limitations of this analysis. ARIMA models rest on the assumption that the future structure of the time series will resemble its historical one. An unforeseen event, such as a new geopolitical crisis or an energy shock, could overturn the forecast. In addition, unemployment is influenced by variables not included in the model, such as the trajectory of GDP, inflation, and fiscal policy. Finally, the confidence intervals should be interpreted with caution, given that the assumption tests revealed heteroskedasticity and non-normality in the errors.

Despite these limitations, the direction of the forecast (continuation of the downward trend) is consistent across all models examined, which reinforces the credibility of the conclusion. Greece appears to be moving steadily towards lower unemployment, but remains at levels significantly above the European average (approximately 6% in the same month), a reminder that the full recovery from the 2010 crisis has not yet been completed.

Photo by Rosy / Bad Homburg / Germany from Pixabay.