Make a prediction about the future value of Greece’s unemployment using ARIMA model.
R
Time Series
Author
stesiam
Published
October 22, 2022
Introduction
Background
Unemployment has been a chronic problem in our country, as it has historically been at higher levels than the European average and OECD countries over the last 25 years. The phenomenon worsened during the years of the economic crisis, when at its worst moments one quarter of the workforce was unable to find employment. Even worse was the situation for young people in the country, with youth unemployment reaching 46%, a figure that represents the worst performance in the EU.
Before proceeding with the analysis, I think it is important to clarify what exactly the term “unemployed” means. The definition is stricter than it appears, as it is not enough for someone simply not to be working. To be classified as unemployed, a person must simultaneously meet three criteria: not be working, be available for work, and be actively seeking it. Students, retirees, and anyone not participating in the labour force are thus excluded from the calculation. The unemployment rate is derived as follows:
\text{Unemployment rate} = \frac{\text{Number of unemployed}}{\text{Labour Force}}
Of course, calculating the number of unemployed is a fairly complex process for various reasons, such as the existence of underemployment or seasonal work. There are two dominant methodologies for recording unemployment. The first is based on administrative data, that is, the number of registered unemployed at DYPA (formerly OAED). This is essentially a count of those receiving or applying for unemployment benefits. The problem is that the eligibility criteria are strict: - Dismissal, not resignation, from the last employment - 125 days of work in the last 14 months (the last two months are not counted) - No benefit entitlement for more than 400 days per four-year period of unemployment
With these criteria, registered unemployment significantly underestimates the actual figure, as it excludes those who resigned, those who do not meet the minimum days of work requirement, and those working in undeclared or temporary arrangements. The second and more reliable methodology is the Labour Force Survey (LFS). This is a sample survey conducted by national statistical agencies (in Greece, ELSTAT) based on the International Labour Organization (ILO) definition of unemployment: a person is considered unemployed if they are not working, are actively seeking work, and are available to take up employment within two weeks. This methodology is also used by Eurostat and the OECD, making the data comparable across countries.
Summary Answer
In this article, my goal is to forecast the trajectory of unemployment over the coming months. I therefore obtained some historical data on unemployment in the EU, the OECD, and our country. I will use a simple (S)ARIMA model to make an estimate of its magnitude over the coming months. The data I am using covers the period from 1998 to 2022. If you want a quick answer, in this particular analysis I forecast that the downward trend in unemployment is expected to continue over the coming months. In February 2023, it is expected to range between 10% and 13%.
Time series with ARIMA models are often referred to as Box-Jenkins time series.
Once the essential libraries are included, I can import my data using the readr library. My data file is a standard CSV file and I will use the read_csv() command to load it. The file contains unemployment data for three entities: - Greece - OECD countries - The 27 EU countries
Show the code
unemployment<-read_csv("data/unemployment.csv")%>%select(LOCATION, TIME, Value)%>%filter(LOCATION!="EA19")
Defining Functions
In order to avoid repeating my code, I will define a few functions. First, I want tables based on the reactable library with certain appearance adjustments to make them look good. For this reason, in many tables you will see the use of gt_custom instead of simply the reactable command.
In addition, I have written two other functions for constructing autocorrelation (ACF) and partial autocorrelation (pACF) diagrams. I have named them acf_plot_function and pacf_plot_function respectively. Both functions are based on the highcharter library for creating the corresponding diagrams.
Show the code
acf_plot_function=function(data, min=-1, max=1, title_text=NULL,subtitle_text=NULL){ts_data<-dataN<-length(ts_data)# Compute confidence limitconf_limit<-qnorm(0.975)/sqrt(N)# Use forecast::Acf (biased estimator, recommended for ARIMA)acf_obj<-acf(ts_data, plot =FALSE, lag.max =20)# Extract lags and valueslags_acf<-as.numeric(acf_obj$lag[-1])acf_values<-as.numeric(acf_obj$acf[-1])# Build the highcharter ACF plotacf_chart<-highchart()%>%hc_chart(type ="column")%>%hc_title(text =title_text)%>%hc_subtitle(text =subtitle_text)%>%hc_xAxis(categories =lags_acf, title =list(text ="# Lag"))%>%hc_yAxis(min =min, max =max, title =list(text ="ACF"), plotLines =list(list(value =0, color ="#000000", width =3),list(value =conf_limit, color ="#FF0000", dashStyle ="ShortDash", width =1),list(value =-conf_limit, color ="#FF0000", dashStyle ="ShortDash", width =1)))%>%hc_legend(enabled =FALSE)%>%hc_add_series(name ="ACF", data =round(acf_values, digits =4))acf_chart}pacf_plot_function<-function(data, min=-1, max=1, title_text=NULL,subtitle_text=NULL){ts_data<-dataN<-length(ts_data)# Compute confidence limit (same formula as for ACF)conf_limit<-qnorm(0.975)/sqrt(N)# Compute PACF using forecast::Pacf (biased estimator, recommended for ARIMA)pacf_obj<-forecast::Pacf(ts_data, plot =FALSE, lag.max =20)# Extract lags and PACF valueslags_pacf<-as.numeric(pacf_obj$lag)pacf_values<-as.numeric(pacf_obj$acf)# Build the highcharter PACF plotpacf_chart<-highcharter::highchart()%>%highcharter::hc_chart(type ="column")%>%hc_title(text =title_text)%>%hc_subtitle(text =subtitle_text)%>%highcharter::hc_xAxis(categories =lags_pacf, title =list(text ="# Lag"))%>%highcharter::hc_yAxis( min =min, max =max, title =list(text ="PACF"), plotLines =list(list(value =0, color ="#000000", width =3),list(value =conf_limit, color ="#FF0000", dashStyle ="ShortDash", width =1),list(value =-conf_limit, color ="#FF0000", dashStyle ="ShortDash", width =1)))%>%highcharter::hc_legend(enabled =FALSE)%>%highcharter::hc_add_series(name ="PACF", data =round(pacf_values, digits =4))pacf_chart}
Data Structure
The file contains unemployment data for various countries or country entities such as the European Union and the OECD countries (Organisation for Economic Co-operation and Development), which allows us to compare unemployment in Greece with that of the other developed economies.
OECD Logo
The OECD (Organisation for Economic Co-operation and Development) is an international organisation with 38 member countries. These are: Australia, Austria, Belgium, France, Germany, Denmark, Switzerland, Greece, Estonia, the United States of America, the United Kingdom, Japan, Ireland, Iceland, Spain, Israel, Italy, Canada, Colombia, South Korea, Costa Rica, Latvia, Lithuania, Luxembourg, Mexico, New Zealand, Norway, the Netherlands, Hungary, Poland, Portugal, Slovakia, Slovenia, Sweden, Turkey, the Czech Republic, Finland, and Chile. It consists of a group of countries that: - are considered developed - have representative democracy and - market economies
Table 1: Data preview (first 6 rows)
Entity LOCATION
Month TIME
Observation Value
GRC
1998-04
10.9
GRC
1998-05
11.0
GRC
1998-06
10.9
GRC
1998-07
11.0
GRC
1998-08
11.2
My data consists of 3 variables (columns). More specifically, my columns are as follows:
Variable
Variable Type
Description
LOCATION
Qualitative (categorical)
Country or country entity
TIME
Qualitative (ordinal)
Month-Year to which the measurement refers
Value
Quantitative (continuous)
Unemployment level (by area and month)
Consequently, my sample consists of 3 variables, two of which are qualitative and one quantitative, which is also the value I want to forecast (unemployment). It is worth pointing out at this stage that the LOCATION variable has three values: Greece, the OECD countries, and the 27 European countries. Finally, regarding the classification of the time variable, since my values take the form of month-year (mm/year), its type is not immediately clear. We could split it into two additional variables where one would be the year, classified as a quantitative variable, and the month would be a qualitative ordinal variable.
Time Series Preprocessing
The variable indicating the month to which the corresponding unemployment figure refers (TIME) is automatically recognised as a character vector. One of the first things we need to do when handling data that indicate a time interval is to convert them to the appropriate variable type, which in R is called Date. The table below shows the existing variables, as well as the type automatically assigned to them based on the values they contain. R did a good job and identified that the Value variable is a numeric vector since it represents the unemployment level. The mapping of the three entities is also correct, since we are referring to their names and they will therefore be a character vector.
Table 2: Summary of variable types (without preprocessing)
Variable Name
Variable Type
LOCATION
character
TIME
character
Value
numeric
It was noted above that the dates take the form “YYYY-MM” (Year-Month) and were automatically recognised by the software as characters. With the help of the lubridate package I will convert the time variable to the Date type. And once we have made the conversion, if we check again we will see that the change was successful, with the variable now having the Date type.
Phew! We have some good news. In this dataset there are a total of 0 missing values. In the event that observations were missing from my dataset, I would first need to investigate which of these variables were affected. In a second step, and depending on the type of the variable, I would either need to drop those rows entirely or attempt to impute their values using various methods.
Descriptive Statistics
The unemployment data for Greece covers the period from April 1998 to August 2022. As for the EU data, it runs from January 2000 to August 2022. Over the last 20 years, shockingly large changes have occurred in our country’s unemployment rate, with the most abrupt change recorded during the periods of the economic crisis (after 2009). As an illustration, we can observe the difference in the unemployment level between September 2010, when it stood at 10%, and three years later, in September 2013, when it reached 28.1%. For completeness, tables are attached below showing the 5 months with the highest and lowest unemployment in Greece and in Europe over the last 20 years.
On the other hand, it is also interesting to look at the periods with the lowest observed unemployment. In Greece this period was just before the economic crisis, in 2008, while the EU27 is going through one of its best periods in terms of unemployment, with a 20-year historical low at 6%.
All of this can be summarised in the chart below, where the enormous changes in our country stand out. We can see that the 2008 crisis affected unemployment in the EU and across the developed world, as there is an upward trajectory over the same period. The EU has since recovered, but Greece has not managed to return to pre-crisis levels, although the trend is downward.
Figure 1: Comparison of unemployment time series between Greece, OECD countries, and the 27 European countries (taking into account the United Kingdom’s departure from the EU)
Examining Trend and Seasonality
In time series analysis it is important to separate the sources of variation in a time series and determine where they originate from. Time series have three basic components: trend (T), seasonality (S), and randomness (E). The trend describes the general direction the time series follows over time, whether upward, downward, or horizontal. In our case, for example, the sharp rise in unemployment after 2009 is a characteristic example of a strong upward trend. Seasonality refers to patterns that repeat with a fixed periodicity; for instance, if unemployment tends to increase every winter and decrease in summer, we are talking about a seasonal component. Finally, randomness (or the remainder) is what is left after removing the trend and seasonality, that is, the unpredictable fluctuations that cannot be attributed to any systematic pattern, such as the sudden increase in unemployment during the pandemic in March 2020.
y_t = S_t + T_t + E_t
Where: - y_t denotes the data we have available, - S_t is the seasonal component, - T_t is the trend component, - E_t is the random component.
Similarly, the multiplicative model:
y_t = S_t \cdot T_t \cdot E_t
where the components composing the time series are multiplied rather than added.
Show the code
decomp<-unemploymentType%>%model(stl =STL(Value))%>%components()%>%as_tibble()%>%dplyr::filter(LOCATION=="GRC")%>%mutate(TIME =as.Date(TIME), trend =round(trend, 2), season_year =round(season_year, 4), remainder =round(remainder, 4))# Build highchart with multiple y-axes (stacked)highchart(type ="stock")%>%hc_title(text ="Decomposition of the time series into trend, seasonality, and noise")%>%hc_subtitle(text ="The historical data on Greek unemployment suggest no clear trend. Additionally, a faint seasonality is observed until 2012, while in subsequent years the phenomenon becomes more pronounced.")%>%hc_caption(text="stesiam, 2023")%>%# Observedhc_add_series(decomp, "line", hcaes(x =datetime_to_timestamp(TIME), y =Value), name ="Observation", yAxis =0)%>%# Trendhc_add_series(decomp, "line", hcaes(x =datetime_to_timestamp(TIME), y =trend), name ="Trend", yAxis =1)%>%# Seasonalhc_add_series(decomp, "line", hcaes(x =datetime_to_timestamp(TIME), y =season_year), name ="Seasonality", yAxis =2)%>%# Randomhc_add_series(decomp, "line", hcaes(x =datetime_to_timestamp(TIME), y =remainder), name ="Noise", yAxis =3)%>%# Define 4 stacked y-axeshc_yAxis_multiples(list( title =list(text ="Data"), height ="25%", top ="0%", offset=0, labels =list(enabled =FALSE), gridLineWidth =0),list( title =list(text ="Trend"), height ="25%", top ="25%", offset=0, labels =list(enabled =FALSE), gridLineWidth =0),list( title =list(text ="Seasonality"), height ="25%", top ="50%", offset=0, labels =list(enabled =FALSE), gridLineWidth =0),list( title =list(text ="Residual"), height ="25%", top ="75%", offset=0, labels =list(enabled =FALSE), gridLineWidth =0))%>%hc_tooltip(shared =TRUE, valueDecimals =2)%>%hc_rangeSelector(enabled =FALSE)%>%hc_scrollbar(enabled =FALSE)%>%hc_navigator(enabled =FALSE)
Figure 2: Time series decomposition using the STL method
From the chart above, it is very important to distinguish the seasonal component, because I need to know whether I am dealing with a model without seasonality using ARIMA, an autoregressive AR model, or a moving average MA model. If I detect seasonality I will need to use a model that incorporates it, such as Seasonal ARIMA (SARIMA), a seasonal autoregressive model (SAR), or a seasonal MA. The seasonality does not follow the same pattern throughout the entire time series. Up to 2004, seasonality is negligible; from then until 2014, there are signs of weak seasonality. From 2014 onwards, seasonality is stronger than at any other period since 1998. It is worth noting that most peaks occur, approximately, in the months of February and March.
I obtained an ambiguous picture with historically weak seasonality, which has been more pronounced in recent years. A simple method to get a quick answer is through the nsdiffs command from the forecast package. In this case we received a response of zero, which leads me to believe that any seasonality present has generally been weak.
Show the code
grc_unemployment=unemployment%>%dplyr::filter(LOCATION=="GRC")y=ts(grc_unemployment$Value, start =c(1998, 4), frequency =12)nsdiffs(y, test ="ch")
Table 6: Results of various tests for evaluating seasonality
Test
Value
Result
Canova-Hansen (CH)
-
No seasonal differencing required
OCSB
-
No seasonal differencing required
Seasonal Influence (STL)
0.153
Weak seasonality
Testing for Structural Breaks
Time series are not a measure that can be reliably interpreted, and consequently forecast, without taking into account various exogenous factors. In our case, we are studying and seeking to forecast unemployment in our country over the coming months. Our task becomes even more difficult when we consider that we cannot do this satisfactorily, as the model cannot understand the patterns of changes in the series and how they came about. Many changes in the time series may have arisen from external factors that influence the specification of our model. It is therefore important to determine whether there are structural breaks, that is, defining events that may have affected the movement of the time series. The case of Greece is one such complex case. These time points could be various dates on which significant events occurred that may have affected the behaviour of the time series. In our case, we are analysing Greek unemployment, which skyrocketed after the 2009 economic crisis, reaching a historical high at its peak. In addition, the series includes the period of the pandemic, which affected the unemployment index.
Show the code
breakPoints=summary(breakpoints(grc_unemployment$Value~1, h =0.15))summaryBreakPoints=t(breakPoints[["RSS"]])%>%as.data.frame()%>%mutate(across(everything()), round(., 2))highchart()%>%hc_chart(type ="spline")%>%hc_title(text ="Performance by number of structural breaks")%>%hc_subtitle(text ="The Bayesian information criterion indicates the presence of two structural breaks. The lower the number, the better.")%>%# custom colors (muted, professional palette)hc_colors(c("#0072B2", "#E69F00"))%>%# serieshc_add_series(data =summaryBreakPoints$RSS, name ="RSS", lineWidth =3)%>%hc_add_series(data =summaryBreakPoints$BIC, name ="BIC", lineWidth =3)%>%# axeshc_xAxis( type ="integer", lineColor ="#e6e6e6", tickColor ="#e6e6e6", title =list(text ="(#) Number of structural breaks"), labels =list(style =list(color ="#333333", fontSize ="12px")))%>%hc_yAxis( title =list(text ="Value"), gridLineColor ="#f0f0f0", labels =list(style =list(color ="#333333", fontSize ="12px")))%>%# tooltiphc_tooltip( shared =TRUE, crosshairs =TRUE, backgroundColor ="#ffffff", borderColor ="#cccccc", style =list(color ="#333333", fontSize ="12px"), xDateFormat ="%b %Y")%>%# plot options for stylinghc_plotOptions( series =list( marker =list(enabled =FALSE, radius =4, symbol ="circle"), states =list(hover =list(lineWidthPlus =1))))%>%# legendhc_legend( align ="center", verticalAlign ="bottom", layout ="horizontal", itemStyle =list(fontSize ="13px", color ="#333333"))
Figure 3: Error diagram by number of structural breaks
Table 7: Table indicating the number of breaks
Breaks
RSS
BIC
0
11913.47
1928.50
1
3255.88
1559.78
2
1356.96
1314.70
3
1243.73
1300.53
4
1207.46
1303.22
5
1146.21
1299.33
There is a fairly large reduction in the Bayesian Information Criterion (BIC) when moving from a model with no structural break to one with two structural breaks, with a smaller reduction for three breaks. This is an important indication that my unemployment series was indeed affected by sudden factors. Of course, we suspected this, as we have the crisis period that contributed to high unemployment rates. Having determined the number of breaks, it is time to identify which segments need to be analysed; in short, to determine the date ranges during which a significant change in the behaviour of the time series was detected. According to the results I have:
Show the code
bp<-breakPoints$breakpointsmax_breaks<-nrow(bp)result<-matrix(NA, nrow =max_breaks, ncol =max_breaks)for(iin1:max_breaks){# take non-NA values from row ivalid_breaks<-bp[i, !is.na(bp[i, ])]# put them in first positions of row iresult[i, 1:length(valid_breaks)]<-valid_breaks}cool_table<-as.data.frame(result)colnames(cool_table)<-c("Break A", "Break B", "Break C","Break D", "Break E")[1:max_breaks]cool_table<-cool_table%>%mutate("Number of Breaks"=1:max_breaks)%>%select("Number of Breaks", everything())cool_table%>%gt_custom(use_labels =FALSE)%>%sub_missing(everything(), missing_text ="-")
Table 8: Breakpoint results by number of breaks (the table reads better horizontally, by number of breaks)
Table 9: Dates of breakpoints by number of breaks (the table reads better horizontally, by number of breaks)
Number of Breaks
Break A
Break B
Break C
Break D
Break E
1
2011-03-01
-
-
-
-
2
2011-07-01
2018-04-01
-
-
-
3
2011-07-01
2015-07-01
2019-02-01
-
-
4
2002-01-01
2011-07-01
2015-07-01
2019-02-01
-
5
2002-04-01
2008-03-01
2011-10-01
2015-07-01
2019-02-01
Based on the error results, I will choose either 2 or 3 structural breaks, given that there is a significant reduction in the BIC criterion at that number. At the third break there is a small reduction, while at the fourth it increases. Let us examine the proposed breaks individually. On the one hand, the 2-break model proposes breaks in June 2011 and March 2018; on the other hand, the 3-break model proposes breaks also in June 2011, June 2015, and January 2019. I deliberated at length over whether I could decide on my own which was a defining moment of the crisis, as this is potentially a somewhat subjective judgement, which is why I prefer to let the model calculate it. Furthermore, the entire period was quite turbulent and full of negative developments, meaning that in reality one cannot pinpoint a single clear break.
Looking at the dates, the three-break model may be the one that makes the most sense to observers. 2011 saw problems that were already showing up in the unemployment index; 2015 was a period of uncertainty; and January 2019 marks the country’s recovery, with the exit from the memoranda announced a few months earlier, in August 2018.
From Figure 1 it is very evident that our series does not move around any specific value, violating the first condition for a time series to be considered stationary. This indicates the need to use first-order differences for Greek unemployment. From the first difference (\Delta y = y_t - y_{t-1}) I observe a great improvement, as we no longer have the enormous deviations seen in the previous chart. The values mostly show no trend and move around values relatively close to zero. This is a good sign, though I have a mild concern as there are two points in the time series with relatively large deviations from zero. The first is between points 120 and 170, where the fluctuation around zero has deviated slightly, referring to the period between 2008 and 2012 (the crisis and deterioration of economic indicators). Another slightly problematic point is the 266th, referring to March-April 2020 and the imposition of movement restrictions to curb the spread of coronavirus when the first cases were detected in our country.
Show the code
grc_unemployment=unemployment%>%dplyr::filter(LOCATION=="GRC")grc_unemployment_diff1<-diff(grc_unemployment$Value, differences =1)highchart()%>%hc_chart(type ="line")%>%hc_title(text ="First differences of unemployment in Greece")%>%hc_subtitle(text ="Differences hover around 0, with the exception of the early years of the Greek debt crisis and the movement restrictions due to COVID-19.")%>%hc_xAxis(title =list(text ="Time Point"))%>%hc_yAxis(title =list(text ="Unemployment Difference"))%>%hc_tooltip(shared =TRUE, valueDecimals =2, valueSuffix =" %")%>%hc_add_series( data =grc_unemployment_diff1)%>%hc_legend(enabled =FALSE)
Figure 4: First-order difference time series of unemployment in Greece
Taking into account the above concerns, I also computed the second differences (\Delta^2y = \Delta y - \Delta_{y-1}) and visualised them. The chart is almost identical, with values fluctuating around zero even at the problematic point near the 150th observation. In the second differences I observe a more consistent fluctuation around zero, but at the same time there is an even larger deviation at the onset of COVID-19 measures in our country (3.6% versus 2.6% for the first differences).
Show the code
grc_unemployment_diff2<-diff(grc_unemployment$Value, differences =2)highchart()%>%hc_chart(type ="line")%>%hc_title(text ="Second differences of unemployment")%>%hc_subtitle(text ="A time series with more stable fluctuation around zero, but with fairly large deviations from it at certain points.")%>%hc_xAxis(title =list(text ="Time Point"))%>%hc_yAxis(title =list(text ="Unemployment Difference (2nd order)"))%>%hc_tooltip(shared =TRUE, valueDecimals =2, valueSuffix =" %")%>%hc_add_series( data =grc_unemployment_diff2)%>%hc_legend(enabled =FALSE)
Figure 5: Second-order difference time series of unemployment in Greece
Examining Stationarity with Statistical Tests
Graphical examination of stationarity is a fairly easy way to identify the presence of trends or whether our series has generally stable behaviour. Except in some very clear-cut cases, there will be times when many people may disagree on the stationarity of a series simply from its trajectory. This is understandable given that as a measure of evaluation it is somewhat subjective, being based on the personal opinion and interpretation each person gives to the movement of the time series. The use of this method alone may therefore lead to inconsistent or unstable decisions, since one person may consider a slight upward trend that reverts as stationary, while another may interpret the pattern as a faint but real trend. By analogy with normality tests, which include both graphical (quantile-quantile plot) and statistical tests (Kolmogorov-Smirnov test, Shapiro-Wilk test), we have similar alternatives for testing stationarity. This way we can have a more objective criterion for whether our series are stationary or not. Some of the best-known stationarity tests, which are also known as unit root tests, are as follows:
The DF test (Dickey-Fuller)
The ADF test (Augmented Dickey-Fuller)
The ADF-GLS test
The PP test (Phillips-Perron)
The KPSS test (Kwiatkowski-Phillips-Schmidt-Shin) and
The ZA test (Zivot-Andrews)
And this is where the chaos begins, which I started to realise when I began learning R. When using a programming language we do not have nice menus and tick boxes with options for each test. In R, as in other languages such as Python, there are packages that add functions and capabilities to each language. For stationarity tests, quite a few R packages have been built that serve a similar purpose. Some fairly well-known packages offering stationarity tests are:
tseries
urca
OK. But which one should I choose? Do they differ? We need to study our options carefully. In ready-made software the program makes certain sensible decisions for us or gives us the option to select certain values or tests. When we make this transition to a programming language, we have the freedom to see how that result was produced (by studying the corresponding package code). This increased freedom of movement comes with increased responsibilities on the part of the analyst, who will need to check what each package computes, what its capabilities and limitations are. In this case both packages, tseries and urca, are promoted as packages providing stationarity tests, that is, similar functionality. In brief, the tseries package, which I am most familiar with, is quite restrictive. Despite the fact that many guides use this package, I realised that I could not set the lags for the tests or set characteristics of the time series, such as whether it has an upward or downward trend. The urca package, on the other hand, addresses these limitations by allowing users to set the number of lags as well as time series characteristics. The only drawback of the urca package is that it does not provide a p-value in the test results, which is the most widely known measure for interpreting tests. For the hypothesis test, the test statistic is computed and compared against the corresponding critical value; if the test statistic exceeds the critical value, the test is rejected at the corresponding significance level.
Summary of Results
Summarising the results of the statistical stationarity tests, I conclude that the unemployment observations in Greece cannot be characterised as stationary. Furthermore, all classical tests agree on the presence of stationarity in the first- and second-order differences.
Table 10: Summary of stationarity test results
Test
Result
Stationarity achieved at...
ADF
I(1)
first difference
PP
I(1)
first difference
KPSS
I(1)
first difference
ZA
I(1)
first difference
LS
I(1)
first difference
DF Test
The Dickey-Fuller test is one of the simplest unit root tests for determining the stationarity of a time series. This test is based on the first-order autoregressive model, \text{AR}(1):
(Sir) Gilbert Walker
(Sir) Gilbert Thomas Walker (United Kingdom, 1868-1958) was a physicist and statistician who studied at Trinity College, University of Cambridge, where he emerged as the top undergraduate student in 1889. During his tenure at the Indian Meteorological Service, he developed the Yule-Walker equations (together with George Udny Yule), which today constitute a fundamental tool for autoregressive (AR) models. With these methods he successfully predicted monsoon patterns.
Photo source: Wikimedia Commons, 2021. Licence: Public Domain (published in 1925). For the original photograph please follow this link.
y_t = \phi y_{t-1} + e_t
Where y_t is the value of the time series and e_t is the error term. That is, it is a time series whose values are influenced by, and depend on, previous values of the series.
The test has specific variants based on the behaviour of the time series in question. More precisely, there are three variants:
Without a constant and without a trend, where the series moves around zero and takes the form written above: \Delta y_t = \gamma y_{t-1} + e_t
With a constant, when the series moves around a fixed value (different from zero): \Delta y_t = \alpha + \gamma y_{t-1} + e_t
With a constant and trend, when the series appears to follow a downward or upward trajectory over time: \Delta y_t = \alpha + \beta t + \gamma y_{t-1} + e_t
This particular test has some problems in application, as it makes a series of assumptions:
our series follows an autoregressive model with p=1,
the errors are homoskedastic, and
the errors are uncorrelated.
The series is clearly non-stationary in the original observations, while the trend disappears in the first-order differences. I will therefore check the assumptions of the Dickey-Fuller test, simply to confirm in practice that this test has some significant weaknesses. First, I will build a regression model based on the DF formulas and then check the aforementioned assumptions, such as that of autocorrelated errors. In the case of the first differences I have eliminated the trend, and the constant is also around zero. I will therefore build a model of the first case:
\Delta y_t = \gamma y_{t-1} + e_t
Show the code
y=grc_unemployment$Valuedy<-diff(y)# DF regression without intercept (because Δy ~ 0)y_lag<-y[-length(y)]# y_{t-1} aligned with dymod<-lm(dy~y_lag+0)# 0 = no interceptTdy<-length(dy)h<-floor(sqrt(Tdy))res_ljung=Box.test(residuals(mod), type="Ljung-Box", lag=h)dwtest_res=dwtest(mod)arch_res=ArchTest(residuals(mod), lags =12)result_table<-data.frame("Test"=c("Ljung-Box","Durbin-Watson", "ARCH Test"),"Test Statistic"=c(round(res_ljung[["statistic"]][[1]], 3),round(dwtest_res[["statistic"]][[1]], 3),round(arch_res[["statistic"]][[1]], 3)),"Degrees of Freedom"=c(res_ljung$parameter, "-",arch_res$parameter),"p value"=c("<0.0001",round(dwtest_res$p.value, 4),round(arch_res$p.value, 6)), check.names =FALSE)result_table%>%gt_custom(use_labels =FALSE)
Table 11: Testing error assumptions (autocorrelation and homoskedasticity)
Test
Test Statistic
Degrees of Freedom
p value
Ljung-Box
158.036
17
<0.0001
Durbin-Watson
1.655
-
0.0015
ARCH Test
39.874
12
7.5e-05
From the Ljung-Box and Durbin-Watson tests, I reject the hypothesis that the errors are uncorrelated. Finally, the ARCH test shows that the errors are not homoskedastic (they have varying variance). From these results it is clear that the DF test is not reliable and is generally not used, as its assumptions are quite restrictive and are typically violated. Tests such as the ADF and PP address this error behaviour in various ways to obtain valid results when assessing the stationarity of a series.
ADF Test
One of the most widely used unit root and stationarity tests is the Augmented Dickey-Fuller (ADF) test, which is a generalisation of the simple Dickey-Fuller test, as it is based on a higher-order autoregressive model (\rho > 1). That is, the value of the time series does not depend only on its previous value but also on other \rho preceding ones. An autoregressive model of order \rho takes the following form:
Analogously with the simple Dickey-Fuller test, variants are defined based on the behaviour of the time series in question:
Without a constant and without a trend, where the series moves around zero: \Delta y_t = \gamma y_{t-1} + \sum\limits_{i=1}^{p} \delta_i \Delta y_{t-i}+ u_t
With a constant, when the series moves around a fixed value (different from zero):
With a constant and trend, when the series appears to follow a downward or upward trajectory over time:
\begin{aligned}
\Delta y_t & = \alpha + \beta_t + \gamma y_{t-1} + \\ \\
&\quad + \sum\limits_{i=1}^{p} \delta_i \Delta y_{t-i}+ u_t
\end{aligned}
The ADF test has the following hypothesis structure:
H_0 : \gamma = 0 \textsf{, the series is not stationary}\\
H_1 : \gamma \neq 0 \text{, the series is stationary}
That is, if our results reject the null hypothesis, this indicates that the time series is stationary. Let us apply the above test to unemployment in Greece and Europe. However, before doing so I need to determine how many time lags to include. First, I need to set the maximum number of lags, p_{max}. This can be derived empirically from the form of my data: if they are quarterly I would set p_{max} = 4, or if they are monthly it is recommended to set p_{max} = 12. A more widespread approach is Schwert’s rule, which is based on the number of observations:
To avoid any coincidence, at this stage I have not yet determined the exact lags to use for the Augmented Dickey-Fuller test. I set the upper bound based on Schwert’s rule, p_{max}. That is, the number of lags \rho is an integer between 1 and 15, which can be expressed as follows:
Enough of that, though. It is time to find the ideal lags. There are two ways to determine them:
Sequential t-testing, or
Using various information criteria, such as AIC and BIC.
I am now in a position to run the Augmented Dickey-Fuller test via the urca package using the ur.df() command, setting the lags parameter equal to the maximum lag value p_{max} = 12. Do not be misled by the parameter name: you are not setting the lags but the maximum number of them.
Show the code
types<-c("none", "drift", "trend")# run ur.df for each caseresults<-lapply(types, function(t){ur.df(grc_unemployment$Value, type =t, lags =15, selectlags ="AIC")})result_table<-data.frame("Test"=c("ADF (none)", "ADF (drift)", "ADF (trend)"),"Test Statistic"=c(round(results[[1]]@teststat[1], 3),round(results[[2]]@teststat[1], 3),round(results[[3]]@teststat[1], 3)),"Lag"=c(results[[1]]@testreg$coefficients[,1]%>%length()-1, results[[2]]@testreg$coefficients[,1]%>%length()-2,results[[3]]@testreg$coefficients[,1]%>%length()-3),"Critical Value (1%)"=c(results[[1]]@cval[1,1],results[[2]]@cval[1,1],results[[3]]@cval[1,1]),"Critical Value (5%)"=c(results[[1]]@cval[1,2],results[[2]]@cval[1,2],results[[3]]@cval[1,2]), check.names =FALSE)result_table%>%gt_custom(use_labels =FALSE)
Table 12: Augmented Dickey-Fuller (ADF) test results
Test
Test Statistic
Lag
Critical Value (1%)
Critical Value (5%)
ADF (none)
-0.870
10
-2.58
-1.95
ADF (drift)
-1.889
10
-3.44
-2.87
ADF (trend)
-2.039
10
-3.98
-3.42
Given that the results in Table 12 are not statistically significant, I can conclude that my time series is not stationary. The use of differenced observations therefore makes sense, and they should be re-tested for stationarity. It is worth recalling that these differences are shown in Figure 4, where we observe that both the trend and the clustering of observations at values far from zero have been eliminated. The most appropriate course of action in this case is to evaluate the test under the simplest model, with neither a constant (drift) nor any form of trend. For completeness, all cases are included in the results below.
Show the code
types<-c("none", "drift", "trend")# run ur.df for each caseresults<-lapply(types, function(t){ur.df(diff(grc_unemployment$Value, 1), type =t, lags =10, selectlags ="Fixed")})result_table<-data.frame("Test"=c("ADF (none)", "ADF (drift)", "ADF (trend)"),"Test Statistic"=c(glue(round(results[[1]]@teststat[1], 3), "*"),round(results[[2]]@teststat[1], 3),round(results[[3]]@teststat[1], 3)),"Lag"=c(results[[1]]@testreg$coefficients[,1]%>%length()-1, results[[2]]@testreg$coefficients[,1]%>%length()-2,results[[3]]@testreg$coefficients[,1]%>%length()-3),"Critical Value (1%)"=c(results[[1]]@cval[1,1],results[[2]]@cval[1,1],results[[3]]@cval[1,1]),"Critical Value (5%)"=c(results[[1]]@cval[1,2],results[[2]]@cval[1,2],results[[3]]@cval[1,2]), check.names =FALSE)result_table%>%gt_custom(use_labels =FALSE)
Table 13: Augmented Dickey-Fuller (ADF) test results on the differences of the original observations
Test
Test Statistic
Lag
Critical Value (1%)
Critical Value (5%)
ADF (none)
-2.119*
10
-2.58
-1.95
ADF (drift)
-2.111
10
-3.44
-2.87
ADF (trend)
-2.194
10
-3.98
-3.42
Since the test statistic (-2.119) is smaller than the corresponding critical value (-1.95), we can reject the null hypothesis of non-stationarity for the differences of the original Greek unemployment observations, at the 5% significance level.
Show the code
types<-c("none", "drift", "trend")results_diff<-lapply(types, function(t){ur.df(diff(grc_unemployment$Value,1), type =t, lags =10, selectlags ="Fixed")})results_level<-lapply(types, function(t){ur.df(grc_unemployment$Value, type =t, lags =15, selectlags ="AIC")})result_table<-data.frame("Series Form"=c(rep("In levels", length(types)),rep("First differences", length(types))),"Specification"=c("None", "Drift", "Trend","None", "Drift", "Trend"),"ADF Statistic"=c(round(results_level[[1]]@teststat[1], 3),round(results_level[[2]]@teststat[1], 3),round(results_level[[3]]@teststat[1], 3),round(results_diff[[1]]@teststat[1], 3),round(results_diff[[2]]@teststat[1], 3),round(results_diff[[3]]@teststat[1], 3)),"Lag (in regression)"=c(length(results_level[[1]]@testreg$coefficients)-1,length(results_level[[2]]@testreg$coefficients)-2,length(results_level[[3]]@testreg$coefficients)-3,length(results_diff[[1]]@testreg$coefficients)-1,length(results_diff[[2]]@testreg$coefficients)-2,length(results_diff[[3]]@testreg$coefficients)-3),"Critical Value (1%)"=c(results_level[[1]]@cval[1,1],results_level[[2]]@cval[1,1],results_level[[3]]@cval[1,1],results_diff[[1]]@cval[1,1],results_diff[[2]]@cval[1,1],results_diff[[3]]@cval[1,1]),"Critical Value (5%)"=c(results_level[[1]]@cval[1,2],results_level[[2]]@cval[1,2],results_level[[3]]@cval[1,2],results_diff[[1]]@cval[1,2],results_diff[[2]]@cval[1,2],results_diff[[3]]@cval[1,2]), check.names =FALSE)# render with gt_custom and source noteresult_table%>%gt_custom(use_labels =FALSE)%>%tab_source_note( source_note ="Note: The null hypothesis of the ADF test is the presence of a unit root. The lag refers to the number of lags used in the regression.")
Table 14: Augmented Dickey-Fuller (ADF) test results on the differences of the original observations
Series Form
Specification
ADF Statistic
Lag (in regression)
Critical Value (1%)
Critical Value (5%)
In levels
None
-0.870
43
-2.58
-1.95
In levels
Drift
-1.889
46
-3.44
-2.87
In levels
Trend
-2.039
49
-3.98
-3.42
First differences
None
-2.119
43
-2.58
-1.95
First differences
Drift
-2.111
46
-3.44
-2.87
Note: The null hypothesis of the ADF test is the presence of a unit root. The lag refers to the number of lags used in the regression.
PP Test
Another stationarity test is the Phillips-Perron test, which is based on the simple Dickey-Fuller. The logic is the same as the ADF and the hypothesis structure is identical, but they differ significantly in how they handle autocorrelation of the errors. The PP test does not add lags to reduce error autocorrelation (as the ADF does); instead, it attempts to modify the test statistic by estimating the long-run variance \hat{\lambda}.
The only term that cannot be determined in advance is the long-run variance, as its summation depends on a parameter q. That is, we have not decided how many terms to include in the computation and estimation of the long-run variance \hat{\lambda}. For the bandwidth parameter q of the long-run variance, there are some simple calculation rules that coincide with those for the maximum lag number in the ADF test. For small samples:
The hypothesis structure takes the same form as the tests mentioned above, with the null hypothesis assuming non-stationarity of our time series.
H_0 : \text{The series is not stationary} \\
H_1 : \text{The series is stationary}
The Phillips-Perron test can be executed via the urca package with the ur.pp command. It accepts as parameters our time series, the test statistic we wish to compute, and the type of series (with trend, without trend). Finally, a parameter that can be set is the maximum number of terms to use for estimating the long-run variance, referred to as lags in the package. This is nothing other than the q we discussed above. The package gives us two options: either set our own value via the use.lag parameter, or use one of two options (short or long) via the lags parameter. Reading the ur.pp source code reveals that if we choose the automatic determination of q through the two options provided, the formulas are identical to those defined above for q_{\text{small}} and q_{\text{large}}.
Show the code
types<-c("constant", "trend")results_level<-lapply(types, function(t){ur.pp(grc_unemployment$Value, type ="Z-tau", model =t, lags ="long")})results_diff<-lapply(types, function(t){ur.pp(diff(grc_unemployment$Value,1), type ="Z-tau", model =t, lags ="long")})result_table<-data.frame("Variable"=c("Levels", "Levels","First Differences", "First Differences"),"Specification"=c("Constant", "Trend","Constant", "Trend"),"Test Statistic"=c(round(results_level[[1]]@teststat[1], 3),round(results_level[[2]]@teststat[1], 3),round(results_diff[[1]]@teststat[1], 3),round(results_diff[[2]]@teststat[1], 3)),"Lag (Newey-West)"=c(results_level[[1]]@lag,results_level[[2]]@lag,results_diff[[1]]@lag,results_diff[[2]]@lag),"Critical Value (1%)"=c(round(results_level[[1]]@cval[1,1], 3),round(results_level[[2]]@cval[1,1], 3),round(results_diff[[1]]@cval[1,1], 3),round(results_diff[[2]]@cval[1,1], 3)),"Critical Value (5%)"=c(round(results_level[[1]]@cval[1,2], 3),round(results_level[[2]]@cval[1,2], 3),round(results_diff[[1]]@cval[1,2], 3),round(results_diff[[2]]@cval[1,2], 3)), check.names =FALSE)result_table%>%gt_custom(use_labels =FALSE)
Table 15: Phillips-Perron test results for the original observations and the first differences
Variable
Specification
Test Statistic
Lag (Newey-West)
Critical Value (1%)
Critical Value (5%)
Levels
Constant
-1.110
15
-3.454
-2.872
Levels
Trend
-0.552
15
-3.993
-3.427
First Differences
Constant
-16.959
15
-3.454
-2.872
First Differences
Trend
-16.957
15
-3.993
-3.427
Consequently, we have a failure to reject the null hypothesis (H_0) for the original data and a rejection for the first differences. This means that the Greek unemployment observations were not stationary, but their differences were. The PP test thus confirms the findings of the ADF test.
KPSS Test
Another unit root test is the KPSS test. Although the aim of the test is the same, it has the opposite logic compared to the previous tests: its null hypothesis is stationarity, as opposed to non-stationarity.
H_0 : \text{The series is stationary} \\
H_1 : \text{The series is not stationary}
Show the code
kpss_level<-ur.kpss(grc_unemployment$Value, type ="mu", lags ="long")kpss_diff1<-ur.kpss(diff(grc_unemployment$Value,1), type ="mu", lags ="long")kpss_diff2<-ur.kpss(diff(grc_unemployment$Value,2), type ="mu", lags ="long")result_table<-data.frame("Series Form"=c("In levels","First differences","Second differences"),"KPSS Statistic"=c(round(kpss_level@teststat[1], 3),round(kpss_diff1@teststat[1], 3),round(kpss_diff2@teststat[1], 3)),"Lag"=c(kpss_level@lag,kpss_diff1@lag,kpss_diff2@lag),"Critical Value (1%)"=c(round(kpss_level@cval[1,4], 3),round(kpss_diff1@cval[1,4], 3),round(kpss_diff2@cval[1,4], 3)),"Critical Value (5%)"=c(round(kpss_level@cval[1,2], 3),round(kpss_diff1@cval[1,2], 3),round(kpss_diff2@cval[1,2], 3)), check.names =FALSE)result_table%>%gt_custom(use_labels =FALSE)%>%tab_source_note( source_note ="Note: The null hypothesis of KPSS is stationarity. The lag is used for the estimation of the long-run variance.")
Table 16: KPSS test results for the observations and the first differences
Series Form
KPSS Statistic
Lag
Critical Value (1%)
Critical Value (5%)
In levels
0.973
15
0.739
0.463
First differences
0.297
15
0.739
0.463
Second differences
0.298
15
0.739
0.463
Note: The null hypothesis of KPSS is stationarity. The lag is used for the estimation of the long-run variance.
My null hypothesis assumed stationarity, which is rejected by the KPSS test at the 1% significance level. Furthermore, the test cannot reject stationarity for the differenced observations. These results confirm both the PP test and the ADF test.
ZA and LS Tests
Finally, another stationarity test is the Zivot-Andrews (ZA) test. This particular test statistic differs from the previous ones in that it takes into account certain points at which the time series changes behaviour. These time points could be various dates on which significant events occurred that may have affected the behaviour of the time series. It is evident that the debt crisis constitutes a time point that affected the behaviour of the time series. The use of specialised tests with structural breaks is deemed necessary for our case.
The best-known tests for stationarity in time series with structural breaks are:
The Zivot-Andrews (ZA) test, if there is only one structural break, and
The Lee-Strazicich (LS) test, if there are two structural breaks.
But how many breaks are there in a time series?
For completeness I will include the Zivot-Andrews test, which is not optimal since it assumes only one structural break. Above, two were found. Here, however, another problem arises. Although for classical tests, and for structural break tests such as Zivot-Andrews, there are packages and commands that compute the test statistic, no equivalent commands exist for the Lee-Strazicich test. Fortunately, instead of relying on a statistical package (e.g. EViews), I found a relevant repository on GitHub, where user hannes101 has written a series of functions for this specific test that follow a logic similar to those in the urca package. At this point I should note that the simple test takes quite a few minutes to run; on my machine it took 5 minutes. The only problem with this procedure is that the object returned by the function does not include critical values. This means that I will need to look up the critical value tables for my specific case from the original Lee-Strazicich paper.
Table 17: Stationarity test results with structural breaks
Test
Variable
Test Statistic
Critical Value (5%)
Zivot-Andrews
Unempl
-4.39
-5.08
Zivot-Andrews
Δ Unempl
-16.42
-5.08
Lee-Strazicich
Unempl
-4.71
-5.65
Lee-Strazicich
Δ Unempl
-7.72
-5.65
Model Identification
Above I concluded that the unemployment observations are not stationary; however, their first differences constitute a stationary time series. At this point I would like to investigate which (S)ARIMA(p, d, q) model is appropriate for my case. For this reason I will construct autocorrelation and partial autocorrelation diagrams in order to identify the ideal values of p and q. Finally, it is worth recalling that in an ARIMA model, d represents the order of differencing, which in our case is 1.
Original Observations
Table 18: Autocorrelation results for the original observations, first 12 lags
Lag
ACF
pACF
0
1
1
1
0.997
0.997
2
0.992
−0.1
3
0.988
−0.068
4
0.982
−0.064
5
0.977
−0.048
6
0.97
−0.112
7
0.963
−0.144
8
0.955
−0.04
9
0.946
0.005
10
0.937
−0.117
11
0.926
−0.166
For completeness, we will start with the given unemployment observations. We already know that they are not stationary, but there are specific patterns that we need to observe in the autocorrelation diagrams in order to confirm this. The autocorrelation diagram decreases at an extremely slow rate, which is a strong indication of non-stationarity of the series.
Show the code
acf_plot_function(grc_unemployment$Value, min=-0.25, max =1)
Table 19: Autocorrelation results for the first differences, first 12 lags
Lag
ACF
pACF
0
1
1
1
0.172
0.172
2
0.118
0.091
3
0.145
0.116
4
0.122
0.076
5
0.191
0.148
6
0.284
0.225
7
0.178
0.087
8
0.003
−0.107
9
0.223
0.166
10
0.338
0.271
11
0.18
0.048
The non-stationarity of my data, confirmed countless times above, compels us to take the first differences and obtain the corresponding autocorrelation diagrams. I observe a rather difficult-to-interpret situation from the diagrams, as there are significant fluctuations in both diagrams that make it hard to decide on the appropriate ARIMA model. Normally, to determine the right lag orders I look for a sharp drop within the red region, but there are quite a few statistically significant exceedances. After the 10th lag this tendency diminishes or disappears significantly, so my model likely has parameters p and q somewhere between 1 and 10.
Show the code
acf_plot_function(grc_unemployment_diff1, min=-0.2, max =0.5)
Figure 8: Autocorrelation diagram of first-order differences
Figure 9: Partial autocorrelation diagram of first-order differences
Model Building
Splitting the Time Series
Forecasting future unemployment requires us to build a model. Established methods include ARIMA models and Exponential Smoothing (ETS) models. Simply running a model says very little unless we evaluate its parameters. More generally, in work where we try to build predictive models we want to evaluate the model’s power, and for this reason we typically split our data into two parts. One part is used for training the model (train set) and the remainder for evaluating it (test set). The usual split is 70/30 or 80/20 respectively. This approach generally works well for classical machine learning problems such as classification and regression.
flowchart LR
A[Full Dataset <br> <small>1998-2022 <br> 293 observations</small>]
A --> B[Training <br> 1998-2015]
A --> C[Evaluation <br> 2015-2022]
Figure 10: Diagram of data split into training and evaluation sets (train / test split)
However, if we have time series problems, things are not so straightforward. Let us first suppose we want to apply the same logic to these. We would then end up with a model that ignores one fifth of the observations, specifically the most recent ones, which may well influence the variability of the time series. For this reason, other methods have been proposed for time series.
flowchart LR
C[Full Time Series <br> <small>293 observations</small>]
C --> F[1st Fold <br> <small>Training: 1998-2010 <br> Forecast: 2011</small>]
C --> G[2nd Fold <br> <small>Training: 1998-2011 <br> Forecast: 2012</small>]
C --> H[3rd Fold <br> <small>Training: 1998-2012 <br> Forecast: 2013</small>]
C --> I[k-th Fold]
C --> J[N-th Fold <br> <small>Training: 1998-2021 <br> Forecast: 2022</small>]
Figure 11: Diagram of data splitting for time series
All of this may sound somewhat complicated. I am not entirely sure whether the diagram above fully conveys the concept of time series cross-validation. I have included an interactive application for you to try for yourselves, so that the difference is clear and you can see how the various options affect the result.
5
2
6
TimeToday
Available Historical Data
Dropped
Model Training
Forecast
Specifying ARIMA Models
For the simple ARIMA model I need to determine my three parameters. I know that d = 1, since I achieved stationarity with the first differences. For the determination of p I suspect 1, 6, and 10, as there was a large sharp drop in the partial autocorrelation diagram of the first differences after these values. Finally, for q I suspect 1, 3, 6, and 10. From the diagram I cannot draw a safe conclusion about their values, as there are quite a few significant autocorrelations at least up to the 10th lag. The graphical determination of p and q is quite subjective and provides uncertain conclusions. For this reason it would be good to evaluate a range of parameter combinations for the ARIMA time series models in order to find the model with the optimal parameters.
George Box
George E.P. Box (United Kingdom, 1919-2013) was a statistician with very significant contributions to the field of time series. George Box, in collaboration with Gwylim Jenkins (United Kingdom, 1932-1982), created the methodology for ARIMA time series. In many textbooks this methodology is referred to as Box-Jenkins. Beyond this, in collaboration with the Finnish-American statistician Greta M. Ljung (Finland, 1941-2014), he created the test for autocorrelation of errors known as the Ljung-Box test, which we will need later on.
Photo source: Wikimedia Commons, 2011. Licence: CC BY-SA 3.0. For the original photograph please follow this link.
Table 20: ARIMA model results based on the Akaike information criterion
Model
BIC
AIC
AICc
ARIMA (1,1,1)
305.5
294.4
294.5
ARIMA (1,1,2)
311.1
296.4
296.5
ARIMA (2,1,1)
311.1
296.4
296.5
ARIMA (1,1,3)
313.8
295.4
295.6
ARIMA (3,1,1)
314.7
296.3
296.5
ARIMA (3,1,2)
314.9
292.9
293.1
ARIMA (2,1,3)
315.1
293.0
293.3
ARIMA (2,1,2)
319.1
293.4
293.8
ARIMA (1,1,0)
325.5
318.2
318.2
ARIMA (0,1,1)
326.8
319.5
319.5
ARIMA (2,1,0)
328.8
317.8
317.9
ARIMA (3,1,0)
330.6
315.9
316.0
ARIMA (0,1,2)
330.7
319.7
319.8
ARIMA (0,1,3)
333.8
319.1
319.3
Older guides or articles would likely have suggested using auto.arima to find the best ARIMA. Well, this article did exactly that until I realised that the fable library gives us the necessary information in a more organised way, so using the model command with various parameter combinations for ARIMA models gives us a ready-made table with:
the Akaike information criteria (AIC),
the corrected Akaike information criterion (AICc), and
the Bayesian information criterion (BIC).
But what should we be looking for? This is very important because, as you can see, each criterion gives a different model as the best. For example, based on the AIC and AICc criteria I should choose the ARIMA(3,1,2) model, while BIC proposes the ARIMA(1,1,1) model. In general, we look for the model that gives the smallest numbers on these criteria. I have thus identified certain models that are clearly unsuitable for this particular time series, such as ARIMA(0,1,2), which performs poorly on all criteria. The ideal model is ARIMA(1,1,1) as it achieves the best performance on the BIC criterion, has good performance on the AIC and AICc criteria with a small gap from the optimal suggestion on those, and finally the AIC criteria proposed models with a marginal improvement but with 5 parameters, compared to 2-parameter models that may generalise better to the data and the behaviour of the time series.
Specifying the ETS Model
The Exponential Smoothing method is an equally popular method for forecasting time series. Its great advantage is that it does not require the tests we performed for the ARIMA models. Despite its flexible methodology, we need to examine the time series graphically to observe its properties (trend, seasonality, cyclicality) so that I can specify the ETS model that suits our case.
ETS(\text{Error}, \text{Trend}, \text{Seasonality})
The smoothing model I will use is based on these characteristics.
Table 21: ETS model results based on the BIC, AIC, AICc information criteria
Model
BIC
AIC
AICc
M-A-N
1,112.2
1,093.8
1,094.0
M-Ad-N
1,115.9
1,093.8
1,094.1
M-N-N
1,126.7
1,115.7
1,115.8
A-A-N
1,152.1
1,133.7
1,133.9
A-Ad-N
1,154.7
1,132.6
1,132.9
A-N-N
1,172.9
1,161.9
1,161.9
A-A-A
1,221.4
1,158.9
1,161.1
M-Ad-M
1,224.0
1,157.7
1,160.2
A-A-M
1,260.1
1,197.6
1,199.8
The table shows that the ETS(M,A,N) model, that is, with multiplicative error, linear trend, and no seasonality, has the best performance on the BIC criterion. Very close behind is ETS(M,Ad,N), which differs only in that the trend damps gradually rather than continuing linearly. The absence of a seasonal component (N) from the best-performing models confirms the findings of the seasonality section: seasonality in Greek unemployment is weak and does not improve the models. By contrast, models with a seasonal component (A-A-A, M-Ad-M, A-A-M) sit at the bottom of the table, with the added complexity not offset by better fit.
Prophet Model
The solutions above are potentially quite complex and require deep theoretical knowledge as well as basic statistical programming skills. An easy alternative for quick time series forecasting is the Prophet algorithm. It was created by Facebook in 2017 and is a fairly popular forecasting method. The eponymous R package, with the prophet() command, gives us an estimate in a single line of code. One point worth noting is that it requires data in data frame format. For smooth application, you should rename your columns appropriately so that it understands which columns it needs to analyse.
Show the code
prophet_df=grc_unemployment%>%rename(ds =TIME, y =Value)prophet_res=prophet::prophet(df =prophet_df)
Unlike ARIMA and ETS models, Prophet does not require explicit stationarity testing or parameter specification because it automatically decomposes the time series into trend, seasonality, and holidays. This ease of use makes it popular in industry applications, but it does not guarantee better performance. In the comparison table that follows we will see how it performs relative to the classical models.
Model Comparison
Up to now I used the AIC and BIC criteria to select the best parameters within each model type, for example, which ARIMA is the best among the other ARIMAs. These criteria cannot, however, tell me whether an ARIMA is better than an ETS or Prophet, because each model type computes them in a different way and they are not directly comparable across types. For this reason I need a common basis for comparison. The logic is simple: each model makes forecasts for months for which I already know the actual unemployment figure, and I measure how far off it was. The smaller the error, the better the model. The metrics I will use are three:
MAE = \frac{1}{n} \sum_{i=1}^{n} |y_i - \hat{y_i}|
y_i : actual value, that is, the unemployment figure for the given month \hat{y}_i : estimated value, that is, the unemployment figure forecast by my model
Table 23: Comparison of errors across different models for a 6-month forecast
Model
RMSE
MAE
ARIMAX (1,1,1) - 1 break
0.8504
0.6243
ARIMA (1,1,1)
0.8510
0.6252
ARIMAX (1,1,1) - 2 breaks
0.8879
0.6449
ETS (M-Ad-N)
0.8913
0.6703
ARIMA (2,1,1)
0.8920
0.6682
ETS (M-A-N)
0.8996
0.6776
Naive
0.9007
0.6946
ARIMA (1,1,2)
0.9165
0.6941
Auto ARIMA
0.9411
0.7101
SNaive
2.1633
1.9108
From the tables above it emerges that for short-term forecasts (h \leq 6 months) the ARIMAX model, which includes two exogenous variables representing the structural changes (breaks), performs best. The ARIMA(1,1,1) model also performs well for both the 2-month and 6-month forecasts. What stands out in our results is the significant difference in errors between the ARIMA models and the Naive model. This potentially indicates that our models identify the various patterns of the time series better, and therefore justify their existence. One point that requires particular attention is the differences in errors: while differences clearly exist, it should be investigated whether they are statistically significant. To study this, the Diebold-Mariano test will be used, which tests whether the errors of the compared models are statistically significant:
Show the code
fit<-unemployment_ts%>%model("ARIMAX (1,1,1) - 2 breaks"=fable::ARIMA(Value~pdq(1,1,1)+break1+break2),"ARIMAX (1,1,1) - 1 break"=fable::ARIMA(Value~pdq(1,1,1)+break1),"ARIMA (1,1,1)"=fable::ARIMA(Value~pdq(1,1,1)), Naive =fable::NAIVE(Value),)aug=fit%>%augment()e1<-aug%>%filter(.model=="ARIMAX (1,1,1) - 2 breaks")%>%arrange(TIME)%>%pull(.resid)e2<-aug%>%filter(.model=="ARIMAX (1,1,1) - 1 break")%>%arrange(TIME)%>%pull(.resid)e3<-aug%>%filter(.model=="ARIMA (1,1,1)")%>%arrange(TIME)%>%pull(.resid)e4<-aug%>%filter(.model=="Naive")%>%arrange(TIME)%>%pull(.resid)t12<-dm.test(e1, e2, alternative ="two.sided", h =2, power =2)t13<-dm.test(e1, e3, alternative ="two.sided", h =2, power =2)t14<-dm.test(e1, e4, alternative ="two.sided", h =2, power =2)t23<-dm.test(e2, e3, alternative ="two.sided", h =2, power =2)t24<-dm.test(e2, e4, alternative ="two.sided", h =2, power =2)t34<-dm.test(e3, e4, alternative ="two.sided", h =2, power =2)dm_results<-tribble(~"Model 1", ~"Model 2", ~"DM Statistic", ~"p-value","ARIMAX(1,1,1) 2br.", "ARIMAX(1,1,1) 1br.", t12$statistic, t12$p.value,"ARIMAX(1,1,1) 2br.", "ARIMA (1,1,1)", t13$statistic, t13$p.value,"ARIMAX(1,1,1) 2br.", "Naive", t14$statistic, t14$p.value,"ARIMAX(1,1,1) 1br.", "ARIMA (1,1,1)", t23$statistic, t23$p.value,"ARIMAX(1,1,1) 1br.", "Naive", t24$statistic, t24$p.value,"ARIMA (1,1,1)", "Naive", t34$statistic, t34$p.value)%>%mutate("DM Statistic"=round(`DM Statistic`, 4),"p-value"=round(`p-value`, 4),"Conclusion"=ifelse(`p-value`<0.05, "Statistically Significant", "No Difference"))dm_results%>%gt_custom(head_max =10, use_labels =FALSE)%>%tab_header( title ="Diebold-Mariano Test", subtitle ="Comparing Forecast Accuracy of Models")%>%tab_style( style =cell_fill(color ="#e8f5e9"), locations =cells_body(rows =`p-value`>=0.05, columns ="Conclusion"))%>%tab_style( style =cell_fill(color ="#fde8e8"), locations =cells_body(rows =`p-value`<0.05, columns ="Conclusion"))%>%fmt_number(columns =c("DM Statistic", "p-value"), decimals =4)
Table 24: Diebold-Mariano tests for the best-performing models for a 2-month forecast
Diebold-Mariano Test
Comparing Forecast Accuracy of Models
Model 1
Model 2
DM Statistic
p-value
Conclusion
ARIMAX(1,1,1) 2br.
ARIMAX(1,1,1) 1br.
−0.9218
0.3574
No Difference
ARIMAX(1,1,1) 2br.
ARIMA (1,1,1)
−0.9349
0.3506
No Difference
ARIMAX(1,1,1) 2br.
Naive
−2.6630
0.0082
Statistically Significant
ARIMAX(1,1,1) 1br.
ARIMA (1,1,1)
−0.2242
0.8228
No Difference
ARIMAX(1,1,1) 1br.
Naive
−2.5129
0.0125
Statistically Significant
ARIMA (1,1,1)
Naive
−2.5001
0.0130
Statistically Significant
According to the test, I found that the selected ARIMA models have statistically significant differences from the Naive model, demonstrating their value for forecasting future unemployment, at least in the short term. A further finding from the combinations of tests is that the selected ARIMA models do not have statistically significant differences among themselves. That is, the ARIMAX model using one or two breaks does not have significant differences at the 5% significance level from a simple ARIMA(1,1,1) model.
Wait, Where Are the Assumptions?
Hey, where are the assumption tests? We want our money back.
Well, I have some good news and some bad news. The good news is that not only has the article not ended, but I also need to check whether the best models satisfy the assumptions we discussed, namely that the errors must be:
The results are not ideal for any of the models. The Ljung-Box test rejects the null hypothesis of uncorrelated errors for all models (p < 0.01), indicating that the errors retain some structure that the models fail to fully explain. The ARCH test is also rejected everywhere, indicating heteroskedasticity (the variance of the errors is not constant over time), something expected in a time series that passed through a debt crisis and a pandemic. Finally, the Jarque-Bera test rejects normality, with extremely high test statistics suggesting heavy tails in the error distribution. These findings mean that the confidence intervals produced by the models should be interpreted with caution, as they may be narrower or wider than the true underlying uncertainty. However, the violation of these assumptions does not automatically invalidate the point forecasts. ARIMA models are known to produce reliable point estimates even when the distributional assumptions are not fully satisfied, particularly over a short-term horizon. The main implication concerns the quantification of uncertainty, not the direction of the forecast.
Forecasting Future Unemployment
All of this was somewhat tiring, but the moment has finally arrived for all of it to make sense. I found the best model and, using it, I will forecast the level of unemployment in Greece for the next 12 months. My data end in August 2022 (unemployment at 12.2%), so any estimate will be made for the months from September 2022 through to August 2023.
Show the code
best_fit<-unemployment_ts%>%model("Best Model"=fable::ARIMA(Value~pdq(1,1,1)+break1+break2))fc_final<-best_fit%>%forecast(h =6)# Prepare historical datahistorical<-unemployment_ts%>%as_tibble()%>%mutate(TIME =as.Date(TIME))# Prepare forecast dataforecast_tbl<-fc_final%>%hilo(level =c(80, 95))%>%unpack_hilo(c("80%", "95%"))%>%as_tibble()%>%mutate(TIME =as.Date(TIME))%>%select(TIME, .mean, `80%_lower`, `80%_upper`, `95%_lower`, `95%_upper`)highchart(type ="stock")%>%hc_title(text ="Unemployment Forecast for Greece - 6 Months")%>%hc_xAxis(type ="datetime")%>%hc_yAxis(title =list(text ="Unemployment Rate (%)"))%>%# Historical linehc_add_series( data =historical, type ="line",hcaes(x =datetime_to_timestamp(TIME), y =Value), name ="Historical Data", color ="#2c3e50", lineWidth =2, marker =list(enabled =FALSE))%>%# Point forecast linehc_add_series( data =forecast_tbl, type ="line",hcaes(x =datetime_to_timestamp(TIME), y =round(.mean, 2)), name ="Forecast", color ="#e74c3c", lineWidth =2, dashStyle ="Dash", marker =list(enabled =FALSE))%>%# 95% interval (arearange)hc_add_series( data =forecast_tbl, type ="arearange",hcaes(x =datetime_to_timestamp(TIME), low =round(`95%_lower`, 2), high =round(`95%_upper`, 2)), name ="95% CI", color ="#e74c3c", fillOpacity =0.15, lineWidth =0, linkedTo =":previous")%>%# 80% interval (arearange)hc_add_series( data =forecast_tbl, type ="arearange",hcaes(x =datetime_to_timestamp(TIME), low =round(`80%_lower`, 2), high =round(`80%_upper`, 2)), name ="80% CI", color ="#e74c3c", fillOpacity =0.25, lineWidth =0, linkedTo =":previous")%>%hc_tooltip(shared =TRUE, valueDecimals =2)%>%hc_navigator(enabled =FALSE)%>%hc_scrollbar(enabled =FALSE)%>%hc_rangeSelector( enabled =TRUE, selected =2, buttons =list(list(type ="month", count =6, text ="6 months"),list(type ="year", count =1, text ="1 year"),list(type ="year", count =2, text ="2 years"),list(type ="all", text ="All")))
Figure 12: Unemployment time series with forecast of future unemployment
Table 25: Unemployment forecasts for Greece for the next 6 months with corresponding confidence intervals
Month
Forecast
80% CI
95% CI
2022 Sep
12.15
[11.65, 12.66]
[11.38, 12.93]
2022 Oct
12.03
[11.28, 12.78]
[10.88, 13.18]
2022 Nov
11.91
[10.96, 12.86]
[10.45, 13.37]
2022 Dec
11.79
[10.65, 12.94]
[10.05, 13.54]
2023 Jan
11.68
[10.36, 13]
[9.66, 13.71]
2023 Feb
11.57
[10.07, 13.07]
[9.28, 13.86]
Based on the best-performing model, the downward trend in Greek unemployment is expected to continue over the next six months. The point estimate for February 2023 is 11.4%, with an 80% confidence interval between 10.1% and 12.8% (and a 95% interval between 9.3% and 13.5%). If confirmed, this would represent a continuation of the steady deceleration that began after the historical high of 2013.
However, it is worth highlighting certain limitations of this analysis. ARIMA models rest on the assumption that the future structure of the time series will resemble its historical one. An unforeseen event, such as a new geopolitical crisis or an energy shock, could overturn the forecast. In addition, unemployment is influenced by variables not included in the model, such as the trajectory of GDP, inflation, and fiscal policy. Finally, the confidence intervals should be interpreted with caution, given that the assumption tests revealed heteroskedasticity and non-normality in the errors.
Despite these limitations, the direction of the forecast (continuation of the downward trend) is consistent across all models examined, which reinforces the credibility of the conclusion. Greece appears to be moving steadily towards lower unemployment, but remains at levels significantly above the European average (approximately 6% in the same month), a reminder that the full recovery from the 2010 crisis has not yet been completed.
Eddelbuettel, D., & Balamuta, J. J. (2018). Extending R with C++: A Brief Introduction to Rcpp. The American Statistician, 72(1), 28–36. https://doi.org/10.1080/00031305.2017.1375990
Eddelbuettel, D., Francois, R., Allaire, J., Ushey, K., Kou, Q., Russell, N., … Chambers, J. (2026). Rcpp: Seamless r and c++ integration. Retrieved from https://www.rcpp.org
Eddelbuettel, D., & François, R. (2011). Rcpp: Seamless R and C++ integration. Journal of Statistical Software, 40(8), 1–18. https://doi.org/10.18637/jss.v040.i08
Grolemund, G., & Wickham, H. (2011). Dates and times made easy with lubridate. Journal of Statistical Software, 40(3), 1–25. Retrieved from https://www.jstatsoft.org/v40/i03/
Hester, J., & Bryan, J. (2024). Glue: Interpreted string literals. Retrieved from https://glue.tidyverse.org/
Hyndman, R. J., & Athanasopoulos, G. (2018). Forecasting: Principles and practice. OTexts.
Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting: The forecast package for R. Journal of Statistical Software, 27(3), 1–22. https://doi.org/10.18637/jss.v027.i03
Hyndman, R., Athanasopoulos, G., Bergmeir, C., Caceres, G., Chhay, L., Kuroptev, K., … Yasmeen, F. (2026). Forecast: Forecasting functions for time series and linear models. Retrieved from https://pkg.robjhyndman.com/forecast/
Iannone, R., Cheng, J., Schloerke, B., Haughton, S., Hughes, E., Lauer, A., … Roy, O. (2025). Gt: Easily create presentation-ready display tables. Retrieved from https://gt.rstudio.com
O’Hara-Wild, M., Hyndman, R., & Wang, E. (2025a). Fabletools: Core tools for packages in the fable framework. Retrieved from https://fabletools.tidyverts.org/
O’Hara-Wild, M., Hyndman, R., & Wang, E. (2025b). Feasts: Feature extraction and statistics for time series. Retrieved from http://feasts.tidyverts.org/
O’Hara-Wild, M., Hyndman, R., & Wang, E. (2026). Fable: Forecasting models for tidy time series. Retrieved from https://fable.tidyverts.org
OECD. (2022). Unemployment rate (indicator). Retrieved October 14, 2022, from doi: 10.1787/52570002-en
Pfaff, B. (2008). Analysis of integrated and cointegrated time series with r (Second). New York: Springer. Retrieved from https://www.pfaffikus.de
R Core Team. (2025). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from https://www.R-project.org/
R Graph Gallery. (2018). Time series. URL Https://r-Graph-Gallery.com/Time-Series.html.
Spinu, V., Grolemund, G., & Wickham, H. (2024). Lubridate: Make dealing with dates a little easier. Retrieved from https://lubridate.tidyverse.org
Vaidyanathan, R., Xie, Y., Allaire, J., Cheng, J., Sievert, C., & Russell, K. (2023). Htmlwidgets: HTML widgets for r. Retrieved from https://github.com/ramnathv/htmlwidgets
Wang, E., Cook, D., & Hyndman, R. J. (2020). A new tidy data structure to support exploration and modeling of temporal data. Journal of Computational and Graphical Statistics, 29(3), 466–478. https://doi.org/10.1080/10618600.2019.1695624
Wang, E., Cook, D., Hyndman, R., & O’Hara-Wild, M. (2025). Tsibble: Tidy temporal data frames and tools. Retrieved from https://tsibble.tidyverts.org
Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). Dplyr: A grammar of data manipulation. Retrieved from https://dplyr.tidyverse.org
Wickham, H., & Henry, L. (2026). Purrr: Functional programming tools. Retrieved from https://purrr.tidyverse.org/
Wickham, H., Hester, J., & Bryan, J. (2025). Readr: Read rectangular text data. Retrieved from https://readr.tidyverse.org
Zeileis, A. (2004). Econometric computing with HC and HAC covariance matrix estimators. Journal of Statistical Software, 11(10), 1–17. https://doi.org/10.18637/jss.v011.i10
Zeileis, A. (2006a). Implementing a class of structural change tests: An econometric computing approach. Computational Statistics & Data Analysis, 50(11), 2987–3008. https://doi.org/10.1016/j.csda.2005.07.001
Zeileis, A. (2006b). Object-oriented computation of sandwich estimators. Journal of Statistical Software, 16(9), 1–16. https://doi.org/10.18637/jss.v016.i09
Zeileis, A., & Grothendieck, G. (2005). Zoo: S3 infrastructure for regular and irregular time series. Journal of Statistical Software, 14(6), 1–27. https://doi.org/10.18637/jss.v014.i06
Zeileis, A., Grothendieck, G., & Ryan, J. A. (2025). Zoo: S3 infrastructure for regular and irregular time series (z’s ordered observations). Retrieved from https://zoo.R-Forge.R-project.org/
Zeileis, A., & Hothorn, T. (2002). Diagnostic checking in regression relationships. R News, 2(3), 7–10. Retrieved from https://CRAN.R-project.org/doc/Rnews/
Zeileis, A., Kleiber, C., Krämer, W., & Hornik, K. (2003). Testing and dating of structural changes in practice. Computational Statistics & Data Analysis, 44(1–2), 109–123. https://doi.org/10.1016/S0167-9473(03)00030-6
Zeileis, A., Köll, S., & Graham, N. (2020). Various versatile variances: An object-oriented implementation of clustered covariances in R. Journal of Statistical Software, 95(1), 1–36. https://doi.org/10.18637/jss.v095.i01
Zeileis, A., Leisch, F., Hornik, K., & Kleiber, C. (2002). Strucchange: An r package for testing for structural change in linear regression models. Journal of Statistical Software, 7(2), 1–38. https://doi.org/10.18637/jss.v007.i02