Forecasting Unemployment in Greece

Make a prediction about the future value of Greece’s unemployment using ARIMA model.

R
Time Series
Author

stesiam

Published

October 22, 2022

Table of Contents


  1. Introduction
  2. Prerequisites
      2.1 Import Libraries
      2.2 Import Dataset
      2.3 Preview Datasset
      2.4 Dataset Structure
      2.5 Time Series Preprocessing
  3. Missing Values
  4. Descriptive Statistics
  5. Examine Stationarity
      5.1 Definition of stationarity
      5.2 Examine Stationarity Graphically
      5.3 Examine Stationarity with Statistical Tests
        5.3.1 ADF test
        5.3.2 PP test
        5.3.3 KPSS test
  6. Identify Model
  7. Build Time Series Model
  8. Compare Models
  9. Forecast Future Unemployment
      9.1 ARIMA (0,2,1) forecasts
      9.2 ARIMA (9,2,1) forecasts
  10. Results

Introduction

[…editing…]

Background

Time series

Procedure

Short Answer

If you are in a hurry, I predicted that unemployment on Greece is expected to further reduce. It will range between 10% - 13% in February, 2023.

Prerequisites

Import Libraries

For this analysis we will need standard libraries for importing and processing my data, such as readr (Wickham, Hester, & Bryan, 2022) and dplyr (Wickham, François, Henry, Müller, & Vaughan, 2023). The kableExtra (Zhu, 2021) package was used to print the results in table format, while the flextable (Gohel & Skintzos, 2022) package was used to print the results of the Dickey-Fuller and KPSS tests.

Then, due to the nature of the data (time series) it was deemed necessary to use relevant libraries such as lubridate(Spinu, Grolemund, & Wickham, 2023), tseries(Trapletti & Hornik, 2022) & forecast(R. Hyndman et al., 2022) packages.

Finally, the ggplot2 (Wickham et al., 2024) package was used to create some visualizations, as well as an auxiliary package, ggtext (Wilke & Wiernik, 2022), for further formatting those.

Show the code
# General purpose R libraries
library(dplyr)
library(readr)
library(kableExtra)
library(flextable)


# Graphs
library(ggplot2)
library(ggtext) # Add support for HTML/CSS on ggplot

# Time Series 

library(lubridate)
library(tseries)
library(forecast)

# Other settings
options(digits=2) # print only 2 decimals

Import dataset

After loading the libraries I am able to use the commands of the readr package to import my data. My data is in .csv format, so I’ll use the read_csv() command (Wickham et al., 2022) to import them.

Additionally, I choose not to include EA-19 values (as I investigate Greece’s unemployment).

Show the code
unemployment <- read_csv("data/unemployment.csv") %>%
  select(LOCATION, TIME, Value) %>% filter(LOCATION != "EA19")

Preview Dataset

Show the code
#| label: tbl-preview-dataset
#| tbl-cap: "Preview Dataset (first 6 rows)"
#| 
preview_dataset = head(unemployment, 10)
kbl(preview_dataset, 
    align = 'c',
    booktabs = T,
    centering = T,
    valign = T) %>%
  kable_paper() %>%
  scroll_box(width = "600px", height = "250px")
LOCATION TIME Value
GRC 1998-04 11
GRC 1998-05 11
GRC 1998-06 11
GRC 1998-07 11
GRC 1998-08 11
GRC 1998-09 11
GRC 1998-10 11
GRC 1998-11 11
GRC 1998-12 11
GRC 1999-01 11

Dataset Structure

Our dataset is consisted by 3 variables (columns). More specifically, concerning my variables, are as follows :

Variable Property Description
LOCATION qualitative
(nominal)
Specific country’s statistics
TIME qualitative
(ordinal)
Month of the reported data
Value quantitative
(continuous)
Unemployment at the given Time and Country

Thus, my sample has 3 variables, of which 2 are qualitative and 1 is quantitative property.

Time Series Preprocessing

The TIME variable needs to be a Date variable which is not fulfilled on our case.

Show the code
sapply(unemployment, class) %>% kbl() %>% kable_styling(full_width = F, position = "center")
x
LOCATION character
TIME character
Value numeric

So, above I see that I have dates in a format “YYYY-MM” (Year - Month) and they are considered as characters. With the help of lubridate package I will convert my time series on a Date format.

Show the code
unemployment$TIME <- lubridate::ym(unemployment$TIME)

And let’s check again :

Show the code
sapply(unemployment, class) %>% kbl() %>% kable_styling(full_width = F, position = "center")
x
LOCATION character
TIME Date
Value numeric

And now I got the Date format. I am able to continue my analysis.

Missing Values

Show the code
sum(is.na(unemployment))

Good news! On this dataset there are 0 missing values, in total.

In case the dataset had missing values, I would first look at which variables those were.In a second phase, it might be necessary to replace the missing values.

Descriptive Statistics

The unemployment data for Greece refer to the period from April 1998 το August 2022. Regarding Europe we have data from January 2000 to August 2022.

The last 20 years have seen particularly large changes mainly in Greece, due to the domestic economic crisis. There is a noticeable change between September 2010 (10.1%) and September 2013 (28.1 %), in terms of unemployment in Greece. For the record, below you can find a table with the months that presented the highest unemployment in Greece. As you will notice the five highest values were observed in 2013.

Show the code
unemployment %>% dplyr::filter(LOCATION == "GRC") %>% select(TIME, Value) %>% as.data.frame() %>%
  arrange(Value) %>% tail(5) %>% kbl() %>% kable_styling(full_width = F, position = "center")
TIME Value
289 2013-11-01 28
290 2013-04-01 28
291 2013-05-01 28
292 2013-07-01 28
293 2013-09-01 28

Accordingly, the highest unemployment rates observed at the European level are the following:

Show the code
unemployment %>% dplyr::filter(LOCATION == "EU27_2020") %>% select(TIME, Value)  %>%
  as.data.frame() %>% arrange(Value) %>% tail(5) %>% kbl() %>% kable_styling(full_width = F, position = "center")
TIME Value
268 2013-06-01 12
269 2013-01-01 12
270 2013-02-01 12
271 2013-03-01 12
272 2013-04-01 12

The lowest unemployment ever recorded in Greece was in the period of 2008:

Show the code
unemployment %>% dplyr::filter(LOCATION == "GRC") %>% select(TIME, Value) %>% as.data.frame() %>%
  arrange(Value) %>% head(5) %>% kbl() %>% kable_styling(full_width = F, position = "center")
TIME Value
2008-05-01 7.4
2008-06-01 7.6
2008-07-01 7.6
2008-10-01 7.6
2008-01-01 7.7

On the other hand, at the moment in Europe we are at the lowest levels of the last 20 years, with unemployment hovering around 6%.

Show the code
unemployment %>% dplyr::filter(LOCATION == "EU27_2020") %>% select(TIME, Value) %>%
  as.data.frame() %>% arrange(Value) %>% head(5) %>% kbl() %>% kable_styling(full_width = F, position = "center")
TIME Value
2022-07-01 6.0
2022-08-01 6.0
2022-04-01 6.1
2022-05-01 6.1
2022-06-01 6.1

Finally, it is evident that in the case of Greece the changes are more intense than in Europe.

All of the above can be summarized by the following graph:

Show the code
grc_unemployment <- unemployment %>% filter(LOCATION == "GRC")

unemployment %>%
  ggplot(aes(x= TIME, y=Value, color = LOCATION)) +
  geom_line() + 
  geom_vline(aes(xintercept = as.POSIXct(as.Date("2010-11-03"))), color = "#000000", size = 1, linetype = "dashed")+
  scale_color_manual(values=c('black', 'dodgerblue2', 'orange')) +
  scale_x_date(date_labels = "%m-%Y") +
  theme_minimal()
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Examine stationarity

Definition of stationarity

An important concept in studying time series is stationarity. A time series is called stationary (“Applied Time Series Analysis,” n.d.) if:

  1. E(X_t) = \text{constant}
  2. Var(X_t) = \text{constant}
  3. Cov(X_t, X_s) = \text{constant}

Examine Stationarity Graphically

It is apparent that there is a big variation on the values of unemployment. This time series is not stationary and the differencing is justified.

Show the code
plot(grc_unemployment$Value,type = "l")

Here we can see a big improvement in comparison with original data. I have some concerns about points close to 150 (mildly upwards trend) and 250 (outlier).

Show the code
grc_unemployment_diff1 <- diff(grc_unemployment$Value, differences = 1)

plot(grc_unemployment_diff1,type = "l")

Given the concerns of above, I made also a second difference plot. It seems to solve the problem on points close to 150.

Show the code
grc_unemployment_diff2<- diff(grc_unemployment$Value, differences = 2)

plot(grc_unemployment_diff2, type = "l")

Examine Stationarity with Statistical tests

The graphical interpretation of stationarity can be beneficial for a quick assessment on topic of stationarity. However it can be considered a subjective metric, which leads on a non consistent decision (someone may consider the second figure as stationary and some others not.

Thankfully, there are some statistical tests which can help us on our decisions. Some commonly used are :

  • Augmented Dickey-Fuller (ADF) test
  • Phillips- Perron (PP) test
  • Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test

Summary

Show the code
summary_stationarity_results <- data.frame(
                             "Type" = c("levels", "Diff(GRC)", "Diff2(GRC)"),
                             "ADF test" = c("Non Stationary", "Stationary", "Stationary"),
                              "PP test" = c("Non Stationary", "Stationary", "Stationary"),
                             "KPSS test" =c("Non Stationary", "Non Stationary", "Stationary")
)


summary_stationarity_results  %>% kbl() %>% kable_styling()
Type ADF.test PP.test KPSS.test
levels Non Stationary Non Stationary Non Stationary
Diff(GRC) Stationary Stationary Non Stationary
Diff2(GRC) Stationary Stationary Stationary

ADF test

\begin{array}{l} H_0 : \text{Time series is not stationary} \\ H_1 : \text{Alternatively} \end{array} \equiv \begin{array}{l} H_0 : \text{There is a unit root} \\ H_1 : \text{Alternatively} \end{array}

Show the code
adf.test(grc_unemployment$Value) %>% as_flextable()

statistic

p.value

parameter

method

alternative

-0.9

0.9485

6.0

Augmented Dickey-Fuller Test

stationary

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

Συνεπώς, είναι προφανές από τα αποτελέσματα του στατιστικού ελέγχου Dickey Fuller ότι η χρονοσειρά μου δεν είναι στάσιμη. Θα πρέπει να εφαρμόσω τον έλεγχο Dickey-Fuller στις δοαφορές.

Show the code
adf.test(grc_unemployment_diff1) %>% as_flextable()

statistic

p.value

parameter

method

alternative

-3.5

0.0424 *

6.0

Augmented Dickey-Fuller Test

stationary

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

And finally the results for the second differences are :

Show the code
adf.test(grc_unemployment_diff2) %>% as_flextable()
Warning in adf.test(grc_unemployment_diff2): p-value smaller than printed
p-value

statistic

p.value

parameter

method

alternative

-9.5

0.0100 **

6.0

Augmented Dickey-Fuller Test

stationary

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

PP test

\begin{array}{l} H_0 : \text{Time series is not stationary} \\ H_1 : \text{Alternatively} \end{array} \equiv \begin{array}{l} H_0 : \text{There is a unit root} \\ H_1 : \text{Alternatively} \end{array}

Show the code
pp.test(grc_unemployment$Value,) %>% as_flextable()

statistic

p.value

parameter

method

alternative

0.2

0.9900

5.0

Phillips-Perron Unit Root Test

stationary

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

Show the code
pp.test(grc_unemployment_diff1) %>% as_flextable()

statistic

p.value

parameter

method

alternative

-279.8

0.0100 **

5.0

Phillips-Perron Unit Root Test

stationary

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

Show the code
pp.test(grc_unemployment_diff2) %>% as_flextable()

statistic

p.value

parameter

method

alternative

-337.1

0.0100 **

5.0

Phillips-Perron Unit Root Test

stationary

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

KPSS test

\begin{array}{l} H_0 : \text{Time series is stationary} \\ H_1 : \text{Alternatively} \end{array} \equiv \begin{array}{l} H_0 : \text{There is not unit root} \\ H_1 : \text{Alternatively} \end{array}

Show the code
kpss.test(grc_unemployment$Value,) %>% as_flextable()

statistic

p.value

parameter

method

2.5

0.0100 **

5.0

KPSS Test for Level Stationarity

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

Show the code
kpss.test(grc_unemployment_diff1) %>% as_flextable()

statistic

p.value

parameter

method

0.6

0.0231 *

5.0

KPSS Test for Level Stationarity

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

Show the code
kpss.test(grc_unemployment_diff2) %>% as_flextable()

statistic

p.value

parameter

method

0.0

0.1000 .

5.0

KPSS Test for Level Stationarity

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

Identify Model

Show the code
par(mfrow = c(1, 2))

acf(grc_unemployment$Value)
pacf(grc_unemployment$Value )

Show the code
par(mfrow = c(1, 2))

acf(grc_unemployment_diff1 )
pacf(grc_unemployment_diff1 )

Show the code
par(mfrow = c(1, 2))

acf(grc_unemployment_diff2 )
pacf(grc_unemployment_diff2 )

Build Time Series Model

Show the code
grc_unemployment = grc_unemployment%>% select(Value)
auto_model <- auto.arima(grc_unemployment, trace = T,seasonal = TRUE)

 Fitting models using approximations to speed things up...

 ARIMA(2,2,2)                    : 302
 ARIMA(0,2,0)                    : 474
 ARIMA(1,2,0)                    : 405
 ARIMA(0,2,1)                    : 298
 ARIMA(1,2,1)                    : 301
 ARIMA(0,2,2)                    : 300
 ARIMA(1,2,2)                    : 302

 Now re-fitting the best model(s) without approximations...

 ARIMA(0,2,1)                    : 297

 Best model: ARIMA(0,2,1)                    
Show the code
auto_model 
Series: grc_unemployment 
ARIMA(0,2,1) 

Coefficients:
         ma1
      -0.908
s.e.   0.023

sigma^2 = 0.16:  log likelihood = -146
AIC=297   AICc=297   BIC=304
Show the code
arimaModel_1=arima(grc_unemployment, order=c(0,1,2))
arimaModel_1

Call:
arima(x = grc_unemployment, order = c(0, 1, 2))

Coefficients:
        ma1    ma2
      0.145  0.076
s.e.  0.059  0.057

sigma^2 estimated as 0.171:  log likelihood = -157,  aic = 320
Show the code
arimaModel_2=arima(grc_unemployment, order=c(1,1,2))
arimaModel_2

Call:
arima(x = grc_unemployment, order = c(1, 1, 2))

Coefficients:
        ar1     ma1    ma2
      0.971  -0.905  0.017
s.e.  0.019   0.067  0.063

sigma^2 estimated as 0.157:  log likelihood = -144,  aic = 296
Show the code
arimaModel_3=arima(grc_unemployment, order=c(9,2,1))
arimaModel_3

Call:
arima(x = grc_unemployment, order = c(9, 2, 1))

Coefficients:
        ar1    ar2    ar3    ar4    ar5    ar6     ar7     ar8     ar9    ma1
      -0.79  -0.71  -0.67  -0.69  -0.59  -0.38  -0.316  -0.446  -0.270  -0.15
s.e.   0.17   0.15   0.15   0.14   0.14   0.12   0.088   0.072   0.069   0.18

sigma^2 estimated as 0.139:  log likelihood = -127,  aic = 276

Compare Models

After building some ARIMA models, I should decide which is the best one to make my estimations. One metric to evaluate those models is AIC (Akaike Information Criterion). The lower the value of AIC, the better my model.

Show the code
accuracy_table <- data.frame(
                             "Name of Model" = c("Auto Model", "ARIMA Candidate #1", "ARIMA Candidate #2", "ARIMA Candidate #3"),
                             Model = c("ARIMA(0,2,1)", "ARIMA(0,1,2)", "ARIMA(1,1,2)", "ARIMA(9,2,1)"),
                             AIC =c(auto_model$aic, arimaModel_1$aic, arimaModel_2$aic, arimaModel_3$aic)
)


accuracy_table %>% kbl() %>% kable_styling()
Name.of.Model Model AIC
Auto Model ARIMA(0,2,1) 297
ARIMA Candidate #1 ARIMA(0,1,2) 320
ARIMA Candidate #2 ARIMA(1,1,2) 296
ARIMA Candidate #3 ARIMA(9,2,1) 276

So, the best model is the Auto Model (ARIMA(9,2,1)), which has the lowest AIC value.

Checking best models

Forecast Future Unemployment

Previously, I identify which is the best model. Now, I will use this model in order to predict unemployment for the next 6 months. It should be recalled that the last available value was from August of 2022 (12.2%). Therefore, I will make a prediction for unemployment in Greece until February 2023.

ARIMA (0,2,1) forecasts

Show the code
forecast(auto_model,6) %>% autoplot()

Show the code
forecast(auto_model,6) %>% kbl() %>% kable_paper()
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
294 12 11.5 13 11.2 13
295 12 11.1 13 10.7 13
296 12 10.7 13 10.2 13
297 12 10.3 13 9.7 13
298 11 10.0 13 9.2 13
299 11 9.6 13 8.8 14

ARIMA (9,2,1) forecasts

Show the code
forecast(arimaModel_3,6) %>% autoplot()

Show the code
forecast(arimaModel_3,6) %>% kbl() %>% kable_paper()
Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
294 12 11 12 11.1 13
295 12 11 13 10.9 13
296 12 11 13 10.7 13
297 12 11 13 10.1 13
298 12 10 13 9.8 13
299 12 10 13 9.4 14

Results

Given the diagram as well as the forecast table (of the best performing model, candidate #3), I conclude that a reduction in unemployment in Greece is expected in the next period of time (in the next six months). More specifically, Greece’s unemployment in February 2023 will range between 10% (9.4%) and 13% (14%) with an 80% (95%) probability (based on ARIMA(9,2,1) model).

Acknowledgements

References

Applied time series analysis. (n.d.). Retrieved October 19, 2022, from https://online.stat.psu.edu/stat510/
Gohel, D., & Skintzos, P. (2022). Flextable: Functions for tabular reporting. Retrieved from https://ardata-fr.github.io/flextable-book/
Grolemund, G., & Wickham, H. (2011). Dates and times made easy with lubridate. Journal of Statistical Software, 40(3), 1–25. Retrieved from https://www.jstatsoft.org/v40/i03/
Hyndman, R. J., & Athanasopoulos, G. (2018). Forecasting: Principles and practice. OTexts.
Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting: The forecast package for R. Journal of Statistical Software, 26(3), 1–22. https://doi.org/10.18637/jss.v027.i03
Hyndman, R., Athanasopoulos, G., Bergmeir, C., Caceres, G., Chhay, L., Kuroptev, K., … Yasmeen, F. (2022). Forecast: Forecasting functions for time series and linear models. Retrieved from https://pkg.robjhyndman.com/forecast/
OECD. (2022). Unemployment rate (indicator). Retrieved October 14, 2022, from doi: 10.1787/52570002-en
R Core Team. (2021). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from https://www.R-project.org/
R Graph Gallery. (2018). Time series. URL Https://r-Graph-Gallery.com/Time-Series.html.
Spinu, V., Grolemund, G., & Wickham, H. (2023). Lubridate: Make dealing with dates a little easier. Retrieved from https://lubridate.tidyverse.org
Trapletti, A., & Hornik, K. (2022). Tseries: Time series analysis and computational finance. Retrieved from https://CRAN.R-project.org/package=tseries
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag New York. Retrieved from https://ggplot2.tidyverse.org
Wickham, H., Chang, W., Henry, L., Pedersen, T. L., Takahashi, K., Wilke, C., … van den Brand, T. (2024). ggplot2: Create elegant data visualisations using the grammar of graphics. Retrieved from https://ggplot2.tidyverse.org
Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). Dplyr: A grammar of data manipulation. Retrieved from https://dplyr.tidyverse.org
Wickham, H., Hester, J., & Bryan, J. (2022). Readr: Read rectangular text data. Retrieved from https://readr.tidyverse.org
Wilke, C. O., & Wiernik, B. M. (2022). Ggtext: Improved text rendering support for ggplot2. Retrieved from https://wilkelab.org/ggtext/
Zhu, H. (2021). kableExtra: Construct complex table with kable and pipe syntax. Retrieved from http://haozhu233.github.io/kableExtra/

Citation

BibTeX citation:
@online{2022,
  author = {, stesiam},
  title = {Forecasting {Unemployment} in {Greece}},
  date = {2022-10-22},
  url = {https://stesiam.com/posts/forecasting-greek-unemployment/},
  langid = {en}
}
For attribution, please cite this work as:
stesiam. (2022, October 22). Forecasting Unemployment in Greece. Retrieved from https://stesiam.com/posts/forecasting-greek-unemployment/