Προβλέποντας την Ανεργία στην Ελλάδα

Μία βραχυπρόθεσμη πρόβλεψη για την πορεία την ανεργίας τους επόμενους μήνες στην Ελλάδα, χρησιμοποιώντας μοντέλα ARIMA.

R
Χρονοσειρες
Συγγραφέας

stesiam

Δημοσιευμένο

22 Οκτωβρίου 2022

Εισαγωγή

Υπόβαθρο

Η ανεργία αποτελεί ένα χρόνιο πρόβλημα της χώρας μας, μιας και ιστορικά τα τελευταία 25 χρόνια είναι σε υψηλότερα επίπεδα του ευρωπαικού μέσου και τις χώρες του ΟΟΣΑ. Το φαινόμενο επιδεινώθηκε τα χρόνια της οικονομικής κρίσης όπου στις χειρότερες στιγμές της το ένα τέταρτο του ενεργού πληθυσμού δεν μπορούσε να βρει εργασία. Ακόμα χειρότερη ήταν η κατάσταση για τους νέους της χώρας μιας και η νεανική ανεργία ανήλθε στο 46% , η οποία είναι και χειρότερη επίδοση στην ΕΕ.

Πριν ξεκινήσουμε να αναλύουμε κρίνω σημαντικό να δούμε τι σημαίνει ο όρος ανεργία και πώς προκύπτει αυτό το ποσοστό. Όταν αναφερόμαστε σε άνεργους αναφερόμαστε σε άτομα τα οποία δεν εργάζονται, αλλά δεν περιλαμβάνονται σε αυτά σπουδαστές, συνταξιούχοι, δηλαδή άτομα τα οποία έχουν την ηλικία (18-67), αλλά και τη δυνατότητα να εργαστούν. Για να υπολογιστεί το ποσοστό ανεργίας:

\textsf{Ποσοστό ανεργίας} = \frac{\textsf{Αριθμός ανέργων}}{\textsf{Εργατικό Δυναμικό}}

Χρονοσειρές

Σε αυτό το άρθρο έχω ως σκοπό τη πρόβλεψη της πορείας της ανεργίας τους επόμενους μήνες. Έτσι λοιπόν πήρα κάποια ιστορικά δεδομένα για την ανεργία στην ΕΕ, τον ΟΟΣΑ και την χώρα μας. Θα χρησιμοποιήσω ένα απλό μοντέλο (S)ARIMA, προκειμένου να κάνω μία εκτίμηση του μεγέθους της τους επόμενους μήνες.

Μεθοδολογία

Συνοπτική Απάντηση

Αν θέλετε μία γρήγορη απάντηση, στη συγκεκριμένη ανάλυση προβλέπω ότι η πτωτική τάση της ανεργίας αναμένεται να συνεχιστεί τους επόμενους μήνες. Τον Φεβρουάσιο του 2023, αυτή θα κυμαίνεται μεταξύ του 10% - 13%.

Προαπαιτούμενα

Εισαγωγή βιβλιοθηκών

Για την ανάλυση αυτή θα χρειαστούμε τυπικές βιβλιοθήκες για την εισαγωγή και την επεξεργασία των δεδομένων μου, όπως οι readr (Wickham, Hester, & Bryan, 2024) και dplyr (Wickham, François, Henry, Müller, & Vaughan, 2023). Το πακέτο kableExtra (Zhu, 2024) χρησιμοποιήθηκε για την αποτύπωση των αποτελεσμάτων σε μορφή πίνακα, ενώ το πακέτο flextable (Gohel & Skintzos, 2024) χρησιμοποιήθηκε για τους πίνακες των αποτελεσμάτων των τεστ Dickey-Fuller και KPSS.

Στη συνέχεια, λόγω της φύσης των δεδομένων (χρονοσειρές) κρίθηκε απαραίτητη η χρήση σχετικών βιβλιοθηκών όπως τα πακέτα lubridate(Spinu, Grolemund, & Wickham, 2024), tseries(Trapletti & Hornik, 2024) & forecast(R. Hyndman κ.ά., 2025).

Τέλος, χρησιμοποιήθηκε το highcharter, ώστε να δημιουργήσω βασικά τα διαγράμματα pacf, acf μέσω ραβδογραμμάτων και κάποια διαγράμματα πρόβλεψης τάσης της ανεργίας με απλά γραφήματα γραμμών.

Show the code
Show the code
acf_plot_function = function(data, min = -1, max = 1){
  ts_data <- data
  N <- length(ts_data)

# Compute confidence limit
conf_limit <- qnorm(0.975) / sqrt(N)

# Use forecast::Acf (biased estimator, recommended for ARIMA)
acf_obj <- Acf(ts_data, plot = FALSE, lag.max = 20)

# Extract lags and values
# Note: Acf() returns acf_obj$acf as a matrix; acf_obj$lag as a matrix
lags_acf <- as.numeric(acf_obj$lag[-1])             # skip lag 0
acf_values <- as.numeric(acf_obj$acf[-1])           # skip lag 0

# Build the highcharter ACF plot
acf_chart <- highchart() %>%
  hc_chart(type = "column") %>%
  hc_title(text = "Autocorrelation Plot") %>%
  hc_xAxis(categories = lags_acf, title = list(text = "Lag")) %>%
  hc_yAxis(min = min, max = max,
    title = list(text = "ACF"),
    plotLines = list(
      list(value = 0, color = "#000000", width = 3),
      list(value = conf_limit, color = "#FF0000", dashStyle = "ShortDash", width = 1),
      list(value = -conf_limit, color = "#FF0000", dashStyle = "ShortDash", width = 1)
    )
  ) %>%
  hc_legend(enabled = FALSE) %>%
  hc_add_series(name = "ACF", data = round(acf_values, digits = 4))

# Show chart
acf_chart
}

pacf_plot_function <- function(data, min = -1, max = 1) {
  ts_data <- data
  N <- length(ts_data)
  
  # Compute confidence limit (same formula as for ACF)
  conf_limit <- qnorm(0.975) / sqrt(N)
  
  # Compute PACF using forecast::Pacf (biased estimator, recommended for ARIMA)
  pacf_obj <- forecast::Pacf(ts_data, plot = FALSE, lag.max = 20)
  
  # Extract lags and PACF values
  lags_pacf <- as.numeric(pacf_obj$lag)  # already excludes lag 0
  pacf_values <- as.numeric(pacf_obj$acf)
  
  # Build the highcharter PACF plot
  pacf_chart <- highcharter::highchart() %>%
    highcharter::hc_chart(type = "column") %>%
    highcharter::hc_title(text = "Partial Autocorrelation Plot") %>%
    highcharter::hc_xAxis(categories = lags_pacf, title = list(text = "Lag")) %>%
    highcharter::hc_yAxis(
      min = min, max = max,
      title = list(text = "PACF"),
      plotLines = list(
        list(value = 0, color = "#000000", width = 3),
        list(value = conf_limit, color = "#FF0000", dashStyle = "ShortDash", width = 1),
        list(value = -conf_limit, color = "#FF0000", dashStyle = "ShortDash", width = 1)
      )
    ) %>%
    highcharter::hc_legend(enabled = FALSE) %>%
    highcharter::hc_add_series(name = "PACF", data = round(pacf_values, digits = 4))
  
  # Show chart
  pacf_chart
}

Εισαγωγή δεδομένων

Αφού συμπεριλάβω τις απολύτως απαραίτητες βιβλιοθήκες, μπορώ να εισάγω τα δεδομένα μου, μέσω της βιλιοθήκης readr. Το αρχείο δεδομένων μου είναι ένα τυπικό αρχείο csv και για την χρήση αυτού θα χρησιμοποιήσω την εντολή read_csv(). Το αρχείο περιέχει δεδομένα ανεργίας

Show the code
unemployment <- read_csv("data/unemployment.csv") %>%
  select(LOCATION, TIME, Value) %>% filter(LOCATION != "EA19")

Δομή δεδομένων

Το αρχείο περιέχει δεδομένα ανεργίας για διάφορες χώρες ή οντότητες χωρών όπως η Ευρωπαϊκή Ένωση και οι χώρες του ΟΟΣΑ (Οργανισμός Οικονομικής Συνεργασίας και Ανάπτυξης), γ

Show the code
head(unemployment, 6) %>%
  kbl(., 
    align = 'c',
    booktabs = T,
    centering = T,
    valign = T) %>%
  kable_styling()
Πίνακας 1: Προεπισκόπηση δεδομένων (πρώτες 6 σειρές)
LOCATION TIME Value
GRC 1998-04 10.9
GRC 1998-05 11.0
GRC 1998-06 10.9
GRC 1998-07 11.0
GRC 1998-08 11.2
GRC 1998-09 11.1

Τα δεδομένα μου αποτελούνται από 3 μεταβλητές (στήλες). Πιο συγκεκριμένα οι στήλες μου είναι οι εξής:

Μεταβλητή Τύπος Μεταβλητής Περιγραφή
LOCATION Ποιοτική
(nominal)
Χώρα ή Οντότητα χωρών
TIME Ποιοτική
(ordinal)
Μήνας-Έτος που αναφέρεται η μέτρηση
Value Ποσοτική
(συνεχής)
Ύψος ανεργίας (βάσει περιοχής και μήνα)

Thus, my sample has 3 variables, of which 2 are qualitative and 1 is quantitative property.

Time Series Preprocessing

The TIME variable needs to be a Date variable which is not fulfilled on our case.

Show the code
sapply(unemployment, class) %>% 
  kbl() %>% kable_styling(full_width = F, position = "center")
x
LOCATION character
TIME character
Value numeric

So, above I see that I have dates in a format “YYYY-MM” (Year - Month) and they are considered as characters. With the help of lubridate package I will convert my time series on a Date format.

Show the code
unemploymentType <- unemployment %>%
  mutate(TIME = yearmonth(TIME)) %>%
  as_tsibble(index = TIME, key = LOCATION)

And let’s check again :

Show the code
sapply(unemploymentType, class) %>% kbl() %>% kable_styling(full_width = F, position = "center")
x
character
x
yearmonth
vctrs_vctr
x
numeric

And now I got the Date format. I am able to continue my analysis.

Missing Values

Show the code
sum(is.na(unemployment))

Good news! On this dataset there are 0 missing values, in total.

In case the dataset had missing values, I would first look at which variables those were.In a second phase, it might be necessary to replace the missing values.

Descriptive Statistics

The unemployment data for Greece refer to the period from April 1998 το August 2022. Regarding Europe we have data from January 2000 to August 2022.

The last 20 years have seen particularly large changes mainly in Greece, due to the domestic economic crisis. There is a noticeable change between September 2010 (10.1%) and September 2013 (28.1 %), in terms of unemployment in Greece. For the record, below you can find a table with the months that presented the highest unemployment in Greece. As you will notice the five highest values were observed in 2013.

Show the code
unemploymentType %>% dplyr::filter(LOCATION == "GRC") %>% select(TIME, Value) %>% as.data.frame() %>%
  arrange(Value) %>% tail(5) %>% kbl() %>% kable_styling(full_width = F, position = "center")
TIME Value
289 2013 Νοε 27.9
290 2013 Απρ 28.0
291 2013 Μαΐ 28.0
292 2013 Ιουλ 28.1
293 2013 Σεπ 28.1

Accordingly, the highest unemployment rates observed at the European level are the following:

Show the code
unemploymentType %>% dplyr::filter(LOCATION == "EU27_2020") %>% select(TIME, Value)  %>%
  as.data.frame() %>% arrange(Value) %>% tail(5) %>% kbl() %>% kable_styling(full_width = F, position = "center")
TIME Value
268 2013 Ιουν 11.6
269 2013 Ιαν 11.7
270 2013 Φεβ 11.7
271 2013 Μαρ 11.7
272 2013 Απρ 11.7

The lowest unemployment ever recorded in Greece was in the period of 2008:

Show the code
unemploymentType %>% dplyr::filter(LOCATION == "GRC") %>% select(TIME, Value) %>% as.data.frame() %>%
  arrange(Value) %>% head(5) %>% kbl() %>% kable_styling(full_width = F, position = "center")
TIME Value
2008 Μαΐ 7.4
2008 Ιουν 7.6
2008 Ιουλ 7.6
2008 Οκτ 7.6
2008 Ιαν 7.7

On the other hand, at the moment in Europe we are at the lowest levels of the last 20 years, with unemployment hovering around 6%.

Show the code
unemploymentType %>% dplyr::filter(LOCATION == "EU27_2020") %>% select(TIME, Value) %>%
  as.data.frame() %>% arrange(Value) %>% head(5) %>% kbl() %>% kable_styling(full_width = F, position = "center")
TIME Value
2022 Ιουλ 6.0
2022 Αυγ 6.0
2022 Απρ 6.1
2022 Μαΐ 6.1
2022 Ιουν 6.1

Finally, it is evident that in the case of Greece the changes are more intense than in Europe.

All of the above can be summarized by the following graph:

Show the code
g = unemploymentType %>%
  as.data.frame() %>%
  mutate(Year = year(TIME)) %>%
  select(-TIME) %>%
  group_by(LOCATION, Year) %>%
  summarise(mean_unemployment = mean(Value))
`summarise()` has grouped output by 'LOCATION'. You can override using the
`.groups` argument.
Show the code
highchart() %>%
    hc_chart(type = "line") %>%
    hc_title(text = "Unemployment Rates") %>%
    hc_xAxis(title = list(text = "Έτος")) %>%
    hc_yAxis(title = list(text = "Ανεργία (%)")) %>%
    hc_tooltip(shared = TRUE, valueDecimals = 2, valueSuffix = " %") %>%
    hc_add_series(
        data = g,
        type = "line",
        hcaes(x = Year, y = mean_unemployment, group = LOCATION)
    ) %>%
  hc_plotOptions(
  series = list(
    dataLabels = list(
      enabled = TRUE,
      formatter = JS("function() {
        if (this.point.index === this.series.data.length - 1) {
          return this.y.toFixed(2) + ' %';
        }
        return null;
      }")
    )
  )
)

Unemployment comparison in Greece, OECD & EU27 (incl. Brexit)

Examine trend and seasonality

In time series analysis, it’s important to distinguish where the variability in the final form of the series originates from. Time series consist of three main components: trend, seasonality, and randomness. The trend refers to whether the series shows a specific direction (upward/downward). Seasonality refers to recurring patterns at regular intervals.

y_t = S_t + T_t + E_t Όπου:

  • y_t δηλώνουν τα δεδομένα που έχουμε διαθέσιμα,
  • S_t
  • T_t
  • E_t

Similarly, the multiplicative model is defined:

y_t = S_t \cdot T_t \cdot E_t

where the components that make up the time series are multiplied instead of added.

Show the code
library(forecast)
library(highcharter)
decomp <- unemploymentType %>%
  model(stl = STL(Value)) %>%
  components() %>%
  as_tibble() %>%
  dplyr::filter(LOCATION == "GRC") %>%
  mutate(TIME = as.Date(TIME),
    trend = round(trend, 2),
    season_year = round(season_year, 4),
    remainder = round(remainder, 4)
  )

# Build highchart with multiple y-axes (stacked)
highchart(type = "stock") %>%
    hc_title(text = "Κατάτμηση χρονοσειράς σε τάση, εποχικότητα και θόρυβο") %>%
    hc_subtitle(text = "Τα ιστορικά δεδομένα της ανεργίας στην Ελλάδα δείχνουν ότι δεν υπάρχει κάποια ξεκάθαρη τάση. Επιπλέον, παρατηρούμε μία ανεμική εποχικότητα μέχρι το 2012, ενώ τα επόμενα χρόνια παρατηρείται εντονότερα το φαινόμενο αυτής.") %>%
    hc_caption(text="stesiam, 2023") %>%

    # Observed
    hc_add_series(decomp, "line", hcaes(x = datetime_to_timestamp(TIME), y = Value), name = "Παρατήρηση", yAxis = 0) %>%
    
    # Trend
    hc_add_series(decomp, "line", hcaes(x = datetime_to_timestamp(TIME), y = trend), name = "Τάση", yAxis = 1) %>%
    
    # Seasonal
    hc_add_series(decomp, "line", hcaes(x = datetime_to_timestamp(TIME), y = season_year), name = "Εποχικότητα", yAxis = 2) %>%
    
    # Random
    hc_add_series(decomp, "line", hcaes(x = datetime_to_timestamp(TIME), y = remainder), name = "Θόρυβος", yAxis = 3) %>%
    # 
    # # Define 4 stacked y-axes
  hc_yAxis_multiples(
  list(
    title = list(text = "Δεδομένα"),
    height = "25%", top = "0%",
    offset=0, labels = list(enabled = FALSE), gridLineWidth = 0),
  list(
    title = list(text = "Τάση"),
    height = "25%", top = "25%",
    offset=0, labels = list(enabled = FALSE), gridLineWidth = 0),
  list(
    title = list(text = "Εποχικότητα"),
    height = "25%", top = "50%",
    offset=0, labels = list(enabled = FALSE), gridLineWidth = 0),
  list(
    title = list(text = "Υπόλοιπο"),
    height = "25%", top = "75%",
    offset=0, labels = list(enabled = FALSE), gridLineWidth = 0)
) %>%
    hc_tooltip(shared = TRUE) %>%
    hc_rangeSelector(enabled = FALSE) %>%
    hc_scrollbar(enabled = FALSE) %>%
    hc_navigator(enabled = FALSE)

Έλεγχος στασιμότητας

Ορισμός στασιμότητας

Μία σημαντική έννοια στις χρονοσειρές είναι η στασιμότητα. Μία χρονοσειρά καλείται στάσιμη (‘Applied Time Series Analysis’, χ.χ.) αν:

  1. E(X_t):\text{σταθερή}
  2. Var(X_t): \text{σταθερή}
  3. Cov(X_t, X_s): \text{σταθερή}

Εξέταση στασιμότητας γραφικά

It is apparent that there is a big variation on the values of unemployment. This time series is not stationary and the differencing is justified.

Show the code
highchart() %>%
    hc_chart(type = "line") %>%
    hc_title(text = "Unemployment Rates") %>%
    hc_xAxis(title = list(text = "Year")) %>%
    hc_yAxis(title = list(text = "Value")) %>%
    hc_tooltip(shared = TRUE, valueDecimals = 2, valueSuffix = " %") %>%
    hc_add_series(
        data = g,
        type = "line",
        hcaes(x = Year, y = mean_unemployment, group = LOCATION)
    ) %>%
  hc_plotOptions(
  series = list(
    dataLabels = list(
      enabled = TRUE,
      formatter = JS("function() {
        if (this.point.index === this.series.data.length - 1) {
          return this.y.toFixed(2) + ' %';
        }
        return null;
      }")
    )
  )
)

Here we can see a big improvement in comparison with original data. I have some concerns about points close to 150 (mildly upwards trend) and 250 (outlier).

Show the code
grc_unemployment = unemployment %>% dplyr::filter(LOCATION == "GRC")
grc_unemployment_diff1 <- diff(grc_unemployment$Value, differences = 1)

highchart() %>%
    hc_chart(type = "line") %>%
    hc_title(text = "Unemployment Rates") %>%
    hc_xAxis(title = list(text = "Year")) %>%
    hc_yAxis(title = list(text = "Value")) %>%
    hc_tooltip(shared = TRUE, valueDecimals = 2, valueSuffix = " %") %>%
    hc_add_series(
        data = grc_unemployment_diff1
    )

Given the concerns of above, I made also a second difference plot. It seems to solve the problem on points close to 150.

Show the code
grc_unemployment_diff2<- diff(grc_unemployment$Value, differences = 2)

highchart() %>%
    hc_chart(type = "line") %>%
    hc_title(text = "Unemployment Rates") %>%
    hc_xAxis(title = list(text = "Year")) %>%
    hc_yAxis(title = list(text = "Value")) %>%
    hc_tooltip(shared = TRUE, valueDecimals = 2, valueSuffix = " %") %>%
    hc_add_series(
        data = grc_unemployment_diff2
    )

Examine Stationarity with Statistical tests

The graphical interpretation of stationarity can be beneficial for a quick assessment on topic of stationarity. However it can be considered a subjective metric, which leads on a non consistent decision (someone may consider the second figure as stationary and some others not.

Thankfully, there are some statistical tests which can help us on our decisions. Some commonly used are :

  • Augmented Dickey-Fuller (ADF) test
  • Phillips- Perron (PP) test
  • Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test
  • Zivot-Andrews (ZA) test

But an other problem arises. We are not in a fancy IDE with menus and checkboxes. We use R and its packages. There are many packages that have already write functions to study stationarity with the aforementioned tests. Some of them are:

  • tseries
  • urca
  • CADFtest

We should carefully evaluate our choices, as these packages advertise similar functionality but differ significantly in the scope of their functions and, more importantly, in the level of flexibility they offer for handling time series characteristics. In short, the tseries package is the one I’m most familiar with, but it is also the most limited in terms of configurable parameters. Although many tutorials use this package, I found that I couldn’t set the number of lags for tests or specify characteristics like trends when applicable. On the other hand, the urca package addresses these limitations by allowing users to specify both the trend component and the number of lags in the respective tests. One minor drawback of urca, however, is that it only provides critical values, not p-values.

Summary

Show the code
summary_stationarity_results <- data.frame(
                             "Type" = c("levels", "Diff(GRC)", "Diff2(GRC)"),
                             "ADF test" = c("Non Stationary", "Stationary", "Stationary"),
                              "PP test" = c("Non Stationary", "Stationary", "Stationary"),
                             "KPSS test" =c("Non Stationary", "Non Stationary", "Stationary")
)


summary_stationarity_results  %>% kbl() %>% kable_styling() 
Type ADF.test PP.test KPSS.test
levels Non Stationary Non Stationary Non Stationary
Diff(GRC) Stationary Stationary Non Stationary
Diff2(GRC) Stationary Stationary Stationary

ADF test

One of the most

H_0 : \text{The time series has a unit root} \\ H_1 : \text{The time series has not a unit root} Selecting the lags

Show the code
Loading required package: MASS

Attaching package: 'MASS'
The following object is masked from 'package:dplyr':

    select
Loading required package: strucchange
Loading required package: zoo

Attaching package: 'zoo'
The following object is masked from 'package:tsibble':

    index
The following objects are masked from 'package:base':

    as.Date, as.Date.numeric
Loading required package: sandwich
Loading required package: urca
Loading required package: lmtest

Attaching package: 'lmtest'
The following object is masked _by_ '.GlobalEnv':

    unemployment
The following object is masked from 'package:highcharter':

    unemployment

Attaching package: 'vars'
The following object is masked from 'package:fable':

    VAR
Show the code
G = VARselect(grc_unemployment$Value, lag.max = 20, type = "const")
Show the code
get_adf_results = function(variable){

    # Ensure variable is numeric (e.g., a time series or numeric vector)
    if (!is.numeric(variable)) {
        stop("Input variable must be numeric.")
    }
    
    # Perform ADF tests
    adf_orig = adf.test(variable)
    adf_diff1 = adf.test(diff(variable, differences = 1))
    adf_diff2 = adf.test(diff(variable, differences = 2))
    
    # Extract results
    adf_table = data.frame(
        "Transformation" = c("Level", "1st Difference", "2nd Difference"),
        "Dickey-Fuller" = c(
            adf_orig$statistic[["Dickey-Fuller"]],
            adf_diff1$statistic[["Dickey-Fuller"]],
            adf_diff2$statistic[["Dickey-Fuller"]]
        ),
        "Lag Order" = c(
            adf_orig$parameter[["Lag order"]],
            adf_diff1$parameter[["Lag order"]],
            adf_diff2$parameter[["Lag order"]]
        ),
        "p-value" = c(
            adf_orig$p.value,
            adf_diff1$p.value,
            adf_diff2$p.value
        )
    ) %>%
        kbl(digits = 5) %>%
        kable_styling(full_width = FALSE)
    
    return(adf_table)
}
Show the code
get_adf_results(grc_unemployment$Value)
Augmented Dickey-Fuller test results
Transformation Dickey.Fuller Lag.Order p.value
Level -0.93123 6 0.94846
1st Difference -3.50738 6 0.04240
2nd Difference -9.48682 6 0.01000

Συνεπώς, είναι προφανές από τα αποτελέσματα του στατιστικού ελέγχου Dickey Fuller ότι η χρονοσειρά μου δεν είναι στάσιμη. Θα πρέπει να εφαρμόσω τον έλεγχο Dickey-Fuller στις δοαφορές.

PP test

H_0 : \text{The time series has a unit root} \\ H_1 : \text{The time series has not a unit root}

Show the code
get_pp_results = function(variable){

    # Ensure variable is numeric (e.g., a time series or numeric vector)
    if (!is.numeric(variable)) {
        stop("Input variable must be numeric.")
    }
    
    # Perform ADF tests
    pp_orig = pp.test(variable)
    pp_diff1 = pp.test(diff(variable, differences = 1))
    pp_diff2 = pp.test(diff(variable, differences = 2))
    
    # Extract results
    pp_table = data.frame(
        "Transformation" = c("Level", "1st Difference", "2nd Difference"),
        "Phillips-Perron" = c(
            pp_orig$statistic[["Dickey-Fuller Z(alpha)"]],
            pp_diff1$statistic[["Dickey-Fuller Z(alpha)"]],
            pp_diff2$statistic[["Dickey-Fuller Z(alpha)"]]
        ),
        "Lag Order" = c(
            pp_orig[["parameter"]][["Truncation lag parameter"]],
            pp_diff1[["parameter"]][["Truncation lag parameter"]],
            pp_diff2[["parameter"]][["Truncation lag parameter"]]
        ),
        "p-value" = c(
            pp_orig$p.value,
            pp_diff1$p.value,
            pp_diff2$p.value
        )
    ) %>%
        kbl(digits = 5) %>%
        kable_styling(full_width = FALSE)
    
    return(pp_table)
}
Show the code
get_pp_results(grc_unemployment$Value)
Phillips-Perron test results
Transformation Phillips.Perron Lag.Order p.value
Level 0.19401 5 0.99
1st Difference -279.81456 5 0.01
2nd Difference -337.13053 5 0.01

KPSS test

H_0 : \text{The time series has not a unit root} \\ H_1 : \text{Alternatively}

Show the code
get_kpss_results = function(variable){

    # Ensure variable is numeric (e.g., a time series or numeric vector)
    if (!is.numeric(variable)) {
        stop("Input variable must be numeric.")
    }
    
    # Perform ADF tests
    kpss_orig = kpss.test(variable)
    kpss_diff1 = kpss.test(diff(variable, differences = 1))
    kpss_diff2 = kpss.test(diff(variable, differences = 2))
    
    # Extract results
    pp_table = data.frame(
        "Transformation" = c("Level", "1st Difference", "2nd Difference"),
        "KPSS" = c(
            kpss_orig[["statistic"]][["KPSS Level"]],
            kpss_diff1[["statistic"]][["KPSS Level"]],
            kpss_diff2[["statistic"]][["KPSS Level"]]
        ),
        "Lag Order" = c(
            kpss_orig[["parameter"]][["Truncation lag parameter"]],
            kpss_diff1[["parameter"]][["Truncation lag parameter"]],
            kpss_diff2[["parameter"]][["Truncation lag parameter"]]
        ),
        "p-value" = c(
            kpss_orig$p.value,
            kpss_diff1$p.value,
            kpss_diff2$p.value
        )
    ) %>%
        kbl(digits = 5) %>%
        kable_styling(full_width = FALSE)
    
    return(pp_table)
}
Show the code
get_kpss_results(grc_unemployment$Value)
KPSS test results
Transformation KPSS Lag.Order p.value
Level 2.53719 5 0.01000
1st Difference 0.59510 5 0.02308
2nd Difference 0.01426 5 0.10000

Identify Model

Show the code
acf_plot_function(unemployment$Value, min=-0.2, max=1)
Show the code
pacf_plot_function(unemployment$Value, min=-0.2, max=1)
Show the code
acf_plot_function(grc_unemployment_diff1, min=-0.1, max = 0.6)
Show the code
pacf_plot_function(grc_unemployment_diff1, min=-0.2, max=0.5)
Show the code
acf_plot_function(grc_unemployment_diff2, min=-0.5, max = 0.5)
Show the code
pacf_plot_function(grc_unemployment_diff2, min=-0.5, max = 0.5)

Build Time Series Model

Show the code
auto_model <- auto.arima(grc_unemployment$Value, trace = T,seasonal = TRUE)

 Fitting models using approximations to speed things up...

 ARIMA(2,2,2)                    : 302.1399
 ARIMA(0,2,0)                    : 473.9345
 ARIMA(1,2,0)                    : 405.2316
 ARIMA(0,2,1)                    : 297.5442
 ARIMA(1,2,1)                    : 301.0253
 ARIMA(0,2,2)                    : 299.5392
 ARIMA(1,2,2)                    : 302.154

 Now re-fitting the best model(s) without approximations...

 ARIMA(0,2,1)                    : 296.71

 Best model: ARIMA(0,2,1)                    
Show the code
auto_model 
Series: grc_unemployment$Value 
ARIMA(0,2,1) 

Coefficients:
          ma1
      -0.9075
s.e.   0.0229

sigma^2 = 0.1597:  log likelihood = -146.33
AIC=296.67   AICc=296.71   BIC=304.02
Show the code
arimaModel_1=arima(grc_unemployment$Value, order=c(0,1,2))
arimaModel_1

Call:
arima(x = grc_unemployment$Value, order = c(0, 1, 2))

Coefficients:
         ma1     ma2
      0.1445  0.0760
s.e.  0.0588  0.0571

sigma^2 estimated as 0.1714:  log likelihood = -156.83,  aic = 319.67
Show the code
arimaModel_2=arima(grc_unemployment$Value, order=c(1,1,2))
arimaModel_2

Call:
arima(x = grc_unemployment$Value, order = c(1, 1, 2))

Coefficients:
         ar1      ma1     ma2
      0.9710  -0.9048  0.0168
s.e.  0.0185   0.0666  0.0629

sigma^2 estimated as 0.1569:  log likelihood = -144.19,  aic = 296.37
Show the code
arimaModel_3=arima(grc_unemployment$Value, order=c(9,2,1))
arimaModel_3

Call:
arima(x = grc_unemployment$Value, order = c(9, 2, 1))

Coefficients:
          ar1      ar2      ar3      ar4      ar5      ar6      ar7      ar8
      -0.7879  -0.7083  -0.6715  -0.6944  -0.5922  -0.3838  -0.3160  -0.4457
s.e.   0.1713   0.1539   0.1493   0.1434   0.1379   0.1172   0.0883   0.0718
          ar9      ma1
      -0.2704  -0.1456
s.e.   0.0690   0.1767

sigma^2 estimated as 0.1388:  log likelihood = -126.92,  aic = 275.83

Compare Models

After building some ARIMA models, I should decide which is the best one to make my estimations. One metric to evaluate those models is AIC (Akaike Information Criterion). The lower the value of AIC, the better my model.

Show the code
accuracy_table <- data.frame(
                             "Name of Model" = c("Auto Model", "ARIMA Candidate #1", "ARIMA Candidate #2", "ARIMA Candidate #3"),
                             Model = c("ARIMA(0,2,1)", "ARIMA(0,1,2)", "ARIMA(1,1,2)", "ARIMA(9,2,1)"),
                             AIC =c(auto_model$aic, arimaModel_1$aic, arimaModel_2$aic, arimaModel_3$aic)
)


accuracy_table %>% kbl() %>% kable_styling()
Name.of.Model Model AIC
Auto Model ARIMA(0,2,1) 296.6684
ARIMA Candidate #1 ARIMA(0,1,2) 319.6669
ARIMA Candidate #2 ARIMA(1,1,2) 296.3705
ARIMA Candidate #3 ARIMA(9,2,1) 275.8314

So, the best model is the Auto Model (ARIMA(9,2,1)), which has the lowest AIC value.

Checking best models

Forecast Future Unemployment

Previously, I identify which is the best model. Now, I will use this model in order to predict unemployment for the next 6 months. It should be recalled that the last available value was from August of 2022 (12.2%). Therefore, I will make a prediction for unemployment in Greece until February 2023.

ARIMA (0,2,1) forecasts

Show the code
# s = forecast(auto_model,20)
# 
# month = seq(max(grc_unemployment$TIME) %m+% months(1),  # start at next month
#               by = "1 month",
#               length.out = 20)
# 
# highchart() %>%
#     hc_title(text = "Prediction of Unemployment with ARIMA(0,2,1)") %>%
#     hc_xAxis(type = "datetime")  %>%
#     hc_yAxis(title = list(text="Unemployment (%)")) %>%
#   hc_add_series(name="Historic Data",
#     data = list_parse2(data.frame(x = datetime_to_timestamp(grc_unemployment$TIME), y = grc_unemployment$Value)),
#     type = "line"
#   )  %>%
#   hc_add_series(name = "Forecast",
#                   data = list_parse2(data.frame(
#                       x = datetime_to_timestamp(month),
#                       y = round(s$mean, digits=2)
#                   )),
#                   type = "line",
#                   color = "#d62728") %>%
#     hc_add_series(name = "Confidence Interval",
#                 data = list_parse2(data.frame(
#                   x = datetime_to_timestamp(month),
#                   low = round(s$lower[, "80%"], digits=2),
#                   high = round(s$upper[, "80%"], digits=2)
#                 )),
#                 type = "arearange",
#                 color = hex_to_rgba("#d62728", 0.2),
#                 linkedTo = ":previous")
# 

ARIMA (9,2,1) forecasts

Show the code
# s1 = forecast(arimaModel_3,20)
# 
# month = seq(max(grc_unemployment$TIME) %m+% months(1),  # start at next month
#               by = "1 month",
#               length.out = 20)
# 
# highchart() %>%
#     hc_title(text = "Prediction of Unemployment with ARIMA(0,2,1)") %>%
#     hc_xAxis(type = "datetime")  %>%
#     hc_yAxis(title = list(text="Unemployment (%)")) %>%
#   hc_add_series(name="Historic Data",
#     data = list_parse2(data.frame(x = datetime_to_timestamp(grc_unemployment$TIME), y = grc_unemployment$Value)),
#     type = "line"
#   )  %>%
#   hc_add_series(name = "Forecast",
#                   data = list_parse2(data.frame(
#                       x = datetime_to_timestamp(month),
#                       y = round(s1$mean, digits=2)
#                   )),
#                   type = "line",
#                   color = "#d62728") %>%
#     hc_add_series(name = "Confidence Interval",
#                 data = list_parse2(data.frame(
#                   x = datetime_to_timestamp(month),
#                   low = round(s1$lower[, "80%"], digits=2),
#                   high = round(s1$upper[, "80%"], digits=2)
#                 )),
#                 type = "arearange",
#                 color = hex_to_rgba("#d62728", 0.2),
#                 linkedTo = ":previous")

Results

Given the diagram as well as the forecast table (of the best performing model, candidate #3), I conclude that a reduction in unemployment in Greece is expected in the next period of time (in the next six months). More specifically, Greece’s unemployment in February 2023 will range between 10% (9.4%) and 13% (14%) with an 80% (95%) probability (based on ARIMA(9,2,1) model).

Ευχαριστίες

Φωτογραφία από Rosy / Bad Homburg / Germany από το Pixabay

Αναφορές

Applied Time Series Analysis. (χ.χ.). Ανακτήθηκε 19 Οκτώβριος 2022, από https://online.stat.psu.edu/stat510/
Gohel, D., & Skintzos, P. (2024). flextable: Functions for Tabular Reporting. Ανακτήθηκε από https://ardata-fr.github.io/flextable-book/
Grolemund, G., & Wickham, H. (2011). Dates and Times Made Easy with lubridate. Journal of Statistical Software, 40(3), 1–25. Ανακτήθηκε από https://www.jstatsoft.org/v40/i03/
Hyndman, R. J., & Athanasopoulos, G. (2018). Forecasting: principles and practice. OTexts.
Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting: the forecast package for R. Journal of Statistical Software, 27(3), 1–22. https://doi.org/10.18637/jss.v027.i03
Hyndman, R., Athanasopoulos, G., Bergmeir, C., Caceres, G., Chhay, L., Kuroptev, K., … Yasmeen, F. (2025). forecast: Forecasting Functions for Time Series and Linear Models. Ανακτήθηκε από https://pkg.robjhyndman.com/forecast/
Kunst, J. (2022). highcharter: A Wrapper for the Highcharts Library. Ανακτήθηκε από https://jkunst.com/highcharter/
O’Hara-Wild, M., Hyndman, R., & Wang, E. (2024a). fable: Forecasting Models for Tidy Time Series. Ανακτήθηκε από https://fable.tidyverts.org
O’Hara-Wild, M., Hyndman, R., & Wang, E. (2024b). fabletools: Core Tools for Packages in the fable Framework. Ανακτήθηκε από https://fabletools.tidyverts.org/
O’Hara-Wild, M., Hyndman, R., & Wang, E. (2024c). feasts: Feature Extraction and Statistics for Time Series. Ανακτήθηκε από http://feasts.tidyverts.org/
OECD. (2022). Unemployment rate (indicator). Ανακτήθηκε 14 Οκτώβριος 2022, από doi: 10.1787/52570002-en
R Core Team. (2025). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. Ανακτήθηκε από https://www.R-project.org/
R Graph Gallery. (2018). Time Series. URL https://r-graph-gallery.com/time-series.html.
Spinu, V., Grolemund, G., & Wickham, H. (2024). lubridate: Make Dealing with Dates a Little Easier. Ανακτήθηκε από https://lubridate.tidyverse.org
Trapletti, A., & Hornik, K. (2024). tseries: Time Series Analysis and Computational Finance. https://doi.org/10.32614/CRAN.package.tseries
Wang, E., Cook, D., & Hyndman, R. J. (2020). A new tidy data structure to support exploration and modeling of temporal data. Journal of Computational and Graphical Statistics, 29(3), 466–478. https://doi.org/10.1080/10618600.2019.1695624
Wang, E., Cook, D., Hyndman, R., & O’Hara-Wild, M. (2025). tsibble: Tidy Temporal Data Frames and Tools. Ανακτήθηκε από https://tsibble.tidyverts.org
Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). dplyr: A Grammar of Data Manipulation. Ανακτήθηκε από https://dplyr.tidyverse.org
Wickham, H., Hester, J., & Bryan, J. (2024). readr: Read Rectangular Text Data. Ανακτήθηκε από https://readr.tidyverse.org
Zhu, H. (2024). kableExtra: Construct Complex Table with kable and Pipe Syntax. Ανακτήθηκε από http://haozhu233.github.io/kableExtra/

Αναφορά

Αναφορά BibTeX:
@online{2022,
  author = {, stesiam},
  title = {Προβλέποντας την Ανεργία στην Ελλάδα},
  date = {2022-10-22},
  url = {https://stesiam.com/posts/forecasting-greek-unemployment/},
  langid = {el}
}
Για απόδοση ευγνωμοσύνης, παρακαλούμε αναφερθείτε σε αυτό το έργο ως:
stesiam. (2022, October 22). Προβλέποντας την Ανεργία στην Ελλάδα. Retrieved from https://stesiam.com/posts/forecasting-greek-unemployment/