After loading the libraries I am able to use the commands of the readr package to import my data. My data is in .csv format, so I’ll use the read_csv() command (Wickham, Hester, et al., 2024) to import them.
Additionally, I choose not to include EA-19 values (as I investigate Greece’s unemployment).
Show the code
unemployment<-read_csv("data/unemployment.csv")%>%select(LOCATION, TIME, Value)%>%filter(LOCATION!="EA19")
So, above I see that I have dates in a format “YYYY-MM” (Year - Month) and they are considered as characters. With the help of lubridate package I will convert my time series on a Date format.
Good news! On this dataset there are 0 missing values, in total.
In case the dataset had missing values, I would first look at which variables those were.In a second phase, it might be necessary to replace the missing values.
Descriptive Statistics
The unemployment data for Greece refer to the period from April 1998 το August 2022. Regarding Europe we have data from January 2000 to August 2022.
The last 20 years have seen particularly large changes mainly in Greece, due to the domestic economic crisis. There is a noticeable change between September 2010 (10.1%) and September 2013 (28.1 %), in terms of unemployment in Greece. For the record, below you can find a table with the months that presented the highest unemployment in Greece. As you will notice the five highest values were observed in 2013.
`summarise()` has grouped output by 'LOCATION'. You can override using the
`.groups` argument.
Show the code
highchart()%>%hc_chart(type ="line")%>%hc_title(text ="Unemployment Rates")%>%hc_xAxis(title =list(text ="Year"))%>%hc_yAxis(title =list(text ="Value"))%>%hc_tooltip(shared =TRUE, valueDecimals =2, valueSuffix =" %")%>%hc_add_series( data =g, type ="line",hcaes(x =Year, y =mean_unemployment, group =LOCATION))%>%hc_plotOptions( series =list( dataLabels =list( enabled =TRUE, formatter =JS("function() { if (this.point.index === this.series.data.length - 1) { return this.y.toFixed(2) + ' %'; } return null; }"))))
Examine trend and seasonality
In time series analysis, it’s important to distinguish where the variability in the final form of the series originates from. Time series consist of three main components: trend, seasonality, and randomness. The trend refers to whether the series shows a specific direction (upward/downward). Seasonality refers to recurring patterns at regular intervals.
y_t = S_t + T_t + E_t
Όπου:
y_t δηλώνουν τα δεδομένα που έχουμε διαθέσιμα,
S_t
T_t
E_t
Similarly, the multiplicative model is defined:
y_t = S_t \cdot T_t \cdot E_t
where the components that make up the time series are multiplied instead of added.
It is apparent that there is a big variation on the values of unemployment. This time series is not stationary and the differencing is justified.
Show the code
highchart()%>%hc_chart(type ="line")%>%hc_title(text ="Unemployment Rates")%>%hc_xAxis(title =list(text ="Year"))%>%hc_yAxis(title =list(text ="Value"))%>%hc_tooltip(shared =TRUE, valueDecimals =2, valueSuffix =" %")%>%hc_add_series( data =g, type ="line",hcaes(x =Year, y =mean_unemployment, group =LOCATION))%>%hc_plotOptions( series =list( dataLabels =list( enabled =TRUE, formatter =JS("function() { if (this.point.index === this.series.data.length - 1) { return this.y.toFixed(2) + ' %'; } return null; }"))))
Here we can see a big improvement in comparison with original data. I have some concerns about points close to 150 (mildly upwards trend) and 250 (outlier).
Given the concerns of above, I made also a second difference plot. It seems to solve the problem on points close to 150.
Show the code
grc_unemployment_diff2<-diff(grc_unemployment$Value, differences =2)plot(grc_unemployment_diff2, type ="l")
Examine Stationarity with Statistical tests
The graphical interpretation of stationarity can be beneficial for a quick assessment on topic of stationarity. However it can be considered a subjective metric, which leads on a non consistent decision (someone may consider the second figure as stationary and some others not.
Thankfully, there are some statistical tests which can help us on our decisions. Some commonly used are :
Augmented Dickey-Fuller (ADF) test
Phillips- Perron (PP) test
Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test
Zivot-Andrews (ZA) test
But an other problem arises. We are not in a fancy IDE with menus and checkboxes. We use R and its packages. There are many packages that have already write functions to study stationarity with the aforementioned tests. Some of them are:
tseries
urca
CADFtest
We should carefully evaluate our choices, as these packages advertise similar functionality but differ significantly in the scope of their functions and, more importantly, in the level of flexibility they offer for handling time series characteristics. In short, the tseries package is the one I’m most familiar with, but it is also the most limited in terms of configurable parameters. Although many tutorials use this package, I found that I couldn’t set the number of lags for tests or specify characteristics like trends when applicable. On the other hand, the urca package addresses these limitations by allowing users to specify both the trend component and the number of lags in the respective tests. One minor drawback of urca, however, is that it only provides critical values, not p-values.
The following object is masked from 'package:dplyr':
select
Loading required package: strucchange
Loading required package: zoo
Attaching package: 'zoo'
The following objects are masked from 'package:base':
as.Date, as.Date.numeric
Loading required package: sandwich
Loading required package: urca
Loading required package: lmtest
Attaching package: 'lmtest'
The following object is masked _by_ '.GlobalEnv':
unemployment
The following object is masked from 'package:highcharter':
unemployment
Show the code
G=VARselect(grc_unemployment$Value, lag.max =20, type ="const")
Show the code
get_adf_results=function(variable){# Ensure variable is numeric (e.g., a time series or numeric vector)if(!is.numeric(variable)){stop("Input variable must be numeric.")}# Perform ADF testsadf_orig=adf.test(variable)adf_diff1=adf.test(diff(variable, differences =1))adf_diff2=adf.test(diff(variable, differences =2))# Extract resultsadf_table=data.frame("Transformation"=c("Level", "1st Difference", "2nd Difference"),"Dickey-Fuller"=c(adf_orig$statistic[["Dickey-Fuller"]],adf_diff1$statistic[["Dickey-Fuller"]],adf_diff2$statistic[["Dickey-Fuller"]]),"Lag Order"=c(adf_orig$parameter[["Lag order"]],adf_diff1$parameter[["Lag order"]],adf_diff2$parameter[["Lag order"]]),"p-value"=c(adf_orig$p.value,adf_diff1$p.value,adf_diff2$p.value))%>%kbl(digits =5)%>%kable_styling(full_width =FALSE)return(adf_table)}
Show the code
get_adf_results(grc_unemployment$Value)
Augmented Dickey-Fuller test results
Transformation
Dickey.Fuller
Lag.Order
p.value
Level
-0.93
6
0.948
1st Difference
-3.51
6
0.042
2nd Difference
-9.49
6
0.010
Συνεπώς, είναι προφανές από τα αποτελέσματα του στατιστικού ελέγχου Dickey Fuller ότι η χρονοσειρά μου δεν είναι στάσιμη. Θα πρέπει να εφαρμόσω τον έλεγχο Dickey-Fuller στις δοαφορές.
PP test
H_0 : \text{The time series has a unit root} \\
H_1 : \text{The time series has not a unit root}
Show the code
get_pp_results=function(variable){# Ensure variable is numeric (e.g., a time series or numeric vector)if(!is.numeric(variable)){stop("Input variable must be numeric.")}# Perform ADF testspp_orig=pp.test(variable)pp_diff1=pp.test(diff(variable, differences =1))pp_diff2=pp.test(diff(variable, differences =2))# Extract resultspp_table=data.frame("Transformation"=c("Level", "1st Difference", "2nd Difference"),"Phillips-Perron"=c(pp_orig$statistic[["Dickey-Fuller Z(alpha)"]],pp_diff1$statistic[["Dickey-Fuller Z(alpha)"]],pp_diff2$statistic[["Dickey-Fuller Z(alpha)"]]),"Lag Order"=c(pp_orig[["parameter"]][["Truncation lag parameter"]],pp_diff1[["parameter"]][["Truncation lag parameter"]],pp_diff2[["parameter"]][["Truncation lag parameter"]]),"p-value"=c(pp_orig$p.value,pp_diff1$p.value,pp_diff2$p.value))%>%kbl(digits =5)%>%kable_styling(full_width =FALSE)return(pp_table)}
Show the code
get_pp_results(grc_unemployment$Value)
Phillips-Perron test results
Transformation
Phillips.Perron
Lag.Order
p.value
Level
0.19
5
0.99
1st Difference
-279.81
5
0.01
2nd Difference
-337.13
5
0.01
KPSS test
H_0 : \text{The time series has not a unit root} \\
H_1 : \text{Alternatively}
Show the code
get_kpss_results=function(variable){# Ensure variable is numeric (e.g., a time series or numeric vector)if(!is.numeric(variable)){stop("Input variable must be numeric.")}# Perform ADF testskpss_orig=kpss.test(variable)kpss_diff1=kpss.test(diff(variable, differences =1))kpss_diff2=kpss.test(diff(variable, differences =2))# Extract resultspp_table=data.frame("Transformation"=c("Level", "1st Difference", "2nd Difference"),"KPSS"=c(kpss_orig[["statistic"]][["KPSS Level"]],kpss_diff1[["statistic"]][["KPSS Level"]],kpss_diff2[["statistic"]][["KPSS Level"]]),"Lag Order"=c(kpss_orig[["parameter"]][["Truncation lag parameter"]],kpss_diff1[["parameter"]][["Truncation lag parameter"]],kpss_diff2[["parameter"]][["Truncation lag parameter"]]),"p-value"=c(kpss_orig$p.value,kpss_diff1$p.value,kpss_diff2$p.value))%>%kbl(digits =5)%>%kable_styling(full_width =FALSE)return(pp_table)}
After building some ARIMA models, I should decide which is the best one to make my estimations. One metric to evaluate those models is AIC (Akaike Information Criterion). The lower the value of AIC, the better my model.
So, the best model is the Auto Model (ARIMA(9,2,1)), which has the lowest AIC value.
Checking best models
Forecast Future Unemployment
Previously, I identify which is the best model. Now, I will use this model in order to predict unemployment for the next 6 months. It should be recalled that the last available value was from August of 2022 (12.2%). Therefore, I will make a prediction for unemployment in Greece until February 2023.
Given the diagram as well as the forecast table (of the best performing model, candidate #3), I conclude that a reduction in unemployment in Greece is expected in the next period of time (in the next six months). More specifically, Greece’s unemployment in February 2023 will range between 10% (9.4%) and 13% (14%) with an 80% (95%) probability (based on ARIMA(9,2,1) model).
Grolemund, G., & Wickham, H. (2011). Dates and times made easy with lubridate. Journal of Statistical Software, 40(3), 1–25. Retrieved from https://www.jstatsoft.org/v40/i03/
Hyndman, R. J., & Athanasopoulos, G. (2018). Forecasting: Principles and practice. OTexts.
Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting: The forecast package for R. Journal of Statistical Software, 27(3), 1–22. https://doi.org/10.18637/jss.v027.i03
Hyndman, R., Athanasopoulos, G., Bergmeir, C., Caceres, G., Chhay, L., Kuroptev, K., … Yasmeen, F. (2024). Forecast: Forecasting functions for time series and linear models. Retrieved from https://pkg.robjhyndman.com/forecast/
OECD. (2022). Unemployment rate (indicator). Retrieved October 14, 2022, from doi: 10.1787/52570002-en
R Core Team. (2024). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Retrieved from https://www.R-project.org/
R Graph Gallery. (2018). Time series. URL Https://r-Graph-Gallery.com/Time-Series.html.
Spinu, V., Grolemund, G., & Wickham, H. (2024). Lubridate: Make dealing with dates a little easier. Retrieved from https://lubridate.tidyverse.org
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag New York. Retrieved from https://ggplot2.tidyverse.org
Wickham, H., Chang, W., Henry, L., Pedersen, T. L., Takahashi, K., Wilke, C., … van den Brand, T. (2024). ggplot2: Create elegant data visualisations using the grammar of graphics. Retrieved from https://ggplot2.tidyverse.org
Wickham, H., François, R., Henry, L., Müller, K., & Vaughan, D. (2023). Dplyr: A grammar of data manipulation. Retrieved from https://dplyr.tidyverse.org
Wickham, H., Hester, J., & Bryan, J. (2024). Readr: Read rectangular text data. Retrieved from https://readr.tidyverse.org
Wilke, C. O., & Wiernik, B. M. (2022). Ggtext: Improved text rendering support for ggplot2. Retrieved from https://wilkelab.org/ggtext/