Kaggle's Greek Community

Introduction

Kaggle is one of the best-known platforms for data analysts and data scientists, with more than 10 million registered users. Beyond that, Kaggle offers a wide range of features, such as:

Creating articles (notebooks) that contain executable code as well as its output (charts, statistical results, etc.)
Communication between users through discussion forums for solving questions
A rich body of educational material and courses, most of which are based on Python and its libraries, covering visualisation, data analysis, machine learning and deep learning
Numerous machine-learning and deep-learning competitions that any user can enter

It is worth mentioning that there are other similar platforms, but none of them has a comparable user base, nor does it offer a similar range of features. A notable alternative is DrivenData, mainly for the competitions side. However, DrivenData does not offer the ability to publish notebooks and its community is significantly smaller.

The Kaggle Machine Learning & Data Science Survey is an annual survey conducted by the platform itself. Users are asked to answer questions about demographics, the analysis tools they use, work characteristics and more. The data is then published and users analyse it as part of an open competition. In this article I carry out an analysis of the 2021 survey data, comparing the characteristics of Greek users of the platform with those of the rest of the world.

Prerequisites

Importing Libraries

For this analysis I need to import and transform the data, so the {tidyverse} package suite is essential, since I will use functionality from packages such as {readr}, {dplyr} and {tidyr}. The huge volume of data gives us rich analysis options and allows us to visualise a variety of findings through the {highcharter} package. Finally, for variables with many values (e.g. countries), it may be necessary to present the data in interactive tables via the {reactable} or {reactablefmtr} packages.

Importing Data

Kaggle provides the dataset in CSV format, so the read_csv command from the {readr} package will be used. The dataset is stored under the name kaggle_2021. The first row of the file contains the wording of each question, which I remove to make the analysis easier.

Data Recoding

Before starting the analysis, I need to carry out some preparatory steps. Since the main purpose is to compare characteristics between Greek and other users, we need to distinguish the data on that basis. There are two ways:

Creating a new variable and grouping the data
Filtering and splitting the dataset into two subsets (Greek and other users)

Either way, I will be able to make the necessary comparison.

Kaggle’s Community

One of the first things I noticed when I signed up on Kaggle was the remarkable geographic diversity of its community. People from dozens of countries, gathered on a single website, sharing the same passion for programming, analysis and data science. Something like Facebook but for Statistics. To highlight this feature, we use an interactive table via the {reactable} package. From the table below I note that users from India make up the largest community, representing about a quarter of the total, followed by users from the USA. Greek users of the platform are relatively few, with a participation rate of 0.39% of the total users in this survey.

Absolute number and share of survey participants by country.
Country	Number of users	Share (%)
India	7434	28.62
United States	2650	10.2
Other	1339	5.16
Japan	921	3.55
China	814	3.13
Brazil	751	2.89
Russia	742	2.86
Nigeria	702	2.7
United Kingdom	550	2.12
Pakistan	530	2.04
Egypt	482	1.86
Germany	470	1.81
Spain	454	1.75
Indonesia	444	1.71
Turkey	416	1.6
France	401	1.54
South Korea	359	1.38
Taiwan	334	1.29
Canada	331	1.27
Bangladesh	317	1.22
Italy	311	1.2
Mexico	279	1.07
Vietnam	277	1.07
Australia	264	1.02
Kenya	248	0.95
Colombia	225	0.87
Poland	219	0.84
Iran	195	0.75
Ukraine	186	0.72
Argentina	182	0.7
Singapore	182	0.7
Malaysia	156	0.6
Netherlands	153	0.59
South Africa	146	0.56
Morocco	140	0.54
Israel	138	0.53
Thailand	123	0.47
Portugal	119	0.46
Peru	117	0.45
United Arab Emirates	111	0.43
Tunisia	109	0.42
Philippines	108	0.42
Sri Lanka	106	0.41
Chile	102	0.39
Greece	102	0.39
Ghana	99	0.38
Saudi Arabia	89	0.34
Ireland	84	0.32
Sweden	81	0.31
Hong Kong SAR China	79	0.3
Nepal	75	0.29
Switzerland	71	0.27
Belgium	65	0.25
Czechia	63	0.24
Romania	61	0.23
Austria	51	0.2
Belarus	51	0.2
Ecuador	50	0.19
Denmark	48	0.18
Uganda	47	0.18
Kazakhstan	45	0.17
Norway	45	0.17
Algeria	44	0.17
Ethiopia	43	0.17
Iraq	43	0.17

An important note

It is important to stress that the data, and by extension the results, come exclusively from the annual survey conducted by the Kaggle platform. They do not reflect the overall picture of the data-science community at a global level. Also, participation in the survey is optional, so we are examining the characteristics only of those who chose to take part. This may introduce self-selection bias.

Women’s Participation in the Field

Globally, a significant under-representation of women in the labour market has persisted over time. According to the World Bank, only 5 in 10 women are employed, in contrast to men, where the corresponding ratio approaches 7 in 10. Here we focus specifically on the data-science field. Does our field follow the same pattern of exclusion? The answer is yes, but with strong variation by country.

Greece has a relatively disappointing performance on this particular indicator, coming in 38th place with a rate of 15.7%, while the average participation is 17.7%.

This inequality in women’s participation is, beyond being sad, also fundamentally problematic, since the lack of diversity limits perspectives and innovation in any field. Fortunately, various communities have recognised the issue and are actively working to empower women’s participation. The group best known to me, as an R user, is the R-ladies community, with chapters in various countries. There is a corresponding group for Python, PyLadies.

Age Distribution

The community of Greek users consists of relatively older people compared with the rest of the world. The most populous age group in Greece is 25–29, while there is also a notable share of people in their 40s. By contrast, the population of other users is particularly youthful, with the overwhelming majority concentrated in the younger age groups. Cumulatively, the three youngest age groups represent about 60% of total users worldwide, versus 42% for the Greek community.

This difference can be interpreted as an indication that the data-science field has not yet matured in Greece to the same degree. Use of the platform appears to be limited to people over 25, who most likely already hold a degree and some kind of work experience.

Comparing the two most widespread programming languages in terms of their use by age group, I observe that in both languages the younger ages (under 29) make up the majority of users. I also observe that Python has a greater concentration of users in the ages up to 29, while R has a significant presence of older users. This was expected, since R has always been intended for statistical analysis.

Educational Background

Another finding worth investigating is the higher level of education that Greek users have compared with the rest of the platform’s users. In the previous section we observed an age distribution suggesting that the participation of older ages is greater than that of the other users. This impression appears to be confirmed, since holders of a Master’s or Doctoral degree are proportionally clearly more numerous. Approximately 78% of Greek users hold a Master’s degree or higher, while the corresponding figure for the rest of the user group is 50%.

Comparing the educational background by programming language, R users show a proportionally higher level of studies than Python users, which is consistent with their older age distribution.

Employment

So far I know that the small community of Greek data scientists consists mainly of older and more highly specialised individuals. But does this translate into years of professional experience in the field?

An interesting finding is that users with up to 1 year of work experience represent 1/5 of Greek users and 1/4 of the platform’s users worldwide. It is also confirmed that Greek users proportionally have more work experience compared with the rest of the users.

Despite the extensive experience and the greater specialisation that results from higher studies, Greek data scientists work in relatively small companies and organisations. Approximately 58% of Greeks reported working in businesses of up to 250 employees, versus 47% for foreign users.

The same pattern of smaller organisations is also reflected in pay: Greek users are concentrated mainly in the lower-to-middle bands of annual earnings.

The most common jobs in the Greek Kaggle community are Data Analyst, Data Scientist and Student. The distribution is similar to the international one, albeit with some differences in the percentages.

Our Toolkit

The language that dominates Data Science in both groups is Python. The only difference between Greek and foreign users is perhaps the proportionally more significant R user base. In any case, about eight in ten have chosen Python as their main language, followed by R and finally SQL.

Beyond software, the equipment the community uses is also of interest. The overwhelming majority rely on a laptop, with desktop computers and cloud platforms following.

Epilogue

The findings of the analysis are ambiguous. On the one hand, Greece does not lack human capital and people who are knowledgeable in the field of Data Science. It consists of individuals with a high degree of specialisation and work experience. On the other hand, there is a pattern that is worrying, and that is the age distribution. This may indicate low penetration of the field, combined with low interest from younger age groups, and that could be problematic for the long-term growth of the field in our country.

Finally, I would like to clarify what I mean in the article in order to avoid misinterpretations. In several distributions, figures are reported that are favourable to Greece. This in no way means that Greeks are better at Data Science, and such a conclusion would be mistaken. I am comparing demographic characteristics of users from one platform’s survey. Thus the distribution probably reveals the low participation and penetration of data science in our country, indicated both by the low number of users and by their structure. In reality the trend is worrying, since the distributions are not accompanied by a corresponding share of younger ages with an interest in the field.

Acknowledgements

Dataset based on the 2021 Kaggle Machine Learning & Data Science Survey.

Image by Christina Smith from Pixabay.