Introduction
Kaggle is one of the most well-known communities of data analysts/scientists with over 10 million active users. Besides that, Kaggle offers an abundance of functionalities (Notebooks), information (through Discussions between users) and Competitions. It is worth noting that there are other similar communities but they cannot compare to the full functionality of Kaggle. For example, DrivenData could be considered an alternative for participating in ML competitions, but it neither provides the possibility to create notebooks nor has a large number of users.
Kaggle Machine Learning & Data Science Survey is an annual survey conducted by Kaggle. The platform asks its users to analyze users’ data in the context of a competition. In this article, I conduct an analysis based on 2021’s survey in order to compare Greek data analysts with the rest of the world.
Prerequisites
Import Libraries
This analysis will make some charts, so the {ggplot2} package is necessary. Also, having variables with too many values (e.g. country of each Kaggle user) is an indication of using tables, and for this the {reactablefmtr} package will help to get a nice result.
# General purpose R libraries
library(readr)
library(dplyr)
library(tidyr)
library(forcats)
library(gridExtra)
library(countrycode)
# Tables
library(kableExtra)
library(reactablefmtr)
# Graphs
library(ggplot2)
library(ggtext)
library(showtext)
library(sysfonts)
library(glue)
library(ggflags)
library(highcharter)
options(digits = 4)
options(warn = -1)
font_add_google(name = "Lilita One", family = "title", db_cache = F)
font_add_google(name = "Ysabeau Office", family = "subtitle", db_cache = F)
font_add_google(name = "Spline Sans", family = "text", db_cache = F)
showtext_auto()
showtext::showtext_opts(dpi = 300)
Import Data
Using read_csv() from the {readr} package, I import the dataset and name it kaggle_2021. The dataset includes in the first line the question text which is not required for the analysis, so I exclude it.
kaggle_2021 <- read_csv("data/kaggle_survey_2021.csv")
# Delete second line (question text row)
kaggle_2021 <- kaggle_2021[-c(1), ]
Prepare Data
Since my analysis is based on Greek users, I split the dataset into two parts. One part includes exclusively Greek users and all the rest another. Thus, we can observe any differences or similarities with Kaggle’s broader userbase.
# Recoding Q2 (gender)
kaggle_2021$Q2 <- kaggle_2021$Q2 %>%
fct_recode(
"Other" = "Nonbinary",
"Other" = "Prefer not to say",
"Other" = "Prefer to self-describe"
)
# Recoding Q3 (country)
kaggle_2021$Q3 <- kaggle_2021$Q3 %>%
fct_recode(
"Hong Kong" = "Hong Kong (S.A.R.)",
"Other" = "I do not wish to disclose my location",
"Iran" = "Iran, Islamic Republic of...",
"UAE" = "United Arab Emirates",
"UK" = "United Kingdom of Great Britain and Northern Ireland",
"USA" = "United States of America",
"Vietnam" = "Viet Nam"
)
# Recoding Q4 (education)
kaggle_2021$Q4 <- kaggle_2021$Q4 %>%
fct_recode(
"Bachelor" = "Bachelor's degree",
"PhD" = "Doctoral degree",
"Other" = "I prefer not to answer",
"Master" = "Master's degree",
"No" = "No formal education past high school",
"ProfDoc" = "Professional doctorate",
"UniNoDegree" = "Some college/university study without earning a bachelor's degree"
) %>%
fct_relevel("No", "UniNoDegree", "Bachelor", "Master", "PhD", "ProfDoc", "Other")
# Create comparison variable: Greece vs the rest
kaggle_2021_compare <- kaggle_2021 %>%
mutate(Q3 = if_else(Q3 != "Greece", "Other", Q3))
Kaggle’s Community
One of the first things I observed when I signed up on Kaggle was the vast majority of nationalities and the multicultural origin of the platform. Many people from many countries all in one platform gathered sharing the same passion for Data Science and Data Analytics. Something like Facebook but for Statistics :)
I decided to make a reactable to see from which nationalities the platform is comprised. One out of four users are from India, which makes them the most populous nation on the platform. Greek users are way less, making up 0.39% of Kaggle’s userbase.
Assumptions Note
We should note that the results are from Kaggle’s Survey, that is from people that participated. An assumption has to be made that the distribution of the users that participated is the same as those who didn’t.
kaggle_2021 %>%
group_by(Q3) %>%
summarise(n = n()) %>%
mutate(pct = round(n / nrow(kaggle_2021) * 100, digits = 2)) %>%
arrange(desc(pct)) %>%
reactable(
.,
defaultPageSize = 6,
theme = espn(),
columns = list(
Q3 = colDef(name = "Country"),
n = colDef(name = "Population", defaultSortOrder = "desc"),
pct = colDef(name = "Percentage (%)")
)
)
Women Participation in DS
Generally, women are under-represented in the labor market. According to the World Bank, only one in two women participates in the labor market, in contrast to men whose corresponding participation is 7 in 10. Does the DS community follow the same pattern? As it seems, it varies.
Greece has a relatively disappointing rate of women participation, holding 15th place with 15.7%, given that the average across all countries is around 26%.
data <- kaggle_2021 %>%
group_by(Q3) %>%
summarise(
n = n(),
Women = sum(factor(Q2) == "Woman"),
pct_women = Women / n * 100
) %>%
dplyr::filter(Q3 != "Other")
data$iso2c <- countrycode(data$Q3, "country.name", "iso2c")
data$iso2c <- tolower(data$iso2c)
data_decreasing <- data %>% dplyr::arrange(-pct_women)
highchart() %>%
hc_chart(type = "bar", inverted = TRUE) %>%
hc_title(text = "<b>Women participation in DS community per country</b>") %>%
hc_xAxis(categories = data_decreasing$Q3, title = list(text = NULL)) %>%
hc_yAxis(title = list(text = "Percent (%)")) %>%
hc_legend(enabled = FALSE) %>%
hc_add_series(
name = "Women %",
data = data_decreasing$pct_women
)
Age Distribution
Greece’s Kaggle Community is comprised of more elderly people compared to the rest of Kaggle’s community. More specifically, Greece’s most prevalent age group is 25–29 and a sufficient proportion of users are in their 40s. On the contrary, Kaggle’s community is quite youthful with most prevalent being the three youngest age groups. Aggregating, those groups constitute six out of ten of Kaggle’s global userbase.
data1 <- kaggle_2021_compare %>%
select(Q3, Q1) %>%
group_by(Q3, Q1) %>%
summarise(n = n()) %>%
group_by(Q3) %>%
mutate(
total = sum(n),
pct = round(n / total * 100, digits = 1)
) %>%
select(Q3, Q1, pct)
highchart() %>%
hc_chart(type = "areaspline") %>%
hc_title(text = "Age Distribution of Kaggle Community (🇬🇷 / 🌍)") %>%
hc_xAxis(categories = unique(data1$Q1)) %>%
hc_yAxis(title = list(text = "Population (%)")) %>%
hc_add_series(
name = "Greece",
data = data1 %>% dplyr::filter(Q3 == "Greece") %>% pull(pct)
) %>%
hc_add_series(
name = "Rest of the World",
data = data1 %>% dplyr::filter(Q3 == "Other") %>% pull(pct)
)
Educational Background
Greek Kagglers have attained higher levels of education compared to the global Kaggle userbase. The proportion with a Master’s or PhD degree is significantly higher among Greek users.
data3 <- kaggle_2021_compare %>%
select(Q3, Q4) %>%
group_by(Q3, Q4) %>%
summarise(n = n()) %>%
group_by(Q3) %>%
mutate(pct = round(n / sum(n) * 100, digits = 1)) %>%
ungroup()
d <- data3 %>% dplyr::filter(Q4 %in% c("Bachelor", "Master", "PhD"))
highchart() %>%
hc_chart(type = "bar") %>%
hc_title(text = "Educational Background of Kaggle Community (🇬🇷 / 🌍)") %>%
hc_subtitle(text = "Greek Kagglers have attained higher studies.") %>%
hc_xAxis(categories = d$Q4, title = list(text = NULL)) %>%
hc_yAxis(title = list(text = "Percentage (%) of total respondents")) %>%
hc_series(
list(name = "Greece", data = d %>% dplyr::filter(Q3 == "Greece") %>% pull(pct)),
list(name = "Rest World", data = d %>% dplyr::filter(Q3 == "Other") %>% pull(pct))
)
Programming Language
The dominant programming language in Data Science — for both Greek and global users — is Python. Around 8 in 10 users have chosen Python as their main language, followed by R and then SQL. One noteworthy difference is that R has a proportionally stronger presence among Greek users compared to the global community.
d1 <- kaggle_2021_compare %>%
select(Q3, Q7_Part_1, Q7_Part_2) %>%
group_by(Q3) %>%
count(Q7_Part_1 == "Python") %>%
mutate(pct = round(n / sum(n) * 100, digits = 1)) %>%
na.omit()
d2 <- kaggle_2021_compare %>%
select(Q3, Q7_Part_2) %>%
group_by(Q3) %>%
count(Q7_Part_2 == "R") %>%
mutate(pct = round(n / sum(n) * 100, digits = 1)) %>%
na.omit()
Jobs
The most common roles in the Greek Kaggle community are Data Analyst, Data Scientist, and Student. The distribution broadly mirrors the global community, though with some differences in the specific proportions.
jobs_greece <- kaggle_2021_compare %>%
select(Q3, Q5) %>%
filter(Q3 == "Greece") %>%
count(Q5) %>%
arrange(-n) %>%
head(8) %>%
mutate(pct = round(n / sum(n) * 100, 1))
jobs_other <- kaggle_2021_compare %>%
select(Q3, Q5) %>%
filter(Q3 == "Other") %>%
count(Q5) %>%
arrange(-n) %>%
head(8) %>%
mutate(pct = round(n / sum(n) * 100, 1))
Conclusions
The findings of this analysis present a nuanced picture of the Greek data science community on Kaggle. Greece accounts for only 0.39% of the surveyed userbase — a small but interesting sample.
Key takeaways:
- Demographics: The Greek community skews older than the global average, with the 25–29 age group being the most prevalent, and a notable share of users in their 40s.
- Education: Greek users are more highly educated on average, with approximately 78% holding a Master’s or Doctoral degree, compared to around 50% globally.
- Gender: Women’s participation in the Greek DS community (15.7%) lags behind the global average (~26%), reflecting a broader societal pattern.
- Programming: Python dominates in both groups, but R has a proportionally stronger presence among Greek users.
- Employment: Greek data scientists tend to work in smaller companies — about 58% work in organizations with fewer than 250 employees, versus 47% globally.
The age distribution is perhaps the most telling finding: it suggests that data science has not yet fully penetrated younger demographics in Greece at the same rate as globally. This could reflect a more mature but narrower community, with lower participation from the newer generation of practitioners.
Acknowledgements
Dataset based on the 2021 Kaggle Machine Learning & Data Science Survey.
Image by Christina Smith from Pixabay.