Where are we Still Split on Tokenization?

Feedback
Report

5 Views PremiumApr 22, 2024

Many Natural Language Processing (NLP) tasks are labeled on the token level, for these tasks, the first step is to identify the tokens (tokenization). Because this step is often considered to be a solved problem, gold tokenization is commonly assumed. In this paper, we investigate if this task is solved with supervised tokenizers. To this end, we propose an effient multi-task model for tokenization that performs on-par with the state-of-the-art. We use this model to reflect on the status of performance on the tokenization task by evaluating on 122 languages in 20 different scripts. We show that tokenization performance is mainly dependent on the amount and consistency of annotated data as well as difficulty of the task in the writing systems. We conclude that besides inconsistencies in the data and exceptional cases the task can be considered solved for Latin languages for in-dataset settings (gt;$99.5 F1). However, performance is 0.75 F1 point lower on average for datasets in other scripts and performance deteriorates in cross-dataset setups.\footnote{Code is

Repost is prohibited without the creator's permission.

0 Follower · 11 Videos

Recommended for You

All
Anime

We Need to Talk About train-dev-test Splits

8:00

We Need to Talk About train-dev-test Splits

18 Views

From Masked Language Modeling to Translation: Non-English Auxiliary Tasks Improve Zero-shot Spoken L

10:00

From Masked Language Modeling to Translation: Non-English Auxiliary Tasks Improve Zero-shot Spoken L

13 Views

MaChAmp at SemEval-2023 Tasks 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, and 12: On the Effectiveness of Interm

10:00

MaChAmp at SemEval-2023 Tasks 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, and 12: On the Effectiveness of Interm

17 Views

Frustratingly Easy Performance Improvements for Low-resource Setups: A Tale on BERT and Segment Embe

1:55

Frustratingly Easy Performance Improvements for Low-resource Setups: A Tale on BERT and Segment Embe

19 Views

Increasing Robustness for Cross-domain Dialogue Act Classification on Social Media Data

5:45

Increasing Robustness for Cross-domain Dialogue Act Classification on Social Media Data

26 Views

MaChAmp at SemEval-2022 Tasks 2, 3, 4, 6, 10, 11, and 12: Multi-task Multi-lingual Learning for a Pr

6:32

MaChAmp at SemEval-2022 Tasks 2, 3, 4, 6, 10, 11, and 12: Multi-task Multi-lingual Learning for a Pr

27 Views

Much Gracias: Semi-supervised Code-switch Detection for Spanish-English: How far can we get? full

6:03

Much Gracias: Semi-supervised Code-switch Detection for Spanish-English: How far can we get? full

11 Views

Much Gracias: Semi-supervised Code-switch Detection for Spanish-English: How far can we get?(teaser)

0:39

Much Gracias: Semi-supervised Code-switch Detection for Spanish-English: How far can we get?(teaser)

9 Views

Enough is Enough! A Case Study on the Effect of Data Size for Evaluation Using Universal Dependencie

4:31

Enough is Enough! A Case Study on the Effect of Data Size for Evaluation Using Universal Dependencie

9 Views

Lexical Normalization for Code-switched Data and its Effect on POS Tagging

12:15

Lexical Normalization for Code-switched Data and its Effect on POS Tagging

8 Views

Who is the biggest rival of a scumbag?

0:39

Who is the biggest rival of a scumbag?

liumishubuxiashuo

0 View

Grabe ito na ang Hagupit ni Bagyong Kabayan | December 18, 2023 Monday

38:36

Grabe ito na ang Hagupit ni Bagyong Kabayan | December 18, 2023 Monday

The Boss live and Travelling the Philippines Vlog

0 View

Basic ICT Seminar via Zoom Advance Excel on July 18, 2025 part 8

1:14:01

Basic ICT Seminar via Zoom Advance Excel on July 18, 2025 part 8

Christopher Angelo

1 View

I Can't Believe We Are In The Philippines!

21:34

I Can't Believe We Are In The Philippines!

The Boss live and Travelling the Philippines Vlog

0 View

PAUL OLD DOG IS DONE WITH THE PHILIPPINES!!

15:00

PAUL OLD DOG IS DONE WITH THE PHILIPPINES!!

The Boss live and Travelling the Philippines Vlog

5 Views

My FRENCH Brother Visited TAGAYTAY!! (Lamig Daw) 🇵🇭

19:16

My FRENCH Brother Visited TAGAYTAY!! (Lamig Daw) 🇵🇭

The Boss live and Travelling the Philippines Vlog

0 View

🎄 PHILIPPINES VLOGMAS DAY 4 Explaining the Accident 🇵🇭

12:34

🎄 PHILIPPINES VLOGMAS DAY 4 Explaining the Accident 🇵🇭

The Boss live and Travelling the Philippines Vlog

6 Views

We're Sad to Be Leaving our Favorite City in The Philippines! 🇵🇭

27:57

We're Sad to Be Leaving our Favorite City in The Philippines! 🇵🇭

The Boss live and Travelling the Philippines Vlog

0 View

2 Years since the Super Typhoon that destroyed our home in the Philippines🎄 Vlogmas Day 16

15:39

2 Years since the Super Typhoon that destroyed our home in the Philippines🎄 Vlogmas Day 16

The Boss live and Travelling the Philippines Vlog

0 View

Arriving Home Reunited With My Dog! (Philippines To British Columbia)

16:29

Arriving Home Reunited With My Dog! (Philippines To British Columbia)

The Boss live and Travelling the Philippines Vlog

0 View