Introducing MuLCAM: Building the Language Foundation for Nepal’s Digital Safety Needs

Open Knowledge Foundation NepalFri Jun 19 2026· 5 min read

Imagine reporting harassment to your supervisor, only to hear: “Do you have proof?” The messages are deleted. The screenshots you never took. The weeks of escalation, now reduced to your word against someone else’s. You go to the Cyber Crime Bureau. They ask for evidence again. They acknowledge the challenges of cross-border coordination with social media platforms. Nothing moves.

This is not a hypothetical. It is a pattern we have heard repeatedly, from women in workplaces, from girls on social media, from individuals who finally found the courage to report and were met with a system not built to receive them.

The gap is not only in the justice system. It starts much earlier, in the digital platforms themselves. When those platforms are asked why they cannot moderate local language harassment, the answer is consistent: we do not have the data. Nepali slurs, Romanized Nepali insults, gendered abuse in Newari, Bhojpuri, Tharu, coded threats buried in informal language. None of this is captured in the global datasets that power mainstream moderation systems.

That gap is precisely what MuLCAM was built to fill.

What Is MuLCAM?

MuLCAM (Multilingual Corpus and Moderation) is Nepal’s first open, context-aware linguistic corpus focused on Technology-Facilitated Gender-Based Violence (TFGBV). It is a living, community-driven dataset that documents how harm and abuse appear in Nepali digital spaces across languages, scripts, and cultural contexts, and makes that data freely available for researchers, developers, platform trust and safety teams, journalists, and policymakers.

The project is an initiative of Open Knowledge Nepal (OKN), built with and for communities directly affected by online gender-based violence.

You can explore the project at: https://mulcam.opendatanepal.com/

Why This Exists: The Scale of the Problem

The Nepal Police Cyber Bureau’s data tracks the pattern year on year. In FY 2023/24, 19,730 cybercrime complaints were filed nationally, of which 8,745 cases involved violence against women, ranging from harassment and impersonation to blackmail and non-consensual sharing of intimate images. A further 382 complaints involved girls, and 767 involved individuals from gender and sexual minority groups. The following year, FY 2024/25, saw 18,926 cases registered (an average of 52 per day), including 1,801 bullying and harassment complaints, 3,067 fake or impersonated accounts, and 495 hate speech reports. Mid-year data from that same fiscal year showed 4,981 women and 255 girls already among those filing complaints. The official numbers are almost certainly an undercount: many survivors never report at all.

And those are only the reported cases. Advocates and practitioners working in this space know that the true number is far larger. Stigma, fear of retaliation, lack of trust in institutions, and the burden of collecting and preserving evidence all prevent survivors from coming forward.

The investigation and justice process is slow. Law enforcement cites the difficulty of obtaining evidence and the complexity of coordinating with international social media platforms. Meanwhile, the platforms themselves say they cannot moderate content effectively because of multilingual complexity and the absence of structured datasets for local languages.

This creates a circular trap: no dataset means no moderation. No moderation means harm goes undetected. Undetected harm means no evidence. No evidence means no justice. And throughout all of this, the burden falls on the survivor.

MuLCAM is an intervention at the root of this cycle.

How MuLCAM Works

The Corpus

At the heart of MuLCAM is an open, evolving dataset of slurs, derogatory terms, and contextual harassment language across Nepal’s languages. The corpus currently covers:

Nepali (Devanagari script)
Romanized Nepali (how most young people actually type)
Other mother tongues, including Newari, Bhojpuri, Tharu, and Tamang

This is not a simple word blacklist. The corpus is structured around approximately 12 harm categories, including strong slurs, gendered insults, sexualized language, dehumanizing phrases, coercive threats, body shaming, and hybrid or coded expressions. Each entry is annotated with language, category, and contextual examples of both harmful and non-harmful usage, because context matters. A word that signals abuse in one context might carry a completely different meaning in another.

Two downloadable datasets are available on the corpus page:

tfgbv_lexicon.csv: A multilingual lexicon of abusive terms and TFGBV expressions, categorized by language and subcategory.
annotated_tfgbv_dataset.csv: Full annotated sentences and messages labeled for TFGBV classification, including language, subcategory, and terms used.

Both datasets are also accessible directly on GitHub under a Creative Commons Attribution 4.0 license.

Detection and Moderation Tools

The corpus is a foundation, not an endpoint. The vision for MuLCAM’s second track is to build practical tools on top of this data: plug-ins, APIs, and network-layer detection algorithms that platforms, workplaces, and developers can integrate into their systems. One direction we have been exploring is a Slack add-on for workplace communication channels, which would be particularly relevant in Nepal’s growing tech-enabled work environment where much harassment now happens inside professional tools like Slack and WhatsApp rather than just on public social media.

This part of the work is still in progress. We are actively exploring approaches, partnerships, and technical pathways to make these tools a reality. The core design principle is clear: detection without surveillance, giving users agency over their own safety rather than building blanket monitoring systems that strip context and privacy. But getting there requires the right collaborators.

If you are a developer, researcher, or organization with ideas or capacity to support the tools development track, we would love to hear from you. Write to us at info@oknp.org.

Get Involved

MuLCAM grows with every contribution. The current dataset is a starting point, and making it a genuinely robust resource for Nepal’s digital safety ecosystem means more inputs from communities, language experts, researchers, and practitioners across the country.

You can contribute text examples and lexicon suggestions directly through the website at: https://mulcam.opendatanepal.com/contribute, or submit datasets and annotations via GitHub at: https://github.com/openknowledgenp/nepal-digitalsafety-corpus.

If your organization works in digital safety, gender justice, open data, or civic tech and you see opportunities for partnership or collaboration, write to us at info@oknp.org. We are also open to hearing new ideas on tools, research applications, and uses of the corpus.

← PreviousPublishing and Exploring Data Just Got Easier on Open Data Nepal Next →When Algorithm Decides: What Nepal’s Digital Creators Are Up Against

Content on this site, made by Open Knowledge Foundation Nepal, is licensed under a Creative Commons Attribution 4.0 International License.