User talk:Jberkel

; Archives
	2015; 2016; 2017; 2018; 2019; 2020; 2021; 2022; 2023;

Catalan pronunciations[edit]

Latest comment: 3 months ago2 comments2 people in discussion

Hi, just a note to be careful when adding Catalan pronunciations. For example, you added a pronunciation of ê to esquetx, which is wrong (it should be é) and unlikely in any case, since ê generally only occurs with inheritances and some old borrowings, and esquetx is a recent borrowing from English. I have documented the sources of pronunciation in the documentation to {{ca-IPA}}; in particular, only trust the DCVB for Balearic pronunciations and don't trust cawikt at all. Benwing2 (talk) 02:34, 28 January 2024 (UTC)Reply

@Benwing2: Ok, I thought cawikt was fairly reliable. Btw, thanks for your great work on the Catalan corner! Jberkel 10:42, 28 January 2024 (UTC)Reply

Statistics[edit]

Latest comment: 2 months ago2 comments2 people in discussion

Hi Jberkel, willst du noch einen neunen Update der Statistik machen? Dein letzter stammt schon wieder vom 1. Juli. Ja, ich weiß dass es eine Menge Zeit und Computerkraft beansprucht, aber ich denke wir alle möchten das einfach schon mal wieder wissen. :) Steinbach (talk) 17:18, 22 February 2024 (UTC)Reply

@Steinbach Hallo, würde ich gerne regelmäßig machen, aber es gibt immer noch Datenprobleme mit den HTML-Dumps: phab:T305407. Die letzten einigermaßen kompletten Daten sind vom letzten Juli. Die WMF-Leute arbeiten daran, aber irgendwie dauert das ewig, bin schon ständig am nachfragen :( Jberkel 17:42, 22 February 2024 (UTC)Reply

HTML Dump[edit]

Latest comment: 2 months ago7 comments2 people in discussion

Hi, I saw your posts complaining about the lack of HTML dumps as I had the same issue. I ended up creating my own HTML dump using the API to rapidly download millions of entries. I used the 20240220 XML dump as a base so that the two dumps would include exactly the same revisions. Note that the same wikitext can produce different HTML code at different points in time, so I can't guarantee that the page looks exactly as it did at the time of the XML dump.

Pages included: non-redirects in namespaces 0 (main) and 118 (reconstruction)
Number of lines: 7,952,575
Time generated: February ‎20, ‎2024, ‏‎7:49:52 PM to ‎February ‎22, ‎2024, ‏‎1:16:18 AM (EST)
Uncompressed size: 112,213,194,308 bytes
Compressed size: 5,482,140,342 bytes

Would you be interested in the code or the dump itself?

Ioaxxere (talk) 20:05, 22 February 2024 (UTC)Reply

@Ioaxxere Lol, I'm close to starting a project myself, given the glacial progress on the WMF side. Yes, I'm interested, how did you get the HTML, how long does it take? Is it the Parsoid rendered version which is used in the HTML dumps? If you want we can join forces and run it as a community project. Jberkel 09:44, 23 February 2024 (UTC)Reply

The script works by grabbing HTML data using a revision ID. For example: https://en.wiktionary.org/w/api.php?action=parse&oldid=65853771&format=json. I'm not sure what parser is used but it seems to correspond with "view page source" in my browser. Here is the code:

wikt_html.py

import requests
import concurrent.futures
from time import time, sleep
from random import random
import mmap
import re

BATCH_SIZE = 10000
HEADER = {"User-Agent": "User:Ioaxxere"} # replace with your username

# tuned parameters
RATE_LIMIT = 80 # per second
THREAD_COUNT = 100

def fetch_data(revid):
    print(revid)
    while True:
        starttime = time()
        try:
            result = requests.get(f"https://en.wiktionary.org/w/api.php?action=parse&oldid={revid}&format=json", headers=HEADER)
            if result.status_code == 200: # OK
                break
            print("...error:", result.status_code)
        except:
            print("...error: Connection failed")
        sleep(0.5 * (1 + random()))

    waittime = THREAD_COUNT/RATE_LIMIT - (time() - starttime)
    if waittime > 0:
        sleep(waittime)

    return result.text

def big_file_finditer(filename, pattern, flags=""):
    compiled_pattern = re.compile(pattern.encode(), flags)
    with open(filename, "r") as f:
        return compiled_pattern.finditer(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ))

pages = [revid.group(1).decode("utf-8") for revid in big_file_finditer("wikt_dump.xml", r"<ns>(?:0|118)</ns>\s+<id>\d+</id>\s+<revision>\s+<id>(\d+)", re.DOTALL)]

for i in range(0, len(pages), BATCH_SIZE):
    queries = pages[i:i+BATCH_SIZE]

    with concurrent.futures.ThreadPoolExecutor(max_workers=THREAD_COUNT) as executor:
        output = executor.map(fetch_data, queries)

    output = "\n".join(q for q in output) + "\n"
    open(r"D:\wiktionarydumps\output.ndjson", "a", encoding="utf-8").write(output) # replace with your output location

Then I verified the output with this code:

find_errors.py

import re
n = 0
with open(r"D:\wiktionarydumps\output.ndjson", "r", encoding="utf-8") as f:
	for line in f:
		n += 1
		if line.startswith("{\"error\":"):
			print(n, re.findall("\"code\":\"([^\"]+)\"", line)[0])

Which produced:

errors.txt

43725 nosuchrevid
82006 nosuchrevid
106857 nosuchrevid
248730 nosuchrevid
319048 nosuchrevid
323556 nosuchrevid
330049 nosuchrevid
394498 nosuchrevid
437859 nosuchrevid
448121 nosuchrevid
561668 nosuchrevid
590865 nosuchrevid
603650 nosuchrevid
610405 nosuchrevid
720072 nosuchrevid
749333 nosuchrevid
808355 nosuchrevid
814281 nosuchrevid
822969 nosuchrevid
859557 nosuchrevid
1021390 nosuchrevid
1036457 nosuchrevid
1058296 nosuchrevid
1084837 nosuchrevid
1157698 nosuchrevid
1229978 nosuchrevid
1248685 nosuchrevid
1285246 nosuchrevid
1323983 nosuchrevid
1324915 nosuchrevid
1385186 nosuchrevid
1396962 nosuchrevid
1486775 nosuchrevid
1497989 nosuchrevid
1513303 nosuchrevid
1581275 nosuchrevid
1609470 nosuchrevid
1678410 nosuchrevid
1725167 nosuchrevid
1735366 nosuchrevid
1735744 nosuchrevid
1814983 nosuchrevid
1854120 nosuchrevid
1907407 nosuchrevid
1921876 nosuchrevid
1963831 nosuchrevid
2010212 nosuchrevid
2073363 nosuchrevid
2166069 nosuchrevid
2177988 nosuchrevid
2183914 nosuchrevid
2184460 nosuchrevid
2278457 nosuchrevid
2330349 nosuchrevid
2358375 nosuchrevid
2499758 nosuchrevid
2501157 nosuchrevid
2520901 nosuchrevid
2591419 nosuchrevid
2621251 nosuchrevid
2630284 nosuchrevid
2671770 nosuchrevid
2696918 nosuchrevid
2697777 nosuchrevid
2746586 nosuchrevid
2769872 nosuchrevid
2831640 nosuchrevid
2857869 nosuchrevid
2910282 nosuchrevid
2911183 nosuchrevid
2915318 nosuchrevid
2967304 nosuchrevid
3014563 nosuchrevid
3063851 nosuchrevid
3124420 nosuchrevid
3137890 nosuchrevid
3185708 nosuchrevid
3225411 nosuchrevid
3230226 nosuchrevid
3241060 nosuchrevid
3259739 nosuchrevid
3261952 nosuchrevid
3301323 nosuchrevid
3318285 nosuchrevid
3320219 nosuchrevid
3324414 nosuchrevid
3336037 nosuchrevid
3443783 nosuchrevid
3481014 nosuchrevid
3527574 nosuchrevid
3585227 nosuchrevid
3589765 nosuchrevid
3614305 nosuchrevid
3734605 nosuchrevid
3821927 nosuchrevid
3843626 nosuchrevid
3914931 nosuchrevid
3925139 nosuchrevid
4025930 nosuchrevid
4244319 nosuchrevid
4246017 nosuchrevid
4260112 nosuchrevid
4278061 nosuchrevid
4330469 nosuchrevid
4331657 nosuchrevid
4412350 nosuchrevid
4413758 nosuchrevid
4432652 nosuchrevid
4485019 nosuchrevid
4602733 nosuchrevid
4608289 nosuchrevid
4720573 nosuchrevid
4737790 nosuchrevid
4858538 nosuchrevid
4889458 nosuchrevid
4908594 nosuchrevid
4973122 nosuchrevid
5010716 nosuchrevid
5052814 nosuchrevid
5150511 nosuchrevid
5154623 nosuchrevid
5182578 nosuchrevid
5223840 nosuchrevid
5235533 nosuchrevid
5246229 nosuchrevid
5259002 nosuchrevid
5344233 nosuchrevid
5364980 nosuchrevid
5368363 nosuchrevid
5369738 nosuchrevid
5469778 nosuchrevid
5507943 nosuchrevid
5598277 nosuchrevid
5607802 nosuchrevid
5631256 nosuchrevid
5648406 nosuchrevid
5659237 nosuchrevid
5729700 nosuchrevid
5752778 nosuchrevid
5774071 nosuchrevid
5790022 nosuchrevid
5833505 nosuchrevid
5861520 nosuchrevid
5864017 nosuchrevid
5871030 nosuchrevid
5877754 nosuchrevid
5983008 nosuchrevid
6006358 nosuchrevid
6067067 nosuchrevid
6085428 nosuchrevid
6138076 nosuchrevid
6138136 nosuchrevid
6188278 nosuchrevid
6248831 nosuchrevid
6276367 nosuchrevid
6286098 nosuchrevid
6289698 nosuchrevid
6293458 nosuchrevid
6303351 nosuchrevid
6309621 nosuchrevid
6311475 nosuchrevid
6391744 nosuchrevid
6392577 nosuchrevid
6396159 nosuchrevid
6409595 nosuchrevid
6412793 nosuchrevid
6424036 nosuchrevid
6484785 nosuchrevid
6562806 nosuchrevid
6568126 nosuchrevid
6580802 nosuchrevid
6633849 nosuchrevid
6741033 nosuchrevid
6797937 nosuchrevid
6900647 nosuchrevid
6903671 nosuchrevid
6996408 nosuchrevid
6996487 nosuchrevid
7030860 nosuchrevid
7043778 nosuchrevid
7048043 nosuchrevid
7059900 nosuchrevid
7091062 nosuchrevid
7091425 nosuchrevid
7130255 nosuchrevid
7169063 nosuchrevid
7184906 nosuchrevid
7244549 nosuchrevid
7276644 nosuchrevid
7331248 nosuchrevid
7359021 nosuchrevid
7537357 nosuchrevid
7578135 nosuchrevid
7585843 nosuchrevid
7595812 nosuchrevid
7641806 nosuchrevid
7651915 nosuchrevid
7697219 nosuchrevid
7778037 nosuchrevid
7781476 nosuchrevid
7782612 nosuchrevid
7802193 nosuchrevid
7808302 nosuchrevid
7820909 nosuchrevid
7885180 nosuchrevid
7914802 nosuchrevid

These correspond with pages in the XML dump that have recently been deleted.

I don't have the time/resources to generate these on a regular basis, but you're welcome to adapt this code for your purposes!

Ioaxxere (talk) 19:56, 23 February 2024 (UTC)Reply

Oh god, I just realized that adding &parsoid=true to the API query gives *far* better data. Time to rerun... Ioaxxere (talk) 20:09, 23 February 2024 (UTC)Reply

Cool, thanks! We could run it on WMF infrastructure. Great to see that 50 lines of Python yield better results than the WMF's buzzword soup of Kafka, DAGs and what have you… How long does it take to do a full run? Jberkel 15:20, 26 February 2024 (UTC)Reply

nm, you already had in your post, almost 2 days… :) Jberkel 15:57, 26 February 2024 (UTC)Reply

Even if the WMF some day manage to produce useful dumps again, we'll still need wiki-specific namespaces such as Reconstruction, so it'll be useful to have some way of generating them ourselves. Jberkel 15:58, 26 February 2024 (UTC)Reply

ScribuntoUnit vs. UnitTests[edit]

Latest comment: 2 months ago3 comments2 people in discussion

I just discovered there are two unit testing frameworks here, Module:UnitTests used by everyone but you, and Module:ScribuntoUnit used by you. The former is older than the latter, so I'm not sure why you imported the latter from Wikipedia, but I think we should consolidate. Can you think about converting your unit tests to use Module:UnitTests? Benwing2 (talk) 20:34, 10 March 2024 (UTC)Reply

Hi, just wondering if you got my msg. Can you at least clarify why you imported and started using Module:ScribuntoUnit in preference to our own module? BTW I just discovered a third unit test framework, Module:QFQ/UnitTests, used only on Module:mnw-translit. Benwing2 (talk) 07:43, 14 March 2024 (UTC)Reply

Hi @Benwing2, sorry had short Wiktionary hiatus. It's been a long time (~ 10 years), but I think when I first looked at Module:UnitTests it was a spaghetti mess and didn't have the features I wanted. That's probably no longer the case, and I agree it's better to standardize on one framework. Jberkel 09:27, 15 March 2024 (UTC)Reply

catalogue raisonné[edit]

Latest comment: 1 month ago1 comment1 person in discussion

Wwoww, Jberkel, you're fast. Wanted to cite the same Guardian passage here, and it was already there ... MistaPPPP (talk) 12:55, 19 March 2024 (UTC)Reply

Apologies[edit]

I need to apologise to you also, about my simple edit in my archaic paragraph about certain 'etymologies that discredit Wiktionary' that it should have completely disrupted the edit section including yours - there should really be mechanism in place to stop this from happening, since any innocent editor could well make a similar mistake that if not detected quickly as both Surjection and I did, it could cause linguistic mayhem! Regards, Andrew Andrew H. Gray 11:40, 29 March 2024 (UTC)

On ass...[edit]

Latest comment: 1 month ago1 comment1 person in discussion

What Doyle said was about this:

https://en.m.wiktionary.org/wiki/arse#English

Here, ass is another way of spelling arse (as in dumb). Lunatone3000 (talk) 22:24, 4 April 2024 (UTC)Reply

User talk:Jberkel

Contents

Catalan pronunciations[edit]

Statistics[edit]

HTML Dump[edit]

ScribuntoUnit vs. UnitTests[edit]

catalogue raisonné[edit]

Apologies[edit]

On ass...[edit]

Navigation menu

User talk:Jberkel

Catalan pronunciations[edit]

Statistics[edit]

HTML Dump[edit]

ScribuntoUnit vs. UnitTests[edit]

catalogue raisonné[edit]

Apologies[edit]

On ass...[edit]

Navigation menu

Search