User talk:Jberkel
Archives |
---|
Catalan pronunciations[edit]
Hi, just a note to be careful when adding Catalan pronunciations. For example, you added a pronunciation of ê
to esquetx, which is wrong (it should be é
) and unlikely in any case, since ê
generally only occurs with inheritances and some old borrowings, and esquetx is a recent borrowing from English. I have documented the sources of pronunciation in the documentation to {{ca-IPA}}
; in particular, only trust the DCVB for Balearic pronunciations and don't trust cawikt at all. Benwing2 (talk) 02:34, 28 January 2024 (UTC)
- @Benwing2: Ok, I thought cawikt was fairly reliable. Btw, thanks for your great work on the Catalan corner! Jberkel 10:42, 28 January 2024 (UTC)
Statistics[edit]
Hi Jberkel, willst du noch einen neunen Update der Statistik machen? Dein letzter stammt schon wieder vom 1. Juli. Ja, ich weiß dass es eine Menge Zeit und Computerkraft beansprucht, aber ich denke wir alle möchten das einfach schon mal wieder wissen. :) Steinbach (talk) 17:18, 22 February 2024 (UTC)
- @Steinbach Hallo, würde ich gerne regelmäßig machen, aber es gibt immer noch Datenprobleme mit den HTML-Dumps: phab:T305407. Die letzten einigermaßen kompletten Daten sind vom letzten Juli. Die WMF-Leute arbeiten daran, aber irgendwie dauert das ewig, bin schon ständig am nachfragen :( Jberkel 17:42, 22 February 2024 (UTC)
HTML Dump[edit]
Hi, I saw your posts complaining about the lack of HTML dumps as I had the same issue. I ended up creating my own HTML dump using the API to rapidly download millions of entries. I used the 20240220 XML dump as a base so that the two dumps would include exactly the same revisions. Note that the same wikitext can produce different HTML code at different points in time, so I can't guarantee that the page looks exactly as it did at the time of the XML dump.
- Pages included: non-redirects in namespaces 0 (main) and 118 (reconstruction)
- Number of lines: 7,952,575
- Time generated: February 20, 2024, 7:49:52 PM to February 22, 2024, 1:16:18 AM (EST)
- Uncompressed size: 112,213,194,308 bytes
- Compressed size: 5,482,140,342 bytes
Would you be interested in the code or the dump itself?
Ioaxxere (talk) 20:05, 22 February 2024 (UTC)
- @Ioaxxere Lol, I'm close to starting a project myself, given the glacial progress on the WMF side. Yes, I'm interested, how did you get the HTML, how long does it take? Is it the Parsoid rendered version which is used in the HTML dumps? If you want we can join forces and run it as a community project. Jberkel 09:44, 23 February 2024 (UTC)
The script works by grabbing HTML data using a revision ID. For example: https://en.wiktionary.org/w/api.php?action=parse&oldid=65853771&format=json. I'm not sure what parser is used but it seems to correspond with "view page source" in my browser. Here is the code:
import requests
import concurrent.futures
from time import time, sleep
from random import random
import mmap
import re
BATCH_SIZE = 10000
HEADER = {"User-Agent": "User:Ioaxxere"} # replace with your username
# tuned parameters
RATE_LIMIT = 80 # per second
THREAD_COUNT = 100
def fetch_data(revid):
print(revid)
while True:
starttime = time()
try:
result = requests.get(f"https://en.wiktionary.org/w/api.php?action=parse&oldid={revid}&format=json", headers=HEADER)
if result.status_code == 200: # OK
break
print("...error:", result.status_code)
except:
print("...error: Connection failed")
sleep(0.5 * (1 + random()))
waittime = THREAD_COUNT/RATE_LIMIT - (time() - starttime)
if waittime > 0:
sleep(waittime)
return result.text
def big_file_finditer(filename, pattern, flags=""):
compiled_pattern = re.compile(pattern.encode(), flags)
with open(filename, "r") as f:
return compiled_pattern.finditer(mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ))
pages = [revid.group(1).decode("utf-8") for revid in big_file_finditer("wikt_dump.xml", r"<ns>(?:0|118)</ns>\s+<id>\d+</id>\s+<revision>\s+<id>(\d+)", re.DOTALL)]
for i in range(0, len(pages), BATCH_SIZE):
queries = pages[i:i+BATCH_SIZE]
with concurrent.futures.ThreadPoolExecutor(max_workers=THREAD_COUNT) as executor:
output = executor.map(fetch_data, queries)
output = "\n".join(q for q in output) + "\n"
open(r"D:\wiktionarydumps\output.ndjson", "a", encoding="utf-8").write(output) # replace with your output location
Then I verified the output with this code:
import re
n = 0
with open(r"D:\wiktionarydumps\output.ndjson", "r", encoding="utf-8") as f:
for line in f:
n += 1
if line.startswith("{\"error\":"):
print(n, re.findall("\"code\":\"([^\"]+)\"", line)[0])
Which produced:
43725 nosuchrevid
82006 nosuchrevid
106857 nosuchrevid
248730 nosuchrevid
319048 nosuchrevid
323556 nosuchrevid
330049 nosuchrevid
394498 nosuchrevid
437859 nosuchrevid
448121 nosuchrevid
561668 nosuchrevid
590865 nosuchrevid
603650 nosuchrevid
610405 nosuchrevid
720072 nosuchrevid
749333 nosuchrevid
808355 nosuchrevid
814281 nosuchrevid
822969 nosuchrevid
859557 nosuchrevid
1021390 nosuchrevid
1036457 nosuchrevid
1058296 nosuchrevid
1084837 nosuchrevid
1157698 nosuchrevid
1229978 nosuchrevid
1248685 nosuchrevid
1285246 nosuchrevid
1323983 nosuchrevid
1324915 nosuchrevid
1385186 nosuchrevid
1396962 nosuchrevid
1486775 nosuchrevid
1497989 nosuchrevid
1513303 nosuchrevid
1581275 nosuchrevid
1609470 nosuchrevid
1678410 nosuchrevid
1725167 nosuchrevid
1735366 nosuchrevid
1735744 nosuchrevid
1814983 nosuchrevid
1854120 nosuchrevid
1907407 nosuchrevid
1921876 nosuchrevid
1963831 nosuchrevid
2010212 nosuchrevid
2073363 nosuchrevid
2166069 nosuchrevid
2177988 nosuchrevid
2183914 nosuchrevid
2184460 nosuchrevid
2278457 nosuchrevid
2330349 nosuchrevid
2358375 nosuchrevid
2499758 nosuchrevid
2501157 nosuchrevid
2520901 nosuchrevid
2591419 nosuchrevid
2621251 nosuchrevid
2630284 nosuchrevid
2671770 nosuchrevid
2696918 nosuchrevid
2697777 nosuchrevid
2746586 nosuchrevid
2769872 nosuchrevid
2831640 nosuchrevid
2857869 nosuchrevid
2910282 nosuchrevid
2911183 nosuchrevid
2915318 nosuchrevid
2967304 nosuchrevid
3014563 nosuchrevid
3063851 nosuchrevid
3124420 nosuchrevid
3137890 nosuchrevid
3185708 nosuchrevid
3225411 nosuchrevid
3230226 nosuchrevid
3241060 nosuchrevid
3259739 nosuchrevid
3261952 nosuchrevid
3301323 nosuchrevid
3318285 nosuchrevid
3320219 nosuchrevid
3324414 nosuchrevid
3336037 nosuchrevid
3443783 nosuchrevid
3481014 nosuchrevid
3527574 nosuchrevid
3585227 nosuchrevid
3589765 nosuchrevid
3614305 nosuchrevid
3734605 nosuchrevid
3821927 nosuchrevid
3843626 nosuchrevid
3914931 nosuchrevid
3925139 nosuchrevid
4025930 nosuchrevid
4244319 nosuchrevid
4246017 nosuchrevid
4260112 nosuchrevid
4278061 nosuchrevid
4330469 nosuchrevid
4331657 nosuchrevid
4412350 nosuchrevid
4413758 nosuchrevid
4432652 nosuchrevid
4485019 nosuchrevid
4602733 nosuchrevid
4608289 nosuchrevid
4720573 nosuchrevid
4737790 nosuchrevid
4858538 nosuchrevid
4889458 nosuchrevid
4908594 nosuchrevid
4973122 nosuchrevid
5010716 nosuchrevid
5052814 nosuchrevid
5150511 nosuchrevid
5154623 nosuchrevid
5182578 nosuchrevid
5223840 nosuchrevid
5235533 nosuchrevid
5246229 nosuchrevid
5259002 nosuchrevid
5344233 nosuchrevid
5364980 nosuchrevid
5368363 nosuchrevid
5369738 nosuchrevid
5469778 nosuchrevid
5507943 nosuchrevid
5598277 nosuchrevid
5607802 nosuchrevid
5631256 nosuchrevid
5648406 nosuchrevid
5659237 nosuchrevid
5729700 nosuchrevid
5752778 nosuchrevid
5774071 nosuchrevid
5790022 nosuchrevid
5833505 nosuchrevid
5861520 nosuchrevid
5864017 nosuchrevid
5871030 nosuchrevid
5877754 nosuchrevid
5983008 nosuchrevid
6006358 nosuchrevid
6067067 nosuchrevid
6085428 nosuchrevid
6138076 nosuchrevid
6138136 nosuchrevid
6188278 nosuchrevid
6248831 nosuchrevid
6276367 nosuchrevid
6286098 nosuchrevid
6289698 nosuchrevid
6293458 nosuchrevid
6303351 nosuchrevid
6309621 nosuchrevid
6311475 nosuchrevid
6391744 nosuchrevid
6392577 nosuchrevid
6396159 nosuchrevid
6409595 nosuchrevid
6412793 nosuchrevid
6424036 nosuchrevid
6484785 nosuchrevid
6562806 nosuchrevid
6568126 nosuchrevid
6580802 nosuchrevid
6633849 nosuchrevid
6741033 nosuchrevid
6797937 nosuchrevid
6900647 nosuchrevid
6903671 nosuchrevid
6996408 nosuchrevid
6996487 nosuchrevid
7030860 nosuchrevid
7043778 nosuchrevid
7048043 nosuchrevid
7059900 nosuchrevid
7091062 nosuchrevid
7091425 nosuchrevid
7130255 nosuchrevid
7169063 nosuchrevid
7184906 nosuchrevid
7244549 nosuchrevid
7276644 nosuchrevid
7331248 nosuchrevid
7359021 nosuchrevid
7537357 nosuchrevid
7578135 nosuchrevid
7585843 nosuchrevid
7595812 nosuchrevid
7641806 nosuchrevid
7651915 nosuchrevid
7697219 nosuchrevid
7778037 nosuchrevid
7781476 nosuchrevid
7782612 nosuchrevid
7802193 nosuchrevid
7808302 nosuchrevid
7820909 nosuchrevid
7885180 nosuchrevid
7914802 nosuchrevid
These correspond with pages in the XML dump that have recently been deleted.
I don't have the time/resources to generate these on a regular basis, but you're welcome to adapt this code for your purposes!
Ioaxxere (talk) 19:56, 23 February 2024 (UTC)
- Oh god, I just realized that adding
&parsoid=true
to the API query gives *far* better data. Time to rerun... Ioaxxere (talk) 20:09, 23 February 2024 (UTC)- Cool, thanks! We could run it on WMF infrastructure. Great to see that 50 lines of Python yield better results than the WMF's buzzword soup of Kafka, DAGs and what have you… How long does it take to do a full run? Jberkel 15:20, 26 February 2024 (UTC)
- nm, you already had in your post, almost 2 days… :) Jberkel 15:57, 26 February 2024 (UTC)
- Even if the WMF some day manage to produce useful dumps again, we'll still need wiki-specific namespaces such as Reconstruction, so it'll be useful to have some way of generating them ourselves. Jberkel 15:58, 26 February 2024 (UTC)
- nm, you already had in your post, almost 2 days… :) Jberkel 15:57, 26 February 2024 (UTC)
- Cool, thanks! We could run it on WMF infrastructure. Great to see that 50 lines of Python yield better results than the WMF's buzzword soup of Kafka, DAGs and what have you… How long does it take to do a full run? Jberkel 15:20, 26 February 2024 (UTC)
ScribuntoUnit vs. UnitTests[edit]
I just discovered there are two unit testing frameworks here, Module:UnitTests used by everyone but you, and Module:ScribuntoUnit used by you. The former is older than the latter, so I'm not sure why you imported the latter from Wikipedia, but I think we should consolidate. Can you think about converting your unit tests to use Module:UnitTests? Benwing2 (talk) 20:34, 10 March 2024 (UTC)
- Hi, just wondering if you got my msg. Can you at least clarify why you imported and started using Module:ScribuntoUnit in preference to our own module? BTW I just discovered a third unit test framework, Module:QFQ/UnitTests, used only on Module:mnw-translit. Benwing2 (talk) 07:43, 14 March 2024 (UTC)
- Hi @Benwing2, sorry had short Wiktionary hiatus. It's been a long time (~ 10 years), but I think when I first looked at Module:UnitTests it was a spaghetti mess and didn't have the features I wanted. That's probably no longer the case, and I agree it's better to standardize on one framework. Jberkel 09:27, 15 March 2024 (UTC)
Wwoww, Jberkel, you're fast. Wanted to cite the same Guardian passage here, and it was already there ... MistaPPPP (talk) 12:55, 19 March 2024 (UTC)
Apologies[edit]
I need to apologise to you also, about my simple edit in my archaic paragraph about certain 'etymologies that discredit Wiktionary' that it should have completely disrupted the edit section including yours - there should really be mechanism in place to stop this from happening, since any innocent editor could well make a similar mistake that if not detected quickly as both Surjection and I did, it could cause linguistic mayhem! Regards, Andrew Andrew H. Gray 11:40, 29 March 2024 (UTC)
On ass...[edit]
What Doyle said was about this:
https://en.m.wiktionary.org/wiki/arse#English
Here, ass is another way of spelling arse (as in dumb). Lunatone3000 (talk) 22:24, 4 April 2024 (UTC)