User talk:Kristian-Clausal

From Wiktionary, the free dictionary
Jump to navigation Jump to search

Welcome

[edit]

Hello, welcome to Wiktionary, and thank you for your contributions so far.

If you are unfamiliar with wiki-editing, take a look at Help:How to edit a page. It is a concise list of technical guidelines to the wiki format we use here: how to, for example, make text boldfaced or create hyperlinks. Feel free to practice in the sandbox. If you would like a slower introduction we have a short tutorial.

These links may help you familiarize yourself with Wiktionary:

  • Entry layout (EL) is a detailed policy on Wiktionary's page formatting; all entries must conform to it. The easiest way to start off is to copy the contents of an existing same-language entry, and then adapt it to fit the entry you are creating.
  • Check out Language considerations to find out more about how to edit for a particular language.
  • Our Criteria for Inclusion (CFI) defines exactly which words can be added to Wiktionary; the most important part is that Wiktionary only accepts words that have been in somewhat widespread use over the course of at least a year, and citations that demonstrate usage can be asked for when there is doubt.
  • If you already have some experience with editing our sister project Wikipedia, then you may find our guide for Wikipedia users useful.
  • If you have any questions, bring them to Wiktionary:Information desk or ask me on my talk page.
  • Whenever commenting on any discussion page, please sign your posts with four tildes (~~~~) which automatically produces your username and timestamp.
  • You are encouraged to add a BabelBox to your userpage to indicate your self-assessed knowledge of languages.

Enjoy your stay at Wiktionary! Vininn126 (talk) 10:20, 8 November 2021 (UTC)Reply

Reusing Wiktionary data

[edit]

Hi there, thank you for your comments and edits here. You mentioned on the Feedback page that the company you work for uses Wiktionary data in some form, can you share some more details about this? – Jberkel 15:37, 10 November 2021 (UTC)Reply

I'll have to ask my boss what I can disclose. I'm pretty sure it's reasonably public and he's been active here himself before I joined, but I want to be sure. Kristian-Clausal (talk) 06:17, 11 November 2021 (UTC)Reply
Nothing to disclose then? Anyway, whatever it is, make sure you comply with the license. Seems obvious but a lot of sites reuse content without attribution. - Jberkel 12:13, 28 November 2021 (UTC)Reply
We ended up not actually going through with it because Wiktionary's wikimedia style data structures (especially the category system) made it essentially unusable for our purposes. There are some other things in the pipeline, maybe. All the licensing stuff and such is above board, and has been before I came along, so no need to worry about that. Kristian-Clausal (talk) 06:10, 29 November 2021 (UTC)Reply

"changes for machine-readability"

[edit]

Hi. Can you explain why you need to make changes like [1] for "machine readability"? I do software development professionally and I can guarantee you that there is no need to make such changes. You changed the look of the page, which you should only do if this is the correct thing from a usability/UI standpoint, not simply to make your life easier as a programmer. Benwing2 (talk) 02:08, 23 July 2022 (UTC)Reply

I see you did this to a lot of templates. I am going to revert them all; this is a really bad idea. Benwing2 (talk) 02:11, 23 July 2022 (UTC)Reply
OK maybe that was a bit harsh. I think what you are trying to do is make it easier to identify the headers vs. the content. However, this causes all headers to be boldface, which changes the look of the page, which may not be the right thing for an individual page. Benwing2 (talk) 03:19, 23 July 2022 (UTC)Reply
Thank you for not reverting the changes yet. I am on vacation right now, so I'll try to get someone else to explain what we're trying to do, if I can't do so well enough right now.
I work for Clausal Computing Oy in Finland, which runs kaikki.org. Wiktextract is a project that takes wikitext (specifically en.wiktionary.org text), processes and tries to parse it to get data from Wiktionary word article headers and tables, then outputs it in a json format.
And by parsing, I mean genuinely getting meaningful information out of human-written text. But this is hard, genuinely difficult to do, because wiktionary is written by humans and their output is too variable and ambiguous (comparing one editor to another, or even one editor to themselves in a different article).
The changes I have been making into tables is, as you've seen, changing table cells with content into headers when they are semantically headers, because that's one of the persistent problems we have with tables. We have some heuristic-based methods for getting something out of tables by trying to figure out if a given cell should actually be considered a header by guessing, but that causes a lot of false positives and a lot of garbage data because you can easily have a table where you have a table cell that looks like a pronoun or a grammatical term that then scrambles everything (amongst other problems). However, when the parser finds something inside a <th></th> / ! header block, it knows that it's not a content cell and can handle it appropriately, without guessing.
What I've been doing when I edit tables (and modules that generate tables) is change |-cells in the table into !-cells (by hand). This is the correct thing to do, from the perspective of the tables: these are headers, and 90% of the time there isn't even any kind of cosmetic change. Only sometimes, like with the Hindi modules you've apparently noticed, is there a visible change.
This is the first time since I've started doing these changes that anyone's actually noticed anything, or at least said anything, so I've taken it to mean that the changes haven't broken anything, and they shouldn't have.
If you wish, I can leave the specific language you're worried about alone and just add it to a list of languages in our code "with known difficult tables". Eventually, as we get the list populated, the languages in that list will have the above-mentioned guess-based heuristics used on them and won't output error messages to me fix the table cell headers. But that will have to wait until next month when I get back to work.
If you want to contribute to Wiktextract, it is open source and everyone is welcome! Seeing as how you are a wiktionary editor and programmer, you could even make tables correct in a better way than I have done up until now by correct cells into headers when appropriate and applying them with some kind of style to override the header bolding? Kristian-Clausal (talk) 05:43, 23 July 2022 (UTC)Reply
I'm all for making Wiktionary more machine-readable, but maybe it would be good to discuss this with the community as well. I think most editors here aren't even aware that this project exists (and when I asked you about it previously, you didn't mention it). – Jberkel 13:51, 23 July 2022 (UTC)Reply
I have tried to not do anything that would be big enough to involve the community and just made what I've felt are small common-sense changes. For example, changing the <td> into <th> in tables is the most common way to do tables in general on wiktionary, so I had convinced myself that it had to be part of some general table formatting style guide and that tables that didn't do it were just incorrect. I will not touch tables in the future, unless something comes from your deliberations. Most all other edits I've done are all very minor and you can verify that easily by checking them out, if you are feeling suspicious. Kristian-Clausal (talk) 14:40, 23 July 2022 (UTC)Reply

"The subst: quotation template is malformed and I can't figure out how to fix "

[edit]

If you can't figure out how to fix it, ask on the BP or GP, or mark it for cleanup ({{rfc}}), instead of just deleting it. Jberkel 08:30, 15 February 2024 (UTC)Reply

I'll keep that in mind. Kristian-Clausal (talk) 08:32, 15 February 2024 (UTC)Reply