teo123 wrote: ↑Tue Sep 10, 2024 4:06 am
brimstoneSalad wrote:Do more work on this before you jump to the conclusion that it's flawed.
And you told me in 2016 that I can safely reject any idea I come up with as being as dumb as a rock. Does that still apply?
I don't know. It has been eight years, but you're also still pulling the same stuff you were in your flat Earth days.
teo123 wrote: ↑Tue Sep 10, 2024 4:06 am
It's simply not obvious to me that one who is, in the absence of a proper statistical analysis, claiming that some pattern is coincidental is more likely to be right than one who is claiming it's a real pattern.
The pattern must be defined more clearly. If merely the appearance of a pattern dug up out of a huge amount of data by selecting a slice of data ad hoc that fits that apparent pattern, then apparent patterns likely vastly outnumber actual patterns. So yes, we should assume it's coincidence.
As you demonstrate a better P value relative to the slice that ad hoc selection holds in that original data pool, you increase the chances of it not being coincidence. P value standards for papers are actually arbitrary, they should in some cases be far higher to match the circumstances of the pattern's "discovery". Experimental P values in the harder sciences are different as long as we force all study results to be shared.
teo123 wrote: ↑Tue Sep 10, 2024 4:06 am
And that's kind of irrelevant here since I actually have a statistical analysis: basic information theory suggesting that the probability of that k-r pattern in the Croatian river names occurring by chance is between 1/300 and 1/17. One who is claiming that some pattern is coincidental in spite of a statistical analysis showing it's statistically significant is... probably wrong, right?
In a soft science where the only thing done was data analysis and no actual experiments, no. I explained why previously.
If it's only 1/17, it's probably coincidence.
It's not hard to look at a large data set, and select a subset of that data which fits a pattern with a 1/17 chance of being wrong.
For instance, I roll a quarter million dice of various colors, I will find different patterns among the different colors. What are the odds of the fifty six-sided yellowish green dice with silverly flecks not rolling any 2's? I will find different patterns throughout the various colors. It seems improbable at first (one in ten thousand), but once multiplied by the number of possible sets I cherry picked from I find the odds are extremely good, and a "coincidence" of this specific nature (no 2s) in the data set is about 50-50 odds. Add to that looking for other similar coincidences (no 1s, no 3s, etc. and the odds increase linearly).
You have looked at an entire language (and you have done so perhaps unknowingly just by being a native speaker and intuitively seeking for apparent patterns), and you've found one niche category which appears to fit a specific pattern with a very underwhelming 1/17 odds of being coincidence. I think you will find many more patterns out of coincidence with those odds or better
If it's 1/300 you have a better argument, but I still don't put your odds very high due to the large number of words in Croatian and the small set you're dealing with (river names, right?).
teo123 wrote: ↑Tue Sep 10, 2024 4:06 amThe problem is that, at least in soft sciences (I am not sure how it is in the hard sciences, but I suppose it's similar.), you can always say "
But maybe some more appropriate model would suggest that pattern is not actually statistically significant.".
If you want to be in the soft sciences, that's something you have to deal with. In the hard sciences, that's more difficult and usually involves pointing out a flaw with the experimental setup.
In case it's not obvious, here's the equation for the odds of not rolling 2s in the colored dice analysis, so you can apply it to your own field:
5/6 (the odds of not rolling a 2) ^ 50 (number of rolls in the subset chosen) * 250,000 (the total number of dice) / 50 (your subset of dice for the total fraction)
This of course doesn't fully represent the entire set, because there's overlap. You could also form a set of all silver flecked dice, all green dice regardless of flecks, etc. which makes the odds of a pattern with this P value occurring even higher since dice can be double-counted for inclusion in other mixed sets. It must also be compounded by every other hypothesis of the same general type (like all cases of not rolling a specific number). Doing that makes not finding MANY coincidences of the 1/10,000 odds astronomically small.