The selection of lexical material for research initially done by the author on their own because of the absence or unavailability of etymological dictionaries. However, when using the etymological dictionaries for research ethnogenetic processes in prehistoric times, the selection should be done according to certain rules, which the author has learned in the course of their work. The main ones are set out below.
Methods of mathematical statistics are used for random sampling of the "general population" as a model for the data source. Random sampling is statistically displays the general population, but it should be eliminated all subjective factors of non-random sampling of Education. In our research without the full confidence of the random nature of the sample deliberately planned to take a greater number of elements in order to prevent distortion of the final results. However, the graphic-analytical method used feature is that randomness and the sample size is determined to be sufficient to reflect the internal structure of the population, which can not be achieved with a small amount of random data. Therefore, data were collected in the volume "level of confidence", which provides construction of the graphic layout relationships related languages. The fact of constructing circuit indicates the existence of internal data structure. If on the basis of the collected data to construct a scheme closely related languages relationship fails, then it is an indication of any incorrect data or the lack of relationship between the individual languages, admitted to the study
The common opinion about large instability of vocabulary can be explained by the fact that in many languages have a lot of loan words. However, observations show that the loan-words relate primarily to a more "cultural" layer of words but ancient words, which correspond to the lower cultural level, still remain in the language. These ancient words in the language at the same time are the most commonly used. According to A.V. Desnitsky, native vocabulary includes a significant part of the most common words which reflect the basic concepts and create the largest number of word-producimg nests (DESNITSKAJA A.V. 1966: 9). M.V.Arapov and V.V. Herz say in their work about dependence of the frequency of using a word and its age so:
There is a relationship between the frequency and the time of occurrence of the word in his language… Most of the words with a high frequency of use are ancient words, and vice versa – the lower the frequency a word, the more likely that the word is new-created (ARAPOV M.V., Herz V.V., 1974: 3)
The authors note that for the first time this connection was remarked by George Kingsley Zipf in 1947 and appreciated its significance for the quantitative analysis of the facts relating to the history of a language (Zipf's law). However, it should be borne in mind that some words with a low frequency may be ancient, and there are many newly created words, which have a greater frequency of use, but thay can be very easy removed while the lexical and statistical studies according their meaning.
It is known that there are such languages wich vocabulary has more words of foreign origin, but while common circulation their own words are used, therefore such languages does not give the impression of belonging to a different linguistic group even according their vocabulary. For example the Romanian languageis in such situation, having more words of Slavic origin, followed by Latin, Turkish, modern Greek, but the Romanian language and texts, written in that language give the impression of a Romance not Slavic. Ignoring or misunderstanding the fact of dependance of word age from its using frequency in language confuses linguists in issue of primary kinship of languages, complicates distinction between ancient words and loanwords, eventually gets scientists in deadlock.
The problem of separation of common words and later loan-words in related languages is one of the most difficult in historical linguistics. It is well understood by all comparativists because it immediately raises wjile comparative analysis of all the languages. Choosing to study even the most commonly used words for lexical and statistical analysis based on their values, we are always subject to certain risks to include in the lists some of the ancient words of foreign origin. However, for the majority of languages, they are relatively few, and if selected lexical material is specially analysed in order to eliminate borrowed words, this risk is substantially reduced, and errors have not significantly affect on results of the research. Elimination of later borrowing can be easier in cases you know sometimes a donor language. We say about more recent borrowings, of that time when speaker of languages have left their ancestral areas. Before this time, borrowing from one language group to another are difficult to separate from the words of its own origin. But for determination of the primary areas of settlements, as we shall see, it's not a big deal.
In principle, the very selection of data would require a minimum of professional knowledge and would be purely technical work in the availability, accessibility and completeness of etymological dictionaries. Unfortunately, all three of these conditions are not met. Etymological dictionaries is still not made up for some languages, for others they are only prepared and not completely out of print. Systematization of the material was hindered to a certain extent also by incomplete data in etymological dictionaries. They give rarely a full set of matches in related languages, the authors often limited to the examples of the most famous, and sometimes some erroneous etymology wander from one dictionary to another.
All these circumstances have forced most part of the work on search and selection of data to conduct by careful review of bilingual dictionaries, which in most cases you can find a very rich material. However, there is not enough of some dictionaries. In accordance with the subject of work would have to be processed Samodian language dictionaries, but because of their lack such work has not been performed at all. However, the most negative impact on the results of the studies had missing or incomplete dictionaries of some Iranian languages.
Studies were conducted on the lexical level without grammatical forms with a comparison of lexical units in the two plans – the sound and meaning. The coincidences of sound forms without correspondence in meaning were unconditionally ignored. While assessing of semantic aspect matches were identified from a maximum value – synonymy through a greater or lesser similarity of semantics to antonyms, which is sometimes a consequence of the specificity of the concept (the classic example – the original meaning of "edge" can be in different languages to get to "the beginning" and "end" ). Synonymy is understood here as a match of at least one meaning of the word in different languages (usually the dominant), but not a complete coincidence of semantic fields. However, most often in the material prevailed not synonyms but words of similar sense of common origin, not even necessarily the same grammatical category.