X
Пользователь приглашает вас присоединиться к открытой игре игре с друзьями .
[{{mminutes}}:{{sseconds}}] Ожидаем начала...    
Data Mining
(0)       Используют 2 человека

Комментарии

Ни одного комментария.
Написать тут
Описание:
text
Автор:
anclaw
Создан:
5 октября 2019 в 13:02 (текущая версия от 11 октября 2019 в 14:54)
Публичный:
Нет
Тип словаря:
Тексты
Цельные тексты, разделяемые пустой строкой (единственный текст на словарь также допускается).
Содержание:
1 1.1 Introduction to Data Mining. Progress in digital data acquisition and storage technology has resulted in the growth of huge databases. This has occurred in all areas of human endeavor, from the mundane (such as supermarket transaction data, credit card usage records, telephone call details, and government statistics) to the more exotic (such as images of astronomical bodies, molecular databases, and medical records). Little wonder, then, that (неудивительно, что) interest has grown in the possibility of tapping (to get or make use of something that exists, such as money, skills, information, energy, etc.) these data, of extracting from them information that might be of value to the owner of the database. The discipline concerned with (касающаяся, занимающаяся вопросами) this task has become known as data mining.
2 Defining a scientific discipline is always a controversial (спорная, неоднозначная) task; researchers often disagree about the precise range and limits of (границы) their field of study. Bearing this in mind (С учетом этого, помятуя об этом), and accepting that others might disagree about the details, we shall adopt as (мы примем как, собираемся принять как) our working definition of data mining: Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways (по-новому) that are both understandable and useful to the data owner. The relationships and summaries derived through (полученные с помощью) a data mining exercise are often referred to as (часто называются) models or patterns. Examples include (В качестве примером можно привести) linear equations, rules, clusters, graphs, tree structures, and recurrent patterns in time series.
3 The definition above refers to "observational data," as opposed to "experimental data." Data mining typically deals with data that have already been collected for some purpose other than the data mining analysis (for example, they may have been collected in order to maintain an up-to-date record of all the transactions in a bank). This means that the objectives of the data mining exercise play no role in the data collection strategy. This is one way in which data mining differs from much of statistics, in which data are often collected by using efficient strategies to answer specific questions. For this reason, data mining is often referred to as "secondary" data analysis.
4 The definition also mentions that the data sets examined in data mining are often large. If only small data sets were involved, we would merely be discussing (мы бы всего лишь обсуждали) classical exploratory data analysis (EDA — разведочный анализ данных) as practiced by statisticians. When we are faced with large bodies of data (большие объемы, массивы данных), new problems arise. Some of these relate to housekeeping issues (внутренние, домашние вопросы) of how to store or access the data, but others relate to more fundamental issues, such as how to determine the representativeness of the data, how to analyze the data in a reasonable period of time, and how to decide whether an apparent relationship is merely a chance occurrence not reflecting any underlying reality. Often the available data comprise (включает) only a sample from the complete population (or, perhaps, from a hypothetical superpopulation); the aim may be to generalize from the sample to the population. For example, we might wish to predict how future customers are likely to behave or to determine the properties of protein structures that we have not yet seen. Such generalizations may not be achievable through standard statistical approaches because often the data are not (classical statistical) "random samples," but rather "convenience" or "opportunity" samples. Sometimes we may want to summarize or compress a very large data set in such a way that the result is more comprehensible, without any notion of generalization. This issue would arise, for example, if we had complete census data for a particular country or a database recording millions of individual retail transactions.
5 The relationships and structures found within a set of data must, of course, be novel (новыми, неизвестными, новаторскими). There is little point in regurgitating well-established relationships (unless, the exercise is aimed at "hypothesis" confirmation, in which one was seeking to determine whether established pattern also exists in a new data set) or necessary relationships (that, for example, all pregnant patients are female). Clearly, novelty must be measured relative to the user's prior knowledge. Unfortunately few data mining algorithms take into account a user's prior knowledge. For this reason we will not say very much about novelty in this text. It remains an open research problem.
6 While novelty is an important property of the relationships we seek, it is not sufficient to qualify a relationship as being worth finding. In particular, the relationships must also be understandable. For instance simple relationships are more readily understood than complicated ones, and may well be preferred, all else being equal.
7 Data mining is often set in the broader context of knowledge discovery in databases, or KDD. This term originated in the artificial intelligence (AI) research field. The KDD process involves several stages: selecting the target data, preprocessing the data, transforming them if necessary, performing data mining to extract patterns and relationships, and then interpreting and assessing the discovered structures. Once again the precise boundaries of the data mining part of the process are not easy to state; for example, to many people data transformation is an intrinsic part of data mining. In this text we will focus primarily on data mining algorithms rather than the overall process. For example, we will not spend much time discussing data preprocessing issues such as data cleaning, data verification, and defining variables. Instead we focus on the basic principles for modeling data and for constructing algorithmic processes to fit these models to data.
8 The process of seeking relationships within a data set— of seeking accurate, convenient, and useful summary representations of some aspect of the data—involves a number of steps: - determining the nature and structure of the representation to be used; - deciding how to quantify and compare how well different representations fit the data (that is, choosing a "score" function); - choosing an algorithmic process to optimize the score function; and - deciding what principles of data management are required to implement the algorithms efficiently. The goal of this text is to discuss these issues in a systematic and detailed manner. We will look at both the fundamental principles (chapters 2 to 8) and the ways these principles can be applied to construct and evaluate specific data mining algorithms (chapters 9 to 14).
9 Example 1.1 Regression analysis is a tool with which many readers will be familiar. In its simplest form, it involves building a predictive model to relate a predictor variable, X, to a response variable, Y, through a relationship of the form Y = aX + b. For example, we might build a model which would allow us to predict a person's annual credit-card spending given their annual income. Clearly the model would not be perfect, but since spending typically increases with income, the model might well be adequate as a rough characterization. In terms of the above steps listed, we would have the following scenario:
10 - The representation is a model in which the response variable, spending, is linearly related to the predictor variable, income. - The score function most commonly used in this situation is the sum of squared discrepancies between the predicted spending from the model and observed spending in the group of people described by the data. The smaller this sum is, the better the model fits the data. - The optimization algorithm is quite simple in the case of linear regression: a and b can be expressed as explicit functions of the observed values of spending and income. We describe the algebraic details in chapter 11. - Unless the data set is very large, few data management problems arise with regression algorithms. Simple summaries of the data (the sums, sums of squares, and sums of products of the X and Y values) are sufficient to compute estimates of a and b. This means that a single pass through the data will yield estimates.
11 Data mining is an interdisciplinary exercise. Statistics, database technology, machine learning, pattern recognition, artificial intelligence, and visualization, all play a role. And just as it is difficult to define sharp boundaries between these disciplines, so it is difficult to define sharp boundaries between each of them and data mining. At the boundaries, one person's data mining is another's statistics, database, or machine learning problem.

Связаться
Выделить
Выделите фрагменты страницы, относящиеся к вашему сообщению
Скрыть сведения
Скрыть всю личную информацию
Отмена