Web Mining: knowledge discovery domains
mining applies the data mining, the artificial intelligence and the chart
technology and so on to the web data and traces users’ visiting characteristics,
and then extracts the users’ using pattern. Web mining technologies are the
right solutions for knowledge discovery on the Web. The knowledge extracted
from the Web can be used to raise the performances for Web information
retrievals, question answering, and Web based data warehousing. In this paper,
we provide an introduction of Web mining as well as a review of the Web mining
categories. Web mining applies the data mining, the artificial intelligence and
the chart technology and so on to the web data and traces users’ visiting
characteristics, and then extracts the users’ using pattern.
Keywords: Data mining; web mining; web usage mining
Mining is the extraction of interesting and potentially useful patterns and
implicit information from artifacts or activity related to the World Wide Web.
In order to is better serves for the users, web mining applies the data mining,
the artificial intelligence and the chart
technology and so on to the web data and traces users’ visiting characteristics,
and then extracts the users’ using pattern.
has quickly become one of the most important areas in Computer and Information
Sciences because of its direct applications in ecommerce, e-CRM, Web analytics,
information retrieval and filtering, and Web information systems.
to the differences of the mining objects, there are roughly three knowledge
discovery domains that pertain to web mining: Web Content Mining, Web Structure Mining, and Web Usage Mining.
Fig.1 Web mining categories and objects
content mining is the process of extracting knowledge from the content of
documents or their descriptions. Web
document text mining, resource discovery based on concepts indexing or agent;
based technology may also fall in this category. Web structure mining is the
process of inferring knowledge from the World Wide Web organization and links
between references and referents in the Web. Finally, web usage mining, also
known as Web Log Mining, is the process of extracting interesting patterns in
web access logs.
II.WEB MINING PROCESS
mining process is generally divided into five stages: data acquisition, data
preprocessing, mode discovery, mode analysis, and mode application.
(1) Data acquisition
mining can collect raw data from client, server and registered/remote agents.
Their data types are quite different, and the data processing method are not
the same. The data collected from different data source reflects the different
access mode in the process of Web using.
(2) Data preprocessing
preprocessing carries on a series of processing to the primary data, and
obtains the target information. The result of data preprocessing is the input
of mining algorithm, which directly influences the mining quality. Data
preprocessing mainly includes data cleaning, user identification, session
identification, path completion and transaction identification. It can obtain
user session sets that reflect the user browsing process quite objectively,
which makes prepare for improving the accuracy of the final mining mode and the
effect of the recommendation.
(3) Mode discovery
discovery mines the effective, novel, latent, useful and ultimate
understandable information and knowledge by using mining algorithm. The
technologies used in Web usage mining include statistic analysis, path
analysis, association analysis, sequence pattern analysis, classification
analysis, clustering analysis as well as dependency modeling and so on. There
are two kinds of clustering on Web: user clustering and page clustering.
(4) Mode analysis
user behavior mode obtained from mining, need to be analyzed, explained and
visualized with suitable tool and technology, from which we select the
interesting mode, make it become people understandable knowledge to realize the
query from the mined knowledge. There are many mode analysis methods, such as
visualization technology, data query and OLAP. Various visualization
technologies like graphical mode paint different color for different values,
which can make the overall mode or trend become very outstanding. Content,
structure information can also be used to filter out specific mode, such as
containing specific used data class, content class, or the web with specific
(5) Mode application
can applies the meaningful conclusions and mode mined, such as modifying web
page content, improving web services design, customizing personalized interface
for user, providing personalized E-commerce services etc.
III.WEB CONTENT MINING
content mining is a form of text mining and can take advantage of the
semi-structured nature of web page text. Query interfaces share similar or
common query patterns. For instance, a frequently used pattern is a text
followed by a selection list with numeric values. The HTML tags of today’s web
pages, and even more so the XML markup of tomorrow’s web pages, bear
information that concerns not only layout, but also logical structure. HTML
format might be invalid and cause problems in extracting information. In most
of previous works extracting information is performed from HTML pages and some
of them firstly is converted invalid HTML pages to valid HTML pages and then
extracting process is applied but in this paper we use XML format of web pages
for extracting information. Extractor system which is presented in this paper gets XML pages as an input and
can access to XML tags in documents with XML DOM API. DOM1 is a standard
language that gets a web page as an input and shows it in a structured tree
from interfaces, objects and relations between them as an output. A sample DOM
tree shows in Fig. 2 that is the extracted form of a sample query interface.
content mining targets the knowledge discovery, in which the main
objects are the traditional collections of text documents and, more recently,
also the collections of multimedia documents such as images, videos, audios,
which are embedded in or linked to the Web pages. Web content mining could be
differentiated from two points of view: the agent-based approach or the
Simple Dom Tree
first approach aims on improving the information finding and filtering and
could be placed into the following three categories:
Search Agents. These agents search for relevant information using domain
characteristics and user profiles to organize and interpret the discovered
Filtering/ Categorization. These agents use information retrieval
techniques and characteristics of open hypertext Web documents to automatically
retrieve, filter, and categorize them.
Web Agents. These agents learn user preferences and discover Web
information based on these preferences, and preferences of other users with
second approach aims on modeling the data on the Web into more structured form
in order to apply standard database querying mechanism and data mining
applications to analyze it. The two main categories are Multilevel databases
and Web query systems.
IV.WEB STRUCTURE MINING
challenge for Web structure mining is to deal with the structure of the
hyperlinks within the Web itself. Link analysis is an old area of research.
However, with the growing interest in Web mining, the research of structure
analysis had increased and these efforts had resulted in a newly emerging
research area called Link Mining, which is located at the intersection of the
work in link analysis, hypertext and web mining, relational learning and
inductive logic programming, and graph mining. There is a potentially wide
range of application areas for this new area of research, including Internet.
Web contains a variety of objects with almost no unifying structure, with
differences in the authoring style and content much greater than in traditional
collections of text documents. The objects in the WWW are web pages, and links
are in-, out- and co-citation (two pages that are both linked to by the same
page). Attributes include HTML tags, word appearances and anchor texts; .his
diversity of objects creates new problems and challenges, since is not possible
to directly made use of existing techniques such as from database management or
information retrieval. Link mining had produced some agitation on some of the
traditional data mining tasks.
follows, we summarize some of these possible tasks of link mining which are
applicable in Web structure mining.
classification is the most recent upgrade of a classic data mining task to
linked domains. The task is to focus on the prediction of the category of a web
page, based on words that occur on the page, links between pages, anchor text,
html tags and other possible attributes found on the web page.
goal in cluster analysis is to find naturally occurring sub-classes. The data
is segmented into groups, where similar objects are grouped together, and
dissimilar objects are grouped into different groups. Different than the
previous task, link-based cluster analysis is unsupervised and can be used to
discover hidden patterns from data.
3. Link Type.
There are a wide range of tasks
concerning the prediction of the existence of links, such as predicting the
type of link between two entities, or predicting the purpose of a link.
Links could be associated with weights.
main task here is to predict the number of links between objects.
are many ways to use the link structure of the Web to create notions of
authority. The main goal in developing applications for link mining is to made
good use of the understanding of these intrinsic social organization of the
V.WEB USAGE MINING
of web usage mining
servers record and accumulate data about user interactions whenever requests
for resources are received. Analyzing the web access logs of different web
sites can help understand the user behavior and the web structure, thereby
improving the design of this colossal collection of resources. There are two
main tendencies in Web Usage Mining driven by the applications of the discoveries:
General Access Pattern Tracking and Customized Usage Tracking. Web Usage Mining
is to mine data from log record on web page. Log record lots useful information
such as URL, IP address and time and so on. Analyzing and discovering Log could
help us to find more potential customers and trace service quality and so on.
web usage mining is the process of applying the data mining technology to the
web data and is the pattern of extracting something that the users are interest
in from their network behaviors to be interested. When people visit one
website, he will leave some data such as IP address, visiting pages, visiting
time and so on, web usage mining will collect, analyze and process the log and
of web usage mining
The web usage
mining generally includes the following several steps: data collection, data
pretreatment, establishing interesting model the data back processes.
(1) Data collection
collection is the first step of web usage mining, the data authenticity and
integrality will directly affect the following works smoothly carrying on and
the final recommendation of characteristic service’s quality. Therefore it must
use scientific, reasonable and advanced technology to gather various data. At
present, towards web usage mining technology, the main data origin has three
kinds: server data, client data and middle data (agent server data and package
(2) Data pretreatment
databases are insufficient, inconsistent and including noise. The data pretreatment
is to carry on a unification transformation to those databases. The result is
that the database will to become integrate and consistent, thus establish the
database which may mine. In the data pretreatment work, mainly include data
clearing, user recognition, user conversation recognition and data formatting.
(3) Establish interesting model
statistical method to carry on the analysis and mine the pretreated data. We
may discover the user or the user community’s interests then construct interest
model. At present the usually used machine learning methods mainly have
clustering, classifying, the relation discovery and the order model discovery.
Each method has its own excellence and shortcomings, but the quite effective
method mainly is classifying and clustering at the present.
on the further analysis and induction to the interested pattern which has
already established. First delete the less significance rules or models from
the interested model storehouse; Next use technology of OLAP and so on to carry
on the comprehensive mining and analysis; Once more, let discovered data or
knowledge be visible; Finally, provide the characteristic service to the
electronic commerce website.
this paper we survey the Web mining, focusing on the category of Web mining as
Web Content Mining, Web Structure Mining, and Web Usage Mining. We have
discussed the process of all three types of mining in detail
M. S., Park, 1. S., and Yu, P. S., “Efficient Data Mining for Path
Traversal Patterns”, IEEE Transactions on Knowledge and Data Engineering,
MarchiApril, 1998, pp.209-221.
Cooley, B. Mobasher and 1. Srivastava, “Web Mining: Infonnation and
Pattern Discovery on the World Wide Web”,
of the 9th IEEE International Conference on Tools with Artificial Intelligence
T, Chiang M, Wang S H, “Mining weighted browsing patterns with linguistic
minimum supports”, 2002IEEE International Conference on Systems, Man and
Cybernetics, 2002,Yasmine Hammamet, Tunisia, pp. 635-639.
Schechter, M. Krishnan, and M. D. Smith. Using path proles to predict http
requests. In 7th International World Wide Web Conference, Brisbane, Australia,
Cooley, B. Mobasher and J. Srivastava, “Web Mining: Information and Pattern
Discovery on the World Wide Web”, Proceedings of the 9th IEEE
International Conference on Tools with Artificial Intelligence (ICTAI’97),
Jicheng, Huang Yuan, Wu Gangshan, Zhang Fuyan. Web mining:
knowledge discovery on the Web. Systems,
Man, and Cybernetics, 1999. IEEE SMC ’99 Conference Proceedings. 1999 IEEE
International Conference – on Volume 2, Page(s):137 – 141 vol.2 – 12-15 Oct.
R.; Mobasher, B.; Srivastava, J.; Web mining: information and pattern discovery
on the World Wide Web. Tools with Artificial Intelligence,1997. Proceedings.,
Ninth IEEE International Conference. Page(s):558 – 567 – 3-8 Nov. 1997.