DATA MINING: PROBLEMS, TOOLS AND APPLICATIONS
Basic contents of the Intelligent systems course taught during the first year.
The exam is a programming project along with an oral examination. The programming project is
assigned to students on an individual basis, taking into account their specific interests.
During this project, the students are typically asked to implement simple methods for carrying
out experimental investigations on data made at their disposal by online social networks, and/or
by web advertising sites, and/or by benchmarking online repositories. These experimental
investigations are aimed at both testing the students' ability to adapt the available methods
to the real-world problems, and understanding the specificity of the problems themselves.
An accompanying report is also required. Upon positive evaluation, the project grants
access the oral examination, whose first question is always related to the students'
programming project. During the oral examination, the students are required to show significant
understanding of the methods covered by this course, and being able to discuss their advantages
and drawbacks. The overall examination is passed if a score of 18 (out of 30) or higher is obtained.
This course aims at introducing the basics of problems, methods, and tools of Data Mining of current technological/industrial interest on big data scenarios. Usage of such data and of the required hardware/software platforms will be made available by a AWS (Amazon Web Services) for education grant. Contents of this course include: Mining of association rules and sequential patterns; decision trees; linear classification and generalized linear classification (kernel methods, support vector machines, etc.); aggregation methods; methods and problems with partial information; hierarchical classification; ranking; collaborative filtering; data mining on networked data (i.e., with social components). The course objectives and expected outcomes are the following:
- Get familiar with the basic methods of Data Mining on big-size data, and of the associated problems;
- Get the ability to apply such knowledge to real problems, through a specific adaptation effort of the available methods to the problems to be solved;
- Get the ability to learn new methods, and to perform comparative analyses to what is already known.
- Mining of association rules and sequential patterns; decision trees; linear classification and generalized linearc lassification (kernel methods, support vector machines, etc.); aggregation methods (bagging and boosting). (16 hours).
- Problems and methods with partial information (exploration-exploitation tradeoff, “bandit” problems, crowdsourcing), and with structured information (hierarchical classification, ranking, collaborative filtering). (10 hours).
- Data Mining on networked data (co-training, transfer learning, active learning and semi-supervised learning on networks of tasks, Pagerank for classification of linked structures, community discovery, etc.). (14 hours).
- Lab activity. This will put an emphasis on real problems involving online social networks and on problems of Web recommendation, Web Content Mining, and Web advertising. Standard software usage includes Matlab (or freeware versions thereof), and/or MapReduce/Hadoop platforms. (16 hours).
- B. Liu, "Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data", Springer, 2011 (chapters 1, 2, 3, 6, 7, 12)
- A. Rajaraman, J. Leskovec, J. D. Ullman, "Mining of Massive Datasets" (chapter. 3)
- T.M Mitchell, "Machine Learning", McGraw-Hill (chapter 3)
- Slides provided by the instructor
40 hours are classroom lectures; 16h of lab are taking place in a computer lab.