Syllabus [Word Document]
Lectures
- Lectures 1 & 2: Introduction, Ergodic Theorem, Perron-Frobenius Theorem, Power Method and Foundations of PageRank
- Lecture 3: Hyperlink-Induced Topic Search (HITS)
- Lecture 4: PageRank & SALSA
- Lecture 5: Latent Semantic Analysis
- Lecture 6: Ranking Links: Search and Surf Engines
- Lecture 7: Detecting Spam Sites
- Lecture 8: Spectral Clustering and Graph Partitioning
- Lecture 9: K-means, Hierarchical and Zoomed Clustering, Hidden Markov Models
Homework
- Homework 1 - Ergodic and Perron-Frobenius Theorems
- Homework 2 - Hubs and Authorities (HITS)
- Homework 2.1 - Sets of Hubs and Authorities
- Homework 3 - PageRank
- Homework 4 - Latent Semantic Analysis
- Homework 5 - Ranking Links
- Homework 6 - K-means and Hierarchical Clustering
- Homework 7 - Spectral Clustering
- Homework 8 - Building A Search Engine
Textbook
- A reference textbook is Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze, Cambridge University Press. 2008. You may view the textbook online, or print your own copy.
Core Papers
- Authoritative Sources in a Hyperlinked Environment
- The PageRank Citation Ranking: Bringing Order to the Web
- The stochastic approach for link-structure analysis (SALSA) and the TKC effect
- Introduction to Latent Semantic Analysis
- Automatic Cross-Language Information Retrieval using Latent Semantic Indexing
- Ranking Links on the Web: Search and Surf Engines
Spam Detection Papers
- Combating Web Spam with TrustRank
- Measuring Similarity to Detect Qualified Links
- Topical TrustRank: Using Topicality to Combat Web Spam
- Improving Web Spam Classifiers Using Link Structure
- A Large-Scale Study of Link Spam Detection by Graph Algorithms
Advanced Reading
- The ATHENS System for Novel Information Discovery
- Detecting Anomalies in Graphs
- Searching and Ranking Web Pages
- Self-Organization and Identification of Web Communities
- Organizing WWW Images Based On The Analysis of Page Layout and Web Link Structure
- The ATHENS System for Novel Information Discovery
- Indexing by Latent Semantic Analysis
- Signature Based Intrusion Detection using Latent Semantic Analysis
- Symbolic Stochastic Systems
Software
- As an alternative to Matlab, there is a free software package called SciLab that is very similar. You can download this software from http://www.scilab.org. There is also online help at http://www.scilab.org/product/man/ and a guide: An Introduction to Scilab.
Data sets
- Abortion Refined -- Sites
- Computational Geometry Refined -- Sites
- Death Penalty Refined -- Sites
- Gun Control Refined -- Sites
- Movies Refined -- Sites
- Net Censorship Refined -- Sites