Introduction to Search Engine Theory
Syllabus [Word Document]Lectures
- Lectures 1 & 2: Introduction, Ergodic Theorem, Perron-Frobenius Theorem, Power Method and Foundations of PageRank
- Lecture 3: Hyperlink-Induced Topic Search (HITS)
- Lecture 4: PageRank & SALSA
- Lecture 5: Latent Semantic Analysis
- Lecture 6: Ranking Links: Search and Surf Engines
- Lecture 7: Detecting Spam Sites
- Lecture 8: Spectral Clustering and Graph Partitioning
- Lecture 9: K-means, Hierarchical and Zoomed Clustering, Hidden Markov Models
Homework
- Homework 1 - Ergodic and Perron-Frobenius Theorems
- Homework 2 - Hubs and Authorities (HITS)
- Homework 2.1 - Sets of Hubs and Authorities
- Homework 3 - PageRank
- Homework 4 - Latent Semantic Analysis
- Homework 5 - Ranking Links
- Homework 6 - K-means and Hierarchical Clustering
- Homework 7 - Spectral Clustering
- Homework 8 - Building A Search Engine
Textbook
- A reference textbook is Introduction to Information Retrieval, Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze, Cambridge University Press. 2008. You may view the textbook online, or print your own copy.
Core Papers
- Authoritative Sources in a Hyperlinked Environment
- The PageRank Citation Ranking: Bringing Order to the Web
- The stochastic approach for link-structure analysis (SALSA) and the TKC effect
- Introduction to Latent Semantic Analysis
- Automatic Cross-Language Information Retrieval using Latent Semantic Indexing
- Ranking Links on the Web: Search and Surf Engines
Spam Detection Papers
- Combating Web Spam with TrustRank
- Measuring Similarity to Detect Qualified Links
- Topical TrustRank: Using Topicality to Combat Web Spam
- Improving Web Spam Classifiers Using Link Structure
- A Large-Scale Study of Link Spam Detection by Graph Algorithms
Advanced Reading
- The ATHENS System for Novel Information Discovery
- Detecting Anomalies in Graphs
- Searching and Ranking Web Pages
- Self-Organization and Identification of Web Communities
- Organizing WWW Images Based On The Analysis of Page Layout and Web Link Structure
- The ATHENS System for Novel Information Discovery
- Indexing by Latent Semantic Analysis
- Signature Based Intrusion Detection using Latent Semantic Analysis
- Symbolic Stochastic Systems
Software
- As an alternative to Matlab, there is a free software package called SciLab that is very similar. You can download this software from http://www.scilab.org. There is also online help at http://www.scilab.org/product/man/ and a guide: An Introduction to Scilab.
Data sets
- Abortion Refined -- Sites
- Computational Geometry Refined -- Sites
- Death Penalty Refined -- Sites
- Gun Control Refined -- Sites
- Movies Refined -- Sites
- Net Censorship Refined -- Sites
Personal Information
Mailing Address:
Purdue University
Dept of Computer Sciences
305 North University Street
West Lafayette, IN 47907-2066
Office: HAAS G77 (Stat)
Email: rrossi [at] purdue.edu
Phone: (843) 240-9811
I am a Ph.D. Student in Computer Science at Purdue University.
My research focuses on prediction and modeling of large dynamic networks.
My work is supported by the NSF Graduate Research Fellowship, National Defense Science and Engineering Graduate Fellowship and the Purdue Andrews Fellowship. Previously, I was a research assistant at Lawrence Livermore National Laboratory, the Naval Research Laboratory (AI Center), Jet Propulsion Laboratory (NASA), California Institute of Technology, Knowledge Discovery Lab at University of Massachusetts Amherst, New Mexico Tech and Coastal Carolina University.
A complete list of my publications can be found at Google
Scholar, DBLP, MS
Academics, or my Social Graph.
