History of Majestic-12 DSearch (distributed search engine)
Planned features can be found in todo list.
15/02/07 v0.6.7
+ Added support for anchor text to be kept as part of page information
+ Rewritten page information storage for better on-disk compression and lower memory usage
! Smarter cut offs for titles
! Fixed anchors pointing to bad links
! Fixed incorrect counting of filtered domains
! Fixed bug in multiple ranging logic
! Fixed per domain match limits for cases when multiple sites used in site: command
04/02/07 v0.6.6
+ Completely rewritten query parsing to support full boolean operators, field limiters and other options (not all in effect though right now :( )
! Fixed memory leaks for indices that supported automatic re-indexing and re-merging
! Fixed bug affecting first document in site: limited scenarios
+ Added error status breakdown to failed urls in url submission
! Fixed (workarounded more like) bug with hash collision happening with very similar domains
+ Added support for HTML cache
! Subindex/ACRank building moved into merging phase
+ Added crawl time
+ Added support for scenario where duplicate match comes from fresher index
! Fixed failure when searching for exact phrases
! Fixed bug causing excessive CPU usage in url lookups
+ Added index priorities
+ Added support for multiple domains and urls in site: qualifiers
! Fixed bug that was resulting in some searches hogging CPU
! Fixed bug that was causing exception in some of the searches
TODO:
+ Added support for field specifiers such as: title:keywords
+ Boolean OR command now supported
+ Non-unique query keywords are not excluded right now allowing for queries like: baden baden
+ Stop words are not completely ignored by default
+ Faster phrase matching
! Fixed bug with some phrase matching was not returning any results
+ Faster and better selection of text snippets with matched words
xx/08/06 v0.6.5 [LOST RELEASE]
+ Much faster anchor and text hits evaluation
! Prefetching is smarter
! Subindex usage is smarter now
! Rebuild user submitted index to merge faster
! Fixed encoding problem in XML results
+ Better caching strategy in keyword resolution stage
30/07/06 v0.6.4
+ Better recent matches analysis screen
! Fixed broken user submission
! Better notification of errors during searching (previously just no results shown)
! Fixed broken site: command with just one search keyword
! Real-time stats collection is more light weight now
+ Much faster inter-server network communications
! Fixed broken communications with multiple-indices
+ Added Wikipedia ( :-/ ) index
! Order in which distributed results merged is more consistent now
! Fixed broken non-Latin characters in text results
! Removed some non-Latin site descriptions that were broken due to wrong encoding
! Fixed bug with incorrect number of shown search results
! Fixed rank explanation not working in some cases
23/07/06 v0.6.3
! Optimised page info requests to only get what is necessary
+ Added separate Wikipedia index (~2.5 mln articles)
! Faster cut off of same domain matches in indices with lots of urls but few domains
! Matching engine optimised for better performance
! Fixed broken (and disabled) skip subindex
! Much faster searches on site:domain limited queries
+ Overall much faster multiword searches
! Lexicon cache to minimise disk accesses
! Fixed broken site: searches that were only showing 2 matches
! Search results page changing is much faster
! Fixed incorrect counting of skipped same domain matches
+ Added new subindex
14/07/06 v0.6.2
! Fixed bug with incorrect user-submitted URL stalling whole process
+ Added logging
! Fixed high memory usage and repeat failure in user-submitted dynamic index
+ Completely rewritten support for multiple local and distributed indices
+ Multiple index stats (pages/links/etc indexed) are now taken into account correctly for all indices,
this can show real time growth of index in case of user-submitted data
! Dynamic index becomes index on its own
! Ranking formulars are applied consistently for all indices
! View Text, Explain Time, Explain Rank, Word Search are all now working correctly with multiple indices
+ Searches can now be made over specified indices only using index:name1,name2,name3 query line command
23/03/06 v0.6.1
! Fixed error when no matches were found
! Fixed incorrect domain clustering
! Filtered matches (ie over per domain limit) will now be present in count shown
! Performance improvements in match evaluator as well as memory leaks plugged
! Fixed problem with paging when distributed index was used
+ Switched to 2-phase distributed searching making system more scalable
17/03/06 v0.6.0
+ Searching: 1,025,161,473 (!) pages, 25,577,885 links
+ Support for distributed searching
+ Support for XML format of search results
12/03/06 v0.5.1
+ ACRank: query independent page/site rank for better relevance
! Fixed incorrect priority and multiple matches per domain in dynamic index
+ Better explanation of rank
! Fixed problem with same domain filtering algorithm
! Fixed inefficient in some circumstances matched text highlighting
! site: command without keywords will now show most important pages from those available
! Fixed broken match counts for site: command and some improvements in the way query string evaluated
+ Much smarter sub-indexing on common terms
+ Much faster site: limited searches that deal with very common terms
! Reduced memory allocations during searching with additional performance improvements
+ Much faster reading of pageinfos
28/02/06 v0.5.0
+ Searching 619,418,654 pages, 19,535,430 links, 30,093,318 unique domains
+ link: command now supported (works for whole site or specific page)
+ Major scalability improvements in merging process
IMMEDIATE TODO:
searching:
+ Much faster searches when site: limit is supplied
+ Combined multi-words lexicon lookup
+ Add special logic to detect when search keywords are in fact a URL or a domain name
+ "Explain Rank" now shows referring pages for anchor hits in
! Smarter scanning of the index
+ Subdomain hits are now far more relevant
+ site: now support sub-domain searching (crude implementation that should work in most cases)
! Fixed incorrect counting of matches - excessive same domain matches were not counted in total
+ Added protection from very heavy search queries
indexing:
! Fix some META descriptions that were not re-encoded to common lowest denominator
+ Better compression for page infos
+ Support for duplicate content analysis
! Some of the page descriptions lack space separations between sentences
! Investigate cases with duplicate content (KAMCO Computer Systems query)
! Provides support for query-independent rank calculation
+ Add filters to import non-standard data like WiKipedia's raw XML
+ Add filters to import DMOZ site descriptions, use presense in DMOZ as additional scoring param
merging:
! Fixed bad entries in lexicon
+ Faster and more scalable merging
+ Use better compression for the merged index
! Fix temporary space requirements for inverted index merging
! Better prevention of corruption of page infos
01/02/06 v0.4.1
! Fixed incorrectly shown encodigs
! Fixed some untrapped errors
! Fixed negative geotargeting scores
+ Improved speed of searches
31/01/06 v0.4.0
+ Searching 205,755,670 web pages, 28,360,973 domains and 9,052,905 links
! Limited size of site descriptions (META tags)
! Fixed problem with strange corruption of pageinfos
+ Support for NOARCHIVE/NOINDEX meta tags
! Fixed page content size being wrong for Conanised barrels
! Stop words will now be present in index and can be searched for with + prefix or when used in quoted phrases
+ Better Stage 1 and 3 merging
+ Link based map now generated for computation of query independent ranks
+ Various warnings/errors from search query parser now shown
! Fixed incorrect calculation of anchor hits leading to poor relevance
! Fixed some titles that were forced to lower-case
15/01/06 v0.3.1
! Fixed broken dynamic index
! Partly broken site: command
! Fixed negative geo-targeting scores
! Fixed problem with entities
14/01/06 v0.3.0
+ Searching 100,425,098 web pages and 9,052,905 links
! Fixed bug with domain hash calculation wrongly using case-sensitive approach
! Fixed bug in domain extraction: '-' was treated incorrectly !
! Memory usage by the indexer has been reduced
! Indexing is now much faster and supports multi-core or multi-processor configs
! Ignore pages that were crawled from private IPs (ie localhost)
! Length of short text elements now correctly takes into account stop words
! Number split off logic is now disabled
! Indexer support rel="nofollow" in a' tag
! Fixed fatal bug with sub-bucket index sort in merger
+ Full support for geo-targeting (controlled by user with loc:COUNTRY override) and g/ccTLD clustering (ie: site:uk limits)
+ Merger supports multi-core and multi-processor configs
! Lexicon merger uses a lot less memory
+ Much more efficient multi-way merge strategy for lexicon
! Fixed bug resulting in much bigger index than it should have been
+ Phrase match (words in double quotes, ie "majestic 12") is now supported
+ Added NOT boolean logic: -KEYWORD would ensure that pages that contain that word will NOT be included
! Fixed bug resulting in poor deduplication leading to poor choice of pageinfos
26/09/05 v0.2.0
! Fixed fatal sub-index bug preventing some searches from completition
! Anchor hits are no longer evaluated in a primitive fashion (algo will still need revision to avoid potentially big sort)
! Fixed (workarounded more like) bug with duplicate docID checker
! Fixed bug with page scrolling in site limited searches
+ Admin interface with user registrations to include great many options designed to let YOU control the search engine
+ Formulae used to rank matches can now be edited and applied in real time
+ JHH interpreter now has isolated session level variables to support multi-user environment
! Fixed problem in JHH with nested IF/ELSE statements
+ New look and feel for the search engine to make it inline with main website and node's web interface
! Fixed bug with site: qualifier that was ignoring dynamic index when it was not present in the main index
(test: site:parchayi.net parchayi)
+ JHH interpreter gains array iterators
+ Rank of the match can now be explained
+ Users can create account and create/edit their own ranking formulaes
! Fixed bug with essentially similar URLs from main and dynamic index treated differently
28/07/05 v0.1.5
! Indexed: ~ 45,000,000 web pages (exact number lost :( )
! Revised stoplist to remove some words that don't belong there (MIT, IR etc). Effective after next reindexing
! Fixed incorrect canonical URL calculation for TLDs like .com.au
+ Added support for clustering by domain - site:www.example.com will now run search for that domain only: this
sounds small but it isn't trivial at all to do it effectively
! Fixed sorting issue with multiple matches having same rank
! Fixed various problems related to (lack of) url encoding of keywords
! Avoid repeating identical site META descriptions
+ Did I mention domain clustering? Not near enough! 8-O
! Canonical URL calculation takes into account url-encoding
! Many more hits can be stored per docid
+ Anchor hits have much better resolution and now have id of the referring page in place for real-time scoring
! Only one anchor text per URL was used from any given page
+ Removed accute memory dependency in urls merger and enabled recovery at frequent savepoints
+ Sorts for very frequent terms in inverted index buckets are not memory bound
14/07/05 v0.1.4
! Indexed: 28,161,400 web pages and 6,888,489 links
+ Numbers are split from words, ie: "majestic12" will be treated the same as "majestic 12"
+ Some-nonalphabetic chars are now allowed to support words: .net, C#, can't etc
+ META content description tags will now be stored as part of page information
+ Added syncronisation bytes to page info barrels to minimise impact of possible data corruption
! Fixed bug in pageinfo merging resulting in abortion of merge process
! URL submission checks for new jobs every 5 seconds rather than 1 now for stability reasons.
+ Much more efficient (300%+) inverted index scanning thanks to guaranteed sequantial reads and hits skipping approach
+ Sub-indices on very frequent terms to reduce CPU usage during index scan
+ Additional short inverted index for very frequent terms
! Fixed restart bug in lexicon merging
! Final sorter can now be stopped and restarted at the point where it was stopped
!!! Many hours wasted to track bug that resulted in exception in the final sorter !!! 8=O
+ PageInfo merger can be stopped and restarted at savepoints
! Better handling of corrupted page info data files
! Escaped keywords for pattern matching that was failing on words like C++ etc
03/07/05 v0.1.3
! Crawled data from domains that were resolved to private IP addresses will be dropped
! URL equivalence is now in force - www.example.com is thought to be the same as example.com, slash's
at the end of URLs are ignored for purposes of comparison
! Indexing of data barrels now done in order of creation, this should result in more balanced data
! Some exceptions were slipping between fingers, now all trapped and displayed.
+ Near-real-time URL submission (includes crawling, indexing and adding to searchable database) is now available
! Fixed failure to merge search results if one or more page info offsets were not found
! Fixed bug with zero results exception from dynamic index
! Fixed incorrect DocIDs used in show Text links for dynamic index results
+ Dynamic index is now kept fully in memory to minimise search costs to almost zero
27/06/05 v0.1.2
! Indexed: 11,358,826 web pages and 4,127,646 links
+ META keywords are no longer indexed
+ Lexicon merging is 20 times faster thanks to intermediate in-memory merge
+ Very rare terms are now removed from lexicon to reduce lexicon size by 65-70% (consequently a big speed up in merging)
! Lexicon merging no longer keeps all lexterms data for remapping in memory thus allowing it to scale beyond available memory
! Merging of inverted indices into sort-buckets is 30% faster now
+++ Overall merging had been improved to scale to at least billion URLs, with expected A64 3.5Ghz merge performance of 100 mln pages/day.
! Final sort of inverted indices is faster and much more memory efficient
! Local (barrel) duplicate (by pageid) documents are now detected and removed at the indexing stage
! The following languages determined by declared charset will not be indexed: Korean, Chinese, Japanese, Hebrew, Thai, Arabic: the reason is that
I have no clue about these languages and they seem to require special word delimiters lack of which results in pointless growth in number of
unique words
! Link anchor text that is actually a URL itself will not be indexed
+ Anchor text pointing to external pages is now taken into account thus allowing to find popular pages that were not actually indexed
! Inverted index final sorter uses less memory
+ Parallel small index for user submitted pages is now active, partial index update allows for "real-time" submission and indexing
+ Search engine will read/write some random stuff from time to time to ensure hard drive does not fall into deep sleep to avoid drive awakening penalty
10/06/05 v0.1.1
! Indexed: 2,300,000 web pages
+ (!) Text snippets with highlighted matches in text are now shown
+ WebServer now compresses dynamic JHH pages for browsers that support gzip compression (IE6/Firefox etc)
+ Document offsets can be kept in memory (saves up to 10 disk seeks pre shown results page)
+ Next/Previous page navigation
+ Recent searches can now be seen
+ Minimised number of lexicon file open/close operations to 1 per search
+ More efficient usage of pattern matching in rendering of results
+ Time taken to run search can now be explained
+ Non-alpha numeric chars were stripped away in titles, now titles are kept as is
+ Fixed incorrect treatment of HTML entities in parser
+ Lexicons are now sorted by the indexer, thus allowing to avoid doing it at the merging stage
+ Better formatted text extracted from HTML that uses tables
+ Match ratio is now taken into account for short text hits (titles, domains etc)
+ Snippets in document info data barrels are now compressed (50% reduction in space used)
+ Fixed display of characters with accents (correct term?), queries with such characters also now correctly received
+ Added ability to view indexed text as it was seen by the indexer
03/06/05 v0.1.0 First ALPHA release (that's one week worth of effort) featuring:
! Indexed: 1,000,000 web pages
+ default (and only) AND logic
+ simple ranking based on proximity and key locations of the hits
25/05/05 Work started on actual search engine - all effort before was directed
towards getting distributed crawler, indexer and merger done.
|