Certain types of Icons are not being displayed in search results with Sharepoint 2013

It looks like Sharepoint 2013 Search results are not showing little icons that display the file type next to each result, specifically non-Office/non-PDF type docs. File types like ZIP, EML, MHT and some others have icons displayed next to them just fine when loaded into a Document Library, but not as part of a search result set.

After checking all the basics, such as making sure docicons.xml file had proper extensions and icons referenced, this looks like a potential product issue.  Pretty easily reproducible and I’ve then tried to use IE Developer Tools to see whether an image file is actually being requested when searching for Word Document versus a Zip File.

Scenario 1:

Searching for a Word document, icdocx.png image is being pulled as expected from the Images directory:

Word_Icon_Debugging

 

Scenario 2:

Searching for a Zip file, we would expect to try grabbing ‘iczip.gif’ file as per this Mapping, but it’s not actually happening:

<Mapping Key=”zip” Value=”iczip.gif” OpenControl=””/>

Zip_Icon_Debugging

 

 

What I think may be happening is this.  There is a Default_Item Display Template which calls Item_CommonItem_Body template for rendering of the body.  There is a specific call that just doesn’t seem to be working well (highlighted):

<!–#_

 

if (!$isEmptyString(ctx.CurrentItem.csr_Icon)) {

_#–>

<div class=”ms-srch-item-icon”>

<img id=”_#= $htmlEncode(id + Srch.U.Ids.icon) =#_” onload=”this.style.display=’inline'” src=”_#= $urlHtmlEncode(ctx.CurrentItem.csr_Icon) =#_” />

</div>

<!–#_

 

As a quick test, I simply took out the “if” clause and used FileExtension property to workaround.  Not sure if it’s the most graceful solution, but at least now every file that has a FileExtension property populated will get an Icon associated with it.  This should cover majority of these use-cases such as ZIP, EML, etc.  Documents that do not have a defined FileExtension (such as Web pages) will still have missing icons.

 

<div class=”ms-srch-item-icon”>

<img id=”_#= $htmlEncode(id + Srch.U.Ids.icon) =#_” onload=”this.style.display=’inline'” src=’_#= “/_layouts/15/images/ic” + $htmlEncode(ctx.CurrentItem.FileExtension) + “.gif” =#_’ />

</div>

 

Posted in SP2013, Uncategorized | Tagged , , , , | 2 Comments

How to tell that a Continuous Crawl is actually working

One of the great things about having Continuous Crawl running is with the fact that it’s supposed to magically work, i.e you enable it and ‘mini-crawls’ are kicked off periodically keeping your index fresh.  However, it’s not very easy what’s happening in case something goes wrong or if you simply need to find out why a specific document isn’t searchable yet, since the crawl logs are aggregated over 24 hours for the Content Source.  This means that you can’t get information about each “Mini-crawl” from crawl logs themselves.

One way to see what’s happening is to take a look at the SQL table itself. The table is called MSSMiniCrawls and it resides in your SSA DB. This will at least provide you with information on when a specific mini-crawl has been kicked off or will get kicked off (based on the set interval) and can come handy if someone mentions that they’ve just uploaded a new document and are wondering when it will get crawled/indexed.

Continuous_Crawl

 

Each Mini-Crawl is recorded in this table and given a MiniCrawlID. If you have many Content Sources, it’s also easy to find out what the ContentSourceID correlates to via Powershell.

Posted in SP2013, Uncategorized | Tagged , , , | Leave a comment

“About X results”: Sharepoint 2013 Search result counts keep changing

A question that’s asked fairly often is regarding the fact that Sharepoint Search seems to show inconsistent result counts from what is initially shown on the very first page and as you move along. Here is an example below:

About X results

Interestingly enough, this is pretty common for most search engines out there and I’ll start with showing a quick example of similar behavior with Google search. I’m going to run a search for “DocumentumConnector”, which you can see returns “About 13,000 results” on my first page.

DC_1

So what happens when I get to Page 10? There are now “About 12,800 results” shown, which means these numbers are just estimates and should get more precise as we move closer to the tail of the result set:

DC_2

Let’s get back to Sharepoint 2013 Search and explains what happens here. The “About x results” is presented when the total number of results is uncertain with Collapsing (i.e. duplicate detection in a typical case) enabled and the result set is partially processed.

  •  With TrimDuplicates = $true(it is by default), Sharepoint Search uses a default CollapseSpecification = “DocumentSignature”. This means we are collapsing on what are considered to be duplicate documents.
  •  Processing all of the results is too costly in terms of quickly returning results, which is why an estimate is given.

To recap, this means that a full result set isn’t processed for duplicates and the number returned as the Total number of hits after duplicate removal is an estimate based on how many duplicates were found in those results that were de-duplicated(based on collapsing done so far).

By default, your result page shows 10 results at a time. You may notice that for result sets with less than or equal to 10 results, you will always get an accurate number of results and thus “About” is omitted. For more than 10 results, you will get the “About x results” on the first page until you get to the very last page of results and only then we will actually know the exact number of results and can once again omit the “About”.

Posted in FS4SP, SP2010, SP2013, Uncategorized | Tagged , , , , , , , | 2 Comments

Interpreting “Crawl Queue” report with Sharepoint search

We’ve noticed something interesting a few weeks ago while working on a FAST Sharepoint 2010 system.  One of the advanced Search Administration Reports called Crawl Queue report was suddenly showing a consistent spike in “Links to Process” counter.  This report is specific to the FAST Content SSA and the two metrics shown are:

  • Links to Process:  Incoming links to process
  • Transactions Queued:  Outgoing transactions queued

New crawls would then start and complete without problems, which led us to believe that this had to do with a specific crawl that never completed.  We took a guess that perhaps one of the crawls has been in a Paused state, which ended up to be a correct assumption and saved us from writing SQL statements to figure out what state various crawls were in and so on.  Once this particular crawl was resumed, Links to Process went down as expected.  This process did give me a reason to explore what exactly happens when a crawl starts up and is either Paused or Stopped.

My colleague Brian Pendergrass describes a lot of these details in the following articles:

http://blogs.msdn.com/b/sharepoint_strategery/archive/2012/10/30/sp2010-search-explained-crawling.aspx

http://blogs.msdn.com/b/sharepoint_strategery/archive/2014/02/10/sharepoint-search-and-deadlocks-in-sql-server.aspx

If I had to just do a very high-level description, here is what happens when a crawl starts up:

  • The HTTP protocol handler starts by making an HTTP GET request to the URL specified by the content source.
  • If the file is successfully retrieved, a Finder() method will enumerate items-to-be-crawled from the WFE(Web Front-End) and all the links found in each document will be added to the MSSCrawlQueue table.  Gathering Manager called msssearch.exe is responsible for that.  This is exactly what “Links to Process” metric shows on a graph, it’s the links found in each document but not yet crawled.  If there are links still to process, they will be seen by querying the MSSCrawlQueue table.
  • Items actually scheduled to be crawled and waiting on callbacks are also seen in MSSCrawlURL table.  This corresponds to “Transactions queued” metric in the graph.
  • Each Crawl Component involved with the crawl will then pull a subset of links from MSSCrawlQueue table and actually attempt to retASTrieve each link from the WFE.
  • These links are removed from the MSSCrawlQueue once each link has been retrieved from a WFE and there is a callback that indicates that this item has now been processed/crawled.

Once a specific crawl was set to a Paused state, enumerated items-to-be crawled stayed in MSSCrawlQueue table and were not cleared out, corresponding to “Links to Process” metric in the graph.  If instead we attempted to Stop a crawl, these links would have actually be cleared out from the table.

This behavior should be similar with Sharepoint 2013 Search.

Posted in FS4SP, SP2010, SP2013 | Tagged , , , , | Leave a comment

Notes on securing data with Sharepoint 2013 Search

A few days ago a question came up regarding looking at Sharepoint 2013 Search from a security perspective, specifically looking at any file-storage paths where ingested content may be stored, temporarily or permanently. An example is a document that contains Personal Information (PII info) and it’s important to know where this document may be stored on disk for auditing purposes. We are leaving Sharepoint databases out of this example.

Before talking about specific file paths, here are some general tidbits on this topic I’ve been able to gather.

  •  The SharePoint 2013 Search Service does not encrypt any data.
  •  All temporary files are secured by ACLs so that sensitive information on disk is only accessible to the relevant users and Windows services.
  • If the disk is encrypted at OS-level, this is transparent to SharePoint search. It’s important to carefully benchmark indexing and search performance when using OS-level encryption due to performance impact.
  • If you do need to use OS-level disk encryption, please first contact Microsoft support to get the official guidance from the Product Group (if official documentation is not yet available on TechNet). My understanding is that currently only Bitlocker drive encryption will work with Sharepoint 2013 Search.
  • Although both the Journal and index files are compressed, they should be considered readable.

Specific paths to where data is stored on disk at some point in time:

Index and Journal files:

C:\Program Files\Microsoft Office Servers\15.0\Data\Office Server\Applications\Search\Nodes\SomeNumber\IndexComponent_SomeNumber\storage\data

Crawler: 

1. The temp path, which is where the mssdmn.exe process initially writes the files it  has gathered:
◾[RegKey on the particular Crawl Component] HKLM\SOFTWARE\Microsoft\Office Server\15.0
\Search\Global\Gathering Manager\TempPath

2. The Gatherer Data Path (shared with Content Processing Component), which is where the MSSearch.exe writes the files that were gathered by the MSSdmn.exe process:
◾[RegKey on the particular Crawl Component] HKLM\SOFTWARE\Microsoft\Office Server\15.0\Search\Components\CrawlComponent_Number>\GathererDataPath

Content Processing Component:
This needs to be tested a bit further and the actual path may need to be updated
(will update later). Temporary storage for input/output data during parsing and
document conversion in Content Processing Component under
C:\Program Files\Microsoft Office Servers\15.0\Data\Office
Server\Applications\Search\Nodes\SomeNumber\ContentProcessingComponent_SomeNumber\Temp\. Continue reading

Posted in SP2013 | Tagged , , , , | 1 Comment

Crawling content with Sharepoint 2013 Search

Before we get anywhere further with search, let’s discuss in more detail how content is gathered, and that’s via crawling.  Crawling is simply a process of gathering documents from various sources/repositories, making sure they obey by various rules and sending them off for further processing to the Content Processing Component.

Let’s take a more in-depth look at how Sharepoint crawl works.

Crawling_In_Depth_Architecture_updated

Architecture:

There are 2 processes that you should be aware of when working with Sharepoint crawler/gatherer:  MSSearch.exe and MSSDmn.exe

  1. The MSSearch.exe process is responsible for crawling content from various repositories, such as SharePoint sites, HTTP sites, file shares, Exchange Server and more.
  2.  When a request is issued to crawl a ‘Content Source’,  MSSearch.exe invokes a ‘Filter Daemon’ process called MSSDmn.exe. This loads the required protocol handlers and filters necessary to connect, fetch and parse the content.  Another way of defining MSSDmn.exe is that it is a child process of MSSearch.exe and is a hosting process for protocol handlers.

The figure above should give you a feel for how Crawl Component operates, it uses MSSearch.exe and MSSDmn.exe to load the necessary protocol handlers and gather documents from various supported repositories, and then sends the crawled content via a Content Plug-In API to the Content Processing Component.  There is one temporary location I should mention as listed in the figure(as there is more than one), and that’s a network location where crawler will store document blobs for CPC to pick up.   It is a temporary on-disk location based on callbacks received  by the crawler Content Plug-In from the indexer.

Last part of this architecture is the Crawl Store database.  It is used by the Crawler/Gatherer to manage crawl operations and store history, URL, deletes, error data, etc.

Major Changes from SP2010 Crawler:

– Crawler is no longer responsible for parsing and extracting document properties and various other tasks such as linguistic processing as was the case with previous Sharepoint Search versions.  Its job is now much closer to FAST for Sharepoint 2010 crawler, where crawler is really just the gatherer of documents that’s tasked with shipping them off to the Content Processing Component for further processing.  This also means no more Property Store database.

– Crawl component and Crawl DB relationship.  As of Sharepoint 2013, crawl component will automatically communicate with all crawl databases if there is more than one(for a single host).  Previously, mapping of crawl components to crawl databases resulted in a big difference in database sizes.

– Coming from FAST for Sharepoint 2010, there is single Search SSA that will handle both content and people crawl.  No longer is there a need to have FAST Content SSA to crawl documents and FAST Query SSA to crawl People data.

– Crawl Reports.  Better clarity from a troubleshooting perspective.

Protocol Handlers:

Protocol handler is a component used for each of the target types.  Here are the target types supported by Sharepoint 2013 crawler:

•  HTTP Protocol Handler:  accessing websites, public Exchange folders and SP sites. (http://)
•  File Protocol Handler: accessing file shares (file://)
•  BCS Protocol Handler: accessing Business Connectivity Services  – (bdc://)
•  STS3 Protocol Handler:  accessing SharePoint Server 2007 sites.
•  STS4 Protocol Handler: accessing SharePoint Server 2010 and 2013 sites.
•  SPS3 Protocol Handler: accessing people profiles in SharePoint 2007 and 2010.

Note that only STS4 Protocol Handler will crawl SP sites as true Sharepoint sites.  If using HTTP protocol handler, Sharepoint sites will still be crawled but only as regular web sites.

 

Crawl Modes:

  • Full Crawl Mode – Discover and Crawl every possible document on the specific content source.
  • Incremental Crawl Mode – Discover and Crawl only documents that have changed since the last full or incremental crawl.

Both are defined on a per-content source basis and sequential and dedicated.  This means that they cannot be run in a parallel and that they process changes from the Content Source ‘change log’ in a top-down fashion.  This presents the following challenge.  Let’s say this is what we expect from an Incremental crawl as far as processing changes and the amount of time it should take.

Incremental_crawl_expected

However, there is a tendency to have some “deep changes” spikes (say a wide security update) which alter this timeline and result in incremental crawls taking longer than expected.  Since these incremental crawls are sequential, the subsequent crawls cannot start until the previous crawl has completed, leading to missing scheduled timelines set by administrator.  Figure below shows the impact:

Incremental_crawl_actual

What is the best way for a search administrator to deal with this?  Enter the new Continuous Crawl mode:

  • Continuous Crawl Mode – Enables a continuous crawl of a content source.  Eliminates the need to define crawl schedules and automatically kicks off crawls as needed to process the latest changes and ensure index freshness. Note that Continuous mode can only work for Sharepoint-type content source.   Below is a figure that shows how using Continuous crawl mode with its parallel sessions ensures that index is kept fresh even with unexpected deep content changes:

Continuous_crawl

There are couple of things to keep in mind here regarding Continuous crawls:

– Each SSA will have only one Continuous Crawl running.

– They are automatically spun up every 15 minutes(can be changed with Powershell)

–  You cannot pause or resume a Continuous crawl, it can only be stopped.

Scaling/Performance/Load:

Some notes here:

– Add crawl components for both tolerance and potentially a better throughput (depending on the use-case).  Number of crawl components figures into calculation of how many sessions each crawl component will start with a Content Processing Component.

– Continuous crawls increase the load on the crawler and on crawl targets.  For each large content source for which you enable continuous crawls, it is recommended that you configure one or more front-end web servers as dedicated targets for crawling. For more information, take a look at http://technet.microsoft.com/en-us/library/dd335962(v=office.14).aspx

– There is a global setting that allows you to control how many worker threads each crawl component will use to target a host.  The default setting is High, which is a change from Sharepoint 2010 Search where this setting was set to Partially Reduced.  The reason for the change is that crawler is now by far less resource intensive than in the past due to much of the functionality moving to Content Processing Component.  Microsoft support team recommends changing Crawler Impact Rules versus this setting, mainly due to the fact that Crawler Impact Rules are host-based and not global.

  1. Reduced = 1 per CPU
  2. Partially reduced = Number_of_CPU’s+4, but threads set to ‘low priority’, meaning another thread can make it wait for CPU time.
  3. High = Number_of_CPU’s + 4  AND a normal thread priority).  This is the default setting

 

– Crawler Impact Rules:  There are not global and are “host”-based.  You can either set it to request a different number of simultaneous requests than below or change to have 1 request/thread at a time with a wait time of Y number of seconds.

Choosing to have 1 request a time while waiting for a specified time will most likely ensure a pretty slow crawl.

Impact_Rules

We will tackle other search components in future posts, hopefully providing a very clear view of how all these components interact with each other.

Posted in SP2013 | Tagged , , , , | 6 Comments

Search Architecture with SharePoint 2013

I’d like to revisit the topics that Leo have so well described in his 2 previous posts titled  ”Search 101″ and “Search Architecture in SharePoint 2010″, but discuss those in the context of SharePoint 2013 Search.  This post will address general architecture of SharePoint 2013 Search, describe all the components involved and briefly touch upon the biggest changes when coming from FAST for SharePoint 2010 “world”.  Future posts will go deeper into each search component and provide both an overview and troubleshooting information.

  • Search 101: general concepts of search, including crawling, processing, indexing and searching (Leo’s initial post)
  • Search Architecture in SharePoint 2013: the overall architecture of search-related components in SharePoint 2013  (this post)
  • Planning and Scale (future post)
  • Installation / Deployment (future post)
  • Crawling
  • Processing (future post)
  • Indexing (future post)
  • Searching (future post)

Search 101

As Leo has described in his previous post, if are a complete ‘newbie’ as to how search engine should work, the very basic tasks it should perform are:

  • Crawling: acquire content from wherever it may be located (web sites, intranet, file shares, email, databases, internal systems, etc.)
  • Processing: prepare this content to make it more “searchable”. Think of a Word document, where you will want to extract the text contained in the document, or an image that has some text that you want people to be able to search, or even an web page where you want to extract the title, the body and maybe some of its HTML metadata tags.
  • Indexing: this is the magic sauce of search and what makes it different than just storing the content in a database and searching it using SQL statements. The content in a search engine is stored in a certain way optimized for later retrieval of this content. We typically call this optimized version of the content as the search index.
  • Searching: the part of search engines most well known. You pass one or more query terms and the search engine will return results based on what is available in its search index.

Armed with this knowledge, let’s take a look at SharePoint 2013 Search Architecture and we can we can immediately see that the main components do just that:

Search Architecture in SharePoint 2013

SP2013_Search_Architecture

– Crawling:                              SharePoint Crawler via SharePoint Content SSA

– Content Processing:           CPC(Content Processing Component)

– Indexing:                              Indexing Component

– Searching:                            Query Processing Component(QPC)

You’ll notice that there is one more component that we didn’t describe in our ‘basics’, but it’s quite an important one.  The Analytics Processing Component is extremely important to SharePoint 2013 Search, as it does both Usage and Search Analytics and learns by usage and by processing various events such as ‘views’, ‘clicks’ and so on.  It then enriches the index by updating index items, which impacts relevancy calculations based on processed data, and provides valuable information in such forms as Recommendations and Usage reports.

– Analytics:                              Analytics Processing Component(APC)

Let’s take a brief look at each sub-system and its architecture:

Crawling

Simply put, SharePoint 203 crawler grabs content from various repositories, runs it through various crawler rules and sends it off to Content Processing Components for further processing.  You can think of it as an initial step for your feeding chain with search index being the final destination.

Crawling can be scaled out using multiple crawl components and databases.  New Continuous crawl mode ensures index freshness, while architecture has been simplified from FAST for SharePoint with having a single SharePoint Search Service Application handle both crawling and querying.

A Continuous Crawl can have multiple continuous crawl sessions running in parallel. This  capability enables crawler to keep Search Index fresher – for example if a preceding Continuous Crawl session is busy processing a deep security change, the subsequent crawl can process content updates.  Unlike Incremental crawl, there is no longer a need to wait for completion before new changes can be picked up, these crawls are spun-up every 15 minutes and crawl the “change logs”.

SP2013_Crawl_Component

  • Invokes Connectors/Protocol Handlers to content sources to retrieve data
  • Crawling is done via a single SharePoint Search SSA
  • Crawl Database is used to store information about crawled items and to track crawl history
  • Crawl modes:  Incremental, Full and Continuous

What’s new:

  • Incremental, Full and Continuous crawl modes
  • No need for Content SSA and Query SSA:  a single Search SSA
  • FAST Web Crawler no longer exists
  • Improved Crawl Reporting/Analytics

Content Processing

In SharePoint Search 2010, there was a single role involved in the feeding chain:  the Crawler.  In FAST for SharePoint 2010, feeding chain consisted of 3 additional components(other than the crawler): Content Distributor, Document Processor and Indexing Dispatcher.  With SharePoint 2013 Search, Content Processing Component combines all three.

A simple way to described Content Processing Component is that it takes the content produced by the Crawler, does some analysis/processing on the content to prepare it for indexing and sends it off to the Indexing Component.  It takes crawled properties as input from the Crawler and produces output in terms of Managed Properties for the Indexer.

Content Processing Component uses Flows and Operators to process the content.  If coming from FAST “world”, think of Flows as Pipelines and Operators as Stages.  Flows define how to process content, queries and results and each flow processes 1 item at a time.  Flows consist of operators and connections organized as graphs.  This is really where all the “magic” happens, things like language detection, word breaking, security descriptors, content enrichment(web service callout), entity and metadata extraction, deep link extraction and so on.

 

SP2013_ContentProcessingFlows

CPC comes with pre-defined flows and operators that currently cannot be changed in a supported way.  If you search hard enough, you will find blogs that will described how to customize flows and operators in an unsupported fashion.   The flow has branches that handle different operations, like inserts, deletes and partial updates.  Notice that security descriptors are now updated in a separate flow, which should make the dreaded “security-only” crawl perform better as opposed to previous versions.

As I’ve mentioned, CPC has an internal mechanism to load-balance items coming from the Crawler between available flows(analogy to the old FAST Content Distributor).  It also has a mechanism at the very end of the flow to load-balance indexing across the available Indexing Components(analogy to the old FAST Indexing Dispatcher).  We will revisit this topic in more detail in subsequent posts.

  • Stateless node
  • Analyzes content for indexing
  • Enriches content as needed via Content Enrichment Web Service (web service callout)
  • Schema mapping.  Produces managed properties from crawled properties
  • Stores links and anchors in Link database(analytics)

What’s new:

  • Web Service callout only works on managed properties and not on crawled properties, as was done with Pipeline Extensibility in FAST for SharePoint 2010.
  • Flows have different branches that can handle operations like deletes or partial updates on security descriptors separately from main content, improving performance.
  • Content Parsing is now handled by Parsers and Format Handlers(will be described in later posts)

Note:  If Content Enrichment Web Service does not meet your current needs and you need more of an ESP-style functionality when it comes to pipeline customization, talk to Microsoft Consulting Services or Microsoft Premier Field Engineers regarding CEWS Pipeline Toolkit.

http://social.technet.microsoft.com/wiki/contents/articles/19376.sharepoint-cews-pipeline-toolkit.aspx

Indexing

Job of the indexer is to receive all processed content from Content Processing Component, eventually persist it to disk(store it) and have it ready to be searchable via Query Processing Component.  It’s the “heart” of your search engine, this is where your crawled content lives.  Your index will reside on a something called an Index Partition.   You may have multiple Index Partitions, with each one containing a unique subset of the index.  All of your Partitions taken together is your entire search index.  Each Partition may have 1 or more Replicas, which will contain an exact copy of the index from that partition.  There will always be at least one replica, meaning that one of your index partitions is also a primary replica.  So when coming from FAST, think “partitions and replicas” instead of “columns and rows”.

Each index replica an Index Component.  When we provision an Index Component, we associate with an index partition.

Scaling:

Increase Query load or fault tolerance:  Add more index replicas

Increase content volume:  Add more index partitions

 

IndexPartitions

There are a couple of very important changes to internals of the indexer that I’d like to touch upon:

– There is NO MORE FIXML.  Just a reminder, FIXML stood for FAST Index XML and contained an XML representation of each document that the indexer used to create the binary index.  FIXML was stored locally on disk and was frequently used to re-create binary index without have to re-feed from scratch.  There is now a new mechanism called a ‘partial update’, which replaces the need for FIXML.

– Instant Indexing: We can now serve queries much quicker directly from memory instead of waiting for them to be persisted to disk.

– Journaling:  Think RDBMS “transaction log”, a sequential history of all operations to each index partition and its replicas.  Together with checkpointing, allows for  “instant indexing” feature above , as well as ACID features (atomicity, consistency, isolation and durability).  For the end-user, this ensures that a full document or set of documents as a group is either fully indexed or not indexed at all.  We will discuss this in much more detail in subsequent posts.

– Update Groups/Partial Update Mechanism:  All document properties(managed properties) are split into Update Groups. In the past with FAST, “partial updates” where quite expensive as indexer would have read the whole FIXML document, find the element, update the file, save it and re-index the FIXML document.  Now, properties in a one update group can be updated at a low cost without affecting the rest of the index.

There is also an updated mechanism to merging Index Parts, which you can somewhat compare to how FAST handled what was then called “index partitions” in the past and merged them.

Indexing_Merging

Index internally is built up of several smaller inverted index parts, each one being an independent portion of the index.  From time to time, based on specific criteria, they need to be merged in order to free up resources associated with maintaining many small indices.  Typically, smaller ones will be merged more often while larger ones will be done less frequently.

Keep in mind that Level/Part 0 is the in-memory section that directly allows for the “Instant Indexing” feature.   When documents come into the indexing subsystem, they come into 2 places at the same time:

  1. The Journal
  2. The Checkpoint section(Level 0 in the figure above)

Checkpoint section contains document that are in memory but have not yet been persisted to disk, yet searchable.  If search crashes, the  in-memory portion will be lost but will be restored/replayed from the Journal on the next start up.

Query Processing

Query Processing Component is tasked with taking a user query that comes from a search front-end and submits it to the Index Component.  It routes incoming queries to index replicas, one from each index partition.  Results are returned as a result set based on the processed query back to the QPC, which in turn processes the result set prior to sending it back to the search front-end.  It also contains a set of flows and operators, similar to the Content Processing Component.  If coming from FAST, you can compare it to the QRServer with its Query Processing Pipelines and stages.

QueryProcessing

  • Stateless node
  • Query-side flows/operators
  • Query federation
  • Query Transformation
  • Load-balancing and health checking
  • Configurations stored in Admin database

What’s new:

  • Result Sources/Query Rules

Analytics

Analytics Processing Component is a powerful component that allows for features such as Recommendations(‘if you like this you might like that’), anchor text/link analysis and much more.  It extracts both search analytics and usage analytics, analyzes all the data and returns the data in various forms, such as via reporting or by sending it to Content Processing Component to be included in the search index for improved relevance calculations and recall.

AnalyticsComponent

Let’s quickly define both search analytics and usage analytics:

 Search analytics is information such as links, anchor text, information related to people, metadata, click distance, social distance, etc. from items that APC receives via the Content Processing Component and stores the information in the Link database.

Usage analytics is information such as the number of times an item is viewed from the front-end event store and is stored in the Analytics Reporting database.

  • Learns by usage
  • Search Analytics
  • Usage Analytics
  • Enriches index for better relevance calculations and recall
  • Based on Map/Reduce framework – workers execute needed tasks.

What’s new:

  • Coming from FAST ESP/FAST for SharePoint, it combines many separate features and components such as FAST Recommendations, WebAnalyzer, Click-through analysis into a single component…and adds more.

 

 

I hope to be able to do some deep-dives into each component in future posts, feel free to drop me a note with any questions that may come up.

Posted in SP2013 | Tagged , , , , , , , , , , , , , , | 10 Comments

Sharepoint 2013 Search Ranking and Relevancy Part 1: Let’s compare to FS14

I’m very happy to do some “guest” blogging for my good friend Leo and continue diving into various search-related topics.  In this and upcoming posts, I’d like to jump right into something that interests me very much, and that is taking a look at what makes some documents more relevant than others as well as what factors influence rank score calculations.

Since Sharepoint 2013 is already out, I’d like to touch upon a question that comes up often when someone is considering moving from FAST ESP or FAST for Sharepoint 2010 to Sharepoint 2013 :  “So how are rank scores calculated in Sharepoint 2013 Search as opposed to previous FAST versions”?

In upcoming posts, I will go more into “internals” of the current Sharepoint 2013 ranking model as well as introduce the basics of relevancy calculation concepts that apply across many search engines and are not necessarily specific to FAST or Sharepoint Search.

There are some excellent blog posts out there that go in-depth on how Sharepoint 2013 Search rank models work, including the ones below from Alexey Kozhemiakin and Mikael Svenson.

http://powersearching.wordpress.com/2013/03/29/how-sharepoint-2013-ranking-models-work/

http://techmikael.blogspot.com/2013/04/rank-models-in-2013main-differences.html

 

To avoid being repetitive, what I’ve tried to do is to create an easy to see comparison chart between factors that influence rank calculations in FS14 to Sharepoint 2013 Search.  I may update this chart in the future to include FAST ESP, although the main factors involved in both ESP and FS14 are somewhat similar to each other as opposed to Sharepoint 2013 Search(which is closer related to Sharepoint 2010 Search model).

One of the main differences is with the fact that Sharepoint 2013 Search uses a 2-stage process for rank calculations:  a linear ranking model as a 1st stage and a Neural Network as a 2nd stage.  The 1st stage is “light” and we can afford to apply it to all documents in a result set.  There are specific rank features that are part of this stage that are applied to all documents.  The top 1000 documents(candidates) based on Stage 1 Rank are input to Stage 2.  This stage is more performance intensive and re-computes the rank score for documents used as an input, which is why it is only applied to a limited set.  It consists of all the same rank features as Stage 1 plus 4 additional Proximity features.

 For my comparison below, I was mainly using a model called “Search Ranking Model with Two Linear Stages”, which has been put in place as of August 2013 CU.  This model is recommended to use as a template when creating custom rank models, as it provides you with proximity without a Neural Network.

 

Rank Factor

FS14

SP2013 Search

Rank Models 1 OOTB rank model 16 Rank Models
Freshness Available OOTB and customizable N/A OOTB, possible to be configured
Dynamic Ranking (field weighting/managed properties) Context Boost:

Title, DocSubject, Keywords, DocKeywords, urlkeywords, Description, Author, CreatedBy, ModifiedBy,  MetadataAuthor, WorkEmail, Body, crawledpropertiescontent

Document MP’s + Usage/Social data

Title, QLogClickedText, SocialTag, Filename, Author, AnchorText, body

FileType Field-Boost weight/Managed Property Boost(OOTB -4000 points):

 

Format:

Unknown Format, XML, XLS

 

FileExtension:

CVS, TXT, MSG, OFT, ZIP, VSD, RTF

 

IsEmptyList, IsListItem

FileType rank feature:

 

 

 

PPT, Sharepoint site, DOC, HTML, ListItems, Image, Message, XLS, TXT

 

Language

 

N/A Dynamic Rank(query-based).  LCID, i.e locale ID is used.
Social Distance  N/A Static Rank(colleague relationship to the person issuing the query).

 

0 bucket – No colleague relationship

 

1 bucket – first level(direct) relationship

 

2 bucket – second level(indirect) relationship

Static Rank Boost (Query-Independent) Quality Weight Components:

 

hwboost

docrank

siterank

urldepthrank

 

 

Authority Weight– Partial and Complete

 

 

Now part of Analytics Processing Component.  Static Rank features calculated with Search and Usage Analytics:

 

QLogClicks

QLogSkips

QLogLastClicks

EventRate

 

Proximity Enabled by default MinSpan (Neural Networks 2nd stage, parameters for proximity minimal span

 

Anchortext (Query-Dependent) Extnumocc = part of Dynamic Rank calculations, query-time hits in anchortext

 

AnchortextComplete
URLDepth (Query-Dependent) N/A – in FS14, this was a static rank feature. UrlDepth – Depth of the document URL(number of slashes)

 

Click-Through Weight(Query-Dependent) Query-Authority weight:  click-through weight, dynamic rank N/A

Now part of static rank features used in Analytics processing Component(QLogClicks, etc)

 

Rank Tuning

FS14

SP2013 Search

GUI-based applications. Ease of tuning rank calculations and user-friendliness N/A

Rank calculations  and scores can be seen either via ranklog output or via Codeplex tools such as FS4SP Query Logger.   However, there isn’t a user-friendly tool to help you make the changes and push them live, or preferably see them in “Preview” mode offline.  A separate ‘spreladmin’ tool is needed for click analysis.

 

Rank Tuning App(coming  soon).  A GUI-based and user-friendly way to tune/customize ranking and impact relevancy.  Includes a “preview”, i.e offline mode.
Rank logging availability Server-side:

Ranklog is available via QRServer output.  However, it is server-side and only available to Admins with local access to QRServer port 13280.

 

Client-side:

 

N/A

Server-side:

Rank tuning app/ULS logs

 

 

 

                                                                                                  Client-side:

 

ExplainRank template available to clients.

 

http://powersearching.wordpress.com/2013/01/25/explain-rank-in-sharepoint-2013-search/

 

 

 

Posted in SP2013 | Tagged , , , , | 4 Comments

The Myths and Perils in the Pursuit of Advanced Search Options

One question that I’ve heard a lot over the years working in the search space is: “How can I provide advanced search options for users, such as exposing boolean operators?”

My answer: don’t waste your time with it (and especially don’t do it in the first iteration of your search application).

Many search usability studies confirm that most users have no idea how to use advanced search, and instead rely only on simple keyword searches to try and find what they want. As Jakob Nielsen brilliantly put it in his article on “Converting Search into Navigation”:

In study after study, we see the same thing: most users reach for search, but they don’t know how to use it.

Given this fact, my recommendation for customers is always to start simple, with a search interface that lets users enter their keywords to find what they are looking for. Then, after collecting usage log for a few weeks/months, you can look for query patterns that could be used to trigger more advanced search functionality, such as the Costco example in Nielsen’s article about redirecting the user to a category page instead of a search results page for certain queries where you know (from inspecting usage logs) that most users just want to get the category page.

This way you can gradually improve the performance of your search application, by “listening” to the search behaviors of your users and adjusting your search application accordingly.

Posted in Uncategorized | Tagged , | 2 Comments

The 4 Essential Concepts You Need to Know To Use Any Search Engine Efficiently

When you go to your insert-search-application-name-here enter a query and hit the search button, what exactly are you searching on?

One of the hardest things to do in IT (or in any field, really) is to sometimes take a step back and look at the basics, at the foundational knowledge behind some things that we may use every day without necessarily understanding how they really work.

After realizing I’ve been having the same conversation with different customers/students to explain these same main concepts over the last few years (both at FAST and now at cXense), I decided to explain a little bit about these 4 essential concepts here:

  1. Type of query (AND, PHRASE, OR)
  2. Where to search (all fields, body, title, etc.)
  3. Field Importance
  4. Sorting

Type of Query

The first thing you have to think about when constructing your search interface is: how do I want the system to match the text/query specified by the user?

To answer this question you must understand the differences in each of the following three search requests explained below.

Obs.: Note that the examples below are query language-agnostic, so just replace them with whatever is the proper syntax for the search engine you are using. Even though the syntax may change, the concepts should remain the same.

AND query

Example: venture AND capital

This query above will match only documents that contain all terms in the query, which in this case means that any document, in order to be returned, must have both the term venture as well as the term capital. Those terms can be found together (e.g. “raised more venture capital money…”), separately (e.g. “is initiating a new venture with capital raised…” or even in a different order (e.g. “for his new venture he raised capital from…”).

This is the most common operator used across search applications and also the default operator in many search platforms (e.g. FAST ESP, FS4SP, cX::search).

PHRASE query

Example: “venture capital”

This query above, in contrary to the AND query, will only return documents that contain this exact phrase. This means that a document with a text like “raised more venture capital money…” will match, but a document with “is initiating a new venture with capital raised…” will not (due to the fact that there is an extra term – with – in between the two required terms).

This is an operator often used behind the scenes by search applications whenever a user puts some text in between quotes into the search box. It’s very useful for scenarios where the user is trying to find some exact phrase he/she is looking for.

OR query

Example: venture OR capital

This last query is the most open of all, as it will return documents that contain any of the terms in the query. With this query, a document only needs to have the term venture or capital to be returned, without the need for having both (as it was the case with the AND and PHRASE queries). This means that a document with the text “he decided to venture down the hall…” will be returned, as well as a document with the text “Brasília is the federal capital of Brazil”.

Where to search

Now that you have decided what type of queries you want to execute (and, phrase, or), the next step is to decide where do you want this search to occur. When asked “where do you want to search?” people usually reply with “everywhere, of course!”. Yet it is important to step back and think if that’s really what you want.

Imagine you go to your search application and type “financial systems” (with/without the quotes) and click the search button, what will happen then? Where in the document do you believe this query will try to find the terms financial and systems?

The answer to these questions depends heavily on which search technology you are using behind the scenes:

  • in FAST ESP – this would be a query against the default composite field, which out-of-the-box would be comprised of fields such as body, title, url, keywords, etc.
  • in FAST Search for SharePoint – this would be a query against the fulltext index, which by default contains fields such as title, author, body, etc.
  • in cX::search – this would be a query against all searchable fields in the index

In the case of cX::search, if you do not define exactly which fields should be searched on, by default the search will be executed against all the searchable fields in the index. This means that cX::search will look for the terms financial and systems in the fields title and body, but also in fields such as category, related_content, or even unitsInStock which may not be exactly what you are looking for.

When I was teaching FAST Search for SharePoint, the main confusion for students was the fact that the default search was not across ALL fields, but instead just a subset of them, which meant that for every new managed property that you wanted to search by default (just by typing some terms in the search box, that is) you needed to make sure to add it to the fulltext index as well.

As you can see, even such a simple question can have very distinct answers depending on which search platform you are using, so the best way to avoid future problems is to first understand exactly how your specific search platform handles the default queries, and then use this knowledge to control exactly which fields you want to search on by default.

For cX::search, for example, this could be done by adding the desired list of fields before the query term:

?p_aq=query(title,body,description,tags,url,author:"financial systems", token-op=and)

In the example above we are being very clear about which fields should be used when looking for the query terms defined by the user, which makes it a lot easier to debug and answer questions like “why was this document returned in the results?”.

Field Importance

By now should know how you want to search (and, phrase, or) and also where to search (title, body, etc.), so it’s time to decide which fields matter more to you among all the ones that were selected to be searched in the previous step. As a starting point, take look at these document examples below:

Document 1
Title: Market Research Findings – 2012
Description: This document summarizes the findings from the 2012 market research study…
Tags: research, 2012

Document 2
Title: About the market crash of 1929
Description: All the available research on the market crash of 1929…
Tags: stock, market, 1929

Document 3
Title: XYZ begins to explore new market
Description: After a few years focused on research, company XYZ began exploring a new market…
Tags: XYZ

And now consider the following query: market AND research

Based on the sample query and documents above, which document would you expect to be ranked higher?

Most people would say Document 1 listed above should be ranked higher, and the reason is that users got trained by search engines to expect, among other things, that anything that is found in the title of a document should have more relevance than something found somewhere in the body of the document. This is a very reasonable expectation, because we tend to accept that if someone went through the trouble of choosing specific terms to put in the title of a document, then those terms must be important.

So, depending on your search platform of choice, there are different ways for you to be explicit about what fields should have higher importance.

In cX::search, for example, the modified query would look like this:

?p_aq=query(title^5,tags^3,body:"market research", token-op=and)

The query above is defining that cX::search should:

  • look for documents containing the terms market and research;
  • these terms must be found in the title, tags or body fields; and, even more importantly
  • terms found in the title have 5 times (title^5) more importance than terms found in the body (the default field boost is 1)
  • terms found in the tags have 3 times (tags^3) more importance than terms found in the body

In a similar fashion, FAST ESP has the composite-rank piece of the rank profile, which allows you to define how much importance you want to give for each field that is part of a composite field.

In FAST Search for SharePoint, you also have some options available both through the UI or through PowerShell, which allow you to configure which importance level a managed property should belong to when mapped to a fulltext index, as shown in the screenshot below:

Fulltext Index Mapping

As you can see from the examples above, using field boosts (or any similar feature for the search platform you are using) give you the flexibility to be very precise about which fields matter most according to your specific business rules.

Sorting

The last important piece of this puzzle of configuring basic relevance settings for your search application is to decide how results should be sorted before being returned. This is crucial because, in the end, this is what decides what results will be displayed on top.

Remember the previous example above that used field boosts to define the importance of each field? Well, now take a look at this cX::search request below:

?p_aq=query(title^5,tags^3,body:"market research", token-op=and)&p_sm=publication_date:desc

As you can see above, this query is explicitly requesting that results be sorted by publication_date in descending order. What this means is that any field boosts are completely ignored by the search engine. Yes, they are simply ignored, since we directly requested results to be sorted based on a date field, instead of the default sorting that is based on the ranking score.

Sometimes this is exactly what you want, such as the case when the user has already drilled down to a subset of results and you want to allow him/her to just sort by price or average rating, for example (two options I often use when searching for products at Amazon).

And a last option is the case when you want to mix the two approaches, in a way that you can still use the ranking score, but with extra boosts that take into consideration how recent is a document (or how many units it has sold, or what is its average rating, etc.). Those are more advanced options that we will discuss another day, but for now just keep in mind that yes, that’s also possible 🙂

Posted in cXense, FS4SP | Leave a comment