The Curious Case of nvarchar and varchar in Entity Framework

I was working on a project recently that uses SQL Server as its primary database. This wasn’t my first rodeo with SQL Server and in fact, I have a side project that makes heavy use of the .NET stack. But, in this project, I encountered a major performance problem that wasn’t very clear and was very easy to miss.

So in the spirit of learning in public, I thought id write up my experience with this problem. We can dive into the problem and the solution so that you can avoid this scar tissue in the future.

The Project

First, let’s set the stage for the problem by getting into some of the background of the project. For the purposes of this blog post, I am going to use an example that mirrors the actual project.

The project is a web API that is built using dotnet core. The main endpoint in the API takes in an array of ids and checks a table in the database for those ids. In terms of APIs, this one is rather straightforward.

In developing this API, we had access to the database but not the original schema. That wasn’t a problem as we made use of Entity Framework code first to represent the table we want to query. We looked at the schema that is on the real database and mirrored those columns with their types into our Entity Framework model.

Here is what that model looked like after that initial phase.

namespace my_api.Data
{
    public class FieldTable
    {
        public Int64 Id { get; set; }
        public string FieldId { get; set; }
        public string Description { get; set; }
        public DateTime DateAdded { get; set; }
        public bool Available { get; set; }
    }
}

The FieldId is the column we query in the API to see if the array of ids that were passed in match any FieldId values in the database. The ones that match we return to the client. Here is the API logic code that does that.

public IEnumerable<MatchedField> GetMatches(IEnumerable<string> fieldIds)
{
    return _dataContext.FieldTable.Where(f => fieldIds.Contains(f.FieldId))
        .Select(f =>
            new MatchedField()
            {
                FirstSeen = f.DateAdded,
                FieldId = f.FieldId.ToLower(),
                Available = f.Known
            }
        ).ToList();
}

Entity Framework to raw SQL

At a high-level Entity Framework is going to map our code above into a raw SQL query that is going to look like this.

select DateAdded, FieldId, Known
from dbo.FieldData
where FieldId in (N'1', N'2', N'3', N'4', N'etc')

This query looks innocent enough right? It is doing a rather straightforward lookup on the table. But it’s actually doing a bit more than that, notice the N character in front of each id value.

This character in SQL is declaring the type of that string as nvarchar. If you’re not familiar with the nvarchar type, it allows you to store any Unicode value in a column. This is different from a varchar data type which only allows you to store ASCII.

This is where the differences between nvarchar and varchar become important. When we defined our FieldTable class up above, we didn’t specify the SQL types of the columns on the table. So what does that mean? It means that Entity Framework is going to use it’s default SQL types, for .NET strings the default type is nvarchar.

This is why we see the N character in front of each of our values that the raw SQL query is looking up.

Is that a problem? Not if the column on our table is of type nvarchar. Entity Framework is sending nvarchar ids in the query and our column has that type so were all good.

But what if the column is of type varchar instead?

If FieldId is actually of type varchar but Entity Framework sends a nvarchar set of ids, now we have a type mismatch. When this happens a conversion now has to happen at query time.

This is a subtle nuance that can often go unlooked. But, the performance impact can be huge if our query is looking up hundreds of values.

The Lesson Learned

This subtle difference when looking up one FieldId value wasn’t all that noticeable. It seemed a bit slower than it should have been but not to bad overall.

But, querying for 500 FieldId values was not performant at all, it was on the order of 45-60 seconds. This led to the investigation laid out above. Looking at the raw query Entity Framework was generating we saw that the id values were being prefixed with N'1', N'2', N'3'. We assumed that those columns were in fact nvarchar so that shouldn’t be the performance bottleneck.

But, then we ran the same query but instead of prefixing the id values with N we looked up normal varchar strings.

select DateAdded, FieldId, Known
from dbo.FieldData
where FieldId in ('1', '2', '3', '4', 'etc')

The results came back in less than 500 milliseconds.

Looking at the FieldId column we were able to confirm that it was actually of type varchar and not nvarchar. Performance bottleneck found ✅.

This wasn’t Entity Frameworks fault or even the fault of the database. It was a small bug in the data model that we created and was very easy to overlook. When we defined the table, FieldTable, in code we specified that the FieldId was of type string.

namespace my_api.Data
{
    public class FieldTable
    {
        public Int64 Id { get; set; }
        public string FieldId { get; set; }
        public string Description { get; set; }
        public DateTime DateAdded { get; set; }
        public bool Available { get; set; }
    }
}

When it came time for EF to query that table it did what it does best, translate your code into a SQL query. But, it added to the WHERE IN clause the N prefix for each id. It operated under the assumption that the FieldId column was of type nvarchar. That is because nvarchar is the default SQL type for the .NET type string in Entity Framework.

select DateAdded, FieldId, Known
from dbo.FieldData
where FieldId in (N'1', N'2', N'3', N'4', N'etc')

Bada bing, major performance problem. But why is that? Because now SQL Server has to convert each id that is declared as nvarchar in the query to a varchar type to query the column. That conversion with 500+ ids to lookup was very costly.

Be explicit when necessary

The fix was very straightforward to put in place. We needed to be explicit with our properties defined for FieldTable. This meant adding an attribute to each property that tells Entity Framework the exact SQL data type this represents.

namespace my_api.Data
{
    public class FieldTable
    {
        public Int64 Id { get; set; }
        [Column(TypeName="varchar(50)")]
        public string FieldId { get; set; }
        [Column(TypeName="varchar(500)")]
        public string Description { get; set; }
        public DateTime DateAdded { get; set; }
        public bool Available { get; set; }
    }
}

The change was to add the [Column(TypeName="varchar(50)")] to the FieldId property. This tells Entity Framework the exact SQL data type of this column. By doing that, Entity Framework now generates the appropriate SQL query, one with out nvarchar strings.

select DateAdded, FieldId, Known
from dbo.FieldData
where FieldId in ('1', '2', '3', '4', 'etc')

The result? Looking up 500+ ids at a time can now be done in 200-400 milliseconds instead of 45-60 seconds.

Conclusion

Building solutions where you don’t have access to all the pieces in play can be challenging. Duplicating a schema manually into your own code is error-prone as we have seen.

ORMs like Entity Framework are fantastic at hiding complexities around database access. Most of the time this is what we want. However, as we have seen, sometimes that hiding can introduce nuances that are easy to overlook. Use ORMs when needed but make sure you have a solid footing in the implicit decisions they may or may not make.