Tags: Database schemas

Then each went to his own home

Philipp Keller’s thoughts on tags and other sweets

googleyahoobing

previous post: «The Google world»

next post: «Tags with MySQL fulltext»

April 24th, 2005

Tags: Database schemas

Recently, on del.icio.us mailinglist, I asked the question “Does anyone know the database schema of del.icio.us?” .
I got a few private responses so I wanted to share the knowledge with the world.

The Problem: You want to have a database schema where you can tag a bookmark (or a blog post or whatever) with as many tags as you want. Later then, you want to run queries to constrain the bookmarks to a union or intersection of tags. You also want to exclude (say: minus) some tags from the search result.
Apparently there are three different solutions (Attention::If you are building a websites that allows users to tag, be sure to have a look at my performance tests as performance seems to be a problem on larger scaled sites.)

“MySQLicious” solution


In this solution, the schema has got just one table, it is denormalized.
This type is called “MySQLicious solution” because MySQLicious imports del.icio.us data into a table with this structure.

Intersection (AND)

Query for “search+webservice+semweb”:SELECT *
FROM `delicious`
WHERE tags LIKE "%search%"
AND tags LIKE "%webservice%"
AND tags LIKE "%semweb%"

Union (OR)

Query for “search|webservice|semweb”:

SELECT *
FROM `delicious`
WHERE tags LIKE "%search%"
OR tags LIKE "%webservice%"
OR tags LIKE "%semweb%"

Minus

Query for “search+webservice-semweb”SELECT *
FROM `delicious`
WHERE tags LIKE "%search%"
AND tags LIKE "%webservice%"
AND tags NOT LIKE "%semweb%"

Conclusion

The advantages of this solution:

Disadvantages:

  • You have a limit on the number of tags per bookmark. Normally you use a 256byte field in your DB (VARCHAR). Otherwise, if you took a text field or similar, the query times would slow down, I suppose
  • If you paid attention (as Patrice did) you notice that LIKE "%search" will also find tags with “websearch”. If you alter the query to LIKE " %search% " you end up having a messy solution: You have to add a space to the beginning of the tags value to make this work.

“Scuttle” solution

Scuttle organizes its data in two tables. That table “scCategories” is the “tag”-table and has got a foreign key to the “bookmark”-table.

Intersection (AND)

Query for “bookmark+webservice+semweb”:SELECT b.*
FROM scBookmarks b, scCategories c
WHERE c.bId = b.bId
AND (c.category IN ('bookmark', 'webservice', 'semweb'))
GROUP BY b.bId
HAVING COUNT( b.bId )=3

First, all bookmark-tag combinations are searched, where the tag is “bookmark”, “webservice” or “semweb” (c.category IN ('bookmark', 'webservice', 'semweb')), then just the bookmarks that have got all three tags searched for are taken into account (HAVING COUNT(b.bId)=3).

Union (OR)

Query for “bookmark|webservice|semweb”:
Just leave out the HAVING clause and you have union:SELECT b.*
FROM scBookmarks b, scCategories c
WHERE c.bId = b.bId
AND (c.category IN ('bookmark', 'webservice', 'semweb'))
GROUP BY b.bId

Minus (Exclusion)

Query for “bookmark+webservice-semweb”, that is: bookmark AND webservice AND NOT semweb.SELECT b. *
FROM scBookmarks b, scCategories c
WHERE b.bId = c.bId
AND (c.category IN ('bookmark', 'webservice'))
AND b.bId NOT
IN (SELECT b.bId FROM scBookmarks b, scCategories c WHERE b.bId = c.bId AND c.category = 'semweb')
GROUP BY b.bId
HAVING COUNT( b.bId ) =2

Leaving out the HAVING COUNT leads to the Query for “bookmark|webservice-semweb”.
Credits go to Rhomboid for helping me out with this query.

Conclusion

I guess the main advantage of this solution is that it is more normalized than the first solution, and that you can have unlimited number of tags per bookmark.

“Toxi” solution


Toxi came up with a three-table structure. Via the table “tagmap” the bookmarks and the tags are n-to-m related. Each tag can be used together with different bookmarks and vice versa. This DB-schema is also used by wordpress.
The queries are quite the same as in the “scuttle” solution.

Intersection (AND)

Query for “bookmark+webservice+semweb”SELECT b.*
FROM tagmap bt, bookmark b, tag t
WHERE bt.tag_id = t.tag_id
AND (t.name IN ('bookmark', 'webservice', 'semweb'))
AND b.id = bt.bookmark_id
GROUP BY b.id
HAVING COUNT( b.id )=3

Union (OR)

Query for “bookmark|webservice|semweb”SELECT b.*
FROM tagmap bt, bookmark b, tag t
WHERE bt.tag_id = t.tag_id
AND (t.name IN ('bookmark', 'webservice', 'semweb'))
AND b.id = bt.bookmark_id
GROUP BY b.id

Minus (Exclusion)

Query for “bookmark+webservice-semweb”, that is: bookmark AND webservice AND NOT semweb.
SELECT b. *
FROM bookmark b, tagmap bt, tag t
WHERE b.id = bt.bookmark_id
AND bt.tag_id = t.tag_id
AND (t.name IN ('Programming', 'Algorithms'))
AND b.id NOT IN (SELECT b.id FROM bookmark b, tagmap bt, tag t WHERE b.id = bt.bookmark_id AND bt.tag_id = t.tag_id AND t.name = 'Python')
GROUP BY b.id
HAVING COUNT( b.id ) =2

Leaving out the HAVING COUNT leads to the Query for “bookmark|webservice-semweb”.
Credits go to Rhomboid for helping me out with this query.

Conclusion

The advantages of this solution:

  • You can save extra information on each tag (description, tag hierarchy, …)
  • This is the most normalized solution (that is, if you go for 3NF: take this one 🙂

Disadvantages:

  • When altering or deleting bookmarks you can end up with tag-orphans.

If you want to have more complicated queries like (bookmarks OR bookmark) AND (webservice or WS) AND NOT (semweb or semanticweb) the queries tend to become very complicated. In these cases I suggest the following query/computation process:

  1. Run a query for each tag appearing in your “tag-query”: SELECT b.id FROM tagmap bt, bookmark b, tag t WHERE bt.tag_id = t.tag_id AND b.id = bt.bookmark_id AND t.name = "semweb"
  2. Put each id-set from the result into an array (that is: in your favourite coding language). You could cache this arrays if you want..
  3. Constrain the arrays with union or intersection or whatever.

In this way, you can also do queries like (del.icio.us|delicious)+(semweb|semantic_web)-search. This type of queries (that is: the brackets) cannot be done by using the denormalized “MySQLicious solution”.
This is the most flexible data structure and I guess it should scale pretty good (that is: if you do some caching).

Update May, 2006. This arcticle got quite some attention. I wasn’t really prepared for that! It seems people keep referring to it and even some new sites that allow tagging give credit to my articles. I think the real credit goes to the contributers of the different schemas: MySQLicious, scuttle, Toxi and to all the contributors of the comments (be sure to read them!)

P.S. Thanks to Toxi for sending me the queries for the three-table-schema, Benjamin Reitzammer for pointing me to a loughing meme article (a good reference for tag queries) and powerlinux for pointing me to scuttle.

Further reading

This entry was posted

on Sunday, April 24th, 2005 at 2:35 pm and is filed under Del.icio.us, MySQL, Tags.
You can follow any responses to this entry through the RSS 2.0 feed.

You can skip to the end and leave a response. Pinging is currently not allowed.

99 Comments
»

  1. I doubt the de-normalized solution would be “fast” by any means.

    De-Normalizing things can *sometimes* improve performance but that doesn’t include putting data into comma seperated text-fields and querying the text data.

    The performance of a query like..

    WHERE tags LIKE “%search%”
    AND tags LIKE “%webservice%”
    AND tags NOT LIKE “%semweb%”

    is going to totally BITE in comparison with the 3NF solution you describe, because “LIKE” is fundamentally slow – think about it, you’re asking the DB to load every entry, parse its value and *then* decide whether to include that row.

    Conversely with the 3NF solution you can whack an index on the tag table, the join table, and the operation of finding all bookmarks for a given tag_name becomes entirely index based. The UNIONS and INTERSECTIONS etc are then a lot faster, because they are dealing only with the small subset of relevant data instead of the full data.

    Don’t believe me? Try it.

    I haven’t tried it but, the real test of performance is always in the pudding.

    Comment by Miles Thompson — April 24, 2005 9:52 pm #comment-57

  2. […] minus) some tags from the search result. Apparently there are three different solutions: Then each went to his own home » Tags: Database schemas […]

    Pingback by HYPERGURU » del.icio.us Tags: Database schemas — April 25, 2005 1:15 am #comment-58

  3. del.icio.us schema

    Just read this discussion about possible del.icio.us database schema solutions. Interesting postulation. Here’s how I’d do it, and in fact, how I set it up for tags in my linkblog. Read on…

    Trackback by Ivan Tumanov’s Blog — April 25, 2005 2:58 am #comment-59

  4. Miles: Thank you for the remarks. I will do some performance test on the first and third solution.

    Comment by phred — April 25, 2005 6:39 am #comment-60

  5. What would a truly normalized database that covered all the current features of del.icio.us look like, I wonder? Throwing out considerations of SQL optimization, how about these six tables?:

    =====================
    Site
    =====================
    site_id
    site_url
    site_description
    site_extended
    =====================

    =====================
    Party
    =====================
    party_id
    party_screenname
    party_email
    party_passwordhash
    party_name
    party_url
    party_publicflag
    =====================

    =====================
    Tag
    =====================
    tag_id
    tag_text
    =====================

    =====================
    Site-Party
    =====================
    site_id
    party_id
    =====================

    =====================
    Site-Party–Tag
    =====================
    site_id
    party_id
    tag_id
    =====================

    =====================
    Inbox
    =====================
    party_id
    party2_id
    =====================

    Comment by Robert — April 25, 2005 9:55 am #comment-61

  6. The denormalized table using LIKE will be SLOW. Full text index and matching would be considerably faster, but I don’t know how it would work functionally.

    This isn’t an old problem by any means, though– it is a pretty common scenario in teaching database design and not much different than “books” and “subjects” or any other canonical example. The standard answer is:

    table: bookmarks – bid, bookmark, comment, etc
    table: tags – tid, tag, etc
    “linking” table: bookmarks_tags – bid, tid

    Then you can do and/or using the IN and NOT IN operators with lists and a few joins…

    Comment by Chris L — April 25, 2005 9:58 am #comment-62

  7. I use the “Scuttle” solution for the Media Manager .

    I also agree with Miles on the performance of the first solution. MySQL has to look at every single row and no index is going to speed up your queries. If you had a “LIKE ‘test%’” query, that could be sped up with an index, but never “LIKE ‘%test%’”. Additionally if you have an object tagged with “websearch” it will end up in the result list using your example query. This is not desirable IMO.

    Comment by Patrice — April 25, 2005 9:59 am #comment-63

  8. Miles, Chris, Patrice: Thank you for your comments. You are all absolutely right. I included the insights into the entry.

    Patrice: Interesting, that you use the “Scuttle” solution. I couldn’t come up with a good argument for this solution. I thought if normalized then the third schema would be the best. Guess you have some arguments?

    Comment by phred — April 25, 2005 11:55 am #comment-64

  9. In either solution #2 or #3, you’ve got two ways to do a minus. One of them is to do an inner join for the query you want to minus… (It looks something like this, but it may take some tweaking to work)

    SELECT b.*
    FROM scBookmarks b, scCategories c INNER JOIN scCategories d USING d.bId=b.bId
    WHERE c.bId = b.bId
    AND (c.category IN (‘bookmark’, ‘webservice’))
    AND d.bId=’semweb’
    AND d.bId=NULL
    GROUP BY b.bId

    This kind of query usually isn’t very fast, but how fast it is will depend on the query (what % of bookmarks are being minused out) and on how MySQL’s query optimizer decides to implement the query.

    Another way to do it (MySQL 4.1 and up) is to use a subquery

    SELECT b.*
    FROM scBookmarks b, scCategories c
    WHERE c.bId = b.bId
    AND (c.category IN (‘bookmark’, ‘webservice’)
    AND NOT EXISTS (SELECT * FROM scCategories d
    WHERE d.bId=b.bId
    AND d.category=’semweb’) GROUP BY b.id

    Comment by Paul A. Houle — April 25, 2005 2:34 pm #comment-65

  10. […]

    Tags op DB niveau

    Voor de webdevs onder u: tags en databases. [via]

    25 apr 2005 @ 19:37 in Dump […]

    Pingback by Low Weblog » Tags op DB niveau — April 25, 2005 6:37 pm #comment-66

  11. Cool comparison. I was a little surprised to see MySQLicious listed in there. 🙂

    The reason I stored the tags like that is that, well, I was being lazy. I figured people might want them as part of a backup, but I’m not actively using them at the moment. All I do with MySQLicious right now is pull my !linklog tag directly from del.iciou.us then do SELECTs based on date to get the bookmarks out onto my site.

    If I was actually using the tags for anything on my site, I would have put some thought into storing them in a more useful way. Maybe that’s something I’ll look into for 2.0. 🙂

    Comment by adam — April 25, 2005 7:50 pm #comment-68

  12. I cheat a bit. Each tag gets a unique number, each url gets a bit vector (stored in a blob) where the bits refer to tags. The good news is that unions, intersections, differences are all fairly easy; the bad news is that the bit vectors get bigger over time. In practice, I’m just interested in my personal tags, so the list is pretty small, and I load everything into memory and do the manipulations there.

    Comment by Sam Denton — April 25, 2005 7:50 pm #comment-69

  13. @Philip: About your question, why I use the “Scuttle” method. I found it a good tradeoff between a normalized table and ease of programming. In my case the tag is the primary key (so unlike the “scCategories” table in your example I don’t have a separate ID – just the tag name and the foreign key ID of the associated object).

    The advantage over the completely normalized solution is, that I can just store stuff into the database without checking whether I have to create the tag or if it already exists.

    With the “Scuttle” solution I manage storing an object with this process (simplified): “save object”, “remove object’s tags if there are any”, “insert one record into object_tags for each tag the user entered”. Of course this is very lazy, my interpretation of KISS.

    With the “Toxy” solution I’d have to do the following: “save object”, “remove object’s tagmap entries” (KISS again) then for every entered tag: “1) create the tag if it doesn’t exist yet – requiring a SELECT and an INSERT, 2) insert the association into tagmap”.

    Personally I don’t think the “Toxy” method brings me any advantage. I’m never going to rename a tag for all instances of the object, and space consumption is probably not better than my model because of the extra table and IDs. Of course, I may be missing something.

    Comment by Patrice — April 25, 2005 9:13 pm #comment-70

  14. And home I went!

    See [1] (requires Linux and Firefox to run without bugs) for a very rough implementation of a liminal[2] type architecture. I figured myself the best way to approach this would be a Toxy type solution but ‘hot switching’ to a columns/tags parity when or if a DTD is developed (though note I haven’t thought through if this is possible in every case as yet). See also [3].

    [1] http://aspn.activestate.com/ASPN/Mail/Message/perl-xml/2506692
    [2] http://www.lmnl.net/
    [3] http://www.gsowww.uklinux.net/pub/Archive/misc/Semantic_Web_Architecture.pdf

    Comment by GSO — April 25, 2005 10:17 pm #comment-71

  15. […] development

    Database schema for tagged systems

    1114461588 Database schema for tagged systems

    This e […]

    Pingback by Mischief to Data » Blog Archive » Database schema for tagged systems — April 25, 2005 10:40 pm #comment-72

  16. […] #8217;s mojo is so-so Tagging in MySQL Phillip Keller has a nice summary of a few database schemas for a tag-enabled application. Basically, […]

    Pingback by Feedmarker Blog » Blog Archive » Tagging in MySQL — April 25, 2005 10:53 pm #comment-73

  17. Why don’t you use the actual UNION SQL operator? All you do is string together a bunch of SELECTS that have the same schema:

    SELECT b.*
    FROM scBookmarks b, scCategories c
    WHERE c.bId = b.bId
    AND c.category = ‘bookmark’
    GROUP BY b.bId

    UNION

    SELECT b.*
    FROM scBookmarks b, scCategories c
    WHERE c.bId = b.bId
    AND c.category = ’semweb’
    GROUP BY b.bId

    UNION

    SELECT b.*
    FROM scBookmarks b, scCategories c
    WHERE c.bId = b.bId
    AND c.category = ‘webservice’
    GROUP BY b.bId;

    In this case, you get away with using a much simpler, more explicit set of queries, all using the power of SQL. There’s also EXCEPT and INTERSECT. They work in MySQL, too.

    Comment by Jason Scheirer — April 26, 2005 3:38 am #comment-75

  18. Thank you, Patrice for your explanation. The schema you describe makes sense.

    Robert: I wonder too what the real schema for del.icio.us is.. Interesting is, that del.icio.us does not yet provide the union constraint of two tags. In all the schemas mentioned, the union-constraint was even easier than the intersection. If this was the case for del.icio.us schema too, then Joshua would have implemented it I suppose.. So I guess that del.icio.us uses a not yet mentioned schema.

    Comment by phred — April 26, 2005 7:13 am #comment-77

  19. two questions:
    1. Does del.icio.us use MySQL at all, or a another DB with triggers? Triggers could help getting rid of orphan problems
    2. Why does the toxy db scheme have an extra id field in the tagmap? The other two keys combined do already form a key and a usefull constraint.

    Comment by Engelke — April 26, 2005 12:43 pm #comment-78

  20. Engelke: Good point with the triggers. And: I think too that the extra id field is superfluous..

    Comment by phred — April 26, 2005 1:53 pm #comment-79

  21. I may be overlooking something, but none of the models described on this page account for user data. Within del.icio.us, each user can bookmark a single URL with different tags; the database needs to store unique combinations of: (bookmarkID, userID, tagID).

    Comment by matt — April 26, 2005 4:32 pm #comment-80

  22. First version with a MySQL 4.1 fulltext index. You don’t have to add spaces and it’s as fast as the 3NF version because you don’t have to use any %.

    -hannes

    Comment by Anonymous — April 26, 2005 7:16 pm #comment-81

  23. I’m currently working on a folksonomy based web-app… Since we’re in the early development stages (although, clearly, the development doesn’t take too long as the concept is simple) I’m doing some testing. We initially went with a postgresql type #1 from above implementation, and have tested it and it performs decently. The #3 method performs better on inserts, but is currently performing horribly on searching using includes and excludes. This is probably because I’m not much of a database wizard. I’ve tried doing intersection/disjunctions, which were 5 times slower doing 3000 queries with 1 tag included and 1 exlcude in the search. I’m currently in the process of trying group-by type queries (as described in the article) but those are also going quite slowly. It seems as though full text queries might be the right way to go for us, as they’re really easy to construct and really quick to use for searching, comparatively. Perhaps this is because I’m incapable of writing good database queries 🙂

    The current test is something like this:
    SELECT b.* from bookmark b, tagmap tm, tags t WHERE b.item_id = tm.city_guide_id AND tm.tag_id = t.tag_id AND t.tag IN (‘bookmark’) GROUP by b.item_id, b.tags, b.category HAVING count(b.*) = 2 EXCEPT (SELECT b.* from city_guide b, tagmap as tm, tags as t where b.item_id = tm.city_guide_id AND tm.tag_id = t.tag_id AND t.tag IN (‘delicious’))

    Sadly, I like the fully-normalized implementation because it seems cleaner. But, it looks like full text search (depending on implementation) might be quicker, at least in some cases.

    Comment by Paul — April 26, 2005 9:44 pm #comment-82

  24. Matt: You are right. There is no User information in the schemas in the first and in the third solution. I think the user data doesn’t add complexity to the queries so I left them out. I guess you’d add user_id to table “delicious” on the first, and to table “tagmap” on the third solution..

    Hannes and Paul: Sounds very interesting! I am setting up some performance tests and will provide some graphs. If you wish to have something tested/set right please tell me as I don’t have much practice in doing DB tests.

    Comment by phred — April 27, 2005 6:42 am #comment-83

  25. What if we add multi-language tag support? Would adding variations of a term in the table Tag of the Toxy solution work? What I mean is say, by searching for “flower”, the query would understand and return tuples tagged with “fleur” as well.

    Comment by Zelnox — April 27, 2005 7:22 am #comment-84

  26. phred – I think you’ll find that accomodating user information adds significant complexity, because every user can store his or her own tags for any URL. The schema changes to store user data are trivial (as you pointed out, you can simply add a column for userID), but the index changes are not — for example, in the 3-table solution, for a single user the combination of (bookmark_id, tag_id) in the tagmap table can be unique… this is a “narrow” folksonomy, as found at Flickr. If you allow multiple users to tag an object separately (as in a “broad folksonomy”), the combination of (bookmark_id, tag_id, user_id) is unique, but there are multiple instances of (bookmark_id, tag_id). In a broad folksonomy you may want to maintain a separate table or index of (bookmark_id, tag_id) — that is, a basic inverted index table that consolidates all the users’ tags into a single (tag, bookmark) row — to simplify searches that are user-agnostic.

    You can find the distinction of of “narrow” vs. “broad” folksonomies:
    http://www.hyperorg.com/blogger/mtarchive/003840.html

    Comment by matt — April 27, 2005 6:35 pm #comment-85

  27. Good comparison. It’s nice to see some information on how these “tagging” schemes are being put together. I actually use my own which I came up with prior to all the tagging articles I’ve been reading, similar to the Scuttle method.

    As far as the Toxy solution goes, what is the need for an id field in the tagmap table? In order to prevent duplications of tag/bookmark pairs, you’d want a primary key of both the tag_id and the bookmark_id, thus eliminating the need for the id field.

    Comment by Patrick Haney — April 27, 2005 9:35 pm #comment-86

  28. Matt: Interesting! Good to have names for these things (broad/narrow folksonomy), I didn’t know that.
    I think I got your point: Have different index tables for different query types? That surely makes the insert/update/delete queries more expensive.. but at del.icio.us, this seems to be the case (slow inserts/fast queries). Even though I often wish that the inserts in del.icio.us would be faster..

    Patrick: As Engelke pointed it out, this column isn’t needed. I didn’t pay that much attention to details.. sorry 🙂

    Comment by phred — April 28, 2005 6:44 am #comment-87

  29. the problem with ‘toxy’ — or any — subselects (despite their 3nf goodness) is that they’re not optimized in mysql. try explain select * in (select …)
    the solution is to hit the database a few times and perform the queries and array joining yourself in the app logic.
    shameless self-link explaining how to do related-tags with mysql and ruby/rails: habtm.com
    [Ed:] The article, courteany refers to: Tagging again

    Comment by courteany — May 1, 2005 9:59 am #comment-89

  30. Following up on Matt’s comment, the decision to have a broad folksonomy (with user specific tags, eg, del.icio.us) or a narrow folksonomy (with only one set of tags per item, eg, flickr.com) would have a big impact on the schema.

    del.icio.us, for example, also allows each user to add comments (under the “extended” field) for each bookmark/ tag item. This approach suggests the three-table “Toxy” schema.

    The tagmap table would also need user_id or username. Then the table would have a composite primary key of user_id, tag_id, and bookmark_id.

    tagmap
    tagmap_id (optional)
    user_id
    tag_id
    bookmark_id
    user_comments_extended

    Although a generic ‘id’ is not necessary, tagmap could also use an auto increment tagmap_id in order to avoid using all three ids to reference a particular bookmark/ tag/ author’s item/ comments.

    Comment by brettstil — May 2, 2005 7:00 am #comment-91

  31. I disagree. Both systems can be done in exactly the same way. I really can’t see why you need to store user information against tags. It’s all in how you collect and present the tag information, not how you store it. Using a tags, items, tags_to_items style schema works fine even with user specific tags.. that’s because there /is/ no such things as a user specific tag. It just ‘appears’ as if there is, since if you want to show “all items with tag X by user Y”.. you just craft the query to only get those responses. If the user wants to change the tag name, you just add a new tag to the general set, and change the relationships over.

    Comment by Peter Cooper — May 5, 2005 1:04 pm #comment-92

  32. Peter, you’re correct that “there /is/ no such things as a user specific tag”. By that I mean, there’s probably only one tag_id for the tag “CSS”, and only one tag_id for the tag “macintosh”, and only one tag_id for the tag “photoshop”, within the del.icio.us database. In other words, the tags themselves are not user-specific.

    However, the association of tags to URLs is absolutely user-specific. del.icio.us allows two users to tag one URL with two different sets of tags. This implies that user information must be present in the tagmap table. In other words, the combination of (tag_id, url_id) is not unique; rather, only the 3-way combination of (tag_id, url_id, user_id) is unique.

    There are alternative implementations that would keep user data out of the tagmap table, but the user-to-tag-to-URL 3-way association has to be made somewhere, or else there is no way to distinguish my tags for a given URL from your tags for that URL.

    Comment by matt — May 5, 2005 4:01 pm #comment-93

  33. Hi Phillip,

    Thanks for the overview of potential tagging solutions. As the author of Freetag, I hope I can contribute a few notes.

    Freetag uses a Toxy-like solution, but it’s actually more extensive than that. The “Bookmarks” sample table can be anything with Freetag. One of the planned features is to support different namespaces that you can use to hold references to completely different objects / users.

    Also, Freetag does store a reference back to a user when the user tags an object. The tags themselves are stored once, but the tagging user is identified inside of the middle referential table.

    It shouldn’t present any problems with systems that display non-user-related tags, as all of the API functions allow you to either specify a user ID to filter with, or leave that parameter as NULL to bring back a user-global set of tags on that object.

    Personally, I don’t see anyone normal building complex intersections or unions with any sort of tag logic anytime soon. I’m planning a little writeup on my blog later about this topic, so check around in a few days for more of my opinions about why those aren’t that important.

    Thanks for the mention!

    -Gordon

    Comment by Gordon Luk — May 10, 2005 6:00 pm #comment-97

  34. Gordon: Interesting that nobody wants those complex intersections…
    Because I’d like to have them.. I’d like to browse delicious with more complex combination of tags. For instance to track new urls as rss feed or so.

    I have a followup of Matts comment: Replicating data is the key for fast applications.. just look how flickr has done it:

    Normalized data is for sissies.

    I like that!

    Comment by phred — May 12, 2005 12:40 pm #comment-98

  35. Hello Folks,

    I have been blogging about exactly these issues and from fundamental principles have derived a database schema that appears to encompass all these ideas in one place. It uses lessons from data warehousing where massive datasets and complex queries are common. I come from a relational database background (worked at Sybase and Ingres in the 90’s) and am pretty certain that multi-table joins will not scale to the data volumes that folksonomy apps will generate. My blog, tagschema, contains both philosophical musings about tag based apps and hard core logical database design that addresses a lot of the issues that appear here.
    I am glad I stumbled upon this site via deli.icio.us of course, and am glad to see that other practitioners have also been thinking along similar lines.

    Philip thanks very much for starting this thread, it is vital that such knowledge be shared and devloped in the community to be able to build next generation tagapps.

    Looking forward to more …..

    Comment by Nitin Borwankar — June 25, 2005 12:19 am #comment-357

  36. phred –

    Freetag now does support tag combos, or intersections. Sorry if the post above came off as condescending. I think what I meant to say was that before intersections become useful, it’s more important to have a well tagged database. Delicious now has them, which is actually useful, because of the sheer size of the database. For smaller apps, it may not be as handy except for extreme power users who are using them to enforce ad-hoc namespaces. However, I do understand that there are people like yourself who want them, so i’m doing my best to accomodate that.

    Thanks,

    Gordon

    Comment by Gordon Luk — June 28, 2005 4:57 pm #comment-437

  37. They’re called “conjunction” and “disjunction” queries (and they are all “restriction” queries), not “union” and “intersection”. “union” and “intersection” come from set theory. A single table (or joined group of tables, or view, etc.) is a set, so any query with WHERE criteria against that set will form a subset – literally a restriction of the original set.

    UNION is a standard SQL keyword that allows you to join together the results of *separate* queries, views, etc. It has nothing to do with AND criteria in the WHERE clause, just as the term “intersection” (also a SQL keyword, but not nearly as widely supported as UNION) has nothing to do with OR criteria.

    Remember, within the context of a single query, AND and OR criteria in any combination always create a subset, not multiple sets. Only talk about UNION and INTERSECT when you are working with the results of multiple queries. It’ll be easy to know when to use the term UNION, because it will actually be part of your query! Not that it’s impossible to create a union without the UNION keyword, but usually UNION makes the most sense. INTERSECT, on the other hand, is not supported in many databases, but it’s still easy to identify whether you are working with one set or multiple sets.

    Actually, your usage of these terms is interesting, as it indicates that you view a table as many possible sets derived from all possible queries. That’s a legitimate point of view, but relational theory is quite clear about the concept that a table is to be considered a set, and potential sets are just that: potential. Once you create one of those possible sets with, say, a query. Hey, now you’ve got another set and can deal with it accordingly – but not until it actually exists.

    By the way, your “three-table-schema” is simply called a many-to-many schema: there is no other way to acheive a many-to-many relationship in an RDBMS. The “tagmap” is called a “link” table or, more formally, an association table (particularly if additional information is stored in each “link” record, like what user posted the tag, post time, etc.). You would usually name it “bookmark_tag” to clearly indicate that its only purpose is to join the bookmark and tag tables (and to leave room for other link table names like “photo_tag”, etc.).

    OK, far too much preaching. Sorry.

    Comment by Ryan — August 29, 2005 8:28 pm #comment-1179

  38. thank you very much guys, it helped a lot while looking for “related tags hell”

    🙂

    our site will be online in a few days.

    Comment by sathia — December 6, 2005 4:28 pm #comment-1879

  39. […] En el post se ponen a prueba los cuatro enfoques de modelado de sistemas de tagging que ya propuso en su anterior post “Tags: database schemas“. Los cuatro enfoques propuestos son: […]

    Pingback by mildiez.net » Archivo de bit�cora » C�mo modelar un sistema de tags — December 8, 2005 5:54 pm #comment-1889

  40. […] Quando ho un po’ di tempo libero, mi diletto a scrivere e provare alcune cose che riguardano il web, ultimamente ajax In questi giorni sto buttando giù una semplice Image Gallery che avesse tra le feature un sistema di tagging per le immagini (assegnare ad ogni immagine alcune parole chiave). Per far ciò ho esaminato alcune soluzioni di Database Schema e ho deciso di implementare la “Toxi Solution“, con 3 tabelle separate: una per le foto, una per i tag e una che fa da collegamento a queste due tabelle. A questo punto le query per la ricerca delle foto venivano suggerite, resta da effettuare una query per avere tutti i tag con relativo conteggio, per avere una Tag Cloud. Leggendo questo articolo su html.it (con codice per ColdFusion) e quest’altro articolo di Augusto Murri (con codice per PHP), ho seguito le indicazioni per permettere la visualizzazione di diverse grandezze di carattere in base all’utilizzo del tag. Arrivando al punto, l’unico problema è che li non viene utilizzata la sopracitata “Toxi Solution”, quindi c’è bisogno di modificare la query. Ecco il codice relativo: PLAIN TEXT PHP: […]

    Pingback by napyfab:blog » Programmare una Tag Cloud in PHP (Toxi-solution) — March 7, 2006 12:20 am #comment-2330

  41. Why would you have the fields tagmap.id and tag.tag_id in the tables? I would think that in tagmap table the bookmark_id and the tag_id would serve as a primary key, and in the tag table the name of the tag would work just fine as a primary key?

    Comment by Torkil — March 9, 2006 10:26 am #comment-2346

  42. Torkil: tagmap.id is superflous, you are right (this has already been suggested in one of the numerous comments). I wouldn’t leave out the tag_id column. Call me old-fashioned but I don’t like string fields as foreign keys. Then, if you take the name of the tag as primary key then you don’t need that table, do you? Then you’d end up in the Scuttle solution.

    Comment by Philipp Keller — March 9, 2006 10:56 am #comment-2347

  43. […] J’ai trouvé un article sur le blog de Philipp Keller qui parle du schéma de base de données utilisé par del.icio.us. Quelques notes : […]

    Pingback by chris.is-a-geek.net » Blog Archive » Schémas de bases de données pour la gestion de tags — March 16, 2006 7:50 pm #comment-2397

  44. Hi,
    the source code and the relational schema of connotea (a clone of del.icio.us for academics) is available at http://www.connotea.org/code.

    Pierre

    Comment by Pierre — March 22, 2006 9:49 pm #comment-2443

  45. Pierre: Thanks for the link!
    I quickly skimmed the database schema and they have separate tables for bookmarks, tags and users (as well as annotations, comments, etc.)
    Then there is a bookmark_user table and finally a user_bookmark_tag table that joins table `tag` with table `user_bookmark`. They use InnoDB and have really nice schema. Here’s the SQL of their user_bookmark_tag table:

    DROP TABLE IF EXISTS `user_bookmark_tag`;
    CREATE TABLE `user_bookmark_tag` (
      `user_bookmark_tag_id` int(7) unsigned NOT NULL auto_increment,
      `user_bookmark` int(7) unsigned NOT NULL default '0',
      `tag` int(7) unsigned NOT NULL default '0',
      `created` datetime NOT NULL default '0000-00-00 00:00:00',
      PRIMARY KEY  (`user_bookmark_tag_id`),
      UNIQUE KEY `user_bookmark_tag_idx` (`user_bookmark`,`tag`),
      KEY `user_bookmark_idx` (`user_bookmark`),
      KEY `tag_idx` (`tag`),
      CONSTRAINT `user_bookmark_tag_user_bookmark_fk` FOREIGN KEY (`user_bookmark`) REFERENCES `user_bookmark` (`user_bookmark_id`),
      CONSTRAINT `user_bookmark_tag_tag_fk` FOREIGN KEY (`tag`) REFERENCES `tag` (`tag_id`)
    ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
    

    Comment by Philipp Keller — March 23, 2006 8:45 am #comment-2448

  46. It’s true LIKE “%search%” in the MySQLicious solution will return websearch too, but if you change it to LIKE “% search %” if space separated or LIKE “%,search,%” if comma separated, you don’t have this problem.

    Furthermore, the Toxi solution enable you to rename and delete tags easily (if you store tags separately for every user, else it may be a bit more complicated), while MySQLicious may have perfomance problems with this. Counting how many times a tag is used, tag suggestion (!), displaying tags for different users, popular tags… is easier too with the last solution. Scuttle is here always in between the other two solutions. It isn’t as slow as MySQLicious, but it’s not as fast as Toxi too.
    You could say MySQLicious operates from in the bookmarks: it does everything with regard to the bookmarks. Toxi sees the tags as separate entities: you can do things with your bookmarks, but you can also do operations with your tags (CRUD: not only Create, but also Retrieve, Update and Delete). Scuttle is somewhere in between. I think that solution is not really good, you can’t rename and delete tag like in Toxi, and you haven’t got the performance when displaying bookmarks only like in MySQLicious.
    Personally, I would use Toxi, the last one. Probably it will scale the best, especially when you want the user to be able to search by tag, edit and delete them, and if you want tag suggestion. Because tags are in a separated table, you can approach them as separated objects, independent of the bookmarks.

    About del.icio.us: probably they use the last solution, although their rename and delete tag functions has been experimental for ages (maybe they use the third solution with tagmap mapping the bookmark_id and tag_id, and this needs a user_id too). However, the tag suggestion doesn’t load slow, so probably they use something like the third solution.

    Comment by JW — April 17, 2006 6:52 pm #comment-2667

  47. I just wanted to thank you for taking the time to pull all of this information together in one easy to use resource. Based off of the work you’ve done to compare various systems I decided to use a modified Toxi system with a heirarcial structure built-in to organize photos and url’s. I’m still tweaking some of the display pieces, but the database schema works great for my new oceancity.shownbyphotos.com site. Cheers!

    Comment by r2 — August 1, 2006 1:12 pm #comment-5416

  48. […] —- UPDATE —- I ran across http://www.pui.ch/phred/archives/2005/04/tags-database-schemas.html which has essentially solved my initial question with a question of ‘which method or DB schema should I use’. I’m leaning more toward the Toxi solution. Seeing those schemas in some relation to delicious is calming. I’ll se how it works out for me. […]

    Pingback by Josh Houghtelin » Blog Archive » Help implementing tagging on PD. — August 1, 2006 11:29 pm #comment-5425

  49. […] Update: OK–here’s a bit on all that. I’m still processing. Actually, some of this should be obvious from my own “Invitation Managment System.” Whatever. Moving on. […]

    Pingback by nillazilla.com » Blog Archive » Reaching 3NF with tags — August 2, 2006 3:23 am #comment-5434

  50. Database Schema for Tagging

    Every good database driven website needs a solid, scalable database backend. With bleeek.com I’ve decided to implement a tagging system for users to label their photos. Tagging is nothing new anymore and so the question is how are sites like del.icio….

    Trackback by drupal — September 9, 2006 11:43 am #comment-7277

  51. […] There were a few approaches available on the net  but no one ever gave so many details about a possible implementation. […]

    Pingback by Any folksonomies around? at NoosFactory — October 9, 2006 4:30 pm #comment-11239

  52. […] I decided to add tagging to a web app that I’m working on in my spare time and wondered after the best way to do it in the database. I was curious if current services that use tags (flickr, del.icio.us, etc) did it using a normalized or denormalized schema. The first article I came across was “Tags: Database schemas” by Philipp Keller which gave three example schemas and expounded on thier different pros and cons. I quickly realized as I was reading that the answer depends greatly on the question (as is usually the case) and how you want to implement tagging is the first thing to figure out. Putting the how aside I continued reading the article and the great comments. One of the commenter’s was Nitin Borwankar who mentioned his blog focused solely on tagging schemas for the new breed of “folksonomy” based tools. I immediately read the archive from start to finish and added his RSS feed to my GReader. The main thing that I got out of his blog was that if your going to do tagging you might as well do it right and treat tags as full-fledged schema elements. […]

    Pingback by dev|sushi » Thinking about database schemas — October 14, 2006 6:52 pm #comment-11939

  53. […] The best resource that I’ve found on the subject is this plog post on the different database schemas .  I’m also quite impressed with the author’s breakdown of performance times for tagging datastructures as the number of tags increased. For most of the applications that I build, there would be a minimal number of tags as they will be managed from one central administration (unlike Flickr or Delicious where users can define their own tags). […]

    Pingback by That’s What She Said — October 16, 2006 1:44 am #comment-12083

  54. Hi guys… I was really interested by the topics…
    but also a bit dispointed because I have a TOXI based structure but i do the queries a COMPLETELY DIFFERENT WAY…
    Maybe i’m totally wrong about performance or maybe right?
    I don’t know (i’m actually dealing with a very little database and just finished to write the piece of code for building the query which made be a little lazy…)!!!
    The only thing i know is that it’s REALLY SIMPLE to generate a really complex query like (del.icio.us|delicious)+(semweb|semantic_web)-search and that was my only goal!

    I WILL GREATLY APPRECIATE ANY COMMENT ABOUT THE PERFORMANCE OF THE QUERIES BELOW!

    PS : It’s possible that i missed something in the queries because i rewrote them from memory but if I did, you should be able to correct them, if not, you can write me at dominique.hamet@hotmail.com

    Intersection (AND)
    Query for “bookmark+webservice+semweb”
    assuming ids are 1,2,3
    SELECT b.*
    FROM tagmap bt0,tagmap bt1,tagmap bt2, bookmark b, tag t
    WHERE (b.id = bt0.bookmark_id)
    AND (bt0.tag_id = t.tag_id)
    AND (bt0.tag_id=bt1.tag_id) AND (bt1.tag_id=bt2.tag_id)
    AND ((bt0.tag_id=1) AND (bt1.tag_id=2) AND (bt2.tag_id=3))

    Union (OR)
    Query for “bookmark|webservice|semweb”
    assuming ids are 1,2,3
    SELECT b.*
    FROM tagmap bt0,tagmap bt1,tagmap bt2, bookmark b, tag t
    WHERE (b.id = bt0.bookmark_id)
    AND (bt0.tag_id = t.tag_id)
    AND (bt0.tag_id=bt1.tag_id) AND (bt1.tag_id=bt2.tag_id)
    AND ((bt0.tag_id=1) OR (bt1.tag_id=2) OR (bt2.tag_id=3))

    Minus (Exclusion)
    Query for “bookmark+webservice-semweb”, that is: bookmark AND webservice AND NOT semweb.
    assuming ids are 1,2,3
    SELECT b.*
    FROM tagmap bt0,tagmap bt1,tagmap bt2, bookmark b, tag t
    WHERE (b.id = bt0.bookmark_id)
    AND (bt0.tag_id = t.tag_id)
    AND (bt0.tag_id=bt1.tag_id) AND (bt1.tag_id=bt2.tag_id)
    AND ((bt0.tag_id=1) AND (bt1.tag_id=2) AND (bt3.tag_id3))

    Comment by HAMET Dominique — October 16, 2006 10:11 pm #comment-12143

  55. If you want to show the tags as well in on field, you can try this one!

    SELECT b.*,GROUP_CONCAT(t.NAME) AS tags FROM tagmap bt, bookmark b, tag t WHERE bt.tag_id = t.tag_id AND (t.name IN (‘bookmark’, ‘webservice’, ’semweb’)) AND b.id = bt.bookmark_id GROUP BY b.id HAVING COUNT( b.id )=3

    I used group_concat

    Comment by Matthijs — October 24, 2006 2:06 pm #comment-12837

  56. One problem I can’t get around when I am using the toxi structure is to display everything except a certain tag. For example if I want to find NOT “cats”. The problem is to get it working when there are entries that doesn’t have any tags at all, since the search will exclude them when requiring a link to the tag bookmark table (since the link doesn’t exist). Has anyone else had this problem and found a solution?

    Comment by Jim — October 30, 2006 11:10 am #comment-13357

  57. In the toxi solution, why doesn’t the ‘tagmap’ table use a composite primary key?

    I believe this would provide the key uniqueness required while preventing an additional index being internally created and maintained for the ‘id’ column.

    Comment by Zeeshan — November 2, 2006 1:03 pm #comment-13652

  58. I previously looked into the methods of creating a tag database, there was very little material to cover it.

    Well done on creating this article.

    Comment by HM2K — November 8, 2006 4:16 pm #comment-14213

  59. Using the toxi solution, if I wanted to query for “bookmark+webservice|semweb” how would I accomplish it?

    I am no SQL expert but it seems like you must us either AND or OR without mixing them.

    Comment by Dustin — March 23, 2007 6:35 am #comment-43754

  60. Solved my own problem. I justed created two separate statements, one for AND and one for OR, and UNIONed them together. By the way, great article!

    Comment by Dustin — March 24, 2007 2:54 am #comment-43940

  61. Awesome article, been looking for something like this. I had sketched my tag solution like Toxi’s but wanted to see what others use… your breakdown is great!

    Comment by phugit — May 29, 2007 11:41 pm #comment-57952

  62. Hi, nice post!

    Did you find out which system is used by the del.icio.us? I mean – do they use RDBMS or not?

    Comment by Koudesnik — June 13, 2007 10:07 pm #comment-63209

  63. Koudesnik: See Using a non-RDBMS system: I guess del.icio.us has still some parts stored in RDBMS but others not.
    What astonishes me is that their search is still very slow. That looks for me that their not yet using lucene indexes for all parts.

    Comment by Philipp Keller — June 14, 2007 7:14 am #comment-63283

  64. Great article. Perfect for me. Sure I am building something 2 years after you wrote the article – but still perfect 🙂

    Just a comment on Tag Orphans – this is only a problem with some RDBMSs – if you use proper forign key solutions they will automatically get removed – which is clearly desirable. I won’t get into which SQL service here as I think they all have their advantages and disadvantages and I don’t want to start a debate on that. But I will post my SCHEMA once it is done – and if lucky, work will allow me to open source the library.

    Thanks

    Scott

    Comment by Scott Penrose — June 27, 2007 10:48 pm #comment-67308

  65. hi,

    What do u think about organizing table structure in star-schema?

    i would consider building a star-schema and creating a cube of that however a cube that is organized for query speed…size would be….well lets just say…BIG 🙂 🙂 🙂

    regards,
    dado

    Comment by dado — July 6, 2007 5:08 pm #comment-69989

  66. The best solution in my opinion is to Toxi / WordPress solution on a PostGreSQL database. That way, you have a fairly normalized structure that is easily maintainable and clear to view. To solve the ophaned-element problem, you have triggers / stored procedures at your fingertips. I have always preferred PostGreSQL over MySQL because working with MySQL always seemed a bit like working with half of a product really. The support for it may be embedded in every weblanguage available, but compiling PHP from source with –with-pgsql=”" (recommended) or editing your extensions.ini isn’t really a reason to stick with MySQL. PostGreSQL is superior to MySQL in my opinion because it supports ALL of the features expected from a production-grade database system, whereas MySQL only supports those that are commonly used by amateur webapps. Proof: http://monstera.man.poznan.pl/wiki/index.php/Mysql_vs_postgres.

    Good day 🙂

    Comment by James — July 21, 2007 1:59 am #comment-73490

  67. Dado: I’m not into star schema/fact tables, but Nitin wrote about this.

    Comment by Philipp Keller — July 22, 2007 3:49 pm #comment-73761

  68. The main problem of the MySQLicious is not the using of Like, this is not the performance killer because it is quite fast.
    The problem is using the %.
    if you put the % at the end mysql can use the index to search for the right data, as in a telephonebook you know a name begins with “Mill” so you can use the index and jump “directly” to the namens beginning with “Mill”.
    But know you want to search all persons whos name end with “ller” (we are searching for Miller). Try to use the index, hmm, you can’t? correct!
    And thats the Problem of using % at the beginning of an String in a Like-Query, it cannot use the index and so it will be veeeery slow, if the data is increasing.

    Comment by Tobias Petry — July 27, 2007 3:13 pm #comment-74697

  69. Thank you for the post. I was looking for the information about database schemas for tagging. The post and discussion cover everything I need!

    By the way I think you might be interested in the approach described here (it is too serious for my needs but may be a good solution for big web sites): http://scalablecorner.blogspot.com/2007/06/implementing-tag-cloud-nasty-way-part-2.html

    Comment by MikeTM — September 2, 2007 5:42 pm #comment-81676

  70. Really useful post and comments, read all of them thoroughly and it’s helped me get my head around some of the key complexities in building a tag-based architecture.

    Good work.

    Comment by Toby Skinner — September 5, 2007 2:09 am #comment-82182

  71. Excellent Article, Thanks a lot!

    Comment by Daniel — September 17, 2007 7:53 pm #comment-85243

  72. Some people have suggested that the Toxi solution could be used with duplicate tags in the tag table in order to simplify inserts and deletes, wouldnt this mean it was effectively a scuttle solution?

    Comment by Gordo — November 30, 2007 9:53 am #comment-101283

  73. Has anyone found a quick way of getting a list of related tags for a set of tags using the “Toxi” schema.

    For example, say I wanted to find all the tags that appeared in bookmarks which were tagged with ‘webservice’ and ‘bookmark’. Ideally it would be good to get the related tags and the count of how many bookmarks have that related tag. Does anyone have any ideas on an efficient way of getting this related tag data?

    Comment by noodles — January 27, 2008 8:00 pm #comment-107520

  74. Thank you, this article helps me a lot!

    Comment by Mefi — June 4, 2008 4:29 pm #comment-129546

  75. Thank you for this article ! very useful

    Comment by Mike — November 4, 2008 10:44 am #comment-130014

  76. Thank you for your time! Your article helped me to quickly create what I wanted.

    All the best!

    Comment by Joao — November 5, 2008 6:12 pm #comment-130016

  77. thanks, this article help me to clarify the things.

    Comment by Marco — January 10, 2009 3:23 am #comment-130067

  78. how to store facets or faceted data in a database with example can any body try and provide?

    Comment by sai — March 5, 2009 12:18 pm #comment-130322

  79. Thanks for your very interesting and useful blog posts, Philipp. This one in particular was of great help to me in figuring out how to make a tag system, which I’ve been using in multiple projects. My tag system was based on the “Toxi” solution.

    I recently released my own free, open source, public domain web links organizer program – Astroblahhh Links Organizer, a component of my incomplete software package Astroblahhh Desktop:

    http://astroblahhh.com/software/abdesktop

    I would welcome any feedback if you’re ever in the mood to try it out.

    Quote by sai:
    “how to store facets or faceted data in a database with example can any body try and provide?”

    I’m not sure exactly what faceted data is, but, judging by the Wikipedia article on faceted search, the concept kind of reminds me of Astroblahhh Desktop a bit, because of its filtering capabilities. Maybe my program will be of interest to you.

    Comment by Apollia — March 7, 2009 12:41 am #comment-130336

  80. here is another idea.

    one table, 2 columns

    Then make a 2 column primary key (to disallow duplicates)

    so…. select * from tag_table

    URL | TAG
    http://www.wikipedia.org | encyclopedia
    http://www.wikipedia.org | reference
    http://www.wikipedia.org | awesome

    but how do I right the queries (intersection, etc)

    thanks

    Comment by syssyphus — May 18, 2009 4:13 pm #comment-130682

  81. Thank you very much, your article helped me a lot!

    Comment by Hua Chen — August 23, 2009 12:55 pm #comment-130767

  82. In the “toxi” solution, why is there an ID column on the “tagmap” table? I don’t really see why that’s needed – or am I missing something?

    Comment by Nick Gilbert — September 25, 2009 10:52 am #comment-130774

  83. You have help me build this site! http://www.TagThis.com thanks for the article! TagThis is a tag based search and suggestion engine that gives you tailor made results. it is still in its infancy. would greatly appreciate any feedback if you have time to check it out! cheers

    Comment by Marwan — September 27, 2009 6:13 pm #comment-130775

  84. I was checking Google to see if there were any better solutions than an intersection table (Toxi), and found this. Good article, if not precisely what I was looking for.

    One thing I did notice that you might care about is your disadvantage for Toxi’s schema;
    “When altering or deleting bookmarks you can end up with tag-orphans.”
    This isn’t the case if you use foreign keys (which Phillip Keller implements above), and if you add ON DELETE CASCADE to your reference clause any deletion of tags or items will also delete references to it in the intersection table- for better or worse.

    Comment by Tozetre — September 30, 2009 6:49 pm #comment-130776

  85. this is very helpful – concise and well written.

    THANKS!

    Comment by turd — October 8, 2009 12:32 am #comment-130778

  86. Some good discussion and sharing of ideas in these comments.

    I never even considered anything other than a relational 3 table schema. The system will be 99% selects so I suppose that’s best anyway:
    I’m making an e-commerce site.. Tags would only be added when new products were added (no user contributions).

    I’m considering using tags rather than categories because the categories aren’t forming a nice hierarchy.
    Now I’m going to look into hierarchical tags systems.

    Comment by Hamish — March 18, 2010 1:43 am #comment-133603

  87. Just been reading this good introduction to faceted classification:

    http://www.miskatonic.org/library/facet-web-howto.html

    I hope someone else finds it useful.

    Comment by Hamish — March 18, 2010 2:25 am #comment-133604

  88. Here’s how one might do this with Postgres:
    The REFERENCES on TMAP address the “orphans” issues when deleting records.

    Big Upz to the MySQL wiki for the tips – ironic huh?
    http://forge.mysql.com/wiki/TagSchema#Toxi

    CREATE TABLE links (
    l_id SERIAL,
    title varchar(55),
    url varchar(255),
    description varchar(300),
    CONSTRAINT “links_pkey” PRIMARY KEY(“l_id”)
    );

    CREATE TABLE tmap (
    l_id integer REFERENCES links ON DELETE CASCADE,
    t_id integer REFERENCES tags,
    PRIMARY KEY (l_id, t_id)
    );

    CREATE INDEX tag_id_idx ON tmap (t_id);

    CREATE TABLE tags (
    t_id SERIAL,
    tag varchar(30) UNIQUE,
    CONSTRAINT “tags_pkey” PRIMARY KEY(“t_id”)
    );

    /////////////////////////////////

    // grab data for a TAG cloud
    SELECT tag, COUNT(*) as count
    FROM tmap i2t
    INNER JOIN tags t
    ON i2t.t_id = t.t_id
    GROUP BY tag;

    // grab data for links tagged with any of the 3 tags
    SELECT tmap.l_id
    FROM tmap
    INNER JOIN tags t
    ON tmap.t_id = t.t_id
    WHERE t.tag IN (‘transcripts’,’registration’,’courses’)
    GROUP BY tmap.l_id;

    // grab data for links tagged with all three tags being searched for
    SELECT i2t3.l_id
    FROM tags t1 CROSS JOIN tags t2 CROSS JOIN tags t3
    INNER JOIN tmap i2t1
    ON t1.t_id = i2t1.t_id
    INNER JOIN tmap i2t2
    ON i2t1.l_id = i2t2.l_id AND i2t2.t_id = t2.t_id
    INNER JOIN tmap i2t3
    ON i2t2.l_id = i2t3.l_id AND i2t3.t_id = t3.t_id
    WHERE t1.tag = ‘transcripts’
    AND t2.tag = ‘registration’
    AND t3.tag = ‘courses’;

    Comment by KodeZilla — March 26, 2010 8:13 pm #comment-133608

  89. This is by far the best article on the world wide web that not only illustrates in clear concise language how to implement tags three different ways but also gives real world easy to follow examples and illustrations. My hat’s off to you sir. You deserve all the accolades endowed to you by your readers.

    Comment by Jules Manson — May 21, 2010 8:01 am #comment-133627

  90. Thanks a ton, searching for this..

    Comment by Sahan — June 30, 2010 10:40 am #comment-133638

  91. Very Nice post, quiet informative.
    Database designing problems being the buzzwords these days, one of the most common designing mistakes is the “Poor Planning / Architecture” of the database or mart. Follow the link to know more…
    http://www.sqllion.com/2010/08/database-design-and-modeling-i/

    Comment by SQL Lion — August 3, 2010 7:36 am #comment-133657

  92. What if The primary index on the Tags table was the tag itself? So it would look like this

    Tags
    —-
    Tag
    PostID

    Posts
    —-
    PostID
    Post

    QUERY
    —–
    SELECT * FROM Posts WHERE PostID IN (SELECT PostID FROM Tags WHERE Tag = ‘$tag1′ OR Tag = ‘$tag2′)

    Comment by Sean Macdonald — August 6, 2010 11:34 am #comment-133658

  93. What is “semweb” and “webservice”?

    Comment by Sam — December 22, 2010 11:24 pm #comment-133929

  94. @Sam: Semweb = Semantic Web, Webservice see Wikipedia

    Comment by Philipp Keller — December 25, 2010 1:09 pm #comment-133963

  95. This is by far the best article on the world wide web that not only illustrates in clear concise language how to implement tags three different ways but also gives real world easy to follow examples and illustrations. My hat’s off to you sir. You deserve all the accolades endowed to you by your readers.

    Comment by altin çilek form seti — January 26, 2011 6:40 pm #comment-134366

  96. In the intersection for toxi solution i got following error : what should i do

    Msg 8120, Level 16, State 1, Line 1
    Column ‘item.id’ is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

    sql:
    SELECT b.*
    FROM tagmap bt, bookmark b, tag t
    WHERE bt.tag_id = t.tag_id
    AND (t.name IN (‘bookmark’, ‘webservice’, ’semweb’))
    AND b.id = bt.bookmark_id
    GROUP BY b.id
    HAVING COUNT( b.id )=3

    Comment by Dhiraj — February 17, 2011 6:47 am #comment-134377

  97. @Dhiraj: I don’t see any token “item” in your query – did you post the query you actually tried?

    Comment by Philipp Keller — February 17, 2011 11:34 am #comment-134378

  98. Does anyone know the database schema of any b2b online marketing websites (like indiamart.com,toocle.com,tradeindia.com)

    Comment by arjun — March 13, 2011 3:55 am #comment-134390

  99. I have visited this very informative post many times, I have used mainly the scuttle design without even knowing what its name was, but in some other more controled environments I used the toxi solution. For a structured (admin controled tag solution I would go for toxi, but for a any other solution I still consider scuttle a suitable one

    Comment by Moklet — April 23, 2011 3:04 pm #comment-134397

RSS feed for comments on this post.

Leave a comment

Name (required)

Mail (will not be published) (required)

Website

This page and it’s content is licenced under creative commons

发表评论

电子邮件地址不会被公开。 必填项已用*标注