The SCO Group's list of "infringing" files -- How Did They Come Up With This List?

The SCO Group's list of "infringing" files -- How Did They Come Up With This List? Back to the main SCO.TuxRocks.com page.

As seen in my article on Groklaw.

In IBM's Reply Memorandum in Support of their (First) Motion to Compel Discovery (text here), IBM includes SCO's Supplemental Responses to IBM's First Set of Interrogatories (text here) and tells the Judge that SCO is still not answering their questions. One of the responses SCO provided was a list of files that may or may not be infringing, according to SCO. Why might IBM view the list as inadequte? To someone without the programming background, it might be hard to know.

A closer look by a computer programmer, with English translation for nonprogrammers, may give a clearer picture of why SCO's responses were neither "responsive nor identified with meaningful particularity", according to IBM. It also reveals the likely method SCO used to draw up the list, which bears on SCO's earlier claims that it had three groups of analysts, including the MIT mathematicians, analyzing the code.

SCO's response includes five lists from several categories:

A list of "source code files identified by SCO thus far ... part of which include information (including methods) that IBM was required to maintain as confidential or proprietary...and/or which constitute trade secrets misused by IBM..." It's a list of 115 files.
A list of "source code files identified by SCO thus far...which may...include information (including methods) that IBM was required to maintain as confidential or proprietary...and/or which constitute trade secrets misused by IBM..." It's a list of 591 files.
A list of people at IBM that SCO claims to be aware of "in which part of the confidential or proprietary and/or trade secrets [were] known or [have] been disclosed." There are 5 lists of names, whose names appear in the Linux code base, adding up to about 74 people.
A list of IBM copyrights. This is a list of 22 names.
A list of people who "likely have knowledge, although their names do not appear in the Linux code base." It's a list of 62 names.

First, a little background on Linux/Unix utilities and tools, then we will examine each of these lists, how they may have been created, and what (if anything) they mean. We conclude with some general comments.

Background

Linux/Unix Terms and Utilities

There are a number of useful utilities in Linux/Unix. Because we will be using some of them in our discussion, we'll briefly mention a few before moving on:

One utility is called grep, and it is a utility designed to search inside a file (or files) for lines containing a certain pattern. In its simplest form, it is usually used like this: 'grep string filename', but it also accepts numerous flags (options) to allow it to perform various functions. When calling grep as egrep, extended pattern matches are enabled. Here, we will use grep to quickly find files containing strings that we are interested in.

Another commonly used utility is find, which is used to search a directory for files having certain properties, such as a specific name or pattern. Here, it will be used to locate files that we are interested in searching the contents of.

sort does just what it says; it sorts a list of strings. It can also be used with the -u option (unique) to remove duplicate references.

cat is used to type out the contents of files, and is very similar to type under DOS/Windows.

xargs is used to execute commands on the output of a previous command. We will be using it to reprocess the output of find commands and the output of other utilities.

Computer and Operating System Terms and Concepts

SMP

Symmetric Multiprocessing is a method for using more than one processor (CPU) in a computer. There are a number of ways in which multiple processors can be used: One processor could be the boss and instruct the other processors what to do, or all the processors can take turns being the controlling computer and all perform work as well, etc. With SMP, there are no manager/worker threads. All processors have equal status.
SMP isn't unique to SCO, IBM, or anybody in particular. It's just a well-known method for keeping multiple processors busy. SMP distributes tasks across all processors. The processors take turns distributing the workload, and they share machine resources. With SMP, often the challenge is making sure that the work that one processor does is independant of the other processors. Problems occur when more than one processor attempts to manage a resource (disk, memory, screen, etc.), and this can lead to system instability. To prevent problems, resources are "locked" by a process while changes are made.
In SCO's List 1 (the list of "definitely infringing" files), they state that "the methods include technical UNIX categories, such as multi-processor locking and unlocking methods, methods for avoiding locking requirements, ...". SMP resource locking is what they are referring to.

JFS

JFS stands for IBM's Journaled Filesystem. A journaled filesystem is one in which every write to the filesystem is written to a special logfile before actually being done. The filesystem is then updated, and the journal is updated with a "Done" or "Completed" status.
Journaling filesystems are used because when a filesystem transaction (such as writing your file to the disk) occurs, there is a slight chance that something bad (like a power failure or stray tachyon beams from passing Starfleet vessels) may happen. If the failure occurs just at the right time, large portions of the filesystem or the file could be corrupted, leading to massive loss of data.
By recording the fact that a write will be performed, performing the write, then noting the fact that it has completed, the filesystem cannot be left in a bad state. Either the write has occurred, or it hasn't. The system can be recovered quickly by replaying the portions of the log that have not been completed. In this way, the system is protected from glitches and is more stable and reliable.
SCO's list includes "...methods for implementing filing systems..." which is just what JFS does. Journaled filesystems are becoming a more common way of implementing filesystems, and a number of journaled filesystems are available for Linux, including JFS, XFS (from SGI), and ext3 (extensions on the common Linux ext2 filesystem).

RCU

RCU stands for Read-Copy Update, and is a method for sharing data between multiple processes. Since a resource must be locked before use in an SMP or multithreaded system, it is possible for an entire system to have to wait for that resource, and it's also possible for two processors to lock up waiting for each other. An example of this could occur when one processor has locked a section of memory and is waiting to lock the disk, and another processor has locked the disk and is waiting to lock the memory. This is called deadlock, and is often solved by locking everything necessary up-front. By locking everything, the system won't deadlock, but may have to wait for these resources to become available again, resulting in wasted processing time.
One way to improve the reliability and speed of a system like this is to come up with methods to avoid locking the system resources and to improve sharing the resources. This is what RCU is designed to do. It attempts to reduce the chance of deadlock, and can result in great speed benefits when the system does mainly reads. RCU is a method designed by Sequent for DYNIX/ptx, but has been used in a number of systems. Here is a resource for additional information. This is part of what SCO refers to when they say "methods for avoiding locking requirements."

NUMA

NUMA stands for Non-Uniform Memory Architecture, and comes into play when using an SMP system where some of the memory is faster to access from one processor than another. Most SMP systems don't have this situation. NUMA support takes into consideration this architecture when using the memory, resulting in a faster overall system.
NUMA support is very similar to using web mirror sites. With mirror sites, servers are placed around the world, and everyone is encouraged to visit a site "near" where they are accessing the internet. In this way, everyone sees a speedup because there is less contention for the main site, and everyone is accessing data "closer" (faster) to them. Here is a source of additional information about NUMA.

SCO's lists of files

Let's start with List 2: The list of "source code files identified by SCO thus far...which may...include information (including methods) that IBM was required to maintain as confidential or proprietary...and/or which constitute trade secrets misused by IBM..." This is a list of 591 files.

While this list contains a number of files from Linux, 591 of them, SCO fails to mention what kernel version, and only says they're from 2.4 and/or 2.5 kernels. As IBM correctly points out, "This is no small problem since there are 75 different releases of the Linux kernel 2.5 alone." SCO also says that they do not claim the entire source code found in those files, but that this information is interspersed in those 330,000 lines of code.

IBM also points out that since it is Unix code (SVRx) that SCO claims was misappropriated, pointing to the Linux source code does not really answer their question, which is from where the trade secrets were misappropriated. SCO passes this argument off by saying that they have not completed discovery, and that since IBM hasn't given them everything they've asked for, they don't know exactly where it came from.

Because SCO is claiming that it is IBM's trade secrets that were misappropriated, they don't have the trade secrets yet themselves. In other words, they need IBM to reveal more information. The question becomes "Why does SCO believe that this list contains their trade secrets if they don't know the trade secrets and need IBM to point them out?"

In attempts to answer this, a number of discussions have occurred, here on Groklaw, on the Linux Kernel Mailing List, and elsewhere. Here on Groklaw, Lev managed to narrow the Linux kernel version down to either 2.5.68 or 2.5.69. Many people were quick to point out that most files on the list contained one or more strings that SCO likes to claim as theirs: SMP, JFS, RCU, and NUMA.

By using the appropriate utilities, it is possible to reproduce SCO's list (number 2) without any manual investigation of the contents of any of those files. A sorted (and cleaned up) copy of SCO's list number 2 is located here for reference. While this solution is certainly not the only one, and is probably not optimal, it is the one that the author managed to construct:

egrep -wilr --include "*.[ch]" 'smp|rcu|numa' * > /tmp/output1

find fs/jfs -type f -path "*.[ch]" >> /tmp/output1

egrep -v 'alpha|parisc|sparc|sound|drivers' /tmp/output1 \
   | sort -u > /tmp/SCOFiles-list2.output

This may look like quite a mess, but it can be deconstructed into manageable pieces. All three lines really consist of several commands strung together using the |, or pipe. This means that the results of one command are used as input to the next command.

Picking apart these lines, first I found all files with a filename ending in .c or .h (C source code and header files). I searched the contents of these files for any of the strings 'smp', 'rcu', or 'numa' (without caring about upper- or lower-case). I placed these matching files into the file /tmp/output1. Next, I included all the JFS filesystem code (.c or .h filenames). The results were appended to /tmp/output1. Finally, I searched the /tmp/output1 file and removed all file names referring to alpha, parisc, or sparc (essentially Sun and HP). References to driver files and sound were then also removed.

When applying this process to the kernel versions identified by Lev, we get 3 false positives and 3 false negatives with the 2.5.68 kernel and just one false positive with the 2.5.69 kernel. As the list is otherwise identical to SCO's, I believe that SCO used the Linux 2.5.69 kernel to generate these lists.

The false positive was include/asm-h8300/smplock.h. There may be a number of explanations for this, one of the most likely being that someone at SCO messed up, and missed a line when sending the list to the lawyers. This is, of course, presuming that the person preparing the list used a similar process, which I belive is likely.

What does this mean? Essentially, that SCO searched for any reference in the Linux kernel source for SMP, JFS, RCU, and NUMA, and claimed all of those files as possibly infringing. They also included the entire JFS source code.

A number of people have pointed out that some of the files are so trivial that they could not contain trade secrets. For example, include/asm-arm/spinlock.h contains only 6 lines, but is included in the list because it contains the string SMP (as in "we don't do SMP"):

#ifndef __ASM_SPINLOCK_H
#define __ASM_SPINLOCK_H

#error ARM architecture does not support SMP spin locks

#endif /* __ASM_SPINLOCK_H */

In providing this list to IBM, it appears that all SCO has done is to make vague claims over all of SMP, JFS, RCU, and NUMA, which is hardly news, but they have given no explanation of how they created their list of possibly infringing files. They haven't answered IBM's question at all (which relates to original SVRx code), and they look silly in the process, at least to those who understand the code and the list.

It is obvious that SCO did not spend a great deal of time or effort at answering IBM's question with valuable information. If they actually did spend time and effort to produce this list, their technical person is not extremely skilled.

List 1: A list of "source code files identified by SCO thus far ... part of which include information (including methods) that IBM was required to maintain as confidential or proprietary...and/or which constitute trade secrets misused by IBM...", the list of 115 files.

The first thing to note is that the files in this list are actually a subset of the files in List 2. For reference, a copy of SCO's list number 2 can be found here. Using our trusty Linux utilities, we can again construct a sequence of commands that produces SCO's list automatically. The following commands will produce all of SCO's files (again, 100%) with just 2 false positives:

cat /tmp/SCOFiles-list2.output \
  | xargs egrep -l 'International Business Machines|ibm\.|IBM Corp' > /tmp/output1

cat /tmp/SCOFiles-list2.output \
  | xargs egrep -wl 'IBM|RCU' \
  | xargs egrep -L 'sco' >> /tmp/output1

sort -u /tmp/output1 > /tmp/SCOFiles-list1.output

These commands first search (List 2) for anything that would be easily identifiable as coming from IBM, files containing "International Business Machines", "IBM Corp", or "ibm." (as could be contained in an email address like username@ibm.com). Next, any mention whatsoever of "IBM" or "RCU" is included, as long as the file does not also contain "sco".

Again, while we do not know for certain that this is the method that SCO used to produce this list, it is easy to demonstrate that even though our commands do not produce an identical list, SCO spent little more time to create this list than List 2.

We are unable to determine determine whether someone messed up and omitted the two false positives, arch/ppc/kernel/setup.c and include/linux/list.h, or whether our search string is not sufficiently developed to produce the same list. What we do know is that this list of "definitely infringing files" is little more than files with IBM mentioned, minus files referring to SCO. IBM is asking for specifics because SCO has given no explanation of how they built their list. Also, they've avoided the question of where in SVRx these trade secrets came from, and why SCO believes they are trade secrets.

List 3: A list of people at IBM that SCO claims to be aware of "in which part of the confidential or proprietary and/or trade secrets [were] known or [have] been disclosed." This consists of 5 lists of authors, for a total of about 74 people.

In SCO's supplemental response, they identify a number of people as having disclosed proprietary information and/or trade secrets. They break down these names into "US Authors" (30), "German Authors" (24), "Australian Authors" (2), "Other" (15), and "Austin Office (JFS)" (3). We won't be going into the same detail in analyzing this section because it involves the names and email addresses of people and we have redacted this information from the text version of the document. Those curious should view SCO's filing to see examples.

Suffice it to say that these lists can be regenerated by searching the kernel source for all files containing an email address at IBM. It contains actual lines from the copyright notices contained in the Linux kernel. On more than one, the line also contained references to other email addresses that the person used, and at least one just ends like this: "username@vnet.ibm.com or". The next line in the kernel source file contains the alternate address.

This list is fairly easy to generate, but does require a bit more manual intervention than most of the others. Since some people have contributed using multiple names (such as Pat and Patrick), someone has manually merged these names together. It was done sloppily, though, since there are other email IBM-related email addresses in the source code which are not mentioned (for example {fred|bob}@ibm.com).

Here, SCO is telling IBM that they believe that every contribution from IBM is tainted, but they'll need all the source code ever written from IBM in order to prove it. I have serious doubts that everyone that ever contributed to Linux from IBM has done so under such suspicious circumstances (I have serious doubts that _any_ contributions are tainted in this way). At any rate, as these copyright notices are coming from the Linux kernel source, they are copyrighted, GPL'd code. SCO has either GPL'd their court filing (as a derivative work), or they are breaking the GPL by only providing the results (analogous to binaries?) of their search. :)

List 4: A list of IBM copyrights (a list of 22 names)

This list is as easy to generate as List 3. It is merely a list of all the various Copyright notices involving IBM in the kernel source. It's actually a pretty boring list, and doesn't seem to tell anyone much, including IBM. It can be regenerated merely by searching for "Copyright" or "(C)" in the same line as "IBM Corporation". They're all just lines like:
Fred So-and-So, IBM Corporation

List 5: A list of people who "likely have knowledge, although their names do not appear in the Linux code base." (a list of 62 names).

We've left the best for last. Here, we've left the kernel source, but where has SCO gotten this list? Ready? Okay... Here goes. They got it from a Google search.

Well, at least that is what it appears. The fact is that you can find the names on this list by searching on Google for email addresses from IBM that posted to the Linux Kernel Mailing List (LKML). Like I said, I don't actually know that this is how SCO did it, but if you're really curious, look at SCO's filing, then check out Google Groups for messages that hit the Linux Kernel Mailing List: '"ibm.com" group:fa.linux.kernel' (for example).

Without doing an extensive study, it is difficult to know exactly how much (or little) work was done to actually build the list, but it is clear that SCO belives that these individuals "likely have knowledge" because their email address can be found on the Linux Kernel Mailing List. To test this theory (in a highly unscientific manner), we chose 5-10 email addresses from the LKML (compliments of Google) and all were located on SCO's list. We then tested things the other way around, and had similar results. The addresses we chose were easy to find on the LKML. One brief example: SCO's list includes the email address fubar@us.ibm.com, which is easy to find here.

So SCO produced a list that they believe holds the names of people with knowledge of Linux. They may have actually searched the Changelogs, as well. A list of names you can find on Google hardly qualifies as a response to IBM's interrogatory.

Some General Comments

In SCO's list, in the legal document, SCO has replaced all the slashes (/) in the file names with periods (.). There are several theories in the Linux community as to why. One possibility is that the lawyers may have written it up using a program that doesn't like slashes, instead of using Unix or Linux. While I used GNU utilities such as grep, the person preparing the list may have used a different platform.

Regular file/path names can be converted to the dotted format with the following command (if you so desire): 'cat /tmp/SCOFiles | sed s:/:.:g' At any rate, they could be converted back easily enough. Interestingly, the path /arch/ppc64/kernel was also changed to .arch.ppc.64.kernel for some yet unknown reason.

Whoever prepared these lists was rather sloppy. They didn't pay attention to detail, missed obvious files and email addresses, and didn't edit very well. Possible references to SCO or Caldera appear to remain. For example, the list includes some contributions to JFS by Christoph Hellwig (once an employee of SCO). Presumably, at least some of those contributions occurred while he was working for SCO.

Some of the files included are trivial and obviously contain no relevant information. The 6-line files that just say "we don't do SMP" come to mind.

It is easy for coders to understand IBM's contention that SCO has not been answering their questions, regardless of the amount of data that they have produced. They don't explain how anything they have reported is a trade secret. And the fact that their lists can be recreated over a weekend using simple scripts indicates to us that their answers are too broad to qualify as answers to the questions they were asked.

The community of Linux coders is left with a number of questions unanswered, including:

Does IBM know how these lists were created?
Will they present details similar to this document to the Judge during oral arguments in connection with their Motion to Compel Discovery?
By not producing anything specific, has SCO harmed their case?
Does SCO know that this data is a nonresponse?
Do their lawyers know?
Did SCO not understand the old saying "Never tangle with a Geek when source code is on the line?"

Prepared by Frank Sorenson - frank AT byu DOT net
With numerous helpful comments from other Groklaw Regulars

Links (mentioning my analysis):
ZDNet UK - How to score SCO's legal games by Rupert Goodwins
GROKLAW and SCO's infringing files from Yahoo message boards