SAS: best practices for working with datasets and libraries

The way SAS handles datasets and libraries of datasets is fiddly, and can cause big problems with analysis if not handled properly. Here, I explain briefly what SAS is doing behind the scenes and how I avoid common pitfalls.

Datasets in SAS

A SAS dataset is essentially a glorified Excel sheet (each variable is a column; each record is a row). SAS saves datasets in folders somewhere on your computer (more on that below). These files have an .sas7bdat extension. The name of the dataset in SAS matches the beginning part of the filename (the part before the .sas7bdat).

Libraries in SAS

A library is a collection of multiple datasets. It maps to a folder on your computer with .sas7bdat files in it. You can give the library a short alias that's 1 to 8 letters long. You do this by placing libname baz "c:\path\to\folder"; at the top of your SAS Editor file.

In the screenshot above, the first line of the Editor window is telling SAS to look in c:\path\to\folder whenever the library "baz" is referenced. On the third line, is telling SAS to look in the folder with the alias "baz" (that's c:\path\to\folder) for a file called foo.sas7bdat.

The default library (work)

Perhaps the most confusing thing in SAS is what happens when you don't specify a library name when referencing a dataset. SAS has a temporary library it calls "work", and this is used whenever you don't specify a library. Datasets in "work" are deleted every time you restart SAS.

This makes the following two editor windows exactly equivalent.

This is convenient but confusing, because proc contents data=foo; and proc contents; are referencing two entirely different datasets.

Note that you can see the library and name of the dataset when you run proc contents;:

Best practices

Here's how I avoid confusion about datasets and libraries in my own code:

  • I always use the full "libname.datasetname" syntax so there's no ambiguity about whether I'm using "work" or my own manually defined library.
  • I only use "work" for something that's truly temporary. Anything I want to reference in a later proc, I save to a manually defined library so it won't disappear if I close SAS.
  • If I create a dataset based on an existing dataset, I prefix the name with "drv" like this. This stands for "derived", which tells me that it's based on another dataset and could theoretically be recreated by running my code again. In contrast, my raw datasets with no "drv" prefix can't be reproduced by running my code.
  • I use descriptive names for my derived datasets. "drv_foo1" and "drv_foo2" are not particularly helpful when trying to remember what has changed between them. "drv_foo_drop_missing" is much better.
  • SAS will allow you to run proc whatever; without explicitly specifying a dataset. I never do this because it makes it ambiguous which dataset I'm using.

An example

If I was doing a Broad Street Pump analysis, I would make a folder called c:\projects\broadstreet where I would save all my SAS Editor (.sas) files.

I would also make a subfolder called c:\projects\broadstreet\data where I would point my SAS library at with libname broadst "c:\projects\broadstreet\data"; at the top of my Editor file.

If my primary dataset was called "primary", then I would expect there to be a file on my computer called c:\projects\broadstreet\data\primary.sas7bdat.

I would then use "broadst.primary" to reference this dataset in my procs. For example: proc contents data=broadst.primary;.