Reading multiple csv files into R while skipping the first lines

Today I needed to read in multiple csv:s into R and merge them all to one table. I’ve done this before succesfully using a nice and short piece of code:

files <- list.files(".", full.names = T, recursive = T)
listcsv <- lapply(files, function(x) read.csv(paste0(x)))
data <- do.call("rbind", listcsv)

This code works nicely when you have csv:s that are of identical structure and don’t include any extra rows at the top of your data.

Today however, I’m stuck with csv:s with an extra seven lines at the top of each file that I’d like to strip. Normally, skipping lines while reading in a csv is easy, just specify the argument skip. Like so:

file <- read.csv("filename", skip=7)

This would give me the file just nicely, as a data frame. However, the above code for reading in the files one by one into a list and the binding them together into one data frame doesn’t work as such as I need to get rid of the seven extra lines at the beginning of each file.

I could, of course, strip the seven lines from each file manually, I currently only have 13 files that I’d like to read in. But I’m sure that there will come a day when I have many more files, so why not do this properly right from the start?

After several trial-and-error approaches I reverted to Google. And found one nice Stackoverflow article on the subject.

I tried a couple of the options with no luck, always failing on some detail, like how to pass the arguments to read.csv. Eventually I tried this one:

data <- do.call(rbind, lapply
        (files, read.csv, as.is=T, skip = 7, header = FALSE))

And it works! In this format passing the extra arguments (as.is, skip and header*) to read.csv works fine. And with my amount of data (only 55000+ rows in the final data frame), it is also fast enough:

   user  system elapsed 
   0.38    0.00    0.37

So now I’m all ready to start joining this data with some other data and get on with my data analysis!

* The as.is argument makes sure that strings are read in as character and not factors. The skip argument allows you to skip the specified amount of rows. The header argument lets you specify whether the first row shouls be used as a header row or not.

2 thoughts on “Reading multiple csv files into R while skipping the first lines”

Thank you for your comment! I think you’re absolutely right and had actually already tried your suggestion. Alas, it failed with an error that I didn’t know how to correct (“Error in match.names(clabs, names(xi)) : names do not match previous names”).

Now, after sleeping on it and reading your comment, I decided to try it again and realised that it needs a little tweak to work as anticipated, i.e. the “header=FALSE” argument. So thus I would use more or less the same lines of code as I have before, only tweaking the middle line as follows:

files <- list.files(".", full.names = T, recursive = T)
listcsv <- lapply(files, function(x) { read.csv(x, skip = 7, header=FALSE, as.is = TRUE)})
data <- do.call("rbind", listcsv)

Now it works like a charm!

i think lapply(files, function(x) read.csv(paste0(x))) would be more succinctly expressed as lapply(files, read.csv), we are pasteing a 1 length vector, and lapply does not care whether you define an unnamed function or just pass existing one. Although in passing stuff to read.csv the formulation with unnamed function is kinda natural because you could do lapply(files, function(d) { read.csv(d, skip = 7, as.is = TRUE)}).

the thing is that function(x) { read.csv(x) } just defines a single argument function that calls read.csv with a single parameter, you can just add whatever parameters you need to that read.csv call.

Share this:

Related

2 thoughts on “Reading multiple csv files into R while skipping the first lines”

Leave a comment Cancel reply