Today I needed to read in multiple csv:s into R and merge them all to one table. I’ve done this before succesfully using a nice and short piece of code:
files <- list.files(".", full.names = T, recursive = T) listcsv <- lapply(files, function(x) read.csv(paste0(x))) data <- do.call("rbind", listcsv)
This code works nicely when you have csv:s that are of identical structure and don’t include any extra rows at the top of your data.
Today however, I’m stuck with csv:s with an extra seven lines at the top of each file that I’d like to strip. Normally, skipping lines while reading in a csv is easy, just specify the argument skip. Like so:
file <- read.csv("filename", skip=7)
This would give me the file just nicely, as a data frame. However, the above code for reading in the files one by one into a list and the binding them together into one data frame doesn’t work as such as I need to get rid of the seven extra lines at the beginning of each file.
I could, of course, strip the seven lines from each file manually, I currently only have 13 files that I’d like to read in. But I’m sure that there will come a day when I have many more files, so why not do this properly right from the start?
After several trial-and-error approaches I reverted to Google. And found one nice Stackoverflow article on the subject.
I tried a couple of the options with no luck, always failing on some detail, like how to pass the arguments to read.csv. Eventually I tried this one:
data <- do.call(rbind, lapply (files, read.csv, as.is=T, skip = 7, header = FALSE))
And it works! In this format passing the extra arguments (as.is, skip and header*) to read.csv works fine. And with my amount of data (only 55000+ rows in the final data frame), it is also fast enough:
user system elapsed
0.38 0.00 0.37
So now I’m all ready to start joining this data with some other data and get on with my data analysis!
* The as.is argument makes sure that strings are read in as character and not factors. The skip argument allows you to skip the specified amount of rows. The header argument lets you specify whether the first row shouls be used as a header row or not.