If you have to import let’s say for example Country Name and County Code into a database from a CSV file like this: (source)
Code,English Name,French Name
AD,Andorra,Andorre
AE,United Arab Emirates,Γmirats arabes unis
AF,Afghanistan,Afghanistan
AG,Antigua and Barbuda,Antigua-et-Barbuda
The database schema is as follows:
country_code {country_code,country_name}
In PHP, you could have done:
(Method 1)
$file_handle = fopen(dirname(__FILE__) .
"/countries.csv", "r");
while (!feof($file_handle) ) {
$line = fgetcsv($file_handle, 1024);
$line[1] = mysql_real_escape_string($line[1]);
$sql = "INSERT INTO country_code
(country_code,country_name)
VALUES ('{$line[0]}','{$line[1]}');";
$db->query($sql);
}
fclose($file_handle);
But it takes about 3 seconds to import the CSV attached. If you have a file with lets say with 50,000 records, it might take hours to run. It is really annoying to have Core i7 computers with lots of RAM and yet you run a process which takes hours to run using only 0 – 1% CPU.
I/O from the Hard Drive is limiting the speed tremendously as each time a record is being read, the HDD is accessed, MySQL then accesses the HDD to save and so on. Even a Solid State Drive (SSD) doesn’t solve the problem (I tried. Still takes hours).
I had to find a way to load a somewhat big chunk of the file into memory then execute the insert query for every 1000 records. I came up with the following code:
(Method 2)
$file_handle = fopen(dirname(__FILE__)
. "/countries.csv", "r");
$init_sql = "INSERT INTO country_code
(country_code,country_name) VALUES ";
$sql = $init_sql;
$count = 1;
$i = 0;
while (!feof($file_handle) ) {
$line = fgetcsv($file_handle, 1024);
$line[1] = mysql_real_escape_string($line[1]);
$sql .= "('{$line[0]}','{$line[1]}'),";
if($count++ >= 1000){
$i++;
$count = 0;
$sql = rtrim($sql , ',');
$sql .= ";";
$db->query($sql);
$sql = $init_sql;
}
}
$sql = rtrim($sql , ",");
$sql .= ";";
$db->query($sql);
fclose($file_handle);
Method 1 exec time: 3.0624158382416 s
Method 2 exec time: 0.010763883590698 s
Method 2 is 284 x faster
The import time is considerably improved. According to my benchmarks, it is 270-285x faster.
I was once importing US ZIP codes in a database. It contained between 43k records. The second method allows importing it in less than 5 seconds π