If you have to import let’s say for example Country Name and County Code into a database from a CSV file like this: (source)
Code,English Name,French Name
AD,Andorra,Andorre
AE,United Arab Emirates,Émirats arabes unis
AF,Afghanistan,Afghanistan
AG,Antigua and Barbuda,Antigua-et-Barbuda
The database schema is as follows:
country_code {country_code,country_name}
In PHP, you could have done:
(Method 1)
$file_handle = fopen(dirname(__FILE__) . "/countries.csv", "r"); while (!feof($file_handle) ) { $line = fgetcsv($file_handle, 1024); $line[1] = mysql_real_escape_string($line[1]); $sql = "INSERT INTO country_code (country_code,country_name) VALUES ('{$line[0]}','{$line[1]}');"; $db->query($sql); } fclose($file_handle);
But it takes about 3 seconds to import the CSV attached. If you have a file with lets say with 50,000 records, it might take hours to run. It is really annoying to have Core i7 computers with lots of RAM and yet you run a process which takes hours to run using only 0 – 1% CPU.
I/O from the Hard Drive is limiting the speed tremendously as each time a record is being read, the HDD is accessed, MySQL then accesses the HDD to save and so on. Even a Solid State Drive (SSD) doesn’t solve the problem (I tried. Still takes hours).
I had to find a way to load a somewhat big chunk of the file into memory then execute the insert query for every 1000 records. I came up with the following code:
(Method 2)
$file_handle = fopen(dirname(__FILE__) . "/countries.csv", "r"); $init_sql = "INSERT INTO country_code (country_code,country_name) VALUES "; $sql = $init_sql; $count = 1; $i = 0; while (!feof($file_handle) ) { $line = fgetcsv($file_handle, 1024); $line[1] = mysql_real_escape_string($line[1]); $sql .= "('{$line[0]}','{$line[1]}'),"; if($count++ >= 1000){ $i++; $count = 0; $sql = rtrim($sql , ','); $sql .= ";"; $db->query($sql); $sql = $init_sql; } } $sql = rtrim($sql , ","); $sql .= ";"; $db->query($sql); fclose($file_handle);
Method 1 exec time: 3.0624158382416 s
Method 2 exec time: 0.010763883590698 sMethod 2 is 284 x faster
The import time is considerably improved. According to my benchmarks, it is 270-285x faster.
I was once importing US ZIP codes in a database. It contained between 43k records. The second method allows importing it in less than 5 seconds 😉