Press enter to see results or esc to cancel.

Different ways of loading datasets for machine learning and deep learning

Loading the dataset is the first task we have to do before running any model. As a beginner in machine learning, I have faced the serious problem of loading datasets. Because different tutorials follow different ways of loading datasets. For this reason, I documented the popular ways of loading datasets into your model.

While training the model, sometimes we download the dataset and feed that from the local directory. When we pass our model to someone, it’s kind of breaks the code. If the dataset is available online, don’t download it into your drive and load it. Because it will break your code in the future. So it is good practice to load the dataset directly from the URL and process it after that. Here I will write a few techniques for loading CSV datasets into your machine learning and deep learning model.

Method1: Load data using basic python module

The first method I will discuss is a little lengthy and has two steps.

  • Download or load the raw data into ram
  • Convert the data into a standard format such as list, Numpy array.

Open File into RAM

#Method 1: Open File from Remote URL
from urllib.request import urlopen
path = "path_of_the_data"
rawdata= urlopen(path) #loads the raw data from url into ram

#Method 2: Open File from Local 
rawdata= open(filename, 'rt') #r=read mode | t=txt mode

Convert Data Into Numpy array

#Method 1 : Convert numpy array using loadtxt() method
data_np = np.loadtxt(rawdata, delimiter=',') # returns numpy array

#Method 2: Convert numpy array using csv and numpy
import csv
csvObj = csv.reader(rawdata, delimiter=',', quoting=csv.QUOTE_NONE)
listObj = list(csvObj)             # convert csv object into list
data_np= np.array(listObj)         #convert list object into numpy array
data_np = data_np.astype('float')  #convert string array into float

Method 2: Load data using Pandas

The second method is using the Pandas module which is easy and straightforward. It has a csv_read() method which takes both URL and local path and returns the output as pandas dataframe.

import pandas as pd
df1 = pd.read_csv(remote_url) #read data from remote url
df2 = pd.read_csv(local_path) #read data from local path
df3 = pd.read_csv('local_path',header=None)  #headers=none for there is not header in my dataset
# return the numpy representation of the dataframe
np_data= df1.values

Method 3: Load data from google drive

While working in Google Colab, it is convenient to load data from google drive.

Suppose here is the structure of your google drive folder and you uploaded the dataset into the dataset_folder

My Drive
|—folder_1
|—dataset_folder

To access the dataset from dataset_folder we have to write the following code.

from google.colab import drive
import os
# Mounting my Google drive
drive.mount('/content/drive')
#Setting google drive folder path. 
os.chdir(r"/content/drive/My Drive/dataset_folder")