Another post starts with you beautiful people!
Today we will explore various file import options in Python which I learned from a great learning site- DataCamp.
In order to import data into Python, we should first have an idea of what files are in our working directory. We will learn step by step examples as given below-
Importing entire text files-
In this exercise, we'll be working with the file mobydick.txt [download here]
It is a text file that contains the opening sentences of Moby Dick, one of the great American novels!
Here you'll get experience opening a text file, printing its contents to the shell and, finally, closing it-
# Open a file: file
file = open('mobydick.txt', mode='r')
# Print it
print(file.read())
# Check whether file is closed
print(file.closed)
# Close file
file.close()
# Check whether file is closed
print(file.closed)
Importing text files line by line-
For large files, we may not want to print all of their content to the shell.We may wish to print only the first few lines.
Enter the readline() method, which allows us to do this. When a file called file is open,
we can print out the first line by executing file.readline().
If we execute the same command again, the second line will print, and so on.
We can bind a variable file by using a context manager construct:
with open('huck_finn.txt') as file:
While still within this construct, the variable file will be bound to open('huck_finn.txt');
thus, to print the file to the shell, all the code we need to execute is:
with open('huck_finn.txt') as file:
print(file.read())
# Read & print the first 3 lines
with open('mobydick.txt') as file:
print(file.readline())
print(file.readline())
print(file.readline())
You're now well-versed in importing text files and you're about to become a wiz at importing flat files.
Using NumPy to import flat files-
In this exercise, we're now going to load the MNIST digit recognition dataset using the numpy function loadtxt() and see just how easy it can be:
The first argument will be the filename.
The second will be the delimiter which, in this case, is a comma.
You can find more information about the MNIST dataset [what is mnist] on the webpage of Yann LeCun who is currently Director of AI Research at Facebook and Founding Director of the NYU Center for Data Science, among many other things.
# Import package
import numpy as np
# Assign filename to variable: file
file = 'digits.csv'
# Load file as array: digits
digits = np.loadtxt(file, delimiter=',')
# Print datatype of digits
print(type(digits))
# Select and reshape a row
im = digits[21, 1:]
im_sq = np.reshape(im, (28, 28))
# Plot reshaped data (matplotlib.pyplot already loaded as plt)
plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
plt.show()
Customizing your NumPy import-
What if there are rows, such as a header, that you don't want to import?
What if your file has a delimiter other than a comma? What if you only wish to import particular columns?
There are a number of arguments that np.loadtxt() takes that we'll find useful:
delimiter changes the delimiter that loadtxt() is expecting, for example, we can use ',' and '\t' for comma-delimited and tab-delimited respectively; skiprows allows us to specify how many rows (not indices) we wish to skip; usecols takes a list of the indices of the columns we wish to keep.
The file that we'll be importing, digits_header.txt,has a header and is tab-delimited.
# Import numpy
import numpy as np
# Assign the filename: file
file = 'digits_header.txt'
# Load the data: data
data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0,2])
# Print data
print(data)
Importing different datatypes-
The file seaslug.txt has a text header, consisting of strings and is tab-delimited.
These data consists of percentage of sea slug larvae that had metamorphosed in a given time period.
Due to the header, if we tried to import it as-is using np.loadtxt(), Python would throw us a ValueError and tell us that it could not convert string to float.
There are two ways to deal with this:
Firstly, we can set the data type argument dtype equal to str (for string).
Alternatively, we can skip the first row as we have seen before, using the skiprows argument.
# Assign filename: file
file = 'seaslug.txt'
# Import file: data
data = np.loadtxt(file, delimiter='\t', dtype=str)
# Print the first element of data
print(data[0])
# Import data as floats and skip the first row: data_float
data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)
# Print the 10th element of data_float
print(data_float[9])
# Plot a scatterplot of the data
plt.scatter(data_float[:, 0], data_float[:, 1])
plt.xlabel('time (min.)')
plt.ylabel('percentage of larvae')
plt.show()
Working with mixed datatypes (1)-
Much of the time we will need to import datasets which have different datatypes in different columns; one column may contain strings and another floats, for example. The function np.loadtxt() will freak at this.
There is another function, np.genfromtxt(), which can handle such structures. If we pass dtype=None to it, it will figure out what types each column should be.
Import 'titanic.csv' using the function np.genfromtxt() as follows:
data = np.genfromtxt('titanic.csv', delimiter=',', names=True, dtype=None)
Here, the first argument is the filename, the second specifies the delimiter , and the third argument names tells us there is a header. Because the data are of different types, data is an object called a structured array.
Because numpy arrays have to contain elements that are all the same type, the structured array solves this by being a 1D array,
where each element of the array is a row of the flat file imported. We can test this by checking out the array's shape in the shell by executing np.shape(data).
Acccessing rows and columns of structured arrays is super-intuitive: to get the ith row, merely execute data[i] and to get the column with name 'Fare', execute data['Fare'].
Working with mixed datatypes (2)-
We have just used np.genfromtxt() to import data containing mixed datatypes.
There is also another function np.recfromcsv() that behaves similarly to np.genfromtxt(),
except that its default dtype is None. In this exercise, we'll practice using this to achieve the same result.
# Assign the filename: file
file = 'titanic.csv'
# Import file using np.recfromcsv: d
d=np.recfromcsv(file)
# Print out first three entries of d
print(d[:3])
Using pandas to import flat files as DataFrames (1)-
In the last exercise, we were able to import flat files containing columns with different datatypes as numpy arrays.
However, the DataFrame object in pandas is a more appropriate structure in which to store such data and,
thankfully, we can easily import files of mixed data types as DataFrames using the pandas functions read_csv() and read_table().
# Import pandas
import pandas as pd
# Assign the filename: file
file = 'titanic.csv'
# Read the file into a DataFrame: df
df = pd.read_csv(file)
# View the head of the DataFrame
print(df.head())
Using pandas to import flat files as DataFrames (2)-
In the last exercise, we were able to import flat files into a pandas DataFrame.
As a bonus, it is then straightforward to retrieve the corresponding numpy array using the attribute values.
We'll now have a chance to do this using the MNIST dataset, which is available as digits.csv.
# Assign the filename: file
file = 'digits.csv'
# Read the first 5 rows of the file into a DataFrame: data
data = pd.read_csv(file, nrows=5, header=None)
# Build a numpy array from the DataFrame: data_array
data_array = data.values
# Print the datatype of data_array to the shell
print(type(data_array))
Customizing your pandas import-
The pandas package is also great at dealing with many of the issues we will encounter when importing data as a data scientist,
such as comments occurring in flat files, empty lines and missing values. Note that missing values are also commonly referred to as NA or NaN.
To wrap up this post, we're now going to import a slightly corrupted copy of the Titanic dataset titanic_corrupt.txt, which contains comments after the character '#' and is tab-delimited.
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
# Assign filename: file
file = 'titanic_corrupt.txt'
# Import file: data
data = pd.read_csv(file, sep='\t', comment='#', na_values='Nothing')
# Print the head of the DataFrame
print(data.head())
# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()
That's it guyz for today. I have given code snippet so that you can use it directly in your notebook.
Please try and share your thoughts!
You forgot to mention that you copied DataCamp codes!
ReplyDeleteThanks for mentioning. I have mentioned that I have learnt this from DataCamp.
Delete
ReplyDeleteThis information was very useful to me and I thank you for this.keep sharing more like this.
ccna course in Chennai
ccna institute in Chennai
Python course in Chennai
Python Training Institute in Chennai
Angularjs Training institute in Chennai
ccna Training in OMR
ccna Training in Adyar
Fabulous post admin, it was too good and helpful. Waiting for more updates.
ReplyDeleteAWS Training in Chennai
DevOps Training in Chennai
Data Science Course in Chennai
Blue Prism Training in Chennai
R Programming Training in Chennai
RPA Training in Chennai
Nice blog...!!! Really so good post, I like your unique post and I gladly waiting for your new post...
ReplyDeleteExcel Training in Chennai
Excel Advanced course
Pega Training in Chennai
Tableau Training in Chennai
Unix Training in Chennai
Oracle Training in Chennai
Embedded System Course Chennai
Linux Training in Chennai
Excel Training in Velachery
Excel Training in Tambaram
This comment has been removed by the author.
ReplyDeletenemulglom-ga Amber Brown https://wakelet.com/wake/ELTJKSb0bDsx1C2SmTpy0
ReplyDeletemyoficorlent
conscufrig_pe Melissa Lambert click
ReplyDeletemimortgistti
Oriaplisin-ku_Newark Darrel Hamer Speedify
ReplyDelete4K Video Downloader
UnHackMe
riatincticte