Data Manipulation and Exploration with Numpy and Pandas

Written By: Deborah Dormah Kanubala, Twitter

Numpy and Pandas are two packages that are used often in data manipulation and exploration. It is therefore important as data scientists or machine learning engineers we learn the basics of these packages. This tutorial will there seek to expose you to how these packages work.

Introduction

NumPy stands for Numerical Python and it is an open-source Python library. It is a standard Python package for working with numerical data and performing mathematical operations on arrays. Importantly, it adds powerful data structures to Python that can help guarantee efficient calculations with arrays.

Installing Numpy

We will first look at how to install Numpy. If you already have Python you can install Numpy via your terminal using this:

pip install numpy

Otherwise, if you have Anaconda installed then Numpy is already included and there is no need to install it separately. Using Anaconda can save you time and effort in setting up your Python environment.

Once you have managed to install Numpy, then you are ready to import it and begin to explore the package.

 import numpy as np

If no error has been displayed, then we can be sure the package has been installed.

Create Numpy Arrays

Numpy arrays are homogeneous in nature which means that they comprise one data type (integer, float, string, etc). You can create a Numpy array using a Python list or a nested list. To do this, you can use the np. array() the function provided by Numpy.

For example, to create a one-dimensional array of integers from 1 to 5, you can use the following code:

Create an array using lists

#create using python lists
np.array([1, 2, 3, 4, 5, 6]) # output: array([1, 2, 3, 4, 5, 6])

#create an array of names of countries
np.array(["Ghana", "Germany", "Togo", "Senega", "Belgium"])
#output: array(['Ghana', 'Germany', 'Togo', 'Senega', 'Belgium'], dtype='<U7')

You can also create a two-dimensional array using a nested list. For example, to create a 2x2 array of integers, you can use the following code:

arr_numbs = np.array([[1, 2], [3, 4]])
print(arr_numbs) #output: [[1 2] [3 4]]

Create an array of zeros or ones

#creating arrays of zeros
np.zeros(10, dtype='int') #output: array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])


#creating a 5 row x 5 column matrix
np.zeros((5,5), dtype=float)
#output: array([[0., 0., 0., 0., 0.],[0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.], [0., 0., 0., 0., 0.]])

#creating a 3 row x 5 column matrix
np.ones((3,5), dtype=float)
#output: array([[1., 1., 1., 1., 1.],[1., 1., 1., 1., 1.],[1., 1., 1., 1., 1.]])

Create an array with a predefined value

#creating a matrix with a predefined value
np.full((3,5), 1.23)
# output: array([[1.23, 1.23, 1.23, 1.23, 1.23],[1.23, 1.23, 1.23, 1.23, 1.23],[1.23, 1.23, 1.23, 1.23, 1.23]])

Create an array of even space

Numpy's np.linspace() function is used to create an array of evenly spaced values within a specified range. The function takes three arguments: the start value, the end value, and the number of values to generate. The resulting array will contain the specified number of values between the start and end values, inclusive. Here's an example of how to use np.linspace() to create an array of 5 equally spaced values between 0 and 10:

my_array = np.linspace(0, 10, 5)
print(my_array) #output: [ 0.   2.5  5.   7.5 10. ]

The np.linspace() function is useful when you need to create an array with a specific number of values, and you want those values to be evenly spaced across a range. It's commonly used in plotting and visualization applications, as well as in mathematical and scientific calculations.

Create an array with a set of sequences

Numpy's np.arange() function is used to create an array with a set of sequences. The function takes three arguments: the start value, the end value (exclusive), and the step size. The resulting array will contain values starting from the start value, incrementing by the step size, and stopping before the end value. The np.arange() function is useful when you need to create an array with a specific step size, and you want to control the start and end values. Here's an example of how to use np.arange() to create an array of values between 0 and 9, with a step size of 2

#np.arange(start, stop, step)
np.arange(0, 10, 2) #output: array([0, 2, 4, 6, 8])

We may have noticed that np.linspace() and np.arange() look similar to the number of parameters we pass to the functions. It is therefore usual that people can tend to confuse them. The main difference between np.arange() and np.linspace() is the way they generate the array values. np.arange() generates an array of values by specifying the start value, the end value (exclusive), and the step size between each value. On the other hand, np.linspace() generates an array of values by specifying the start value, the end value (inclusive), and the number of values to generate between the start and endpoints. The number of values is constant and is evenly spaced across the range.

The image below will also help in explaining these. Credit: RealPython

Create an array with random numbers

np.random.normal() is a NumPy function that generates random numbers following a normal (Gaussian) distribution. The function takes three arguments:

mean: the mean value of the distribution
standard deviation: the standard deviation of the distribution
size: the size of the output array (optional)

#create a array of 5 random numbers following a normal distribution with a mean of 0 and standard deviation of 1:
np.random.normal(0, 1, 5) #output: array([ 0.53276999, -0.63569854,  0.30679869, -0.79091318,  0.02547093])

#create a 3x3 array with mean 0 and standard deviation 1 in a given dimension
#np.random.norma;(loc, scale, size)
np.random.normal(0, 1, (3,3))
#output: array([[ 1.10941128, -0.53880979, -1.25965531],[ 0.10272582, -0.28797406,  0.68592358],[ 2.51164442,  0.58626229, -1.85221223]])

Create identity matrix

An identity matrix is an n-by n-dimensional matrix with only ones in its main diagonal and zeros in its off-diagonal.

# create an identity matrix of a 2 by 2 matrix
np.eye(2, 2)
#output: array([[1., 0.],[0., 1.]])

Numpy Array Indexing

Once you have an overview of different ways to create Numpy arrays. It may be possible you need to access a particular array element. Numpy array indexing is assessing an element(s) from a Numpy array. Here we would look at the different ways you can index element(s) from an array.

However, we need to REMIND ourselves that counting starts at index 0. This means that the first element has an index of 0, and the second has an index of 1, etc

Assume we are given an array data = np.array([1, 2, 3])

From this image, we see that indexing the name of our array data[0] returns 1, which is the first element in our Numpy array. The index data[1] returns the second element in our array which is 2 and of course data[2] will return 3.

In a similar light, it is possible to one needs to access not just one element from a Numpy array. In accessing more than one element we use : to specify the start and stop. I.e if we index with data[0:2] this means we want to access the elements from index 0 to 2. However, python will exclude the element at the index of 2. In our example, data[0:2] returns [1, 2] which are elements at index 0 and index 2. The element at index 2 is excluded.

NOTE: Python indexing starts at 0 and not 1. When using index slicing, all elements from the start index to the stop-1 index will be returned. For example, data[start: stop] will return all elements from the start index up to the element before the stop index.

Negative indices can also be used in Numpy array indexing. Negative indices start counting from the end of the array. For instance, data[-2:] will return the elements starting from the second to the last element of the array up to the last element. In our example, this will return [2, 3].

Numpy Slicing

Slicing means accessing a specific portion or a subset of the list for some operation while the original list remains unaffected. To pass slice we use [start: end] or we can specify it to include a step like this [start:end: step].

Note:

If we do not pass a start, it considers it 0. For example, if we have an array arr = np.array([1, 2, 3, 4, 5]), we can slice it using:

arr[1:4]    # output array([2, 3, 4])

This returns the elements from index 1 up to index 4 (exclusive). We can also slice the array from the beginning by not passing the start index:

arr[:3]    # output array([1, 2, 3])

Similarly, we can slice the array up to the end by not passing the end index:

arr[2:]    # output array([3, 4, 5])

This returns the elements from index 2 to the end of the array. We can also specify a step size between elements using the syntax [start:end:step]. If we do not pass a step size, it is considered to be 1. For example, if we have an array arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9]), we can slice it using:

arr[1:8:2]    # returns array([2, 4, 6, 8])

This returns the elements from index 1 up to index 8 (exclusive), with a step size of 2.

Negative Indices

In Numpy, we can also use negative indices to slice arrays. Negative indices refer to the elements at the end of the array. For example, if we have the same array arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9]), we can slice it using:

arr[-3:-1]    # array([7, 8])

This returns the elements starting from the third element from the end up to the second element from the end.

Built-in Methods

Numpy also provides us with a set of inbuilt functions that allows us to perform certain actions quickly. In this section, weNum will go through a few of these inbuilt functions and what they help us do.

np.zeros(): Here, this returns a new array of zeros for a given shape. It takes the shape (shape of the new array), dtype as inputs.

  #create an array of 5-dim vector of zeros
  np.zeros(5) # output:array([0., 0., 0., 0., 0.])

  #create a 2 X 2 matrix of zeros
  np.zeros((2, 2), dtype = int) # output: array([[0, 0],[0, 0]])

np.reshape(): This method is used to change the shape of an array without changing its data. It takes the new shape as a parameter and returns a new array with the same data in the new shape.
```
  arr = np.array([1, 2, 3, 4, 5, 6])
  new_arr = np.reshape(arr, (2, 3))
  print(new_arr) # output: [[1 2 3][4 5 6]]
```
np.transpose(): This method is used to permute the dimensions of an array. It takes the axes to be swapped as parameters and returns a new array with the transposed dimensions.

arr = np.array([[1, 2], [3, 4]])
new_arr = np.transpose(arr)
print(new_arr) #return [[1 3] [2 4]]

np.concatenate(): This method is used to join two or more arrays along a specified axis. It takes the arrays to be joined and the axis as parameters and returns a new array with the concatenated data.

arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
new_arr = np.concatenate((arr1, arr2))
print(new_arr) #output: [1 2 3 4 5 6]

np.split(): This method is used to split an array into multiple sub-arrays along a specified axis. It takes the array to be split and the number of sub-arrays or the indices at which to split as parameters and returns a list of the sub-arrays.

arr = np.array([1, 2, 3, 4, 5, 6])
new_arr = np.split(arr, 3)
print(new_arr)
#output: [array([1, 2]), array([3, 4]), array([5, 6])]

Broadcasting

There are times when you might want to carry out an operation between an array and a single number (also called an operation between a vector and a scalar) or between arrays of two different sizes. Broadcasting is a mechanism that allows NumPy to perform operations on arrays of different shapes.

NumPy provides some rules for broadcasting to ensure that the operations are performed correctly. These rules state that the arrays must be compatible in their dimensions or have the same shape. Additionally, arrays can only be broadcasted if their, dimensions are compatible which means that one of the arrays has a dimension of 1, or they have the same number of dimensions.

Broadcasting between an array and a scalar

# create an array
a = np.array([1, 2, 3])
# multiply the array by a scalar
b = a * 2
print(b) # Output: [2, 4, 6]

In this example, we multiply the array a by the scalar 2 using the * operator. NumPy automatically broadcasts the scalar 2 to an array of the same shape as a (i.e., [2, 2, 2]) and then perform the multiplication element-wise.

Broadcasting between two arrays of different sizes

# create two arrays of different sizes
a = np.array([1, 2, 3])
b = np.array([4])
# add the two arrays
c = a + b
print(c) # Output: [5 6 7]

In this example, we add the arrays a and b using the + operator. Since the arrays have different sizes, NumPy automatically broadcasts the smaller array b to the same shape as a (i.e., [[4, 5, 4], [4, 5, 4], [4, 5, 4]]) and then perform the addition element-wise.

References

Numpy for beginners: numpy.org/doc/stable/user/absolute_beginner..
Numpy Indexing: geeksforgeeks.org/indexing-and-selecting-da..
Numpy Documentation: numpy.org/doc
Real Python: https://realpython.com/numpy-tutorial/

Data Manipulation and Exploration with Numpy and Pandas - Part 1

Numpy Basics

Table of contents