2_Data Manipulation and Exploration with Numpy and Pandas - Part  2

2_Data Manipulation and Exploration with Numpy and Pandas - Part 2

Written By: Deborah Dormah Kanubala, Twitter

This article is a continuation of my previously written article on Numpy. In case you are already familiar with Numpy then you can jump right into this tutorial. Otherwise, I will refer you to this article and strongly encourage you to go through it before coming back to this.

Pandas Installation

Pandas is a python library used for working with datasets. It provides useful functions/methods for analyzing, cleaning, exploring, and manipulating data.

To install pandas using the terminal, simply type pip install pandas

Why Use Pandas

  • Pandas allow us to analyze big datasets easily.

  • Pandas make it easy to clean messy datasets and make them readable and relevant.

  • Pandas also provide us with useful methods that allow us to merge, and concatenate different columns in a dataset.

Import Pandas

import pandas as pd

Once we have installed pandas and imported it, we can always refer to the imported pandas using pd. The code below allows us to create a python data frame using pd.DataFrame()

data = pd.DataFrame({'group':['a', 'a', 'a', 'b','b', 'b', 'c', 'c','c'],'ounces':[4, 3, 12, 6, 7.5, 8, 3, 5, 6]})
data
data.describe() #computes summary statistics data
#Let's sort the data frame by ounces - inplace = True will make changes to the data
data.sort_values(by=['ounces'],ascending=True,inplace=False)
#sort the data by multiple columns.
data.sort_values(by=['group','ounces'],ascending=[True,False],inplace=False)

Pandas DataFrames

Datasets in pandas are multi-dimensional tables, called DataFrames.

We can create a data frame from two series using the code below:

series= {"calories": [420, 380, 390, 210, 671, 111, 321, 214, 566, 222],
  "duration": [50, 40, 45, 35, 60, 25, 62, 95, 55, 45]}
df = pd.DataFrame(series)
df

Pandas Indexing and Data Selection

There are a lot of different ways to select elements, rows, and columns from a data frame. In the next few lines of code, we will look at different ways to index/select:

  • dataframe.loc[]: used for labels

  • dataframe.iloc[]: used for positions or integer based

  • dataframe.ix[]: used for bot label and integer-based (NB: This has depreciated in recent pandas versions)

series= {"calories": [420, 380, 390, 210, 671, 111, 321, 214, 566, 222, 610, 882],
    "month":["January", "Feburary", "March", "April", "May", "June", "July",
             "August", "September", "October", "November", "December"],
         "duration": [50, 40, 45, 35, 60, 25, 62, 95, 55, 45, 11, 23],
        "names": ["Kwame", "Ama", "Kwasi", "Kwadwo", "Ebo", "Abena", "Akua", 
                  "Asi", "Abena", "Yaw", "Akosua", "Afiba"],
        "location":["Tamale", "Buipe", "Damango", "Dungu", "Savelugu", 
                    "Bolgatanga", "Yendi", "Sawla", "Tuna", "Yapei", "Walewale", "Navrongo"],
        "prices":[1400, 3200, 1200, 4560, 7761, 9900, 4955, 8722, 9111, 6093, 7021,2531]}
df = pd.DataFrame(series)

we can then perform the next operation based on the created data frame

#column names
df.columns

#select a single column
df["calories"]

#multiple selection
df[["calories", "calories", "location"]]

dataframe.loc[]

# first row
df.loc[0]

#select multiple rows
df.loc[[0, 1, 3]]

#select rows(first, second and sixth rows) and columns(calories, month, names) using loc
df.loc[[0, 1, 5], ["calories", "month", "names"]]

#select all rows and specific columns
df.loc[:, ["calories", "month"]]

#select first to 4th row and all columns
df.loc[1:3,:]

dataframe.iloc[]

#select a single row
df.iloc[3]

#select all rows and some columns
df.iloc[:, [1, 2]]

Methods Pandas

FunctionDescription/Use
dataframe.head()Return the top n rows of a data frame.
dataframe.tail()Return bottom n rows of a data frame.
dataframe.shape()Return the columns and rows of the data.
dataframe.groupby()
dataframe.isnull().sum()Return names of columns with missing values with their counts.
dataframe.insert()Insert column into DataFrame at the specified location.
dataframe.copy()Makes a copy of the original data frame
dataframe.value_counts()Return the counts of unique values.
dataframe.head()Returns a Pandas DataFrame with duplicate rows removed.
dataframe.memory_usage()Return memory usage of each column (in bytes) in a Pandas DataFrame

Load Data Using Pandas

The best part of using pandas is the provision it makes for us to load datasets of different extensions. In this section, we would briefly look at how to load different data extension files.

Reading csv files

df_csv = pd.read_csv('data.csv')

Reading JSON files

df_json = pd.read_json('data.json')

Reading excel files

df_json = pd.read_excel('data.xlsx')

Reading excel files

df_json = pd.read_html('data.html')

Working with titanic dataset

  • You will now have to work in break-out rooms. All instructions should be carefully followed.

  • Each room will have one volunteer.

  • One participant will voluntarily or randomly be chosen to share their screen with their notebook.

  • The participant sharing should not be the one speaking, the other participants should direct what should be written.

  • Download the titanic data from the google drive shared.

  • Now, follow the instructions in the notebook.

Final tips :

  • Continue practicing daily, learning programming == learning a new language like EN, DE, FR etc.

  • Use StackOverflow if you are not sure of how to do something.

  • Instead of copying and pasting code, write the code out by yourself.

  • Read other people's code as much as possible.

  • Don't freak out about errors, read the errors and try to understand.

  • Copy the errors and search on Google, HIGH CHANCE: it is already a solved problem.

  • If preparing for technical interviews, make sure to use these resources Hackerrank, Leetcode, and Codewars.

References