Exploring a Python Pandas dataframe

Exploring a Python Pandas dataframe

Panda is open source, the BSD-licensed library used for data analysis in Python programming language. It is build on top of two most essential python packages numpy and matplotlib. Numpy provides multidimensional array objects for easy data manipulation which pandas uses to store data. Matplotlib has powerful data visualization capabilities. It is most popularly used for data manipulation and data visualization. Pandas designed to work with tabular form of data which is also called dataframe. Every values within a column has similar data type but different column could have different data types.

Most Common Methods and attribute in dataframe

If df is a dataframe of certain dataset.

  • df.heads() – returns the first few rows (the “head” of the DataFrame).
  • df.info() – shows information on each of the columns, such as the data type and number of missing values.
  • df.shape – returns the number of rows and columns of the DataFrame. This is attribute of pandas dataframe object.
  • df.describe() – calculates a few summary statistics for each column.

Parts of a Dataframe

  • df.values – A two-dimensional NumPy array of values.
  • df.columns – An index of columns: the column names.
  • df.index –  An index for the rows: either row numbers or row names.

Sorting of a dataframe

Suppose we have a dog dataframe that has [name, breed, color, height_cm, weight_kg, date_of_birth] as column name with 50 rows.

  • dog.sort_values(‘weight_kg’) – lightest dog at the top and heaviest at the bottom
  • dog.sort_values(‘weight_kg’, ascending=False) – heaviest at the top and lightest at the bottom
  • dog.sort_values(‘weight_kg’, ‘height_cm’) – height_cm lightest to heaviest

Subsetting of a dataframe

suppose we have same dog dataframe from above.

  • dog[“name”]
  • dog[[“breed”, “height_cm”]] – inner bracket is the list of column names for subsetting and out bracket is responsible for subsetting

Subsetting of a dataframe rows

Suppose we have same dog dataframe from above.

  • dog[‘height_cm’] > 50 , return True or False which statisfies and dissatisfies the condition
  • dog[dog[‘height_cm’] > 50] , return only the rows that satisfies the condition
  • dog[dog[‘breed’] == ‘labrador’]] , subsetting based on text data
  • dog[dog[‘date_of_birth’] == ‘2015-01-01’], subsetting based on dates
  • dog[(dog[‘breed’] == “Labrador”) & (dog[‘breed’] == ‘Brown’])]

Subsetting using .isin() – suitable for categorical variable

is_black_or_brown = dog[‘color’].isin([‘black’,’brown’])
dog[is_black_or_brown]

Also check this medium article.

Leave a Reply

Your email address will not be published. Required fields are marked *