Exploring a Python Pandas dataframe
Panda is open source, the BSD-licensed library used for data analysis in Python programming language. It is build on top of two most essential python packages numpy and matplotlib. Numpy provides multidimensional array objects for easy data manipulation which pandas uses to store data. Matplotlib has powerful data visualization capabilities. It is most popularly used for data manipulation and data visualization. Pandas designed to work with tabular form of data which is also called dataframe. Every values within a column has similar data type but different column could have different data types.
Most Common Methods and attribute in dataframe
If df is a dataframe of certain dataset.
- df.heads() – returns the first few rows (the “head” of the DataFrame).
- df.info() – shows information on each of the columns, such as the data type and number of missing values.
- df.shape – returns the number of rows and columns of the DataFrame. This is attribute of pandas dataframe object.
- df.describe() – calculates a few summary statistics for each column.
Parts of a Dataframe
- df.values – A two-dimensional NumPy array of values.
- df.columns – An index of columns: the column names.
- df.index – An index for the rows: either row numbers or row names.
Sorting of a dataframe
Suppose we have a dog dataframe that has [name, breed, color, height_cm, weight_kg, date_of_birth] as column name with 50 rows.
- dog.sort_values(‘weight_kg’) – lightest dog at the top and heaviest at the bottom
- dog.sort_values(‘weight_kg’, ascending=False) – heaviest at the top and lightest at the bottom
- dog.sort_values(‘weight_kg’, ‘height_cm’) – height_cm lightest to heaviest
Subsetting of a dataframe
suppose we have same dog dataframe from above.
- dog[“name”]
- dog[[“breed”, “height_cm”]] – inner bracket is the list of column names for subsetting and out bracket is responsible for subsetting
Subsetting of a dataframe rows
Suppose we have same dog dataframe from above.
- dog[‘height_cm’] > 50 , return True or False which statisfies and dissatisfies the condition
- dog[dog[‘height_cm’] > 50] , return only the rows that satisfies the condition
- dog[dog[‘breed’] == ‘labrador’]] , subsetting based on text data
- dog[dog[‘date_of_birth’] == ‘2015-01-01’], subsetting based on dates
- dog[(dog[‘breed’] == “Labrador”) & (dog[‘breed’] == ‘Brown’])]
Subsetting using .isin() – suitable for categorical variable
is_black_or_brown = dog[‘color’].isin([‘black’,’brown’])
dog[is_black_or_brown]
Also check this medium article.