AMA #3 Questions & Answers Part 2.

Your most burning questions from AMA #3, continued. 

I noticed that sometimes to change column names in DataFrames, I’ll need to use df[‘colName’].method or df.colName.method. Is there a best practice for doing this or is it just preference?


It is usually preference, although I find it better to use df[‘colName’] because there might be cases when your column name is 2 words. dependent on the context. For instance, when you are using it in a for loop, and colname is what you are iterating over, then you might want to use [‘colname’]. By using the string form, it will be coloured differently in Jupyter which helps the user see that it is a column. The dot notation cannot be used if you want to assign to the column. You may find out more here: https://github.com/pandas-dev/pandas/issues/7175

I notice pandas can import data= pd.read_csv(data.csv) and change the data to array by using data.values so what is the use of using numpy importing data?


numpy is the underlying structure of the DataFrame in the pandas library. The NumPy is written in Python and C, and has optimized/efficient code for working with arrays. NumPy arrays are generally faster compared to Python lists. If you’re interested, you can read about the implementation on Numpy documentation pages or search some questions on Stackoverflow: https://stackoverflow.com/a/994010

data.values would be used to extract the values from data (a dataframe) as a numpy array, and throw away the column, row indexes and other metadata. df.values is a useful hack when the auto-alignment of columns/indexes feature in pandas causes result data to be arranged in the wrong order when you are doing operations that will affect the order of rows/columns (i can’t think of a more specific scenario now), but the key point is to avoid alignment problems by hiding the column/row index information by extracting values only.

Does anyone know of any Python package (eg: DataMaid in R) that can go thru a DataFrame and produce summary stats as output? eg: top 20 values, outlier detection, etc, etc.


You can use df.describe(). It will show you basic stats.

What is the difference between map, apply and applymap?

Map: It iterates over each element of a series. df[‘column1’].map(lambda x: 10+x), this will add 10 to each element of column1. df[‘column2’].map(lambda x: ‘AV’+x), this will concatenate “AV“ at the beginning of each element of column2 (column format is string).
Apply: As the name suggests, applies a function along any axis of the DataFrame. df[[‘column1’,’column2’]].apply(sum), it will returns the sum of all the values of column1 and column2.
ApplyMap: This helps to apply a function to each element of dataframe. func = lambda x: x+2 df.applymap(func), it will add 2 to each element of dataframe (all columns of dataframe must be numeric type)

Can we parse results from Help() command by pages instead? At the moment, if for eg. I type help(“numpy.array”) in iPython, it will scroll the entire help results. It would be easier to read in pages.


Not easily? Unless you get hacky. I would suggest that the simplest method is to just open the documentation in a separate browser window.

What is the difference between series.plot(‘hist’), series.plot.hist() and series.hist()?


They are different ways of saying the same thing.

There are many platforms that can be used for python programming: such as Jupyter book, Nvidia Digits and Spyder. Which one is the best?


This depends on your workflow, and personal preference. For example, if you are used to matlab or other GUI interfaces you can use spyder. Jupyter notebooks work well if you are trying to quickly experiment with different codes or creating a tutorial for people to follow since you can insert explanations and pictures inbetween blocks of code and run each block of code separately.

What is the difference between method and attribute, how to decide which one to use?


Attributes are the features of the objects or the variables used in a class whereas the methods are the operations or activities performed by that object defined as functions in the class.

For example, if Dog is an object of class animal,

Limbs=4
Eyes=2
Tail=1

are the attributes or features.

Move()
Bark()
Eat(food)

are the methods or functions.

I use a “beautifulsoup” package to import html as an object call “soup”. “soup.title” is callable but it is not shown in the list from “dir(soup)”. “dir()” is supposed to be used to show all the attribute, right?


No, dir() does not show an exhaustive list of attributes. Learn more here: https://docs.python.org/2/library/functions.html#dir

It seems “df.info” and “df.info()” are not the same. Why did “df.info” return the df?

df.info() is a method that gives summary information about the dataframe. Without the ‘()’ brackets, the method does not get run. Instead, only a reference to the object itself is returned.

What is the difference between “d*” and “d+”?


Assuming this question is about regex, “d*” would match any digit that occurs 0 or more times. “d+” will match a digit that occurs 1 or more times.

Are there any legal considerations to be made when scraping data from the web?

Yes. There are many sites that have a “no crawl” rule. If one ignores that rule, that site can ban you or even seek legal action against you. Be cognizant and courteous of where you point your spiders.

AMA #3 Q&A Part 1 here.
Watch AMA #3 here for a quick recap. 

Author