API

Scraping Cost of Living

scrape_cost_of_living.get_cost_of_living_table(place_name: str, country=True)[source]

Get the cost of living table for a place.

Parameters:

place_name (str) – name of the place.
country (bool) – whether the place is a country or city.

Returns:

cost of living table.

Return type:

pd.DataFrame

scrape_cost_of_living.clean_numbeo_table(numbeo_df: DataFrame) → DataFrame[source]

Cleans the default Numbeo cost of living table.

Parameters:: numbeo_df (pd.DataFrame) – pandas dataframe
Returns:: pandas dataframe that has been cleaned up.
Return type:: pd.DataFrame

scrape_cost_of_living.check_enough_data(numbeo_df: DataFrame) → float[source]

Checks that the number of data points we have is sufficient. I check that I have enough data to be able to use it to estimate cost of living.

Parameters:: numbeo_df (pd.DataFrame) – pandas dataframe.
Returns:: proportion of filled cells as a proportion of number of total categories.
Return type:: float

scrape_cost_of_living.get_country_cost_of_living(country: str, percentile: int = 90) → float[source]

Get the cost of living for a country.

Parameters:

country (str) – country to get the cost of living for.
percentile (int) – percentile to use.

Returns:

cost of living for the country.

Return type:

float

scrape_cost_of_living.get_city_cost_of_living(city: str, percentile: int = 90) → float[source]

Get the cost of living for a city.

Parameters:

city (str) – city to get the cost of living for.
percentile (int) – percentile to use.

Returns:

cost of living for the city.

Return type:

float

scrape_cost_of_living.get_cost_of_living(place_name: str, simulations: int = 10000, percentile: int = 90) → float[source]

For all cost categories, get the cost and multiply it by the number of units. I simulate the cost using a triangular distribution if there is a lower and upper bound. If there isn’t one though, I simply take the mode.

Parameters:

numbeo_table (pd.DataFrame) – pandas dataframe with the cost of living.
simulations (int) – number of simulations to run.
percentile (int) – percentile to use.

Returns:

cost of living in the input country.

Return type:

float

scrape_cost_of_living.get_numbeo_countries() → list[source]

This function returns a list of countries that have been scraped from Numbeo. The countries get standardized using the pycountry library.

Returns:: list of countries.
Return type:: list

scrape_cost_of_living.main()[source]

Scraping Indices

scrape_numbeo_indices.scrape_index(url: str = 'https://www.numbeo.com/pollution/rankings_by_country.jsp', columns: tuple = ('Country', 'Pollution')) → None[source]

Scrapes the pollution index from the numbeo website.

Parameters:

url (str) – url to scrape.
columns (tuple) – columns to scrape.

Returns:

pandas dataframe.

Return type:

pd.DataFrame

scrape_numbeo_indices.to_pandas_df(rows: list) → DataFrame[source]

Converts a list of HTML rows to a pandas dataframe.

Parameters:: rows (list) – list of rows
Returns:: pandas dataframe
Return type:: pd.DataFrame

Scraping Climate Data

scrape_temperatures.f_to_c(value: float) → float[source]

Converts Fahrenheit to Celsius.

Parameters:: value (float) – float of the value to convert.
Returns:: Celisus value.
Return type:: float

scrape_temperatures.in_to_mm(value: float) → float[source]

Converts inches to mm.

Parameters:: value (float) – float of the value to convert.
Returns:: mm value.
Return type:: float

scrape_temperatures.check_float(potential_float: str) → bool[source]

Checks if a string is indeed a float.

Parameters:: potential_float (str) – string to check.
Returns:: True if the string is a float, False otherwise.
Return type:: bool

scrape_temperatures.get_stats(table: DataFrame) → dict[source]

Aggregates the climate table data to get the maxes and mins and avgs.

Parameters:: table (pd.DataFrame) – pandas dataframe of the table to aggregate.
Returns:: dictionary of the maxes and mins and avgs.
Return type:: dict

scrape_temperatures.get_country_stats(soups: List[BeautifulSoup]) → dict[source]

For every country, get the stats on its climate.

Parameters:: soups (List[BeautifulSoup]) – list of the soups of the pages.
Returns:: dictionary of the stats.
Return type:: dict

scrape_temperatures.main()[source]

Main Scraper Module

scrape_urls.get_table(soup: BeautifulSoup, table_num: int = 2, row_start: int = 1, row_end: int = 5) → DataFrame[source]

Pulls out a table from a beautifulsoup html.

Format is: Row labels with ‘Average’ in the name. The table returns just the rows 1-4 inclusive. This was for the type of tables coming from the climate page.

Parameters:

soup (BeautifulSoup) – BeautifulSoup object.
table_num (int) – The table number to pull out.
row_start (int) – The row to start pulling from.
row_end (int) – The row to end pulling from.

Returns:

a pandas dataframe of the table.

Return type:

pd.DataFrame

scrape_urls.find_html_class(soup: BeautifulSoup, class_name: str) → List[BeautifulSoup][source]

Finds all elements with a given class name.

Parameters:

soup (BeautifulSoup) – BeautifulSoup object.
class_name (str) – The class name to find.

Returns:

A list of elements with the given class name.

Return type:

List[BeautifulSoup]

scrape_urls.find_in_html(soup: BeautifulSoup, element: Union[str, list]) → Optional[BeautifulSoup][source]

Finds an element in a BeautifulSoup object.

Parameters:

soup (BeautifulSoup) – BeautifulSoup object.
element (Union[str, list]) – The element to find.

Returns:

The element if found, else None.

Return type:

Optional[BeautifulSoup]

scrape_urls.find_id_in_html(soup: BeautifulSoup, id: str) → Optional[BeautifulSoup][source]

Finds an element with a given id in a BeautifulSoup object.

Parameters:

soup (BeautifulSoup) – BeautifulSoup object.
id (str) – The id to find.

Returns:

The element if found, else None.

Return type:

Optional[BeautifulSoup]

scrape_urls.proxy_generator() → dict[source]

This function scrapes a list of a free proxies from:

https://sslproxies.org/

It then returns a random proxy from the list.

Returns:: A random proxy from the list.
Return type:: dict

scrape_urls.scrape_page(url: str, spoof: bool = False) → Optional[BeautifulSoup][source]

This function tries to get page information by spoofing the header and trying a random proxy. If successful, it returns the soup of the page.

Parameters:

url (str) – The url to scrape.
spoof (bool) – Whether to spoof the header and use a proxy.

Returns:

The soup of the page.

Return type:

Optional[BeautifulSoup]

scrape_urls.multi_thread_func(func: Callable, values: List, threads: int = 126) → List[source]

This function takes a function and a list of values. It then runs the function on each value in the list using a thread pool.

Parameters:

func (Callable) – The function to run.
values (List) – The values to run the function on.
threads (int) – The number of threads to use.

Returns:

A list of the results of the function.

Return type:

List

Get All Data

get_data.main()[source]

get_data.standardise_country_names(dfs: List[DataFrame]) → list[source]

Standardisses the country names across all the dataframes.

Parameters:: dfs (List[pd.DataFrame]) – list of dataframes.
Returns:: list of dataframes with standardised country names.
Return type:: List[pd.DataFrame]

get_data.import_data(suffix: str = ' by Country.csv') → list[source]

Imports all the data into a list of dataframes.

Parameters:: suffix (str) – suffix of the file names.
Returns:: list of dataframes.
Return type:: List[pd.DataFrame]

get_data.join_data(df1: DataFrame, dfs: list) → DataFrame[source]

Joins the dataframes together.

Parameters:

df1 (pd.DataFrame) – dataframe to be joined.
dfs (List[pd.DataFrame]) – list of dataframes to be joined to df1.

Returns:

joined dataframe.

Return type:

pd.DataFrame

get_data.clean_pop_density(df: DataFrame) → DataFrame[source]

Renames the columns in the population density dataframe.

Parameters:: df (pd.DataFrame) – dataframe to be cleaned.
Returns:: cleaned dataframe.
Return type:: pd.DataFrame

get_data.promote_to_index(dfs: list, col_name: str) → list[source]

Promotes the specified column to the index of the dataframes.

Parameters:

dfs (List[pd.DataFrame]) – list of dataframes.
col_name (str) – name of column to be promoted.

Returns:

list of dataframes with the specified column promoted to the index.

Return type:

List[pd.DataFrame]

Clustering

clustering.reduce_dimensions_pca(embeddings: List[List[float]], dimensions: int = 80) → List[List[float]][source]

Reduces the number of dimensions using PCA.

Parameters:

embeddings (List[List[float]]) – list of lists size m*n where n is the number of dimensions.
dimensions (int) – number of principal components to keep.

Returns:

list of lists size m*dimensions (reduced data)

Return type:

List[List[float]]

clustering.reduce_dimensions_umap(embeddings: List[List[float]], dimensions: int = 80, n_neighbors: int = 10) → List[List[float]][source]

Uses UMAP to reduce the dimensionality of the embeddings.

Parameters:

embeddings (List[List[float]]) – list of lists size m*n where n is the number of dimensions.
dimensions (int) – number of components to keep.

Returns:

list of lists size m*dimensions (reduced data).

Return type:

List[List[float]]

clustering.shuffle(df: DataFrame) → DataFrame[source]

Shuffles the data by each column or row for a pandas dataframe.

Parameters:: df (pd.DataFrame) – pandas dataframe shaped m*n
Returns:: pandas dataframe shuffled.
Return type:: pd.DataFrame

clustering.single_sample_t_test(sample: array, population_stat: float = 0.0) → float[source]

Run a simple t test on a sample to see if it is significantly different from the population mean.

Parameters:

sample (np.array) – numpy array of floats.
population_stat (float) – float for the population mean.

Returns:

float for the t statistic.

Return type:

float

clustering.calc_perm_variance(pca, embeddings_df: DataFrame, n_simulations: int = 5) → DataFrame[source]

Calculates the variance explained for a PCA of the permuted data.

Parameters:

pca – sklearn pca object
embeddings_df (pd.DataFrame) – a pandas dataframe of the embedding vectors (m*n).
n_simulations (int) – integer for the number of permutations to run it on.

Returns:

list of lists as a pandas dataframe with the variance explained.

Return type:

pd.DataFrame

clustering.get_optimal_n_components(embeddings: List[List[float]], n_simulations: int = 5) → int[source]

Calculates the optimal number of principal components to keep in a dimension reduction situation. It calculates a ‘noise’ threshold by permuting the variables to remove any correlations. Once that has been done, you calculate the variance explained by the principal components of the permuted data.

I then run a t test to find which principal components are significantly different from the baseline noise.

Parameters:

embeddings (List[List[float]]) – list of lists (m*n) of floats. where m is the number of vectors, and n the number of variables in each vector.
n_simulations (int) – the number of permulations to get a distribution of the noise.

Returns:

integer for the optimal number of principal components.

Return type:

int

clustering.kmeans_clustering(reduced: List[List[float]], num_clusters: int = -1, max_num_clusters: int = 75) → List[int][source]

This function calculates clusters based on the reduced vectors. I also calculates the best number of clusters using the elbow method. max_num_clusters is the maximum number of clusters to calculate to find the optimal number.

Parameters:

reduced (List[List[float]]) – list of lists (m*n) of floats. where m is the number of vectors, and n the number of variables in each vector.
max_num_clusters (int) – integer for the upper bound number of clusters.

Returns:

list of cluster numbers for each element in reduced.

Return type:

List[int]

clustering.hdbscan_clustering(reduced: List[List[float]], min_cluster_size: int = 4, allow_single_cluster: bool = False) → List[int][source]

Uses HDBSCAN to calculate clusters from the reduced data.

Parameters:: reduced (List[List[float]]) – list of lists (m*n) of floats. where m is the number of vectors, and n the number of variables in each vector.
Returns:: list of cluster numbers for each element in reduced. (note that -1 is an outlier)
Return type:: List[int]

Estimate Cost to Retire

estimate_cost_to_retire.estimate_cost_to_retire(country: str, weekly_cost: float, r: float = 0.06, n: int = 68, moving_cost: float = 8000, buffer_pa: float = 10000)[source]

Estimate the cost to retire in a country.

Parameters:

country (str) – The country to estimate the cost to retire in.
weekly_cost (float) – The weekly cost of living in the country.
r (float) – The rate of return on investments.
n (int) – The number of years to retire.
moving_cost (float) – The cost of moving to a new country.
buffer_pa (float) – The buffer cost in the annual cost of living.

Returns:

The cost to retire in the country.

Return type:

float

new_estimate_cost_to_retire.main()[source]

Purchasing Power Parity (PPP) Conversion Rates

predict_PPP.estimate_PPP_conversion_rate_long_term_change(country: str) → float[source]

Estimate the long term change in the PPP conversion rate of a country.

Parameters:: country (str) – The country to estimate the long term change in the PPP conversion rate of.
Returns:: The long term change in the PPP conversion rate of the country.
Return type:: float