API

Scraping Cost of Living

scrape_cost_of_living.get_cost_of_living_table(place_name: str, country=True)[source]

Get the cost of living table for a place.

Parameters:
  • place_name (str) – name of the place.

  • country (bool) – whether the place is a country or city.

Returns:

cost of living table.

Return type:

pd.DataFrame

scrape_cost_of_living.clean_numbeo_table(numbeo_df: DataFrame) DataFrame[source]

Cleans the default Numbeo cost of living table.

Parameters:

numbeo_df (pd.DataFrame) – pandas dataframe

Returns:

pandas dataframe that has been cleaned up.

Return type:

pd.DataFrame

scrape_cost_of_living.check_enough_data(numbeo_df: DataFrame) float[source]

Checks that the number of data points we have is sufficient. I check that I have enough data to be able to use it to estimate cost of living.

Parameters:

numbeo_df (pd.DataFrame) – pandas dataframe.

Returns:

proportion of filled cells as a proportion of number of total categories.

Return type:

float

scrape_cost_of_living.get_country_cost_of_living(country: str, percentile: int = 90) float[source]

Get the cost of living for a country.

Parameters:
  • country (str) – country to get the cost of living for.

  • percentile (int) – percentile to use.

Returns:

cost of living for the country.

Return type:

float

scrape_cost_of_living.get_city_cost_of_living(city: str, percentile: int = 90) float[source]

Get the cost of living for a city.

Parameters:
  • city (str) – city to get the cost of living for.

  • percentile (int) – percentile to use.

Returns:

cost of living for the city.

Return type:

float

scrape_cost_of_living.get_cost_of_living(place_name: str, simulations: int = 10000, percentile: int = 90) float[source]

For all cost categories, get the cost and multiply it by the number of units. I simulate the cost using a triangular distribution if there is a lower and upper bound. If there isn’t one though, I simply take the mode.

Parameters:
  • numbeo_table (pd.DataFrame) – pandas dataframe with the cost of living.

  • simulations (int) – number of simulations to run.

  • percentile (int) – percentile to use.

Returns:

cost of living in the input country.

Return type:

float

scrape_cost_of_living.get_numbeo_countries() list[source]

This function returns a list of countries that have been scraped from Numbeo. The countries get standardized using the pycountry library.

Returns:

list of countries.

Return type:

list

scrape_cost_of_living.main()[source]

Scraping Indices

scrape_numbeo_indices.scrape_index(url: str = 'https://www.numbeo.com/pollution/rankings_by_country.jsp', columns: tuple = ('Country', 'Pollution')) None[source]

Scrapes the pollution index from the numbeo website.

Parameters:
  • url (str) – url to scrape.

  • columns (tuple) – columns to scrape.

Returns:

pandas dataframe.

Return type:

pd.DataFrame

scrape_numbeo_indices.to_pandas_df(rows: list) DataFrame[source]

Converts a list of HTML rows to a pandas dataframe.

Parameters:

rows (list) – list of rows

Returns:

pandas dataframe

Return type:

pd.DataFrame

Scraping Climate Data

scrape_temperatures.f_to_c(value: float) float[source]

Converts Fahrenheit to Celsius.

Parameters:

value (float) – float of the value to convert.

Returns:

Celisus value.

Return type:

float

scrape_temperatures.in_to_mm(value: float) float[source]

Converts inches to mm.

Parameters:

value (float) – float of the value to convert.

Returns:

mm value.

Return type:

float

scrape_temperatures.check_float(potential_float: str) bool[source]

Checks if a string is indeed a float.

Parameters:

potential_float (str) – string to check.

Returns:

True if the string is a float, False otherwise.

Return type:

bool

scrape_temperatures.get_stats(table: DataFrame) dict[source]

Aggregates the climate table data to get the maxes and mins and avgs.

Parameters:

table (pd.DataFrame) – pandas dataframe of the table to aggregate.

Returns:

dictionary of the maxes and mins and avgs.

Return type:

dict

scrape_temperatures.get_country_stats(soups: List[BeautifulSoup]) dict[source]

For every country, get the stats on its climate.

Parameters:

soups (List[BeautifulSoup]) – list of the soups of the pages.

Returns:

dictionary of the stats.

Return type:

dict

scrape_temperatures.main()[source]

Main Scraper Module

scrape_urls.get_table(soup: BeautifulSoup, table_num: int = 2, row_start: int = 1, row_end: int = 5) DataFrame[source]

Pulls out a table from a beautifulsoup html.

Format is: Row labels with ‘Average’ in the name. The table returns just the rows 1-4 inclusive. This was for the type of tables coming from the climate page.

Parameters:
  • soup (BeautifulSoup) – BeautifulSoup object.

  • table_num (int) – The table number to pull out.

  • row_start (int) – The row to start pulling from.

  • row_end (int) – The row to end pulling from.

Returns:

a pandas dataframe of the table.

Return type:

pd.DataFrame

scrape_urls.find_html_class(soup: BeautifulSoup, class_name: str) List[BeautifulSoup][source]

Finds all elements with a given class name.

Parameters:
  • soup (BeautifulSoup) – BeautifulSoup object.

  • class_name (str) – The class name to find.

Returns:

A list of elements with the given class name.

Return type:

List[BeautifulSoup]

scrape_urls.find_in_html(soup: BeautifulSoup, element: Union[str, list]) Optional[BeautifulSoup][source]

Finds an element in a BeautifulSoup object.

Parameters:
  • soup (BeautifulSoup) – BeautifulSoup object.

  • element (Union[str, list]) – The element to find.

Returns:

The element if found, else None.

Return type:

Optional[BeautifulSoup]

scrape_urls.find_id_in_html(soup: BeautifulSoup, id: str) Optional[BeautifulSoup][source]

Finds an element with a given id in a BeautifulSoup object.

Parameters:
  • soup (BeautifulSoup) – BeautifulSoup object.

  • id (str) – The id to find.

Returns:

The element if found, else None.

Return type:

Optional[BeautifulSoup]

scrape_urls.proxy_generator() dict[source]

This function scrapes a list of a free proxies from:

https://sslproxies.org/

It then returns a random proxy from the list.

Returns:

A random proxy from the list.

Return type:

dict

scrape_urls.scrape_page(url: str, spoof: bool = False) Optional[BeautifulSoup][source]

This function tries to get page information by spoofing the header and trying a random proxy. If successful, it returns the soup of the page.

Parameters:
  • url (str) – The url to scrape.

  • spoof (bool) – Whether to spoof the header and use a proxy.

Returns:

The soup of the page.

Return type:

Optional[BeautifulSoup]

scrape_urls.multi_thread_func(func: Callable, values: List, threads: int = 126) List[source]

This function takes a function and a list of values. It then runs the function on each value in the list using a thread pool.

Parameters:
  • func (Callable) – The function to run.

  • values (List) – The values to run the function on.

  • threads (int) – The number of threads to use.

Returns:

A list of the results of the function.

Return type:

List

Get All Data

get_data.main()[source]
get_data.standardise_country_names(dfs: List[DataFrame]) list[source]

Standardisses the country names across all the dataframes.

Parameters:

dfs (List[pd.DataFrame]) – list of dataframes.

Returns:

list of dataframes with standardised country names.

Return type:

List[pd.DataFrame]

get_data.import_data(suffix: str = ' by Country.csv') list[source]

Imports all the data into a list of dataframes.

Parameters:

suffix (str) – suffix of the file names.

Returns:

list of dataframes.

Return type:

List[pd.DataFrame]

get_data.join_data(df1: DataFrame, dfs: list) DataFrame[source]

Joins the dataframes together.

Parameters:
  • df1 (pd.DataFrame) – dataframe to be joined.

  • dfs (List[pd.DataFrame]) – list of dataframes to be joined to df1.

Returns:

joined dataframe.

Return type:

pd.DataFrame

get_data.clean_pop_density(df: DataFrame) DataFrame[source]

Renames the columns in the population density dataframe.

Parameters:

df (pd.DataFrame) – dataframe to be cleaned.

Returns:

cleaned dataframe.

Return type:

pd.DataFrame

get_data.promote_to_index(dfs: list, col_name: str) list[source]

Promotes the specified column to the index of the dataframes.

Parameters:
  • dfs (List[pd.DataFrame]) – list of dataframes.

  • col_name (str) – name of column to be promoted.

Returns:

list of dataframes with the specified column promoted to the index.

Return type:

List[pd.DataFrame]

Clustering

clustering.reduce_dimensions_pca(embeddings: List[List[float]], dimensions: int = 80) List[List[float]][source]

Reduces the number of dimensions using PCA.

Parameters:
  • embeddings (List[List[float]]) – list of lists size m*n where n is the number of dimensions.

  • dimensions (int) – number of principal components to keep.

Returns:

list of lists size m*dimensions (reduced data)

Return type:

List[List[float]]

clustering.reduce_dimensions_umap(embeddings: List[List[float]], dimensions: int = 80, n_neighbors: int = 10) List[List[float]][source]

Uses UMAP to reduce the dimensionality of the embeddings.

Parameters:
  • embeddings (List[List[float]]) – list of lists size m*n where n is the number of dimensions.

  • dimensions (int) – number of components to keep.

Returns:

list of lists size m*dimensions (reduced data).

Return type:

List[List[float]]

clustering.shuffle(df: DataFrame) DataFrame[source]

Shuffles the data by each column or row for a pandas dataframe.

Parameters:

df (pd.DataFrame) – pandas dataframe shaped m*n

Returns:

pandas dataframe shuffled.

Return type:

pd.DataFrame

clustering.single_sample_t_test(sample: array, population_stat: float = 0.0) float[source]

Run a simple t test on a sample to see if it is significantly different from the population mean.

Parameters:
  • sample (np.array) – numpy array of floats.

  • population_stat (float) – float for the population mean.

Returns:

float for the t statistic.

Return type:

float

clustering.calc_perm_variance(pca, embeddings_df: DataFrame, n_simulations: int = 5) DataFrame[source]

Calculates the variance explained for a PCA of the permuted data.

Parameters:
  • pca – sklearn pca object

  • embeddings_df (pd.DataFrame) – a pandas dataframe of the embedding vectors (m*n).

  • n_simulations (int) – integer for the number of permutations to run it on.

Returns:

list of lists as a pandas dataframe with the variance explained.

Return type:

pd.DataFrame

clustering.get_optimal_n_components(embeddings: List[List[float]], n_simulations: int = 5) int[source]

Calculates the optimal number of principal components to keep in a dimension reduction situation. It calculates a ‘noise’ threshold by permuting the variables to remove any correlations. Once that has been done, you calculate the variance explained by the principal components of the permuted data.

I then run a t test to find which principal components are significantly different from the baseline noise.

Parameters:
  • embeddings (List[List[float]]) – list of lists (m*n) of floats. where m is the number of vectors, and n the number of variables in each vector.

  • n_simulations (int) – the number of permulations to get a distribution of the noise.

Returns:

integer for the optimal number of principal components.

Return type:

int

clustering.kmeans_clustering(reduced: List[List[float]], num_clusters: int = -1, max_num_clusters: int = 75) List[int][source]

This function calculates clusters based on the reduced vectors. I also calculates the best number of clusters using the elbow method. max_num_clusters is the maximum number of clusters to calculate to find the optimal number.

Parameters:
  • reduced (List[List[float]]) – list of lists (m*n) of floats. where m is the number of vectors, and n the number of variables in each vector.

  • max_num_clusters (int) – integer for the upper bound number of clusters.

Returns:

list of cluster numbers for each element in reduced.

Return type:

List[int]

clustering.hdbscan_clustering(reduced: List[List[float]], min_cluster_size: int = 4, allow_single_cluster: bool = False) List[int][source]

Uses HDBSCAN to calculate clusters from the reduced data.

Parameters:

reduced (List[List[float]]) – list of lists (m*n) of floats. where m is the number of vectors, and n the number of variables in each vector.

Returns:

list of cluster numbers for each element in reduced. (note that -1 is an outlier)

Return type:

List[int]

Estimate Cost to Retire

estimate_cost_to_retire.estimate_cost_to_retire(country: str, weekly_cost: float, r: float = 0.06, n: int = 68, moving_cost: float = 8000, buffer_pa: float = 10000)[source]

Estimate the cost to retire in a country.

Parameters:
  • country (str) – The country to estimate the cost to retire in.

  • weekly_cost (float) – The weekly cost of living in the country.

  • r (float) – The rate of return on investments.

  • n (int) – The number of years to retire.

  • moving_cost (float) – The cost of moving to a new country.

  • buffer_pa (float) – The buffer cost in the annual cost of living.

Returns:

The cost to retire in the country.

Return type:

float

new_estimate_cost_to_retire.main()[source]

Purchasing Power Parity (PPP) Conversion Rates

predict_PPP.estimate_PPP_conversion_rate_long_term_change(country: str) float[source]

Estimate the long term change in the PPP conversion rate of a country.

Parameters:

country (str) – The country to estimate the long term change in the PPP conversion rate of.

Returns:

The long term change in the PPP conversion rate of the country.

Return type:

float