API
Scraping Cost of Living
- scrape_cost_of_living.get_cost_of_living_table(place_name: str, country=True)[source]
Get the cost of living table for a place.
- Parameters:
place_name (str) – name of the place.
country (bool) – whether the place is a country or city.
- Returns:
cost of living table.
- Return type:
pd.DataFrame
- scrape_cost_of_living.clean_numbeo_table(numbeo_df: DataFrame) DataFrame[source]
Cleans the default Numbeo cost of living table.
- Parameters:
numbeo_df (pd.DataFrame) – pandas dataframe
- Returns:
pandas dataframe that has been cleaned up.
- Return type:
pd.DataFrame
- scrape_cost_of_living.check_enough_data(numbeo_df: DataFrame) float[source]
Checks that the number of data points we have is sufficient. I check that I have enough data to be able to use it to estimate cost of living.
- Parameters:
numbeo_df (pd.DataFrame) – pandas dataframe.
- Returns:
proportion of filled cells as a proportion of number of total categories.
- Return type:
float
- scrape_cost_of_living.get_country_cost_of_living(country: str, percentile: int = 90) float[source]
Get the cost of living for a country.
- Parameters:
country (str) – country to get the cost of living for.
percentile (int) – percentile to use.
- Returns:
cost of living for the country.
- Return type:
float
- scrape_cost_of_living.get_city_cost_of_living(city: str, percentile: int = 90) float[source]
Get the cost of living for a city.
- Parameters:
city (str) – city to get the cost of living for.
percentile (int) – percentile to use.
- Returns:
cost of living for the city.
- Return type:
float
- scrape_cost_of_living.get_cost_of_living(place_name: str, simulations: int = 10000, percentile: int = 90) float[source]
For all cost categories, get the cost and multiply it by the number of units. I simulate the cost using a triangular distribution if there is a lower and upper bound. If there isn’t one though, I simply take the mode.
- Parameters:
numbeo_table (pd.DataFrame) – pandas dataframe with the cost of living.
simulations (int) – number of simulations to run.
percentile (int) – percentile to use.
- Returns:
cost of living in the input country.
- Return type:
float
Scraping Indices
- scrape_numbeo_indices.scrape_index(url: str = 'https://www.numbeo.com/pollution/rankings_by_country.jsp', columns: tuple = ('Country', 'Pollution')) None[source]
Scrapes the pollution index from the numbeo website.
- Parameters:
url (str) – url to scrape.
columns (tuple) – columns to scrape.
- Returns:
pandas dataframe.
- Return type:
pd.DataFrame
Scraping Climate Data
- scrape_temperatures.f_to_c(value: float) float[source]
Converts Fahrenheit to Celsius.
- Parameters:
value (float) – float of the value to convert.
- Returns:
Celisus value.
- Return type:
float
- scrape_temperatures.in_to_mm(value: float) float[source]
Converts inches to mm.
- Parameters:
value (float) – float of the value to convert.
- Returns:
mm value.
- Return type:
float
- scrape_temperatures.check_float(potential_float: str) bool[source]
Checks if a string is indeed a float.
- Parameters:
potential_float (str) – string to check.
- Returns:
True if the string is a float, False otherwise.
- Return type:
bool
- scrape_temperatures.get_stats(table: DataFrame) dict[source]
Aggregates the climate table data to get the maxes and mins and avgs.
- Parameters:
table (pd.DataFrame) – pandas dataframe of the table to aggregate.
- Returns:
dictionary of the maxes and mins and avgs.
- Return type:
dict
Main Scraper Module
- scrape_urls.get_table(soup: BeautifulSoup, table_num: int = 2, row_start: int = 1, row_end: int = 5) DataFrame[source]
Pulls out a table from a beautifulsoup html.
Format is: Row labels with ‘Average’ in the name. The table returns just the rows 1-4 inclusive. This was for the type of tables coming from the climate page.
- Parameters:
soup (BeautifulSoup) – BeautifulSoup object.
table_num (int) – The table number to pull out.
row_start (int) – The row to start pulling from.
row_end (int) – The row to end pulling from.
- Returns:
a pandas dataframe of the table.
- Return type:
pd.DataFrame
- scrape_urls.find_html_class(soup: BeautifulSoup, class_name: str) List[BeautifulSoup][source]
Finds all elements with a given class name.
- Parameters:
soup (BeautifulSoup) – BeautifulSoup object.
class_name (str) – The class name to find.
- Returns:
A list of elements with the given class name.
- Return type:
List[BeautifulSoup]
- scrape_urls.find_in_html(soup: BeautifulSoup, element: Union[str, list]) Optional[BeautifulSoup][source]
Finds an element in a BeautifulSoup object.
- Parameters:
soup (BeautifulSoup) – BeautifulSoup object.
element (Union[str, list]) – The element to find.
- Returns:
The element if found, else None.
- Return type:
Optional[BeautifulSoup]
- scrape_urls.find_id_in_html(soup: BeautifulSoup, id: str) Optional[BeautifulSoup][source]
Finds an element with a given id in a BeautifulSoup object.
- Parameters:
soup (BeautifulSoup) – BeautifulSoup object.
id (str) – The id to find.
- Returns:
The element if found, else None.
- Return type:
Optional[BeautifulSoup]
- scrape_urls.proxy_generator() dict[source]
This function scrapes a list of a free proxies from:
It then returns a random proxy from the list.
- Returns:
A random proxy from the list.
- Return type:
dict
- scrape_urls.scrape_page(url: str, spoof: bool = False) Optional[BeautifulSoup][source]
This function tries to get page information by spoofing the header and trying a random proxy. If successful, it returns the soup of the page.
- Parameters:
url (str) – The url to scrape.
spoof (bool) – Whether to spoof the header and use a proxy.
- Returns:
The soup of the page.
- Return type:
Optional[BeautifulSoup]
- scrape_urls.multi_thread_func(func: Callable, values: List, threads: int = 126) List[source]
This function takes a function and a list of values. It then runs the function on each value in the list using a thread pool.
- Parameters:
func (Callable) – The function to run.
values (List) – The values to run the function on.
threads (int) – The number of threads to use.
- Returns:
A list of the results of the function.
- Return type:
List
Get All Data
- get_data.standardise_country_names(dfs: List[DataFrame]) list[source]
Standardisses the country names across all the dataframes.
- Parameters:
dfs (List[pd.DataFrame]) – list of dataframes.
- Returns:
list of dataframes with standardised country names.
- Return type:
List[pd.DataFrame]
- get_data.import_data(suffix: str = ' by Country.csv') list[source]
Imports all the data into a list of dataframes.
- Parameters:
suffix (str) – suffix of the file names.
- Returns:
list of dataframes.
- Return type:
List[pd.DataFrame]
- get_data.join_data(df1: DataFrame, dfs: list) DataFrame[source]
Joins the dataframes together.
- Parameters:
df1 (pd.DataFrame) – dataframe to be joined.
dfs (List[pd.DataFrame]) – list of dataframes to be joined to df1.
- Returns:
joined dataframe.
- Return type:
pd.DataFrame
- get_data.clean_pop_density(df: DataFrame) DataFrame[source]
Renames the columns in the population density dataframe.
- Parameters:
df (pd.DataFrame) – dataframe to be cleaned.
- Returns:
cleaned dataframe.
- Return type:
pd.DataFrame
- get_data.promote_to_index(dfs: list, col_name: str) list[source]
Promotes the specified column to the index of the dataframes.
- Parameters:
dfs (List[pd.DataFrame]) – list of dataframes.
col_name (str) – name of column to be promoted.
- Returns:
list of dataframes with the specified column promoted to the index.
- Return type:
List[pd.DataFrame]
Clustering
- clustering.reduce_dimensions_pca(embeddings: List[List[float]], dimensions: int = 80) List[List[float]][source]
Reduces the number of dimensions using PCA.
- Parameters:
embeddings (List[List[float]]) – list of lists size m*n where n is the number of dimensions.
dimensions (int) – number of principal components to keep.
- Returns:
list of lists size m*dimensions (reduced data)
- Return type:
List[List[float]]
- clustering.reduce_dimensions_umap(embeddings: List[List[float]], dimensions: int = 80, n_neighbors: int = 10) List[List[float]][source]
Uses UMAP to reduce the dimensionality of the embeddings.
- Parameters:
embeddings (List[List[float]]) – list of lists size m*n where n is the number of dimensions.
dimensions (int) – number of components to keep.
- Returns:
list of lists size m*dimensions (reduced data).
- Return type:
List[List[float]]
- clustering.shuffle(df: DataFrame) DataFrame[source]
Shuffles the data by each column or row for a pandas dataframe.
- Parameters:
df (pd.DataFrame) – pandas dataframe shaped m*n
- Returns:
pandas dataframe shuffled.
- Return type:
pd.DataFrame
- clustering.single_sample_t_test(sample: array, population_stat: float = 0.0) float[source]
Run a simple t test on a sample to see if it is significantly different from the population mean.
- Parameters:
sample (np.array) – numpy array of floats.
population_stat (float) – float for the population mean.
- Returns:
float for the t statistic.
- Return type:
float
- clustering.calc_perm_variance(pca, embeddings_df: DataFrame, n_simulations: int = 5) DataFrame[source]
Calculates the variance explained for a PCA of the permuted data.
- Parameters:
pca – sklearn pca object
embeddings_df (pd.DataFrame) – a pandas dataframe of the embedding vectors (m*n).
n_simulations (int) – integer for the number of permutations to run it on.
- Returns:
list of lists as a pandas dataframe with the variance explained.
- Return type:
pd.DataFrame
- clustering.get_optimal_n_components(embeddings: List[List[float]], n_simulations: int = 5) int[source]
Calculates the optimal number of principal components to keep in a dimension reduction situation. It calculates a ‘noise’ threshold by permuting the variables to remove any correlations. Once that has been done, you calculate the variance explained by the principal components of the permuted data.
I then run a t test to find which principal components are significantly different from the baseline noise.
- Parameters:
embeddings (List[List[float]]) – list of lists (m*n) of floats. where m is the number of vectors, and n the number of variables in each vector.
n_simulations (int) – the number of permulations to get a distribution of the noise.
- Returns:
integer for the optimal number of principal components.
- Return type:
int
- clustering.kmeans_clustering(reduced: List[List[float]], num_clusters: int = -1, max_num_clusters: int = 75) List[int][source]
This function calculates clusters based on the reduced vectors. I also calculates the best number of clusters using the elbow method. max_num_clusters is the maximum number of clusters to calculate to find the optimal number.
- Parameters:
reduced (List[List[float]]) – list of lists (m*n) of floats. where m is the number of vectors, and n the number of variables in each vector.
max_num_clusters (int) – integer for the upper bound number of clusters.
- Returns:
list of cluster numbers for each element in reduced.
- Return type:
List[int]
- clustering.hdbscan_clustering(reduced: List[List[float]], min_cluster_size: int = 4, allow_single_cluster: bool = False) List[int][source]
Uses HDBSCAN to calculate clusters from the reduced data.
- Parameters:
reduced (List[List[float]]) – list of lists (m*n) of floats. where m is the number of vectors, and n the number of variables in each vector.
- Returns:
list of cluster numbers for each element in reduced. (note that -1 is an outlier)
- Return type:
List[int]
Estimate Cost to Retire
- estimate_cost_to_retire.estimate_cost_to_retire(country: str, weekly_cost: float, r: float = 0.06, n: int = 68, moving_cost: float = 8000, buffer_pa: float = 10000)[source]
Estimate the cost to retire in a country.
- Parameters:
country (str) – The country to estimate the cost to retire in.
weekly_cost (float) – The weekly cost of living in the country.
r (float) – The rate of return on investments.
n (int) – The number of years to retire.
moving_cost (float) – The cost of moving to a new country.
buffer_pa (float) – The buffer cost in the annual cost of living.
- Returns:
The cost to retire in the country.
- Return type:
float
Purchasing Power Parity (PPP) Conversion Rates
- predict_PPP.estimate_PPP_conversion_rate_long_term_change(country: str) float[source]
Estimate the long term change in the PPP conversion rate of a country.
- Parameters:
country (str) – The country to estimate the long term change in the PPP conversion rate of.
- Returns:
The long term change in the PPP conversion rate of the country.
- Return type:
float