What is involved:

df = pd.read_csv('Categorical.csv')

  • Gather relevant data from appropriate sources, addressing any quality or privacy concerns.
## Get textbook data using for example:
 
import re
def read_file(filename):
    with open(filename, "r", encoding='UTF-8') as file:
        contents = file.read().replace('\n\n',' ').replace('[edit]', '').replace('\ufeff', '').replace('\n', ' ').replace('\u3000', ' ')
    return contents
text = read_file('Data various/Monte_Cristo.txt')
 
text_start = [m.start() for m in re.finditer('VOLUME ONE', text)]
text_end = [m.start() for m in re.finditer('End of Project Gutenberg', text)]
text = text[text_start[1]:text_end[0]]

How would you approach a colleague who is hesitant to share their data? ?

  • explain the purpose and benefits
  • ensure confidentiality (GDPR) with data masking.
  • and finding common ground to address any concerns or objections.
  • build trust.
  • make agreements of terms of use/ownership/document the data accessing process.

How would you go about obtaining the necessary permissions for a dataset? ?

  • establishing clear communication channels within the organsisation.
  • obtaining necessary approvals
  • emphasizing the value of collaboration.

How would you gather sensitive data?;; Get consent. Ensure anonyminaty (follow regularions)

How to you ensure data is unbiased and representative. ?

  • Stratified sampling, (group then randomly sample).
  • Examine the data sources.