ITS is a Data Privacy Week champion! Organized by the National Cybersecurity Alliance, Data Privacy Week, which runs January 21 – 27, is designed to help you safeguard your data online.
Artificial intelligence (AI) tools have raised a lot of questions, including around copyright, ethics, bias and potential for societal disruption. But here’s one AI concern you may not have thought of — data privacy.
AI tools, like chatbots and large language models (LLMs) and image generators, need lots of data to work. From training to compiling responses to the data generated when you interact with an AI tool, all this data raises a lot of privacy questions.
Where did the data come from? How does AI use my data? How can I protect my data?
The answers won’t be the same for every AI tool, but here’s the general landscape of AI and data privacy.
Your data powers AI
AI tools pull their training and answer data from multiple sources. These sources vary by tool and companies don’t often share details about where or how they collect data.
For example, OpenAI, ChatGPT’s parent company, says it develops models using “three primary sources of information: (1) information that is publicly available on the internet, (2) information that we license from third parties and (3) information that our users or our human trainers provide.”
Let’s look at the first item — information that is publicly available on the internet. This generally translates to pieces of information you can find using a search engine. At first thought, this may not concern you much. Aren’t most web results from companies or entities that want their data to be found? Sure, but there’s probably more of your data out there than you think.
A lot of publicly available information is rounded up through a process called “web scraping.” Web scraping is essentially downloading all the content of public websites, including social media sites and public databases.
AI tools may already have this data about you from scraping the web:
- Your name and date of birth
- Current and past addresses
- Photos of you or photos you have uploaded to sharing sites
- Where you work, from your LinkedIn profile or from a company website
- Your voter affiliation and voting history
Your data is for sale
AI companies may scrape the web themselves, or they buy data sets from a third party, like OpenAI references. “Licensed from third parties” data can be almost anything from anywhere. It could be scraped public data, or it could be data that’s collected from a website or an app and resold.
And unfortunately, these data sets may include information that was never meant to be public, like medical records. Companies compile data sets from whatever information they can get — including data that was hacked, leaked or was otherwise unlawfully collected.
Earlier this month, the Federal Trade Commission issued its first data tracking settlement against a company that sold de-anonymized location data of consumers without their consent. Not only were consumers largely unaware of the tracking, the company sold data sets that included visits to sensitive locations like domestic abuse shelters or medical offices.
“Proprietary data” is also your data
Another vague response from companies is that they train their AI models on “proprietary data.” Proprietary data is data that the company itself collects and owns.
So, what makes up proprietary data and how would a company use it? It depends. As a fictional, but not particularly far-fetched example, imagine a book publishing company wanted to use AI to design high-performing book covers. They might train the model using proprietary data from their company, like earlier cover designs and corresponding sales data.
But many of the major players in AI — Alphabet (Google), Microsoft, Meta (Instagram and Facebook), Amazon — are also companies that collect your data. The data you generate — searches you make, websites you browse, what you buy, posts you share, where you go or videos you watch — becomes the company’s proprietary data.
The companies can then use this data to train models, compile answers or customize responses.
For example, in September 2023, Amazon disclosed that Alexa voice searches would be used to train its future Alexa AI model. And Meta said that it trained a new image generator using 1.1 billion public Facebook and Instagram photos.
And an AI company’s proprietary data also includes how you interact with the tool itself. It’s common for AI tools to collect and store user queries and responses. Companies then analyze this interaction data and use it to train future models.
Choices that protect your data
Admittedly, there is a lot that is out of your control when it comes to AI and your data privacy. You generate data in your daily life. Much of that data is necessary — imagine if your doctor didn’t keep track of your test results or if the DMV didn’t know who owned a car. And you probably enjoy suggestions based on videos you watch or books you read. But there are some ways you can be more in control of your data privacy.
Keep in mind that changing settings now, like making your Instagram account private, won’t retroactively remove formerly public information from data sets.
Second, be careful with what you send and share with the AI tools themselves. Don’t send sensitive information to a chatbot and don’t expect privacy. There are some exceptions, like Microsoft Copilot with Data Protection, but unless you’re sure, don’t share.
Lastly, review settings for services you use and devices you own. You might be able to limit how much data an app or service collects. Check your settings in services for things like turning off personalized ads or limiting permissions to while you are using an app.
AI and data privacy at UNC
The AI landscape has evolved rapidly over the past year, and it can be hard to keep up. To help you use AI securely and ethically, UNC has created resources tailored to students, faculty and staff.
One tool with enhanced data privacy protection is available from ITS. Last Fall, ITS rolled out an institutionally-scoped AI tool, Microsoft Copilot with Data Protection, for UNC faculty and staff.
Copilot with Data Protection does not store or view your chats. Your queries are encrypted, and Microsoft does not use UNC-Chapel Hill data or queries to train any of its models. By comparison, tools like ChatGPT, Google Bard and the consumer version of Microsoft Copilot store your searches and use your queries and chat history to train future models.
While Microsoft Copilot with Data Protection is a more secure alternative to ChatGPT or other LLMs, you should not use it for all University data. Do not use AI tools for protected health information (PHI), or data subject to HIPAA (Health Insurance Portability and Accountability Act) or Tier 3 data.
While Copilot with Data Protection has safeguards for Tier 1 and Tier 2 data, check to make sure any external obligation, data management plans, or system security plans allow for this tool’s use. Those obligations would override any “safe for Tier 1 or Tier 2” designation.
Learn more about Microsoft Copilot with Data Protection in this article, 4 questions about Microsoft Copilot, answered.