Instructions
This second assignment reviews the Describing Data content.
You will use the describing_data.Rmd file I reviewed as part of
the lectures for this week to complete this assignment. You will
copy and paste relevant code from that file and update it to
answer the questions in this assignment. You will respond to questions
in each section after executing relevant code to answer a question. You
will submit this assignment to its Submissions folder on
D2L. You will submit two files:
- this completed R Markdown script, and
- preferably a PDF (if you already installed
TinyTeX properly), otherwise a HTML (if you did
not install TinyTeX properly yet), rendered version of it
to D2L.
To start:
First, create a folder on your computer to save all relevant files
for this course. If you did not do so already, you will want to create a
folder named GSB 519 that contains all of the materials for
this course.
Second, inside of GSB 519, you will create a folder to host
assignments. You can name that folder assignments.
Third, inside of assignments, you will create folders for
each assignment. You can name the folder for this first assignment:
describing_data.
Fourth, create two additional folders in describing_data
named scripts, data, and plots. Store this
script in the scripts folder and the data for this assignment
in the data folder.
Fifth, go to the File menu in RStudio, select
New Project…, choose Existing Directory, go to your
~/GSB 519/assignments/describing_data folder to select it as
the top-level directory for this R Project.
Global Settings
The first code chunk sets the global settings for the remaining code
chunks in the document. Do not change anything in this code
chunk.
Load Packages
In this code chunk, we load two packages we need for this
assignment:
- here,
- tidyverse,
- readxl, and
- skimr.
Make sure you installed these packages when you reviewed the
analytical lecture.
We will use functions from these packages to examine the data. Do
not change anything in this code chunk.
### load libraries for use in current working session
## library "here" for project workflow
library(here)
## tidyverse for data manipulation and plotting
# loads eight different libraries simultaneously
library(tidyverse)
## readxl to import/export Excel data
library(readxl)
## skimr to summarize data
library(skimr)
Task 1: Import Data
We will use the same data as in the analytical lecture:
clients.xlsx. After you load the data, then you will
execute other commands on the data.
Use the read_excel() and here()
functions to load the second sheet from this Excel data file
for this working session. Save the data as the object
clients_s2_raw.
Question 1.1: After you load the data, look at your
Global Environment window. How many observations and variables
are there in the data?
Response 1.1: 919 observations and 14
variables.
Question 1.2: Use the glimpse()
function to view a preview of values for each variable in the data.
Which is the first value of the product_type variable?
How many variables are treated as numeric (i.e., dbl)
when you import the data?
Response 1.2: First value of product type is
computers and 9 variables are treated as numeric variables.
#### Q1.1
### import and save data as object
## use read_excel() to import the csv data file
clients_s2_raw <- read_excel(
## use here() to locate file in our project directory;
here("data", "clients.xlsx"),
## specify sheet
sheet = 2
)
#### Q1.2
### glimpse data
glimpse(clients_s2_raw)
Task 2: Clean Data
For your second task, you will clean the data. Apply the
rename() function to rename:
- credit.term to credit_term,
and
- HavingChildren_flg to
having_children_flg.
In the same piped command, use mutate() and
across() functions to convert to factors:
- month,
- sex,
- education,
- product_type,
- having_children_flg,
- region,
- family_status,
- phone_operator,
- is_client, and
- bad_client_target.
Save the result as a new data object named:
clients_s2_work. Apply glimpse() to
clients_s2_work to preview the working data.
Question 2.1: How many factor variables
(indicated by fct) are there now in the data?
Response 2.1: 10.
Use map() and unique() functions to
examine the levels for all of the factor variables. Then, recode the
factor levels for the various factors appropriately:
- for month, recode the levels from
7-12 to July-December;
- for sex, recode the levels just like in the lecture
script;
- for education, recode the levels just like in the
lecture script and add “Incomplete Secondary Education” =
“Incomplete secondary education” and “PhD Degree” =
“PhD degree”;
- for product_type, recode the levels just like in
the lecture script except remove Garden Equipment and
Children’s Goods and add “Fishing and Hunting
Supplies” = “Fishing and hunting supplies”;
- for region, recode the levels 0-2
to East, Midwest, and
West, respectively (i.e., 0 =
East, and so on);
- for having_children_flg,
is_client, and bad_client_target,
recode the levels 0-1 to No and
Yes, respectively.
Save the changes to clients_s2_work. Use
map() and unique() functions again to
examine the levels for all of the factor variables.
Question 2.2: How many factor levels are there for
product_type? How many factor levels are there for
education?
Response 2.2: 6 for education, and 19 for
product type.
clients_s2_work <-clients_s2_raw %>% rename(credit_term=credit.term,having_children_flg=HavingChildren_flg) %>%
mutate(across(c(month,sex,education,product_type,having_children_flg,region,family_status,phone_operator,is_client,bad_client_target),factor))
glimpse(clients_s2_work)
map(clients_s2_work %>% select(c(month,sex,education,product_type,having_children_flg,region,family_status,phone_operator,is_client,bad_client_target)),unique)
clients_s2_work <-clients_s2_work %>%
mutate(month=fct_recode(month,
"July"="7",
"August"="8",
"September"="9",
"October"="10",
"November"="11",
"December"="12"),
sex=fct_recode(sex,
"Male"="male",
"Female"="female"),
education=fct_recode(education,
"Secondary Special Education"="Secondary special education",
"Higher Education"="Higher education",
"Incomplete Higher Education"="Incomplete higher education",
"Secondary Education"="Secondary education",
"Incomplete Secondary Education"="Incomplete secondary education",
"PhD Degree"="PhD degree"),
product_type=fct_recode(product_type,
"Cell Phones"="Cell phones",
"Household App"="Household app",
"Cosm. & Beaut. Serv."="Cosmetics and beauty services",
"Medical Services"="Medical services",
"Sporting Goods"="Sporting goods",
"Fishing and Hunting Supplies"="Fishing and hunting supplies"
),
region=fct_recode(region,
"East"="0",
"Midwest"="1",
"West"="2"),
having_children_flg=fct_recode(having_children_flg,
"No"="0",
"Yes"="1"),
is_client=fct_recode(is_client,
"No"="0",
"Yes"="1"),
bad_client_target=fct_recode(bad_client_target,
"No"="0",
"Yes"="1"))
map(clients_s2_work %>% select(c(month,sex,education,product_type,having_children_flg,region,family_status,phone_operator,is_client,bad_client_target)),unique)
Task 3: Summary of Variables
Summarize clients_s2_work using the
skim_without_charts() fuctnion. Group by
sex and select the credit_term,
education, is_client, and
income variables to examine.
Question 3.1: What is the 75th percentile
of income for men? How many women are
clients?
Response 3.1: 36000.
clients_s2_work%>%
group_by(sex)%>%
select(credit_term,education,is_client,income)%>%
skim_without_charts()
Task 4: Discrete Variables
For this task, you will plot a discrete variable.
Produce a horizontal bar plot of product_type that
calculates the percentages of individuals in each category. Correctly
label the axes. Do NOT order the categories yet.
Question 4.1: Which two product types are most
popular?
Response 4.1: Cell phones and household
appliances.
Examine the present order of the product_type
categories with the levels() function. Reorder the
categories by their reverse frequency. Examine the changed order of the
product_type categories.
Question 4.2: What were the first and last
categories originally? What are the first and last categories after the
reordering?
Response 4.2: Audio & video, Windows &
doors after reordering it is Construction materials and
computers.
Produce a second horizontal bar plot of product_type
that calculates the percentages of individuals in each category.
Correctly label the axes. This plot should use the now ordered version
of product_type. Add text to the middle of the bars to
indicate their percentage values. Change the size of the text to a value
of 2.
Question 4.3: What percentage bought items were
windows and doors? What percentage of bought items were
furniture?
Response 4.3: 1.4 and 26.3.
clients_s2_work%>%
ggplot(aes(x=product_type,y=..count../sum(..count..)*100))+
geom_bar()
levels(clients_s2_work$product_type)
product_type_order<-clients_s2_work%>%count(product_type)%>%
arrange(desc(n))%>%
mutate(pct=round(n/sum(n),3)*100)
clients_s2_work$product_type<-factor(clients_s2_work$product_type,levels=product_type_order$product_type)
levels(clients_s2_work$product_type)
product_type_order%>%
ggplot(aes(x=reorder(product_type,desc(pct)),y=pct,label=pct))+
geom_bar(stat="identity")+
geom_text(size=2,position=position_stack(vjust=.5))
Task 5: Continuous Variables
For this task, you will plot a continuous variable.
Produce a histogram plot for credit_term. Color the
histogram blue, choose 10 bins, add
text that prints the count value for each bin, and use
8 breaks for the x-axis. Label the axes
appropriately.
Question 5.1: Are there more individuals with a
credit term of roughly 10-12.5 or 35-37.5?
Are there more than 10 individuals with a credit term
in the 21-23 range?
Response 5.1: There are more individuals with a
credit term of 10-12.5 than 35-37.5. There are not more than 10
individuals with a credit term in the 21-23 range.
Produce a density plot for credit_term. Fill the
histogram purple, make it half transparent, and make
the color of the density curve white. Label the axes
appropriately. Provide a title and subtitle for the plot.
Question 5.2: How many modes do you see for
credit terms falling between 0 to 20
months?
Response 5.2: There are 3 modes.
clients_s2_work%>%
ggplot(aes(x=credit_term))+
geom_histogram(bins = 10,fill="blue")+
stat_bin(aes(y=..count..,label=..count..),geom="text",bins = 10,vjust=-.5)+
scale_x_continuous(breaks = c(5,10,15,20,25,30,35,40))+
labs(x="Credit Term")
clients_s2_work%>%
ggplot(aes(x=credit_term,y=..density..))+
geom_histogram(bins = 10,fill="purple",alpha=.5)+
geom_density(color="white",aes(y=..density..))+
stat_bin(aes(y=..density..,label=..count..),geom="text",bins = 10,vjust=-.5)+
scale_x_continuous(breaks = c(5,10,15,20,25,30,35,40))+
labs(x="Credit Term")
Task 6: Multiple Variables
For this task, you will produce plots involving multiple
variables.
Produce a bar plot calculating the percentage of
men and women with various phone operators.
Place phone_operator on the x-axis and
sex as the fill and
group aesthetics. Produce a dodged bar plot.
Add text on top of the bars with the relevant percentages. Label the
axes and legend appropriately. Provide an appropriate title. Save the
plot as phone_gender_plot. Print the plot by
highlighting the saved object.
Question 6.1: Which phone operator is most
often used by men and women?
Response 6.1: AT&T is most used by men and
women use US Cellular more.
Produce a scatterplot of credit_term (x-axis) and
credit_amount (y-axis). Add a loess
line and color the points blue with transparency. Save
the plot as credit_term_amount_plot.
Question 6.2: Using the loess line
as a guide, are larger credit amounts associated with
lower or higher credit terms?
Response 6.2: WRITE YOUR ANSWER BETWEEN THESE
ASTERISKS.
Produce a boxplot of income (y-axis) as a function
of family_status (x-axis). Fill the boxplots by
family_status, color the outliers in
darkred, include the jittered data points, exclude the
legend, and appropriately label the axes. Save the plot as
income_fam_stat_plot.
Question 6.3: Which category of family
status has the largest outliers?
Response 6.3: WRITE YOUR ANSWER BETWEEN THESE
ASTERISKS.
phone_gender_plot<-clients_s2_work%>%count(sex,phone_operator)%>%group_by(phone_operator)%>%
mutate(pct=n/sum(n)*100,pct=round(pct,1))%>%
ggplot(aes(x=phone_operator,y=pct,fill=sex,group=sex))+
geom_bar(stat="identity",position="dodge")+
geom_text(aes(label=paste0(pct,"%")),vjust=-.5)+
coord_cartesian(ylim = c(0,100))+
labs(x="Phone Operator",y="Percentage",title="Distribution of Phone Operator by Sex")
phone_gender_plot
Task 7: Save Plots and Data
For this task, you will save the plots from the sixth task and the
working data. Save the working data, clients_s2_work as
the data file: clients_s2_work.csv in the
data folder of the project directory.
Save the three plots from the sixth task as png
files in the plots folder of the project directory.
Save phone_gender_plot as
phone_gender.png,
credit_term_amount_plot as
credit_term_amount.png, and
income_fam_stat_plot as
income_fam_stat.png. Use a width of 6 inches
and height of 6 inches.
Task 8: Conceptual Questions
For your last task, you will respond to conceptual questions based on
the conceptual lectures for this week.
Question 8.1: What is a percentile of a
variable?
Response 8.1: WRITE YOUR ANSWER BETWEEN THESE
ASTERISKS.
Question 8.2: What is the difference between the
variance and standard deviation of a variable?
Response 8.2: WRITE YOUR ANSWER BETWEEN THESE
ASTERISKS.
Question 8.3: What five statistics are computed in a
boxplot? How is a boxplot useful for evaluating variables?
Response 8.3: WRITE YOUR ANSWER BETWEEN THESE
ASTERISKS.
---
title: "Assignment: Describing Data"
author: "Maya Estell"
date: "2022-10-22"
output:
  html_document:
    df_print: paged
  pdf_document: default
  html_notebook: default
editor_options:
  chunk_output_type: console
---

## Instructions

This second assignment reviews the *Describing Data* content. 
You will use the *describing_data.Rmd* file I reviewed as part of the lectures for this week to complete this assignment. 
You will *copy and paste* relevant code from that file and update it to answer the questions in this assignment. 
You will respond to questions in each section after executing relevant code to answer a question. 
You will submit this assignment to its *Submissions* folder on *D2L*.
You will submit *two* files:

1. this completed *R Markdown* script, and 
2. preferably a *PDF* (if you already installed `TinyTeX` properly), otherwise a *HTML* (if you did not install `TinyTeX` properly yet), rendered version of it to *D2L*.

To start:

First, create a folder on your computer to save all relevant files for this course. 
If you did not do so already, you will want to create a folder named *GSB 519* that contains all of the materials for this course.

Second, inside of *GSB 519*, you will create a folder to host assignments.
You can name that folder *assignments*.

Third, inside of *assignments*, you will create folders for each assignment.
You can name the folder for this first assignment: *describing_data*.

Fourth, create two additional folders in *describing_data* named *scripts*, *data*, and *plots*.
Store this script in the *scripts* folder and the data for this assignment in the *data* folder.

Fifth, go to the *File* menu in *RStudio*, select *New Project...*, choose *Existing Directory*, go to your *~/GSB 519/assignments/describing_data* folder to select it as the top-level directory for this **R Project**.  

## Global Settings

The first code chunk sets the global settings for the remaining code chunks in the document.
Do *not* change anything in this code chunk.

```{r, setup, include = FALSE}
### specify echo setting for all code chunks
knitr::opts_chunk$set(echo = TRUE)
```

## Load Packages

In this code chunk, we load two packages we need for this assignment:

1. **here**,
2. **tidyverse**,
3. **readxl**, and
4. **skimr**.

Make sure you installed these packages when you reviewed the analytical lecture.

We will use functions from these packages to examine the data. 
Do *not* change anything in this code chunk.

```{r, libraries, message = FALSE}
### load libraries for use in current working session
## library "here" for project workflow
library(here)

## tidyverse for data manipulation and plotting
# loads eight different libraries simultaneously
library(tidyverse)

## readxl to import/export Excel data
library(readxl)

## skimr to summarize data
library(skimr)
```

## Task 1: Import Data

We will use the same data as in the analytical lecture: **clients.xlsx**.
After you load the data, then you will execute other commands on the data.

Use the **read_excel()** and **here()** functions to load the *second sheet* from this Excel data file for this working session. 
Save the data as the object **clients_s2_raw**. 

**Question 1.1**: After you load the data, look at your *Global Environment* window. 
How many observations and variables are there in the data?

**Response 1.1**: *919 observations and 14 variables*.

**Question 1.2**: Use the **glimpse()** function to view a preview of values for each variable in the data. 
Which is the first value of the **product_type** variable?
How many variables are treated as numeric (i.e., **dbl**) when you import the data?

**Response 1.2**: *First value of product type is computers and 9 variables are treated as numeric variables*.

```{r, task1}
#### Q1.1
### import and save data as object
## use read_excel() to import the csv data file
clients_s2_raw <- read_excel(
  ## use here() to locate file in our project directory;
  here("data", "clients.xlsx"),
  ## specify sheet
  sheet = 2
)

#### Q1.2
### glimpse data
glimpse(clients_s2_raw)
```

## Task 2: Clean Data

For your second task, you will clean the data.
Apply the **rename()** function to rename: 

1. **credit.term** to **credit_term**, and
2. **HavingChildren_flg** to **having_children_flg**.

In the same piped command, use **mutate()** and **across()** functions to convert to factors:

1. **month**,
2. **sex**,
3. **education**,
4. **product_type**,
5. **having_children_flg**,
6. **region**,
7. **family_status**,
8. **phone_operator**,
9. **is_client**, and
10. **bad_client_target**.

Save the result as a new data object named: **clients_s2_work**.
Apply **glimpse()** to **clients_s2_work** to preview the working data.

**Question 2.1**: How many *factor* variables (indicated by *fct*) are there now in the data?

**Response 2.1**: *10*.

Use **map()** and **unique()** functions to examine the levels for all of the factor variables.
Then, recode the factor levels for the various factors appropriately:

1. for **month**, recode the levels from **7-12** to **July-December**;
2. for **sex**, recode the levels just like in the lecture script;
3. for **education**, recode the levels just like in the lecture script and add **"Incomplete Secondary Education" = "Incomplete secondary education"** and **"PhD Degree" = "PhD degree"**;
4. for **product_type**, recode the levels just like in the lecture script except remove **Garden Equipment** and **Children's Goods** and add **"Fishing and Hunting Supplies" = "Fishing and hunting supplies"**; 
5. for **region**, recode the levels **0-2** to **East**, **Midwest**, and **West**, respectively (i.e., **0** = **East**, and so on);
6. for **having_children_flg**, **is_client**, and **bad_client_target**, recode the levels **0-1** to **No** and **Yes**, respectively.

Save the changes to **clients_s2_work**.
Use **map()** and **unique()** functions again to examine the levels for all of the factor variables.

**Question 2.2**: How many factor levels are there for **product_type**?
How many factor levels are there for **education**?

**Response 2.2**: *6 for education, and 19 for product type*.

```{r, task2}
clients_s2_work <-clients_s2_raw %>% rename(credit_term=credit.term,having_children_flg=HavingChildren_flg) %>% 
 mutate(across(c(month,sex,education,product_type,having_children_flg,region,family_status,phone_operator,is_client,bad_client_target),factor))

glimpse(clients_s2_work)

map(clients_s2_work %>% select(c(month,sex,education,product_type,having_children_flg,region,family_status,phone_operator,is_client,bad_client_target)),unique)

clients_s2_work <-clients_s2_work %>% 
  mutate(month=fct_recode(month,
                          "July"="7",
                          "August"="8",
                          "September"="9",
                          "October"="10",
                          "November"="11",
                          "December"="12"),
         sex=fct_recode(sex,
                      "Male"="male",
                      "Female"="female"),
         education=fct_recode(education,
                              "Secondary Special Education"="Secondary special education",
                              "Higher Education"="Higher education",
                              "Incomplete Higher Education"="Incomplete higher education",
                              "Secondary Education"="Secondary education",
                              "Incomplete Secondary Education"="Incomplete secondary education",
                              "PhD Degree"="PhD degree"),
         product_type=fct_recode(product_type,
                                 "Cell Phones"="Cell phones",
                                 "Household App"="Household app",
                                 "Cosm. & Beaut. Serv."="Cosmetics and beauty services",
                                 "Medical Services"="Medical services",
                                 "Sporting Goods"="Sporting goods",
                                 "Fishing and Hunting Supplies"="Fishing and hunting supplies"
                              
                                 ),
         region=fct_recode(region,
                           "East"="0",
                           "Midwest"="1",
                           "West"="2"),
         having_children_flg=fct_recode(having_children_flg,
                                        "No"="0",
                                        "Yes"="1"),
         is_client=fct_recode(is_client,
                              "No"="0",
                              "Yes"="1"),
         bad_client_target=fct_recode(bad_client_target,
                                      "No"="0",
                                      "Yes"="1"))

map(clients_s2_work %>% select(c(month,sex,education,product_type,having_children_flg,region,family_status,phone_operator,is_client,bad_client_target)),unique)


```

## Task 3: Summary of Variables

Summarize **clients_s2_work** using the **skim_without_charts()** fuctnion.
Group by **sex** and select the **credit_term**, **education**, **is_client**, and **income** variables to examine.

**Question 3.1**: What is the *75th percentile* of *income* for *men*?
How many *women* are clients?

**Response 3.1**: *36000*.

```{r, task3}
clients_s2_work%>%
  group_by(sex)%>%
  select(credit_term,education,is_client,income)%>%
  skim_without_charts()
```

## Task 4: Discrete Variables

For this task, you will plot a discrete variable.

Produce a horizontal bar plot of **product_type** that calculates the percentages of individuals in each category.
Correctly label the axes.
Do NOT order the categories yet.

**Question 4.1**: Which two product types are most popular?

**Response 4.1**: *Cell phones and household appliances.*

Examine the present order of the **product_type** categories with the **levels()** function.
Reorder the categories by their reverse frequency.
Examine the changed order of the **product_type** categories.

**Question 4.2**: What were the first and last categories originally?
What are the first and last categories after the reordering?

**Response 4.2**: *Audio & video, Windows & doors after reordering it is Construction materials and computers*.

Produce a second horizontal bar plot of **product_type** that calculates the percentages of individuals in each category.
Correctly label the axes.
This plot should use the now ordered version of **product_type**.
Add text to the middle of the bars to indicate their percentage values.
Change the size of the text to a value of **2**.

**Question 4.3**: What percentage bought items were *windows and doors*?
What percentage of bought items were *furniture*?

**Response 4.3**: *1.4 and 26.3*.

```{r, task4}
clients_s2_work%>%
  ggplot(aes(x=product_type,y=..count../sum(..count..)*100))+
  geom_bar()

levels(clients_s2_work$product_type)

product_type_order<-clients_s2_work%>%count(product_type)%>%
  arrange(desc(n))%>%
  mutate(pct=round(n/sum(n),3)*100)

clients_s2_work$product_type<-factor(clients_s2_work$product_type,levels=product_type_order$product_type)

levels(clients_s2_work$product_type)

product_type_order%>%
  ggplot(aes(x=reorder(product_type,desc(pct)),y=pct,label=pct))+
  geom_bar(stat="identity")+
  geom_text(size=2,position=position_stack(vjust=.5))


```

## Task 5: Continuous Variables

For this task, you will plot a continuous variable.

Produce a histogram plot for **credit_term**.
Color the histogram **blue**, choose **10** bins, add text that prints the count value for each bin, and use **8** breaks for the x-axis.
Label the axes appropriately.

**Question 5.1**: Are there more individuals with a *credit term* of roughly *10-12.5* or *35-37.5*?
Are there more than *10* individuals with a *credit term* in the *21-23* range?

**Response 5.1**: *There are more individuals with a credit term of 10-12.5 than 35-37.5. There are not more than 10 individuals with a credit term in the 21-23 range*.

Produce a density plot for **credit_term**.
Fill the histogram **purple**, make it half transparent, and make the color of the density curve **white**.
Label the axes appropriately.
Provide a title and subtitle for the plot.

**Question 5.2**: How many modes do you see for *credit terms* falling between *0* to *20* months?

**Response 5.2**: *There are 3 modes*.

```{r, task5}
clients_s2_work%>%
  ggplot(aes(x=credit_term))+
  geom_histogram(bins = 10,fill="blue")+
  stat_bin(aes(y=..count..,label=..count..),geom="text",bins = 10,vjust=-.5)+
  scale_x_continuous(breaks = c(5,10,15,20,25,30,35,40))+
  labs(x="Credit Term")


clients_s2_work%>%
  ggplot(aes(x=credit_term,y=..density..))+
  geom_histogram(bins = 10,fill="purple",alpha=.5)+
  geom_density(color="white",aes(y=..density..))+
  stat_bin(aes(y=..density..,label=..count..),geom="text",bins = 10,vjust=-.5)+
  scale_x_continuous(breaks = c(5,10,15,20,25,30,35,40))+
    labs(x="Credit Term")
```

## Task 6: Multiple Variables

For this task, you will produce plots involving multiple variables.

Produce a bar plot calculating the *percentage* of *men* and *women* with various *phone operators*.
Place **phone_operator** on the x-axis and **sex** as the **fill** and **group** aesthetics.
Produce a *dodged* bar plot.
Add text on top of the bars with the relevant percentages.
Label the axes and legend appropriately.
Provide an appropriate title.
Save the plot as **phone_gender_plot**.
Print the plot by highlighting the saved object.

**Question 6.1**: Which *phone operator* is most often used by *men* and *women*?

**Response 6.1**: *AT&T is most used by men and women use US Cellular more*.

Produce a scatterplot of **credit_term** (x-axis) and **credit_amount** (y-axis).
Add a **loess** line and color the points **blue** with transparency.
Save the plot as **credit_term_amount_plot**.

**Question 6.2**: Using the **loess** line as a guide, are larger *credit amounts* associated with *lower* or *higher* *credit terms*?

**Response 6.2**: *WRITE YOUR ANSWER BETWEEN THESE ASTERISKS*.

Produce a boxplot of **income** (y-axis) as a function of **family_status** (x-axis).
Fill the boxplots by **family_status**, color the outliers in **darkred**, include the jittered data points, exclude the legend, and appropriately label the axes.
Save the plot as **income_fam_stat_plot**.

**Question 6.3**: Which category of *family status* has the largest outliers?

**Response 6.3**: *WRITE YOUR ANSWER BETWEEN THESE ASTERISKS*.

```{r, task6}
phone_gender_plot<-clients_s2_work%>%count(sex,phone_operator)%>%group_by(phone_operator)%>%
  mutate(pct=n/sum(n)*100,pct=round(pct,1))%>%
  ggplot(aes(x=phone_operator,y=pct,fill=sex,group=sex))+
  geom_bar(stat="identity",position="dodge")+
  geom_text(aes(label=paste0(pct,"%")),vjust=-.5)+
  coord_cartesian(ylim = c(0,100))+
  labs(x="Phone Operator",y="Percentage",title="Distribution of Phone Operator by Sex")

phone_gender_plot
```

## Task 7: Save Plots and Data

For this task, you will save the plots from the sixth task and the working data.
Save the working data, **clients_s2_work** as the data file: **clients_s2_work.csv** in the **data** folder of the project directory.

Save the three plots from the sixth task as **png** files in the **plots** folder of the project directory.
Save **phone_gender_plot** as **phone_gender.png**, **credit_term_amount_plot** as **credit_term_amount.png**, and **income_fam_stat_plot** as **income_fam_stat.png**.
Use a width of *6 inches* and height of *6 inches*.

```{r, task7}

```

## Task 8: Conceptual Questions

For your last task, you will respond to conceptual questions based on the conceptual lectures for this week.

**Question 8.1**: What is a percentile of a variable?

**Response 8.1**: *WRITE YOUR ANSWER BETWEEN THESE ASTERISKS*.

**Question 8.2**: What is the difference between the variance and standard deviation of a variable?

**Response 8.2**: *WRITE YOUR ANSWER BETWEEN THESE ASTERISKS*.

**Question 8.3**: What five statistics are computed in a boxplot?
How is a boxplot useful for evaluating variables?

**Response 8.3**: *WRITE YOUR ANSWER BETWEEN THESE ASTERISKS*.
