Background

In this project, I imported three CSV datasets, tidied them, and answered questions about them. The datsets and questions were chosen by my classmates and posted to the DATA 607 BlackBoard forum.

Datasets

Here are links to the original datasets and the classmate who posted them.

U.S. Marriage and Divorce Rates, posted by Jiadi Li
Time Use by Gender in Europe, posted by Nicholas Schettini
School Attendance, posted by Baron Curtin

Packages

tidyr, dplyr, and stringr – to reshape, replace, and tidy the data
knitr and kableExtra – to create responsive HTML tables
ggplot2 – to visualize the data

U.S. Marriage and Divorce Rates

1. Import and examine the dataset.

I imported the CSV from a folder on my desktop.

df1 <- read.table("C:/Users/Kavya/Desktop/Education/MS Data Science/DATA 607 - Data Acquisition and Management/Projects/Project 02/national_marriage_divorce_rates_00-16.csv",sep = ",", fill = TRUE, header = TRUE)

This dataset contains two tables: one that describes the number of marriages per year in the U.S population from 2000 – 2016, and one that describes the number of divorces and annulments during the same period.

The dataset also contains many extraneous columns, and variables that aren’t formatted correctly.

2. Prepare the dataset.

I split the original dataset into two – one for marriages and one for divorces – and renamed the headers to be consistent.

# Choose the data related to marriage
marriage <- df1[3:19, 1:4]

# Rename the headers
names(marriage) <- c("Year", "Marriages", "Population", "Marriage_Rate")

head(marriage) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

	Year	Marriages	Population	Marriage_Rate
3	2016	2,245,404	323,127,513	6.9
4	2015	2,221,579	321,418,820	6.9
5	2014/1	2,140,272	308,759,713	6.9
6	2013/1	2,081,301	306,136,672	6.8
7	2012	2,131,000	313,914,040	6.8
8	2011	2,118,000	311,591,917	6.8

# Choose the data related to divorce
divorce <- df1[32:48, 1:4]

# Rename the headers
names(divorce) <- c("Year", "Divorces", "Population", "Divorce_Rate")

head(divorce) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	Year	Divorces	Population	Divorce_Rate
32	2016/1	827,261	257,904,548	3.2
33	2015/2	800,909	258,518,265	3.1
34	2014/2	813,862	256,483,624	3.2
35	2013/2	832,157	254,408,815	3.3
36	2012/3	851,000	248,041,986	3.4
37	2011/3	877,000	246,273,366	3.6

3. Clean up the variables.

Remove notes

First, I focused on removing the notes that were added to the “Year” column with a forward slash.

# Separate the "Year" column by the forward slash.
marriage_sep <- marriage %>%
  separate(Year, c("Year", "X"), sep = "[\\/]")

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 14 rows [1,
## 2, 5, 6, 7, 8, 9, 10, 12, 13, 14, 15, 16, 17].

# Remove the extra column created by the separation.
marriage <- marriage_sep[, -2]

# Separate the "Year" column by the forward slash.
divorce_sep <- divorce %>%
  separate(Year, c("Year", "X"), sep = "[\\/]")

# Remove the extra column created by the separation.
divorce <- divorce_sep[, -2]

Coerce into numeric

Then, I removed the commas from the datasets and coerced each variable into a numeric.

I started with the “Marriages” dataset:

# Coerce "Year" into a numeric
marriage$Year <- as.numeric(
                 as.character(marriage$Year))

# Remove commas and coerce "Marriages" into a numeric
m_replace1 <- str_replace_all(marriage$Marriages, "[\\,]", "")

marriage$Marriages <- as.numeric(
                      as.character(m_replace1))

# Remove commas and coerce "Population" into a numeric
m_replace2 <- str_replace_all(marriage$Population, "[\\,]", "")

marriage$Population <- as.numeric(
                       as.character(m_replace2))

# Coerce "Rate_Per_1000" into a numeric
marriage$Marriage_Rate <- as.numeric(
                          as.character(marriage$Marriage_Rate))

I did the same to the “Divorces” dataset:

# Coerce "Year" into a numeric
divorce$Year <- as.numeric(
                as.character(divorce$Year))

# Remove commas and coerce "Divorces" into a numeric
d_replace1 <- str_replace_all(divorce$Divorces, "[\\,]", "")

divorce$Divorces <- as.numeric(
                    as.character(d_replace1))

# Remove commas and coerce "Population" into a numeric
d_replace2 <- str_replace_all(divorce$Population, "[\\,]", "")

divorce$Population <- as.numeric(
                      as.character(d_replace2))

# Coerce "Rate_Per_1000" into a numeric
divorce$Divorce_Rate <- as.numeric(
                         as.character(divorce$Divorce_Rate))

View cleaned data

At the end of this step, I got two datasets that had clean variables in the correct format.

head(marriage) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	Year	Marriages	Population	Marriage_Rate
3	2016	2245404	323127513	6.9
4	2015	2221579	321418820	6.9
5	2014	2140272	308759713	6.9
6	2013	2081301	306136672	6.8
7	2012	2131000	313914040	6.8
8	2011	2118000	311591917	6.8

head(divorce) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	Year	Divorces	Population	Divorce_Rate
32	2016	827261	257904548	3.2
33	2015	800909	258518265	3.1
34	2014	813862	256483624	3.2
35	2013	832157	254408815	3.3
36	2012	851000	248041986	3.4
37	2011	877000	246273366	3.6

4. Analyze the data.

Is the decrease in divorce rate due to the decrease in marriage rate?

Jiadi posed this question in her original post on BlackBoard. While we may not be able to prove causality using this dataset, we can certainly investigate whether the two rates move in the same direction, and what that might mean.

Reshape the data

The data I wanted to visualize were the divorce rates and marriage rates over time. The two datasets have the “Year” column in common, so I performed a left join based on year.

Then, I created a separate dataframe called d1_viz with just the year and rates, and gathered the data into columns by Rate_Type – Marriage or Divorce – and Rate.

# Join the "Marriage" and "Divorce" datasets by Year
d1_joined <- left_join(marriage, divorce, by="Year")

# Create a new dataset with Year, Marriage Rate, and Divorce Rate
d1_viz <- data.frame(d1_joined$Year, d1_joined$Marriage_Rate, d1_joined$Divorce_Rate)

# Rename the columns of the new dataset
names(d1_viz) <- c("Year", "Marriage_Rate", "Divorce_Rate")

# Gather the dataset
d1_viz <- gather(d1_viz, "Rate_Type", "Rate", 2:3)

head(d1_viz) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Year	Rate_Type	Rate
2016	Marriage_Rate	6.9
2015	Marriage_Rate	6.9
2014	Marriage_Rate	6.9
2013	Marriage_Rate	6.8
2012	Marriage_Rate	6.8
2011	Marriage_Rate	6.8

Visualize the data

I used ggplot2 to visualize the data in a smoothed line graph, which helped to uncover trends.

ggplot(d1_viz, aes(x = d1_viz$Year, y = d1_viz$Rate, group = d1_viz$Rate_Type, colour = d1_viz$Rate_Type)) +
  geom_point() +
  labs(title = "U.S. Marriage and Divorce Rates from 2000 - 2016", colour = "") +
  xlab("Year") +
  ylab("Rate (per 1000 people)") +
  geom_smooth(method = "auto")

## `geom_smooth()` using method = 'loess'

Analysis

The chart shows that both marriage and divorce rates have been on a downward decline since the year 2000 in the United States. However, I was surprised to see that the marriage rate appears to increase slightly after 2010, while the divorce rate continues to decline. This complicates the hypothesis that a decline in marriage causes a decline in divorce.

A TIME Magazine article from 2016 observed the same trend of an increasing marriage rate, and noted that the two measurements – marriage rate and divorce rate – may not necessarily even be related. The article also mentions that divorce and marriage rates vary drastically depending on factors like income level, political preference, and location.

Is the decrease in divorce rate due to the decrease in marriage rate?

We don’t have enough information to answer this question. The question assumes that both marriage and divorce rate move negatively together and directly affect one other. However, the data shows that they do not always move together, and we don’t have enough information on individuals – like income level and location – to assign causality.

Personally, I would be interested in building a model to predict whether a person will divorce, taking into account the divorce rate for each of the individual’s relevant attributes .

Time Use by Gender in Europe

1. Import and examine the dataset.

df2 <- read.table("C:/Users/Kavya/Desktop/Education/MS Data Science/DATA 607 - Data Acquisition and Management/Projects/Project 02/TimeUse.csv",sep = ",", fill = TRUE, header = TRUE)

This dataset records how females and males in 14 European countries spend their time. It shows the time they spend in one day on 57 different activities.

The dataset is very wide, since each activity is in its own column. In addition, the time spent is recorded in a time format (HH:MM), which makes it hard to analyze and compare.

2. Gather the data.

To begin, I decided to gather the activity data into a new dataframe with two columns – Activity Type and Time Spent. The dataset transformed from having 58 columns and 28 rows to having 5 columns and 1,540 rows.

t <- gather(df2, "Activity_Type", "Time_Spent", 4:58)

## Warning: attributes are not identical across measure variables;
## they will be dropped

head(t) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

SEX	GEO.ACL00	Total	Activity_Type	Time_Spent
Males	Belgium	24:00	Personal.care	10:45
Males	Bulgaria	24:00	Personal.care	11:54
Males	Germany (including former GDR from 1991)	24:00	Personal.care	10:40
Males	Estonia	24:00	Personal.care	10:35
Males	Spain	24:00	Personal.care	11:11
Males	France	24:00	Personal.care	11:44

2. Convert the time to minutes.

First, I separated the “Time Spent” variable into columns for Hours and Minutes, and coerced those columns into a numeric.

t_sep <- t %>%
  separate(Time_Spent, c("Hours_Spent", "Minutes_Spent"), sep = "[\\:]")

t_sep$Hours_Spent <- as.numeric(
                     as.character(t_sep$Hours_Spent))

t_sep$Minutes_Spent <- as.numeric(
                       as.character(t_sep$Minutes_Spent))

head(t_sep) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

SEX	GEO.ACL00	Total	Activity_Type	Hours_Spent	Minutes_Spent
Males	Belgium	24:00	Personal.care	10	45
Males	Bulgaria	24:00	Personal.care	11	54
Males	Germany (including former GDR from 1991)	24:00	Personal.care	10	40
Males	Estonia	24:00	Personal.care	10	35
Males	Spain	24:00	Personal.care	11	11
Males	France	24:00	Personal.care	11	44

Then, I converted the hours column into minutes using the mutate function.

t_mutate <- mutate(t_sep, Hours_Spent = Hours_Spent * 60)

## Warning: package 'bindrcpp' was built under R version 3.4.3

head(t_mutate) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

SEX	GEO.ACL00	Total	Activity_Type	Hours_Spent	Minutes_Spent
Males	Belgium	24:00	Personal.care	600	45
Males	Bulgaria	24:00	Personal.care	660	54
Males	Germany (including former GDR from 1991)	24:00	Personal.care	600	40
Males	Estonia	24:00	Personal.care	600	35
Males	Spain	24:00	Personal.care	660	11
Males	France	24:00	Personal.care	660	44

Finally, I added together the new hours spent column with the minutes column to get the total number of minutes spent on each activity. I removed the leftover Hours_Spent, Minutes_Spent, and Total columns.

t_sum <- mutate(t_mutate, Time_Spent = Hours_Spent + Minutes_Spent)

t_sum1 <- t_sum[, -5]

t_sum2 <- t_sum1[, -5]

t_sum <- t_sum2[, -3]

head(t_sum) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

SEX	GEO.ACL00	Activity_Type	Time_Spent
Males	Belgium	Personal.care	645
Males	Bulgaria	Personal.care	714
Males	Germany (including former GDR from 1991)	Personal.care	640
Males	Estonia	Personal.care	635
Males	Spain	Personal.care	671
Males	France	Personal.care	704

3. Calculate time spent as a percentage.

Since there are 1,440 minutes in a day, and the time in the dataset refers to time spent in a day, we can calculate the time spent on each activity as a percentage.

t_percent <- mutate(t_sum, Percent_Time_Spent = (Time_Spent / 1440)*100)

names(t_percent) <- c("Sex", "Country", "Activity_Type", "Time_Spent", "Percent_Time_Spent")

head(t_percent) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Sex	Country	Activity_Type	Time_Spent	Percent_Time_Spent
Males	Belgium	Personal.care	645	44.79167
Males	Bulgaria	Personal.care	714	49.58333
Males	Germany (including former GDR from 1991)	Personal.care	640	44.44444
Males	Estonia	Personal.care	635	44.09722
Males	Spain	Personal.care	671	46.59722
Males	France	Personal.care	704	48.88889

4. Analyze the data.

My classmates identified two questions of interest in this dataset:

What is the activity individuals in a country spend the greatest percent of their time doing?
How do males and females spend their time?

Activity with greatest percent of time

To answer the first question, I grouped the data by Country and Activity Type, calculated the average percentage of time spent on each activity, and filtered the data by the largest percentages.

t_max <- group_by(t_percent, Country, Activity_Type) %>%
           summarise(Avg_Time_Spent = mean(Percent_Time_Spent)) %>%
           filter(Avg_Time_Spent == max(Avg_Time_Spent))

t_max %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

Country	Activity_Type	Avg_Time_Spent
Belgium	Personal.care	45.69444
Bulgaria	Personal.care	49.02778
Estonia	Personal.care	43.92361
Germany (including former GDR from 1991)	Personal.care	45.06944
Italy	Personal.care	46.80556
Latvia	Personal.care	45.10417
Lithuania	Personal.care	45.45139
Poland	Personal.care	45.38194
Slovenia	Personal.care	43.85417
Spain	Personal.care	46.38889
United Kingdom	Personal.care	43.92361

The table shows that men and women in all countries spend the biggest proportion of their time on “Personal Care” – sleeping, eating, or grooming.

Male and female comparison

I decided to narrow down this question to compare how much time men and women in different countries spend on specific activities.

Household and Family Care

First, I filtered the dataframe by the “Household and Family Care” activity type, and grouped the data by country and sex. Then, I created a summary of the data that added up time spent.

I then plotted the data using a stacked bar graph, and ordered the bars from most to least overall time.

t_plot <- filter(t_percent, Activity_Type == "Household.and.family.care") %>%
          group_by(Country, Sex) %>%
          summarise(Time_Sum = sum(Time_Spent))

ggplot(data = t_plot, aes(x = reorder(Country, Time_Sum), y = Time_Sum, fill = Sex)) + 
                geom_bar(stat = "identity") + 
                coord_flip() +
                xlab("Country") +
                ylab("Time Spent (Min.)") +
                labs(title = "Time Spent on Household and Family Care by Gender", fill = "")

The chart shows a clear trend: females spend more minutes in a day on household and family care when compared to men in all countries.

Bulgarians spend the most time on household care overall, but Italy appears to be the most unequal.

TV and Video

I followed the same procedure as above to visualize the time spent on watching TV and Video by gender.

t_plot2 <- filter(t_percent, Activity_Type == "TV.and.video") %>%
          group_by(Country, Sex) %>%
          summarise(Time_Sum = sum(Time_Spent))

ggplot(data = t_plot2, aes(x = reorder(Country, Time_Sum), y = Time_Sum, fill = Sex)) + 
                geom_bar(stat = "identity") + 
                coord_flip() +
                xlab("Country") +
                ylab("Time Spent (Min.)") +
                labs(title = "Time Spent on Watching TV / Video by Gender", fill = "")

It seems like males and females in all countries watch TV / Video about equally. Interesting that Bulgarians are at the top of the list for both TV-watching and household care – it makes me wonder if that’s is due to a bias in the data, if it’s just coincidence, or if those two activities are somehow related.

School Attendance

1. Import and examine the dataset.

df3 <- read.table("C:/Users/Kavya/Desktop/Education/MS Data Science/DATA 607 - Data Acquisition and Management/Projects/Project 02/attendance.csv",sep = ",", fill = TRUE)

This dataset records different metrics related to average daily student attendance by state. It was collected in 2003-04 and 2007-08 as part of a Schools and Staffing Survey from the U.S. Department of Education.

The data itself is quite untidy. It contains three different tables in one:

Columns 2-9 are a summary table for Elementary and Secondary schools
Columns 10-13 refer to Elementary schools
Columns 14-17 refer to Secondary schools

2. Trim the data.

For the purposes of this assignment, I decided to remove the numbers in parentheses – which refer to standard errors – from the dataframe.

I chose all of the columns that did not contain standard error and placed them in a dataframe called a1.

# Trim the data and choose only distinct rows
a <- distinct(df3[2:65, 1:16])

a1 <- data.frame(a$V1, a$V2, a$V4, a$V6, a$V8, a$V10, a$V12, a$V14, a$V16)

Then, I went ahead and prepared the first column – State – by removing the periods after each state name.

a_replace <- str_replace_all(a1$a.V1, "[\\.]", "")

a1$a.V1 <- as.character(a_replace)

Finally, I removed the 3rd and 9th rows of the dataframe, which did not contain useful information.

a2 <- a1[-3, ]

a1 <- a2[-9, ]

head(a1) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	a.V1	a.V2	a.V4	a.V6	a.V8	a.V10	a.V12	a.V14	a.V16
1		Total elementary, secondary, and combined elementary/secondary schools				Elementary schools		Secondary schools
2		ADA as percent of enrollment	Average hours in school day	Average days in school year	Average hours in school year	ADA as percent of enrollment	Average hours in school day	ADA as percent of enrollment	Average hours in school day
4	United States	93.1	6.6	180	1,193	94.0	6.7	91.1	6.6
5	Alabama	93.8	7.0	180	1,267	93.8	7.0	94.6	7.1
6	Alaska	89.9	6.5	180	1,163	91.3	6.5	93.2	6.2
7	Arizona	89.0	6.4	181	1,159	88.9	6.4	89.0	6.4

3. Separate and clean the data.

Next, my goal was to separate the data into 3 dataframes based on type of attendance data:

a_total – total attendance data
a_elem – elementary school attendance data
a_sec – secondary school attendance data

Total Attendance Data

To get this data, I chose the relevant columns from the full dataframe, removed irrelevant rows, and coerced the column into a numeric. I used similar methods below to get elementary school and secondary school data.

# Choose columns that relate to total attendance data and place them in a new dataframe.
a_total <- data.frame(a1$a.V1, a1$a.V2, a1$a.V4, a1$a.V6, a1$a.V8)

# Rename the columns
names(a_total) <- c("State","ADA as percent of enrollment", "Average hours in school day", "Average days in school year", "Average hours in school year")

# Remove the first 3 rows from the dataset
a_total <- a_total[ 4:54, ]

# Remove the comma from the "Average hours in school year" column
a_replace2 <- str_replace_all(a_total$`Average hours in school year`, "[\\,]", "")

# Coerce the column into a numeric
a_total$`Average hours in school year` <- as.numeric(as.character(a_replace2))

# Coerce the rest of the columns into a numeric
a_total[, 2] <- as.numeric(as.character(a_total[, 2]))
a_total[, 3] <- as.numeric(as.character(a_total[, 3]))
a_total[, 4] <- as.numeric(as.character(a_total[, 4]))

head(a_total) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	State	ADA as percent of enrollment	Average hours in school day	Average days in school year	Average hours in school year
4	Alabama	93.8	7.0	180	1267
5	Alaska	89.9	6.5	180	1163
6	Arizona	89.0	6.4	181	1159
7	Arkansas	91.8	6.9	179	1229
8	California	93.2	6.2	181	1129
9	Colorado	93.9	7.0	171	1199

Elementary School Attendance Data

# Choose columns that relate to elementary school attendance data and place them in a new dataframe.
a_elem <- data.frame(a1$a.V1, a1$a.V10, a1$a.V12)

# Rename the columns
names(a_elem) <- c("State","ADA as percent of enrollment", "Average hours in school day")

# Remove the first 3 rows from the dataset
a_elem <- a_elem[ 4:54, ]

# Coerce the columns into a numeric
a_elem[, 2] <- as.numeric(as.character(a_elem[, 2]))
a_elem[, 3] <- as.numeric(as.character(a_elem[, 3]))

head(a_elem) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	State	ADA as percent of enrollment	Average hours in school day
4	Alabama	93.8	7.0
5	Alaska	91.3	6.5
6	Arizona	88.9	6.4
7	Arkansas	92.1	6.9
8	California	94.9	6.3
9	Colorado	94.5	7.0

Secondary School Attendance Data

# Choose columns that relate to secondary school attendance data and place them in a new dataframe.
a_sec <- data.frame(a1$a.V1, a1$a.V14, a1$a.V16)

# Rename the columns
names(a_sec) <- c("State","ADA as percent of enrollment", "Average hours in school day")

# Remove the first 3 rows from the dataset
a_sec <- a_sec[ 4:54, ]

# Coerce the columns into a numeric
a_sec[, 2] <- as.numeric(as.character(a_sec[, 2]))

## Warning: NAs introduced by coercion

a_sec[, 3] <- as.numeric(as.character(a_sec[, 3]))

## Warning: NAs introduced by coercion

head(a_sec) %>% 
  kable("html") %>% 
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"))

	State	ADA as percent of enrollment	Average hours in school day
4	Alabama	94.6	7.1
5	Alaska	93.2	6.2
6	Arizona	89.0	6.4
7	Arkansas	90.8	6.8
8	California	89.4	6.1
9	Colorado	91.2	7.0

3. Analyze the data.

Population Plot

I thought I would try a plot I had never attempted before to compare attendance at the elementary school and secondary school levels – a population plot. To make this kind of plot, I had to change the percentage attendance at the elementary school level to be negative.

# Create a new dataframe with the variables to plot
a_plot <- data.frame(a_elem$State, a_elem$`ADA as percent of enrollment`, a_sec$`ADA as percent of enrollment`)

# Rename the dataframe
names(a_plot) <- c("State", "Elem_School", "Secondary School")

# Add a column with Elementary School attendance as negative.
a_mutate <- mutate(a_plot, "Elementary School" = Elem_School * -1)

# Gather the data into two columns.
a_plot2 <- gather(a_mutate, "School_Level", "Attendance_Percentage", 2:4)

Once I put together the dataframe, I created the plot and ordered it by lowest to highest elementary school attendance.

(In retrospect, a population plot may not have been the best way to represent this data, since the y-axis is very long and the data isn’t that varied.)

# Specify range and markers of x-axis
brks <- seq(-100, 100, 25)

#Specify labels of x-axis and take the absolute value to keep them positive
lbls = abs(seq(-100, 100, 25))

# Create the plot
ggplot(data = a_plot2, aes(x = reorder(State, Attendance_Percentage), y = Attendance_Percentage, fill = School_Level)) + 
                geom_bar(data = subset(a_plot2, School_Level == "Elementary School"), stat = "identity") + 
                geom_bar(data = subset(a_plot2, School_Level == "Secondary School"), stat = "identity") + 
                scale_y_continuous(breaks = brks, labels = lbls) + # Make a continuous x-axis
                coord_flip() +
                xlab("State") +
                ylab("Average Daily Attendance as Percent of Enrollment") +
                labs(title = "Attendance in Elementary and Secondary Schools by State", fill = "")

## Warning: Removed 5 rows containing missing values (position_stack).

Analysis

Elementary schools in Washington have the lowest average attendance, and elementary schools in North Dakota have the highest average attendance.

I wondered if state population had something to do with elementary school attendance. Interestingly, North Dakota has one of the lowest state populations (at 0.7 million people), and Washington is within the top 15 most-populous states (at 1.7 million people). It’s also worth noting how much lower the attendance rate of Washington is compared to Connecticut, the state with the next-lowest attendance rate.

I did some digging and it seems like Washington continues to struggle with school attendance – as of 2014, it is still ranked among the lowest in the nation.

However, the article mentions that other states may actually be under-reporting attendance data, since (according to the article) Washington educators are more faithfully recording attendance.

If that’s the case, I would want to compare the sample sizes included in this dataset with the student population of each state, to get a rough estimate of the sample’s representativeness.

DATA 607: Project 02

Kavya Beheraj

March 11, 2018

Background

Datasets

Packages

U.S. Marriage and Divorce Rates

1. Import and examine the dataset.

2. Prepare the dataset.

3. Clean up the variables.

Remove notes

Coerce into numeric

View cleaned data

4. Analyze the data.

Reshape the data

Visualize the data

Analysis

Time Use by Gender in Europe

1. Import and examine the dataset.

2. Gather the data.

2. Convert the time to minutes.

3. Calculate time spent as a percentage.

4. Analyze the data.

Activity with greatest percent of time

Male and female comparison

School Attendance

1. Import and examine the dataset.

2. Trim the data.

3. Separate and clean the data.

Total Attendance Data

Elementary School Attendance Data

Secondary School Attendance Data

3. Analyze the data.

Population Plot

Analysis